论文部分内容阅读
从文本中提取主题串是自然语言处理的重要基础之一.传统的提取方法主要是依据“词典加匹配”的模式.由于词典的更新速度无法同步于网上新闻中新词汇涌现的速度,而且词典的内容也无法完全涵盖网上新闻的范围, 因此这种方法不适用于网上新闻的主题提取.提出并实现了一种不用词典即可提取新闻主题的新方法.该方法利用网上新闻的特殊结构,在标题和正文间寻找重复的字串.经过简单地处理,这些字串能够较好地反映新闻的主题.实验结果显示该方法能够准确、有效地提取出绝大部分网上新闻的主题,满足新闻自动处理的需要.该方法同样适用于其它亚洲语言和西方语言.
The extraction of subject strings from texts is one of the important bases of natural language processing.Traditional methods of extraction are mainly based on the “dictionary plus matching” mode.Due to the speed of updating the dictionary can not be synchronized with the emergence of new words in the online news, and the dictionary The content can not completely cover the scope of online news, so this method is not suitable for the thematic extraction of online news.It proposes and implements a new method that can extract news topics without using a dictionary.With the special structure of online news, Search for repetitive strings between the title and the text, and after a simple process, these strings can better reflect the theme of the news.The experimental results show that the method can accurately and effectively extract the theme of most online news and satisfy the news The need for automated processing is equally applicable to other Asian and Western languages.