论文部分内容阅读
古籍文本检索目前大多局限于篇、章及目录,即使是全文检索一般也是基于单汉字的检索,由于没有现成的古籍词表可用,古籍文本的标引和检索效率都受到了影响。现将常用于处理现代文本的N元组法移植到古籍文本中进行实义词提取,试验步骤包括:自动分词并统计词频;利用抽词词典和停用词词典得到候选词汇;通过简单计算对n元组进行剔除过滤;人工判别提取实词。试验从古籍文本《齐民要术》中提取普通语词和专有名词(包括书名、地名、人名官职名)3000多个,表明此试验方案基本可行。
At present, the retrieval of ancient texts is mostly confined to articles, chapters and catalogs. Even full-text retrieval is generally based on the retrieval of single Chinese characters. Since there is no available ancient dictionaries available, the indexing and retrieval efficiency of ancient texts have been affected. At present, the N-tuple method commonly used to deal with modern texts is transplanted to the ancient texts to extract the real meaning words. The test steps include: automatic word segmentation and statistics of word frequency; the use of thesaurus and stop word dictionary to obtain candidate words; n-tuple filtering; artificial discrimination extraction of real words. The experiment extracted more than 3,000 words and proper nouns (including title, place name and personal title) from the ancient text “Qi Min Yao Shu”, which shows that this test plan is basically feasible.