论文部分内容阅读
为解决二字短语扩充词表带来的歧义切分大幅增加问题 ,我们对扩收的二字短语进行了凝固度的分级。我们首先考察验证了已曾提出过的各种标准和方法。考察证明 ,结构类型、“成分字替换率”“前 /后接歧义度”与凝固度密切相关 ,也与接续类型 (A/BC~AB/C)密切相关。其中 ,定中、状中、述宾三类结构以前字为基准的后字替换率有特别价值 ,该频率高的字组多为A/BC型接续 ,其他字组多为AB/C型接续。在此基础上 ,我们提出了二字短语扩充词表的分级方案和具体的分级排歧策略。
In order to solve the problem of sharp increase in the segmentation of ambiguity caused by the expansion of thesaurus, we conducted a coagulation classification of the expanded phrase. We first examine and verify the various standards and methods that have been proposed. The investigation shows that the type of structure, the “prefix / post-ambiguity” of “constituent word replacement rate” are closely related to the degree of solidification and are also closely related to the type of connection (A / BC ~ AB / C). Among them, there are special values for the replacement ratio of the posterior word based on the former words in the three categories of middle, middle, and senior guests. The most frequent word group is A / BC type connection, and most of the other word groups are AB / C type connection . On this basis, we propose a hierarchical scheme and a specific hierarchical disambiguation strategy for the word list expansionary vocabulary.