论文部分内容阅读
针对中文图书关键词自动标引问题,引入条件随机场机器学习算法,通过对大量已有的中文图书手工关键词标引数据进行训练和学习,生成包含序列实体之间语义关系和规则特征的标注模型,并利用该标注模型进行机器预测,自动抽取出图书关键词。主要解决两个问题:鉴于条件随机场模型的参数选择会影响到系统的标注性能,从多个角度进行对比实验,确定针对中文图书关键词标引这一特定问题的条件随机场模型的最佳参数集合;探讨不同的观察特征对关键词标引的影响,通过实验论证4个能够有效提高标引性能的观察特征。最终建立起面向中文图书的最佳关键词标引模型。
Aiming at the problem of automatic indexing of Chinese book keywords, the condition-based random machine learning algorithm is introduced. Through training and learning of a large number of existing Chinese book manual keyword indexing data, a label containing semantic relations and regular features between sequence entities is generated Model, and use the annotation model for machine prediction, automatic extraction of book keywords. Mainly solve two problems: In view of the parameter selection of conditional random field model will affect the system’s labeling performance, comparative experiments from many angles, to determine the optimal conditions for the Chinese book key words indexing random field model Parameter sets. The influence of different observation features on keyword indexing is discussed. Four observation features that can effectively improve the indexing performance are demonstrated experimentally. Finally, we set up the best keyword indexing model for Chinese books.