论文部分内容阅读
【目的】在总结当前引文元数据抽取方法的基础上,结合语义学知识和机器学习方法,对引文元数据的自动抽取方法进行探索。【方法】实验中采用神经网络模型对人工分割过的语料进行词向量训练。利用相同类型的元数据会相对集中地出现在向量空间中某一位置的现象,通过支持向量机分类算法实现对元数据的自动归类和标注。【结果】在以外文引文数据作为测试集的实验中,本文方法取得了较高的准确率和召回率,特别是针对引文中含有多种语言和缩写的现象,具有较好的处理能力。【局限】在对于引文元数据时间内容的细粒度抽取中存在一定的局限性。【结论】实验结果表明,此方法在引文元数据的自动发现和标注上具有良好的效果,并能很大程度地提高方法的适用性和容错率。
【Objective】 On the basis of summarizing the current metadata extraction method of citation and combining with the semantic knowledge and machine learning method, this paper explores the automatic extraction method of citation metadata. [Methods] The neural network model was used in the experiment to train vector-based corpus. Using the same type of metadata will be relatively concentrated in a certain place in the vector space phenomenon, through the support vector machine classification algorithm for automatic classification and labeling of metadata. 【Result】 In the experiment of using foreign citation data as a test set, the proposed method achieves high accuracy and recall rate, especially for citation with many languages and abbreviations, and has good processing power. [Limitations] There are some limitations in fine-grained extraction of content for citation metadata. 【Conclusion】 The experimental results show that this method has a good effect in the automatic detection and annotation of citation metadata and can greatly improve the applicability and fault tolerance of the method.