论文部分内容阅读
【目的】对科技文献领域的被引片段概念的特征进行分析,并比较不同识别方法效果的差异。【方法】以CL-Sci Summ 2016比赛被引片段标注数据为例,探索被引片段长度、位置与重要性特征,并分析与其对应引文上下文在长度和位置上的相关性。之后以基于词袋模型、主题模型、Word Net语义词典的相似性算法为例,比较这些方法在被引片段识别中的效果差异。【结果】研究结果发现:被标注的被引片段有96%少于三句,且更多地出现在文章前部和章节内的前部分,被引片段的Text Rank权重均值显著高于其他片段;被引片段与引文上下文在长度上显著相关,但在出现位置上相关性不明显;无论从MMR?还是句子与词汇层面的匹配度来看,基于词袋模型的识别方法效果均优于基于语义词典的方法,而后者明显优于基于主题模型的方法。【局限】对于被引片段概念与特性的分析只停留在理论层面,对其特征的分析与有关识别方法的比较也只是在CL-Sci Summ 2016被引片段标注数据上进行的。【结论】科技文献的用词比较规范严谨,所以词汇特征在被引片段的识别过程中起到关键的作用。
【Objective】 This paper analyzes the features of the concept of quoted segments in the field of scientific literature and compares the differences in the effects of different recognition methods. 【Method】 Taking the data of labeled segments in the CL-Sci Summ 2016 competition as an example, the length, position and importance features of the quoted segments were explored and their correlation with the corresponding citation context length and position was analyzed. Afterwards, based on the similarity algorithm of bag model, theme model and Word Net semantic dictionary, this paper compares the difference of these methods in the recognition of the cited segments. 【Result】 The study found that 96% of the cited segments were less than three, and more appeared in the front part and the front part of the chapter. ; The cited segments are significantly related to the citation context in length, but the correlation is not obvious at the position of appearance; the recognition based on the bag-of-words model is better than that based on the MMR or the match of sentence and vocabulary Semantic dictionary approach, while the latter is significantly better than the theme-based approach. [Limitations] The analysis of the concept and characteristics of the quoted segment only stays at the theoretical level. The analysis of its characteristics and the comparison of relevant identification methods are performed only on the data marked by CL-Sci Summ 2016. 【Conclusion】 The terminology of scientific literature is more rigorous, so the lexical features play a key role in the recognition of the cited segments.