,Topic discovery and evolution in scientific literature based on content and citations

来源 :Frontiers of Information Technology & Electronic Engineering | 被引量 : 0次 | 上传用户:michael_lv
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citationcontent-latent Dirichlet allocation(LDA) topic discovery method that accounts for both document citation relations and the content of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI) and IEEE Computer Society(CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA. Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to include both the words of the document itself and its citations of other documents. In this paper, we propose a citationcontent-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the content of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ’father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evo lution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflected the topic evolution of important research themes. According to our evaluation metrics , citation-content-LDA outperforms both content-LDA and citation-LDA.
其他文献
A distributed fault-tolerant strategy for the controller area network based electric swing system of hybrid excavators is proposed to achieve good performance u
该试验以1347纯系品种做为试验材料,采用3因素二次回归通用旋转组俣设计,系统地研究了肥密因素对产量及一些形态生理指标的影响.通过计算机模拟,得出产量与密度(x)、施氮量(x
The explosive growth of malware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unkno
继续教育是对干部完成学历教育后进行知识更新、补充、拓展和提高的追加性教育。随着科学技术的迅猛发展和军事革命的深刻变化,继续教育在军事人才培养中日益显现出巨大作用,目
根据全军第十四次院校工作会议精神,装备指挥技术学院自2000年开始承担全军军事代表中级指挥管理干部岗前培训工作。首期全军装备采办军事代表中级指挥已于2001年年初圆满结业。通过首
该文对棉属A、D、G三个染色体组合成的三元杂种[亚洲棉(Gossypium arboreum)×比克氏棉(g.bickii)]×陆地棉(G.hirsutum)和[陆地棉×(亚洲棉×比克氏棉)]以及四元杂种(亚洲×
We consider the problem of finding map regions that best match query keywords. This region search problem can be applied in many practical scenarios such as sho
核盘蓖[Sclerotinia sclerotiorum(Lib)de Bary]是世界性分布的重要植物病原菌,它引起的病害对油菜、大豆、向日葵等重要经济类作物的产量和品质构成严重威胁.模式植物拟南芥
该文根据穗倾角(穗颈节至穗尖的连线与茎秆延长线的夹角)的大小,将参试的12个品种划分为直立穗型、半弯曲穗型、弯曲穗型三种类型,从穗粒结构、形态特征、生理特性等方面,研
The traveling salesman problem(TSP), a typical non-deterministic polynomial(NP) hard problem, has been used in many engineering applications. As a new swarm-int