论文部分内容阅读
【目的】通过结合传统LDA模型的概率主题抽取方法和共词网络分析发现文献词汇间的联系结构的两者优势,降低由少量文献产生的高频词汇的干扰,提高主题凝聚性。【方法】在交通法学文献摘要文本主题分析中,加入文献的关键词作为分词复合词典,提高语义识别度;提出CA-LDA模型(Latent Dirichlet Allocation Model with Co-word Analysis),在传统LDA模型的基础上加入共词网络分析,以共词网络拓扑结构参数作为权重控制词汇主题分配(采用介数中心度),优先提取同时具有高共现性(中介性)和高频率的词汇。【结果】CA-LDA模型可以得到多篇文献同时共现的高频词汇,这样产生的重点词汇表对主题分析更有意义。该算法的结果不仅仅反映词频概率,同时也能从词汇关联上发现枢纽词汇,更深入理解该领域的研究热点。【局限】CA-LDA模型主题数目K的取值采用混淆度标准交叉验证获得,如果在实际分析中K值太大,不利于文献主题的分类整理,未来研究需要对该结果进一步处理来凝聚主题。【结论】本文将该模型应用于交通法学研究领域热点主题分析,在处理大规模文献数据中取得较好效果。相关研究可以拓展应用于各种领域的大规模文献数据自动化处理中。
【Objective】 By combining traditional LDA model with probabilistic topic extraction method and common word network analysis, the advantages of the relational structure between documents and words are found, which can reduce the interference of high-frequency vocabulary generated by a few documents and improve the cohesion of topic. 【Method】 In the thematic analysis of traffic law literature summary texts, the keywords of the documents were added as the word segmentation compound dictionary to improve the semantic recognition degree. The Latent Dirichlet Allocation Model with Co-word Analysis (CA-LDA) was proposed. In the traditional LDA model Based on the analysis of co-word network, the co-word network topology parameters are used as the topic weight control vocabulary distribution (using the mediance of mediation), and the words with high co-occurrence (high) and medium frequency are preferentially extracted. 【Result】 The CA-LDA model can obtain high-frequency words that coexist in many articles at the same time. The key vocabulary thus generated is more meaningful to the topic analysis. The result of this algorithm not only reflects the probability of word frequency, but also can discover the key words from the word association and further understand the research hotspot in this field. [Limitations] The value of the subject number K in the CA-LDA model is obtained by confusion standard cross-validation. If the K value is too large in the actual analysis, it is not conducive to the classification and sorting of the subject matter of the literature. Future research needs further processing to condense the subject . 【Conclusion】 This paper applies the model to hot topic analysis in the field of traffic jurisprudence and achieves good results in dealing with large-scale literature data. Related research can expand the automation of large-scale literature data used in various fields.