Identifying Word Translations in Scientific Literature based on Labeled Bilingual Topic Model and Co

来源 :第十七届全国计算语言学学术会议暨第六届基于自然标注大数据的自然语言处理国际学术研讨会(CCL 2018) | 被引量 : 0次 | 上传用户:ssttll
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  Aiming at the increasingly rich multi language information resources and multi-label data in scientific literature,in order to mining the relevance and correlation in languages,this paper proposed the labeled bilingual topic model and co-occurrence feature based similarity metric which could be adopted to the word translation identifying task.First of all,it could assume that the keywords in the scientific literature are relevant to the abstract in the same article,then extracted the keywords and regard it as labels,labels with topics are assigned and the “latent” topic was instantiated.Secondly,the abstracts in article were trained by the labeled bilingual topic model and got the word representation on the topic distribution.Finally,the most similar word between both languages was matched with similarity metric proposed in this paper.The experiment result shows that the labeled bilingual topic model reaches better precision than “latent” topic model based bilingual model,and co-occurrence features enhance the attractive-ness of the bilingual word pairs to improve the identifying effects.
其他文献
跨语言信息检索指用户以一种语言提问,检索出另一种或几种语言描述的信息资源的检索技术,是信息检索领域重要的研究方向之一.近年来,跨语言词向量为跨语言信息处理提供了良好的表示形式,受到很多学者的关注.该文利用跨语言词向量实现从汉文查询词到蒙古文查询词扩展和映射,并利用该文提出的串联式查询扩展、串联式查询扩展过滤、交叉验证过滤三种查询扩展方法在进行词向量映射时对候选的蒙古文查询词进行筛选和排序,选择符合
文本蕴含是自然语言处理的难点,其形式类型复杂、知识难以概括.早期多利用词汇蕴含和逻辑推理知识识别蕴含,但仅对特定类型的蕴含有效.近年来,利用大规模数据训练深度学习模型的方法在句级蕴含关系识别任务上取得优异性能,但模型不可解释,尤其是无法标定引起蕴含的具体语言片段.本文研究文本蕴含成因形式,归纳为词汇、句法异构、常识三类,并以句法异构蕴含为研究对象.针对上述两个问题,提出句法异构蕴含语块的概念,定义
The conventional Chinese word embedding model is similar to the English word embedding model in modeling text,simply uses the Chinese word or character as the minimum processing unit of the text,witho
Word embeddings have recently been widely used to model words in Natural Language Processing(NLP)tasks including semantic similarity measurement.However,word embeddings are not able to cap-ture polyse
会议
Existing sentence alignment methods are founded fundamentally on sentence length and lexical correspondences.Methods based on the former follow in general the length proportionality assumption that th
Metadata extraction for scientific literature is to automati-cally annotate each paper with metadata that represents its most valu-able information,including problem,method and dataset.Most existing w
For different language pairs,word-level neural machine translation(NMT)models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary(OOV)words.The common practice
Existing methods for knowledge graph embedding do not ensure the high-rank triples predicted by themselves to be as consistent as possible with the logical background which is made up of a knowledge g
会议
In e-commerce websites,user-generated question-answering text pairs generally contain rich aspect information of products.In this paper,we address a new task,namely Question-answering(QA)aspect classi
Neural machine translation(NMT)has achieved great suc-cess under a great deal of bilingual corpora in the past few years.Howev-er,it is much less effective for low-resource language.In order to allevi