【摘 要】
:
Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists,we explore a novel semi-supervised method for a practical
【机 构】
:
Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031,China;University of Scien
【出 处】
:
第十五届全国计算语言学学术会议(CCL2016)暨第四届基于自然标注大数据的自然语言处理国际学术研讨会(NLP-NABD
论文部分内容阅读
Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists,we explore a novel semi-supervised method for a practical application,i.e.,statistical machine translation(SMT),based on a low-resource learning setting,in which a small amount of labeled data and large amount of unlabeled data are available.First,a CRF-based supervised learning is exploited to predict morpheme boundaries by using small labeled data.Then,a lexicon-based segmentation model with small labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data.Finally,we present some error correction models to revise segmentation results.Experimental results show that our method can improve the segmentation results compared with the pure supervised learning.Besides,we integrate the morphological segmentation result into Chinese-Mongolian SMT and achieve the satisfactory performance compared with the baseline.
其他文献
本文尝试从文本语义离散度的角度去提升自动作文评分的效果,提出了两种文本语义离散度的表示方法,并给出了数学化的计算公式.基于现有的LDA模型、段落向量、词向量等具体方法,提取出四种表征文本语义离散度的实例,应用于自动作文评分.本文从统计学角度将文本语义离散度向量化,从去中心化的角度将文本语义离散度矩阵化,并使用多元线性回归、卷积神经网络和循环神经网络三种方法进行对比实验.实验结果表明:在50篇作文的
AMR是国际上一种新的句子抽象语义表示方法,有着接近于中间语言的表示能力,其研发者已经建立了英文《小王子》等AMR语料库.AMR与以往的句法语义表示方法的最大不同在于两个方面,首先突破了树结构,转而采用图结构;其次允许添加原句之外的概念节点来表示隐含的语义.本文针对汉语特点,在制定中文AMR标注规范的基础上,标注完成了中文版《小王子》的AMR语料库,标注一致性的Smatch值为0.83.统计结果显
高考阅读理解选择题是基于背景材料,通过对材料的“理解”从多个选项中选出最佳选项.由于提供的背景材料相对较短且关键信息极具隐藏性,答案可能无法在背景材料中直接找到.因此,如何从背景材料中挖掘信息并与选项进行相关性分析是解答该类问题的关键,而句子级的语义相关性分析是背景材料与选项相关性分析的基础.本文通过对大量高考科技文文意理解类选择题进行分析,提出基于多维度投票算法(Multi-Dimension
研究基于矩阵分解的词嵌入方法,提出统一的描述模型,并应用于中英跨语言词嵌入问题.以双语对齐语料为知识源,提出跨语言关联词计算方法和两种点关联测度的计算方法:跨语言共现计数和跨语言点互信息.分别设计目标函数学习中英跨语言词嵌入.从目标函数、语料数据、向量维数等角度进行实验,结果表明:在中英跨语言文档分类中以前者作为点关联测度最高得到87.04%的准确率;在中英跨语言词义相似度计算中,后者作为点关联测
统计机器翻译模型,特别是基于句法的翻译模型,其翻译单元在保留足够的翻译信息以及翻译单元在翻译新句子时的泛化能力上始终存在着一个平衡.神经网络被成功用于统计机器翻译模型中的调序和语言生成中.本文提出了一个新颖的基于神经网络的句法翻译规则生成器——依存边转换翻译规则生成器(DETG),它利用一条转换翻译规则的源端以及源端的上下文作为输入,以依存边转换翻译规则的目标端作为输出.它不仅保留了依存边——这种
Most researches to SRL focus on English.It is still a challenge to improve the SRL performance of other language.In this paper,we introduce a two-pass approach to do Chinese SRL with a Recurrent Neura
In this paper,we propose a neural graph-based dependency parsing model which utilizes hierarchical LSTM networks on character level and word level to learn word representations,allowing our model to a
In order to explore a practical way of improving machine translation(MT)quality,the error types and distribution of MT results have to be analyzed first.This paper analyzed English-Chinese MT errors f
For the difficulty of marking Vietnamese dependency tree,this paper proposed the method which combined MST algorithm and improved Nivre algorithm to build Vietnamese dependency treebank.The method too
Traditional Mongolian Unicode Encoding has serious problems as several pairs of vowels with the same glyphs but different pronunciations are coded differently.We expose the severity of the problem by