论文部分内容阅读
【目的/意义】无论是统计机器翻译,还是神经机器翻译,训练数据通常来源复杂,主题多样,文体不一,与待翻译目标文本的领域不能保证完全一致,导致领域自适应问题。目前机器翻译的领域自适应方法大多用主题模型得到主题信息,将数据粗略划分为领域内(in-domain)和领域外(out-domain),缺乏更为明确的领域标签。【方法/过程】本研究采用中图分类号作为领域标签,采用两种方法对汉语句子进行自动领域标注领域:利用论文关键词和科技词系统等知识组织构建领域知识库的领域标注方法;训练卷积神经网络的深度学习的领域标注方法,通过神经网络深度融合模型将这两种方法融合起来得到效果更佳的领域标注器,利用机器翻译的测试集获取领域标签集合筛选其训练数据。【结果/结论】经过在神经机器翻译系统上进行测试,针对两个特定领域测试集,仅利用部分训练数据就获取了比原始训练数据高约1.3BLEU得分(相对5.4%)的翻译结果,证明了本研究方法的有效性和可行性。
[Purpose / Significance] Whether it is statistical machine translation or neuro-machine translation, training data usually come from complex sources, diverse topics, different styles, and can not be completely consistent with the field of the target text to be translated, resulting in the problem of self-adaptation in the field. At present, the domain adaptive methods of machine translation mostly use thematic models to obtain thematic information, which roughly divides the data into in-domain and out-domain, lacking a clearer field label. 【Method / Procedure】 This study uses the CLC number as the field label, and uses two methods to mark the field of Chinese sentences automatically: using the keyword of the thesis and the scientific word system to construct the domain labeling method of the field knowledge base; training Convolution neural network depth learning domain labeling method, through the neural network depth fusion model these two methods are combined to obtain a better field label, the use of machine translation test set to obtain the field label set to select its training data. [Results / Conclusion] After testing on the neuro-machine translation system, for two test fields in a particular field, only the partial training data was used to obtain the translation result of about 1.3 BLEU (5.4% relative to the original training data), which proves The effectiveness and feasibility of this research method.