基于向量空间模型的网页文本句子对齐方法研究

来源 :第十一届全国人机语音通讯学术会议 | 被引量 : 0次 | 上传用户:papalong2009
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  平行网页文本中除了互为对照的内容,还存在一些无关的噪声,因此利用网页结构相似的方法解决平行网页中句对齐问题受到一定的限制。通过引入互译词典或同类词典的方法可以提高句对齐质量,但是双语词典的规模是有限的,不能覆盖所有对应的词汇。
  本文利用基于向量空间模型提供的相似度计算方法对平行网页文本进行句子对齐,在向量空间模型中,网页文本中的句子为一维空间中的向量,选取实词作为特征项,利用CHI统计量计算词汇关联度,采用TF-IDF算法计算特征项权重,采用cosine距离计算句子向量之间的相似度,解决平行网页文本句对齐问题。以蒙古文-中文平行网页为实验对象,设计了相关实验。实验结果证实了本文方法的有效性。
其他文献
To meet the demand for efficient automatic navigation in virtual complex indoor scenes, this paper presents an automatic navigation algorithm. The algorithm uses Dijkstra algorithm for path planning o
Presented in this paper is an immersive and interactive entertainment environment which integrates multi-projector tiled display wall and motion tracking. Calibration methods are proposed for the geom
The scale of some datasets generated by simulations on tens of thousands of cores are gigabyte or larger per output step. It is imperative that efficient coupling of these simulations and parallel vis
Semantic concept detection is a key technique to video semantic indexing. Traditional approaches did not take account of conceptual correlation adequately. A new approach based on conceptual correlati
In this paper, a parallel ray-casting volume rendering algorithm based on adaptive sampling is presented for visualizing TB-scale time-varying scientific data. The algorithm samples a data field adapt
Automotive interior ergonomics analysis is important step for automotive development validation in the process, which directly affects the product development cycle time and cost. In order to provide
The traditional volumetric visual hull generating methods were not applicable to real-time objects due to frame by frame calculations. A fast new algorithm based on interframe coherence was represente
A SERIES MODELS FOR RADAR DETECTION RANGE UNDER COMPLEX ELECTROMAGNETIC ENVIRONMENT WERE ESTABLISHED, INCLUDING ANTENNA GAIN, PROPAGATION IN MULTI-PATH, ATTENUATION, CLUTTERS OF RAINFALL AND SEA SURFA
Aiming at the problem of low efficiency and unsatisfactory matching of uniform texture regions in binocular stereo vision, we propose a rapid window-based adaptive correspondence search algorithm usin
现有的计算机辅助语言学习系统(Computer Assisted Language Learning,CALL)在得到GOP分数之后,对所有的音素都使用相同的映射函数计算相应的句子得分,忽略了不同音素发音之间的差异性。本文提出了一种使用专家评分语音对GOP分数归一化处理的新方法“概率分布映射算法” (probability distribution mapping algorithm,PDMA)。