论文部分内容阅读
语料库语言学的发展要求语料库的规模越来越大。随着电子出版业的迅速发展,获取大量机读文本建立大规模语料库已成为可能。但是收集来的粗语料是杂乱无章的,在作加工整理前必须分类。若用手工分类则工作量很大。本文介绍了一种语料自动分类办法。它采用文中提出的语料相关系数的概念,并利用不同类语料相关系数不同的特点进行分类,取得了93%的大类分类正确率。
The development of corpus linguistics requires that the size of corpus be larger and larger. With the rapid development of electronic publishing industry, it has become possible to acquire a large number of machine readable texts to build large-scale corpora. However, the collected rough corpus is disorganized, and must be classified before processing and finishing. If using manual classification is a heavy workload. This article describes a method of automatic classification of corpus. It adopts the concept of corpus correlation coefficient proposed in the paper and uses the different characteristics of different corpora to classify the corpus, and obtains 93% classification accuracy.