论文部分内容阅读
随着网络的发展,大量的文档涌现在网上,自动文本分类成为处理海量数据的关键技术。在众多的文本分类算法中,kNN算法被证明是最好的文本分类算法之一。对于大多数文本分类来说,文本预处理是文本分类的瓶颈,文本预处理的好坏直接影响着分类的性能。在此介绍了一种新的文本预处理算法——基于基尼的文本预处理算法。同时采用模糊集理论改进kNN的决策规则。这两者的结合使得模糊kNN比传统的kNN表现出更好的分类性能。实验结果证明这种改进是有效的,可行的。
With the development of the Internet, a large number of documents are emerging online, and automatic text categorization becomes the key technology for processing massive data. Among the many text classification algorithms, kNN algorithm proved to be one of the best text classification algorithms. For most text classification, text preprocessing is the bottleneck of text classification. The quality of text preprocessing directly affects the performance of classification. This paper introduces a new text preprocessing algorithm - based on Gini text preprocessing algorithm. At the same time, the fuzzy set theory is used to improve the decision rules of kNN. The combination of the two makes the fuzzy kNN show better classification performance than the traditional kNN. Experimental results show that this improvement is effective and feasible.