论文部分内容阅读
自动文本分类是数据挖掘和知识发现的关键一步.标准的机器学习技术(如SVM等)已经成功地应用于文本分类.但是,高维度的特征向量影响了分类速度和核函数的设置以及特征的选择则影响了分类的精度.该文提出了降低特征向量的维度和优化SVM参数来提高SVM分类的精度和速度.为了提高分类的速度和精度,该文提出了使用粗糙集对特征向量进行降维,使用遗传算法对特征选择和SVM参数进行优化.实验表明基于粗糙集和遗传算法的SVM分类方法比传统的k-NN和决策树方法更有效率.“,”Automatic categorization of documents into pre-defined taxonomies is a crucial step in data mining and knowledge discovery. Standard machine learning techniques like support vector machines(SVM) and related large margin methods have been successfully applied for this task. Unfortunately, the high dimensionality of input feature vectors impacts on the classification speed. The kernel parameters setting for SVM in a training process impacts on the classification accuracy. Feature selection is another factor that impacts classification accuracy. The objective of this work is to reduce the dimension of feature vectors, optimizing the parameters to improve the SVM classification accuracy and speed. In order to improve classification speed we spent rough sets theory to reduce the feature vector space. We present a genetic algorithm approach for feature selection and parameters optimization to improve classification accuracy. Experimental results indicate our method is more effective than traditional SVM methods and other traditional methods like k-NN and Decision Tree.