论文部分内容阅读
Along with the coming of network times,the research of spam filtering technology has been imperative under the situation. However,some specialties of mail dataset such as the data sparseness,high dimensionalities and multi-collinearity in mail content make great difference between spam filtering work and text classification work. In this paper,a new Partial Least Squares (PLS) feature extraction method on spare filtering is proposed,which could extract new much less latent semantic components than full features by linear combination,compress original data and be better solution for multi-collinearity. The experiments on CEAS 2006 benchmark datasets (Enron-Spam datasets) show that promising results are reported after evaluated by TREC spare track and the new method performs better than feature selection by x2 statistics.