,Short text classification based on strong feature thesaurus

来源 :浙江大学学报(英文版)(C辑:计算机与电子) | 被引量 : 0次 | 上传用户:ssss426
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Data sparseness,the evident characteristic of short text,has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods.Intensive research has been conducted in this area during the past decade.However,most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy.In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models.By giving larger weights to feature terms in SFT,the classification accuracy can be improved.Specifically,our method appeared to be more effective with more detailed classification.Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na(i)ve Bayes Multinomial.
其他文献
中宣部新闻局、首都青年编辑记者协会于去年12月25日,联合召开了新闻改革座谈会,就党的十三大以后,新闻改革面临的难点与热点,以及如何加以解决,进行了热烈的探讨。关于发挥
Maintenance of high performance formation control is important for low Earth orbit (LEO) formation missions of small spacecraft.In this paper,a model of nonline
学位
针对复杂电磁环境下高密度非平稳雷达信号的卷积混合问题,首先建立其频域盲分离模型,其次利用稀疏分解的思想,将混合信号分解成时频稀疏单元,再利用分离信号包络描述的单元活
该研究运用遗传和分子生物学方法分析和研究了向日葵细胞质雄性不育植株的被恢复程度,以及向日葵细胞质雄性不育基因、核恢复基因、保持基因等.
The most challenging problem in mesh denoising is to distinguish features from noise. Based on the robust guided normal estimation and alteate vertex updating s
人参(Panax ginseng C.A.Mey)和西洋参(Panax quinquefolius Linn.)中国参业的发展却很不平衡,栽培品种类型混杂,栽培技术落后,病害严重.因此以有栽培品种进行遗传多样性分析
RAPD是一种新发展起来分子标记技术,在它诞生的短短几年中,不仅广泛地应用于基因定位、基因组作图等领域,在物种亲缘关系研究中也提供大量数据.该文对这一技术在棉花指纹图谱