论文部分内容阅读
基于链接分析自动侦测Spam页面,提出了一个分阶段机制。采用决策树和链接分析模型对Wikipedia中的所有节点进行Indegree和Outdegree检测,从而产生出一个候选列表,并引入一个启发算法来降低第一类型的错误。设计一个分类器用于分类候选列表,采用TrustRank和SpamRank算法分别从信任种子集和Spam种子集中推算系统页面各自可信概率和Spam概率,从而减少第二类型的错误。然后将产生的候选集合推送至页面编辑,根据编辑判断的结果反馈训练模型,调整权重。结果表明,分阶段侦测模型可自动地侦测Spam页面,其查准率和查全率分别达到78.3%和94%。
Spam pages are automatically detected based on link analysis and a phased mechanism is proposed. Indegree and Outdegree tests are performed on all nodes in Wikipedia using decision trees and link analysis models to generate a candidate list and introduce a heuristic to reduce the first type of error. A classifier is designed to classify the candidate list. TrustRank and SpamRank algorithm are used to calculate the respective credible probability and Spam probability of the system pages from the trust seed set and Spam seed set to reduce the second type of error. Then the generated candidate set is pushed to the page editor, and the training model is fed back according to the result of the editing judgment, and the weight is adjusted. The results show that the phased detection model can automatically detect Spam page, the accuracy rate and recall rate reached 78.3% and 94% respectively.