论文部分内容阅读
针对因特网流量分类面临的流量类别标记瓶颈和类别样本数分布不平衡,提出基于Bootstrapping的流量分类方法,使用少量有标记样本训练初始分类器,迭代利用无标记样本扩展样本集并更新分类器.在构建扩展样本集过程中,将无标记样本在某后验概率分布下的正确分类行为视为一个概率事件,建立新的置信度计算方法,以减少扩展样本集中的噪声样本;基于概率近似正确学习理论建立启发式规则,注重选择小类样本加入扩展样本集,缓解类别样本数分布的不平衡.实验结果表明,与初始分类器相比,基于Bootstrapping的流量分类器总体分类准确率可提高9.46%;与现有半监督学习方法相比,小类分类准确率提高2.22%.
In view of unbalanced distribution of bottlenecks and class samples in Internet traffic classification, this paper proposes a traffic classification method based on Bootstrapping, uses a small amount of labeled samples to train initial classifiers, and iteratively uses unlabeled samples to extend sample sets and update classifiers. In the process of constructing an extended sample set, the correct classification behavior of unlabeled samples under a posterior probability distribution is considered as a probability event, and a new confidence calculation method is established to reduce the noise samples in the extended sample set. Proper learning based on probability approximation The theory builds a heuristic rule, emphasizing on the selection of small samples and adding the extended sample sets, and alleviating the imbalance of the distribution of sample numbers.The experimental results show that the overall classification accuracy of Bootstrapping-based traffic classifiers can be increased by 9.46% compared with the initial classifier, Compared with the existing semi-supervised learning methods, the classification accuracy of subcategories increased by 2.22%.