论文部分内容阅读
摘 要:对于机器学习在P2P网络流识别中需要大量标记训练数据的问题,提出一种基于改进图半监督支持向量机的P2P流识别方法。采用自动调节的高斯核函数计算少量标识数据和大量未标识训练样本之间的相似距离以构建图模型,并在标记传播过程中嵌入训练样本局部分布信息以获取未标记样本的标识;在此基础上使用所有已标记样本对SVM训练实现P2P网络流识别。实验结果表明该方法能够兼顾整个训练样本集的信息,在提高SVM识别精度的同时,极大降低了人工标记训练样本的成本。
关键词:P2P网络流识别; 图; 半监督学习; 标记传播
中图分类号:TP393 文献标识码:A
Abstract:In P2P network traffic identification, aiming at the problems that passive machine learning needs a lot of labeled training data ,an improved graphic semisupervised learning method was proposed. and with SVM used in P2P network traffic identification. Gauss kernel function of selfregulation was applied for calculating similar distance of graphic model.Meanwhile,in the course of label propagation, local distribution information of training samples was added to get label of unlabeled samples.Finally, the labeled samples were used to train SVM for P2P network traffic identification. Simulation shows that the method can give consideration to all the information of training samples, effectively improve accuracy rate of P2P network traffic identification and greatly reduce the cost of labeling training samples.
Key words:P2P network traffic identification;graph; semisupervised learning; label propagation
1 引 言
对等网络 (Peer to Peer,P2P)技术给因特网、企业内联网及校园网等公共网络的正常运行带来了诸多问题,如可用带宽下降、服务质量降低和扩容成本增加等。因此,对P2P流量的监控已成为网络管理的主要任务之一,其实施基础便是对P2P流量的准确识别。
近年来,基于传输层统计特征的有监督机器学习方法成为P2P流量识别的研究热点[1-8]。这些方法虽然取得了较好的识别效果,但均以训练样本的正确标
记为前提。而现实网络环境是:获取未标记流量数据比较容易,而由领域专家对其正确标记不仅代价高昂且需耗费大量时间。考虑到半监督学习方法能在标记样本数量不充分的情况下,借助未标注数据的信息来提高识别精度,文中提出了一种基于改进图半监督支持向量机(Improved Graph Semisuperised Learning SVM,IGSSLSVM)的P2P网络流识别方法。实验结果显示该方法能够兼顾整个样本集的信息,在提高SVM识别精度的同时,显著降低了人工标记训练样本的成本。2 支持向量机
支持向量机是建立在统计学习理论中结构风险最小化原理基础上,根据有限的样本信息,在模型复杂度和学习能力之间寻求最佳匹配,以期获得最好的泛化能力。它通过核函数将原始特征空间中的非线性分类界面映射到更高维的特征空间中,使得分类界面在高维特征空间中变得线性可分,使分类效果更好。
其过程可表述为,对于n个样本的二分类问题,设{(x1,y1),(x2,y2),…,(xn,yn)}为给定的训练样本和其期望输出,寻找最优权值向量ω和阈值b,使(1)式的代价函数最小化。
6 结 论
使用有监督的机器学习方法进行P2P流量识别的成果较多,但此类方法均需要大量已正确标记的训练样本。然而现实的网络环境是:获取未标记流量数据比较容易,而由领域专家对其正确标记不仅代价高昂且需耗费大量时间。半监督学习方法可以在标注样本数量不充分的情况下,借助未标注数据的信息来提高分类精度,在P2P流量识别中具有明显优势。因而,文中提出了一种基于改进图半监督支持向量机的P2P网络流识别方法。实验结果表明该方法在只有少量标记训练样本的情况下,能够提高P2P流识别精度。
在下一步的工作中,我们将争取使用校园网的实际流量数据进行实验,以期发现更多的流量特征,进一步对算法模型进行完善。
参考文献
[1] WANG R,LIU Y,YANG Y,etal. Solving the applevel classification problem of P2P traffic via optimized support vector machine[C]//Proc of the 6thInt Conf on Intelligent Systems Design and Applications.Piscataway,NJ:IEEE,2006:534-539 [2] ZUEV D,MOORE A. Traffic classification using statistical approach[G]//LNCS 3431:Proc of the 6thInt Workshop on Passive and Active NetworkMeasurement.Berlin:Springer,2005:321-324.
[3] CONSTANTINOU F,MAVROMMATIS P.Identifying known and unknown peertopeer traffic [C]//Proc of the 5th IEEE Int Symp on Network Computing and Application. Piscataway,NJ:IEEE,2006:93-102.
[4] CHEN H,HU Z,YE Z,etal.Reserch of P2P traffic classificationbased on BP neural network [C]//Proc of the 1st IntSymp on Computer Network and Multimedia Technology.Piscataway,NJ:IEEE,2009:579-582.
[5] YANG A,JIANG S,DENG H. A P2P network traffic classification method using SVM[C]//Proc of the 9thInt Conf on Young Computer Scientists. Piscataway,NJ:IEEE,2008:398-403.
[6] LIU F,LI Z,NIE Q. A new method of P2P network traffic classificationbased on support vector machine at the host level [C]//Proc of theIntConfon Information Technology and Computer Science. Piscataway,NJ:IEEE,2009:579-582.
[7] 李致远,王汝传.一种基于机器学习的P2P网络流量识别方法[J].计算机研究与发展,2011,48(12):2253-2260.
[8] 郭伟,王西闯,刘肖振.基于K均值和双支持向量机的P2P流量识别方法[J].计算机应用,2013,33(10):.2734-2738
[9] ZHU X,GHAHRAMANI Z.Learning from labeled and unlabeled data with label propagation[D].Technical Report CMUCALD 02 -107,Carnegie Mellon University,2002: 1-7.
[10]DUDA RO,HART PE,STORK DG. Pattern Classification[M]. 2nd New York:Wiley,2000:134-143.
[11]MOORE A W,ZUEV D. Internet traffic classification using Bayesian analysis technique s[C]. International Conference on Measurement and Modeling of Computer Systems,Alberta,Canada,2005: 50-60.
关键词:P2P网络流识别; 图; 半监督学习; 标记传播
中图分类号:TP393 文献标识码:A
Abstract:In P2P network traffic identification, aiming at the problems that passive machine learning needs a lot of labeled training data ,an improved graphic semisupervised learning method was proposed. and with SVM used in P2P network traffic identification. Gauss kernel function of selfregulation was applied for calculating similar distance of graphic model.Meanwhile,in the course of label propagation, local distribution information of training samples was added to get label of unlabeled samples.Finally, the labeled samples were used to train SVM for P2P network traffic identification. Simulation shows that the method can give consideration to all the information of training samples, effectively improve accuracy rate of P2P network traffic identification and greatly reduce the cost of labeling training samples.
Key words:P2P network traffic identification;graph; semisupervised learning; label propagation
1 引 言
对等网络 (Peer to Peer,P2P)技术给因特网、企业内联网及校园网等公共网络的正常运行带来了诸多问题,如可用带宽下降、服务质量降低和扩容成本增加等。因此,对P2P流量的监控已成为网络管理的主要任务之一,其实施基础便是对P2P流量的准确识别。
近年来,基于传输层统计特征的有监督机器学习方法成为P2P流量识别的研究热点[1-8]。这些方法虽然取得了较好的识别效果,但均以训练样本的正确标
记为前提。而现实网络环境是:获取未标记流量数据比较容易,而由领域专家对其正确标记不仅代价高昂且需耗费大量时间。考虑到半监督学习方法能在标记样本数量不充分的情况下,借助未标注数据的信息来提高识别精度,文中提出了一种基于改进图半监督支持向量机(Improved Graph Semisuperised Learning SVM,IGSSLSVM)的P2P网络流识别方法。实验结果显示该方法能够兼顾整个样本集的信息,在提高SVM识别精度的同时,显著降低了人工标记训练样本的成本。2 支持向量机
支持向量机是建立在统计学习理论中结构风险最小化原理基础上,根据有限的样本信息,在模型复杂度和学习能力之间寻求最佳匹配,以期获得最好的泛化能力。它通过核函数将原始特征空间中的非线性分类界面映射到更高维的特征空间中,使得分类界面在高维特征空间中变得线性可分,使分类效果更好。
其过程可表述为,对于n个样本的二分类问题,设{(x1,y1),(x2,y2),…,(xn,yn)}为给定的训练样本和其期望输出,寻找最优权值向量ω和阈值b,使(1)式的代价函数最小化。
6 结 论
使用有监督的机器学习方法进行P2P流量识别的成果较多,但此类方法均需要大量已正确标记的训练样本。然而现实的网络环境是:获取未标记流量数据比较容易,而由领域专家对其正确标记不仅代价高昂且需耗费大量时间。半监督学习方法可以在标注样本数量不充分的情况下,借助未标注数据的信息来提高分类精度,在P2P流量识别中具有明显优势。因而,文中提出了一种基于改进图半监督支持向量机的P2P网络流识别方法。实验结果表明该方法在只有少量标记训练样本的情况下,能够提高P2P流识别精度。
在下一步的工作中,我们将争取使用校园网的实际流量数据进行实验,以期发现更多的流量特征,进一步对算法模型进行完善。
参考文献
[1] WANG R,LIU Y,YANG Y,etal. Solving the applevel classification problem of P2P traffic via optimized support vector machine[C]//Proc of the 6thInt Conf on Intelligent Systems Design and Applications.Piscataway,NJ:IEEE,2006:534-539 [2] ZUEV D,MOORE A. Traffic classification using statistical approach[G]//LNCS 3431:Proc of the 6thInt Workshop on Passive and Active NetworkMeasurement.Berlin:Springer,2005:321-324.
[3] CONSTANTINOU F,MAVROMMATIS P.Identifying known and unknown peertopeer traffic [C]//Proc of the 5th IEEE Int Symp on Network Computing and Application. Piscataway,NJ:IEEE,2006:93-102.
[4] CHEN H,HU Z,YE Z,etal.Reserch of P2P traffic classificationbased on BP neural network [C]//Proc of the 1st IntSymp on Computer Network and Multimedia Technology.Piscataway,NJ:IEEE,2009:579-582.
[5] YANG A,JIANG S,DENG H. A P2P network traffic classification method using SVM[C]//Proc of the 9thInt Conf on Young Computer Scientists. Piscataway,NJ:IEEE,2008:398-403.
[6] LIU F,LI Z,NIE Q. A new method of P2P network traffic classificationbased on support vector machine at the host level [C]//Proc of theIntConfon Information Technology and Computer Science. Piscataway,NJ:IEEE,2009:579-582.
[7] 李致远,王汝传.一种基于机器学习的P2P网络流量识别方法[J].计算机研究与发展,2011,48(12):2253-2260.
[8] 郭伟,王西闯,刘肖振.基于K均值和双支持向量机的P2P流量识别方法[J].计算机应用,2013,33(10):.2734-2738
[9] ZHU X,GHAHRAMANI Z.Learning from labeled and unlabeled data with label propagation[D].Technical Report CMUCALD 02 -107,Carnegie Mellon University,2002: 1-7.
[10]DUDA RO,HART PE,STORK DG. Pattern Classification[M]. 2nd New York:Wiley,2000:134-143.
[11]MOORE A W,ZUEV D. Internet traffic classification using Bayesian analysis technique s[C]. International Conference on Measurement and Modeling of Computer Systems,Alberta,Canada,2005: 50-60.