Efficient Computation of k-Medians over Data Streams Under Memory Constraints

来源 :计算机科学技术学报(英文版) | 被引量 : 0次 | 上传用户:zhangzhao322
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms ofapproximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - e)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.
其他文献
CFD验证和确认是气动计算可信度分析的基本活动.通过这类活动将产生大量数据资源.如何有效管理和利用这些数据资源是基准数据库要解决的主要问题.研究并提出了一种可行的共享
Multi-project multi-site location problems are multi-objective combinational optimization ones with discrete variables which are hard to solve. To do so, the ca
德国在若干年前启动的国家CFD项目MEGAFLOW集中了DLR、大学以及航空工业界的许多CFD研究开发工作,目的是开发并验证能满足工业实践所要求的可靠并有效的数值工具,以对全机进
A fast, matrix-free implicit method based on the Lower-Upper Symmetric Gauss-Seidel (LU-SGS) method was developed to solve the three-dimensional incompressible
给出了两种不同的再入飞行器:充气的再入和下滑演示器(IRDT)和Huygens.两者的任务都涉及在靠近轨道的条件下再入并均已于2005年完成.概述了IRDT的使命和具体设计问题,综述了I
Numerical simulations of 3D turbulent flow in a large-bore axial-flow pump coupled with half-elbow suction sump were performed by using CFD approach. The numeri
Direct conversion of methane into C2 hydrocarbons through alternating current electric field enhanced plasma was studied under room temperature, atmospheric pre
为了研究细长体大攻角非对称流态的机理以及开发新的控制技术进行了风洞实验.实验中观察到了侧向力的双稳态状态,并且它的正负指向很容易被来流中或者模型头部的微小扰动切换