论文部分内容阅读
为了解决现有的乱序数据流聚合查询处理技术不能在降低查询处理延迟,同时保障聚合查询结果的最终正确性的局限性问题,本研究设计了混合嵌入分布式流处理模块和分布式批处理模块的乱序数据流分布式聚合查询处理技术.该技术一方面基于用户给定的结果质量,限制自适应地优化流处理模块所用的缓冲区大小,从而尽可能降低流处理的查询处理延迟;另一方面基于备份于分布式数据存储系统的历史流数据,并以批处理的方式实现对极其晚到流元组的查询处理,从而保障聚合查询结果的最终正确性.基于真实的乱序数据流数据集对该技术进行测试分析表明:该技术在平均查询处理时延、查询结果精度和系统可扩展性方面,比目前最好的基于缓存的乱序数据流处理技术均具有显著优势.“,”The existing out-of-order data stream aggregation query processing techniques cannot guarantee the final correctness of the aggregated query result while reducing the query processing delay. In order to solve this limitation,this paper designs a distributed aggregation query processing technique for out-of-order data streams based on both of the distributed streaming processing model and the distributed batch processing model. The proposed technique on one hand optimizes the buffer sizes used by the distributed streaming processing model based on a user-given constraint on query result quality,thereby minimizing the query processing delay of the stream processing as much as possible. And on the other hand,based on the historical stream data backed up in the distributed data storage system and in batch processing mode,the query processing of the extremely late tuples is realized,so as to ensure the final precision of the aggregated query results. The test analysis based on the real out-of-order data stream dataset shows that compared with the current best cache-based out-of-order data stream processing technique,the proposed technique has significant advantages in average query processing delay,query result precision and system scalability.