论文部分内容阅读
本文介绍了一个大型分布式Web Crawler系统——Igloo 1.2版。它采用分布式的系统结构,通过我们设计的二级哈希映射算法使系统可以进行高效的任务分割,并且系统的规模动态可扩展.爬行网页的质量是评价Crawler的一个重要指标,Igloo以PageRank值作为网页质量评价的标准,从而提高了爬行质量.加快爬行速度的关键是如何解除Crawler系统中的性能瓶颈,本文对此也作了详细的讨论,并提出了一种基于“滞后合并”策略的UBL数据库存取方法.实验表明,Igloo在保持高性能的同时能快速爬行到高质量的网页.
This article introduced a large-scale distributed Web Crawler system - Igloo 1.2 edition. It uses a distributed system structure, through which we design a two-level hash mapping algorithm to make the system can efficiently task segmentation, and the scale of the system can be dynamically scalable.Crawling the quality of Web pages is an important indicator to evaluate Crawler, Igloo to PageRank Value as the standard of Web page quality evaluation to improve the crawling quality.The key to speed up the crawling speed is how to lift the performance bottleneck in Crawler system.This paper also discussed in detail and proposed a strategy based on " UBL database access methods.Experiments show that Igloo can quickly crawl to high quality web pages while maintaining high performance.