论文部分内容阅读
重点分析了爬虫的策略设计以及网页主题的相关度算法研究等,分析了各个算法的实现方法以及优缺点等。1宽度遍历策略宽度优先搜索[1](Breadth-First-Search)简称BFS,网络爬虫从初始URL集合中,按照访问的层次逐个遍历网页,当遍历完当前层的网页包含的所有URL链接完,然后才接着对下一层级的页面进行遍历,不断断的递归这个过程,直到完成爬取任务,或者到达遍历的停止条件等。因此,宽度遍历也称为
The paper mainly analyzes the strategy design of crawler and the research on the relevance of web pages, and analyzes the implementation methods, advantages and disadvantages of each algorithm. 1 Breadth-First-Search (BFS), the web crawler from the initial URL set, according to the level of access to traverse the web page one by one, when traversal of the current web page contains all the URL links finished, And then went on to the next level of the page traversal, continuous recursive recursive this process until the completion of crawling tasks, or to reach the traversal stop conditions. Therefore, width traversal is also called