论文部分内容阅读
随着互联网的发展以及网上信息的日益丰富 ,传统的信息处理已经延伸到互联网领域。在对互联网上的信息进行处理时 ,常常要将分布在互联网各处的Web页面下载到本地供进一步处理 ;这便是所讨论的Web页面收集工具的核心功能。该页面收集系统在综合使用Web页面间的链接关系和页面内容的基础上 ,增加了多层次的页面过滤模块 ,可用来收集特定领域内的Web页面 ;同时可采用多机并行收集的方法提高页面收集的效率 ;采用大型数据库存放元收集信息 ,并对收集到的页面进行压缩 ,能够支持海量数据的收集 ;动态更新机制的实施使得下载到本地的页面信息能够得到及时的更新。
With the development of the Internet and the increasingly rich online information, traditional information processing has been extended to the Internet. When dealing with information on the Internet, it is often necessary to download Web pages distributed throughout the Internet for further processing; this is a central feature of the web page collection tool in question. The page gathering system adds a multi-level page filtering module based on the link relation and the page content of the web pages in combination, and can be used to collect web pages in specific fields. In the meantime, the method of multi-machine parallel collecting can be used to improve the page The efficiency of data collection; the collection of information using a large database storage element, and the compression of the collected pages can support the collection of huge amounts of data; and the implementation of a dynamic update mechanism enables the download of local page information to be updated in time.