论文部分内容阅读
Heritrix是由Java开发的开源Web网络爬虫,HTMLParser技术对抓取后网页内容进行高效率解析,对信息进行再一次整合,很好的解决了专业搜索引擎所需数据来源问题。文章探讨了基于Heritrix和HTMLParser构建Web信息收集系统的设计和实现。
Heritrix is an open source web crawler developed by Java. HTMLParser technology efficiently analyzes web contents after crawling, and once again integrates information, which solves the data source problem required by a professional search engine. The article discusses the design and implementation of Web information collection system based on Heritrix and HTMLParser.