论文部分内容阅读
对于组建一个面向Web的信息系统来说 ,去除掉脚本、广告链接以及导航链接等无用数据 ,将提高信息存储和检索的效率 ;同时 ,基于语义对Web文档进行合并和分割也会有助于信息的管理 ,这些都是Web文档清洗系统的任务。在Web文档清洗中 ,无论是脱机的规则学习还是联机的文档清洗 ,都需要建立在对Web文档的结构和内容进行分析的基础之上。从HTML解析的一般概念入手 ,结合Web文档清洗系统的需求 ,描述了一个自主开发的HTML解析器的结构 ,并对其组成部分 :词典、词法分析器和语法分析器的设计作了详细的讨论
For building a Web-oriented information system, the removal of unnecessary data such as scripts, advertising links and navigation links will improve the efficiency of information storage and retrieval. At the same time, merging and segmenting Web documents based on semantic information will also help Management, these are the tasks of Web document cleaning system. Web document cleaning, whether it is offline learning rules or online document cleaning, you need to build on the Web document structure and content based on the analysis. Starting from the general concept of HTML parsing, this paper describes the structure of a self-developed HTML parser based on the needs of the Web document cleaning system and discusses in detail the design of its components: lexicon, lexical analyzer and parser