论文部分内容阅读
目前Web中的海量信息已经成为人们重要的信息来源 ,如何从大量半结构化或无结构的HTML网页中提取信息已成为目前的研究热点 .但是Web页面的初始设计目的是为了方便用户浏览 ,而不是便于应用程序自动处理 ,如何实现一个精确的、应用广泛的提取系统面临很多困难 .传统的方法可以粗略划分为基于交互产生的包装程序和自动生成的包装程序 ,但是基于交互产生的包装程序不具备普遍的应用性 ,基于自动生成的包装程序准确性不高 .该文提出了一种新的二阶段基于语义的半自动提取方法 ,在保证提取准确性的前提下 ,尽可能减少交互操作 ,同时随着参与网站的增加 ,逐步提高包装程序生成的自动化 .相对于目前的方法 ,该文方法同时考虑了包装程序提取结果的准确性和提取过程的应用普遍性 .其有效性在原型系统中得到验证 .应用该方法 ,已经成功提取了12 0万HTML页面 .
At present, massive information in the Web has become an important source of information for people, and how to extract information from a large amount of semi-structured or unstructured HTML pages has become a research hotspot now. However, the initial design of the Web page is for the convenience of users It is not easy for the application to process automatically, and how to implement a precise and widely used extraction system faces a lot of difficulties. The traditional method can be roughly divided into an interactively generated wrapper and an automatically generated wrapper, but the interactively generated wrapper does not Which has universal applicability and low accuracy based on automatically generated packaging program.This paper proposes a new two-stage semantic-based semi-automatic extraction method, which can reduce the interaction as much as possible while ensuring the accuracy of extraction, meanwhile, With the increase of the number of participating websites, the automation of wrapper generation is gradually improved.Compared with the current methods, this method considers both the accuracy of the wrapper extraction results and the universality of the extraction process.The validity of this method is obtained in the prototype system Using this method, we have successfully extracted 12 0 HTML page.