论文部分内容阅读
新闻网页主要由大量文字描述构成,相比网页其他区域的噪音内容,其主题内容含有大段连贯的文字。根据这一特点提出一种基于模式匹配的网页净化方法,即在网页源代码中匹配最长文字字符串,从而准确定位主题内容源代码在网页源代码中位置,实现网页净化。本方法可去除来自不同网站网页的噪音内容,无需事先训练数据集来生成模板,不需要生成网页DOM树。对同构、异构和不符合XML规范的网页净化,试验证明效果理想且性能稳定。
The news web page is mainly composed of a large number of text descriptions. The main content of the news web page contains a large number of coherent texts compared to the noise content of other areas of the web page. According to this feature, this paper proposes a webpage purification method based on pattern matching, that is, matching the longest text string in the source code of the webpage so as to accurately locate the source code of the subject content in the source code of the webpage, and to purify the webpage. This method can remove the noise content from web pages of different websites without generating template in advance by training datasets, and does not need to generate a web page DOM tree. For isomorphism, heterogeneity and non-XML Web pages purification, the experiment proved effective and stable performance.