论文部分内容阅读
This paper presents a new method that eliminates noise in Web page classification. It first describes the presentation of a Web page based on HTML tags. Then through a novel distance formula, it eliminates the noise in similarity measure. After carefully analyzing Web pages, we design an algorithm that can distinguish related hyperlinks from noisy ones.We can utilize non-noisy hyperlinks to improve the performance of Web page classification (the CAWN algorithm). For any page, wecan classify it through the text and category of neighbor pages related to the page. The experimental results show that our approach improved classification accuracy.