论文部分内容阅读
【目的】以相关的图书类网页为对象,研究图书网页的自动识别及书目信息抽取方法。【方法】在分析不同图书网页标签使用特征、布局结构以及书目信息表征的基础上,通过定义通用规则及共现词和页面分析等技术建立图书网页自动识别及书目信息抽取模型。【结果】实验证明,该模型针对来自一般性网站的图书网页识别率可以达到近80%,而针对各类图书网页书目信息的抽取准确率平均也达到79%左右。【局限】该方法中阈值的设定综合考虑了多种类型图书网页信息特征,但对于部分特征极其特殊的网页存在误判现象,若进一步改进算法,可能效果更好。【结论】此方法对于各种类型图书网页的自动识别和书目信息抽取均能取得比较理想的效果,普适性较强,同时也为图书网页信息组织管理和自动分类研究奠定了基础。
【Objective】 To study the automatic identification of web pages and the method of extracting bibliographic information from related web pages. 【Method】 On the basis of analyzing the usage characteristics, layout structure and bibliographic information characterization of different book web pages, this paper establishes a web page automatic identification and bibliographic information extraction model by defining general rules, co-occurrence words and page analysis. [Results] The experiment proves that this model can achieve a recognition rate of nearly 80% for the web pages from general websites and about 79% for the bibliographic information for all kinds of books. [Limitations] The threshold setting in this method takes into account the information characteristics of many types of book web pages. However, for some web pages with extremely special characteristics, there is a misjudgment phenomenon. If the algorithm is further improved, the effect may be better. 【Conclusion】 This method can achieve ideal effect and universal applicability for automatic identification and bibliographic information extraction of all kinds of book web pages. It also lays the foundation for the study of book web page information organization management and automatic classification.