论文部分内容阅读
以搜索引擎链接提取模块所要求的容错性、正确性、全面性、高效性和可扩展性为目标,提出了一种新的链接提取模型的设计思路。该模型将链接提取过程划分为信息提取、信息加工、信息分析和信息储存。信息的获取是通过HTM L(hypertex t m arkup language)文法分析方法从文档中得到初始统一资源地址(un iform resourceiden tifier,UR I)数据;信息加工阶段通过运用UR I解析算法对初始数据进行精练;然后在信息分析过程中进一步加以筛选和过滤;最后将结果存储在一个灵活的数据结构中。通过对比测试证实这种新的链接提取模式比传统方法在各项指标上均具有明显优势。
With the goal of fault tolerance, correctness, comprehensiveness, high efficiency and scalability required by search engine link extraction module, a new design idea of link extraction model is proposed. The model divides the link extraction process into information extraction, information processing, information analysis and information storage. The information is obtained by using the HTM L grammatical analysis method to obtain the URI data from the document. In the information processing phase, the initial data is refined by using the UR I parsing algorithm. The information is then further filtered and filtered during the information analysis; the result is stored in a flexible data structure. Through the comparison test, it confirms that this new link extraction model has obvious advantages over the traditional methods in various indicators.