论文部分内容阅读
提出一种在内网和外网间处于物理隔离状态下防止信息重复采集的电子政务二次信息采集交互系统原型.外网用户能够从客户端软件中二次采集由webalert功能采集的互联网中最新相关网页的链接所指内容,最后再通过摆渡式传输设备将采集结果传递到存储设备上,与内网搭建的网络平台进行数据同步,供内网用户直接浏览.在外网抓取信息和内外网数据同步中,都需要对网页提取信息指纹进行对比,防止重复抓取和拷贝.原型采用HashTrie保存信息指纹.进行评测对比后,可知基于HashTrie信息指纹提取比目前专利申请中速度最快的Darts(双数组Trie)结构快2.28倍,还提出了一种新的Hash函数,并且实现了现有12种高速Hash函数以供HashTrie使用,当词典容量大于50万词时,可以采用PJWHash或SuperFastHash函数,而当词典容量为10万词时,可以采用CalcStrCRC32和ELFHash函数.
This paper proposes a prototype of an interactive e-government information collection system that prevents duplicate information collection under the condition of physical isolation between the intranet and the extranet. The extranet users can secondly collect the latest version of the interoperability information collected by the webalert function from the client software The content of the relevant webpage is referred to, and finally the delivery result is delivered to the storage device by the ferry-type transmission device to synchronize data with the network platform set up by the intranet for direct browsing by intranet users. Data synchronization, we need to extract the information on the web page to compare the fingerprint to prevent duplication of crawling and copying Prototype using HashTrie save information fingerprints.After the evaluation comparison, we can see that fingerprint extraction based on HashTrie information than the current patent applications in the fastest Darts ( Double array Trie) structure is 2.28 times faster, also proposed a new Hash function, and implements the existing 12 kinds of high-speed Hash function for HashTrie use, when the dictionary capacity is greater than 500,000 words, you can use PJWHash or SuperFastHash function, When the dictionary capacity is 100,000 words, CalcStrCRC32 and ELFHash functions can be used.