论文部分内容阅读
【目的】解决特色数据库建设过程中数据抓取困难、多类型数字资源难以整合等问题。【应用背景】特色资源信息生命短暂,陕西省已建特色库平台差异较大,支持RSS接口有限,数据格式复杂。【方法】利用Drupal Feeds,XPath Parser,Crawls,Image Grabber等Web数据采集技术,结合数据清洗、剔除手段,实现Web数据采集的系统化和专业化。【结果】对Feeds RSS采集,HTML/XML网页分析自动采集,特别是数据采集中需要针对不同特色资源修改规则及采集网页中流媒体等问题进行探讨。【结论】丰富陕西省特色数字资源平台的数据来源,部分解决数据采集困难、数据格式不规范、数据来源途径有限的问题。
【Objective】 To solve the problems of data capture difficulty and integration of multiple types of digital resources in the process of characteristic database construction. Application background The life of featured resource information is short, and the platform of featured libraries built in Shaanxi Province is quite different. The support for RSS interface is limited and the data format is complicated. 【Method】 Web data collection technology, such as Drupal Feeds, XPath Parser, Crawls, Image Grabber and so on, combined with data cleaning and culling methods, was used to systematize and specialize web data collection. 【Result】 Feeds RSS was collected, and HTML / XML web page analysis was automatically collected. In particular, data mining needed to be modified for different characteristics of resources and streaming media in web pages. 【Conclusion】 Enriching the data source of Shaanxi characteristic digital resource platform partially solves the problems of difficult data collection, non-standard data format and limited sources of data sources.