地学数据共享网用户Web行为预测及数据推荐方法

来源 :地球信息科学学报 | 被引量 : 0次 | 上传用户:jp19861213
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
网络环境下,如何让用户快速发现所需数据是地学数据共享平台长期面临的挑战之一。本文基于国家地球系统科学数据共享平台网站服务器日志数据获取用户搜索行为及数据集访问行为,使用聚类算法挖掘用户行为模式,并基于会话聚类模式开发在线搜索和访问预测算法。在数据预处理阶段,对原始服务器日志数据进行清洗、用户识别、用户会话识别、搜索词提取。在模式挖掘阶段,采用DBSCAN算法对会话进行聚类。考虑到会话向量值的二元性,聚类算法中的距离采用Jaccard距离函数计算。视每个会话聚类包含的搜索词集合为一个文本,所有用户历史搜索词集合为语料库,统计各聚类中搜索词的TF-IDF值。在线搜索推荐,以搜索词检索各聚类中TF-IDF值,返回TF-IDF值最高的搜索词所属聚类,并给出该聚类的高频项目作为推荐。在线访问推荐,则以用户实时访问向量为查询向量,计算该向量与聚类中心的聚类。根据聚类排序,给出距离最近的聚类,并产生该聚类中高频项目作为推荐。实验结果表明基于TF-IDF和聚类的搜索推荐有较高的准确率和召回率,访问推荐效果基于高频统计的推荐有较大提高。研究可得出以下结论:(1)地学共享网用户访问和搜索行为体现了专业性的特点,其行为较普通网站用户可预测性更好;(2)对于地学数据共享用户行为预测,需明确定义用户行为,并采用合适的距离函数描述行为相似性;(3)通过搜索词TF-IDF值来预测用户数据需求的方法可行,以此产生的推荐可作为搜索结果的补充。本研究可服务于地学领域数据共享平台建设,提高共享服务质量,也可为其他领域科学数据共享提供技术方法借鉴。 Under the network environment, how to enable users to quickly find the data they need is one of the long-term challenges for the geo-data sharing platform. In this paper, the user search behavior and data set access behavior are obtained based on the log data of the National Earth System Science Data Sharing Website server. The clustering algorithm is used to mine user behavior patterns and the online search and access prediction algorithm is developed based on the clustering model. In the data preprocessing stage, raw server log data is cleaned, user identification, user session identification, and search word extraction are performed. In the pattern mining stage, the DBSCAN algorithm is used to cluster the sessions. Considering the duality of conversational vector values, the distance in clustering algorithm is calculated by Jaccard distance function. According to each conversation cluster contains a collection of search words as a text, all the user history search word collection as a corpus, statistical clustering of the TF-IDF value of the search term. The online search is recommended. The TF-IDF value of each cluster is searched by the search term, and the cluster with the highest TF-IDF value is returned, and the high-frequency item of the cluster is recommended as a recommendation. When the online visit is recommended, the real-time access vector of the user is a query vector, and the clustering of the vector and the clustering center is calculated. According to the clustering ranking, the closest clustering is given and the high frequency items in the clustering are generated as recommendations. The experimental results show that the search recommendation based on TF-IDF and clustering has high accuracy and recall, and the recommendation recommendation based on high-frequency statistics has been greatly improved. The following conclusions can be drawn from the research: (1) The geospatial user access and search behaviors reflect the professional characteristics and their behavior is more predictable than that of ordinary users. (2) The prediction of geospatial data sharing user behavior needs to be clear Define the user behavior and describe the similarity of the behavior by the appropriate distance function; (3) The method of predicting user data needs by using the TF-IDF value of the search term is feasible, and the resulting recommendation can be used as a supplement to the search results. This study can serve as a data sharing platform for geosciences and improve the quality of shared services. It can also provide technical methods for scientific data sharing in other fields.
其他文献
在前人研究成果基础上.开发了多层叠置含煤层气系统成藏演化史数值模拟软件,并利用该软件对贵州织纳煤田水公河向斜多层叠置含煤层气系统进行了试算。研究结果表明,软件计算结果
第十九次全国代表大会对经济建设提出新要求,提高农民收入,缩小城乡收入差距,有利于经济建设目标的实现。农地制度影响农民收入,农地的有序流转和规模经营可以有效增加农民收入。新型农业经营主体主要包括家庭农场(专业大户)、农民合作社和农业产业化龙头企业,随着农业生产分工深入化形成。新型农业经营主体以规模化和集约化的经营,打破小农户生产困境,推动农业现代化进程。传统的“两权分离”农地政策下农地面临细碎化、流
目的:了解我院2004年护肝类药物的应用情况及合理用药水平。方法:对我院2004年护肝类药物的用药频度、用药金额及其排序等统计数据进行分析。结果:护肝宁片的用药频度列首位,
介绍了鸡沙门氏菌病与球虫病并发的临床症状、剖检变化、发病原因及其防治措施。
针对北方保护地蔬菜栽培中几种常见害虫提出了相应的农业防治、物理防治、生物防治以及科学的药剂防治等综合防治的策略和措施,为反季节无公害蔬菜的生产提供了相应的技术保
从工程规划设计、施工管理、建后维护3个方面就饮水安全工程进行了讨论,并提出了思考与建议。
通过对当代大学生基础文明现状的调查、分析,阐述了大学生基础文明状况令人担忧的原因,并就发展理想人格,建设大学生基础文明的途径和方法提出了一些建设性的见解。
目的:对白内障合并糖尿病患者围手术期实施预见性护理干预的临床效果进行分析。方法:以笔者所在医院2015年5月-2016年5月收治的96例白内障合并糖尿病患者作为此次研究对象,按
发生在中亚地区的“颜色革命”是对所在国内部存在的民主化进程缓慢、政治体制改革滞后于市场经济发展内在要求等矛盾的一种调适。司法制度有失公正、贪渎现象严重、贫富分化