论文部分内容阅读
区别于传统计算网页文本相似度的去重方法,以多媒体数据文件为主的云盘资源仅可利用相当有限的元信息进行检索结果去重。针对这一问题,以搭建的面向云盘资源数据的搜索引擎系统为基础,通过对云盘资源元信息特性的分析,发现除名称之外,资源文件后缀名、占用空间大小、资源的用户归属是判定重复记录的有效特征。在此基础上,给出了处理上述特征的归一化方法,进而使用无监督方法进行去重。实验结果表明,该方法能够有效对云盘资源检索结果去重。
Different from the deduplication method of the text similarity of the traditional calculation webpage, the cloud disk resource mainly based on the multimedia data file can only use the rather limited meta information to retrieve the result. To solve this problem, based on the search engine system built for cloud disk resource data, by analyzing the meta information characteristics of cloud disk resources, it is found that in addition to the name, the suffix name of resource file, the size of the occupied space, the user attribute of the resource Is to determine the effective features of duplicate records. On this basis, the normalization method to deal with the above characteristics is given, and then the unsupervised method is used to deduct the weight. Experimental results show that this method can effectively reduce the retrieval results of cloud disk resources.