论文部分内容阅读
目的 聚类分析中适宜类别数的确定和结果的验证一直是难以解决的问题 ,而在生物信息学研究中这一矛盾尤为突出 ,本文试图将数据挖掘技术引入这一领域 ,以期能有所帮助。方法 以甲型流感病毒的H3序列为例 ,按照数据挖掘的思想将其拆分为训练集和验证集 ,然后使用两阶段聚类法和自组织图进行聚类分析 ,利用验证集对聚类结果进行验证 ,并进行各类的特征描述。结果 两阶段聚类法可自动搜索适宜的类别数 ,两种聚类方法结果相互验证 ,以及验证集的结果验证都证实了聚类结果的准确性。结论 数据挖掘方法体系中的智能聚类技术可以满足基因序列数据聚类问题的需求 ,其相关技术可较好的解决类别数判定、结果验证等问题 ,值得在该领域中推广
The determination of the number of suitable categories in the purpose of cluster analysis and the verification of the results have always been difficult problems to solve. However, this contradiction is particularly prominent in the study of bioinformatics. This paper attempts to introduce data mining technology into this field in order to be helpful. . Methods The H3 sequence of Influenza A virus was taken as an example. According to the idea of data mining, it was divided into training set and verification set. Then two-stage clustering method and self-organizing map were used for cluster analysis. The results are verified and various types of features are described. Results The two-stage clustering method can automatically search for the appropriate number of categories, the results of the two clustering methods verify each other, and the validation of the validation set confirms the accuracy of the clustering results. Conclusion The intelligent clustering technology in the data mining method system can meet the needs of the clustering problem of gene sequence data. The related technology can better solve the problems of classification number determination and result verification, and is worthy of promotion in this field.