论文部分内容阅读
聚类算法广泛应用于生物芯片数据分析中,用于寻找表达相似的基因或样本.大多数已有算法都需要人为地给出一些参数,然而在没有先验知识的情况下,人为地确定这些参数是十分困难的.为了解决这一难题,提出了一种迭代的聚类算法.首先用主集方法对原有基因进行重新排序,使高度相似的基因排列在特定区域.类的分割界线通常难于确定.提出一种标准,根据类内元素间的距离远小于类外元素间的距离的性质,从排序后的数据集中划分出一个类.将找到的类从当前数据集中排除以后,对剩下的数据重复以上处理,直到满足所提出的循环停止条件为止.从多方面分析了该算法的性能,并将该算法应用于酵母细胞周期的芯片表达谱数据聚类.理论分析和应用结果都表明,该算法是实用、有效的,并且有很好的抗噪性能.
Clustering algorithms are widely used in biochip data analysis to find genes or samples with similar expression.Most existing algorithms need to give some parameters artificially, however, they are artificially identified without prior knowledge Parameter is very difficult.To solve this problem, an iterative clustering algorithm is proposed.First, the main set method is used to rearrange the original genes so that the highly similar genes are arranged in a specific area.The segmentation boundary is usually It is difficult to determine.A standard is proposed to classify a class from the sorted data set according to the properties that the distance between elements in the class is much smaller than the distance between the elements in the class.When the found class is excluded from the current data set, The above data are repeated until the proposed cycle stop condition is satisfied, the performance of the algorithm is analyzed in many aspects, and the algorithm is applied to the clustering of the chip expression profile data of yeast cell cycle. Both the theoretical analysis and the application result Show that the algorithm is practical and effective, and has good anti-noise performance.