论文部分内容阅读
An increasing number of microbial genomes are being sequenced and deposited in public databases.Building non-redundant reference sequence database through efficient clustering analysis is important for handling the large amount of available microbial genome sized sequences and assembled contigs.Toward this aim,in this article,we describe Gclust (Genome sequence clustering),a program for clustering the rapid growth of complete or draft genome sequences.Using a sparse suffix array algorithm and a long genome sequence identity criteria based on extension DNA maximal exact matches (MEM),Gclust creates clusters under the given set of genome sequences and extension MEM identity.It takes less than 7 hours for the clustering of the 1560 complete microbial genome sequences with average 3.4MB length on Intel(R) Xeon(R) CPU 2.27GHz with 8 threads parallel computing.It offers the possibility of clustering the rapid growth of complete or draft microbial genomes in the future.This program is freely available for non-commercial use at http://weizhong-lab.ucsd.edu/gclust.