论文部分内容阅读
Background: Next Generation Sequencing (NGS) methodology has dramatically increased the sequencing datasets and enabled novel biological applications.NGS has been applied to many areas such as RNA-Seq, CHIP-Seq and MeDIP-Seq.In whole genome re-sequencing projects of mammalian genomes, NGS usually generates billions of short read sequences (short reads).The computational cost of aligning such sequences can be very large.Methods: We propose a new method of aligning Illumina short reads to the reference genome.The novelty of our method lies in its way of indexing the reference.We transform all K-mer subsequences of the reference sequence into natural numbers and use a fixed function to randomize all of them.We then sort these randomized numbers to form an index table.During mapping process, the K-mer subsequences of short reads are transformed into numbers and then randomized in exactly the same way of those K-mer subsequences of the reference.We make use of the statistical character of the index table to speed up the process of inserting the randomized numbers to the index table.If a K-mer subsequence is inserted to the index table (we can call it a seed), we then use seed-and-extend method to make the full alignment.Results: We compare our method with Bowtie2 and SOAP2 respectively using 3 short read data sets.The alignment rate of our method is comparable with Bowtie2 and about 10% more than SOAP2.The speed of our method is 2 to 5 times faster than Bowtie2 and comparable with SOAP2.Conclusions: Unlike complex data structures such as Hash table or FM-index, our index table is simply the index of sorted and randomized nature numbers of K-mer subsequences of the reference.Whats more, we can make use of the statistical character of the index table to speed up the process of finding exact matches.Our method is fast and flexible due to the character of the randomized and sorted index table.The result turns out that the performances of speed and sensitivity of our method is comparable or better than Bowtie2 and SOAP2 .