论文部分内容阅读
Background: Small insertions and deletions (INDELs) compose of the second largest category of genetic variants (next to single nucleotide polymorphism) in the human genome, which accounts for 18% of all the variants documented.Data from the 1,000 Genome Project indicated an estimate of 1 million INDELs per human genome.Similar to SNPs (Single Nucleotide Polymorphisms) and large structural variations, INDELs are of significant interest due to their potentials in affecting gene molecular functions and therefore cause disease.Despite of its functional importance, lacking a highly accurate bioinformatics tool for INDEL classifications has become a major obstacle in understanding the molecular functions.Methods: In this study, we develop a bioinformatics tool for INDEL prioritization.We constructed a training dataset that includes all the disease-causing and neutral INDELs documented in the Human Genome Mutation Database (HGMD) and 1,000 Genome Project.For the two sets of INDELs, we systematically investigated a series of features on their potentials in disrupting both protein structures and pre-mRNA splicing.We then designed a predictor to best distinguish the disease-causing and neutral INDELs using machine-learning technique.Results: In this study, we focused on the disease relevance of the INDELs in the exonic regions, which include 25,923 and 4,643 disease-causing and neutral INDELs, respectively.We found that the disease-causing INDELs more likely occur in predicted structured regions and neutral INDELs are more likely in unstructured, intrinsically disordered regions.We also found that INDEL sites matched more to homologous sequences are more like disease-causing.For RNA processing, we found disease-causing INDELs are twice likely to disrupt binding sites of RNA-binding proteins.We also found that disease-causing INDELs tend to be longer, and be closer to the splice sites.In addition, the genomic loci for disease-causing INDELs are more evolutionarily conserved, and tend to locate in the exons that have previous evidence for altemative splicing.We designed a predictor by integrating all these features for prioritizing potential molecular functions for novel INDELs.The predictor achieved excellent prediction power with the area under the curve (AUC) of the receiver operating characterstic (ROC) curve at 0.86, and Mathews correlation coefficient 0.65.Conclusions: With the fast growing application of next generation sequencing technology in genetic studies, more and more novel genetic variants will be identified in the near future.The bioinformatics tool reported here will provide a robust and informative means for identifying disease-causing INDELs derived from DNA-sequencing experiments .