Deciphering sequence and structure features of disease-causing small insertions and deletions

来源 :第五届全国生物信息学与系统生物学学术大会 | 被引量 : 0次 | 上传用户:a596298067
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  Background: Small insertions and deletions (INDELs) compose of the second largest category of genetic variants (next to single nucleotide polymorphism) in the human genome, which accounts for 18% of all the variants documented.Data from the 1,000 Genome Project indicated an estimate of 1 million INDELs per human genome.Similar to SNPs (Single Nucleotide Polymorphisms) and large structural variations, INDELs are of significant interest due to their potentials in affecting gene molecular functions and therefore cause disease.Despite of its functional importance, lacking a highly accurate bioinformatics tool for INDEL classifications has become a major obstacle in understanding the molecular functions.Methods: In this study, we develop a bioinformatics tool for INDEL prioritization.We constructed a training dataset that includes all the disease-causing and neutral INDELs documented in the Human Genome Mutation Database (HGMD) and 1,000 Genome Project.For the two sets of INDELs, we systematically investigated a series of features on their potentials in disrupting both protein structures and pre-mRNA splicing.We then designed a predictor to best distinguish the disease-causing and neutral INDELs using machine-learning technique.Results: In this study, we focused on the disease relevance of the INDELs in the exonic regions, which include 25,923 and 4,643 disease-causing and neutral INDELs, respectively.We found that the disease-causing INDELs more likely occur in predicted structured regions and neutral INDELs are more likely in unstructured, intrinsically disordered regions.We also found that INDEL sites matched more to homologous sequences are more like disease-causing.For RNA processing, we found disease-causing INDELs are twice likely to disrupt binding sites of RNA-binding proteins.We also found that disease-causing INDELs tend to be longer, and be closer to the splice sites.In addition, the genomic loci for disease-causing INDELs are more evolutionarily conserved, and tend to locate in the exons that have previous evidence for altemative splicing.We designed a predictor by integrating all these features for prioritizing potential molecular functions for novel INDELs.The predictor achieved excellent prediction power with the area under the curve (AUC) of the receiver operating characterstic (ROC) curve at 0.86, and Mathews correlation coefficient 0.65.Conclusions: With the fast growing application of next generation sequencing technology in genetic studies, more and more novel genetic variants will be identified in the near future.The bioinformatics tool reported here will provide a robust and informative means for identifying disease-causing INDELs derived from DNA-sequencing experiments .
其他文献
  Background: Replication of chromosomes is one of the central events in the cell cycle.DNA replication begins at a specific site, called an origin of replica
  Nucleosome positioning in vivo is influenced by DNA sequence, chromatin remodelers and fixed barriers, such as DNA-binding proteins, but the relative contri
会议
  Background: Protein phosphorylation is one of the pervasive and most important protein posttranslational modifications, which regulates the dynamic behavior
  Background: Module (community) structure is a common and important property of many types of networks such as social networks and biological networks.Severa
  Motivation: Genetic and pharmacological perturbations are powerful systems biology tools to study cellular signal transduction pathways.Here, we report a fr
会议
会议
会议
  Background: The ultra intercellular heterogeneity in tumor is one major causes for the failure of cancer therapy, e.g.drug resistance and/or cancer relapse.
会议
会议