Deciphering sequence and structure features of disease-causing small insertions and deletions

来源 :第五届全国生物信息学与系统生物学学术大会 | 被引量 : 0次 | 上传用户：a596298067

【摘要】

：

　　Background: Small insertions and deletions (INDELs) compose of the second largest category of genetic variants (next to single nucleotide polymorphism) in t

【作者】

：

Xinjun Zhang Huiying Zhao Yadong Wang Guohua Wang Yaoqi Zhou Yunlong Liu

【机构】

：

SchoolofInformaticsandComputing,IndianaUniversity,Bloomington,IN,47408,USA;CenterforComputationalBio

【出处】

：

第五届全国生物信息学与系统生物学学术大会

【发表日期】

：

2012年8期

【关键词】

：

insertion deletions INDELs genetic variants next generation sequencing

下载到本地 , 更方便阅读

下载此文赞助VIP

声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架

论文部分内容阅读

　　Background: Small insertions and deletions (INDELs) compose of the second largest category of genetic variants (next to single nucleotide polymorphism) in the human genome, which accounts for 18% of all the variants documented.Data from the 1,000 Genome Project indicated an estimate of 1 million INDELs per human genome.Similar to SNPs (Single Nucleotide Polymorphisms) and large structural variations, INDELs are of significant interest due to their potentials in affecting gene molecular functions and therefore cause disease.Despite of its functional importance, lacking a highly accurate bioinformatics tool for INDEL classifications has become a major obstacle in understanding the molecular functions.Methods: In this study, we develop a bioinformatics tool for INDEL prioritization.We constructed a training dataset that includes all the disease-causing and neutral INDELs documented in the Human Genome Mutation Database (HGMD) and 1,000 Genome Project.For the two sets of INDELs, we systematically investigated a series of features on their potentials in disrupting both protein structures and pre-mRNA splicing.We then designed a predictor to best distinguish the disease-causing and neutral INDELs using machine-learning technique.Results: In this study, we focused on the disease relevance of the INDELs in the exonic regions, which include 25,923 and 4,643 disease-causing and neutral INDELs, respectively.We found that the disease-causing INDELs more likely occur in predicted structured regions and neutral INDELs are more likely in unstructured, intrinsically disordered regions.We also found that INDEL sites matched more to homologous sequences are more like disease-causing.For RNA processing, we found disease-causing INDELs are twice likely to disrupt binding sites of RNA-binding proteins.We also found that disease-causing INDELs tend to be longer, and be closer to the splice sites.In addition, the genomic loci for disease-causing INDELs are more evolutionarily conserved, and tend to locate in the exons that have previous evidence for altemative splicing.We designed a predictor by integrating all these features for prioritizing potential molecular functions for novel INDELs.The predictor achieved excellent prediction power with the area under the curve (AUC) of the receiver operating characterstic (ROC) curve at 0.86, and Mathews correlation coefficient 0.65.Conclusions: With the fast growing application of next generation sequencing technology in genetic studies, more and more novel genetic variants will be identified in the near future.The bioinformatics tool reported here will provide a robust and informative means for identifying disease-causing INDELs derived from DNA-sequencing experiments .

其他文献

Motif analysis in bacterial and eukaryotic replication origins

　　Background: Replication of chromosomes is one of the central events in the cell cycle.DNA replication begins at a specific site, called an origin of replica

会议

replication originmotif analysisMEME SuiteAT-rich motif

Nucleosome positioning mechanism, dynamics and computational challenges

　　Nucleosome positioning in vivo is influenced by DNA sequence, chromatin remodelers and fixed barriers, such as DNA-binding proteins, but the relative contri

会议

Revealing human phosphorylation networks systematic identification of protein kinase for phosphoryla

　　Background: Protein phosphorylation is one of the pervasive and most important protein posttranslational modifications, which regulates the dynamic behavior

会议

phosphorylationnetworkprotein kinaseidentification

Network Module Identification with Its Applications in Biological Networks

　　Background: Module (community) structure is a common and important property of many types of networks such as social networks and biological networks.Severa

会议

Gene co-expression networkModule identificationOptimization problem

Revealing signaling pathways from systematic perturbation data

　　Motivation: Genetic and pharmacological perturbations are powerful systems biology tools to study cellular signal transduction pathways.Here, we report a fr

会议

Generating CCR5--Hematopoietic stem cells for HIV cell therapy

会议

Expansion of Monocytic Myeloid-derived Suppressor Cells Dampens T Cell Function in HIV-1 Seropositiv

会议

Robustness and gene expression noise in gene regulatory network alters the evolution of heterogeneit

　　Background: The ultra intercellular heterogeneity in tumor is one major causes for the failure of cancer therapy, e.g.drug resistance and/or cancer relapse.

会议

gene regulatory networksrobustnessgene expression noiseheterogeneityevolutio

Long-term reduction in peripheral blood HIV-1 reservoirs following limited-intensity allogeneic stem

会议

Treatment intensification strategies reveal residual viral replication.The immune correlates

会议

Deciphering sequence and structure features of disease-causing small insertions and deletions

与本文相关的学术论文