基于数据降维技术的全基因组区域化关联分析统计推断方法研究

被引量 : 0次 | 上传用户:wuln2909
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Many common human diseases, such as cancer, schizophrenia, essential hypertension, type 2 diabetes, and cardiovascular disease, are known to be complex diseases. Complex diseases, also known as multifactorial diseases, are controlled by multiple genetic and environmental factors. Although they often show a tendency for family aggregation, complex diseases do not have a clear-cut pattern of inheritance. This makes it difficult to determine one’s risk of inheriting or passing on these disorders. Recently with rapid improvements in high-throughout genotyping techniques and the growing number of available markers, genome-wide association studies (GWAS), which genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) on thousands of participants, are emerging as promising approaches for the identification of SNPs that are marginally associated with complex diseases. On the other hand, researches on gene-gene interactions (epistasis) in GWAS have shed light on some disease-associated pathways and networks to some extent, and improved our understanding of the genetic basis of complex diseases despite the computational challenge. However, there are still many analytic and interpretation challenges in GWAS. It is customary to run SNP-based association or interaction tests in the whole genome to identify causal or associated SNPs with strong marginal or jointly epistasis effects on disease or traits.In other words, the unit of association is the SNP. However, such a SNP-based analysis usually leads to computational burden and the well-known multiplicity problem, with a highly inflated risk of type I error and decreased ability to detect modest effects. In the present study, higher units, such as gene or genome regions, were considered to deal with these and related challenges. Under the framework, we proposed four methods to detect disease-associated genes or gene-gene interactions in the genome, presented in four chapters as follows:Chapter 1 A new method to test the nonlinear feature in nonlinear principal component analysis Given the SNPs allocated into genes or regions, the issue of how to evaluate genetic association for each candidate gene or genome region remains. As powerful multi-marker analysis methods, PCA-based methods are often applied in the gene- or region- based association study. PCA can capture linkage disequilibrium information and avoid multicolinearity between SNPs within a candidate gene/region. However, it only extracts the linear relationship between SNPs. For nonlinear situation, the PCA-based methods will lose power, and a nonlinear PCA model should be used. Therefore, in present study, we introduced a nonlinear measure determine whether the underlying relationship within a given variable set can be described by a linear PCA model or whether nonlinear PCA model must be utilized for further study. Applications to two simulated data and the data from GAW16 are described to demonstrate its performance. In the two simulated examples, as expected, no violations of the accuracy bounds arise in the linear example while some of the residual variances fall outside the accuracy bounds in the nonlinear example. For the real data, at least one of the residual variances fall outside any of the accuracy bounds, implying that a nonlinear PCA model is required for this data set. These results show that the new nonlinearity measure is effective to detect the relationships between variables in a given data set. With this measure, we can choose a more suitable model to make optimal use of all information available in the given data set.Chapter 2 Gene- or region- based association study via kernel principal component analysis For linear data, PCA-based methods are better choices for the following association study, while nonlinear approaches should be applied for nonlinear data. Among the modified nonlinear PCA methods, the kernel PCA (KPCA) is the most well known and widely adopted. In this study, we proposed to combine KPCA with logistic regression test (LRT) to detect the association between multiple SNPs in a candidate gene or genome region and diseases or traits. The algorithm conducted KPCA first to account for between-SNP relationships in a candidate region, and then applied LRT to test the association between kernel principal components (KPCs) scores and diseases. Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR=1.2, 1.3). Application to the four regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop 16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT. KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.Chapter 3 Exhaustive sliding-window scan approach for genome-wide association study via PCA-based logistic model The gene- or region-based approaches mentioned above, including our newly proposed KPCA-based method, will definitely improve our understanding of the genetic basis of complex diseases. However, all of these approaches only allow a gene or genome region of several to tens of markers. For a large number of SNPs across the candidate region or the human genome, the performance of these methods will not be satisfying. In recent years, sliding-window methods, in which several neighboring SNPs together included in a "window", have been a popular strategy of automated GWAS data analysis. In these sliding-window approaches, the candidate region or the whole genome is divided into many contiguous overlapping windows, followed by gene- or region-based multi-locus association methods in each window. Sliding-window approach can be implemented with the fixed window size or variable sizes. However, we are not certain whether the window sizes previously set or decided by specific methods are statistically sufficient to gain the optimal detection power. Lin et al proposed that an exhaustive search of all possible windows of SNPs at the genome level is not only computationally practical but also statistically sufficient to detect common or rare genetic-risk alleles. With the development as well as the extensive applications of multiprocessor and multithreading computational technique, the "exhaustive" methods have been more feasible in practice. At present study, under the framework of "exhaustive" search, we first conducted simulations to assess statistical powers with different window sizes, and then evaluated the performance via application to real data to test whether the exhaustive strategy can be extended in GWAS data analysis. Results from both simulation and real data analysis indicated that the powers and p-values with different window sizes were quite different. Furthermore, with the development of multiprocessor computational technique, the proposed exhaustive strategy combined with the cluster computer technique is computationally efficient and feasible for analyzing GWAS data. The exhaustive strategy is computationally efficient and feasible, so it should be popularized in GWAS data analysis. Chapter 4 A new gene- or region-based method for detecting gene-gene interactions between two unlinked loci via kernel canonical correlation analysis For GWAS data set, it is often of interest to identify SNPs that jointly have an epistatic (interaction) effect on complex diseases. However, most of the current methods consider SNP as the unit of association, which leads to several well-know limitations such as multiple testing. Under the gene- or region-based framework, our group have previously proposed a gene-based statistic (CCU statistic) for detecting gene-gene co-association based on canonical correlation analysis (CCA). Apparently, in the case that the two genes of interest are unlinked, the co-association between them is the same as their interaction effect. The CCU statistic has been proved to have good performance on detecting gene-gene co-associations or interactions. Despite that, CCA can only detect linear structure of the data set. If the genomic data contains nonlinear structure, CCA will not be able to detect it. In recent years, kernel CCA (KCCA), as a generalized CCA, has been studied intensively in the field of machine learning, face recognition and data classification, and has been claimed success in many applications. We, therefore, proposed to use KCCA rather than CCA to construct a revised version of CCU statistic-kernel CCU (KCCU) statistic-for detecting gene-gene interaction in association study. Simulation results showed that all the powers of KCCU statistic were higher than CCU statistic at given significant levels, sample sizes and relative risks. Application to RA data in GAW16 Problem 1 showed that CCU statistic only detected the interaction between PTPN22 and C5 genes, while KCCU statistics identified all the pairwise interactions among the four genes. In summary, KCCU statistic had better performance than CCU statistic.
其他文献
<正> 近年来,由于胃癌发病率的上升,对胃癌前期病变的防治日益受到人们的重视,现已公认的癌前病变有:肠上皮化生(IM)、不典型增生(ATP)、腺体异型扩张。我们采用纯中药制剂西
<正> 升降散出自清代杨栗山《伤寒温疫条辨》一书,由蝉蜕、僵蚕、姜黄、大黄组成。方中蝉蜕、僵蚕祛风解痉,散风热,宣肺气,升阳中之清阳;大黄、姜黄荡积行瘀,清热邪,解温毒,
从专业人才的知识结构特点及学校人才培养定位出发,结合"厚基础,宽口径,多样式"的教改理念,构建微电子人才培养的实践教学平台,较好地突出专业特色,推动了创新人才培养。
著名科学家钱学森院士的草产业理论是做好草产业大文章、践行科学发展观的指导性理论。特点是最有效多层次、多环节地转化太阳能;以系统工程的眼光,变生物循环链为产业增值链
采用文献资料法、问卷调查法、数理统计法等方法,对安徽省青少年篮球培训机构的经营管理现状进行调查研究,分析了这些培训机构在经营管理方面存在的问题,并提出针对性的意见,
采用浸渍-还原方法将不同比例的金钴(Au、Co)负载到锌铝水滑石表面(AuCo/ZnAl-LDHs),经焙烧得到具有较高光催化性能的复合氧化物(AuCo/ZnAl-LDO),借助X射线粉末衍射、透射电
本文依据大量实测数据,统计分析了水泥混凝土路面主要结构参数的分布特性和变异范围,推荐了各级公路拟采用的变异水平等级,可供编制混凝土路面可靠性设计规范时参考。
本文应用热传导理论建立碾压混凝土与沥青混凝土复合式层状路面结构的分层热传导方程,并通过总换热系数的反演分析、路表复合传热机理及材料热物理特性的研究,建立路面结构非稳
这一期的电影《霸王别姬》《阳光灿烂的日子》讲述的是中国当代历史中发生的故事。电影的一头连接着个人,另一头连接着时代,虽然那段历史离开现在有些远了,但是我们常常可以
介绍了煤尘爆炸的机理、条件、特征及影响因素,分析了煤尘爆炸的主要原因,提出了具体的防治措施,以消除煤尘爆炸危险,对煤矿安全生产具有重要的意义。