论文部分内容阅读
采用统计检验和机器学习的方法来研究SNP或基因与疾病(可测性状)的关联性.先对SNP选择合适的数值编码方式,并设计了相应的统计检验流程,随后通过P值初步筛选出了与疾病或性状相关联的位点.在此基础上,对筛选出的位点,采用随机森林,XGBoost等机器学习方法,从样本外预测的角度判断SNP与疾病或性状的关联度.相关结果,显示发现运用该分析框架能较好地筛选出与疾病或性状关联的SNP(基因).并且框架由于考虑了多种分类模型,有着稳健性高,计算开销较小以及可以交叉比对等优势.框架未来在还可在金融,社交网络等方面发挥作用.
Using statistical tests and machine learning methods to study the association of SNPs or genes with disease (measurable traits), SNPs were selected by appropriate numerical coding methods and corresponding statistical test procedures were designed, followed by preliminary screening by P values Based on this, the relationship between SNPs and diseases or traits was judged from the perspective of out-of-sample prediction by using machine learning methods such as stochastic forest and XGBoost. The results showed that the framework was used to better identify the SNPs associated with the disease or trait, and the framework was robust, cost-effective and crossover-based due to the consideration of multiple classification models Advantages The future of the framework can also play a role in finance, social networking and more.