论文部分内容阅读
本文利用先进的生物信息学方法,首次从全基因组水平综合基因表达、甲基化水平和拷贝数变异三类数据,寻找与肺鳞状细胞癌(LUSC)发生和发展密切相关的特征基因,为进一步解释其内在机理、开发新的靶向药物和治疗手段提供更加深入的理论依据.为克服全基因组数据超高维高噪声小样本特性对机器学习算法性能的影响,防止信息饱和现象的干扰,本文创新性地组合应用4种特征基因筛选方法,分别从特异性、相关性、生物学功能和对肿瘤分类模型的贡献等多个方面,通过迭代降维技术递归筛选真正的特征基因.研究中,我们以TCGA(The Cancer Genome Atlas project)数据库中的LUSCⅠ~Ⅲ期病人样本为例,对其基因表达数据(GE)、基因甲基化数据(ME)以及拷贝数变异数据(CNV)进行分析.结果筛选出67个GE特征基因,对3类样本分类的平均准确率达到86.29%,70个ME特征基因,相应的分类准确率为90.92%,31个CNV特征基因,相应的分类准确率为69.16%.KEGG(Kyoto Encyclopedia of Genes and Genomes)和IPA(Ingenuity Pathway Analysis)对上述3类特征基因集在代谢通路水平和基因调控网络水平上的分析,证明了其在调控水平上的密切关系.同时也表明,识别的特征基因与LUSC肿瘤进展之间有着重要的直接关系,这对了解肿瘤机理以及新靶向治疗的发展非常重要.
In this paper, advanced bioinformatics methods have been used for the first time to comprehensively analyze gene expression, methylation levels, and copy number variation data from the genome level to search for characteristic genes closely related to the occurrence and development of lung squamous cell carcinoma (LUSC). Further explanation of its intrinsic mechanism and the development of new targeted drugs and therapies provide a more in-depth theoretical basis. To overcome the effects of ultra-high-dimensional, high-dimensional, high-noise and small-sample characteristics of whole genome data on the performance of machine learning algorithms, and to prevent the interference of information saturation phenomena, This article innovatively combines the application of four kinds of characteristic gene screening methods, and recursively screens true characteristic genes through iterative dimension reduction techniques in terms of specificity, relevance, biological function, and contribution to tumor classification models. We analyzed the gene expression data (GE), gene methylation data (ME), and copy number variation data (CNV) in LUSCI-III patient samples from the TCGA (The Cancer Genome Atlas project) database. RESULTS: Sixty-seven GE gene mutations were screened out. The average accuracy of the classification of the three types of samples reached 86.29%, and 70 ME gene genes were correspondingly accurately classified. For 90.92% of the 31 CNV signature genes, the corresponding classification accuracy rate was 69.16%. The KEGG (Kyoto Encyclopedia of Genes and Genomes) and IPA (Ingenuity Pathway Analysis) were used to analyze the above three kinds of characteristic gene sets at the metabolic pathway level and gene regulatory network. The analysis at the level has proved its close relationship with the regulatory level. It also shows that there is an important direct relationship between the identified genes and LUSC tumor progression, which is very important for understanding the tumor mechanism and the development of new targeted therapies. .