[1]杨帅,郭茂祖,赵玲玲,等.融合遗传算法与XGBoost的玉米百粒重相关基因挖掘[J].智能系统学报,2022,17(1):170-180.[doi:10.11992/tis.202105005]
YANG Shuai,GUO Maozu,ZHAO Lingling,et al.The method of 100-kernel weight related genes mining in maize mixed with genetic algorithm and XGboost[J].CAAI Transactions on Intelligent Systems,2022,17(1):170-180.[doi:10.11992/tis.202105005]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
17
期数:
2022年第1期
页码:
170-180
栏目:
人工智能院长论坛
出版日期:
2022-01-05
- Title:
-
The method of 100-kernel weight related genes mining in maize mixed with genetic algorithm and XGboost
- 作者:
-
杨帅1,2, 郭茂祖1,2, 赵玲玲3, 李阳1,2
-
1. 北京建筑大学 电气与信息工程学院,北京 100044;
2. 建筑大数据智能处理方法研究北京市重点实验室,北京 100044;
3. 哈尔滨工业大学 计算机科学与技术学院,黑龙江 哈尔滨 150001
- Author(s):
-
YANG Shuai1,2, GUO Maozu1,2, ZHAO Lingling3, LI Yang1,2
-
1. School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China;
2. Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, China;
3. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
-
- 关键词:
-
遗传算法; 极限梯度提升算法; 机器学习; 玉米; 转录组分析; 百粒重; 基因本体; 京都基因与基因组百科全书
- Keywords:
-
genetic algorithm; eXtreme gradient boosting; machine learning; maize; transcriptome analysis; 100-kernel weight; gene ontology; kyoto encyclopedia of genes and genomes
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202105005
- 摘要:
-
基于RNA-Seq的转录组测序数据特征维度较高,使用传统生信方法寻找表型相关基因需要大量计算资源,且差异分析所得候选基因范围较大,进一步筛选依赖已有的先验知识。针对这一问题,本文提出了融合遗传算法和XGBoost的转录组分析方法—GA-XGBoost,通过融入机器学习算法缩小了后续分析的候选基因范围。在一组高质量玉米数据集上对基因-百粒重性状的关联进行了对比实验和后续分析,结果显示,相比于分别使用全体基因和差异表达基因直接训练XGBoost模型,所提方法得到的候选基因训练的XGBoost模型在玉米百粒重的预测结果上具有最小的MSE;相比于差异表达分析结果的1542个差异表达基因,GA-XGBoost方法最终将候选基因范围减小至48个,范围缩小了31倍,表明所提方法能够有效提升对转录组数据的分析能力和效率。
- Abstract:
-
The RNA-Seq-based transcriptome sequencing data has a high feature dimension that requires a lot of computing resources when using traditional methods to find phenotype related genes. Moreover, the range of candidate genes obtained by difference analysis is large, and further screening depends on existing a prior knowledge. A transcriptome analysis method combining genetic algorithm and XGBoost, GA-XGBoost, was proposed to narrow the range of candidate genes for subsequent analysis by incorporating machine learning algorithm. A comparative experiment and subsequent analysis of the gene-100-kernel weight trait association on a set of high-quality maize datasets showed that, compared with training the XGBoost model directly with whole genes and differentially expressed genes, the candidate gene training XGBoost model obtained by the proposed method had the minimum MSE in predicting the 100-kernel weight of maize. Compared with 1542 differentially expressed genes in the results of differential expression analysis, the range of candidate genes was reduced to 48 by the GA-XGBoost method, which was reduced by 31 times, indicating that the proposed method could effectively improve the ability and efficiency of transcriptome data analysis.
备注/Memo
收稿日期:2021-05-06。
基金项目:国家自然科学基金项目(62031003,61871020);北京市属高校高水平创新团队建设计划项目(IDHT20190506);国家重点研发计划子课题(2020YFF0305501);北京市教委科技计划重点项目(KZ201810016019).
作者简介:杨帅,硕士研究生,主要研究方向为机器学习、生物信息学;郭茂祖,博士,教授、博士生导师,中国计算机学会生物信息学专委会副主任、中国人工智能学会机器学习专委会常委。主要研究方向为生物信息学、机器学习、智慧城市。获教育部自然科学二等奖、吴文俊人工智能自然科学二等奖,主持国家级科研项目10项。发表学术论文300余篇;赵玲玲,副教授,博士,主要研究方向为机器学习、智慧城市、生物信息学。主持国家自然科学基金面上项目1项、国家自然科学基金青年基金项目1项、国家自然科学基金重点项目1项。发表学术论文40余篇。
通讯作者:赵玲玲. E-mail: zhaoll@hit.edu.cn
更新日期/Last Update:
1900-01-01