[1]朱仕通,董琦.基于二维张量并行策略的大模型加速训练方法[J].智能系统学报,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]
 ZHU Shitong,DONG Qi.Accelerated method for training large models based on a 2D tensor parallel strategy[J].CAAI Transactions on Intelligent Systems,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]
点击复制

基于二维张量并行策略的大模型加速训练方法

参考文献/References:
[1] 李蕾, 周延泉, 钟义信. 基于语用的自然语言处理研究与应用初探[J]. 智能系统学报, 2006, 1(2): 1-6.
LI Lei, ZHOU Yanquan, ZHONG Yixin. Pragmatic information based NLP research and application[J]. CAAI transactions on intelligent systems, 2006, 1(2): 1-6.
[2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30: 5998-6008.
[3] 周飞燕, 金林鹏, 董军. 卷积神经网络研究综述[J]. 计算机学报, 2017, 40(6): 1229-1251.
ZHOU Feiyan, JIN Linpeng, DONG Jun. Review of convolutional neural network[J]. Chinese journal of computers, 2017, 40(6): 1229-1251.
[4] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[5] LIU Yiheng, HE Hao, HAN Tianle, et al. Understanding LLMs: a comprehensive overview from training to inference[J]. Neurocomputing, 2025, 620: 129190.
[6] MINAEE S, MIKOLOV T, NIKZAD N, et al. Large language models: a survey[EB/OL]. (2024-02-09)[2024-11-21]. https://arxiv.org/abs/2402.06196.
[7] SCHAEFFER R, MIRANDA B, KOYEJO S. Are emergent abilities of large language models a mirage? [EB/OL]. (2023-05-22)[2024-11-21]. https://arxiv.org/abs/2304.15004v2.
[8] WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models[EB/OL]. (2022-06-15)[2024-11-21]. https://arxiv.org/abs/2206.07682.
[9] KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[EB/OL]. (2020-01-23)[2024-11-21]. https://arxiv.org/abs/2001.08361v1.
[10] GORDON M A, DUH K, KAPLAN J. Data and parameter scaling laws for neural machine translation[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021: 5915-5922.
[11] SEVILLA J, HEIM L, HO A, et al. Compute trends across three eras of machine learning[C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8.
[12] LIU Qinghua, JIANG Yuxiang. Dive into big model training[EB/OL]. (2022-07-25)[2024-11-21]. https://arxiv.org/abs/2207.11912v1.
[13] 王楠禔. 基于BERT改进的文本表示模型研究[D]. 重庆: 西南大学, 2019.
WANG Nanzhi. Research on improved text representation model based on BERT[D]. Chongqing: Southwest University, 2019.
[14] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[EB/OL]. (2020-05-28)[2024-11-21]. https://arxiv.org/abs/2005.14165.
[15] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-11-21]. https://arxiv.org/abs/2303.08774.
[16] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models. [EB/OL]. (2023-02-27)[2024-11-21]. https://arxiv.org/abs/2302.13971.
[17] 李德毅. 网络时代人工智能研究与发展[J]. 智能系统学报, 2009, 4(1): 1-6.
LI Deyi. AI research and development in the network age[J]. CAAI transactions on intelligent systems, 2009, 4(1): 1-6.
[18] 杨春生. 大模型不等于第三次AI浪潮[J]. 智能系统学报, 2023, 18(3): 409.
YANG Chunsheng. Large language models are not the third AI wave[J]. CAAI transactions on intelligent systems, 2023, 18(3): 409.
[19] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Atlanta: IEEE, 2020: 1-16.
[20] WANG Guanhua, QIN Heyang, JACOBS S A, et al. ZeRO++: extremely efficient collective communication for giant model training[EB/OL]. (2023-06-16)[2024-11-21]. https://arxiv.org/abs/2306.10209v1.
[21] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-Offload: democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference. [S. l.]: USENIX Association, 2021: 551-564.
[22] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: training multi-billion parameter language models using model parallelism[EB/OL]. (2019-09-17)[2024-11-21]. https://arxiv.org/abs/1909.08053.
[23] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on GPU clusters using megatron-LM[C]//SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. Saint Louis: Association for Computing Machinery, 2021: 1-14.
[24] NARAYANAN D, PHANISHAYEE A, SHI Kaiyu, et al. Memory-efficient pipeline-parallel DNN training[C]//International Conference on Machine Learning. [S. l. ]: PMLR, 2020: 7937-7947.
[25] HARLAP A, NARAYANAN D, PHANISHAYEE A, et al. PipeDream: fast and efficient pipeline parallel DNN training[EB/OL]. (2018-06-08)[2024-11-21]. https://arxiv.org/abs/1806.03377.
[26] NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville: ACM, 2019: 1-15.
[27] LI Shenggui, XUE Fuzhao, BARANWAL C, et al. Sequence parallelism: long sequence training from system perspective[EB/OL]. (2022-05-21)[2024-11-21]. https://arxiv.org/abs/2105.13120v3.
[28] 李铭. 基于GPU的张量分解及重构方法研究及应用[D]. 成都: 电子科技大学, 2018.
LI Ming. Research and application of tensor decomposition and reconstruction method based on GPU[D]. Chengdu: University of Electronic Science and Technology of China, 2018.
[29] 曹嵘晖, 唐卓, 左知微, 等. 面向机器学习的分布式并行计算关键技术及应用[J]. 智能系统学报, 2021, 16(5): 919-930.
CAO Ronghui, TANG Zhuo, ZUO Zhiwei, et al. Key technologies and applications of distributed parallel computing for machine learning[J]. CAAI transactions on intelligent systems, 2021, 16(5): 919-930.
[30] 舒娜, 刘波, 林伟伟, 等. 分布式机器学习平台与算法综述[J]. 计算机科学, 2019, 46(3): 9-18.
SHU Na, LIU Bo, LIN Weiwei, et al. Survey of distributed machine learning platforms and algorithms[J]. Computer science, 2019, 46(3): 9-18.
相似文献/References:
[1]郭一楠,王斌,巩敦卫,等.实体结构与语义融合的多层注意力知识表示学习[J].智能系统学报,2023,18(3):577.[doi:10.11992/tis.202204026]
 GUO Yinan,WANG Bin,GONG Dunwei,et al.Multi-layer attention knowledge representation learning by integrating entity structure with semantics[J].CAAI Transactions on Intelligent Systems,2023,18():577.[doi:10.11992/tis.202204026]
[2]周静,胡怡宇,黄心汉.形状补全引导的Transformer点云目标检测方法[J].智能系统学报,2023,18(4):731.[doi:10.11992/tis.202210038]
 ZHOU Jing,HU Yiyu,HUANG Xinhan.Shape completion-guided Transformer point cloud object detection method[J].CAAI Transactions on Intelligent Systems,2023,18():731.[doi:10.11992/tis.202210038]
[3]张少乐,雷涛,王营博,等.基于多尺度金字塔Transformer的人群计数方法[J].智能系统学报,2024,19(1):67.[doi:10.11992/tis.202304044]
 ZHANG Shaole,LEI Tao,WANG Yingbo,et al.A crowd counting network based on multi-scale pyramid Transformer[J].CAAI Transactions on Intelligent Systems,2024,19():67.[doi:10.11992/tis.202304044]
[4]程艳,胡建生,赵松华,等.融合Transformer和交互注意力网络的方面级情感分类模型[J].智能系统学报,2024,19(3):728.[doi:10.11992/tis.202303016]
 CHENG Yan,HU Jiansheng,ZHAO Songhua,et al.Aspect-level sentiment classification model combining Transformer and interactive attention network[J].CAAI Transactions on Intelligent Systems,2024,19():728.[doi:10.11992/tis.202303016]
[5]邵凯,王明政,王光宇.基于Transformer的多尺度遥感语义分割网络[J].智能系统学报,2024,19(4):920.[doi:10.11992/tis.202304026]
 SHAO Kai,WANG Mingzheng,WANG Guangyu.Transformer-based multiscale remote sensing semantic segmentation network[J].CAAI Transactions on Intelligent Systems,2024,19():920.[doi:10.11992/tis.202304026]
[6]刘万军,姜岚,曲海成,等.融合CNN与Transformer的MRI脑肿瘤图像分割[J].智能系统学报,2024,19(4):1007.[doi:10.11992/tis.202301016]
 LIU Wanjun,JIANG Lan,QU Haicheng,et al.MRI brain tumor image segmentation by fusing CNN and Transformer[J].CAAI Transactions on Intelligent Systems,2024,19():1007.[doi:10.11992/tis.202301016]
[7]丁贵广,陈辉,王澳,等.视觉深度学习模型压缩加速综述[J].智能系统学报,2024,19(5):1072.[doi:10.11992/tis.202311011]
 DING Guiguang,CHEN Hui,WANG Ao,et al.Review of model compression and acceleration for visual deep learning[J].CAAI Transactions on Intelligent Systems,2024,19():1072.[doi:10.11992/tis.202311011]
[8]刘国奇,陈宗玉,刘栋,等.融合边界注意力的特征挖掘息肉小目标网络[J].智能系统学报,2024,19(5):1092.[doi:10.11992/tis.202306025]
 LIU Guoqi,CHEN Zongyu,LIU Dong,et al.A small polyp objects network integrating boundary attention features[J].CAAI Transactions on Intelligent Systems,2024,19():1092.[doi:10.11992/tis.202306025]
[9]郝剑龙,刘志斌,张宸,等.基于改进Transformer和超图模型的股票趋势预测方法研究[J].智能系统学报,2024,19(5):1126.[doi:10.11992/tis.202308017]
 HAO Jianlong,LIU Zhibin,ZHANG Chen,et al.Stock trend prediction method based on improved Transformer and hypergraph model[J].CAAI Transactions on Intelligent Systems,2024,19():1126.[doi:10.11992/tis.202308017]
[10]黄昱程,肖子旺,武丹凤,等.时空融合与判别力增强的孪生网络目标跟踪方法[J].智能系统学报,2024,19(5):1218.[doi:10.11992/tis.202306005]
 HUANG Yucheng,XIAO Ziwang,WU Danfeng,et al.Spatiotemporal fusion and discriminative augmentation for improved Siamese tracking[J].CAAI Transactions on Intelligent Systems,2024,19():1218.[doi:10.11992/tis.202306005]

备注/Memo

收稿日期:2024-11-21。
作者简介:朱仕通,硕士研究生,主要研究方向为神经网络、自然语言处理和大语言模型。E-mail:zst555157@163.com。;董琦,高级工程师,博士,主要研究方向为智能博弈、自主规划和多智能体系统控制。2017年获中国电子学会优秀博士学位论文奖,2019年获吴文俊人工智能优秀青年奖。E-mail:dongqiouc@123.com。
通讯作者:董琦. E-mail:dongqiouc@123.com

更新日期/Last Update: 2025-09-05
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com