<-上一篇/Previous Article 下一篇/Next Article->

[1]朱仕通,董琦.基于二维张量并行策略的大模型加速训练方法[J].智能系统学报,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]
　ZHU Shitong,DONG Qi.Accelerated method for training large models based on a 2D tensor parallel strategy[J].CAAI Transactions on Intelligent Systems,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]

点击复制

基于二维张量并行策略的大模型加速训练方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 20 期数: 2025年第5期页码: 1256-1265 栏目: 吴文俊人工智能科学技术奖论坛出版日期: 2025-09-05

Title:: Accelerated method for training large models based on a 2D tensor parallel strategy

作者:: 朱仕通, 董琦; 中国电子科学研究院, 北京 100043

Author(s):: ZHU Shitong, DONG Qi; China Academy of Electronics and Information Technology, Beijing 100043, China

关键词:: Transformer; 张量并行; 注意力机制; 自然语言处理; 人工智能; 预训练; 分布式训练; 分布式通信

Keywords:: Transformer; tensor parallel; attention mechanism; natural language processing; artificial intelligence; pretraining; distributed training; distributed communication

分类号:: TP339

DOI:: 10.11992/tis.202411023

摘要:: 近期语言模型领域的进展显示，采用Transformer架构的大型预训练模型在自然语言处理应用中表现出优异的技术能力。然而，受限于GPU内存，训练大语言模型(large language models, LLMs)成为了一项挑战。张量并行方法要求单个GPU存储所有激活值，难以突破内存瓶颈。为解决GPU内存对大语言模型训练的制约并提升训练效率，本文提出一种二维张量并行方法(2D tensor parallelism, TP2D)。2D张量并行将输入数据和参数矩阵分割并分配至4个GPU；采用分布式通信，进行GPU间数据的高速交互，实现真正的分布式并行训练。以 GPT-2 模型作为基准模型，测试了两种训练方法的软扩展(soft scaling)效率和训练效率。实验表明，当使用4块GPU时，2D张量并行的训练速度是张量并行的1.84倍，软扩展效率达到86%，并降低了内存占用。

Abstract:: Recent advancements in language modeling have shown that large pretrained models based on the Transformer architecture exhibit exceptional performance in natural language processing applications. However, training large language models (LLMs) poses a considerable challenge due to the limited memory capacity of GPUs. Traditional tensor parallelism methods require a single GPU to store all activation values, making it difficult to address memory bottlenecks. Aiming to solve the GPU memory constraint on LLM training and improve training efficiency, this paper proposes a two-dimensional tensor parallelism method (TP2D). TP2D partitions the input data and parameter matrices across multiple GPUs, leveraging distributed communication to facilitate high-speed data exchange between GPUs. This approach enables true distributed parallel training and alleviates memory constraints. GPT-2 was used as a benchmark model to evaluate the soft scaling efficiency and training efficiency of the two training methods. Experimental results show that, when using a 4-block GPU, the training speed of 2D tensor parallelism is 1.84 times that of tensor parallelism, with a soft scaling efficiency of 86% and reduced memory consumption.

参考文献/References:: [1] 李蕾, 周延泉, 钟义信. 基于语用的自然语言处理研究与应用初探[J]. 智能系统学报, 2006, 1(2): 1-6.
LI Lei, ZHOU Yanquan, ZHONG Yixin. Pragmatic information based NLP research and application[J]. CAAI transactions on intelligent systems, 2006, 1(2): 1-6.
[2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30: 5998-6008.
[3] 周飞燕, 金林鹏, 董军. 卷积神经网络研究综述[J]. 计算机学报, 2017, 40(6): 1229-1251.
ZHOU Feiyan, JIN Linpeng, DONG Jun. Review of convolutional neural network[J]. Chinese journal of computers, 2017, 40(6): 1229-1251.
[4] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[5] LIU Yiheng, HE Hao, HAN Tianle, et al. Understanding LLMs: a comprehensive overview from training to inference[J]. Neurocomputing, 2025, 620: 129190.
[6] MINAEE S, MIKOLOV T, NIKZAD N, et al. Large language models: a survey[EB/OL]. (2024-02-09)[2024-11-21]. https://arxiv.org/abs/2402.06196.
[7] SCHAEFFER R, MIRANDA B, KOYEJO S. Are emergent abilities of large language models a mirage? [EB/OL]. (2023-05-22)[2024-11-21]. https://arxiv.org/abs/2304.15004v2.
[8] WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models[EB/OL]. (2022-06-15)[2024-11-21]. https://arxiv.org/abs/2206.07682.
[9] KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[EB/OL]. (2020-01-23)[2024-11-21]. https://arxiv.org/abs/2001.08361v1.
[10] GORDON M A, DUH K, KAPLAN J. Data and parameter scaling laws for neural machine translation[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021: 5915-5922.
[11] SEVILLA J, HEIM L, HO A, et al. Compute trends across three eras of machine learning[C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8.
[12] LIU Qinghua, JIANG Yuxiang. Dive into big model training[EB/OL]. (2022-07-25)[2024-11-21]. https://arxiv.org/abs/2207.11912v1.
[13] 王楠禔. 基于BERT改进的文本表示模型研究[D]. 重庆: 西南大学, 2019.
WANG Nanzhi. Research on improved text representation model based on BERT[D]. Chongqing: Southwest University, 2019.
[14] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[EB/OL]. (2020-05-28)[2024-11-21]. https://arxiv.org/abs/2005.14165.
[15] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-11-21]. https://arxiv.org/abs/2303.08774.
[16] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models. [EB/OL]. (2023-02-27)[2024-11-21]. https://arxiv.org/abs/2302.13971.
[17] 李德毅. 网络时代人工智能研究与发展[J]. 智能系统学报, 2009, 4(1): 1-6.
LI Deyi. AI research and development in the network age[J]. CAAI transactions on intelligent systems, 2009, 4(1): 1-6.
[18] 杨春生. 大模型不等于第三次AI浪潮[J]. 智能系统学报, 2023, 18(3): 409.
YANG Chunsheng. Large language models are not the third AI wave[J]. CAAI transactions on intelligent systems, 2023, 18(3): 409.
[19] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Atlanta: IEEE, 2020: 1-16.
[20] WANG Guanhua, QIN Heyang, JACOBS S A, et al. ZeRO++: extremely efficient collective communication for giant model training[EB/OL]. (2023-06-16)[2024-11-21]. https://arxiv.org/abs/2306.10209v1.
[21] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-Offload: democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference. [S. l.]: USENIX Association, 2021: 551-564.
[22] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: training multi-billion parameter language models using model parallelism[EB/OL]. (2019-09-17)[2024-11-21]. https://arxiv.org/abs/1909.08053.
[23] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on GPU clusters using megatron-LM[C]//SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. Saint Louis: Association for Computing Machinery, 2021: 1-14.
[24] NARAYANAN D, PHANISHAYEE A, SHI Kaiyu, et al. Memory-efficient pipeline-parallel DNN training[C]//International Conference on Machine Learning. [S. l. ]: PMLR, 2020: 7937-7947.
[25] HARLAP A, NARAYANAN D, PHANISHAYEE A, et al. PipeDream: fast and efficient pipeline parallel DNN training[EB/OL]. (2018-06-08)[2024-11-21]. https://arxiv.org/abs/1806.03377.
[26] NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville: ACM, 2019: 1-15.
[27] LI Shenggui, XUE Fuzhao, BARANWAL C, et al. Sequence parallelism: long sequence training from system perspective[EB/OL]. (2022-05-21)[2024-11-21]. https://arxiv.org/abs/2105.13120v3.
[28] 李铭. 基于GPU的张量分解及重构方法研究及应用[D]. 成都: 电子科技大学, 2018.
LI Ming. Research and application of tensor decomposition and reconstruction method based on GPU[D]. Chengdu: University of Electronic Science and Technology of China, 2018.
[29] 曹嵘晖, 唐卓, 左知微, 等. 面向机器学习的分布式并行计算关键技术及应用[J]. 智能系统学报, 2021, 16(5): 919-930.
CAO Ronghui, TANG Zhuo, ZUO Zhiwei, et al. Key technologies and applications of distributed parallel computing for machine learning[J]. CAAI transactions on intelligent systems, 2021, 16(5): 919-930.
[30] 舒娜, 刘波, 林伟伟, 等. 分布式机器学习平台与算法综述[J]. 计算机科学, 2019, 46(3): 9-18.
SHU Na, LIU Bo, LIN Weiwei, et al. Survey of distributed machine learning platforms and algorithms[J]. Computer science, 2019, 46(3): 9-18.

相似文献/References:: [1]郭一楠,王斌,巩敦卫,等.实体结构与语义融合的多层注意力知识表示学习[J].智能系统学报,2023,18(3):577.[doi:10.11992/tis.202204026]
　GUO Yinan,WANG Bin,GONG Dunwei,et al.Multi-layer attention knowledge representation learning by integrating entity structure with semantics[J].CAAI Transactions on Intelligent Systems,2023,18():577.[doi:10.11992/tis.202204026]
[2]周静,胡怡宇,黄心汉.形状补全引导的Transformer点云目标检测方法[J].智能系统学报,2023,18(4):731.[doi:10.11992/tis.202210038]
　ZHOU Jing,HU Yiyu,HUANG Xinhan.Shape completion-guided Transformer point cloud object detection method[J].CAAI Transactions on Intelligent Systems,2023,18():731.[doi:10.11992/tis.202210038]
[3]张少乐,雷涛,王营博,等.基于多尺度金字塔Transformer的人群计数方法[J].智能系统学报,2024,19(1):67.[doi:10.11992/tis.202304044]
　ZHANG Shaole,LEI Tao,WANG Yingbo,et al.A crowd counting network based on multi-scale pyramid Transformer[J].CAAI Transactions on Intelligent Systems,2024,19():67.[doi:10.11992/tis.202304044]
[4]程艳,胡建生,赵松华,等.融合Transformer和交互注意力网络的方面级情感分类模型[J].智能系统学报,2024,19(3):728.[doi:10.11992/tis.202303016]
　CHENG Yan,HU Jiansheng,ZHAO Songhua,et al.Aspect-level sentiment classification model combining Transformer and interactive attention network[J].CAAI Transactions on Intelligent Systems,2024,19():728.[doi:10.11992/tis.202303016]
[5]邵凯,王明政,王光宇.基于Transformer的多尺度遥感语义分割网络[J].智能系统学报,2024,19(4):920.[doi:10.11992/tis.202304026]
　SHAO Kai,WANG Mingzheng,WANG Guangyu.Transformer-based multiscale remote sensing semantic segmentation network[J].CAAI Transactions on Intelligent Systems,2024,19():920.[doi:10.11992/tis.202304026]
[6]刘万军,姜岚,曲海成,等.融合CNN与Transformer的MRI脑肿瘤图像分割[J].智能系统学报,2024,19(4):1007.[doi:10.11992/tis.202301016]
　LIU Wanjun,JIANG Lan,QU Haicheng,et al.MRI brain tumor image segmentation by fusing CNN and Transformer[J].CAAI Transactions on Intelligent Systems,2024,19():1007.[doi:10.11992/tis.202301016]
[7]丁贵广,陈辉,王澳,等.视觉深度学习模型压缩加速综述[J].智能系统学报,2024,19(5):1072.[doi:10.11992/tis.202311011]
　DING Guiguang,CHEN Hui,WANG Ao,et al.Review of model compression and acceleration for visual deep learning[J].CAAI Transactions on Intelligent Systems,2024,19():1072.[doi:10.11992/tis.202311011]
[8]刘国奇,陈宗玉,刘栋,等.融合边界注意力的特征挖掘息肉小目标网络[J].智能系统学报,2024,19(5):1092.[doi:10.11992/tis.202306025]
　LIU Guoqi,CHEN Zongyu,LIU Dong,et al.A small polyp objects network integrating boundary attention features[J].CAAI Transactions on Intelligent Systems,2024,19():1092.[doi:10.11992/tis.202306025]
[9]郝剑龙,刘志斌,张宸,等.基于改进Transformer和超图模型的股票趋势预测方法研究[J].智能系统学报,2024,19(5):1126.[doi:10.11992/tis.202308017]
　HAO Jianlong,LIU Zhibin,ZHANG Chen,et al.Stock trend prediction method based on improved Transformer and hypergraph model[J].CAAI Transactions on Intelligent Systems,2024,19():1126.[doi:10.11992/tis.202308017]
[10]黄昱程,肖子旺,武丹凤,等.时空融合与判别力增强的孪生网络目标跟踪方法[J].智能系统学报,2024,19(5):1218.[doi:10.11992/tis.202306005]
　HUANG Yucheng,XIAO Ziwang,WU Danfeng,et al.Spatiotemporal fusion and discriminative augmentation for improved Siamese tracking[J].CAAI Transactions on Intelligent Systems,2024,19():1218.[doi:10.11992/tis.202306005]

备注/Memo

收稿日期:2024-11-21。
作者简介:朱仕通，硕士研究生，主要研究方向为神经网络、自然语言处理和大语言模型。E-mail：zst555157@163.com。;董琦，高级工程师，博士，主要研究方向为智能博弈、自主规划和多智能体系统控制。2017年获中国电子学会优秀博士学位论文奖，2019年获吴文俊人工智能优秀青年奖。E-mail：dongqiouc@123.com。
通讯作者:董琦. E-mail：dongqiouc@123.com

更新日期/Last Update: 2025-09-05

基于二维张量并行策略的大模型加速训练方法 PDF下载HTML

备注/Memo

基于二维张量并行策略的大模型加速训练方法

PDF下载 HTML