<-Previous Article Next Article->

[1]ZHU Shitong,DONG Qi.Accelerated method for training large models based on a 2D tensor parallel strategy[J].CAAI Transactions on Intelligent Systems,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]

Copy

Accelerated method for training large models based on a 2D tensor parallel strategy

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 20 Number of periods: 2025 5 Page number: 1256-1265 Column: 吴文俊人工智能科学技术奖论坛 Public date: 2025-09-05

Title:: Accelerated method for training large models based on a 2D tensor parallel strategy

Author(s):: ZHU Shitong; DONG Qi; China Academy of Electronics and Information Technology, Beijing 100043, China

Keywords:: Transformer; tensor parallel; attention mechanism; natural language processing; artificial intelligence; pretraining; distributed training; distributed communication

CLC:: TP339

DOI:: 10.11992/tis.202411023

Abstract:: Recent advancements in language modeling have shown that large pretrained models based on the Transformer architecture exhibit exceptional performance in natural language processing applications. However, training large language models (LLMs) poses a considerable challenge due to the limited memory capacity of GPUs. Traditional tensor parallelism methods require a single GPU to store all activation values, making it difficult to address memory bottlenecks. Aiming to solve the GPU memory constraint on LLM training and improve training efficiency, this paper proposes a two-dimensional tensor parallelism method (TP2D). TP2D partitions the input data and parameter matrices across multiple GPUs, leveraging distributed communication to facilitate high-speed data exchange between GPUs. This approach enables true distributed parallel training and alleviates memory constraints. GPT-2 was used as a benchmark model to evaluate the soft scaling efficiency and training efficiency of the two training methods. Experimental results show that, when using a 4-block GPU, the training speed of 2D tensor parallelism is 1.84 times that of tensor parallelism, with a soft scaling efficiency of 86% and reduced memory consumption.

References:: [1] 李蕾, 周延泉, 钟义信. 基于语用的自然语言处理研究与应用初探[J]. 智能系统学报, 2006, 1(2): 1-6.
LI Lei, ZHOU Yanquan, ZHONG Yixin. Pragmatic information based NLP research and application[J]. CAAI transactions on intelligent systems, 2006, 1(2): 1-6.
[2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30: 5998-6008.
[3] 周飞燕, 金林鹏, 董军. 卷积神经网络研究综述[J]. 计算机学报, 2017, 40(6): 1229-1251.
ZHOU Feiyan, JIN Linpeng, DONG Jun. Review of convolutional neural network[J]. Chinese journal of computers, 2017, 40(6): 1229-1251.
[4] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[5] LIU Yiheng, HE Hao, HAN Tianle, et al. Understanding LLMs: a comprehensive overview from training to inference[J]. Neurocomputing, 2025, 620: 129190.
[6] MINAEE S, MIKOLOV T, NIKZAD N, et al. Large language models: a survey[EB/OL]. (2024-02-09)[2024-11-21]. https://arxiv.org/abs/2402.06196.
[7] SCHAEFFER R, MIRANDA B, KOYEJO S. Are emergent abilities of large language models a mirage? [EB/OL]. (2023-05-22)[2024-11-21]. https://arxiv.org/abs/2304.15004v2.
[8] WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models[EB/OL]. (2022-06-15)[2024-11-21]. https://arxiv.org/abs/2206.07682.
[9] KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[EB/OL]. (2020-01-23)[2024-11-21]. https://arxiv.org/abs/2001.08361v1.
[10] GORDON M A, DUH K, KAPLAN J. Data and parameter scaling laws for neural machine translation[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021: 5915-5922.
[11] SEVILLA J, HEIM L, HO A, et al. Compute trends across three eras of machine learning[C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8.
[12] LIU Qinghua, JIANG Yuxiang. Dive into big model training[EB/OL]. (2022-07-25)[2024-11-21]. https://arxiv.org/abs/2207.11912v1.
[13] 王楠禔. 基于BERT改进的文本表示模型研究[D]. 重庆: 西南大学, 2019.
WANG Nanzhi. Research on improved text representation model based on BERT[D]. Chongqing: Southwest University, 2019.
[14] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[EB/OL]. (2020-05-28)[2024-11-21]. https://arxiv.org/abs/2005.14165.
[15] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-11-21]. https://arxiv.org/abs/2303.08774.
[16] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models. [EB/OL]. (2023-02-27)[2024-11-21]. https://arxiv.org/abs/2302.13971.
[17] 李德毅. 网络时代人工智能研究与发展[J]. 智能系统学报, 2009, 4(1): 1-6.
LI Deyi. AI research and development in the network age[J]. CAAI transactions on intelligent systems, 2009, 4(1): 1-6.
[18] 杨春生. 大模型不等于第三次AI浪潮[J]. 智能系统学报, 2023, 18(3): 409.
YANG Chunsheng. Large language models are not the third AI wave[J]. CAAI transactions on intelligent systems, 2023, 18(3): 409.
[19] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Atlanta: IEEE, 2020: 1-16.
[20] WANG Guanhua, QIN Heyang, JACOBS S A, et al. ZeRO++: extremely efficient collective communication for giant model training[EB/OL]. (2023-06-16)[2024-11-21]. https://arxiv.org/abs/2306.10209v1.
[21] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-Offload: democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference. [S. l.]: USENIX Association, 2021: 551-564.
[22] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: training multi-billion parameter language models using model parallelism[EB/OL]. (2019-09-17)[2024-11-21]. https://arxiv.org/abs/1909.08053.
[23] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on GPU clusters using megatron-LM[C]//SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. Saint Louis: Association for Computing Machinery, 2021: 1-14.
[24] NARAYANAN D, PHANISHAYEE A, SHI Kaiyu, et al. Memory-efficient pipeline-parallel DNN training[C]//International Conference on Machine Learning. [S. l. ]: PMLR, 2020: 7937-7947.
[25] HARLAP A, NARAYANAN D, PHANISHAYEE A, et al. PipeDream: fast and efficient pipeline parallel DNN training[EB/OL]. (2018-06-08)[2024-11-21]. https://arxiv.org/abs/1806.03377.
[26] NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville: ACM, 2019: 1-15.
[27] LI Shenggui, XUE Fuzhao, BARANWAL C, et al. Sequence parallelism: long sequence training from system perspective[EB/OL]. (2022-05-21)[2024-11-21]. https://arxiv.org/abs/2105.13120v3.
[28] 李铭. 基于GPU的张量分解及重构方法研究及应用[D]. 成都: 电子科技大学, 2018.
LI Ming. Research and application of tensor decomposition and reconstruction method based on GPU[D]. Chengdu: University of Electronic Science and Technology of China, 2018.
[29] 曹嵘晖, 唐卓, 左知微, 等. 面向机器学习的分布式并行计算关键技术及应用[J]. 智能系统学报, 2021, 16(5): 919-930.
CAO Ronghui, TANG Zhuo, ZUO Zhiwei, et al. Key technologies and applications of distributed parallel computing for machine learning[J]. CAAI transactions on intelligent systems, 2021, 16(5): 919-930.
[30] 舒娜, 刘波, 林伟伟, 等. 分布式机器学习平台与算法综述[J]. 计算机科学, 2019, 46(3): 9-18.
SHU Na, LIU Bo, LIN Weiwei, et al. Survey of distributed machine learning platforms and algorithms[J]. Computer science, 2019, 46(3): 9-18.

Similar References:

Memo

Last Update: 2025-09-05

Accelerated method for training large models based on a 2D tensor parallel strategy PDF DownloadHTML

Memo

Accelerated method for training large models based on a 2D tensor parallel strategy

PDF Download HTML