[1]朱仕通,董琦.基于二维张量并行策略的大模型加速训练方法[J].智能系统学报,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]
ZHU Shitong,DONG Qi.Accelerated method for training large models based on a 2D tensor parallel strategy[J].CAAI Transactions on Intelligent Systems,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
20
期数:
2025年第5期
页码:
1256-1265
栏目:
吴文俊人工智能科学技术奖论坛
出版日期:
2025-09-05
- Title:
-
Accelerated method for training large models based on a 2D tensor parallel strategy
- 作者:
-
朱仕通, 董琦
-
中国电子科学研究院, 北京 100043
- Author(s):
-
ZHU Shitong, DONG Qi
-
China Academy of Electronics and Information Technology, Beijing 100043, China
-
- 关键词:
-
Transformer; 张量并行; 注意力机制; 自然语言处理; 人工智能; 预训练; 分布式训练; 分布式通信
- Keywords:
-
Transformer; tensor parallel; attention mechanism; natural language processing; artificial intelligence; pretraining; distributed training; distributed communication
- 分类号:
-
TP339
- DOI:
-
10.11992/tis.202411023
- 摘要:
-
近期语言模型领域的进展显示,采用Transformer架构的大型预训练模型在自然语言处理应用中表现出优异的技术能力。然而,受限于GPU内存,训练大语言模型(large language models, LLMs)成为了一项挑战。张量并行方法要求单个GPU存储所有激活值,难以突破内存瓶颈。为解决GPU内存对大语言模型训练的制约并提升训练效率,本文提出一种二维张量并行方法(2D tensor parallelism, TP2D)。2D张量并行将输入数据和参数矩阵分割并分配至4个GPU;采用分布式通信,进行GPU间数据的高速交互,实现真正的分布式并行训练。以 GPT-2 模型作为基准模型,测试了两种训练方法的软扩展(soft scaling)效率和训练效率。实验表明,当使用4块GPU时,2D张量并行的训练速度是张量并行的1.84倍,软扩展效率达到86%,并降低了内存占用。
- Abstract:
-
Recent advancements in language modeling have shown that large pretrained models based on the Transformer architecture exhibit exceptional performance in natural language processing applications. However, training large language models (LLMs) poses a considerable challenge due to the limited memory capacity of GPUs. Traditional tensor parallelism methods require a single GPU to store all activation values, making it difficult to address memory bottlenecks. Aiming to solve the GPU memory constraint on LLM training and improve training efficiency, this paper proposes a two-dimensional tensor parallelism method (TP2D). TP2D partitions the input data and parameter matrices across multiple GPUs, leveraging distributed communication to facilitate high-speed data exchange between GPUs. This approach enables true distributed parallel training and alleviates memory constraints. GPT-2 was used as a benchmark model to evaluate the soft scaling efficiency and training efficiency of the two training methods. Experimental results show that, when using a 4-block GPU, the training speed of 2D tensor parallelism is 1.84 times that of tensor parallelism, with a soft scaling efficiency of 86% and reduced memory consumption.
更新日期/Last Update:
2025-09-05