[1]ZHU Shitong,DONG Qi.Accelerated method for training large models based on a 2D tensor parallel strategy[J].CAAI Transactions on Intelligent Systems,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
20
Number of periods:
2025 5
Page number:
1256-1265
Column:
吴文俊人工智能科学技术奖论坛
Public date:
2025-09-05
- Title:
-
Accelerated method for training large models based on a 2D tensor parallel strategy
- Author(s):
-
ZHU Shitong; DONG Qi
-
China Academy of Electronics and Information Technology, Beijing 100043, China
-
- Keywords:
-
Transformer; tensor parallel; attention mechanism; natural language processing; artificial intelligence; pretraining; distributed training; distributed communication
- CLC:
-
TP339
- DOI:
-
10.11992/tis.202411023
- Abstract:
-
Recent advancements in language modeling have shown that large pretrained models based on the Transformer architecture exhibit exceptional performance in natural language processing applications. However, training large language models (LLMs) poses a considerable challenge due to the limited memory capacity of GPUs. Traditional tensor parallelism methods require a single GPU to store all activation values, making it difficult to address memory bottlenecks. Aiming to solve the GPU memory constraint on LLM training and improve training efficiency, this paper proposes a two-dimensional tensor parallelism method (TP2D). TP2D partitions the input data and parameter matrices across multiple GPUs, leveraging distributed communication to facilitate high-speed data exchange between GPUs. This approach enables true distributed parallel training and alleviates memory constraints. GPT-2 was used as a benchmark model to evaluate the soft scaling efficiency and training efficiency of the two training methods. Experimental results show that, when using a 4-block GPU, the training speed of 2D tensor parallelism is 1.84 times that of tensor parallelism, with a soft scaling efficiency of 86% and reduced memory consumption.