[1]ZHU Shitong,DONG Qi.Accelerated method for training large models based on a 2D tensor parallel strategy[J].CAAI Transactions on Intelligent Systems,2025,20(5):1256-1265.[doi:10.11992/tis.202411023]
Copy

Accelerated method for training large models based on a 2D tensor parallel strategy

References:
[1] 李蕾, 周延泉, 钟义信. 基于语用的自然语言处理研究与应用初探[J]. 智能系统学报, 2006, 1(2): 1-6.
LI Lei, ZHOU Yanquan, ZHONG Yixin. Pragmatic information based NLP research and application[J]. CAAI transactions on intelligent systems, 2006, 1(2): 1-6.
[2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30: 5998-6008.
[3] 周飞燕, 金林鹏, 董军. 卷积神经网络研究综述[J]. 计算机学报, 2017, 40(6): 1229-1251.
ZHOU Feiyan, JIN Linpeng, DONG Jun. Review of convolutional neural network[J]. Chinese journal of computers, 2017, 40(6): 1229-1251.
[4] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[5] LIU Yiheng, HE Hao, HAN Tianle, et al. Understanding LLMs: a comprehensive overview from training to inference[J]. Neurocomputing, 2025, 620: 129190.
[6] MINAEE S, MIKOLOV T, NIKZAD N, et al. Large language models: a survey[EB/OL]. (2024-02-09)[2024-11-21]. https://arxiv.org/abs/2402.06196.
[7] SCHAEFFER R, MIRANDA B, KOYEJO S. Are emergent abilities of large language models a mirage? [EB/OL]. (2023-05-22)[2024-11-21]. https://arxiv.org/abs/2304.15004v2.
[8] WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models[EB/OL]. (2022-06-15)[2024-11-21]. https://arxiv.org/abs/2206.07682.
[9] KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[EB/OL]. (2020-01-23)[2024-11-21]. https://arxiv.org/abs/2001.08361v1.
[10] GORDON M A, DUH K, KAPLAN J. Data and parameter scaling laws for neural machine translation[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021: 5915-5922.
[11] SEVILLA J, HEIM L, HO A, et al. Compute trends across three eras of machine learning[C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8.
[12] LIU Qinghua, JIANG Yuxiang. Dive into big model training[EB/OL]. (2022-07-25)[2024-11-21]. https://arxiv.org/abs/2207.11912v1.
[13] 王楠禔. 基于BERT改进的文本表示模型研究[D]. 重庆: 西南大学, 2019.
WANG Nanzhi. Research on improved text representation model based on BERT[D]. Chongqing: Southwest University, 2019.
[14] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[EB/OL]. (2020-05-28)[2024-11-21]. https://arxiv.org/abs/2005.14165.
[15] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-11-21]. https://arxiv.org/abs/2303.08774.
[16] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models. [EB/OL]. (2023-02-27)[2024-11-21]. https://arxiv.org/abs/2302.13971.
[17] 李德毅. 网络时代人工智能研究与发展[J]. 智能系统学报, 2009, 4(1): 1-6.
LI Deyi. AI research and development in the network age[J]. CAAI transactions on intelligent systems, 2009, 4(1): 1-6.
[18] 杨春生. 大模型不等于第三次AI浪潮[J]. 智能系统学报, 2023, 18(3): 409.
YANG Chunsheng. Large language models are not the third AI wave[J]. CAAI transactions on intelligent systems, 2023, 18(3): 409.
[19] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Atlanta: IEEE, 2020: 1-16.
[20] WANG Guanhua, QIN Heyang, JACOBS S A, et al. ZeRO++: extremely efficient collective communication for giant model training[EB/OL]. (2023-06-16)[2024-11-21]. https://arxiv.org/abs/2306.10209v1.
[21] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-Offload: democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference. [S. l.]: USENIX Association, 2021: 551-564.
[22] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: training multi-billion parameter language models using model parallelism[EB/OL]. (2019-09-17)[2024-11-21]. https://arxiv.org/abs/1909.08053.
[23] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on GPU clusters using megatron-LM[C]//SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. Saint Louis: Association for Computing Machinery, 2021: 1-14.
[24] NARAYANAN D, PHANISHAYEE A, SHI Kaiyu, et al. Memory-efficient pipeline-parallel DNN training[C]//International Conference on Machine Learning. [S. l. ]: PMLR, 2020: 7937-7947.
[25] HARLAP A, NARAYANAN D, PHANISHAYEE A, et al. PipeDream: fast and efficient pipeline parallel DNN training[EB/OL]. (2018-06-08)[2024-11-21]. https://arxiv.org/abs/1806.03377.
[26] NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville: ACM, 2019: 1-15.
[27] LI Shenggui, XUE Fuzhao, BARANWAL C, et al. Sequence parallelism: long sequence training from system perspective[EB/OL]. (2022-05-21)[2024-11-21]. https://arxiv.org/abs/2105.13120v3.
[28] 李铭. 基于GPU的张量分解及重构方法研究及应用[D]. 成都: 电子科技大学, 2018.
LI Ming. Research and application of tensor decomposition and reconstruction method based on GPU[D]. Chengdu: University of Electronic Science and Technology of China, 2018.
[29] 曹嵘晖, 唐卓, 左知微, 等. 面向机器学习的分布式并行计算关键技术及应用[J]. 智能系统学报, 2021, 16(5): 919-930.
CAO Ronghui, TANG Zhuo, ZUO Zhiwei, et al. Key technologies and applications of distributed parallel computing for machine learning[J]. CAAI transactions on intelligent systems, 2021, 16(5): 919-930.
[30] 舒娜, 刘波, 林伟伟, 等. 分布式机器学习平台与算法综述[J]. 计算机科学, 2019, 46(3): 9-18.
SHU Na, LIU Bo, LIN Weiwei, et al. Survey of distributed machine learning platforms and algorithms[J]. Computer science, 2019, 46(3): 9-18.
Similar References:

Memo

-

Last Update: 2025-09-05

Copyright © CAAI Transactions on Intelligent Systems