[1]马翔,申国伟,郭春,等.面向异构分布式机器学习的动态自适应并行加速方法[J].智能系统学报,2023,18(5):1099-1107.[doi:10.11992/tis.202209024]
MA Xiang,SHEN Guowei,GUO Chun,et al.Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning[J].CAAI Transactions on Intelligent Systems,2023,18(5):1099-1107.[doi:10.11992/tis.202209024]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
18
期数:
2023年第5期
页码:
1099-1107
栏目:
学术论文—机器学习
出版日期:
2023-09-05
- Title:
-
Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning
- 作者:
-
马翔, 申国伟, 郭春, 崔允贺, 陈意
-
贵州大学 计算机科学与技术学院, 贵州 贵阳 550025
- Author(s):
-
MA Xiang, SHEN Guowei, GUO Chun, CUI Yunhe, CHEN Yi
-
College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
-
- 关键词:
-
异构集群; 机器学习; 数据并行; 分布式训练; 参数服务器; 落后者; 陈旧梯度; 大规模深度学习
- Keywords:
-
heterogeneous clusters; machine learning; data parallel; distributed training; parameter servers; stragglers; stale gradient; large-scale deep learning
- 分类号:
-
TP18
- DOI:
-
10.11992/tis.202209024
- 摘要:
-
分布式机器学习因其优越的并行能力成为人工智能领域复杂模型训练的常用技术。然而,GPU升级换代非常快,异构集群环境下的分布式机器学习成为数据中心、研究机构面临的新常态。异构节点之间训练速度的差异使得现有并行方法难以平衡同步等待和陈旧梯度的影响,从而显著降低模型整体训练效率。针对该问题,提出了一种基于节点状态的动态自适应并行方法(dynamic adaptive synchronous parallel, DASP),利用参数服务器动态管理节点训练时的状态信息并对节点的并行状态进行划分,通过节点状态信息自适应调整每个节点的并行状态,以减少快速节点对全局模型参数的同步等待时间与陈旧梯度的产生,从而加快收敛效率。在公开数据集上的实验结果表明,DASP比主流方法收敛时间减少了16.9%~82.1%,并且训练过程更加稳定。
- Abstract:
-
Distributed machine learning has emerged as a common technique for training complex artificial intelligence models due to its excellent parallelism capability. However, GPU upgrades are exceedingly fast, and distributed machine learning in a heterogeneous cluster environment is increasingly being adopted by data centers and research institutions. The difference in training speed between heterogeneous nodes makes it difficult for existing parallel strategies to balance the effects of synchronized waits and stale gradients, considerably reducing the model’s overall training efficiency. To address this problem, a node state-based dynamic adaptive parallel strategy, namely, dynamic adaptive synchronous parallel (DASP), is proposed using a parameter server to dynamically manage the state information of nodes during training and to divide the parallel states of nodes. The parallel state of each node is adaptively adjusted by the state information of the node to reduce the synchronization waiting time of fast nodes for global model parameters and the generation of stale gradients, speeding up the convergence efficiency. Experimental results on publicly available datasets show that DASP not only reduces the convergence time by 16.9%~82.1% compared to mainstream strategies but also makes the training process more stable.
更新日期/Last Update:
1900-01-01