[1]MA Xiang,SHEN Guowei,GUO Chun,et al.Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning[J].CAAI Transactions on Intelligent Systems,2023,18(5):1099-1107.[doi:10.11992/tis.202209024]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
18
Number of periods:
2023 5
Page number:
1099-1107
Column:
学术论文—机器学习
Public date:
2023-09-05
- Title:
-
Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning
- Author(s):
-
MA Xiang; SHEN Guowei; GUO Chun; CUI Yunhe; CHEN Yi
-
College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
-
- Keywords:
-
heterogeneous clusters; machine learning; data parallel; distributed training; parameter servers; stragglers; stale gradient; large-scale deep learning
- CLC:
-
TP18
- DOI:
-
10.11992/tis.202209024
- Abstract:
-
Distributed machine learning has emerged as a common technique for training complex artificial intelligence models due to its excellent parallelism capability. However, GPU upgrades are exceedingly fast, and distributed machine learning in a heterogeneous cluster environment is increasingly being adopted by data centers and research institutions. The difference in training speed between heterogeneous nodes makes it difficult for existing parallel strategies to balance the effects of synchronized waits and stale gradients, considerably reducing the model’s overall training efficiency. To address this problem, a node state-based dynamic adaptive parallel strategy, namely, dynamic adaptive synchronous parallel (DASP), is proposed using a parameter server to dynamically manage the state information of nodes during training and to divide the parallel states of nodes. The parallel state of each node is adaptively adjusted by the state information of the node to reduce the synchronization waiting time of fast nodes for global model parameters and the generation of stale gradients, speeding up the convergence efficiency. Experimental results on publicly available datasets show that DASP not only reduces the convergence time by 16.9%~82.1% compared to mainstream strategies but also makes the training process more stable.