<-Previous Article Next Article->

[1]MA Xiang,SHEN Guowei,GUO Chun,et al.Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning[J].CAAI Transactions on Intelligent Systems,2023,18(5):1099-1107.[doi:10.11992/tis.202209024]

Copy

Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 18 Number of periods: 2023 5 Page number: 1099-1107 Column: 学术论文—机器学习 Public date: 2023-09-05

Title:: Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning

Author(s):: MA Xiang; SHEN Guowei; GUO Chun; CUI Yunhe; CHEN Yi; College of Computer Science and Technology, Guizhou University, Guiyang 550025, China

Keywords:: heterogeneous clusters; machine learning; data parallel; distributed training; parameter servers; stragglers; stale gradient; large-scale deep learning

CLC:: TP18

DOI:: 10.11992/tis.202209024

Abstract:: Distributed machine learning has emerged as a common technique for training complex artificial intelligence models due to its excellent parallelism capability. However, GPU upgrades are exceedingly fast, and distributed machine learning in a heterogeneous cluster environment is increasingly being adopted by data centers and research institutions. The difference in training speed between heterogeneous nodes makes it difficult for existing parallel strategies to balance the effects of synchronized waits and stale gradients, considerably reducing the model’s overall training efficiency. To address this problem, a node state-based dynamic adaptive parallel strategy, namely, dynamic adaptive synchronous parallel (DASP), is proposed using a parameter server to dynamically manage the state information of nodes during training and to divide the parallel states of nodes. The parallel state of each node is adaptively adjusted by the state information of the node to reduce the synchronization waiting time of fast nodes for global model parameters and the generation of stale gradients, speeding up the convergence efficiency. Experimental results on publicly available datasets show that DASP not only reduces the convergence time by 16.9%~82.1% compared to mainstream strategies but also makes the training process more stable.

References:: [1] SZEGEDY C, LIU Wei, JIA Yangqing, et al. Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1?9.
[2] SZEGEDY C, TOSHEV A, ERHAN D. Deep neural networks for object detection[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. New York: ACM, 2013: 2553?2561.
[3] 叶正喆, 苍岩. 基于卷积神经网络的行人检测方法[J]. 应用科技, 2022, 49(2): 55-62
YE Zhengzhe, CANG Yan. A pedestrian detection method based on convolutional neural network[J]. Applied science and technology, 2022, 49(2): 55-62
[4] KENTON J D M W C, TOUTANOVA L K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT. Minnesota: Association for Computational Linguistics, 2019: 4171?4186.
[5] 窦勇敢, 袁晓彤. 基于隐式随机梯度下降优化的联邦学习[J]. 智能系统学报, 2022, 17(3): 488-495
DOU Yonggan, YUAN Xiaotong. Federated learning with implicit stochastic gradient descent optimization[J]. CAAI transactions on intelligent systems, 2022, 17(3): 488-495
[6] PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York: Curran Associates Inc, 2019: 8026?8037.
[7] ABADI M, BARHAM P, CHEN Jianmin, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. New York: ACM, 2016: 265?283.
[8] 曹嵘晖, 唐卓, 左知微, 等. 面向机器学习的分布式并行计算关键技术及应用[J]. 智能系统学报, 2021, 16(5): 919-930
CAO Ronghui, TANG Zhuo, ZUO Zhiwei, et al. Key technologies and applications of distributed parallel computing for machine learning[J]. CAAI transactions on intelligent systems, 2021, 16(5): 919-930
[9] 王帅, 李丹. 分布式机器学习系统网络性能优化研究进展[J]. 计算机学报, 2022, 45(7): 1384-1411
WANG Shuai, LI Dan. Research progress on network performance optimization of distributed machine learning system[J]. Chinese journal of computers, 2022, 45(7): 1384-1411
[10] MIAO Xupeng, NIE Xiaonan, SHAO Yingxia, et al. Heterogeneity-aware distributed machine learning training via partial reduce[C]//Proceedings of the 2021 International Conference on Management of Data. New York: ACM, 2021: 2262?2270.
[11] 舒娜, 刘波, 林伟伟, 等. 分布式机器学习平台与算法综述[J]. 计算机科学, 2019, 46(3): 9-18
SHU Na, LIU Bo, LIN Weiwei, et al. Survey of distributed machine learning platforms and algorithms[J]. Computer science, 2019, 46(3): 9-18
[12] FAN Wenfei, HE Kun, LI Qian, et al. Graph algorithms: parallelization and scalability[J]. Science China information sciences, 2020, 63(10): 203101.
[13] JIANG Jiawei, CUI Bin, ZHANG Ce, et al. Heterogeneity-aware distributed parameter servers[C]//Proceedings of the 2017 ACM International Conference on Management of Data. New York: ACM, 2017: 463?478.
[14] 朱泓睿, 元国军, 姚成吉, 等. 分布式深度学习训练网络综述[J]. 计算机研究与发展, 2021, 58(1): 98-115
ZHU Hongrui, YUAN Guojun, YAO Chengji, et al. Survey on network of distributed deep learning training[J]. Journal of computer research and development, 2021, 58(1): 98-115
[15] XU Ning, CUI Bin, CHEN Lei, et al. Heterogeneous environment aware streaming graph partitioning[J]. IEEE transactions on knowledge and data engineering, 2015, 27(6): 1560-1572.
[16] HO Q, CIPAR J, CUI Henggang, et al. More effective distributed ML via a stale synchronous parallel parameter server[J]. Advances in neural information processing systems, 2013, 2013: 1223-1231.
[17] LI M, ANDERSEN D G, SMOLA A, et al. Communication efficient distributed machine learning with the parameter server[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1. Massachusetts: MIT press, 2014: 19?27.
[18] ZHAO Xing, AN Aijun, LIU Junfeng, et al. Dynamic stale synchronous parallel distributed training for deep learning[C]//2019 IEEE 39th International Conference on Distributed Computing Systems. Dallas: IEEE, 2019: 1507?1517.
[19] FAN Wenfei, LU Ping, YU Wenyuan, et al. Adaptive asynchronous parallelization of graph algorithms[J]. ACM transactions on database systems, 2020, 45(2): 1-45.
[20] 王恩东, 闫瑞栋, 郭振华, 等. 分布式训练系统及其优化算法综述[J/OL]. 计算机学报, 2023: 1?29. (2023?04?06)[2023?05?01]. https://kns.cnki.net/kcms/detail/11.1826.tp.20230404.1510.002.html.
WANG Endong, YAN Ruidong, GUO Zhenhua, et al. A survey of distributed training system and its optimization algorithms[J/OL]. Chinese journal of computers, 2023: 1?29. (2023?04?06)[2023?05?01]. https://kns.cnki.net/kcms/detail/11.1826.tp.20230404.1510.002.html.
[21] CHEN Jianmin, PAN Xinghao, MONGA R, et al. Revisiting distributed synchronous SGD[EB/OL]. (2017?03?21)[2022?07?11]. https://arxiv.org/abs/1604.00981.
[22] TENG M, WOOD F. Bayesian distributed stochastic gradient descent[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018: 6380?6390.
[23] SUN Haifeng, GUI Zhiyi, GUO Song, et al. GSSP: eliminating stragglers through grouping synchronous for distributed deep learning in heterogeneous cluster[J]. IEEE transactions on cloud computing, 2022, 10(4): 2637-2648.
[24] HARLAP A, CUI Henggang, DAI Wei, et al. Addressing the straggler problem for iterative convergent parallel ML[C]//Proceedings of the Seventh ACM Symposium on Cloud Computing. New York: ACM, 2016: 98?111.
[25] XU Hongfei, VAN GENABITH J, XIONG Deyi, et al. Dynamically adjusting transformer batch size by monitoring gradient direction change[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 3519?3524.
[26] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770?778.
[27] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015?04?10)[2022?07?11]. https://arxiv.org/abs/1409.1556.

Similar References:

Memo

Last Update: 1900-01-01

Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning PDF DownloadHTML

Memo

Dynamic adaptive parallel acceleration method for heterogeneous distributed machine learning

PDF Download HTML