[1]CAO Ronghui,TANG Zhuo,ZUO Zhiwei,et al.Key technologies and applications of distributed parallel computing for machine learning[J].CAAI Transactions on Intelligent Systems,2021,16(5):919-930.[doi:10.11992/tis.202108010]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
16
Number of periods:
2021 5
Page number:
919-930
Column:
吴文俊人工智能科技进步奖一等奖
Public date:
2021-09-05
- Title:
-
Key technologies and applications of distributed parallel computing for machine learning
- Author(s):
-
CAO Ronghui1; 2; TANG Zhuo1; 2; ZUO Zhiwei1; 2; ZHANG Xuedong1; 2
-
1. College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China;
2. National Supercomputer Center in Changsha, Changsha 410082, China
-
- Keywords:
-
machine learning; distributed computing; skew data; task space-time scheduling; resource management; energy-saving scheduling; cross-domain resource migration; parallel optimization; graph iteration algorithm; intelligent analysis system
- CLC:
-
TP18
- DOI:
-
10.11992/tis.202108010
- Abstract:
-
At present, the calculation and iteration process of algorithms such as machine learning is becoming more and more complex. Sufficient computational power is the key to ensure the landing effect of artificial intelligence application. In view of this, this paper first puts forward a task space-time scheduling algorithm adapted to the distributed heterogeneous environment of skew data, which effectively improves the average efficiency of tasks such as machine learning model training. Then, the high-efficiency resource management system and energy-saving scheduling algorithm in distributed heterogeneous environment are proposed to realize the dynamic prediction based cross-domain computing resource migration and voltage/frequency dynamic regulation in distributed heterogeneous environment, which saves the overall energy consumption of the system, and then, the distributed heterogeneous optimization environment adapted to the iteration of machine learning/deep learning algorithm is constructed, and the basic method of distributed parallel optimization for machine learning/graph iteration algorithm is proposed. Finally, the intelligent analysis system for field-oriented applications is researched and developed, and popularized in manufacturing, transportation, education, medical and other fields, which solves the performance bottleneck problems that are common in the process of high-efficiency data collection, storage, cleaning, fusion and intelligent analysis.