<-上一篇/Previous Article 下一篇/Next Article->

[1]殷昌盛,杨若鹏,朱巍,等.多智能体分层强化学习综述[J].智能系统学报,2020,15(4):646-655.[doi:10.11992/tis.201909027]
　YIN Changsheng,YANG Ruopeng,ZHU Wei,et al.A survey on multi-agent hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15(4):646-655.[doi:10.11992/tis.201909027]

点击复制

多智能体分层强化学习综述

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 15 期数: 2020年第4期页码: 646-655 栏目: 综述出版日期: 2020-07-05

Title:: A survey on multi-agent hierarchical reinforcement learning

作者:: 殷昌盛, 杨若鹏, 朱巍, 邹小飞, 李峰; 国防科技大学信息通信学院，湖北武汉 430010

Author(s):: YIN Changsheng, YANG Ruopeng, ZHU Wei, ZOU Xiaofei, LI Feng; School of Information and Communication, National University of Defense Technology, Wuhan 430010, China

关键词:: 人工智能; 机器学习; 强化学习; 多智能体; 综述; 深度学习; 分层强化学习; 应用现状

Keywords:: artificial intelligence; machine learning; reinforcement learning; multi-agent; summary; reinforcement learning; hierarchical reinforcement learning; application status

分类号:: TP18

DOI:: 10.11992/tis.201909027

摘要:: 作为机器学习和人工智能领域的一个重要分支，多智能体分层强化学习以一种通用的形式将多智能体的协作能力与强化学习的决策能力相结合，并通过将复杂的强化学习问题分解成若干个子问题并分别解决，可以有效解决空间维数灾难问题。这也使得多智能体分层强化学习成为解决大规模复杂背景下智能决策问题的一种潜在途径。首先对多智能体分层强化学习中涉及的主要技术进行阐述，包括强化学习、半马尔可夫决策过程和多智能体强化学习；然后基于分层的角度，对基于选项、基于分层抽象机、基于值函数分解和基于端到端等4种多智能体分层强化学习方法的算法原理和研究现状进行了综述；最后介绍了多智能体分层强化学习在机器人控制、博弈决策以及任务规划等领域的应用现状。

Abstract:: As an important research area in the field of machine learning and artificial intelligence, multi-agent hierarchical reinforcement learning (MAHRL) integrates the advantages of the collaboration of multi-agent system (MAS) and the decision making of reinforcement learning (RL) in a general-purpose form, and decomposes the RL problem into sub-problems and solves each of them to overcome the so-called curse of dimensionality. So MAHRL offers a potential way to solve large-scale and complex decision problem. In this paper, we systematically describe three key technologies of MAHRL: reinforcement learning (RL), Semi Markov Decision Process (SMDP), multi-agent reinforcement learning (MARL). We then systematically describe four main categories of the MAHRL method from the angle of hierarchical learning, which includes Option, HAM, MAXQ and End-to-End. Finally, we end up with summarizing the application status of MAHRL in robot control, game decision making and mission planning.

参考文献/References:: [1] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521: 436-444.
[2] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play[J]. Science, 2018, 362: 1140-1144.
[3] JADERBERG M, CZARNECKI M M, DUNNING L, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning[J]. Science, 2019, 364(6443): 859-865.
[4] LIU Siqi, LEVER G, MEREL J, HEESS N, et al. Emergent coordination through completion[EB/OL]. [2019-2-21]. https://arxiv.org/abs/1902.07151.
[5] WU Bin, FU Qiang, LIANG Jing, et al. Hierarchical macro strategy model for MOBA game AI[EB/OL]. [2018-12-19]. https://arxiv.org/abs/1812.07887v1.
[6] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[EB/OL]. [2013-12-19]. https://arxiv.org/abs/1312.5602.
[7] WOOLDRIDGE M. An introduction to multi-agent systems[J]. Wiley & Sons, 2011, 4(2): 125-128.
[8] GIL P, NUNES L. Hierarchical reinforcement learning using path clustering[C]//Proceedings of 8th Iberian Conference on Information Systems and Technologies. Lisaboa, Portugal, 2013: 1-6.
[9] XUE B, GLEN B. DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning[J]. ACM transactions on graphics, 2017, 36(4): 1-13.
[10] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 1998.
[11] SILVER D, SCHRITTEIESER J, SIMONYAN K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676): 354-391.
[12] 刘全, 翟建伟, 章宗长, 等. 深度强化学习综述[J]. 计算机学报, 2018, 41(1): 1-27
LIU Quan, ZHAI Jianwei, ZHANG Zongchang, et al. A survey on deep reinforcement learning[J]. Chinese journal of computers, 2018, 41(1): 1-27
[13] HAUSKNECHT M, STONE P. Deep recurrent q-learning for patially observable mdps[EB/OL]. [2017-11-16]. https://arxiv.org/abs/1507.06527.
[14] HASSELT H V, GUEZ A, SILVER D. Deep reinforcement learning with double Q learning[EB/OL]. [2015-12-8]. https://arxiv.org/abs/1509.06461v1.
[15] RUMMERY G A, NIRNJAN M. On-line q-learning using connectionist systems[EB/OL]. [2018-2-2]. https://www.researchgate.net/publication/250611_On-Line_Q-Learning_Using_Connectionist_Systems.
[16] WATKINS C, DAYAN P. Q-learning[J]. Machine learning, 1992, 8(34): 279-292.
[17] SILVER D, LEVER G, HEESS N, et al. Deterministic policy gradient algorithms [C]//International Conference on Machine Learning 2014. Beijing, China, 2014: 387-395.
[18] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning [EB/OL]. [2016-6-16]. https://arxiv.org/abs/1602.01783.
[19] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization [EB/OL]. [2015-2-19]. https://arxiv.org/abs/1502.05477.
[20] HEESS N, WAYNE G, SILVER D, et al. Learning continuous control policies by stochastic value gradients[EB/OL]. [2015-10-30]. https://arxiv.org/abs/1510.09142.
[21] LEVINE S, KOLTUM V. Guided policy search[EB/OL]. [2016-10-3]. https://arxiv.org/abs/1610.00529.
[22] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. [2018-9-18]. https://arxiv.org/abs/1707.06347.
[23] SCHULMAN J, MORITZ P, LEVINE S, et al. High dimensional continuous control using generalized advantage estimation [EB/OL]. [2011-11-16]. https://arxiv.org/abs/1506.024398.
[24] SUTTON R S. Dyna, an integrated architecture for learning, planning and reacting[J]. ACM SIGART bulletin, 1991, 2(4): 160-163.
[25] DING Shifei, ZHAO Xingyu, XU Xinzheng, et al. An effective asynchronous framework for small scale reinforcement learning problems[J]. Applied intelligence, 2019, 49(12): 4303-4318.
[26] ZHAO Xingyu, DING Shifei, AN Yuexuan, et al. Applications of asynchronous deep reinforcement learning based on dynamic updating weights[J]. Applied intelligence, 2019, 49(2): 581-591.
[27] ZHAO Xingyu, DING Shifei, AN Yuexuan, et al. Asynchronous reinforcement learning algorithms for solving discrete space path planning problems[J]. Applied intelligence, 2018, 48(12): 4889-4904.
[28] SUTTON R S, PRECUP D, SINGH S R. Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning[J]. Artificial intelligence, 1999, 112(1-2): 181-211.
[29] PRECUP D, SUTTON R S. Multi-time models for temporally abstract planning[C]// Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10. Cambridge, United States, 1998: 1050-1056.
[30] PRECUP D. Temporal abstraction in reinforcement learning. [D]. Amherst: University of Massachusetts, USA, 2000.
[31] TANG Zhentao, ZHAO Dongbin, ZHU Yuanheng. Reinforcement learning for build-order production in StarCraft II [C]//8th International Conference on Information Science and Technology. Istanbul, Turkey. 2018.
[32] PARR R. Hierarchical control and learning for markov decision processes[D]. Berkeley: University of California, 1998.
[33] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation[EB/OL]. [2016-4-20]. https://arxiv.org/abs/1604.06057.
[34] DIETTERICH T G. Hierarchical reinforcement learning with the MAXQ value function decomposition[J]. Journal of artificial intelligence research, 2000, 13: 227-303.
[35] MENACHE I, MARMOR S, SHIMKIN N. Q-Cut: dynamic discovery of sub-goals in reinforcement learning[J]. Lecture notes in computer science 2430.2002: 295-306.
[36] DRUNNOND C. Accelerating reinforcement learning by composing solutions of automatically identified subtasks[J]. Journal of artificial intelligence research, 2002, 16: 59-104.
[37] HENGST B. Discovering hierarchy in reinforcement learning[D]. Sydney: University of New South Wales, Australia, 2003.
[38] UTHER W T B. Tree based hierarchical reinforcement learning[D]. Pittsburgh: Carnegie Mellon University, USA, 2002.
[39] PIERRE B, JEAN H. The option-critic architecture[C]//Proceedings of 31th AAAI Conference on Artifical Intelligence. San Francisco, USA, 2017: 1726-1734.
[40] VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al. Feudal networks for hierarchical reinforcement learning[C]//Proceedings of 34th International Conference on Machine Learning. Sydney, Australia, 2017: 3540-3549.
[41] PONSEN M J V, SPRONCK P, AHA D W. Automatically acquiring domain knowledge for adaptive game AI using evolutionary learning[C]//Conference on Innovative Applications of Artificial Intelligence. Pittsburgh, Pennsylvania, 2005: 1535-1540.
[42] WEBER B G, ONTANON S. Using automated replay annotation for case-based planning in games[C]//18th International Conference on Case-based Reasoning. Alessandria, Italy, 2010: 15-24.
[43] WEBER B G, MAWHORTER P, MATEAS M, et al. Reactive planning idioms for multi-scale game AI[C]// Conference on Computational Intelligence and Games, Maastricht, The Netherlands, 2010: 115-122.
[44] SONG Y, LI Y, LI C. Initialization in reinforcement learning for mobile robots path planning[J]. Control theory & applications, 2012, 29(12): 1623-1628.
[45] LIU Chunyang, TAN Yingqing, LIU Changan, MA Yingwei. Application of multi-Agent reinforcement learning in robot soccer[J]. Acta electronica sinica, 2010, 38(8): 1958-1962.
[46] DUAN Yong, CUI Baoxia, XU Xinhe. Multi-agent reinforcement learning and its application role assignment of robot soccer[J]. Control theory & app1ications, 2009, 26(4): 371-376.
[47] SYNNAEVE G, BESSIERE P. A bayesian model for RTS units control applied to starcraft[J]. IEEE transactions on computational intelligence and AI in games, 2011, 3(1): 83-86.
[48] SURDU J R, KITTKA K. Deep green: commander’s tool for COA’s concept[C]//Computing, Communications and Control Technologies 2008, Orlando, Florida, USA, 2008.
[49] ERNEST N, CARROLL D, SCHUMACHER C, et al. Genetic fuzzy based artificial intelligence for unmanned combat aerial vehicle control in simulated air combat missions[J]. Journal of denfense management, 2016, 6(1): 1-7.
[50] DERESZYNSKI E, HOSTETLER J, FERN A, et al. Learning probabilistic behavior models in real-time strategy games[C]//Proc of the 7th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Stanford, USA, 2011: 20-25.
[51] 胡桐清, 陈亮. 军事智能辅助决策的理论与实践[J]. 军事系统工程, 1995(C1): 3-10
HU Tongqing, CHEN Liang. Theory and practice of military intelligence assistant decision[J]. Military operations research and systems engineering, 1995(C1): 3-10
[52] 朱丰, 胡晓峰. 基于深度学习的战场态势评估综述与研究展望[J]. 军事运筹与系统工程, 2016, 30(3): 22-27
ZHU Feng, HU Xiaofeng. Overview and research prospect of battlefield situation assessment based on deep learning[J]. Military operations research and systems engineering, 2016, 30(3): 22-27
[53] TIAN Yuandong, GONG Quchengg, SHANG Wenling, et al. ELF: an extensive, lightweight and flexible research platform for real-time strategy games [C]//31st Conference and Workshop on Neural Information Processing Systems, California, USA, 2017: 2656-2666.
[54] MEHTA M, ONTANOS S, AMUNDESEN T, et al. Authoring behaviors for games using learning from demonstration[C]//Proc of the 8th Intenational Conference on Case-based Reasoning, Berlin, Heidelberg, 2009: 12-20.
[55] JUSTESEN N, RISI S. Learning macromanagement in StarCraft from replays using deep learning[C]// IEEE’s 2017 Conference on Computational Intelligence in Games, New York, USA. 2017.
[56] WU Huikai, ZHANG Junge, HUANG Kaiqi. MSC: A dataset for macro-management in StarCraft II [DB/OL]. [2018-05-31]. http://cn.arxiv.org/pdf/1710.03131v1.
[57] BATO A G, MAHADEVAN S. Recent advances in hierarchical reinforcement learning[J]. Discrete event dynamic systems, 2013, 13(4): 341-379.
[58] TIMOTHY P L, JONATHAN J H, PRITZEL A, et al. Continous control with deep reinforcement learning[EB/OL]. [2015-11-18]. https://arxiv.org/abs/1509.02971.
[59] DIBIA V, DEMIRALP C. Data2Vis automatic generation of data visualizations using sequence to sequence recurrent neural networks [EB/OL]. [2018-11-2]. https://arxiv.org/abs/1804.03126.
[60] SUSHIL J L, LIU Siming. multi-objective evolution for 3D RTS micro [EB/OL]. [2018-3-8]. https://arxiv.org/abs/1803.02943.
[61] PENG Peng, WEN Ying, YANG Yaodong, et al. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play StarCraft combat games[EB/OL]. [2018-05-31]. http://cn.arxiv.org/pdf/1703.10069v4.
[62] SHAO Kun, ZHU Yuanheng, ZHAO Dongbin. StarCraft micromanagement with reinforcement learning and curriculum transfer learning[J]. IEEE transactions on emerging topics in computational intelligence, 2018(99): 1-12.
[63] 李耀宇, 朱一凡, 杨峰. 基于逆向强化学习的舰载机甲板调度优化方案生成方法[J]. 国防科技大学学报, 2013, 35(4): 171-175
LI Yaoyu, ZHU Yifan, YANG Fan. Inverse reinforcement learning based optimal schedule generation approach for carrier aircraft on flight deck[J]. Journal of national university of defense technology, 2013, 35(4): 171-175
[64] 陈希亮, 张永亮. 基于深度强化学习的陆军分队战术决策问题研究[J]. 军事运筹与系统工程, 2017, 31(3): 20-27
CHEN Xiliang, ZHANG Yongliang. Research on tactical decision of army units based on deep reinforcement learning[J]. Military operations research and systems engineering, 2017, 31(3): 20-27
[65] 乔永杰, 王欣九, 孙亮. 陆军指挥所模型自主生成作战计划时间参数的方法[J]. 中国电子科学研究院学报, 2017, 12(3): 278-284
QIAO Yongjie, WANG Xinjiu, SUN Liang. A Method for Army command post to auto-Generate combat time scheduling[J]. Journal of china academy of electronics and information technology, 2017, 12(3): 278-284
[66] DING Shifei, DU Wei, ZHAO Xingyu, et al. A new asynchronous reinforcement learning algorithm based on improved parallel PSO[J]. Applied intelligence, 2019, 49(12): 4211-4222.
[67] ZHENG Yanbin, LI Bo, AN Deyu, et al. Multi-agent path planning algorithm based on hierarchical reinforcement learning and artificial potential field[J]. Journal of computer applications, 2015, 35(12): 3491-3496.
[68] 王冲, 景宁, 李军, 等. 一种基于多Agent强化学习的多星协同任务规划算法[J]. 国防科技大学学报, 2011, 33(1): 53-58
WANG Chong, JING Ning, LI Jun, et al. An algorithm of cooperative multiple satellites mission planning based on multi-agent reinforcement learning[J]. Journal of national university of defense technology, 2011, 33(1): 53-58

相似文献/References:: [1]李德毅.网络时代人工智能研究与发展[J].智能系统学报,2009,4(1):1.
　LI De-yi.AI research and development in the network age[J].CAAI Transactions on Intelligent Systems,2009,4():1.
[2]赵克勤.二元联系数A+Bi的理论基础与基本算法及在人工智能中的应用[J].智能系统学报,2008,3(6):476.
　ZHAO Ke-qin.The theoretical basis and basic algorithm of binary connection A+Bi and its application in AI[J].CAAI Transactions on Intelligent Systems,2008,3():476.
[3]徐玉如,庞永杰,甘?? 永,等.智能水下机器人技术展望[J].智能系统学报,2006,1(1):9.
　XU Yu-ru,PANG Yong-jie,GAN Yong,et al.AUV—state-of-the-art and prospect[J].CAAI Transactions on Intelligent Systems,2006,1():9.
[4]王志良.人工心理与人工情感[J].智能系统学报,2006,1(1):38.
　WANG Zhi-liang.Artificial psychology and artificial emotion[J].CAAI Transactions on Intelligent Systems,2006,1():38.
[5]赵克勤.集对分析的不确定性系统理论在AI中的应用[J].智能系统学报,2006,1(2):16.
　ZHAO Ke-qin.The application of uncertainty systems theory of set pair analysis (SPU)in the artificial intelligence[J].CAAI Transactions on Intelligent Systems,2006,1():16.
[6]秦裕林,朱新民,朱? 丹.Herbert Simon在最后几年里的两个研究方向[J].智能系统学报,2006,1(2):11.
　QIN Yu-lin,ZHU Xin-min,ZHU Dan.Herbert Simons two research directions in his lost years[J].CAAI Transactions on Intelligent Systems,2006,1():11.
[7]叶志飞,文益民,吕宝粮.不平衡分类问题研究综述[J].智能系统学报,2009,4(2):148.
　YE Zhi-fei,WEN Yi-min,LU Bao-liang.A survey of imbalanced pattern classification problems[J].CAAI Transactions on Intelligent Systems,2009,4():148.
[8]谷文祥,李丽,李丹丹.规划识别的研究及其应用[J].智能系统学报,2007,2(1):1.
　GU Wen-xiang,LI Li,LI Dan-dan.Research and application of plan recognition[J].CAAI Transactions on Intelligent Systems,2007,2():1.
[9]刘奕群,张敏,马少平.基于非内容信息的网络关键资源有效定位[J].智能系统学报,2007,2(1):45.
　LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,2():45.
[10]杨春燕,蔡文.可拓信息-知识-智能形式化体系研究[J].智能系统学报,2007,2(3):8.
　YANG Chun-yan,CAI Wen.A formalized system of extension information-knowledge-intelligence[J].CAAI Transactions on Intelligent Systems,2007,2():8.
[11]杨成东,邓廷权.综合属性选择和删除的属性约简方法[J].智能系统学报,2013,8(2):183.[doi:10.3969/j.issn.1673-4785.201209056]
　YANG Chengdong,DENG Tingquan.An approach to attribute reduction combining attribute selection and deletion[J].CAAI Transactions on Intelligent Systems,2013,8():183.[doi:10.3969/j.issn.1673-4785.201209056]
[12]马世龙,乌尼日其其格,李小平.大数据与深度学习综述[J].智能系统学报,2016,11(6):728.[doi:10.11992/tis.201611021]
　MA Shilong,WUNIRI Qiqige,LI Xiaoping.Deep learning with big data: state of the art and development[J].CAAI Transactions on Intelligent Systems,2016,11():728.[doi:10.11992/tis.201611021]
[13]李雪,蒋树强.智能交互的物体识别增量学习技术综述[J].智能系统学报,2017,12(2):140.[doi:10.11992/tis.201701006]
　LI Xue,JIANG Shuqiang.Incremental learning and object recognition system based on intelligent HCI: a survey[J].CAAI Transactions on Intelligent Systems,2017,12():140.[doi:10.11992/tis.201701006]
[14]周文吉,俞扬.分层强化学习综述[J].智能系统学报,2017,12(5):590.[doi:10.11992/tis.201706031]
　ZHOU Wenji,YU Yang.Summarize of hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2017,12():590.[doi:10.11992/tis.201706031]
[15]刘彪,黄蓉蓉,林和,等.基于卷积神经网络的盲文音乐识别研究[J].智能系统学报,2019,14(1):186.[doi:10.11992/tis.201805002]
　LIU Biao,HUANG Rongrong,LIN He,et al.Research on braille music recognition based on convolutional neural networks[J].CAAI Transactions on Intelligent Systems,2019,14():186.[doi:10.11992/tis.201805002]
[16]杨瑞,严江鹏,李秀.强化学习稀疏奖励算法研究——理论与实验[J].智能系统学报,2020,15(5):888.[doi:10.11992/tis.202003031]
　YANG Rui,YAN Jiangpeng,LI Xiu.Survey of sparse reward algorithms in reinforcement learning — theory and experiment[J].CAAI Transactions on Intelligent Systems,2020,15():888.[doi:10.11992/tis.202003031]
[17]高宇,霍静,李文斌,等.基于路径规划特点的语义目标导航方法[J].智能系统学报,2024,19(1):217.[doi:10.11992/tis.202309001]
　GAO Yu,HUO Jing,LI Wenbin,et al.Object goal navigation based on path planning characteristics[J].CAAI Transactions on Intelligent Systems,2024,19():217.[doi:10.11992/tis.202309001]
[18]刘晓光,李梅.基于深度学习的药物-靶标相互作用预测研究综述[J].智能系统学报,2024,19(3):494.[doi:10.11992/tis.202308024]
　LIU Xiaoguang,LI Mei.A survey of deep learning-based drug-target interaction prediction[J].CAAI Transactions on Intelligent Systems,2024,19():494.[doi:10.11992/tis.202308024]

备注/Memo

收稿日期:2019-09-10。
基金项目:国家社会科学基金项目（2019-SKJJ-C-083）
作者简介:殷昌盛，讲师，博士，主要研究方向为机器学习与智能决策。发表学术论文20余篇，出版专著3部;杨若鹏，教授，博士生导师，主要研究方向为智能化指挥。近年来获得军队科技进步一等奖1项、三等奖2项，发表学术论文40余篇，出版专著10余部;朱巍，副教授，主要研究方向为机器学习与智能决策
通讯作者:殷昌盛.E-mail:yincs1989@163.com

更新日期/Last Update: 2020-07-25

多智能体分层强化学习综述 PDF下载HTML

备注/Memo

多智能体分层强化学习综述

PDF下载 HTML