<-上一篇/Previous Article 下一篇/Next Article->

[1]杨瑞,严江鹏,李秀.强化学习稀疏奖励算法研究——理论与实验[J].智能系统学报,2020,15(5):888-899.[doi:10.11992/tis.202003031]
　YANG Rui,YAN Jiangpeng,LI Xiu.Survey of sparse reward algorithms in reinforcement learning — theory and experiment[J].CAAI Transactions on Intelligent Systems,2020,15(5):888-899.[doi:10.11992/tis.202003031]

点击复制

强化学习稀疏奖励算法研究——理论与实验

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 15 期数: 2020年第5期页码: 888-899 栏目: 学术论文—智能系统出版日期: 2020-09-05

Title:: Survey of sparse reward algorithms in reinforcement learning — theory and experiment

作者:: 杨瑞¹, 严江鹏¹, 李秀^1,2; 1. 清华大学自动化系，北京 100084;
2. 清华大学深圳国际研究生院，广东深圳 518055

Author(s):: YANG Rui¹, YAN Jiangpeng¹, LI Xiu^1,2; 1. Department of Automation, Tsinghua University, Beijing 100084, China;
2. Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

关键词:: 强化学习; 深度强化学习; 机器学习; 稀疏奖励; 神经网络; 人工智能; 深度学习

Keywords:: reinforcement learning; deep reinforcement learning; machine learning; sparse reward; neural networks; artificial intelligence; deep learning

分类号:: TP181

DOI:: 10.11992/tis.202003031

文献标志码:: A

摘要:: 近年来，强化学习在游戏、机器人控制等序列决策领域都获得了巨大的成功，但是大量实际问题中奖励信号十分稀疏，导致智能体难以从与环境的交互中学习到最优的策略，这一问题被称为稀疏奖励问题。稀疏奖励问题的研究能够促进强化学习实际应用与落地，在强化学习理论研究中具有重要意义。本文调研了稀疏奖励问题的研究现状，以外部引导信息为线索，分别介绍了奖励塑造、模仿学习、课程学习、事后经验回放、好奇心驱动、分层强化学习等方法。本文在稀疏奖励环境Fetch Reach上实现了以上6类方法的代表性算法进行实验验证和比较分析。使用外部引导信息的算法平均表现好于无外部引导信息的算法，但是后者对数据的依赖性更低，两类方法均具有重要的研究意义。最后，本文对稀疏奖励算法研究进行了总结与展望。

Abstract:: In recent years, reinforcement learning has achieved great success in a range of sequential decision-making applications such as games and robotic control. However, the reward signals are very sparse in many real-world situations, which makes it difficult for agents to determine an optimal strategy based on interaction with the environment. This problem is called the sparse reward problem. Research on sparse reward can advance both the theory and actual applications of reinforcement learning. We investigated the current research status of the sparse reward problem and used the external information as the clue to introduce the following six classes of algorithms: reward shaping, imitation learning, curriculum learning, hindsight experience replay, curiosity-driven algorithms, and hierarchical reinforcement learning. To conduct experiments in the sparse reward environment Fetch Reach, we implemented typical algorithms from the above six classes, followed by thorough comparison and analysis of the results. Algorithms that utilize external information were found to outperform those without external information, but the latter are less dependent on data. Both methods have great research significance. At last, summarize the current sparse reward algorithms and forecast future work.

参考文献/References:: [1] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge, USA: MIT Press, 1998.
[2] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. 2nd ed. Cambridge: MIT Press, 2018.
[3] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489.
[4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[5] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[EB/OL]. California, USA: arXiv, 2019. [2019-10-1] https://arxiv.org/pdf/1912.06680.pdf.
[6] SILVER D. Tutorial: Deep reinforcement learning[C]//Proc. of the 33rd Int. Conf. on Machine Learning (ICML 2016). 2016.
[7] LI Yuxi. Deep reinforcement learning: An overview[EB/OL]. Alberta, Canada: arXiv, 2017. [2019-10-2] https://arxiv.org/pdf/ 1701.07274.pdf.
[8] LI Yuxi. Deep reinforcement learning[EB/OL]. Alberta, Canada: arXiv, 2018. [2019-10-5] https://arxiv.org/pdf/1810.06339.pdf.
[9] Riedmiller M, Hafner R, Lampe T, et al. Learning by playing-solving sparse reward tasks from scratch[EB/OL]. London, UK: arXiv, 2018. [2019-10-20] https://arxiv.org/pdf/1802.10567.pdf.
[10] HOSU I A, REBEDEA T. Playing atari games with deep reinforcement learning and human checkpoint replay[EB/OL]. Bucharest, Romania: arXiv, 2016. [2019-10-21] https://arxiv.org/pdf/1607.05077.pdf.
[11] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 5048-5058.
[12] 杨惟轶, 白辰甲, 蔡超, 等. 深度强化学习中稀疏奖励问题研究综述[J]. 计算机科学, 2020, 47(3): 182-191
YANG Weiyi, BAI Chenjia, CAI Chao, et al. Survey on sparse reward in deep reinforcement learning[J]. Computer science, 2020, 47(3): 182-191
[13] GULLAPALLI V, BARTO A G. Shaping as a method for accelerating reinforcement learning[C]//Proceedings of the 1992 IEEE International Symposium on Intelligent Control. Glasgow, UK, 1992: 554-559.
[14] HUSSEIN A, GABER M M, ELYAN E, et al. Imitation learning: A survey of learning methods[J]. ACM computing surveys, 2017, 50(2): 1-35.
[15] BENGIO Y, LOURADOUR J, COLLOBERT R, et al. Curriculum learning[C]//Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, Quebec, Canada, 2009: 41-48.
[16] BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. California, USA: arXiv, 2018. [2019-10-30] https://arxiv.org/pdf/1808.04355.
[17] 周文吉, 俞扬. 分层强化学习综述[J]. 智能系统学报, 2017, 12(5): 590-594
ZHOU Wenji, YU Yang. Summarize of hierarchical reinforcement learning[J]. CAAI transactions on intelligent systems, 2017, 12(5): 590-594
[18] Plappert M, Andrychowicz M, Ray A, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research[EB/OL]. California, USA: arXiv, 2018. [2019-11-1] https://arxiv.org/pdf/1802.09464.pdf.
[19] 万里鹏, 兰旭光, 张翰博, 等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019, 32(1): 67-81
WAN Lipeng, LAN Xuguang, ZHANG Hanbo, et al. A review of deep reinforcement learning theory and application[J]. Pattern recognition and artificial intelligence, 2019, 32(1): 67-81
[20] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[EB/OL]. London, UK: arXiv, 2013. [2019-11-1] https://arxiv.org/pdf/1312.5602.pdf.
[21] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[22] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine learning, 1992, 8(3/4): 229-256.
[23] KONDA V R, TSITSIKLIS J N. Actor-critic algorithms[C]//Advances in Neural Information Processing Systems. Colorado, USA, 2000: 1008-1014.
[24] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York, USA, 2016: 1928-1937.
[25] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. California, USA: arXiv, 2017. [2019-11-3] https://arxiv.org/pdf/1707.06347.pdf.
[26] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[EB/OL]. London, UK: arXiv, 2015. [2019-12-25] https://arxiv.org/pdf/1509.02971.pdf.
[27] NG A Y, HARADA D, RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[C]//Proceedings of the Sixteenth International Conference on Machine Learning. Bled, Slovenia, 1999, 99: 278-287.
[28] RANDL?V J, ALSTR?M P. Learning to drive a bicycle using reinforcement learning and shaping[C]//Proceedings of the Fifteenth International Conference on Machine Learning. Madison, USA, 1998, 98: 463-471.
[29] JAGODNIK K M, THOMAS P S, VAN DEN BOGERT A J, et al. Training an actor-critic reinforcement learning controller for arm movement using human-generated rewards[J]. IEEE transactions on neural systems and rehabilitation engineering, 2017, 25(10): 1892-1905.
[30] FERREIRA E, LEFèVRE F. Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management[C]//2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic, 2013: 108-113.
[31] NG A Y, RUSSELL S J. Algorithms for inverse reinforcement learning[C]//Proceedings of the Seventeenth International Conference on Machine Learning. Stanford, USA, 2000, 1: 663-670.
[32] MARTHI B. Automatic shaping and decomposition of reward functions[C]//Proceedings of the 24th International Conference on Machine Learning. Corvallis, USA, 2007: 601-608.
[33] ROSS S, BAGNELL D. Efficient reductions for imitation learning[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia, Italy, 2010: 661-668.
[34] NAIR A, MCGREW B, ANDRYCHOWICZ M, et al. Overcoming exploration in reinforcement learning with demonstrations[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, QLD, Australia, 2018: 6292-6299.
[35] HO J, ERMON S. Generative adversarial imitation learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 4565-4573.
[36] LIU Yuxuan, GUPTA A, ABBEEL P, et al. Imitation from observation: Learning to imitate behaviors from raw video via context translation[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, Australia, 2018: 1118-1125.
[37] TORABI F, WARNELL G, STONE P. Behavioral cloning from observation[EB/OL]. Texas, USA: arXiv, 2018. [2019-11-1] https://arxiv.org/pdf/1805.01954.pdf.
[38] ELMAN J L. Learning and development in neural networks: The importance of starting small[J]. Cognition, 1993, 48(1): 71-99.
[39] GRAVES A, BELLEMARE M G, MENICK J, et al. Automated curriculum learning for neural networks[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. Sydney, Australia, 2017: 1311-1320.
[40] OpenAI, AKKAYA I, ANDRYCHOWICZ M, et al. Solving rubik’s cube with a robot hand[EB/OL]. California, USA: arXiv, 2019. [2019-11-2] https://arxiv.org/pdf/1910.07113.pdf.
[41] LANKA S, WU Tianfu. ARCHER: aggressive rewards to counter bias in hindsight experience replay[EB/OL]. NC, USA: arXiv, 2018. [2019-12-3] https://arxiv.org/pdf/1809.02070.
[42] MANELA B, BIESS A. Bias-reduced hindsight experience replay with virtual goal prioritization[EB/OL]. BeerSheva, Israel: arXiv, 2019. [2019-12-3] https://arxiv.org/pdf/1905.05498.pdf.
[43] RAUBER P, UMMADISINGU A, MUTZ F, et al. Hindsight policy gradients[EB/OL]. London, UK: arXiv, 2017. [2019-11-2] https://arxiv.org/pdf/1711.06006.pdf.
[44] SCHMIDHUBER J. A possibility for implementing curiosity and boredom in model-building neural controllers[C]//Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats. Cambridge, USA, 1991: 222-227.
[45] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA, 2017: 16-17.
[46] BELLEMARE M, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 1471-1479.
[47] STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of computer and system sciences, 2008, 74(8): 1309-1331.
[48] TANG Haoran, HOUTHOOFT R, FOOTE D, et al. # exploration: A study of count-based exploration for deep reinforcement learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 2753-2762.
[49] BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. California, USA: arXiv, 2018. [2019-5-20] https://arxiv.org/pdf/1810.12894.pdf.
[50] STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. California, USA: arXiv, 2015. [2019-5-2] https://arxiv.org/pdf/1507.00814.pdf.
[51] KINGMA D P, WELLING M. Auto-encoding variational bayes[EB/OL]. Amsterdam, Netherlands: arXiv, 2013. [2019-2-2] https://arxiv.org/pdf/1312.6114.pdf.
[52] RAFATI J, NOELLE D C. Learning representations in model-free hierarchical reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu, USA, 2019, 33: 10009-10010.
[53] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning[J]. Artificial intelligence, 1999, 112(1-2): 181-211.
[54] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 3675-3683.
[55] BACON P L, HARB J, PRECUP D. The option-critic architecture[C]//Thirty-First AAAI Conference on Artificial Intelligence. San Francisco, USA, 2017.
[56] FRANS K, HO J, CHEN X, et al. Meta learning shared hierarchies[EB/OL]. California, USA: arXiv, 2017. [2019-11-15] https://arxiv.org/pdf/1710.09767, 2017.
[57] VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al. Feudal networks for hierarchical reinforcement learning[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. Sydney, Australia, 2017: 3540-3549.
[58] NACHUM O, GU Shixiang, LEE H, et al. Data-efficient hierarchical reinforcement learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada, 2018: 3303-3313.
[59] LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight[EB/OL]. RI, USA: arXiv, 2017. [2019-12-16] https://arxiv.org/pdf/1712.00948.pdf.
[60] SCHAUL T, HORGAN D, GREGOR K, et al. Universal value function approximators[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille, France, 2015: 1312-1320.
[61] SUKHBAATAR S, LIN Zeming, KOSTRIKOV I, et al. Intrinsic motivation and automatic curricula via asymmetric self-play[EB/OL]. NY, USA: arXiv, 2017. [2019-12-11] https://arxiv.org/pdf/1703.05407.pdf.
[62] JABRI A, HSU K, EYSENBACH B, et al. Unsupervised curricula for visual meta-reinforcement learning[EB/OL]. California, USA: arXiv, 2019. [2019-12-21] https://arxiv.org/pdf/1912.04226.pdf.
[63] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 5998-6008.
[64] SAHNI H, BUCKLEY T, ABBEEL P, et al. Visual hindsight experience replay[EB/OL]. GA, USA: arXiv, 2019. [2019-10-12] https://arxiv.org/pdf/1901.11529.pdf.
[65] SUKHBAATAR S, DENTON E, SZLAM A, et al. Learning goal embeddings via self-play for hierarchical reinforcement learning[EB/OL]. NY, USA: arXiv, 2018. [2019-11-21] https://arxiv.org/pdf/1811.09083.pdf.
[66] LANIER J B, MCALEER S, BALDI P. Curiosity-driven multi-criteria hindsight experience replay[EB/OL]. California, USA: arXiv, 2019. [2019-12-13] https://arxiv.org/pdf/1906.03710.pdf.

相似文献/References:: [1]连传强,徐昕,吴军,等.面向资源分配问题的Q-CF多智能体强化学习[J].智能系统学报,2011,6(2):95.
　LIAN Chuanqiang,XU Xin,WU Jun,et al.Q-CF multiAgent reinforcement learningfor resource allocation problems[J].CAAI Transactions on Intelligent Systems,2011,6():95.
[2]梁爽,曹其新,王雯珊,等.基于强化学习的多定位组件自动选择方法[J].智能系统学报,2016,11(2):149.[doi:10.11992/tis.201510031]
　LIANG Shuang,CAO Qixin,WANG Wenshan,et al.An automatic switching method for multiple location components based on reinforcement learning[J].CAAI Transactions on Intelligent Systems,2016,11():149.[doi:10.11992/tis.201510031]
[3]张文旭,马磊,王晓东.基于事件驱动的多智能体强化学习研究[J].智能系统学报,2017,12(1):82.[doi:10.11992/tis.201604008]
　ZHANG Wenxu,MA Lei,WANG Xiaodong.Reinforcement learning for event-triggered multi-agent systems[J].CAAI Transactions on Intelligent Systems,2017,12():82.[doi:10.11992/tis.201604008]
[4]张文旭,马磊,贺荟霖,等.强化学习的地-空异构多智能体协作覆盖研究[J].智能系统学报,2018,13(2):202.[doi:10.11992/tis.201609017]
　ZHANG Wenxu,MA Lei,HE Huilin,et al.Air-ground heterogeneous coordination for multi-agent coverage based on reinforced learning[J].CAAI Transactions on Intelligent Systems,2018,13():202.[doi:10.11992/tis.201609017]
[5]徐鹏,谢广明,文家燕,等.事件驱动的强化学习多智能体编队控制[J].智能系统学报,2019,14(1):93.[doi:10.11992/tis.201807010]
　XU Peng,XIE Guangming,WEN Jiayan,et al.Event-triggered reinforcement learning formation control for multi-agent[J].CAAI Transactions on Intelligent Systems,2019,14():93.[doi:10.11992/tis.201807010]
[6]郭宪,方勇纯.仿生机器人运动步态控制：强化学习方法综述[J].智能系统学报,2020,15(1):152.[doi:10.11992/tis.201907052]
　GUO Xian,FANG Yongchun.Locomotion gait control for bionic robots: a review of reinforcement learning methods[J].CAAI Transactions on Intelligent Systems,2020,15():152.[doi:10.11992/tis.201907052]
[7]申翔翔,侯新文,尹传环.深度强化学习中状态注意力机制的研究[J].智能系统学报,2020,15(2):317.[doi:10.11992/tis.201809033]
　SHEN Xiangxiang,HOU Xinwen,YIN Chuanhuan.State attention in deep reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15():317.[doi:10.11992/tis.201809033]
[8]殷昌盛,杨若鹏,朱巍,等.多智能体分层强化学习综述[J].智能系统学报,2020,15(4):646.[doi:10.11992/tis.201909027]
　YIN Changsheng,YANG Ruopeng,ZHU Wei,et al.A survey on multi-agent hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15():646.[doi:10.11992/tis.201909027]
[9]莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740.[doi:10.11992/tis.201910039]
　MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAI Transactions on Intelligent Systems,2020,15():740.[doi:10.11992/tis.201910039]
[10]赵玉新,杜登辉,成小会,等.基于强化学习的海洋移动观测网络观测路径规划方法[J].智能系统学报,2022,17(1):192.[doi:10.11992/tis.202106004]
　ZHAO Yuxin,DU Denghui,CHENG Xiaohui,et al.Path planning for mobile ocean observation network based on reinforcement learning[J].CAAI Transactions on Intelligent Systems,2022,17():192.[doi:10.11992/tis.202106004]
[11]周文吉,俞扬.分层强化学习综述[J].智能系统学报,2017,12(5):590.[doi:10.11992/tis.201706031]
　ZHOU Wenji,YU Yang.Summarize of hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2017,12():590.[doi:10.11992/tis.201706031]
[12]王作为,徐征,张汝波,等.记忆神经网络在机器人导航领域的应用与研究进展[J].智能系统学报,2020,15(5):835.[doi:10.11992/tis.202002020]
　WANG Zuowei,XU Zheng,ZHANG Rubo,et al.Research progress and application of memory neural network in robot navigation[J].CAAI Transactions on Intelligent Systems,2020,15():835.[doi:10.11992/tis.202002020]

备注/Memo

收稿日期:2020-03-19。
基金项目:国家自然科学基金项目（41876098）
作者简介:杨瑞,硕士研究生,主要研究方向为机器学习与强化学习;严江鹏,博士研究生,主要研究方向为人工智能与计算机视觉;李秀,教授,博士生导师,主要研究方向为智能系统、数据挖掘与模式识别。主持完成国家自然科学基金项目3项、深圳市基础研究项目2项、深圳市技术开发项目1项;参与完成国家863项目4项;目前在研863重大项目1项,国家自然科学基金项目1项。获得国家发明专利授权7项,国家软件著作权5项。发表学术论文100余篇
通讯作者:李秀.E-mail:li.xiu@sz.tsinghua.edu.cn

更新日期/Last Update: 2021-01-15

强化学习稀疏奖励算法研究——理论与实验 PDF下载HTML

备注/Memo

强化学习稀疏奖励算法研究——理论与实验

PDF下载 HTML