<-Previous Article Next Article->

[1]YANG Rui,YAN Jiangpeng,LI Xiu.Survey of sparse reward algorithms in reinforcement learning — theory and experiment[J].CAAI Transactions on Intelligent Systems,2020,15(5):888-899.[doi:10.11992/tis.202003031]

Copy

Survey of sparse reward algorithms in reinforcement learning — theory and experiment

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 15 Number of periods: 2020 5 Page number: 888-899 Column: 学术论文—智能系统 Public date: 2020-09-05

Title:: Survey of sparse reward algorithms in reinforcement learning — theory and experiment

Author(s):: YANG Rui¹; YAN Jiangpeng¹; LI Xiu¹; 2; 1. Department of Automation, Tsinghua University, Beijing 100084, China;
2. Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

Keywords:: reinforcement learning; deep reinforcement learning; machine learning; sparse reward; neural networks; artificial intelligence; deep learning

CLC:: TP181

DOI:: 10.11992/tis.202003031

Abstract:: In recent years, reinforcement learning has achieved great success in a range of sequential decision-making applications such as games and robotic control. However, the reward signals are very sparse in many real-world situations, which makes it difficult for agents to determine an optimal strategy based on interaction with the environment. This problem is called the sparse reward problem. Research on sparse reward can advance both the theory and actual applications of reinforcement learning. We investigated the current research status of the sparse reward problem and used the external information as the clue to introduce the following six classes of algorithms: reward shaping, imitation learning, curriculum learning, hindsight experience replay, curiosity-driven algorithms, and hierarchical reinforcement learning. To conduct experiments in the sparse reward environment Fetch Reach, we implemented typical algorithms from the above six classes, followed by thorough comparison and analysis of the results. Algorithms that utilize external information were found to outperform those without external information, but the latter are less dependent on data. Both methods have great research significance. At last, summarize the current sparse reward algorithms and forecast future work.

References:: [1] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge, USA: MIT Press, 1998.
[2] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. 2nd ed. Cambridge: MIT Press, 2018.
[3] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489.
[4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[5] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[EB/OL]. California, USA: arXiv, 2019. [2019-10-1] https://arxiv.org/pdf/1912.06680.pdf.
[6] SILVER D. Tutorial: Deep reinforcement learning[C]//Proc. of the 33rd Int. Conf. on Machine Learning (ICML 2016). 2016.
[7] LI Yuxi. Deep reinforcement learning: An overview[EB/OL]. Alberta, Canada: arXiv, 2017. [2019-10-2] https://arxiv.org/pdf/ 1701.07274.pdf.
[8] LI Yuxi. Deep reinforcement learning[EB/OL]. Alberta, Canada: arXiv, 2018. [2019-10-5] https://arxiv.org/pdf/1810.06339.pdf.
[9] Riedmiller M, Hafner R, Lampe T, et al. Learning by playing-solving sparse reward tasks from scratch[EB/OL]. London, UK: arXiv, 2018. [2019-10-20] https://arxiv.org/pdf/1802.10567.pdf.
[10] HOSU I A, REBEDEA T. Playing atari games with deep reinforcement learning and human checkpoint replay[EB/OL]. Bucharest, Romania: arXiv, 2016. [2019-10-21] https://arxiv.org/pdf/1607.05077.pdf.
[11] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 5048-5058.
[12] 杨惟轶, 白辰甲, 蔡超, 等. 深度强化学习中稀疏奖励问题研究综述[J]. 计算机科学, 2020, 47(3): 182-191
YANG Weiyi, BAI Chenjia, CAI Chao, et al. Survey on sparse reward in deep reinforcement learning[J]. Computer science, 2020, 47(3): 182-191
[13] GULLAPALLI V, BARTO A G. Shaping as a method for accelerating reinforcement learning[C]//Proceedings of the 1992 IEEE International Symposium on Intelligent Control. Glasgow, UK, 1992: 554-559.
[14] HUSSEIN A, GABER M M, ELYAN E, et al. Imitation learning: A survey of learning methods[J]. ACM computing surveys, 2017, 50(2): 1-35.
[15] BENGIO Y, LOURADOUR J, COLLOBERT R, et al. Curriculum learning[C]//Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, Quebec, Canada, 2009: 41-48.
[16] BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. California, USA: arXiv, 2018. [2019-10-30] https://arxiv.org/pdf/1808.04355.
[17] 周文吉, 俞扬. 分层强化学习综述[J]. 智能系统学报, 2017, 12(5): 590-594
ZHOU Wenji, YU Yang. Summarize of hierarchical reinforcement learning[J]. CAAI transactions on intelligent systems, 2017, 12(5): 590-594
[18] Plappert M, Andrychowicz M, Ray A, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research[EB/OL]. California, USA: arXiv, 2018. [2019-11-1] https://arxiv.org/pdf/1802.09464.pdf.
[19] 万里鹏, 兰旭光, 张翰博, 等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019, 32(1): 67-81
WAN Lipeng, LAN Xuguang, ZHANG Hanbo, et al. A review of deep reinforcement learning theory and application[J]. Pattern recognition and artificial intelligence, 2019, 32(1): 67-81
[20] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[EB/OL]. London, UK: arXiv, 2013. [2019-11-1] https://arxiv.org/pdf/1312.5602.pdf.
[21] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[22] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine learning, 1992, 8(3/4): 229-256.
[23] KONDA V R, TSITSIKLIS J N. Actor-critic algorithms[C]//Advances in Neural Information Processing Systems. Colorado, USA, 2000: 1008-1014.
[24] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York, USA, 2016: 1928-1937.
[25] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. California, USA: arXiv, 2017. [2019-11-3] https://arxiv.org/pdf/1707.06347.pdf.
[26] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[EB/OL]. London, UK: arXiv, 2015. [2019-12-25] https://arxiv.org/pdf/1509.02971.pdf.
[27] NG A Y, HARADA D, RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[C]//Proceedings of the Sixteenth International Conference on Machine Learning. Bled, Slovenia, 1999, 99: 278-287.
[28] RANDL?V J, ALSTR?M P. Learning to drive a bicycle using reinforcement learning and shaping[C]//Proceedings of the Fifteenth International Conference on Machine Learning. Madison, USA, 1998, 98: 463-471.
[29] JAGODNIK K M, THOMAS P S, VAN DEN BOGERT A J, et al. Training an actor-critic reinforcement learning controller for arm movement using human-generated rewards[J]. IEEE transactions on neural systems and rehabilitation engineering, 2017, 25(10): 1892-1905.
[30] FERREIRA E, LEFèVRE F. Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management[C]//2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic, 2013: 108-113.
[31] NG A Y, RUSSELL S J. Algorithms for inverse reinforcement learning[C]//Proceedings of the Seventeenth International Conference on Machine Learning. Stanford, USA, 2000, 1: 663-670.
[32] MARTHI B. Automatic shaping and decomposition of reward functions[C]//Proceedings of the 24th International Conference on Machine Learning. Corvallis, USA, 2007: 601-608.
[33] ROSS S, BAGNELL D. Efficient reductions for imitation learning[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia, Italy, 2010: 661-668.
[34] NAIR A, MCGREW B, ANDRYCHOWICZ M, et al. Overcoming exploration in reinforcement learning with demonstrations[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, QLD, Australia, 2018: 6292-6299.
[35] HO J, ERMON S. Generative adversarial imitation learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 4565-4573.
[36] LIU Yuxuan, GUPTA A, ABBEEL P, et al. Imitation from observation: Learning to imitate behaviors from raw video via context translation[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, Australia, 2018: 1118-1125.
[37] TORABI F, WARNELL G, STONE P. Behavioral cloning from observation[EB/OL]. Texas, USA: arXiv, 2018. [2019-11-1] https://arxiv.org/pdf/1805.01954.pdf.
[38] ELMAN J L. Learning and development in neural networks: The importance of starting small[J]. Cognition, 1993, 48(1): 71-99.
[39] GRAVES A, BELLEMARE M G, MENICK J, et al. Automated curriculum learning for neural networks[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. Sydney, Australia, 2017: 1311-1320.
[40] OpenAI, AKKAYA I, ANDRYCHOWICZ M, et al. Solving rubik’s cube with a robot hand[EB/OL]. California, USA: arXiv, 2019. [2019-11-2] https://arxiv.org/pdf/1910.07113.pdf.
[41] LANKA S, WU Tianfu. ARCHER: aggressive rewards to counter bias in hindsight experience replay[EB/OL]. NC, USA: arXiv, 2018. [2019-12-3] https://arxiv.org/pdf/1809.02070.
[42] MANELA B, BIESS A. Bias-reduced hindsight experience replay with virtual goal prioritization[EB/OL]. BeerSheva, Israel: arXiv, 2019. [2019-12-3] https://arxiv.org/pdf/1905.05498.pdf.
[43] RAUBER P, UMMADISINGU A, MUTZ F, et al. Hindsight policy gradients[EB/OL]. London, UK: arXiv, 2017. [2019-11-2] https://arxiv.org/pdf/1711.06006.pdf.
[44] SCHMIDHUBER J. A possibility for implementing curiosity and boredom in model-building neural controllers[C]//Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats. Cambridge, USA, 1991: 222-227.
[45] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA, 2017: 16-17.
[46] BELLEMARE M, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 1471-1479.
[47] STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of computer and system sciences, 2008, 74(8): 1309-1331.
[48] TANG Haoran, HOUTHOOFT R, FOOTE D, et al. # exploration: A study of count-based exploration for deep reinforcement learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 2753-2762.
[49] BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. California, USA: arXiv, 2018. [2019-5-20] https://arxiv.org/pdf/1810.12894.pdf.
[50] STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. California, USA: arXiv, 2015. [2019-5-2] https://arxiv.org/pdf/1507.00814.pdf.
[51] KINGMA D P, WELLING M. Auto-encoding variational bayes[EB/OL]. Amsterdam, Netherlands: arXiv, 2013. [2019-2-2] https://arxiv.org/pdf/1312.6114.pdf.
[52] RAFATI J, NOELLE D C. Learning representations in model-free hierarchical reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu, USA, 2019, 33: 10009-10010.
[53] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning[J]. Artificial intelligence, 1999, 112(1-2): 181-211.
[54] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 3675-3683.
[55] BACON P L, HARB J, PRECUP D. The option-critic architecture[C]//Thirty-First AAAI Conference on Artificial Intelligence. San Francisco, USA, 2017.
[56] FRANS K, HO J, CHEN X, et al. Meta learning shared hierarchies[EB/OL]. California, USA: arXiv, 2017. [2019-11-15] https://arxiv.org/pdf/1710.09767, 2017.
[57] VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al. Feudal networks for hierarchical reinforcement learning[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. Sydney, Australia, 2017: 3540-3549.
[58] NACHUM O, GU Shixiang, LEE H, et al. Data-efficient hierarchical reinforcement learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada, 2018: 3303-3313.
[59] LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight[EB/OL]. RI, USA: arXiv, 2017. [2019-12-16] https://arxiv.org/pdf/1712.00948.pdf.
[60] SCHAUL T, HORGAN D, GREGOR K, et al. Universal value function approximators[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille, France, 2015: 1312-1320.
[61] SUKHBAATAR S, LIN Zeming, KOSTRIKOV I, et al. Intrinsic motivation and automatic curricula via asymmetric self-play[EB/OL]. NY, USA: arXiv, 2017. [2019-12-11] https://arxiv.org/pdf/1703.05407.pdf.
[62] JABRI A, HSU K, EYSENBACH B, et al. Unsupervised curricula for visual meta-reinforcement learning[EB/OL]. California, USA: arXiv, 2019. [2019-12-21] https://arxiv.org/pdf/1912.04226.pdf.
[63] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 5998-6008.
[64] SAHNI H, BUCKLEY T, ABBEEL P, et al. Visual hindsight experience replay[EB/OL]. GA, USA: arXiv, 2019. [2019-10-12] https://arxiv.org/pdf/1901.11529.pdf.
[65] SUKHBAATAR S, DENTON E, SZLAM A, et al. Learning goal embeddings via self-play for hierarchical reinforcement learning[EB/OL]. NY, USA: arXiv, 2018. [2019-11-21] https://arxiv.org/pdf/1811.09083.pdf.
[66] LANIER J B, MCALEER S, BALDI P. Curiosity-driven multi-criteria hindsight experience replay[EB/OL]. California, USA: arXiv, 2019. [2019-12-13] https://arxiv.org/pdf/1906.03710.pdf.

Similar References:

Memo

Last Update: 2021-01-15

Survey of sparse reward algorithms in reinforcement learning — theory and experiment PDF DownloadHTML

Memo

Survey of sparse reward algorithms in reinforcement learning — theory and experiment

PDF Download HTML