[1]杨瑞,严江鹏,李秀.强化学习稀疏奖励算法研究——理论与实验[J].智能系统学报,2020,15(5):888-899.[doi:10.11992/tis.202003031]
 YANG Rui,YAN Jiangpeng,LI Xiu.Survey of sparse reward algorithms in reinforcement learning — theory and experiment[J].CAAI Transactions on Intelligent Systems,2020,15(5):888-899.[doi:10.11992/tis.202003031]
点击复制

强化学习稀疏奖励算法研究——理论与实验

参考文献/References:
[1] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge, USA: MIT Press, 1998.
[2] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. 2nd ed. Cambridge: MIT Press, 2018.
[3] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489.
[4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[5] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[EB/OL]. California, USA: arXiv, 2019. [2019-10-1] https://arxiv.org/pdf/1912.06680.pdf.
[6] SILVER D. Tutorial: Deep reinforcement learning[C]//Proc. of the 33rd Int. Conf. on Machine Learning (ICML 2016). 2016.
[7] LI Yuxi. Deep reinforcement learning: An overview[EB/OL]. Alberta, Canada: arXiv, 2017. [2019-10-2] https://arxiv.org/pdf/ 1701.07274.pdf.
[8] LI Yuxi. Deep reinforcement learning[EB/OL]. Alberta, Canada: arXiv, 2018. [2019-10-5] https://arxiv.org/pdf/1810.06339.pdf.
[9] Riedmiller M, Hafner R, Lampe T, et al. Learning by playing-solving sparse reward tasks from scratch[EB/OL]. London, UK: arXiv, 2018. [2019-10-20] https://arxiv.org/pdf/1802.10567.pdf.
[10] HOSU I A, REBEDEA T. Playing atari games with deep reinforcement learning and human checkpoint replay[EB/OL]. Bucharest, Romania: arXiv, 2016. [2019-10-21] https://arxiv.org/pdf/1607.05077.pdf.
[11] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 5048-5058.
[12] 杨惟轶, 白辰甲, 蔡超, 等. 深度强化学习中稀疏奖励问题研究综述[J]. 计算机科学, 2020, 47(3): 182-191
YANG Weiyi, BAI Chenjia, CAI Chao, et al. Survey on sparse reward in deep reinforcement learning[J]. Computer science, 2020, 47(3): 182-191
[13] GULLAPALLI V, BARTO A G. Shaping as a method for accelerating reinforcement learning[C]//Proceedings of the 1992 IEEE International Symposium on Intelligent Control. Glasgow, UK, 1992: 554-559.
[14] HUSSEIN A, GABER M M, ELYAN E, et al. Imitation learning: A survey of learning methods[J]. ACM computing surveys, 2017, 50(2): 1-35.
[15] BENGIO Y, LOURADOUR J, COLLOBERT R, et al. Curriculum learning[C]//Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, Quebec, Canada, 2009: 41-48.
[16] BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. California, USA: arXiv, 2018. [2019-10-30] https://arxiv.org/pdf/1808.04355.
[17] 周文吉, 俞扬. 分层强化学习综述[J]. 智能系统学报, 2017, 12(5): 590-594
ZHOU Wenji, YU Yang. Summarize of hierarchical reinforcement learning[J]. CAAI transactions on intelligent systems, 2017, 12(5): 590-594
[18] Plappert M, Andrychowicz M, Ray A, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research[EB/OL]. California, USA: arXiv, 2018. [2019-11-1] https://arxiv.org/pdf/1802.09464.pdf.
[19] 万里鹏, 兰旭光, 张翰博, 等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019, 32(1): 67-81
WAN Lipeng, LAN Xuguang, ZHANG Hanbo, et al. A review of deep reinforcement learning theory and application[J]. Pattern recognition and artificial intelligence, 2019, 32(1): 67-81
[20] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[EB/OL]. London, UK: arXiv, 2013. [2019-11-1] https://arxiv.org/pdf/1312.5602.pdf.
[21] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[22] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine learning, 1992, 8(3/4): 229-256.
[23] KONDA V R, TSITSIKLIS J N. Actor-critic algorithms[C]//Advances in Neural Information Processing Systems. Colorado, USA, 2000: 1008-1014.
[24] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York, USA, 2016: 1928-1937.
[25] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. California, USA: arXiv, 2017. [2019-11-3] https://arxiv.org/pdf/1707.06347.pdf.
[26] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[EB/OL]. London, UK: arXiv, 2015. [2019-12-25] https://arxiv.org/pdf/1509.02971.pdf.
[27] NG A Y, HARADA D, RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[C]//Proceedings of the Sixteenth International Conference on Machine Learning. Bled, Slovenia, 1999, 99: 278-287.
[28] RANDL?V J, ALSTR?M P. Learning to drive a bicycle using reinforcement learning and shaping[C]//Proceedings of the Fifteenth International Conference on Machine Learning. Madison, USA, 1998, 98: 463-471.
[29] JAGODNIK K M, THOMAS P S, VAN DEN BOGERT A J, et al. Training an actor-critic reinforcement learning controller for arm movement using human-generated rewards[J]. IEEE transactions on neural systems and rehabilitation engineering, 2017, 25(10): 1892-1905.
[30] FERREIRA E, LEFèVRE F. Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management[C]//2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic, 2013: 108-113.
[31] NG A Y, RUSSELL S J. Algorithms for inverse reinforcement learning[C]//Proceedings of the Seventeenth International Conference on Machine Learning. Stanford, USA, 2000, 1: 663-670.
[32] MARTHI B. Automatic shaping and decomposition of reward functions[C]//Proceedings of the 24th International Conference on Machine Learning. Corvallis, USA, 2007: 601-608.
[33] ROSS S, BAGNELL D. Efficient reductions for imitation learning[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia, Italy, 2010: 661-668.
[34] NAIR A, MCGREW B, ANDRYCHOWICZ M, et al. Overcoming exploration in reinforcement learning with demonstrations[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, QLD, Australia, 2018: 6292-6299.
[35] HO J, ERMON S. Generative adversarial imitation learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 4565-4573.
[36] LIU Yuxuan, GUPTA A, ABBEEL P, et al. Imitation from observation: Learning to imitate behaviors from raw video via context translation[C]//2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, Australia, 2018: 1118-1125.
[37] TORABI F, WARNELL G, STONE P. Behavioral cloning from observation[EB/OL]. Texas, USA: arXiv, 2018. [2019-11-1] https://arxiv.org/pdf/1805.01954.pdf.
[38] ELMAN J L. Learning and development in neural networks: The importance of starting small[J]. Cognition, 1993, 48(1): 71-99.
[39] GRAVES A, BELLEMARE M G, MENICK J, et al. Automated curriculum learning for neural networks[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. Sydney, Australia, 2017: 1311-1320.
[40] OpenAI, AKKAYA I, ANDRYCHOWICZ M, et al. Solving rubik’s cube with a robot hand[EB/OL]. California, USA: arXiv, 2019. [2019-11-2] https://arxiv.org/pdf/1910.07113.pdf.
[41] LANKA S, WU Tianfu. ARCHER: aggressive rewards to counter bias in hindsight experience replay[EB/OL]. NC, USA: arXiv, 2018. [2019-12-3] https://arxiv.org/pdf/1809.02070.
[42] MANELA B, BIESS A. Bias-reduced hindsight experience replay with virtual goal prioritization[EB/OL]. BeerSheva, Israel: arXiv, 2019. [2019-12-3] https://arxiv.org/pdf/1905.05498.pdf.
[43] RAUBER P, UMMADISINGU A, MUTZ F, et al. Hindsight policy gradients[EB/OL]. London, UK: arXiv, 2017. [2019-11-2] https://arxiv.org/pdf/1711.06006.pdf.
[44] SCHMIDHUBER J. A possibility for implementing curiosity and boredom in model-building neural controllers[C]//Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats. Cambridge, USA, 1991: 222-227.
[45] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA, 2017: 16-17.
[46] BELLEMARE M, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 1471-1479.
[47] STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of computer and system sciences, 2008, 74(8): 1309-1331.
[48] TANG Haoran, HOUTHOOFT R, FOOTE D, et al. # exploration: A study of count-based exploration for deep reinforcement learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 2753-2762.
[49] BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. California, USA: arXiv, 2018. [2019-5-20] https://arxiv.org/pdf/1810.12894.pdf.
[50] STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. California, USA: arXiv, 2015. [2019-5-2] https://arxiv.org/pdf/1507.00814.pdf.
[51] KINGMA D P, WELLING M. Auto-encoding variational bayes[EB/OL]. Amsterdam, Netherlands: arXiv, 2013. [2019-2-2] https://arxiv.org/pdf/1312.6114.pdf.
[52] RAFATI J, NOELLE D C. Learning representations in model-free hierarchical reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu, USA, 2019, 33: 10009-10010.
[53] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning[J]. Artificial intelligence, 1999, 112(1-2): 181-211.
[54] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 3675-3683.
[55] BACON P L, HARB J, PRECUP D. The option-critic architecture[C]//Thirty-First AAAI Conference on Artificial Intelligence. San Francisco, USA, 2017.
[56] FRANS K, HO J, CHEN X, et al. Meta learning shared hierarchies[EB/OL]. California, USA: arXiv, 2017. [2019-11-15] https://arxiv.org/pdf/1710.09767, 2017.
[57] VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al. Feudal networks for hierarchical reinforcement learning[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. Sydney, Australia, 2017: 3540-3549.
[58] NACHUM O, GU Shixiang, LEE H, et al. Data-efficient hierarchical reinforcement learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada, 2018: 3303-3313.
[59] LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight[EB/OL]. RI, USA: arXiv, 2017. [2019-12-16] https://arxiv.org/pdf/1712.00948.pdf.
[60] SCHAUL T, HORGAN D, GREGOR K, et al. Universal value function approximators[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille, France, 2015: 1312-1320.
[61] SUKHBAATAR S, LIN Zeming, KOSTRIKOV I, et al. Intrinsic motivation and automatic curricula via asymmetric self-play[EB/OL]. NY, USA: arXiv, 2017. [2019-12-11] https://arxiv.org/pdf/1703.05407.pdf.
[62] JABRI A, HSU K, EYSENBACH B, et al. Unsupervised curricula for visual meta-reinforcement learning[EB/OL]. California, USA: arXiv, 2019. [2019-12-21] https://arxiv.org/pdf/1912.04226.pdf.
[63] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 5998-6008.
[64] SAHNI H, BUCKLEY T, ABBEEL P, et al. Visual hindsight experience replay[EB/OL]. GA, USA: arXiv, 2019. [2019-10-12] https://arxiv.org/pdf/1901.11529.pdf.
[65] SUKHBAATAR S, DENTON E, SZLAM A, et al. Learning goal embeddings via self-play for hierarchical reinforcement learning[EB/OL]. NY, USA: arXiv, 2018. [2019-11-21] https://arxiv.org/pdf/1811.09083.pdf.
[66] LANIER J B, MCALEER S, BALDI P. Curiosity-driven multi-criteria hindsight experience replay[EB/OL]. California, USA: arXiv, 2019. [2019-12-13] https://arxiv.org/pdf/1906.03710.pdf.
相似文献/References:
[1]连传强,徐昕,吴军,等.面向资源分配问题的Q-CF多智能体强化学习[J].智能系统学报,2011,6(2):95.
 LIAN Chuanqiang,XU Xin,WU Jun,et al.Q-CF multiAgent reinforcement learningfor resource allocation problems[J].CAAI Transactions on Intelligent Systems,2011,6():95.
[2]梁爽,曹其新,王雯珊,等.基于强化学习的多定位组件自动选择方法[J].智能系统学报,2016,11(2):149.[doi:10.11992/tis.201510031]
 LIANG Shuang,CAO Qixin,WANG Wenshan,et al.An automatic switching method for multiple location components based on reinforcement learning[J].CAAI Transactions on Intelligent Systems,2016,11():149.[doi:10.11992/tis.201510031]
[3]张文旭,马磊,王晓东.基于事件驱动的多智能体强化学习研究[J].智能系统学报,2017,12(1):82.[doi:10.11992/tis.201604008]
 ZHANG Wenxu,MA Lei,WANG Xiaodong.Reinforcement learning for event-triggered multi-agent systems[J].CAAI Transactions on Intelligent Systems,2017,12():82.[doi:10.11992/tis.201604008]
[4]张文旭,马磊,贺荟霖,等.强化学习的地-空异构多智能体协作覆盖研究[J].智能系统学报,2018,13(2):202.[doi:10.11992/tis.201609017]
 ZHANG Wenxu,MA Lei,HE Huilin,et al.Air-ground heterogeneous coordination for multi-agent coverage based on reinforced learning[J].CAAI Transactions on Intelligent Systems,2018,13():202.[doi:10.11992/tis.201609017]
[5]徐鹏,谢广明,文家燕,等.事件驱动的强化学习多智能体编队控制[J].智能系统学报,2019,14(1):93.[doi:10.11992/tis.201807010]
 XU Peng,XIE Guangming,WEN Jiayan,et al.Event-triggered reinforcement learning formation control for multi-agent[J].CAAI Transactions on Intelligent Systems,2019,14():93.[doi:10.11992/tis.201807010]
[6]郭宪,方勇纯.仿生机器人运动步态控制:强化学习方法综述[J].智能系统学报,2020,15(1):152.[doi:10.11992/tis.201907052]
 GUO Xian,FANG Yongchun.Locomotion gait control for bionic robots: a review of reinforcement learning methods[J].CAAI Transactions on Intelligent Systems,2020,15():152.[doi:10.11992/tis.201907052]
[7]申翔翔,侯新文,尹传环.深度强化学习中状态注意力机制的研究[J].智能系统学报,2020,15(2):317.[doi:10.11992/tis.201809033]
 SHEN Xiangxiang,HOU Xinwen,YIN Chuanhuan.State attention in deep reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15():317.[doi:10.11992/tis.201809033]
[8]殷昌盛,杨若鹏,朱巍,等.多智能体分层强化学习综述[J].智能系统学报,2020,15(4):646.[doi:10.11992/tis.201909027]
 YIN Changsheng,YANG Ruopeng,ZHU Wei,et al.A survey on multi-agent hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15():646.[doi:10.11992/tis.201909027]
[9]莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740.[doi:10.11992/tis.201910039]
 MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAI Transactions on Intelligent Systems,2020,15():740.[doi:10.11992/tis.201910039]
[10]赵玉新,杜登辉,成小会,等.基于强化学习的海洋移动观测网络观测路径规划方法[J].智能系统学报,2022,17(1):192.[doi:10.11992/tis.202106004]
 ZHAO Yuxin,DU Denghui,CHENG Xiaohui,et al.Path planning for mobile ocean observation network based on reinforcement learning[J].CAAI Transactions on Intelligent Systems,2022,17():192.[doi:10.11992/tis.202106004]
[11]周文吉,俞扬.分层强化学习综述[J].智能系统学报,2017,12(5):590.[doi:10.11992/tis.201706031]
 ZHOU Wenji,YU Yang.Summarize of hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2017,12():590.[doi:10.11992/tis.201706031]
[12]王作为,徐征,张汝波,等.记忆神经网络在机器人导航领域的应用与研究进展[J].智能系统学报,2020,15(5):835.[doi:10.11992/tis.202002020]
 WANG Zuowei,XU Zheng,ZHANG Rubo,et al.Research progress and application of memory neural network in robot navigation[J].CAAI Transactions on Intelligent Systems,2020,15():835.[doi:10.11992/tis.202002020]

备注/Memo

收稿日期:2020-03-19。
基金项目:国家自然科学基金项目(41876098)
作者简介:杨瑞,硕士研究生,主要研究方向为机器学习与强化学习;严江鹏,博士研究生,主要研究方向为人工智能与计算机视觉;李秀,教授,博士生导师,主要研究方向为智能系统、数据挖掘与模式识别。主持完成国家自然科学基金项目3项、深圳市基础研究项目2项、深圳市技术开发项目1项;参与完成国家863项目4项;目前在研863重大项目1项,国家自然科学基金项目1项。获得国家发明专利授权7项,国家软件著作权5项。发表学术论文100余篇
通讯作者:李秀.E-mail:li.xiu@sz.tsinghua.edu.cn

更新日期/Last Update: 2021-01-15
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com