[1]王学宁,陈 伟,张 锰,等.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(01):16-24.
 WANG Xue-ning,CHEN Wei,ZHANG Men,et al.A survey of direct policy search methods in reinforcement learning[J].CAAI Transactions on Intelligent Systems,2007,2(01):16-24.
点击复制

增强学习中的直接策略搜索方法综述(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第2卷
期数:
2007年01期
页码:
16-24
栏目:
出版日期:
2007-02-25

文章信息/Info

Title:
A survey of direct policy search methods in reinforcement learning
文章编号:
1673-4785(2007)01-0016-09
作者:
王学宁1 陈 伟1 张 锰2 徐 昕1  贺汉根1
1.国防科技大学机电工程与自动化学院,湖南长沙410073;2.北京清河大楼子9,北京100085
Author(s):
WANG Xue-ning1CHEN Wei1ZHANG Men2XU Xin1HE Han-gen1
1. School of Electromechanical Engineering and Automation, National University o f Defense Technology, Changsha 410073, China;
2. Qinghe Building Zi 9, Bei jing 10008 5, China
关键词:
增强学习策略搜索策略梯度
Keywords:
reinforcement learning policy search policy Gradien t
分类号:
TP242
文献标志码:
A
摘要:
对增强学习中各种策略搜索算法进行了简单介绍,建立了策略梯度方法的理论框架,并且根据这个理论框架的指导,对一些现有的策略梯度算法进行了推广,讨论了近年来出现的提高策略梯度算法收敛速度的几种方法,对于非策略梯度搜索算法的最新进展进行了介绍,对进一步研究工作的方向进行了展望.
Abstract:
The direct policy search methods in reinforcement learn ing are described, and the theoretic framework of policy gradient meth ods is presented. According to this framework, some current policy gradient algo rithms are generalized. The new methods of speeding up the policy gradient al gorithms are discussed. The new nonpolicy gradient search methods are also described. Finally, some future directions of research work are also given.

参考文献/References:

[1]徐  昕.增强学习及其在移动机器人导航与控制中的应用研究[D].长沙:国防科技大学, 2002.
XU Xin. Reinforcement learning and its applications in navigation and control of mobile robots[D]. Changsha: National University of Defence Technology, 2002.
[2]SUTTON R,BARTO A. Reinforcement learning, an introduction[M]. M IT Press, 1998.
[3]SINGH S P. Learning to solve Markovian decision processes[D]. University of Massachusetts, 1994.
[4]ROY B V. Learning and value function approximation in complex decision pro cesses[M]. MIT Press, 1998.
[5]WATKINS C. Learning from delayed rewards[D]. Cambrideg: University of Cambridge, 1989.
[6]HUMPHRYS M. Action selection methods using reinforcement learning[D]. Cam brideg:Un iversity of Cambridge,1996.
[7]BERTSEKAS D P, TSITSIKLIS J N. Neurodynamic programming[M]. Athena Scie ntific,Belmont, Mass.,1996.
[8]SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for rein fo rcement learning with function approximation[A]. In: Advances in Neural Infor mation Processing Systems[C]. Denver, USA,2000.
[9]BAIRD L C. Residual algorithms: reinforcement learning with function appr ox imation[A]. In: Proc. Of the 12# Int. Conf. on Machine Learning[C]. San Fran cisco, 1995.
[10]TSITSIKLIS J N, ROY V B. Featurebased methods for large scale dyn a mic programming[J]. Machine Learning,1996(22):59-94.[11]徐 昕,贺汉根.神经网络增强学习的梯度算法研究[J].计算机学报,2003 ,26(2):227-233.
 XU Xin, HE Hangen. A gradient algorithm for reinforcement learning based on neur al networks[J]. Chinese Journal of Computers, 2003, 26(2): 227-233.
[12]BAXTER J, BARTLETT P L. Infinitehorizon policygradient estima tion[J]. Journal of Artificial Intelligence Research,2001(15):319-350.
[13]ABERDEEN D A. Policy-gradient algorithms for partially observable Markov decision processes[D]. Australian National University, 2003.
[14]GREENSMITH P L, BAXTER J. Variance reduction techniques for gradient est im ation in reinforcement learning[J]. Journal of Machine Learning Reseach, 2002 (4): 1471-1530
[15]王学宁,徐昕,吴涛,贺汉根.策略梯度强化学习中的最优回报基线[J ] .计算机学报,2005,28(6):1021-1026.
 WANG Xuening, XU Xin, WU Tao ,HE Hangen. The optimal reward baseline for policy  gradient reinforcement learning[J]. Chinese Journal of Computers, 2005,28(6):1 021-1026.
[16]SCHWARTZ A. A reinforcement learning method for maximizing undis counted re wards[A]. In Proceedings of the Tenth International Conference on Machine Lear ning[C].Morgan Kaufmann,San Mateo,CA,1993.
[17]MAHADEVAN S. To discount or not to discount in reinforcement learning :a ca se study comparing Rlearning and Qlearning[A]. In: Proc. Of International Machine Learning Conf[C]. New Brunswick, USA,1994.
[18]WILLIAMS R J. Simple statistical gradientfollowing algorithms for conne ctionist reinforcement learning[J]. Machine Learning,1992(8):229-256.
[19]KONDA V R, TSITSIKLIS J N. Actorcritic algorithms[J]. Adv. Neural Info rm Processing Syst, 2000(12): 1008-1014.
[20]AMARI S.Natural gradient works efficiently in learning[J]. Neural Compu tation, 1998, 10(2):251-276.
[21]KAKADE S. A natural policy gradient[A]. Advances in Neural Information Processing Systems 14[C]. MIT Press, 2002.
[22]PETERS J, VIJAYAKUMAR S,SCHAAL S. Natural actorcritic[A]. In 16th E uropean Conference on Machine Learning (ECML 2005)[C].[s.l.],2005.
[23]GREENSMITH E. Variance reduction techniques for gradient estimation in re in forcement learning[J]. Journal of Machine Learning Reseach, 2002(4): 1471-153 0.
[24]WEAVER L, TAO N. The optimal reward baseline for gradientbased reinforce me nt learning[A]. Proceedings of 17# Conference in Uncertainty in Artificial Intelligence[C].Washington ,2001.
[25]MUNOS R. Geometric variance reduction in Markov chains: application to va lu e function and gradient estimation[J]. Journal of Machine Learning Research, 2 006 (7):413-427.
[26]BERENJI H R, VENGEROV D. A convergent actorcriticbased FRL algorithm wi th application to power management of wireless tansmitters[J]. IEEE Transactio ns on Fuzzy Systems, 2003, 11(4): 478-485.
[27]GRUDIC G Z, UNGER L R. Localizing policy gradient estimates to action tra ns itions[A]. Seventeenth int. Conference on Machine Learning, Stanford Universit y[C].2000:343-350.
[28]NICOL N S, YU Jin,ABERDEEN D. Fast online policy gradient learning wit h SMD gain vector adaptation[A]. In Advances in Neural Information Processing S ystems (NIPS)[C]. The MIT Press, Cambridge, MA, 2006.
[29]BAGNELL J A, SCHNEIDER J. Policy search in kernel hilbert space[EB/OL]. ht tp://citeseer.ist.psu.edu/650945.html.2005-05-27.
[30]NG A, JORDAN M. Pegasus: a policy search method for large MDPs and POMDPs a pproximation[A]. In Proceedings of the 16th Conference on Uncertainty in Arti ficial Intelligence [C].San Francisco, 2000.
[31]Diplomarbeit. The Reinforcement Learning Toolbox, Reinforcement Learning for Optimal Control Tasks[D], University of Technology, Graz, 2005.
[32]NG A Y, KIM H J, MICHAEL I, SASTRY S. Autonomous helicopter flight via R ei nforcement Learning[A]. In Neural Information Processing Systems 16[C].[s.l .], 2004.
[33]STRENS M J, MOORE A. Policy search using paired comparisons[J]. Journal of Machine Learning Research, 2002(3):921-950.
[34][JP2]MARTIN C. The essential dynamics algorithm: fast policy Search in con tinu ou s worlds[R]. MIT Media Laboratory, Vsion and Modelling Technical Report. 2004.
[35]WANG X N, XU X, WU T, HE H G, ZHANG M. A hybrid reinforcement learning c om bined with SVM and its applications[A]. Proceedings of the International Confe rence on Sensing, Computing and Automation[C]. Chongqing, China, 2006.
[36]GHAVAMZADEH M, MAHADEVAN S. Hierarchical policy gradient algorithms[A]. In Proceedings of the Twentieth International Conference on Machine Learning[C]. Washington, D.C., 2003.
[37]DIETTERICH T.The MAXQ method for hierarchical reinforcement learning[A ]. In: Proceedings of the fifteenth international conference on machine learning[C ]. [s.l.],1998.
[38]PARR R.Hierarchical control and learning for Markov decision processes [D]. University of California, Berkley,1998.
[39]SUTTON R, PRECUP D, SINGH. Between MDPs and SemiMDPs: a framework for t e mp oral abstraction in reinforcement learning[J]. Artificial Intellignce, 1999(11 2):181-211.

备注/Memo

备注/Memo:
收稿日期:2006-07-07.
基金项目:国家自然科学基金资助项目(60234030, 60303012)
作者简介:
王学宁,男,1976年生,博士研究生,主要研究方向为增强学习、智能控制等,参加国家自然科学基金重点项目一项、青年基金项目一项,863项目一项,已发表论文10余篇,其中被S CI收录3篇,Ei收录5篇.
 E-mail:wxn9576@yahoo.com.cn
陈 伟,男,1976年生,博士研究生,主要研究方向为机器人定位与见图、机器学习等,参加国家自然科学基金重点项目一项.
张 锰,男,1972年生,2001年毕业于国防科技大学计算机学院,获硕士学位.主要研究方向为指挥自动化.曾获全军科技进步二等奖2项,全军科技进步三等奖3项,并在国内外科技期刊上发表论文12篇,其中SCI检索1篇,EI检索3篇.
更新日期/Last Update: 2009-05-05