<-Previous Article Next Article->

[1]WANG Xue-ning,CHEN Wei,ZHANG Men,et al.A survey of direct policy search methods in reinforcement learning[J].CAAI Transactions on Intelligent Systems,2007,2(1):16-24.

Copy

A survey of direct policy search methods in reinforcement learning

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 2 Number of periods: 2007 1 Page number: 16-24 Column: 综述 Public date: 2007-02-25

Title:: A survey of direct policy search methods in reinforcement learning

Author(s):: WANG Xue-ning¹; CHEN Wei¹; ZHANG Men²; XU Xin¹; HE Han-gen¹; 1. School of Electromechanical Engineering and Automation, National University o f Defense Technology, Changsha 410073, China;
2. Qinghe Building Zi 9, Bei jing 10008 5, China

Keywords:: reinforcement learning; policy search; policy Gradien t

CLC:: TP242

DOI:: -

Abstract:: The direct policy search methods in reinforcement learn ing are described, and the theoretic framework of policy gradient meth ods is presented. According to this framework, some current policy gradient algo rithms are generalized. The new methods of speeding up the policy gradient al gorithms are discussed. The new nonpolicy gradient search methods are also described. Finally, some future directions of research work are also given.

References:: ［1］徐? 昕.增强学习及其在移动机器人导航与控制中的应用研究［D］.长沙:国防科技大学, 2002.
XU Xin. Reinforcement learning and its applications in navigation and control of mobile robots［D］. Changsha: National University of Defence Technology, 2002.
［2］SUTTON R，BARTO A. Reinforcement learning, an introduction［M］. M IT Press, 1998.
［3］SINGH S P. Learning to solve Markovian decision processes［D］. University of Massachusetts, 1994.
［4］ROY B V. Learning and value function approximation in complex decision pro cesses［M］. MIT Press, 1998.
［5］WATKINS C. Learning from delayed rewards［D］. Cambrideg: University of Cambridge, 1989.
［6］HUMPHRYS M. Action selection methods using reinforcement learning［D］. Cam brideg:Un iversity of Cambridge,1996.
［7］BERTSEKAS D P, TSITSIKLIS J N. Neurodynamic programming［M］. Athena Scie ntific,Belmont, Mass.,1996.
［8］SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for rein fo rcement learning with function approximation［A］. In: Advances in Neural Infor mation Processing Systems［C］. Denver, USA，2000.
［9］BAIRD L C. Residual algorithms: reinforcement learning with function appr ox imation［A］. In: Proc. Of the 12# Int. Conf. on Machine Learning［C］. San Fran cisco, 1995.
［10］TSITSIKLIS J N, ROY V B. Featurebased methods for large scale dyn a mic programming［J］. Machine Learning,1996(22):59-94.［11］徐昕，贺汉根.神经网络增强学习的梯度算法研究［J］.计算机学报，2003 ，26（2）：227-233.
?XU Xin, HE Hangen. A gradient algorithm for reinforcement learning based on neur al networks［J］. Chinese Journal of Computers, 2003, 26(2): 227-233.
［12］BAXTER J, BARTLETT P L. Infinitehorizon policygradient estima tion［J］. Journal of Artificial Intelligence Research，2001(15):319-350.
［13］ABERDEEN D A. Policy-gradient algorithms for partially observable Markov decision processes［D］. Australian National University, 2003.
［14］GREENSMITH P L, BAXTER J. Variance reduction techniques for gradient est im ation in reinforcement learning［J］. Journal of Machine Learning Reseach, 2002 (4): 1471-1530
［15］王学宁，徐昕，吴涛，贺汉根.策略梯度强化学习中的最优回报基线［J ］ .计算机学报，2005，28（6）：1021-1026.
?WANG Xuening, XU Xin, WU Tao ,HE Hangen. The optimal reward baseline for policy  gradient reinforcement learning［J］. Chinese Journal of Computers, 2005,28(6):1 021-1026.
［16］SCHWARTZ A. A reinforcement learning method for maximizing undis counted re wards［A］. In Proceedings of the Tenth International Conference on Machine Lear ning［C］.Morgan Kaufmann,San Mateo,CA,1993.
［17］MAHADEVAN S. To discount or not to discount in reinforcement learning :a ca se study comparing Rlearning and Qlearning［A］. In: Proc. Of International Machine Learning Conf［C］. New Brunswick, USA,1994.
［18］WILLIAMS R J. Simple statistical gradientfollowing algorithms for conne ctionist reinforcement learning［J］. Machine Learning,1992(8):229-256.
［19］KONDA V R, TSITSIKLIS J N. Actorcritic algorithms［J］. Adv. Neural Info rm Processing Syst, 2000(12): 1008-1014.
［20］AMARI S.Natural gradient works efficiently in learning［J］. Neural Compu tation, 1998, 10(2):251-276.
［21］KAKADE S. A natural policy gradient［A］. Advances in Neural Information Processing Systems 14［C］. MIT Press, 2002.
［22］PETERS J, VIJAYAKUMAR S,SCHAAL S. Natural actorcritic［A］. In 16th E uropean Conference on Machine Learning (ECML 2005)［C］.［s.l.］,2005.
［23］GREENSMITH E. Variance reduction techniques for gradient estimation in re in forcement learning［J］. Journal of Machine Learning Reseach, 2002(4): 1471-153 0.
［24］WEAVER L, TAO N. The optimal reward baseline for gradientbased reinforce me nt learning［A］. Proceedings of 17# Conference in Uncertainty in Artificial Intelligence［C］.Washington ,2001.
［25］MUNOS R. Geometric variance reduction in Markov chains: application to va lu e function and gradient estimation［J］. Journal of Machine Learning Research, 2 006 (7):413-427.
［26］BERENJI H R, VENGEROV D. A convergent actorcriticbased FRL algorithm wi th application to power management of wireless tansmitters［J］. IEEE Transactio ns on Fuzzy Systems, 2003, 11(4): 478-485.
［27］GRUDIC G Z, UNGER L R. Localizing policy gradient estimates to action tra ns itions［A］. Seventeenth int. Conference on Machine Learning, Stanford Universit y［C］.2000:343-350.
［28］NICOL N S， YU Jin，ABERDEEN D. Fast online policy gradient learning wit h SMD gain vector adaptation［A］. In Advances in Neural Information Processing S ystems (NIPS)［C］. The MIT Press, Cambridge, MA, 2006.
［29］BAGNELL J A, SCHNEIDER J. Policy search in kernel hilbert space［EB/OL］. ht tp://citeseer.ist.psu.edu/650945.html.2005-05-27.
［30］NG A, JORDAN M. Pegasus: a policy search method for large MDPs and POMDPs a pproximation［A］. In Proceedings of the 16th Conference on Uncertainty in Arti ficial Intelligence ［C］.San Francisco, 2000.
［31］Diplomarbeit. The Reinforcement Learning Toolbox, Reinforcement Learning for Optimal Control Tasks［D］, University of Technology, Graz, 2005.
［32］NG A Y, KIM H J, MICHAEL I, SASTRY S. Autonomous helicopter flight via R ei nforcement Learning［A］. In Neural Information Processing Systems 16［C］.［s.l .］, 2004.
［33］STRENS M J, MOORE A. Policy search using paired comparisons［J］. Journal of Machine Learning Research, 2002(3):921-950.
［34］[JP2]MARTIN C. The essential dynamics algorithm: fast policy Search in con tinu ou s worlds［R］. MIT Media Laboratory, Vsion and Modelling Technical Report. 2004.
[35］WANG X N, XU X, WU T, HE H G, ZHANG M. A hybrid reinforcement learning c om bined with SVM and its applications［A］. Proceedings of the International Confe rence on Sensing, Computing and Automation［C］. Chongqing, China, 2006.
［36］GHAVAMZADEH M, MAHADEVAN S. Hierarchical policy gradient algorithms［A］. In Proceedings of the Twentieth International Conference on Machine Learning［C］. Washington, D.C., 2003.
［37］DIETTERICH T.The MAXQ method for hierarchical reinforcement learning［A ］. In: Proceedings of the fifteenth international conference on machine learning［C ］. ［s.l.］,1998.
［38］PARR R.Hierarchical control and learning for Markov decision processes ［D］. University of California, Berkley，1998.
［39］SUTTON R， PRECUP D, SINGH. Between MDPs and SemiMDPs: a framework for t e mp oral abstraction in reinforcement learning［J］. Artificial Intellignce, 1999(11 2):181-211.

Similar References:

Memo

Last Update: 2009-05-05

A survey of direct policy search methods in reinforcement learning PDF DownloadHTML

Memo

A survey of direct policy search methods in reinforcement learning

PDF Download HTML