[1]王学宁,陈 伟,张 锰,等.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(1):16-24.
 WANG Xue-ning,CHEN Wei,ZHANG Men,et al.A survey of direct policy search methods in reinforcement learning[J].CAAI Transactions on Intelligent Systems,2007,2(1):16-24.
点击复制

增强学习中的直接策略搜索方法综述

参考文献/References:
[1]徐? 昕.增强学习及其在移动机器人导航与控制中的应用研究[D].长沙:国防科技大学, 2002.
XU Xin. Reinforcement learning and its applications in navigation and control of mobile robots[D]. Changsha: National University of Defence Technology, 2002.
[2]SUTTON R,BARTO A. Reinforcement learning, an introduction[M]. M IT Press, 1998.
[3]SINGH S P. Learning to solve Markovian decision processes[D]. University of Massachusetts, 1994.
[4]ROY B V. Learning and value function approximation in complex decision pro cesses[M]. MIT Press, 1998.
[5]WATKINS C. Learning from delayed rewards[D]. Cambrideg: University of Cambridge, 1989.
[6]HUMPHRYS M. Action selection methods using reinforcement learning[D]. Cam brideg:Un iversity of Cambridge,1996.
[7]BERTSEKAS D P, TSITSIKLIS J N. Neurodynamic programming[M]. Athena Scie ntific,Belmont, Mass.,1996.
[8]SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for rein fo rcement learning with function approximation[A]. In: Advances in Neural Infor mation Processing Systems[C]. Denver, USA,2000.
[9]BAIRD L C. Residual algorithms: reinforcement learning with function appr ox imation[A]. In: Proc. Of the 12# Int. Conf. on Machine Learning[C]. San Fran cisco, 1995.
[10]TSITSIKLIS J N, ROY V B. Featurebased methods for large scale dyn a mic programming[J]. Machine Learning,1996(22):59-94.[11]徐 昕,贺汉根.神经网络增强学习的梯度算法研究[J].计算机学报,2003 ,26(2):227-233.
?XU Xin, HE Hangen. A gradient algorithm for reinforcement learning based on neur al networks[J]. Chinese Journal of Computers, 2003, 26(2): 227-233.
[12]BAXTER J, BARTLETT P L. Infinitehorizon policygradient estima tion[J]. Journal of Artificial Intelligence Research,2001(15):319-350.
[13]ABERDEEN D A. Policy-gradient algorithms for partially observable Markov decision processes[D]. Australian National University, 2003.
[14]GREENSMITH P L, BAXTER J. Variance reduction techniques for gradient est im ation in reinforcement learning[J]. Journal of Machine Learning Reseach, 2002 (4): 1471-1530
[15]王学宁,徐昕,吴涛,贺汉根.策略梯度强化学习中的最优回报基线[J ] .计算机学报,2005,28(6):1021-1026.
?WANG Xuening, XU Xin, WU Tao ,HE Hangen. The optimal reward baseline for policy  gradient reinforcement learning[J]. Chinese Journal of Computers, 2005,28(6):1 021-1026.
[16]SCHWARTZ A. A reinforcement learning method for maximizing undis counted re wards[A]. In Proceedings of the Tenth International Conference on Machine Lear ning[C].Morgan Kaufmann,San Mateo,CA,1993.
[17]MAHADEVAN S. To discount or not to discount in reinforcement learning :a ca se study comparing Rlearning and Qlearning[A]. In: Proc. Of International Machine Learning Conf[C]. New Brunswick, USA,1994.
[18]WILLIAMS R J. Simple statistical gradientfollowing algorithms for conne ctionist reinforcement learning[J]. Machine Learning,1992(8):229-256.
[19]KONDA V R, TSITSIKLIS J N. Actorcritic algorithms[J]. Adv. Neural Info rm Processing Syst, 2000(12): 1008-1014.
[20]AMARI S.Natural gradient works efficiently in learning[J]. Neural Compu tation, 1998, 10(2):251-276.
[21]KAKADE S. A natural policy gradient[A]. Advances in Neural Information Processing Systems 14[C]. MIT Press, 2002.
[22]PETERS J, VIJAYAKUMAR S,SCHAAL S. Natural actorcritic[A]. In 16th E uropean Conference on Machine Learning (ECML 2005)[C].[s.l.],2005.
[23]GREENSMITH E. Variance reduction techniques for gradient estimation in re in forcement learning[J]. Journal of Machine Learning Reseach, 2002(4): 1471-153 0.
[24]WEAVER L, TAO N. The optimal reward baseline for gradientbased reinforce me nt learning[A]. Proceedings of 17# Conference in Uncertainty in Artificial Intelligence[C].Washington ,2001.
[25]MUNOS R. Geometric variance reduction in Markov chains: application to va lu e function and gradient estimation[J]. Journal of Machine Learning Research, 2 006 (7):413-427.
[26]BERENJI H R, VENGEROV D. A convergent actorcriticbased FRL algorithm wi th application to power management of wireless tansmitters[J]. IEEE Transactio ns on Fuzzy Systems, 2003, 11(4): 478-485.
[27]GRUDIC G Z, UNGER L R. Localizing policy gradient estimates to action tra ns itions[A]. Seventeenth int. Conference on Machine Learning, Stanford Universit y[C].2000:343-350.
[28]NICOL N S, YU Jin,ABERDEEN D. Fast online policy gradient learning wit h SMD gain vector adaptation[A]. In Advances in Neural Information Processing S ystems (NIPS)[C]. The MIT Press, Cambridge, MA, 2006.
[29]BAGNELL J A, SCHNEIDER J. Policy search in kernel hilbert space[EB/OL]. ht tp://citeseer.ist.psu.edu/650945.html.2005-05-27.
[30]NG A, JORDAN M. Pegasus: a policy search method for large MDPs and POMDPs a pproximation[A]. In Proceedings of the 16th Conference on Uncertainty in Arti ficial Intelligence [C].San Francisco, 2000.
[31]Diplomarbeit. The Reinforcement Learning Toolbox, Reinforcement Learning for Optimal Control Tasks[D], University of Technology, Graz, 2005.
[32]NG A Y, KIM H J, MICHAEL I, SASTRY S. Autonomous helicopter flight via R ei nforcement Learning[A]. In Neural Information Processing Systems 16[C].[s.l .], 2004.
[33]STRENS M J, MOORE A. Policy search using paired comparisons[J]. Journal of Machine Learning Research, 2002(3):921-950.
[34][JP2]MARTIN C. The essential dynamics algorithm: fast policy Search in con tinu ou s worlds[R]. MIT Media Laboratory, Vsion and Modelling Technical Report. 2004.
[35]WANG X N, XU X, WU T, HE H G, ZHANG M. A hybrid reinforcement learning c om bined with SVM and its applications[A]. Proceedings of the International Confe rence on Sensing, Computing and Automation[C]. Chongqing, China, 2006.
[36]GHAVAMZADEH M, MAHADEVAN S. Hierarchical policy gradient algorithms[A]. In Proceedings of the Twentieth International Conference on Machine Learning[C]. Washington, D.C., 2003.
[37]DIETTERICH T.The MAXQ method for hierarchical reinforcement learning[A ]. In: Proceedings of the fifteenth international conference on machine learning[C ]. [s.l.],1998.
[38]PARR R.Hierarchical control and learning for Markov decision processes [D]. University of California, Berkley,1998.
[39]SUTTON R, PRECUP D, SINGH. Between MDPs and SemiMDPs: a framework for t e mp oral abstraction in reinforcement learning[J]. Artificial Intellignce, 1999(11 2):181-211.

备注/Memo

收稿日期:2006-07-07.
基金项目:国家自然科学基金资助项目(60234030, 60303012)
作者简介:
王学宁,男,1976年生,博士研究生,主要研究方向为增强学习、智能控制等,参加国家自然科学基金重点项目一项、青年基金项目一项,863项目一项,已发表论文10余篇,其中被S CI收录3篇,Ei收录5篇.
?E-mail:wxn9576@yahoo.com.cn
陈 伟,男,1976年生,博士研究生,主要研究方向为机器人定位与见图、机器学习等,参加国家自然科学基金重点项目一项.
张 锰,男,1972年生,2001年毕业于国防科技大学计算机学院,获硕士学位.主要研究方向为指挥自动化.曾获全军科技进步二等奖2项,全军科技进步三等奖3项,并在国内外科技期刊上发表论文12篇,其中SCI检索1篇,EI检索3篇.

更新日期/Last Update: 2009-05-05
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com