<-上一篇/Previous Article 下一篇/Next Article->

[1]周娴玮,王宇翔,罗仕鑫,等.基于自适应分位数的离线强化学习算法[J].智能系统学报,2025,20(5):1093-1102.[doi:10.11992/tis.202410016]
　ZHOU Xianwei,WANG Yuxiang,LUO Shixin,et al.Offline reinforcement learning with adaptive quantile[J].CAAI Transactions on Intelligent Systems,2025,20(5):1093-1102.[doi:10.11992/tis.202410016]

点击复制

基于自适应分位数的离线强化学习算法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 20 期数: 2025年第5期页码: 1093-1102 栏目: 学术论文—机器学习出版日期: 2025-09-05

Title:: Offline reinforcement learning with adaptive quantile

作者:: 周娴玮, 王宇翔, 罗仕鑫, 余松森; 华南师范大学人工智能学院, 广东佛山 528225

Author(s):: ZHOU Xianwei, WANG Yuxiang, LUO Shixin, YU Songsen; School of Artificial Intelligence, South China Normal University, Foshan 528225, China

关键词:: 离线强化学习; 分布偏移; 外推误差; 策略约束; 模仿学习; 双Q估计; 价值高估; 分位数

Keywords:: offline reinforcement learning; distribution shift; extrapolation error; policy constraint; imitation learning; double Q-estimation; overestimation; quantile

分类号:: TP301.6

DOI:: 10.11992/tis.202410016

摘要:: 离线强化学习旨在仅通过使用预先收集的离线数据集进行策略的有效学习，从而减少与环境直接交互所带来的高昂成本。然而，由于缺少环境对智能体行为的交互反馈，从离线数据集中学习到的策略可能会遇到数据分布偏移的问题，进而导致外推误差的不断加剧。当前方法多采用策略约束或模仿学习方法来缓解这一问题，但其学习到的策略通常较为保守。针对上述难题，提出一种基于自适应分位数的方法。具体而言，该方法在双Q估计的基础上进一步利用双Q的估计差值大小对分布外未知动作的价值高估情况进行评估，同时结合分位数思想自适应调整分位数来校正过估计偏差。此外，构建分位数优势函数作为策略约束项权重以平衡智能体对数据集的探索和模仿，从而缓解策略学习的保守性。最后在D4RL (datasets for deep data-driven reinforcement learning) 数据集上验证算法的有效性，该算法在多个任务数据集上表现优异，同时展现出在不同场景应用下的广泛潜力。

Abstract:: Offline reinforcement learning aims to reduce the high cost of environmental interaction by learning effective policies solely from precollected offline datasets. However, the absence of interactive feedback can cause a distribution shift between the learned policy and the offline dataset, leading to increased extrapolation errors. Most existing methods address this problem using policy constraints or imitation learning, but they often result in overly conservative policies. To address the above problems, an adaptive quantile-based method is proposed. Building upon dual Q-estimation, the relationship between dual Q-estimates is further analyzed, using their differences to assess overestimation in out-of-distribution actions. The quantile is then adaptively adjusted to correct bias overestimation. Additionally, a quantile advantage is introduced, which serves as a weight for the policy constraint term, balancing exploration and imitation to reduce policy conservativeness. Finally, the proposed approach is validated on the D4RL dataset, where it achieves excellent performance across multiple tasks, showing its potential for broad application in various scenarios.

参考文献/References:: [1] SINGLA A, RAFFERTY A N, RADANOVIC G, et al. Reinforcement learning for education: opportunities and challenges[EB/OL]. (2021-07-15)[2024-10-12]. https://arxiv.org/abs/2107.08828v1.
[2] LIU Siqi, SEE K C, NGIAM K Y, et al. Reinforcement learning for clinical decision support in critical care: comprehensive review[J]. Journal of medical Internet research, 2020, 22(7): e18477.
[3] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[4] 李霞丽, 王昭琦, 刘博, 等. 麻将博弈AI构建方法综述[J]. 智能系统学报, 2023, 18(6): 1143-1155.
LI Xiali, WANG Zhaoqi, LIU Bo, et al. Survey of Mahjong game AI construction methods[J]. CAAI transactions on intelligent systems, 2023, 18(6): 1143-1155.
[5] 朱少凯, 孟庆浩, 金晟, 等. 基于深度强化学习的室内视觉局部路径规划[J]. 智能系统学报, 2022, 17(5): 908-918.
ZHU Shaokai, MENG Qinghao, JIN Sheng, et al. Indoor visual local path planning based on deep reinforcement learning[J]. CAAI transactions on intelligent systems, 2022, 17(5): 908-918.
[6] 赵玉新, 杜登辉, 成小会, 等. 基于强化学习的海洋移动观测网络观测路径规划方法[J]. 智能系统学报, 2022, 17(1): 192-200.
ZHAO Yuxin, DU Denghui, CHENG Xiaohui, et al. Path planning for mobile ocean observation network based on reinforcement learning[J]. CAAI transactions on intelligent systems, 2022, 17(1): 192-200.
[7] 张晓明, 高士杰, 姚昌瑀, 等. 强化学习及其在机器人任务规划中的进展与分析[J]. 模式识别与人工智能, 2023, 36(10): 902-917.
ZHANG Xiaoming, GAO Shijie, YAO Changyu, et al. Reinforcement learning and its application in robot task planning: a survey[J]. Pattern recognition and artificial intelligence, 2023, 36(10): 902-917.
[8] 郭宪, 方勇纯. 仿生机器人运动步态控制: 强化学习方法综述[J]. 智能系统学报, 2020, 15(1): 152-159.
GUO Xian, FANG Yongchun. Locomotion gait control for bionic robots: a review of reinforcement learning methods[J]. CAAI transactions on intelligent systems, 2020, 15(1): 152-159.
[9] 乌兰, 刘全, 黄志刚, 等. 离线强化学习研究综述[J]. 计算机学报, 2025, 48(1): 156-187.
WU Lan, LIU Quan, HUANG Zhigang, et al. A review of research on offline reinforcement learning[J]. Chinese journal of computers, 2025, 48(1): 156-187.
[10] LEVINE S, KUMAR A, TUCKER G, et al. Offline reinforcement learning: tutorial, review, and perspectives on open problems[EB/OL]. (2020-05-04)[2024-10-12]. https://arxiv.org/pdf/2005.01643.
[11] 陈锶奇, 耿婕, 汪云飞, 等. 基于离线强化学习的研究综述[J]. 无线电通信技术, 2024, 50(5): 831-842.
CHEN Siqi, GENG Jie, WANG Yunfei, et al. Survey of research on offline reinforcement learning[J]. Radio communications technology, 2024, 50(5): 831-842.
[12] FUJIMOTO S, MEGER D, Precup D. Off-policy deep reinforcement learning without exploration[C]//International Conference on Machine Learning. Los Angeles: PMLR, 2019: 2052-2062.
[13] WU Yifan, TUCKER G, NACHUM O. Behavior regularized offline reinforcement learning[EB/OL]. (2019-11-26)[2024-10-12]. https://arxiv.org/abs/1911.11361v1.
[14] KUMAR A, FU J, SOH M, et al. Stabilizing off-policy Q-learning via bootstrapping error reduction[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019: 11784-11794.
[15] WANG Z, NOVIKOV A, ZOLNA K, et al. Critic regularized regression[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 7768-7778.
[16] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing function approximation error in actor-critic methods[C]// International Conference on Machine Learning. Stockholm: PMLR, 2018: 1587-1596.
[17] FU J, KUMAR A, NACHUM O, et al. D4RL: datasets for deep data-driven reinforcement learning[EB/OL]. (2021-02-06)[2024-10-12]. https://arxiv.org/abs/2004.07219v4.
[18] FIGUEIREDO PRUDENCIO R, MAXIMO M R O A, COLOMBINI E L. A survey on offline reinforcement learning: taxonomy, review, and open problems[J]. IEEE transactions on neural networks and learning systems, 2024, 35(8): 10237-10257.
[19] FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 20132-20145.
[20] PENG Zhiyong, HAN Changlin, LIU Yadong, et al. Weighted policy constraints for offline reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington DC: AAAI, 2023: 9435-9443.
[21] CHEN Xinyue, ZHOU Zijian, WANG Zheng, et al. Bail: best-action imitation learning for batch deep reinforcement learning[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 18353-18363.
[22] SIEGEL N Y, SPRINGENBERG J T, BERKENKAMP F, et al. Keep doing what worked: behavioral modelling priors for offline reinforcement learning[C]//International Conference on Learning Representations. [S. l. ]: OpenReview.net, 2020: 1-21.
[23] ABDOLMALEKI A, SPRINGENBERG J T, TASSA Y, et al. Maximum a posteriori policy optimisation[C]//International Conference on Learning Representations. Vancouver: OpenReview.net, 2018: 1-23.
[24] BRANDFONBRENER D, WHITNEY W F, RANGANATH R, et al. Quantile filtered imitation learning[EB/OL]. (2021-12-02)[2024-10-12]. https://arxiv.org/abs/2112.00950v1.
[25] KOENKER R, HALLOCK K F. Quantile regression[J]. Journal of economic perspectives, 2001, 15(4): 143-156.
[26] AGARWAL R, SCHUURMANS D, NOROUZI M. An optimistic perspective on offline reinforcement learning[C]//International Conference on Machine Learning. [S. l. ]: PMLR, 2020: 104–114.
[27] KOSTRIKOV I, NAIR A, LEVINE S. Offline reinforcement learning with implicit Q-learning[EB/OL]. (2021-10-12)[2024-10-12]. https://arxiv.org/abs/2110.06169v1.
[28] KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 1179-1191.
[29] EMMONS S, EYSENBACH B, KOSTRIKOV I, et al. RvS: what is essential for offline RL via supervised learning? [EB/OL]. (2022-05-11)[2024-10-12]. https://arxiv.org/abs/2112.10751v2.
[30] CHEN Lili, LU K, RAJESWARAN A, et al. Decision transformer: reinforcement learning via sequence modeling[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 15084-15097.

备注/Memo

收稿日期:2024-10-12。
基金项目:广东省应用型科技研发重大专项(2016B020244003)；广东省企业科技特派员项目(GDKTP2020014000)；广东省基础与应用基础研究基金项目(2020B1515120089，2020A1515110783).
作者简介:周娴玮，讲师，博士，主要研究方向为强化学习、机器人技术和多传感信息融合。E-mail：20871147@qq.com。;王宇翔，硕士研究生，主要研究方向为深度强化学习和离线强化学习。E-mail：2023024285@m.scnu.edu.cn。;余松森，教授，博士后，主要研究方向为智能感知与信息处理。主持国家自然科学基金面上项目1项、科技部星火计划面上项目2项、广东省基础与应用基础研究重点项目1项。参与制定广东省高端新型电子信息产业地方标准，获得发明专利授权53项，发表学术论文40余篇。E-mail：yss8109@163.com。
通讯作者:余松森. E-mail：yss8109@163.com

更新日期/Last Update: 2025-09-05

基于自适应分位数的离线强化学习算法 PDF下载HTML

备注/Memo

基于自适应分位数的离线强化学习算法

PDF下载 HTML