<-Previous Article Next Article->

[1]ZHOU Xianwei,WANG Yuxiang,LUO Shixin,et al.Offline reinforcement learning with adaptive quantile[J].CAAI Transactions on Intelligent Systems,2025,20(5):1093-1102.[doi:10.11992/tis.202410016]

Copy

Offline reinforcement learning with adaptive quantile

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 20 Number of periods: 2025 5 Page number: 1093-1102 Column: 学术论文—机器学习 Public date: 2025-09-05

Title:: Offline reinforcement learning with adaptive quantile

Author(s):: ZHOU Xianwei; WANG Yuxiang; LUO Shixin; YU Songsen; School of Artificial Intelligence, South China Normal University, Foshan 528225, China

Keywords:: offline reinforcement learning; distribution shift; extrapolation error; policy constraint; imitation learning; double Q-estimation; overestimation; quantile

CLC:: TP301.6

DOI:: 10.11992/tis.202410016

Abstract:: Offline reinforcement learning aims to reduce the high cost of environmental interaction by learning effective policies solely from precollected offline datasets. However, the absence of interactive feedback can cause a distribution shift between the learned policy and the offline dataset, leading to increased extrapolation errors. Most existing methods address this problem using policy constraints or imitation learning, but they often result in overly conservative policies. To address the above problems, an adaptive quantile-based method is proposed. Building upon dual Q-estimation, the relationship between dual Q-estimates is further analyzed, using their differences to assess overestimation in out-of-distribution actions. The quantile is then adaptively adjusted to correct bias overestimation. Additionally, a quantile advantage is introduced, which serves as a weight for the policy constraint term, balancing exploration and imitation to reduce policy conservativeness. Finally, the proposed approach is validated on the D4RL dataset, where it achieves excellent performance across multiple tasks, showing its potential for broad application in various scenarios.

References:: [1] SINGLA A, RAFFERTY A N, RADANOVIC G, et al. Reinforcement learning for education: opportunities and challenges[EB/OL]. (2021-07-15)[2024-10-12]. https://arxiv.org/abs/2107.08828v1.
[2] LIU Siqi, SEE K C, NGIAM K Y, et al. Reinforcement learning for clinical decision support in critical care: comprehensive review[J]. Journal of medical Internet research, 2020, 22(7): e18477.
[3] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[4] 李霞丽, 王昭琦, 刘博, 等. 麻将博弈AI构建方法综述[J]. 智能系统学报, 2023, 18(6): 1143-1155.
LI Xiali, WANG Zhaoqi, LIU Bo, et al. Survey of Mahjong game AI construction methods[J]. CAAI transactions on intelligent systems, 2023, 18(6): 1143-1155.
[5] 朱少凯, 孟庆浩, 金晟, 等. 基于深度强化学习的室内视觉局部路径规划[J]. 智能系统学报, 2022, 17(5): 908-918.
ZHU Shaokai, MENG Qinghao, JIN Sheng, et al. Indoor visual local path planning based on deep reinforcement learning[J]. CAAI transactions on intelligent systems, 2022, 17(5): 908-918.
[6] 赵玉新, 杜登辉, 成小会, 等. 基于强化学习的海洋移动观测网络观测路径规划方法[J]. 智能系统学报, 2022, 17(1): 192-200.
ZHAO Yuxin, DU Denghui, CHENG Xiaohui, et al. Path planning for mobile ocean observation network based on reinforcement learning[J]. CAAI transactions on intelligent systems, 2022, 17(1): 192-200.
[7] 张晓明, 高士杰, 姚昌瑀, 等. 强化学习及其在机器人任务规划中的进展与分析[J]. 模式识别与人工智能, 2023, 36(10): 902-917.
ZHANG Xiaoming, GAO Shijie, YAO Changyu, et al. Reinforcement learning and its application in robot task planning: a survey[J]. Pattern recognition and artificial intelligence, 2023, 36(10): 902-917.
[8] 郭宪, 方勇纯. 仿生机器人运动步态控制: 强化学习方法综述[J]. 智能系统学报, 2020, 15(1): 152-159.
GUO Xian, FANG Yongchun. Locomotion gait control for bionic robots: a review of reinforcement learning methods[J]. CAAI transactions on intelligent systems, 2020, 15(1): 152-159.
[9] 乌兰, 刘全, 黄志刚, 等. 离线强化学习研究综述[J]. 计算机学报, 2025, 48(1): 156-187.
WU Lan, LIU Quan, HUANG Zhigang, et al. A review of research on offline reinforcement learning[J]. Chinese journal of computers, 2025, 48(1): 156-187.
[10] LEVINE S, KUMAR A, TUCKER G, et al. Offline reinforcement learning: tutorial, review, and perspectives on open problems[EB/OL]. (2020-05-04)[2024-10-12]. https://arxiv.org/pdf/2005.01643.
[11] 陈锶奇, 耿婕, 汪云飞, 等. 基于离线强化学习的研究综述[J]. 无线电通信技术, 2024, 50(5): 831-842.
CHEN Siqi, GENG Jie, WANG Yunfei, et al. Survey of research on offline reinforcement learning[J]. Radio communications technology, 2024, 50(5): 831-842.
[12] FUJIMOTO S, MEGER D, Precup D. Off-policy deep reinforcement learning without exploration[C]//International Conference on Machine Learning. Los Angeles: PMLR, 2019: 2052-2062.
[13] WU Yifan, TUCKER G, NACHUM O. Behavior regularized offline reinforcement learning[EB/OL]. (2019-11-26)[2024-10-12]. https://arxiv.org/abs/1911.11361v1.
[14] KUMAR A, FU J, SOH M, et al. Stabilizing off-policy Q-learning via bootstrapping error reduction[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019: 11784-11794.
[15] WANG Z, NOVIKOV A, ZOLNA K, et al. Critic regularized regression[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 7768-7778.
[16] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing function approximation error in actor-critic methods[C]// International Conference on Machine Learning. Stockholm: PMLR, 2018: 1587-1596.
[17] FU J, KUMAR A, NACHUM O, et al. D4RL: datasets for deep data-driven reinforcement learning[EB/OL]. (2021-02-06)[2024-10-12]. https://arxiv.org/abs/2004.07219v4.
[18] FIGUEIREDO PRUDENCIO R, MAXIMO M R O A, COLOMBINI E L. A survey on offline reinforcement learning: taxonomy, review, and open problems[J]. IEEE transactions on neural networks and learning systems, 2024, 35(8): 10237-10257.
[19] FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 20132-20145.
[20] PENG Zhiyong, HAN Changlin, LIU Yadong, et al. Weighted policy constraints for offline reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington DC: AAAI, 2023: 9435-9443.
[21] CHEN Xinyue, ZHOU Zijian, WANG Zheng, et al. Bail: best-action imitation learning for batch deep reinforcement learning[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 18353-18363.
[22] SIEGEL N Y, SPRINGENBERG J T, BERKENKAMP F, et al. Keep doing what worked: behavioral modelling priors for offline reinforcement learning[C]//International Conference on Learning Representations. [S. l. ]: OpenReview.net, 2020: 1-21.
[23] ABDOLMALEKI A, SPRINGENBERG J T, TASSA Y, et al. Maximum a posteriori policy optimisation[C]//International Conference on Learning Representations. Vancouver: OpenReview.net, 2018: 1-23.
[24] BRANDFONBRENER D, WHITNEY W F, RANGANATH R, et al. Quantile filtered imitation learning[EB/OL]. (2021-12-02)[2024-10-12]. https://arxiv.org/abs/2112.00950v1.
[25] KOENKER R, HALLOCK K F. Quantile regression[J]. Journal of economic perspectives, 2001, 15(4): 143-156.
[26] AGARWAL R, SCHUURMANS D, NOROUZI M. An optimistic perspective on offline reinforcement learning[C]//International Conference on Machine Learning. [S. l. ]: PMLR, 2020: 104–114.
[27] KOSTRIKOV I, NAIR A, LEVINE S. Offline reinforcement learning with implicit Q-learning[EB/OL]. (2021-10-12)[2024-10-12]. https://arxiv.org/abs/2110.06169v1.
[28] KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 1179-1191.
[29] EMMONS S, EYSENBACH B, KOSTRIKOV I, et al. RvS: what is essential for offline RL via supervised learning? [EB/OL]. (2022-05-11)[2024-10-12]. https://arxiv.org/abs/2112.10751v2.
[30] CHEN Lili, LU K, RAJESWARAN A, et al. Decision transformer: reinforcement learning via sequence modeling[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 15084-15097.

Similar References:

Memo

Last Update: 2025-09-05

Offline reinforcement learning with adaptive quantile PDF DownloadHTML

Memo

Offline reinforcement learning with adaptive quantile

PDF Download HTML