[1]周娴玮,王宇翔,罗仕鑫,等.基于自适应分位数的离线强化学习算法[J].智能系统学报,2025,20(5):1093-1102.[doi:10.11992/tis.202410016]
ZHOU Xianwei,WANG Yuxiang,LUO Shixin,et al.Offline reinforcement learning with adaptive quantile[J].CAAI Transactions on Intelligent Systems,2025,20(5):1093-1102.[doi:10.11992/tis.202410016]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
20
期数:
2025年第5期
页码:
1093-1102
栏目:
学术论文—机器学习
出版日期:
2025-09-05
- Title:
-
Offline reinforcement learning with adaptive quantile
- 作者:
-
周娴玮, 王宇翔, 罗仕鑫, 余松森
-
华南师范大学 人工智能学院, 广东 佛山 528225
- Author(s):
-
ZHOU Xianwei, WANG Yuxiang, LUO Shixin, YU Songsen
-
School of Artificial Intelligence, South China Normal University, Foshan 528225, China
-
- 关键词:
-
离线强化学习; 分布偏移; 外推误差; 策略约束; 模仿学习; 双Q估计; 价值高估; 分位数
- Keywords:
-
offline reinforcement learning; distribution shift; extrapolation error; policy constraint; imitation learning; double Q-estimation; overestimation; quantile
- 分类号:
-
TP301.6
- DOI:
-
10.11992/tis.202410016
- 摘要:
-
离线强化学习旨在仅通过使用预先收集的离线数据集进行策略的有效学习,从而减少与环境直接交互所带来的高昂成本。然而,由于缺少环境对智能体行为的交互反馈,从离线数据集中学习到的策略可能会遇到数据分布偏移的问题,进而导致外推误差的不断加剧。当前方法多采用策略约束或模仿学习方法来缓解这一问题,但其学习到的策略通常较为保守。针对上述难题,提出一种基于自适应分位数的方法。具体而言,该方法在双Q估计的基础上进一步利用双Q的估计差值大小对分布外未知动作的价值高估情况进行评估,同时结合分位数思想自适应调整分位数来校正过估计偏差。此外,构建分位数优势函数作为策略约束项权重以平衡智能体对数据集的探索和模仿,从而缓解策略学习的保守性。最后在D4RL (datasets for deep data-driven reinforcement learning) 数据集上验证算法的有效性,该算法在多个任务数据集上表现优异,同时展现出在不同场景应用下的广泛潜力。
- Abstract:
-
Offline reinforcement learning aims to reduce the high cost of environmental interaction by learning effective policies solely from precollected offline datasets. However, the absence of interactive feedback can cause a distribution shift between the learned policy and the offline dataset, leading to increased extrapolation errors. Most existing methods address this problem using policy constraints or imitation learning, but they often result in overly conservative policies. To address the above problems, an adaptive quantile-based method is proposed. Building upon dual Q-estimation, the relationship between dual Q-estimates is further analyzed, using their differences to assess overestimation in out-of-distribution actions. The quantile is then adaptively adjusted to correct bias overestimation. Additionally, a quantile advantage is introduced, which serves as a weight for the policy constraint term, balancing exploration and imitation to reduce policy conservativeness. Finally, the proposed approach is validated on the D4RL dataset, where it achieves excellent performance across multiple tasks, showing its potential for broad application in various scenarios.
备注/Memo
收稿日期:2024-10-12。
基金项目:广东省应用型科技研发重大专项(2016B020244003);广东省企业科技特派员项目(GDKTP2020014000);广东省基础与应用基础研究基金项目(2020B1515120089,2020A1515110783).
作者简介:周娴玮,讲师,博士,主要研究方向为强化学习、机器人技术和多传感信息融合。E-mail:20871147@qq.com。;王宇翔,硕士研究生,主要研究方向为深度强化学习和离线强化学习。E-mail:2023024285@m.scnu.edu.cn。;余松森,教授,博士后,主要研究方向为智能感知与信息处理。主持国家自然科学基金面上项目1项、科技部星火计划面上项目2项、广东省基础与应用基础研究重点项目1项。参与制定广东省高端新型电子信息产业地方标准,获得发明专利授权53项,发表学术论文40余篇。E-mail:yss8109@163.com。
通讯作者:余松森. E-mail:yss8109@163.com
更新日期/Last Update:
2025-09-05