[1]ZHOU Xianwei,WANG Yuxiang,LUO Shixin,et al.Offline reinforcement learning with adaptive quantile[J].CAAI Transactions on Intelligent Systems,2025,20(5):1093-1102.[doi:10.11992/tis.202410016]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
20
Number of periods:
2025 5
Page number:
1093-1102
Column:
学术论文—机器学习
Public date:
2025-09-05
- Title:
-
Offline reinforcement learning with adaptive quantile
- Author(s):
-
ZHOU Xianwei; WANG Yuxiang; LUO Shixin; YU Songsen
-
School of Artificial Intelligence, South China Normal University, Foshan 528225, China
-
- Keywords:
-
offline reinforcement learning; distribution shift; extrapolation error; policy constraint; imitation learning; double Q-estimation; overestimation; quantile
- CLC:
-
TP301.6
- DOI:
-
10.11992/tis.202410016
- Abstract:
-
Offline reinforcement learning aims to reduce the high cost of environmental interaction by learning effective policies solely from precollected offline datasets. However, the absence of interactive feedback can cause a distribution shift between the learned policy and the offline dataset, leading to increased extrapolation errors. Most existing methods address this problem using policy constraints or imitation learning, but they often result in overly conservative policies. To address the above problems, an adaptive quantile-based method is proposed. Building upon dual Q-estimation, the relationship between dual Q-estimates is further analyzed, using their differences to assess overestimation in out-of-distribution actions. The quantile is then adaptively adjusted to correct bias overestimation. Additionally, a quantile advantage is introduced, which serves as a weight for the policy constraint term, balancing exploration and imitation to reduce policy conservativeness. Finally, the proposed approach is validated on the D4RL dataset, where it achieves excellent performance across multiple tasks, showing its potential for broad application in various scenarios.