Reiforcement Learning efficient training (sample efficiency for most)
离线/批量强化学习:保守学习与分布偏移/外推风险缓解
该组聚焦“无/少在线交互”的离线/批量强化学习:通过保守/悲观估计、分布偏移与覆盖不足的鲁棒化、约束/公平/风险感知、以及不变表示或域泛化来提升从固定数据学习的可靠性与有效性;包含离线数据到离线仿真/评估的样本效率路线。
- Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity(Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Yuejie Chi, 2022, ArXiv Preprint)
- Efficient Diffusion Policies For Offline Reinforcement Learning(Chao Du, Bingyi Kang, Xiao Ma, Tianyu Pang, Shuicheng Yan, 2023, Advances in Neural Information Processing Systems 36)
- A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems(Rafael Figueiredo Prudencio, M. Maximo, E. Colombini, 2022, IEEE Transactions on Neural Networks and Learning Systems)
- On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling and Beyond(Thanh Nguyen-Tang, Raman Arora, 2023, Advances in Neural Information Processing Systems 36)
- Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage(Jose Blanchet, Miao Lu, Tong Zhang, Han Zhong, 2023, ArXiv Preprint)
- Semi-gradient DICE for Offline Constrained Reinforcement Learning(Woosung Kim, JunHo Seo, Jongmin Lee, Byung-Jun Lee, 2025, ArXiv Preprint)
- Data-Driven Offline Decision-Making via Invariant Representation Learning(Han Qi, Yi Su, Aviral Kumar, Sergey Levine, 2022, ArXiv Preprint)
- GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models(Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, Giovanni Montana, 2023, ArXiv Preprint)
- Revisiting the Minimalist Approach to Offline Reinforcement Learning(Sergey Kolesnikov, Vladislav Kurenkov, Alexander Nikulin, Denis Tarasov, 2023, Advances in Neural Information Processing Systems 36)
- Pessimistic Model Selection for Offline Deep Reinforcement Learning(Chao-Han Huck Yang, Zhengling Qi, Yifan Cui, Pin-Yu Chen, 2021, ArXiv Preprint)
- Robust Reinforcement Learning Using Offline Data(Mohammad Ghavamzadeh, Dileep Kalathil, Kishan Panaganti, Zaiyan Xu, 2022, Advances in Neural Information Processing Systems 35)
- Efficient RL Training for LLMs with Experience Replay(Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Remi Munos, 2026, ArXiv Preprint)
- NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning(Rongjun Qin, Songyi Gao, Xingyuan Zhang, Zhen Xu, Shengkai Huang, Zewen Li, Weinan Zhang, Yang Yu, 2021, Advances in Neural Information Processing Systems 35)
- Domain Generalization for Robust Model-Based Offline Reinforcement Learning(Alan Clark, Shoaib Ahmed Siddiqui, Robert Kirk, Usman Anwar, Stephen Chung, David Krueger, 2022, ArXiv Preprint)
- Robust Reinforcement Learning Using Offline Data(Mohammad Ghavamzadeh, Dileep Kalathil, Kishan Panaganti, Zaiyan Xu, 2022, Advances in Neural Information Processing Systems 35)
- Personalization for Web-based Services using Offline Reinforcement Learning(Pavlos Athanasios Apostolopoulos, Zehui Wang, Hanson Wang, Chad Zhou, Kittipat Virochsiri, Norm Zhou, Igor L. Markov, 2021, ArXiv Preprint)
- Towards Data-Driven Offline Simulations for Online Reinforcement Learning(Shengpu Tang, Felipe Vieira Frujeri, Dipendra Misra, Alex Lamb, John Langford, Paul Mineiro, Sebastian Kochman, 2022, ArXiv Preprint)
- Offline Reinforcement Learning from Datasets with Structured Non-Stationarity(Johannes Ackermann, Takayuki Osa, Masashi Sugiyama, 2024, ArXiv Preprint)
- Benchmarking Offline Reinforcement Learning Algorithms for E-Commerce Order Fraud Evaluation(Soysal Degirmenci, Chris Jones, 2022, ArXiv Preprint)
- Efficient experience replay architecture for offline reinforcement learning(Longfei Zhang, Yang Feng, Rongxiao Wang, Yue Xu, Naifu Xu, Zeyi Liu, Hang Du, 2023, Robotic Intelligence and Automation)
- Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning(Alex Beeson, Giovanni Montana, 2022, ArXiv Preprint)
- FairDICE: Fairness-Driven Offline Multi-Objective Reinforcement Learning(Woosung Kim, Jinho Lee, Jongmin Lee, Byung-Jun Lee, 2025, ArXiv Preprint)
- Revisiting the Minimalist Approach to Offline Reinforcement Learning(Sergey Kolesnikov, Vladislav Kurenkov, Alexander Nikulin, Denis Tarasov, 2023, Advances in Neural Information Processing Systems 36)
经验回放与重采样:优先级/多样性/可靠性/Hindsight提升数据复用率
该组的共同点是直接把“样本效率”落在训练数据复用机制上:通过经验回放的非均匀/优先级采样、可靠性/不确定性与多样性驱动的高价值重放、Hindsight与成功表征(successor representation)改进稀疏奖励学习;同时引入对经验回放收敛与陈旧性影响的分析,以及合成/结构化重放(mixup、随机重排、对称/聚类/遗忘重放)进一步提升更新-数据比。
- Reliability-Adjusted Prioritized Experience Replay(Leonard S. Pleiss, Tobias Sutter, Maximilian Schiffer, 2025, ArXiv Preprint)
- Large Batch Experience Replay(Thibault Lahire, Matthieu Geist, Emmanuel Rachelson, 2021, ArXiv Preprint)
- Sample Efficient Experience Replay in Non-stationary Environments(Tianyang Duan, Zongyuan Zhang, Songxiao Guo, Yuanye Zhao, Zheng Lin, Zihan Fang, Yi Liu, Dianxin Luan, Dong Huang, Heming Cui, Yong Cui, 2025, ArXiv Preprint)
- Enhancing Sample Efficiency in Multi-Agent RL with Uncertainty Quantification and Selective Exploration(Tom Danino, Nahum Shimkin, 2025, ArXiv Preprint)
- Robust experience replay sampling for multi-agent reinforcement learning(Isack Thomas Nicholaus, Dae-Ki Kang, 2021, Pattern Recognition Letters)
- Efficient Diversity-based Experience Replay for Deep Reinforcement Learning(Kaiyan Zhao, Yiming Wang, Yuyang Chen, Yan Li, Leong Hou U, Xiaoguang Niu, 2024, ArXiv Preprint)
- Curious Replay for Model-based Adaptation(Isaac Kauvar, Chris Doyle, Linqi Zhou, Nick Haber, 2023, ArXiv Preprint)
- Sampled-data control through model-free reinforcement learning with effective experience replay(Bo Xiao, H.K. Lam, Xiaojie Su, Ziwei Wang, F. P. Lo, Shihong Chen, Eric M. Yeatman, 2023, Journal of Automation and Intelligence)
- Improved exploration-exploitation trade-off through adaptive prioritized experience replay(Hossein Hassani, S. Nikan, Abdallah Shami, 2024, Neurocomputing)
- Solving Robotic Manipulation With Sparse Reward Reinforcement Learning Via Graph-Based Diversity and Proximity(Zhenshan Bing, Hongkuan Zhou, Rui Li, Xiaojie Su, F. Morin, Kai Huang, Alois Knoll, 2023, IEEE Transactions on Industrial Electronics)
- Cluster-based Sampling in Hindsight Experience Replay for Robotic Tasks (Student Abstract)(Taeyoung Kim, Dongsoo Har, 2022, ArXiv Preprint)
- Improving Experience Replay with Successor Representation(Yizhi Yuan, Marcelo G Mattar, 2021, ArXiv Preprint)
- Improving Sample Efficiency in Model-Free Reinforcement Learning from Images(Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, R. Fergus, 2019, Proceedings of the AAAI Conference on Artificial Intelligence)
- Non-Uniform Memory Sampling in Experience Replay(Andrii Krutsylo, 2025, ArXiv Preprint)
- Experience replay is associated with efficient nonlocal learning(Yunzhe Liu, M. Mattar, T. Behrens, N. Daw, R. Dolan, 2021, Science)
- Prioritized Experience Replay based on Multi-armed Bandit(Ximing Liu, Tianqing Zhu, Cuiqing Jiang, Dayong Ye, Fuqing Zhao, 2021, Expert Systems with Applications)
- Uncertainty Prioritized Experience Replay(Rodrigo Carrasco-Davis, Sebastian Lee, Claudia Clopath, Will Dabney, 2025, ArXiv Preprint)
- Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks(Ryan Sander, Wilko Schwarting, Tim Seyde, Igor Gilitschenski, Sertac Karaman, Daniela Rus, 2022, ArXiv Preprint)
- On the Convergence of Experience Replay in Policy Optimization: Characterizing Bias, Variance, and Finite-Time Convergence(Hua Zheng, Wei Xie, M. Ben Feng, 2021, ArXiv Preprint)
- Experience Replay with Random Reshuffling(Yasuhiro Fujita, 2025, ArXiv Preprint)
- Prioritized experience replay based on dynamics priority(Hu Li, Xuezhong Qian, Wei Song, 2024, Scientific Reports)
- Actor Prioritized Experience Replay(Baturay Sağlam, Furkan B. Mutlu, Dogan C. Cicek, Süleyman S. Kozat, 2023, Journal of Artificial Intelligence Research)
- Experience Replay with Random Reshuffling(Yasuhiro Fujita, 2025, ArXiv Preprint)
- Synthetic Experience Replay(Cong Lu, Philip J. Ball, Yee Whye Teh, Jack Parker-Holder, 2023, ArXiv Preprint)
- Symmetric Replay Training: Enhancing Sample Efficiency in Deep Reinforcement Learning for Combinatorial Optimization(Hyeonah Kim, Minsu Kim, Sungsoo Ahn, Jinkyoo Park, 2023, ArXiv Preprint)
- Efficient experience replay architecture for offline reinforcement learning(Longfei Zhang, Yang Feng, Rongxiao Wang, Yue Xu, Naifu Xu, Zeyi Liu, Hang Du, 2023, Robotic Intelligence and Automation)
- Clustering experience replay for the effective exploitation in reinforcement learning(Min Li, Tianyi Huang, William Zhu, 2022, Pattern Recognition)
- Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations(Alexey Skrynnik, A. Staroverov, Ermek Aitygulov, Kirill Aksenov, Vasilii Davydov, A. Panov, 2021, Knowledge-Based Systems)
- TROFI: Trajectory-Ranked Offline Inverse Reinforcement Learning(Alessandro Sestini, Joakim Bergdahl, Konrad Tollmar, Andrew D. Bagdanov, Linus Gisslén, 2025, ArXiv Preprint)
探索效率:不确定性/信息增益驱动的少样本交互与覆盖
该组强调“用更少交互得到更有效覆盖/更快信息增益”的探索机制:以不确定性/集成误差/后验采样、信息增益或分布距离度量来选择动作;并包含面向竞争多智能体的策略性探索、以及基于探索-利用解耦与少样本适应(meta/RL框架)的效率提升。
- Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning(Robert Loftin, Aadirupa Saha, Sam Devlin, Katja Hofmann, 2021, ArXiv Preprint)
- Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning(Ke Sun, Yingnan Zhao, Enze Shi, Yafei Wang, Xiaodong Yan, Bei Jiang, Linglong Kong, 2021, ArXiv Preprint)
- From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning(Junseok Park, Hyeonseo Yang, Min Whoo Lee, Won-Seok Choi, Minsu Lee, Byoung-Tak Zhang, 2025, ArXiv Preprint)
- Dopamine encoding of novelty facilitates efficient uncertainty-driven exploration(Yuhao Wang, Armin Lak, Sanjay Manohar, Rafał Bogacz, 2024, PLOS Computational Biology)
- Query The Agent: Improving sample efficiency through epistemic uncertainty estimation(Julian Alverio, Boris Katz, Andrei Barbu, 2022, ArXiv Preprint)
- Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior(Zhiyuan Zhou, Shreyas Sundara Raman, Henry Sowerby, Michael L. Littman, 2022, ArXiv Preprint)
- Enhancing Exploration Efficiency using Uncertainty-Aware Information Prediction(Seunghwan Kim, Heejung Shin, Gaeun Yim, Changseung Kim, Hyondong Oh, 2024, ArXiv Preprint)
- From Wasserstein to Maximum Mean Discrepancy Barycenters: A Novel Framework for Uncertainty Propagation in Model-Free Reinforcement Learning(Srinjoy Roy, Subhajit Saha, Swagatam Das, 2026, IEEE Transactions on Emerging Topics in Computational Intelligence)
- Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning(Abdul Wahab, Raksha Kumaraswamy, Martha White, 2026, ArXiv Preprint)
- A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning(Christoph Dann, Mehryar Mohri, Tong Zhang, Julian Zimmert, 2022, ArXiv Preprint)
- SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration(Yang Jin, Jun Lv, Han Xue, Wendi Chen, Chuan Wen, Cewu Lu, 2025, ArXiv Preprint)
- Uncertainty Prioritized Experience Replay(Rodrigo Carrasco-Davis, Sebastian Lee, Claudia Clopath, Will Dabney, 2025, ArXiv Preprint)
- A Novel Framework for Uncertainty-Driven Adaptive Exploration(Leonidas Bakopoulos, Georgios Chalkiadakis, 2025, ArXiv Preprint)
- Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning(William F. Whitney, Michael Bloesch, Jost Tobias Springenberg, Abbas Abdolmaleki, Kyunghyun Cho, Martin Riedmiller, 2021, ArXiv Preprint)
- Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning(Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, Doina Precup, 2025, ArXiv Preprint)
- A Tutorial on Meta-Reinforcement Learning(Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, Shimon Whiteson, 2023, ArXiv Preprint)
- Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning(Ke Sun, Yingnan Zhao, Enze Shi, Yafei Wang, Xiaodong Yan, Bei Jiang, Linglong Kong, 2021, ArXiv Preprint)
- From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning(Junseok Park, Hyeonseo Yang, Min Whoo Lee, Won-Seok Choi, Minsu Lee, Byoung-Tak Zhang, 2025, ArXiv Preprint)
模型基础与规划/想象数据:用模型少采样并校正误差与偏差
该组共同点是“模型化/规划/虚拟交互与合成经验”路线:通过学习环境模型或模型集合来进行少样本规划与想象,从而降低真实交互需求;同时用不确定性、保守/风险感知、以及误差界来缓解模型偏差导致的级联错误,并覆盖模型基础RL的调查与特定应用(如量子、世界模型/规划框架、以及跨域合成数据引导)。
- GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models(Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, Giovanni Montana, 2023, ArXiv Preprint)
- Curious Replay for Model-based Adaptation(Isaac Kauvar, Chris Doyle, Linqi Zhou, Nick Haber, 2023, ArXiv Preprint)
- Learning to Navigate in Mazes with Novel Layouts using Abstract Top-down Maps(Linfeng Zhao, Lawson L. S. Wong, 2024, ArXiv Preprint)
- Reinforcement learning with learned gadgets to tackle hard quantum problems on real hardware(Akash Kundu, Leopoldo Sarra, 2024, ArXiv Preprint)
- Efficient Diversity-based Experience Replay for Deep Reinforcement Learning(Kaiyan Zhao, Yiming Wang, Yuyang Chen, Yan Li, Leong Hou U, Xiaoguang Niu, 2024, ArXiv Preprint)
- Data-Driven Offline Decision-Making via Invariant Representation Learning(Han Qi, Yi Su, Aviral Kumar, Sergey Levine, 2022, ArXiv Preprint)
- Model-based Reinforcement Learning: A Survey(T. Moerland, J. Broekens, C. Jonker, 2020, … in Machine Learning)
- Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning(Barna Pásztor, Ilija Bogunovic, Andreas Krause, 2021, ArXiv Preprint)
- Maximum Entropy Model-based Reinforcement Learning(Oleg Svidchenko, Aleksei Shpilman, 2021, ArXiv Preprint)
- Addressing Sample Efficiency and Model-bias in Model-based Reinforcement Learning(Akhil S. Anand, Jens Erik Kveen, Fares J. Abu-Dakka, E. Grøtli, Jan Tommy Gravdahl, 2022, 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA))
- Active Exploration for Neural Global Illumination of Variable Scenes(Stavros Diolatzis, Julien Philip, George Drettakis, 2022, ArXiv Preprint)
- Model-Based Reinforcement Learning with SINDy(Rushiv Arora, Bruno Castro da Silva, Eliot Moss, 2022, ArXiv Preprint)
- Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning(Erin J. Talvitie, Zilei Shao, Huiying Li, Jinghan Hu, Jacob Boerma, Rory Zhao, Xintong Wang, 2024, ArXiv Preprint)
- Between Rate-Distortion Theory & Value Equivalence in Model-Based Reinforcement Learning(Dilip Arumugam, Benjamin Van Roy, 2022, ArXiv Preprint)
- Sample Efficient Reinforcement Learning via Model-Ensemble Exploration and Exploitation(Yao Yao, Li Xiao, Zhicheng An, Wanpeng Zhang, Dijun Luo, 2021, ArXiv Preprint)
- Settling the sample complexity of model-based offline reinforcement learning(G Li, L Shi, Y Chen, Y Chi, Y Wei, 2024, The Annals of Statistics)
- Sample-efficient model-based reinforcement learning for quantum control(Irtaza Khalid, Carrie A. Weidner, Edmond Jonckheere, S. G. Schirmer, Frank C. Langbein, 2023, Physical Review Research)
- Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation(Michelle Pan, Mariah Schrum, Vivek Myers, Erdem Bıyık, Anca Dragan, 2024, ArXiv Preprint)
- STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning(Gao Huang, Jian Sun, Gang Wang, Yetian Yuan, Weipu Zhang, 2023, Advances in Neural Information Processing Systems 36)
- Towards General Negotiation Strategies with End-to-End Reinforcement Learning(Bram M. Renting, Thomas M. Moerland, Holger H. Hoos, Catholijn M. Jonker, 2024, ArXiv Preprint)
- DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning(Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, Sunil Gupta, 2025, ArXiv Preprint)
- Efficient experience replay architecture for offline reinforcement learning(Longfei Zhang, Yang Feng, Rongxiao Wang, Yue Xu, Naifu Xu, Zeyi Liu, Hang Du, 2023, Robotic Intelligence and Automation)
奖励工程与任务信号重构:稀疏/噪声奖励、奖励几何与Q整形提升样本效率
该组集中在“奖励信号/任务结构重构”对样本效率的作用:面向稀疏、噪声或延迟奖励,通过奖励稠密化、层级/分阶段奖励、奖励几何/Quasimetric、因果步进课程、奖励机器(stochastic reward machines)与选择性学习等,使学习信号更易优化、更稳定、更快;同时在高维观测下用奖励到Q值整形(reward shaping / Q-shaping)与信息论/信用分配视角解释样本效率根源。
- Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications(Sinan Ibrahim, Mostafa Mostafa, Ali Jnadi, Hadi Salloum, Pavel Osinenko, 2024, IEEE Access)
- Reward Function Design in Reinforcement Learning(Jonas Eschmann, 2021, Studies in Computational Intelligence)
- Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior(Zhiyuan Zhou, Shreyas Sundara Raman, Henry Sowerby, Michael L. Littman, 2022, ArXiv Preprint)
- Vanishing Bias Heuristic-guided Reinforcement Learning Algorithm(Qinru Li, Hao Xiang, 2023, ArXiv Preprint)
- Internally Rewarded Reinforcement Learning(Mengdi Li, Xufeng Zhao, Jae Hee Lee, Cornelius Weber, Stefan Wermter, 2023, ArXiv Preprint)
- From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning(Junseok Park, Hyeonseo Yang, Min Whoo Lee, Won-Seok Choi, Minsu Lee, Byoung-Tak Zhang, 2025, ArXiv Preprint)
- Solving Robotic Manipulation With Sparse Reward Reinforcement Learning Via Graph-Based Diversity and Proximity(Zhenshan Bing, Hongkuan Zhou, Rui Li, Xiaojie Su, F. Morin, Kai Huang, Alois Knoll, 2023, IEEE Transactions on Industrial Electronics)
- Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning(Gautham Vasan, Yan Wang, Fahim Shahriar, James Bergstra, Martin Jagersand, A. Rupam Mahmood, 2024, ArXiv Preprint)
- Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning(Tongzhou Wang, Antonio Torralba, Phillip Isola, Amy Zhang, 2023, ArXiv Preprint)
- Reward Shaping via Diffusion Process in Reinforcement Learning(Peeyush Kumar, 2023, ArXiv Preprint)
- Causal-Paced Deep Reinforcement Learning(Geonwoo Cho, Jaegyun Im, Doyoon Kim, Sundong Kim, 2025, ArXiv Preprint)
- Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning(Subhojyoti Mukherjee, Josiah P. Hanna, Qiaomin Xie, Robert Nowak, 2024, ArXiv Preprint)
- Towards General Negotiation Strategies with End-to-End Reinforcement Learning(Bram M. Renting, Thomas M. Moerland, Holger H. Hoos, Catholijn M. Jonker, 2024, ArXiv Preprint)
- Reinforcement Learning with Stochastic Reward Machines(Jan Corazza, Ivan Gavran, Daniel Neider, 2025, ArXiv Preprint)
- Selective Learning for Sample-Efficient Training in Multi-Agent Sparse Reward Tasks(Xinning Chen, Xuan Liu, Yanwen Ba, Shigeng Zhang, Bo Ding, Kenli Li, 2023, Frontiers in Artificial Intelligence and Applications)
- Metric Residual Networks for Sample Efficient Goal-Conditioned Reinforcement Learning(Bo Liu, Yihao Feng, Qiang Liu, Peter Stone, 2022, ArXiv Preprint)
- An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning(Dilip Arumugam, Peter Henderson, Pierre-Luc Bacon, 2021, ArXiv Preprint)
- From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge(Xiefeng Wu, 2024, ArXiv Preprint)
- Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity(Abhishek Gupta, Sham Kakade, Sergey Levine, Aldo Pacchiano, Yuexiang Zhai, 2022, Advances in Neural Information Processing Systems 35)
- From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning(Junseok Park, Hyeonseo Yang, Min Whoo Lee, Won-Seok Choi, Minsu Lee, Byoung-Tak Zhang, 2025, ArXiv Preprint)
训练范式与工程插件:降低单位交互/训练利用成本
该组更偏“训练范式/系统插件/工程流程”层面:通过更省数据利用的训练设置(如reset replay)、评测与实验范式选择、以及对算法训练成本/交互开销的组织方式,降低单位样本的利用摩擦;并保留与样本效率紧密相关的特定训练约束(如步进公平约束)作为工程可落地的效率来源。
- Sample Efficient Reinforcement Learning with REINFORCE(Junzi Zhang, Jongho Kim, Brendan O'Donoghue, S. Boyd, 2020, Proceedings of the AAAI Conference on Artificial Intelligence)
- Sample-efficient LLM Optimization with Reset Replay(Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian, 2025, ArXiv Preprint)
- How Should We Meta-Learn Reinforcement Learning Algorithms?(Alexander David Goldie, Zilin Wang, Jaron Cohen, Jakob Nicolaus Foerster, Shimon Whiteson, 2025, ArXiv Preprint)
- Experimental evaluation of model-free reinforcement learning algorithms for continuous HVAC control(Marco Biemann, F. Scheller, Xiufeng Liu, Lizhen Huang, 2021, Applied Energy)
- d3rlpy: An Offline Deep Reinforcement Learning Library(Takuma Seno, Michita Imai, 2021, ArXiv Preprint)
- Reinforcement Learning with Stepwise Fairness Constraints(Zhun Deng, He Sun, Zhiwei Steven Wu, Linjun Zhang, David C. Parkes, 2022, ArXiv Preprint)
理论层面的样本效率:样本复杂度、regret与可证明高效算法
该组共同点是“可证高效/可计算样本效率”的理论支撑:包括样本复杂度上界、在线收敛与regret界、经验回放/模型基础方法的收敛与偏差-方差刻画、以及在特定MDP结构/Tabular设定下给出最小样本复杂度或无burn-in性质,从而解释为什么某些机制能在少样本条件下工作。
- Reinforcement Learning with Stepwise Fairness Constraints(Zhun Deng, He Sun, Zhiwei Steven Wu, Linjun Zhang, David C. Parkes, 2022, ArXiv Preprint)
- Sample Efficient Reinforcement Learning with REINFORCE(Junzi Zhang, Jongho Kim, Brendan O'Donoghue, S. Boyd, 2020, Proceedings of the AAAI Conference on Artificial Intelligence)
- Settling the Sample Complexity of Online Reinforcement Learning(Zihan Zhang, Yuxin Chen, Jason D. Lee, S. Du, 2023, Journal of the ACM)
- A Lower Bound for the Sample Complexity of Inverse Reinforcement Learning(Abi Komanduru, Jean Honorio, 2021, ArXiv Preprint)
- On-Policy Deep Reinforcement Learning for the Average-Reward Criterion(Yiming Zhang, Keith W. Ross, 2021, ArXiv Preprint)
- Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity(Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Yuejie Chi, 2022, ArXiv Preprint)
- Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage(Jose Blanchet, Miao Lu, Tong Zhang, Han Zhong, 2023, ArXiv Preprint)
- Efficient Reinforcement Learning With the Novel N-Step Method and V-Network(Miaomiao Zhang, Shuo Zhang, Xinying Wu, Zhiyi Shi, Xiangyang Deng, Edmond Q. Wu, Xin Xu, 2024, IEEE Transactions on Cybernetics)
- On the sample complexity of actor-critic method for reinforcement learning with function approximation(Harshat Kumar, Alec Koppel, Alejandro Ribeiro, 2019, Machine Learning)
- Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes(Juan Sebastian Rojas, Chi-Guhn Lee, 2024, ArXiv Preprint)
- Convergence and Sample Complexity of Gradient Methods for the Model-Free Linear–Quadratic Regulator Problem(Hesameddin Mohammadi, A. Zare, M. Soltanolkotabi, M. Jovanovi'c, 2019, IEEE Transactions on Automatic Control)
- On the Convergence of Experience Replay in Policy Optimization: Characterizing Bias, Variance, and Finite-Time Convergence(Hua Zheng, Wei Xie, M. Ben Feng, 2021, ArXiv Preprint)
- Settling the sample complexity of model-based offline reinforcement learning(G Li, L Shi, Y Chen, Y Chi, Y Wei, 2024, The Annals of Statistics)
- Between Rate-Distortion Theory & Value Equivalence in Model-Based Reinforcement Learning(Dilip Arumugam, Benjamin Van Roy, 2022, ArXiv Preprint)
- A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning(Christoph Dann, Mehryar Mohri, Tong Zhang, Julian Zimmert, 2022, ArXiv Preprint)
特定领域与复杂交互设置:多智能体/控制/真实系统中的交互效率
该组保留“复杂设置/应用验证”在样本效率研究中的独特性:包括多智能体竞争与协作(交互预算敏感)、以及控制/系统识别/能量管理/UAV采集等真实工程任务;同时包含面向真实系统的端到端控制与不确定性驱动模型基础RL,突出“少样本可落地、鲁棒与稳定”的评价重点。
- Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning(Robert Loftin, Aadirupa Saha, Sam Devlin, Katja Hofmann, 2021, ArXiv Preprint)
- Enhancing Sample Efficiency in Multi-Agent RL with Uncertainty Quantification and Selective Exploration(Tom Danino, Nahum Shimkin, 2025, ArXiv Preprint)
- Robust experience replay sampling for multi-agent reinforcement learning(Isack Thomas Nicholaus, Dae-Ki Kang, 2021, Pattern Recognition Letters)
- Reinforcement learning with learned gadgets to tackle hard quantum problems on real hardware(Akash Kundu, Leopoldo Sarra, 2024, ArXiv Preprint)
- Real-time optimal energy management of Microgrid with uncertainties based on deep reinforcement learning(Chen-Hui Guo, Xin Wang, Yihui Zheng, Feng Zhang, 2021, Energy)
- Energy-Efficient UAV-Enabled Data Collection via Wireless Charging: A Reinforcement Learning Approach(Shu Fu, Yujie Tang, Yuan Wu, Ning Zhang, Huaxi Gu, Chen Chen, Min Liu, 2021, IEEE Internet of Things Journal)
- Algorithmic pricing with independent learners and relative experience replay(Bingyan Han, 2021, ArXiv Preprint)
- Reinforcement Learning in System Identification(Jose Antonio Martin H., Oscar Fernandez Vicente, Sergio Perez, Anas Belfadil, Cristina Ibanez-Llano, Freddy Jose Perozo Rondon, Jose Javier Valle, Javier Arechalde Pelaz, 2022, ArXiv Preprint)
- Energy efficient speed planning of electric vehicles for car-following scenario using model-based reinforcement learning(Heeyun Lee, Kyunghyun Kim, Namwook Kim, S. Cha, 2022, Applied Energy)
- End-to-End Reinforcement Learning for Torque Based Variable Height Hopping(Raghav Soni, Daniel Harnack, Hannah Isermann, Sotaro Fushimi, Shivesh Kumar, Frank Kirchner, 2023, ArXiv Preprint)
- Data-Efficient Deep Reinforcement Learning for Attitude Control of Fixed-Wing UAVs: Field Experiments(Eivind Bøhn, Erlend M. Coates, Dirk Reinhardt, Tor Arne Johansen, 2021, ArXiv Preprint)
- Model-Based Reinforcement Learning with SINDy(Rushiv Arora, Bruno Castro da Silva, Eliot Moss, 2022, ArXiv Preprint)
- Uncertainty-Aware Model-Based Reinforcement Learning: Methodology and Application in Autonomous Driving(Jingda Wu, Zhiyu Huang, Chen Lv, 2023, IEEE Transactions on Intelligent Vehicles)
合并后的最终分组将“样本效率”研究脉络按并列且尽量不交叉的维度归纳为 7 条主线:①离线/批量RL的保守与分布偏移缓解;②经验回放/重采样以提高数据复用率(含Hindsight与结构化重放);③探索与信息/不确定性驱动的少样本交互;④模型化与规划/想象数据(含不确定性校正与合成经验);⑤奖励与任务信号重构(稀疏/噪声/内部奖励与奖励几何/课程/表示);⑥训练范式与工程/系统级插件降低单位交互利用成本;⑦理论与算法结构的可证高效(样本复杂度、收敛与regret)。同时对多智能体/控制/真实系统等“应用与复杂设置”做了专门单独分组,以保留其在交互预算与可部署性上的独特性。
总计152篇相关文献
Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of "bad" episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.
Training an agent to solve control tasks directly from high-dimensional images with model-free reinforcement learning (RL) has proven difficult. A promising approach is to learn a latent representation together with the control policy. However, fitting a high-capacity encoder using a scarce reward signal is sample inefficient and leads to poor performance. Prior work has shown that auxiliary losses, such as image reconstruction, can aid efficient representation learning. However, incorporating reconstruction loss into an off-policy learning algorithm often leads to training instability. We explore the underlying reasons and identify variational autoencoders, used by previous investigations, as the cause of the divergence. Following these findings, we propose effective techniques to improve training stability. This results in a simple approach capable of matching state-of-the-art model-free and model-based algorithms on MuJoCo control tasks. Furthermore, our approach demonstrates robustness to observational noise, surpassing existing approaches in this setting. Code, results, and videos are anonymously available at https://sites.google.com/view/sac-ae/home.
Learning from demonstration (LfD) approaches have garnered significant interest for teaching social robots a variety of tasks in healthcare, educational, and service domains after they have been deployed. These LfD approaches often require a significant number of demonstrations for a robot to learn a performant model from task demonstrations. However, requiring non-experts to provide numerous demonstrations for a social robot to learn a task is impractical in real-world applications. In this paper, we propose a method to improve the sample efficiency of existing learning from demonstration approaches via data augmentation, dynamic experience replay sizes, and hierarchical Deep Q-Networks (DQN). After validating our methods on two different datasets, results suggest that our proposed hierarchical DQN is effective for improving sample efficiency when learning tasks from demonstration. In the future, such a sample-efficient approach has the potential to improve our ability to apply LfD approaches for social robots to learn tasks in domains where demonstration data is limited, sparse, and imbalanced.
The application of reinforcement learning (RL) in artificial intelligence has become increasingly widespread. However, its drawbacks are also apparent, as it requires a large number of samples for support, making the enhancement of sample efficiency a research focus. To address this issue, we propose a novel N-step method. This method extends the horizon of the agent, enabling it to acquire more long-term effective information, thus resolving the issue of data inefficiency in RL. Additionally, this N-step method can reduce the estimation variance of Q-function, which is one of the factors contributing to estimation errors in Q-function estimation. Apart from high variance, estimation bias in Q-function estimation is another factor leading to estimation errors. To mitigate the estimation bias of Q-function, we design a regularization method based on the V-function, which has been underexplored. The combination of these two methods perfectly addresses the problems of low sample efficiency and inaccurate Q-function estimation in RL. Finally, extensive experiments conducted in discrete and continuous action spaces demonstrate that the proposed novel N-step method, when combined with classical deep Q-network, deep deterministic policy gradient, and TD3 algorithms, is effective, consistently outperforming the classical algorithms.
Abstract Controlling heating, ventilation and air-conditioning (HVAC) systems is crucial to improving demand-side energy efficiency. At the same time, the thermodynamics of buildings and uncertainties regarding human activities make effective management challenging. While the concept of model-free reinforcement learning demonstrates various advantages over existing strategies, the literature relies heavily on value-based methods that can hardly handle complex HVAC systems. This paper conducts experiments to evaluate four actor-critic algorithms in a simulated data centre. The performance evaluation is based on their ability to maintain thermal stability while increasing energy efficiency and on their adaptability to weather dynamics. Because of the enormous significance of practical use, special attention is paid to data efficiency. Compared to the model-based controller implemented into EnergyPlus, all applied algorithms can reduce energy consumption by at least 10% by simultaneously keeping the hourly average temperature in the desired range. Robustness tests in terms of different reward functions and weather conditions verify these results. With increasing training, we also see a smaller trade-off between thermal stability and energy reduction. Thus, the Soft Actor Critic algorithm achieves a stable performance with ten times less data than on-policy methods. In this regard, we recommend using this algorithm in future experiments, due to both its interesting theoretical properties and its practical results.
A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of MVP (Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et al. [82], achieves a regret on the order of (modulo log factors) \(\begin{equation*} \min \bigl \lbrace \sqrt {SAH^3K}, \,HK \bigr \rbrace , \end{equation*}\) where S is the number of states, A is the number of actions, H is the horizon length, and K is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size \(K\ge 1\) , essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield \(\varepsilon\) -accuracy) of \(\frac{SAH^3}{\varepsilon ^2}\) up to log factor, which is minimax-optimal for the full \(\varepsilon\) -range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in a novel analysis paradigm (based on a new concept called “profiles”) to decouple complicated statistical dependency across the sample trajectories — a long-standing challenge facing the analysis of online RL in the sample-starved regime.
Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.
… We demonstrate that the model-based (or “plug-in”) approach achieves minimax-optimal sample complexity without any burn-in cost for tabular Markov decision processes (MDPs) …
… improve the sample efficiency of the resulting reinforcement learning algorithm, provide a set of analysis techniques to understand improvements in sample complexity from shaping, …
Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this article, we take a step toward demystifying the performance and efficiency of such methods by focusing on the standard infinite-horizon linear–quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-accuracy in the model-free setup and the total number of function evaluations both scale as <inline-formula><tex-math notation="LaTeX">$\log \, (1/\epsilon)$</tex-math></inline-formula>.
Purpose Offline reinforcement learning (RL) acquires effective policies by using prior collected large-scale data, while, in some scenarios, collecting data may be hard because it is time-consuming, expensive and dangerous, i.e. health care, autonomous driving, seeking a more efficient offline RL method. The purpose of the study is to introduce an algorithm, which attempts to sample the high-value transitions in the prioritized buffer, and uniformly sample from the normal experience buffer, improving sample efficiency of offline reinforcement learning, as well as alleviating the “extrapolation error” commonly arising in offline RL. Design/methodology/approach The authors propose a new structure of experience replay architecture, which consists of double experience replies, a prioritized experience replay and a normal experience replay, supplying samples for policy updates in different training phases. At the first training stage, the authors sample from prioritized experience replay according to the calculated priority of each transitions. At the second training stage, the authors sample from the normal experience replay uniformly. The combination of the two experience replies is initialized by the same offline data set. Findings The proposed method eliminates out-of-distribution problem in an offline RL regime, and promotes training by leveraging a new efficient experience replay. The authors evaluate their method on D4RL benchmark, and the results reveal that the algorithm can achieve superior performance over the state-of-the-art offline RL algorithm. The ablation study proves that the authors’ experience replay architecture plays an important role in terms of improving final performance, data-efficiency and training stability. Research limitations/implications Because of the extra addition of prioritized experience replay, the proposed method increases the computational burden and has the risk of changing data distribution due to the combined sample strategy. Therefore, researchers are encouraged to use the experience replay block effectively and efficiently further. Practical implications Offline RL is susceptible to the quality and coverage of pre-collected data, which may be not easy to be collected from specific environment, demanding practitioners to handcraft behavior policy to interact with environment for gathering data. Originality/value The proposed approach focuses on the experience replay architecture for offline RL, and empirically demonstrates the superiority of the algorithm on data efficiency and final performance over conservative Q-learning across diverse D4RL tasks. In particular, the authors compare different variants of their experience replay block, and the experiments show that the stages, when to sample from the priority buffer, play an important role in the algorithm. The algorithm is easy to implement and can be combined with any Q-value approximation-based offline RL methods by minor adjustment.
From the first theoretical propositions in the 1950s to its application in real-world problems, Reinforcement Learning (RL) is still a fascinating and complex class of machine learning algorithms with overgrowing literature in recent years. In this work, we present an extensive and structured literature review and discuss how the Experience Replay (ER) technique has been fundamental in making various RL methods in most relevant problems and different domains more data efficient. ER is the central focus of this review. One of its main contributions is a taxonomy that organizes the many research works and the different RL methods that use ER. Here, the focus is on how RL methods improve and apply ER strategies, demonstrating their specificities and contributions while having ER as a prominent component. Another relevant contribution is the organization in a facet-oriented way, allowing different perspectives of reading, whether based on the fundamental problems of RL, focusing on algorithmic strategies and architectural decisions, or with a view to different applications of RL with ER. Moreover, we start by presenting a detailed formal theoretical foundation of RL and some of the most relevant algorithms and bring from the recent literature some of the main trends, challenges, and advances focused on ER formal basement and how to improve its propositions to make it even more efficient in different methods and domains. Lastly, we discuss challenges and open problems and present relevant paths to feature works.
… a clustering experience replay, called CER, to effectively exploit the experience hidden in all … We construct a new method, TD3 _ CER, to implement our clustering experience replay on …
… approach to experience sampling can yield improved learning efficiency and performance. … the sampling probabilities of stored transitions in the replay buffer. Unlike existing sampling …
Experience replay has been instrumental in achieving significant advancements in reinforcement learning by increasing the utilization of data. To further improve the sampling efficiency, prioritized experience replay (PER) was proposed. This algorithm prioritizes experiences based on the temporal difference error (TD error), enabling the agent to learn from more valuable experiences stored in the experience pool. While various prioritized algorithms have been proposed, they ignored the dynamic changes of experience value during the training process, merely combining different priority criteria in a fixed or linear manner. In this paper, we present a novel prioritized experience replay algorithm called PERDP, which employs a dynamic priority adjustment framework. PERDP adaptively adjusts the weights of each criterion based on average priority level of the experience pool and evaluates experiences’ value according to current network. We apply this algorithm to the SAC model and conduct experiments in the OpenAI Gym experimental environment. The experiment results demonstrate that the PERDP exhibits superior convergence speed when compared to the PER.
A widely-studied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms off-policy actor-critic algorithms. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. The introduced algorithm suggests a new branch of improvements to PER and schedules effective and efficient training for both actor and critic networks. An extensive set of experiments verifies our theoretical findings, showing that our method outperforms competing approaches and achieves state-of-the-art results over the standard off-policy actor-critic algorithms.
Abstract Experience replay has been widely used in deep reinforcement learning. The learning algorithm allows online reinforcement learning agents to remember and reuse experiences from the past. In order to further improve the sampling efficiency for experience replay, the most useful experiences are expected to be sampled with higher frequency. Existing methods usually designed their sampling strategy according to a few criteria, but they tended to combine different criteria in a linear or fixed manner, where the strategy were static and independent of the agent learner. This ignores the dynamic attribute of the environment and thus can only lead to a suboptimal performance. In this work, we propose a dynamic experience replay strategy according to the interaction between the agent and environment, which is called Prioritized Experience Replay based on Multi-armed Bandit (PERMAB). PERMAB can adaptively combine multiple priority criteria to measure the importance of the experience. In particular, the weight of each assessing criterion can be adaptively adjusted from episode to episode according to their respective contribution to the agent performance, which guarantees useful criterion to be weighted more in its current state. The proposed replay strategy is able to take both sample informativeness and diversity into consideration, which could significantly boosts learning ability and speed of the game agent. Experimental results show that PERMAB accelerates the network learning and achieves a better performance compared to baseline algorithms on seven benchmark environments with various difficulties.
… promising approaches to improve the sample efficiency of reinforcement learning methods. In … Our Forgetful Experience Replay (ForgER) algorithm effectively handles expert data errors …
Model-based reinforcement learning algorithms, which aim to learn a model of the environment to make decisions, are more sample efficient than their model-free counterparts. The sample efficiency of model-based approaches relies on whether the model can well approximate the environment. However, learning an accurate model is challenging, especially in complex and noisy environments. To tackle this problem, we propose the conservative model-based actor-critic (CMBAC), a novel approach that achieves high sample efficiency without the strong reliance on accurate learned models. Specifically, CMBAC learns multiple estimates of the Q-value function from a set of inaccurate models and uses the average of the bottom-k estimates---a conservative estimate---to optimize the policy. An appealing feature of CMBAC is that the conservative estimates effectively encourage the agent to avoid unreliable “promising actions”---whose values are high in only a small fraction of the models. Experiments demonstrate that CMBAC significantly outperforms state-of-the-art approaches in terms of sample efficiency on several challenging control tasks, and the proposed method is more robust than previous methods in noisy environments.
We propose a model-based reinforcement learning (RL) approach for noisy time-dependent gate optimization with reduced sample complexity over model-free RL. Sample complexity is defined as the number of controller interactions with the physical system. Leveraging an inductive bias, inspired by recent advances in neural ordinary differential equations (ODEs), we use an autodifferentiable ODE, parametrized by a learnable Hamiltonian ansatz, to represent the model approximating the environment, whose time-dependent part, including the control, is fully known. Control alongside Hamiltonian learning of continuous time-independent parameters is addressed through interactions with the system. We demonstrate an order of magnitude advantage in sample complexity of our method over standard model-free RL in preparing some standard unitary gates with closed and open system dynamics, in realistic computational experiments incorporating single-shot measurements, arbitrary Hilbert space truncations, and uncertainty in Hamiltonian parameters. Also, the learned Hamiltonian can be leveraged by existing control methods like GRAPE (gradient ascent pulse engineering) for further gradient-based optimization with the controllers found by RL as initializations. Our algorithm, which we apply to nitrogen vacancy (NV) centers and transmons, is well suited for controlling partially characterized one- and two-qubit systems.
Model-based reinforcement learning promises to be an effective way to bring reinforcement learning to real-world robotic systems by offering a sample efficient learning approach compared to model-free reinforcement learning. However, model-based reinforcement learning approaches at present struggle to match the performance of model-free ones. This work attempts to fill this gap by improving the performance of model-based reinforcement learning while further improving its sample efficiency. To improve the sample efficiency, an exploration strategy is formulated which maximizes the information gain. The asymptotic performance is improved by compensating for the model-bias using a model-free critic. We have evaluated our proposed approach on four reinforcement learning benchmarking tasks in the openAI gym framework.
To further improve learning efficiency and performance of reinforcement learning (RL), a novel uncertainty-aware model-based RL method is proposed and validated in autonomous driving scenarios in this paper. First, an action-conditioned ensemble model with the capability of uncertainty assessment is established as the environment model. Then, a novel uncertainty-aware model-based RL method is developed based on the adaptive truncation approach, providing virtual interactions between the agent and environment model, and improving RL’s learning efficiency and performance. The proposed method is then implemented in end-to-end autonomous vehicle control tasks, validated and compared with state-of-the-art methods under various driving scenarios. Validation results suggest that the proposed method outperforms the model-free RL approach with respect to learning efficiency, and model-based approach with respect to both efficiency and performance, demonstrating its feasibility and effectiveness.
Transporting suspended payloads is challenging for autonomous aerial vehicles because the payload can cause significant and unpredictable changes to the robot's dynamics. These changes can lead to suboptimal flight performance or even catastrophic failure. Although adaptive control and learning-based methods can in principle adapt to changes in these hybrid robot-payload systems, rapid mid-flight adaptation to payloads that have a priori unknown physical properties remains an open problem. We propose a meta-learning approach that “learns how to learn” models of altered dynamics within seconds of post-connection flight data. Our experiments demonstrate that our online adaptation approach outperforms non-adaptive methods on a series of challenging suspended payload transportation tasks. Videos and other supplemental material are available on our website: https://sites.google.com/view/meta-rl-for-flight
… of our proposed world model structure, we compare it with model-based DRL algorithms that share a similar training pipeline, as discussed in Section 2. However, similarly to [10, 12, 13]…
When facing an unfamiliar environment, animals need to explore to gain new knowledge about which actions provide reward, but also put the newly acquired knowledge to use as quickly as possible. Optimal reinforcement learning strategies should therefore assess the uncertainties of these action-reward associations and utilise them to inform decision making. We propose a novel model whereby direct and indirect striatal pathways act together to estimate both the mean and variance of reward distributions, and mesolimbic dopaminergic neurons provide transient novelty signals, facilitating effective uncertainty-driven exploration. We utilised electrophysiological recording data to verify our model of the basal ganglia, and we fitted exploration strategies derived from the neural model to data from behavioural experiments. We also compared the performance of directed exploration strategies inspired by our basal ganglia model with other exploration algorithms including classic variants of upper confidence bound (UCB) strategy in simulation. The exploration strategies inspired by the basal ganglia model can achieve overall superior performance in simulation, and we found qualitatively similar results in fitting model to behavioural data compared with the fitting of more idealised normative models with less implementation level detail. Overall, our results suggest that transient dopamine levels in the basal ganglia that encode novelty could contribute to an uncertainty representation which efficiently drives exploration in reinforcement learning.
Uncertainty characterization via posteriors followed by Bayesian updates is an acclaimed way to aid the exploration of model-free Reinforcement Learning (RL) algorithms. Motivated by the Bayesian RL literature, we build Maximum Mean Discrepancy Q-Learning (MMD-QL) to enhance the well-known Wasserstein Q-Learning (WQL) for more accurately estimating the posteriors and, as a result, accomplish even better exploration. MMD-QL leverages the Maximum Mean Discrepancy (MMD) barycenter, as for many positive semi-definite kernels that induce smooth function classes, MMD offers a more precise notion of distance between distributions than the Wasserstein distance. We analytically prove that a slightly modified version of MMD-QL is Probably Approximately Correct in Markov Decision Processes (PAC-MDP) under the average loss metric. Thorough experiments on several tabular domains illustrate that MMD-QL mainly surpasses the performance of WQL and other algorithms. Unlike many provably efficient RL algorithms, the framework of MMD-QL can extend to function approximation. Specifically, we incorporate Bootstrapped DQN-style deep networks, giving rise to Maximum Mean Discrepancy Q-Network (MMD-QN). Our experiments on a few Atari games reveal that MMD-QN delivers better than the deep equivalent of WQL, emphasizing the efficacy of MMD barycenters for propagating uncertainty in environments with large state-action space size.
… We seek to understand what facilitates sample-efficient learning from historical … offline reinforcement learning (RL). Further, we are interested in algorithms that enjoy sample efficiency …
… In this paper, we propose an efficient yet powerful policy class for offline reinforcement learning. We show that this method is superior to most existing methods on simulated robotic tasks…
… the offline robust RL problem where the offline data is generated according to a (training) … function approximation, with provable guarantees on the performance of the learned policy. …
With the widespread adoption of deep learning, reinforcement learning (RL) has experienced a dramatic increase in popularity, scaling to previously intractable problems, such as playing complex games from pixel observations, sustaining conversations with humans, and controlling robotic agents. However, there is still a wide range of domains inaccessible to RL due to the high cost and danger of interacting with the environment. Offline RL is a paradigm that learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets. Effective offline RL algorithms have a much wider range of applications than online RL, being particularly appealing for real-world applications, such as education, healthcare, and robotics. In this work, we contribute with a unifying taxonomy to classify offline RL methods. Furthermore, we provide a comprehensive review of the latest algorithmic breakthroughs in the field using a unified notation as well as a review of existing benchmarks’ properties and shortcomings. Additionally, we provide a figure that summarizes the performance of each method and class of methods on different dataset properties, equipping researchers with the tools to decide which type of algorithm is best suited for the problem at hand and identify which classes of algorithms look the most promising. Finally, we provide our perspective on open problems and propose future research directions for this rapidly growing field.
… match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data. The …
In multigoal reinforcement learning (RL), algorithms usually suffer from inefficiency in the collection of successful experiences in tasks with sparse rewards. By utilizing the ideas of relabeling hindsight experience and curriculum learning, some prior works have greatly improved the sample efficiency in robotic manipulation tasks, such as hindsight experience replay (HER), hindsight goal generation (HGG), graph-based HGG (G-HGG), and curriculum-guided HER (CHER). However, none of these can learn efficiently to solve challenging manipulation tasks with distant goals and obstacles, since they rely either on heuristic or simple distance-guided exploration. In this article, we introduce graph-curriculum-guided HGG (GC-HGG), an extension of CHER and G-HGG, which works by selecting hindsight goals on the basis of graph-based proximity and diversity. We evaluated GC-HGG in four challenging manipulation tasks involving obstacles in both simulations and real-world experiments, in which significant enhancements in both sample efficiency and overall success rates over prior works were demonstrated. Videos and codes can be viewed at this link: https://videoviewsite.wixsite.com/gc-hgg.
Optimistic value estimates provide one mechanism for directed exploration in reinforcement learning (RL). The agent acts greedily with respect to an estimate of the value plus what can be seen as a value bonus. The value bonus can be learned by estimating a value function on reward bonuses, propagating local uncertainties around rewards. However, this approach only increases the value bonus for an action retroactively, after seeing a higher reward bonus from that state and action. Such an approach does not encourage the agent to visit a state and action for the first time. In this work, we introduce an algorithm for exploration called Value Bonuses with Ensemble errors (VBE), that maintains an ensemble of random action-value functions (RQFs). VBE uses the errors in the estimation of these RQFs to design value bonuses that provide first-visit optimism and deep exploration. The key idea is to design the rewards for these RQFs in such a way that the value bonus can decrease to zero. We show that VBE outperforms Bootstrap DQN and two reward bonus approaches (RND and ACB) on several classic environments used to test exploration and provide demonstrative experiments that it can scale easily to more complex environments like Atari.
Designing effective task sequences is crucial for curriculum reinforcement learning (CRL), where agents must gradually acquire skills by training on intermediate tasks. A key challenge in CRL is to identify tasks that promote exploration, yet are similar enough to support effective transfer. While recent approach suggests comparing tasks via their Structural Causal Models (SCMs), the method requires access to ground-truth causal structures, an unrealistic assumption in most RL settings. In this work, we propose Causal-Paced Deep Reinforcement Learning (CP-DRL), a curriculum learning framework aware of SCM differences between tasks based on interaction data approximation. This signal captures task novelty, which we combine with the agent's learnability, measured by reward gain, to form a unified objective. Empirically, CP-DRL outperforms existing curriculum methods on the Point Mass benchmark, achieving faster convergence and higher returns. CP-DRL demonstrates reduced variance with comparable final returns in the Bipedal Walker-Trivial setting, and achieves the highest average performance in the Infeasible variant. These results indicate that leveraging causal relationships between tasks can improve the structure-awareness and sample efficiency of curriculum reinforcement learning. We provide the full implementation of CP-DRL to facilitate the reproduction of our main results at https://github.com/Cho-Geonwoo/CP-DRL.
Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.
Recent advances in reinforcement learning have demonstrated its ability to solve hard agent-environment interaction tasks on a super-human level. However, the application of reinforcement learning methods to practical and real-world tasks is currently limited due to most RL state-of-art algorithms' sample inefficiency, i.e., the need for a vast number of training episodes. For example, OpenAI Five algorithm that has beaten human players in Dota 2 has trained for thousands of years of game time. Several approaches exist that tackle the issue of sample inefficiency, that either offers a more efficient usage of already gathered experience or aim to gain a more relevant and diverse experience via a better exploration of an environment. However, to our knowledge, no such approach exists for model-based algorithms, that showed their high sample efficiency in solving hard control tasks with high-dimensional state space. This work connects exploration techniques and model-based reinforcement learning. We have designed a novel exploration method that takes into account features of the model-based approach. We also demonstrate through experiments that our method significantly improves the performance of the model-based algorithm Dreamer.
This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.
Multi-objective reinforcement learning (MORL) aims to optimize policies in the presence of conflicting objectives, where linear scalarization is commonly used to reduce vector-valued returns into scalar signals. While effective for certain preferences, this approach cannot capture fairness-oriented goals such as Nash social welfare or max-min fairness, which require nonlinear and non-additive trade-offs. Although several online algorithms have been proposed for specific fairness objectives, a unified approach for optimizing nonlinear welfare criteria in the offline setting-where learning must proceed from a fixed dataset-remains unexplored. In this work, we present FairDICE, the first offline MORL framework that directly optimizes nonlinear welfare objective. FairDICE leverages distribution correction estimation to jointly account for welfare maximization and distributional regularization, enabling stable and sample-efficient learning without requiring explicit preference weights or exhaustive weight search. Across multiple offline benchmarks, FairDICE demonstrates strong fairness-aware performance compared to existing baselines.
The process of meta-learning algorithms from data, instead of relying on manual design, is growing in popularity as a paradigm for improving the performance of machine learning systems. Meta-learning shows particular promise for reinforcement learning (RL), where algorithms are often adapted from supervised or unsupervised learning despite their suboptimality for RL. However, until now there has been a severe lack of comparison between different meta-learning algorithms, such as using evolution to optimise over black-box functions or LLMs to propose code. In this paper, we carry out this empirical comparison of the different approaches when applied to a range of meta-learned algorithms which target different parts of the RL pipeline. In addition to meta-train and meta-test performance, we also investigate factors including the interpretability, sample cost and train time for each meta-learning algorithm. Based on these findings, we propose several guidelines for meta-learning new RL algorithms which will help ensure that future learned algorithms are as performant as possible.
In real-world reinforcement learning applications the learner's observation space is ubiquitously high-dimensional with both relevant and irrelevant information about the task at hand. Learning from high-dimensional observations has been the subject of extensive investigation in supervised learning and statistics (e.g., via sparsity), but analogous issues in reinforcement learning are not well understood, even in finite state/action (tabular) domains. We introduce a new problem setting for reinforcement learning, the Exogenous Markov Decision Process (ExoMDP), in which the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learner's actions, but evolves in an arbitrary, temporally correlated fashion. We provide a new algorithm, ExoRL, which learns a near-optimal policy with sample complexity polynomial in the size of the endogenous component and nearly independent of the size of the exogenous component, thereby offering a doubly-exponential improvement over off-the-shelf algorithms. Our results highlight for the first time that sample-efficient reinforcement learning is possible in the presence of exogenous information, and provide a simple, user-friendly benchmark for investigation going forward.
In goal-reaching reinforcement learning (RL), the optimal value function has a particular geometry, called quasimetric structure. This paper introduces Quasimetric Reinforcement Learning (QRL), a new RL method that utilizes quasimetric models to learn optimal value functions. Distinct from prior approaches, the QRL objective is specifically designed for quasimetrics, and provides strong theoretical recovery guarantees. Empirically, we conduct thorough analyses on a discretized MountainCar environment, identifying properties of QRL and its advantages over alternatives. On offline and online goal-reaching benchmarks, QRL also demonstrates improved sample efficiency and performance, across both state-based and image-based observations.
Attitude control of fixed-wing unmanned aerial vehicles (UAVs) is a difficult control problem in part due to uncertain nonlinear dynamics, actuator constraints, and coupled longitudinal and lateral motions. Current state-of-the-art autopilots are based on linear control and are thus limited in their effectiveness and performance. Deep reinforcement learning (DRL) is a machine learning method to automatically discover optimal control laws through interaction with the controlled system, which can handle complex nonlinear dynamics. We show in this paper that DRL can successfully learn to perform attitude control of a fixed-wing UAV operating directly on the original nonlinear dynamics, requiring as little as three minutes of flight data. We initially train our model in a simulation environment and then deploy the learned controller on the UAV in flight tests, demonstrating comparable performance to the state-of-the-art ArduPlane proportional-integral-derivative (PID) attitude controller with no further online learning required. Learning with significant actuation delay and diversified simulated dynamics were found to be crucial for successful transfer to control of the real UAV. In addition to a qualitative comparison with the ArduPlane autopilot, we present a quantitative assessment based on linear analysis to better understand the learning controller's behavior.
Modern decision-making systems, from robots to web recommendation engines, are expected to adapt: to user preferences, changing circumstances or even new tasks. Yet, it is still uncommon to deploy a dynamically learning agent (rather than a fixed policy) to a production system, as it's perceived as unsafe. Using historical data to reason about learning algorithms, similar to offline policy evaluation (OPE) applied to fixed policies, could help practitioners evaluate and ultimately deploy such adaptive agents to production. In this work, we formalize offline learner simulation (OLS) for reinforcement learning (RL) and propose a novel evaluation protocol that measures both fidelity and efficiency of the simulation. For environments with complex high-dimensional observations, we propose a semi-parametric approach that leverages recent advances in latent state discovery in order to achieve accurate and efficient offline simulations. In preliminary experiments, we show the advantage of our approach compared to fully non-parametric baselines. The code to reproduce these experiments will be made available at https://github.com/microsoft/rl-offline-simulation.
Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts -- which do not require explicit model estimation -- have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.
Experience replay is a core ingredient of modern deep reinforcement learning, yet its benefits in policy optimization are poorly understood beyond empirical heuristics. This paper develops a novel theoretical framework for experience replay in modern policy gradient methods, where two sources of dependence fundamentally complicate analysis: Markovian correlations along trajectories and policy drift across optimization iterations. We introduce a new proof technique based on auxiliary Markov chains and lag-based decoupling that makes these dependencies tractable. Within this framework, we derive finite-time bias bounds for policy-gradient estimators under replay, identifying how bias scales with the cumulative policy update, the mixing time of the underlying dynamics, and the age of buffered data, thereby formalizing the practitioner's rule of avoiding overly stale replay. We further provide a correlation-aware variance decomposition showing how sample dependence governs gradient variance from replay and when replay is beneficial. Building on these characterizations, we establish the finite-time convergence guarantees for experience-replay-based policy optimization, explicitly quantifying how buffer size, sample correlation, and mixing jointly determine the convergence rate and revealing an inherent bias-variance trade-off: larger buffers can reduce variance by averaging less correlated samples but can increase bias as data become stale. These results offer a principled guide for buffer sizing and replay schedules, bridging prior empirical findings with quantitative theory.
Agents must be able to adapt quickly as an environment changes. We find that existing model-based reinforcement learning agents are unable to do this well, in part because of how they use past experiences to train their world model. Here, we present Curious Replay -- a form of prioritized experience replay tailored to model-based agents through use of a curiosity-based priority signal. Agents using Curious Replay exhibit improved performance in an exploration paradigm inspired by animal behavior and on the Crafter benchmark. DreamerV3 with Curious Replay surpasses state-of-the-art performance on Crafter, achieving a mean score of 19.4 that substantially improves on the previous high score of 14.5 by DreamerV3 with uniform replay, while also maintaining similar performance on the Deepmind Control Suite. Code for Curious Replay is available at https://github.com/AutonomousAgentsLab/curiousreplay
Experience replay plays a crucial role in improving the sample efficiency of deep reinforcement learning agents. Recent advances in experience replay propose using Mixup (Zhang et al., 2018) to further improve sample efficiency via synthetic sample generation. We build upon this technique with Neighborhood Mixup Experience Replay (NMER), a geometrically-grounded replay buffer that interpolates transitions with their closest neighbors in state-action space. NMER preserves a locally linear approximation of the transition manifold by only applying Mixup between transitions with vicinal state-action features. Under NMER, a given transition's set of state action neighbors is dynamic and episode agnostic, in turn encouraging greater policy generalizability via inter-episode interpolation. We combine our approach with recent off-policy deep reinforcement learning algorithms and evaluate on continuous control environments. We observe that NMER improves sample efficiency by an average 94% (TD3) and 29% (SAC) over baseline replay buffers, enabling agents to effectively recombine previous experiences and learn from limited data.
Hindsight Experience Replay (HER) is a technique used in reinforcement learning (RL) that has proven to be very efficient for training off-policy RL-based agents to solve goal-based robotic manipulation tasks using sparse rewards. Even though HER improves the sample efficiency of RL-based agents by learning from mistakes made in past experiences, it does not provide any guidance while exploring the environment. This leads to very large training times due to the volume of experience required to train an agent using this replay strategy. In this paper, we propose a method that uses primitive behaviours that have been previously learned to solve simple tasks in order to guide the agent toward more rewarding actions during exploration while learning other more complex tasks. This guidance, however, is not executed by a manually designed curriculum, but rather using a critic network to decide at each timestep whether or not to use the actions proposed by the previously-learned primitive policies. We evaluate our method by comparing its performance against HER and other more efficient variations of this algorithm in several block manipulation tasks. We demonstrate the agents can learn a successful policy faster when using our proposed method, both in terms of sample efficiency and computation time. Code is available at https://github.com/franroldans/qmp-her.
Reinforcement learning (RL) in non-stationary environments is challenging, as changing dynamics and rewards quickly make past experiences outdated. Traditional experience replay (ER) methods, especially those using TD-error prioritization, struggle to distinguish between changes caused by the agent's policy and those from the environment, resulting in inefficient learning under dynamic conditions. To address this challenge, we propose the Discrepancy of Environment Dynamics (DoE), a metric that isolates the effects of environment shifts on value functions. Building on this, we introduce Discrepancy of Environment Prioritized Experience Replay (DEER), an adaptive ER framework that prioritizes transitions based on both policy updates and environmental changes. DEER uses a binary classifier to detect environment changes and applies distinct prioritization strategies before and after each shift, enabling more sample-efficient learning. Experiments on four non-stationary benchmarks demonstrate that DEER further improves the performance of off-policy algorithms by 11.54 percent compared to the best-performing state-of-the-art ER methods.
Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity-based Experience Replay (EDER). EDER employs a determinantal point process to model the diversity between samples and prioritizes replay based on the diversity between samples. To further enhance learning efficiency, we incorporate Cholesky decomposition for handling large state spaces in realistic environments. Additionally, rejection sampling is applied to select samples with higher diversity, thereby improving overall learning efficacy. Extensive experiments are conducted on robotic manipulation tasks in MuJoCo, Atari games, and realistic indoor environments in Habitat. The results demonstrate that our approach not only significantly improves learning efficiency but also achieves superior performance in high-dimensional, realistic environments.
Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing that ReaPER outperforms PER across various environment types, including the Atari-10 benchmark.
Several algorithms have been proposed to sample non-uniformly the replay buffer of deep Reinforcement Learning (RL) agents to speed-up learning, but very few theoretical foundations of these sampling schemes have been provided. Among others, Prioritized Experience Replay appears as a hyperparameter sensitive heuristic, even though it can provide good performance. In this work, we cast the replay buffer sampling problem as an importance sampling one for estimating the gradient. This allows deriving the theoretically optimal sampling distribution, yielding the best theoretical convergence speed. Elaborating on the knowledge of the ideal sampling scheme, we exhibit new theoretical foundations of Prioritized Experience Replay. The optimal sampling distribution being intractable, we make several approximations providing good results in practice and introduce, among others, LaBER (Large Batch Experience Replay), an easy-to-code and efficient method for sampling the replay buffer. LaBER, which can be combined with Deep Q-Networks, distributional RL agents or actor-critic methods, yields improved performance over a diverse range of Atari games and PyBullet environments, compared to the base agent it is implemented on and to other prioritization schemes.
Experience replay is a key component in reinforcement learning for stabilizing learning and improving sample efficiency. Its typical implementation samples transitions with replacement from a replay buffer. In contrast, in supervised learning with a fixed dataset, it is a common practice to shuffle the dataset every epoch and consume data sequentially, which is called random reshuffling (RR). RR enjoys theoretically better convergence properties and has been shown to outperform with-replacement sampling empirically. To leverage the benefits of RR in reinforcement learning, we propose sampling methods that extend RR to experience replay, both in uniform and prioritized settings, and analyze their properties via theoretical analysis and simulations. We evaluate our sampling methods on Atari benchmarks, demonstrating their effectiveness in deep reinforcement learning. Code is available at https://github.com/pfnet-research/errr.
Prioritized experience replay is a reinforcement learning technique whereby agents speed up learning by replaying useful past experiences. This usefulness is quantified as the expected gain from replaying the experience, a quantity often approximated as the prediction error (TD-error). However, recent work in neuroscience suggests that, in biological organisms, replay is prioritized not only by gain, but also by "need" -- a quantity measuring the expected relevance of each experience with respect to the current situation. Importantly, this term is not currently considered in algorithms such as prioritized experience replay. In this paper we present a new approach for prioritizing experiences for replay that considers both gain and need. Our proposed algorithms show a significant increase in performance in benchmarks including the Dyna-Q maze and a selection of Atari games.
Prioritized experience replay, which improves sample efficiency by selecting relevant transitions to update parameter estimates, is a crucial component of contemporary value-based deep reinforcement learning models. Typically, transitions are prioritized based on their temporal difference error. However, this approach is prone to favoring noisy transitions, even when the value estimation closely approximates the target mean. This phenomenon resembles the noisy TV problem postulated in the exploration literature, in which exploration-guided agents get stuck by mistaking noise for novelty. To mitigate the disruptive effects of noise in value estimation, we propose using epistemic uncertainty estimation to guide the prioritization of transitions from the replay buffer. Epistemic uncertainty quantifies the uncertainty that can be reduced by learning, hence reducing transitions sampled from the buffer generated by unpredictable random processes. We first illustrate the benefits of epistemic uncertainty prioritized replay in two tabular toy models: a simple multi-arm bandit task, and a noisy gridworld. Subsequently, we evaluate our prioritization scheme on the Atari suite, outperforming quantile regression deep Q-learning benchmarks; thus forging a path for the use of uncertainty prioritized replay in reinforcement learning agents.
Deep reinforcement learning (DRL) has significantly advanced the field of combinatorial optimization (CO). However, its practicality is hindered by the necessity for a large number of reward evaluations, especially in scenarios involving computationally intensive function assessments. To enhance the sample efficiency, we propose a simple but effective method, called symmetric replay training (SRT), which can be easily integrated into various DRL methods. Our method leverages high-reward samples to encourage exploration of the under-explored symmetric regions without additional online interactions - free. Through replay training, the policy is trained to maximize the likelihood of the symmetric trajectories of discovered high-rewarded samples. Experimental results demonstrate the consistent improvement of our method in sample efficiency across diverse DRL methods applied to real-world tasks, such as molecular optimization and hardware design.
A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or self-supervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusion-based approach to flexibly upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings, in both proprioceptive and pixel-based environments. In offline settings, we observe drastic improvements when upsampling small offline datasets and see that additional synthetic data also allows us to effectively train larger networks. Furthermore, SynthER enables online agents to train with a much higher update-to-data ratio than before, leading to a significant increase in sample efficiency, without any algorithmic changes. We believe that synthetic training data could open the door to realizing the full potential of deep learning for replay-based RL algorithms from limited data. Finally, we open-source our code at https://github.com/conglu1997/SynthER.
This paper introduces a fuzzy reinforcement learning framework, Enhanced-FQL($λ$), that integrates novel Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with the Fuzzified Bellman Equation (FBE) for continuous control. The proposed approach employs an interpretable fuzzy rule base instead of complex neural architectures, while maintaining competitive performance through two key innovations: a fuzzified Bellman equation with eligibility traces for stable multi-step credit assignment, and a memory-efficient segment-based experience replay mechanism for enhanced sample efficiency. Theoretical analysis proves the proposed method convergence under standard assumptions. On the Cart--Pole benchmark, Enhanced-FQL($λ$) improves sample efficiency and reduces variance relative to $n$-step fuzzy TD and fuzzy SARSA($λ$), while remaining competitive with the tested DDPG baseline. These results support the proposed framework as an interpretable and computationally compact alternative for moderate-scale continuous control problems.
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
Adaptive brain stimulation can treat neurological conditions such as Parkinson's disease and post-stroke motor deficits by influencing abnormal neural activity. Because of patient heterogeneity, each patient requires a unique stimulation policy to achieve optimal neural responses. Model-free reinforcement learning (MFRL) holds promise in learning effective policies for a variety of similar control tasks, but is limited in domains like brain stimulation by a need for numerous costly environment interactions. In this work we introduce Coprocessor Actor Critic, a novel, model-based reinforcement learning (MBRL) approach for learning neural coprocessor policies for brain stimulation. Our key insight is that coprocessor policy learning is a combination of learning how to act optimally in the world and learning how to induce optimal actions in the world through stimulation of an injured brain. We show that our approach overcomes the limitations of traditional MFRL methods in terms of sample efficiency and task success and outperforms baseline MBRL approaches in a neurologically realistic model of an injured brain.
Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multi-demonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose Domain-Invariant Model-based Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline model-based RL setting, can improve the stability of the policy learning process, and potentially enable increased exploration.
In model-based reinforcement learning, simulated experiences from the learned model are often treated as equivalent to experience from the real environment. However, when the model is inaccurate, it can catastrophically interfere with policy learning. Alternatively, the agent might learn about the model's accuracy and selectively use it only when it can provide reliable predictions. We empirically explore model uncertainty measures for selective planning and show that best results require distribution insensitive inference to estimate the uncertainty over model-based updates. To that end, we propose and evaluate bounding-box inference, which operates on bounding-boxes around sets of possible states and other quantities. We find that bounding-box inference can reliably support effective selective planning.
We draw on the latest advancements in the physics community to propose a novel method for discovering the governing non-linear dynamics of physical systems in reinforcement learning (RL). We establish that this method is capable of discovering the underlying dynamics using significantly fewer trajectories (as little as one rollout with $\leq 30$ time steps) than state of the art model learning algorithms. Further, the technique learns a model that is accurate enough to induce near-optimal policies given significantly fewer trajectories than those required by model-free algorithms. It brings the benefits of model-based RL without requiring a model to be developed in advance, for systems that have physics-based dynamics. To establish the validity and applicability of this algorithm, we conduct experiments on four classic control tasks. We found that an optimal policy trained on the discovered dynamics of the underlying system can generalize well. Further, the learned policy performs well when deployed on the actual physical system, thus bridging the model to real system gap. We further compare our method to state-of-the-art model-based and model-free approaches, and show that our method requires fewer trajectories sampled on the true physical system compared other methods. Additionally, we explored approximate dynamics models and found that they also can perform well.
Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs. This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees. We introduce novel proof techniques to show that under suitable conditions, the worst-case regret of our posterior sampling method matches the best known results of optimization based methods. In the linear MDP setting with dimension, the regret of our algorithm scales linearly with the dimension as compared to a quadratic dependence of the existing posterior sampling-based exploration algorithms.
The remarkable empirical performance of distributional reinforcement learning (RL) has garnered increasing attention to understanding its theoretical advantages over classical RL. By decomposing the categorical distributional loss commonly employed in distributional RL, we find that the potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization. This less-studied entropy regularization aims to capture additional knowledge of return distribution beyond only its expectation, contributing to an augmented reward signal in policy optimization. In contrast to the vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration by promoting diverse actions, the novel entropy regularization derived from categorical distributional loss implicitly updates policies to align the learned policy with (estimated) environmental uncertainty. Finally, extensive experiments verify the significance of this uncertainty-aware regularization from distributional RL on the empirical benefits over classical RL. Our study offers an innovative exploration perspective to explain the intrinsic benefits of distributional learning in RL.
Curricula for goal-conditioned reinforcement learning agents typically rely on poor estimates of the agent's epistemic uncertainty or fail to consider the agents' epistemic uncertainty altogether, resulting in poor sample efficiency. We propose a novel algorithm, Query The Agent (QTA), which significantly improves sample efficiency by estimating the agent's epistemic uncertainty throughout the state space and setting goals in highly uncertain areas. Encouraging the agent to collect data in highly uncertain states allows the agent to improve its estimation of the value function rapidly. QTA utilizes a novel technique for estimating epistemic uncertainty, Predictive Uncertainty Networks (PUN), to allow QTA to assess the agent's uncertainty in all previously observed states. We demonstrate that QTA offers decisive sample efficiency improvements over preexisting methods.
In model-free deep reinforcement learning (RL) algorithms, using noisy value estimates to supervise policy evaluation and optimization is detrimental to the sample efficiency. As this noise is heteroscedastic, its effects can be mitigated using uncertainty-based weights in the optimization process. Previous methods rely on sampled ensembles, which do not capture all aspects of uncertainty. We provide a systematic analysis of the sources of uncertainty in the noisy supervision that occurs in RL, and introduce inverse-variance RL, a Bayesian framework which combines probabilistic ensembles and Batch Inverse Variance weighting. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision. Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks.
Despite the close connection between exploration and sample efficiency, most state of the art reinforcement learning algorithms include no considerations for exploration beyond maximizing the entropy of the policy. In this work we address this seeming missed opportunity. We observe that the most common formulation of directed exploration in deep RL, known as bonus-based exploration (BBE), suffers from bias and slow coverage in the few-sample regime. This causes BBE to be actively detrimental to policy learning in many control tasks. We show that by decoupling the task policy from the exploration policy, directed exploration can be highly effective for sample-efficient continuous control. Our method, Decoupled Exploration and Exploitation Policies (DEEP), can be combined with any off-policy RL algorithm without modification. When used in conjunction with soft actor-critic, DEEP incurs no performance penalty in densely-rewarding environments. On sparse environments, DEEP gives a several-fold improvement in data efficiency due to better exploration.
Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample efficiency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based $Q$ updates, parallel tempering for exploring multiple modes of the posterior of the $Q$ function, and diffusion synthesized state-action samples regularized with $Q$ action gradients. Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces.
Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement. Project website: https://ericjin2002.github.io/SOE
Model-based deep reinforcement learning has achieved success in various domains that require high sample efficiencies, such as Go and robotics. However, there are some remaining issues, such as planning efficient explorations to learn more accurate dynamic models, evaluating the uncertainty of the learned models, and more rational utilization of models. To mitigate these issues, we present MEEE, a model-ensemble method that consists of optimistic exploration and weighted exploitation. During exploration, unlike prior methods directly selecting the optimal action that maximizes the expected accumulative return, our agent first generates a set of action candidates and then seeks out the optimal action that takes both expected return and future observation novelty into account. During exploitation, different discounted weights are assigned to imagined transition tuples according to their model uncertainty respectively, which will prevent model predictive error propagation in agent training. Experiments on several challenging continuous control benchmark tasks demonstrated that our approach outperforms other model-free and model-based state-of-the-art methods, especially in sample complexity.
Amazon and other e-commerce sites must employ mechanisms to protect their millions of customers from fraud, such as unauthorized use of credit cards. One such mechanism is order fraud evaluation, where systems evaluate orders for fraud risk, and either "pass" the order, or take an action to mitigate high risk. Order fraud evaluation systems typically use binary classification models that distinguish fraudulent and legitimate orders, to assess risk and take action. We seek to devise a system that considers both financial losses of fraud and long-term customer satisfaction, which may be impaired when incorrect actions are applied to legitimate customers. We propose that taking actions to optimize long-term impact can be formulated as a Reinforcement Learning (RL) problem. Standard RL methods require online interaction with an environment to learn, but this is not desirable in high-stakes applications like order fraud evaluation. Offline RL algorithms learn from logged data collected from the environment, without the need for online interaction, making them suitable for our use case. We show that offline RL methods outperform traditional binary classification solutions in SimStore, a simplified e-commerce simulation that incorporates order fraud risk. We also propose a novel approach to training offline RL policies that adds a new loss term during training, to better align policy exploration with taking correct actions.
Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes $k$-nearest neighbor ($k$-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks. We show that our method often achieves the oracle performance and performs better than baselines.
The ability to discover optimal behaviour from fixed data sets has the potential to transfer the successes of reinforcement learning (RL) to domains where data collection is acutely problematic. In this offline setting, a key challenge is overcoming overestimation bias for actions not present in data which, without the ability to correct for via interaction with the environment, can propagate and compound during training, leading to highly sub-optimal policies. One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC), which encourages agents to pick actions closer to the source data. By finding the right balance between RL and BC such approaches have been shown to be surprisingly effective while requiring minimal changes to the underlying algorithms they are based on. To date this balance has been held constant, but in this work we explore the idea of tipping this balance towards RL following initial training. Using TD3-BC, we demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies that outperform the original baseline, as well as match or exceed the performance of more complex alternatives. Furthermore, we demonstrate such an approach can be used for stable online fine-tuning, allowing policies to be safely improved during deployment.
The goal in offline data-driven decision-making is synthesize decisions that optimize a black-box utility function, using a previously-collected static dataset, with no active interaction. These problems appear in many forms: offline reinforcement learning (RL), where we must produce actions that optimize the long-term reward, bandits from logged data, where the goal is to determine the correct arm, and offline model-based optimization (MBO) problems, where we must find the optimal design provided access to only a static dataset. A key challenge in all these settings is distributional shift: when we optimize with respect to the input into a model trained from offline data, it is easy to produce an out-of-distribution (OOD) input that appears erroneously good. In contrast to prior approaches that utilize pessimism or conservatism to tackle this problem, in this paper, we formulate offline data-driven decision-making as domain adaptation, where the goal is to make accurate predictions for the value of optimized decisions ("target domain"), when training only on the dataset ("source domain"). This perspective leads to invariant objective models (IOM), our approach for addressing distributional shift by enforcing invariance between the learned representations of the training dataset and optimized decisions. In IOM, if the optimized decisions are too different from the training dataset, the representation will be forced to lose much of the information that distinguishes good designs from bad ones, making all choices seem mediocre. Critically, when the optimizer is aware of this representational tradeoff, it should choose not to stray too far from the training distribution, leading to a natural trade-off between distributional shift and learning performance.
Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.
In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization ($P^2MPO$), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that $P^2MPO$ is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of the distributions induced by the optimal robust policy and the perturbed models around the nominal model. By tailoring specific model estimation subroutines for concrete examples of RMDPs, including tabular RMDPs, factored RMDPs, kernel and neural RMDPs, we prove that $P^2MPO$ enjoys a $\tilde{\mathcal{O}}(n^{-1/2})$ convergence rate, where $n$ is the dataset size. We highlight that all these examples, except tabular RMDPs, are first identified and proven tractable by this work. Furthermore, we continue our study of robust offline RL in the robust Markov games (RMGs). By extending the double pessimism principle identified for single-agent RMDPs, we propose another algorithm framework that can efficiently find the robust Nash equilibria among players using only robust unilateral (partial) coverage data. To our best knowledge, this work proposes the first general learning principle -- double pessimism -- for robust offline RL and shows that it is provably efficient with general function approximation.
Q-shaping is an extension of Q-value initialization and serves as an alternative to reward shaping for incorporating domain knowledge to accelerate agent training, thereby improving sample efficiency by directly shaping Q-values. This approach is both general and robust across diverse tasks, allowing for immediate impact assessment while guaranteeing optimality. We evaluated Q-shaping across 20 different environments using a large language model (LLM) as the heuristic provider. The results demonstrate that Q-shaping significantly enhances sample efficiency, achieving a \textbf{16.87\%} improvement over the best baseline in each environment and a \textbf{253.80\%} improvement compared to LLM-based reward shaping methods. These findings establish Q-shaping as a superior and unbiased alternative to conventional reward shaping in reinforcement learning.
Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.
Reinforcement learning (RL) agents often face challenges in balancing exploration and exploitation, particularly in environments where sparse or dense rewards bias learning. Biological systems, such as human toddlers, naturally navigate this balance by transitioning from free exploration with sparse rewards to goal-directed behavior guided by increasingly dense rewards. Inspired by this natural progression, we investigate the Toddler-Inspired Reward Transition in goal-oriented RL tasks. Our study focuses on transitioning from sparse to potential-based dense (S2D) rewards while preserving optimal strategies. Through experiments on dynamic robotic arm manipulation and egocentric 3D navigation tasks, we demonstrate that effective S2D reward transitions significantly enhance learning performance and sample efficiency. Additionally, using a Cross-Density Visualizer, we show that S2D transitions smooth the policy loss landscape, resulting in wider minima that improve generalization in RL models. In addition, we reinterpret Tolman's maze experiments, underscoring the critical role of early free exploratory learning in the context of S2D rewards.
Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our job in the learning process is to design reward functions to express desired behavior and enable the agent to learn such behavior swiftly. However, designing good reward functions to induce the desired behavior is generally hard, let alone the question of which rewards make learning fast. In this work, we introduce a family of a reward structures we call Tiered Reward that addresses both of these questions. We consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space to resolve trade-offs in behavior preference. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we introduce Tiered Reward, a class of environment-independent reward functions and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we demonstrate that Tiered Reward leads to fast learning with multiple tabular and deep reinforcement-learning algorithms.
We study a class of reinforcement learning problems where the reward signals for policy learning are generated by an internal reward model that is dependent on and jointly optimized with the policy. This interdependence between the policy and the reward model leads to an unstable learning process because reward signals from an immature reward model are noisy and impede policy learning, and conversely, an under-optimized policy impedes reward estimation learning. We call this learning setting $\textit{Internally Rewarded Reinforcement Learning}$ (IRRL) as the reward is not provided directly by the environment but $\textit{internally}$ by a reward model. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
Many real-world robot learning problems, such as pick-and-place or arriving at a destination, can be seen as a problem of reaching a goal state as soon as possible. These problems, when formulated as episodic reinforcement learning tasks, can easily be specified to align well with our intended goal: -1 reward every time step with termination upon reaching the goal state, called minimum-time tasks. Despite this simplicity, such formulations are often overlooked in favor of dense rewards due to their perceived difficulty and lack of informativeness. Our studies contrast the two reward paradigms, revealing that the minimum-time task specification not only facilitates learning higher-quality policies but can also surpass dense-reward-based policies on their own performance metrics. Crucially, we also identify the goal-hit rate of the initial policy as a robust early indicator for learning success in such sparse feedback settings. Finally, using four distinct real-robotic platforms, we show that it is possible to learn pixel-based policies from scratch within two to three hours using constant negative rewards.
How do we formalize the challenge of credit assignment in reinforcement learning? Common intuition would draw attention to reward sparsity as a key contributor to difficult credit assignment and traditional heuristics would look to temporal recency for the solution, calling upon the classic eligibility trace. We posit that it is not the sparsity of the reward itself that causes difficulty in credit assignment, but rather the \emph{information sparsity}. We propose to use information theory to define this notion, which we then use to characterize when credit assignment is an obstacle to efficient learning. With this perspective, we outline several information-theoretic mechanisms for measuring credit under a fixed behavior policy, highlighting the potential of information theory as a key tool towards provably-efficient credit assignment.
Deep learning (DL) and reinforcement learning (RL) methods seem to be a part of indispensable factors to achieve human-level or super-human AI systems. On the other hand, both DL and RL have strong connections with our brain functions and with neuroscientific findings. In this review, we summarize talks and discussions in the "Deep Learning and Reinforcement Learning" session of the symposium, International Symposium on Artificial Intelligence and Brain Science. In this session, we discussed whether we can achieve comprehensive understanding of human intelligence based on the recent advances of deep learning and reinforcement learning algorithms. Speakers contributed to provide talks about their recent studies that can be key technologies to achieve human-level intelligence.
Optimal control problems are one of the most challenging problems in optimization. This paper presents a new and efficient Reinforcement Learning approach to optimal control …
In this article, we study the application of unmanned aerial vehicle (UAV) for data collection with wireless charging, which is crucial for providing seamless coverage and improving system performance in the next-generation wireless networks. To this end, we propose a reinforcement learning-based approach to plan the route of UAV to collect sensor data from sensor devices scattered in the physical environment. Specifically, the physical environment is divided into multiple grids, where one spot for UAV hovering as well as the wireless charging of UAV is located at the center of each grid. Each grid has a spot for the UAV to hover, and moreover, there is a wireless charger at the center of each grid, which can provide wireless charging to UAV when it is hovering in the grid. When the UAV lacks energy, it can be charged by the wireless charger at the spot. By taking into account the collected data amount as well as the energy consumption, we formulate the problem of data collection with UAV as a Markov decision problem, and exploit $Q$ -learning to find the optimal policy. In particular, we design the reward function considering the energy efficiency of UAV flight and data collection, based on which $Q$ -table is updated for guiding the route of UAV. Through extensive simulation results, we verify that our proposed reward function can achieve a better performance in terms of the average throughput, delay of data collection, as well as the energy efficiency of UAV, in comparison with the conventional capacity-based reward function.
Deep learning has provided new ways of manipulating, processing and analyzing data. It sometimes may achieve results comparable to, or surpassing human expert performance, and has become a source of inspiration in the era of artificial intelligence. Another subfield of machine learning named reinforcement learning, tries to find an optimal behavior strategy through interactions with the environment. Combining deep learning and reinforcement learning permits resolving critical issues relative to the dimensionality and scalability of data in tasks with sparse reward signals, such as robotic manipulation and control tasks, that neither method permits resolving when applied on its own. In this paper, we present recent significant progress of deep reinforcement learning algorithms, which try to tackle the problems for the application in the domain of robotic manipulation control, such as sample efficiency and generalization. Despite these continuous improvements, currently, the challenges of learning robust and versatile manipulation skills for robots with deep reinforcement learning are still far from being resolved for real-world applications.
Replay supports planning Learning from direct experience is easy—we can always use trial and error—but how do we learn from nondirect (nonlocal) experiences? For this, we need additional mechanisms that bridge time and space. In rodents, hippocampal replay is hypothesized to promote this function. Liu et al. measured high-temporal-resolution brain signals using human magnetoencephalography combined with a new model-based, visually oriented, multipath reinforcement memory task. This task was designed to differentiate local versus nonlocal learning episodes within the subject. They found that reverse sequential replay in the human medial temporal lobe supports nonlocal reinforcement learning and is the underlying mechanism for solving complex credit assignment problems such as value learning. Science, abf1357, this issue p. eabf1357 Reverse sequential replay supports nondirect reinforcement learning in humans and is prioritized according to utility. INTRODUCTION Adaptive decision-making requires assimilation of reward information to guide subsequent choices. However, actions and outcomes are often separated by time and space, rendering this difficult. In reinforcement learning, this problem can be solved using “model-based” inference, in which an agent leverages a learned model of the environment to link local reward to nonlocal actions; this process is known as experience replay. A potential neural substrate for this process is hippocampal “replay.” In rodents, cells in the hippocampus fire in an organized manner that recapitulates past or potential future trajectories during rest. A similar phenomenon has also been observed in humans during a post-task rest period. However, a direct connection between replay and nonlocal (i.e., model-based) learning has yet to be established. RATIONALE The question of how to achieve model-based learning in the service of adaptive behavior is central to understanding intelligence in both biological and artificial agents. We addressed this question by exploiting a normative model of replay based on reinforcement learning theory. This model makes specific predictions regarding how replay relates to nonlocal learning and about which information is more or less useful if replayed. To measure replay in humans, we used machine learning techniques in conjunction with magnetoencephalography (MEG) recordings of whole-brain neural activity. These techniques enabled us to ask whether and how neural replay contributes to nondirect learning in humans. In so doing, we address an outstanding question in reinforcement learning theory: how the brain forms links between disjoint actions and goals and uses these to inform better decisions in the future. RESULTS We designed a novel decision-making task to separate local from nonlocal learning—that is, learning from direct experience as opposed to indirect learning based on inference. Specifically, we developed a three-arm task, wherein each arm comprised two paths leading to distinct end states. Crucially, the two end states, reachable from each arm, are shared across all three arms. This design feature allows post-choice reward feedback to inform future choices not only in the chosen arm (local learning) but also in either of the other two arms (nonlocal learning). This work revealed the existence of backward neural replay of nonlocal experiences after reward receipt, with a 160-ms state-to-state time lag. In line with normative theory, such replay predominantly represents the path most useful for future behavior. This backward replay encoded nonlocal experience alone and was physiologically distinct from a faster forward replay (30-ms time lag), which was associated with power increase in a ripple band. Using computational modeling, we showed that the strength of this backward replay relates to efficient trial-by-trial within-subject learning of the same nonlocal experience, as well as a better overall task performance across subjects. This is consistent with our theoretical predictions and provides strong support for a reinforcement learning–based account of neural replay in decision-making. CONCLUSION Backward replay accompanies efficient nonlocal learning in humans and is prioritized according to its utility. These results connect several findings in human and rodent neuroscience and implicate experience replay in model-based reinforcement learning. Experience replay is associated with efficient nonlocal learning. Top left: The key element of the experimental design is a separation of local versus nonlocal learning. The chosen path (indicated by hand) reflects local experience; the other two paths leading to the same outcome state (the red £), but not directly experienced, are the nonlocal paths. Arrow direction indicates the order of actual experience; color indicates the arm identity. There are three arms, and outcomes are shared across the three arms. Top right: Nonlocal backward replay after reward receipt. There are two nonlocal experiences per trial. We found neural replay of these paths after reward receipt, consistent with a credit assignment account in reinforcement learning—an assignment of local reward information to nonlocal actions. Bottom: Consistent with reinforcement learning theory, replay was prioritized according to utility (need × gain) and was related to more efficient nonlocal learning. In the example, this is illustrated as stronger replay (double arrows) for the green path, because of its higher utility (0.32 versus 0.29) relative to the orange path. To make effective decisions, people need to consider the relationship between actions and outcomes. These are often separated by time and space. The neural mechanisms by which disjoint actions and outcomes are linked remain unknown. One promising hypothesis involves neural replay of nonlocal experience. Using a task that segregates direct from indirect value learning, combined with magnetoencephalography, we examined the role of neural replay in human nonlocal learning. After receipt of a reward, we found significant backward replay of nonlocal experience, with a 160-millisecond state-to-state time lag, which was linked to efficient learning of action values. Backward replay and behavioral evidence of nonlocal learning were more pronounced for experiences of greater benefit for future behavior. These findings support nonlocal replay as a neural mechanism for solving complex credit assignment problems during learning.
… experiences leads to fast convergence if the experiences … yet efficient technique to find suitable samples of experiences to train … It filters the samples of experiences that can benefit more …
… through experience replay. In the acting stage, the most effective experience obtained … stage, the stored experience will be replayed to customized times, which helps enhance the …
Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is a key challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This paper presents a survey of the integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematically cover approaches to dynamics model learning, including challenges like dealing with stochasticity, uncertainty, partial observability, and temporal abstraction. Second, we present a systematic categorization of planning-learning integration, including aspects like: where to start planning, what budgets to allocate to planning and real data collection, how to plan, and how to integrate planning in the learning and acting loop. After these two key sections, we also discuss the potential benefits of model-based RL, like enhanced data efficiency, targeted exploration, and improved stability. Along the survey, we also draw connections to several related RL fields, like hierarchical RL and transfer, and other research disciplines, like behavioural psychology. Altogether, the survey presents a broad conceptual overview of planning-learning combinations for MDP optimization.
… through a data-driven learning process. A model-based reinforcement learning algorithm is … the driving data and are used for reinforcement learning. The proposed algorithm is tested …
… reduces uncertainty on some (discrete) property of interest. We present an efficient method … We start again with a binary case; now, uncertainty is modeled by a binary random function, …
Various strategies for active learning have been proposed in the machine learning literature. In uncertainty sampling, which is among the most popular approaches, the active learner sequentially queries the label of those instances for which its current prediction is maximally uncertain. The predictions as well as the measures used to quantify the degree of uncertainty, such as entropy, are traditionally of a probabilistic nature. Yet, alternative approaches to capturing uncertainty in machine learning, alongside with corresponding uncertainty measures, have been proposed in recent years. In particular, some of these measures seek to distinguish different sources and to separate different types of uncertainty, such as the reducible (epistemic) and the irreducible (aleatoric) part of the total uncertainty in a prediction. The goal of this paper is to elaborate on the usefulness of such measures for uncertainty sampling, and to compare their performance in active learning. To this end, we instantiate uncertainty sampling with different measures, analyze the properties of the sampling strategies thus obtained, and compare them in an experimental study.
Machine learning (ML) models, if trained to data sets of high-fidelity quantum simulations, produce accurate and efficient interatomic potentials. Active learning (AL) is a powerful tool to iteratively generate diverse data sets. In this approach, the ML model provides an uncertainty estimate along with its prediction for each new atomic configuration. If the uncertainty estimate passes a certain threshold, then the configuration is included in the data set. Here we develop a strategy to more rapidly discover configurations that meaningfully augment the training data set. The approach, uncertainty-driven dynamics for active learning (UDD-AL), modifies the potential energy surface used in molecular dynamics simulations to favor regions of configuration space for which there is large model uncertainty. The performance of UDD-AL is demonstrated for two AL tasks: sampling the conformational space of glycine and sampling the promotion of proton transfer in acetylacetone. The method is shown to efficiently explore the chemically relevant configuration space, which may be inaccessible using regular dynamical sampling at target temperature conditions. A biasing energy derived from the uncertainty of a neural network ensemble modifies the potential energy surface in molecular dynamics simulations to rapidly discover under-represented structural regions that meaningfully augment the training data set.
While neural networks achieve state-of-the-art performance for many molecular modeling and structure–property prediction tasks, these models can struggle with generalization to out-of-domain examples, exhibit poor sample efficiency, and produce uncalibrated predictions. In this paper, we leverage advances in evidential deep learning to demonstrate a new approach to uncertainty quantification for neural network-based molecular structure–property prediction at no additional computational cost. We develop both evidential 2D message passing neural networks and evidential 3D atomistic neural networks and apply these networks across a range of different tasks. We demonstrate that evidential uncertainties enable (1) calibrated predictions where uncertainty correlates with error, (2) sample-efficient training through uncertainty-guided active learning, and (3) improved experimental validation rates in a retrospective virtual screening campaign. Our results suggest that evidential deep learning can provide an efficient means of uncertainty quantification useful for molecular property prediction, discovery, and design tasks in the chemical and physical sciences.
… of dealing with uncertainty. This paper provides a review of the uncertainty treatment practices in design optimization of structural and multidisciplinary systems under uncertainties. To …
Abstract Microgrid (MG) is an effective way to integrate renewable energy into power system at the consumer side. In the MG, the energy management system (EMS) is necessary to be deployed to realize efficient utilization and stable operation. To help the EMS make optimal schedule decisions, we proposed a real-time dynamic optimal energy management (OEM) based on deep reinforcement learning (DRL) algorithm. Traditionally, the OEM problem is solved by mathematical programming (MP) or heuristic algorithms, which may lead to low computation accuracy or efficiency. While for the proposed DRL algorithm, the MG-OEM is formulated as a Markov decision process (MDP) considering environment uncertainties, and then solved by the PPO algorithm. The PPO is a novel policy-based DRL algorithm with continuous state and action spaces, which includes two phases: offline training and online operation. In the training process, the PPO can learn from historical data to capture the uncertainty characteristic of renewable energy generation and load consumption. Finally, the case study demonstrates the performance of the proposed method. The computation efficiency and accuracy are verified by the online dispatch result.
Uncertainty quantification (UQ) includes the characterization, integration, and propagation of uncertainties that result from stochastic variations and a lack of knowledge or data in the natural world. Monte Carlo (MC) method is a sampling‐based approach that has widely used for quantification and propagation of uncertainties. However, the standard MC method is often time‐consuming if the simulation‐based model is computationally intensive. This article gives an overview of modern MC methods to address the existing challenges of the standard MC in the context of UQ. Specifically, multilevel Monte Carlo (MLMC) extending the concept of control variates achieves a significant reduction of the computational cost by performing most evaluations with low accuracy and corresponding low cost, and relatively few evaluations at high accuracy and corresponding high cost. Multifidelity Monte Carlo (MFMC) accelerates the convergence of standard Monte Carlo by generalizing the control variates with different models having varying fidelities and varying computational costs. Multimodel Monte Carlo method (MMMC), having a different setting of MLMC and MFMC, aims to address the issue of UQ and propagation when data for characterizing probability distributions are limited. Multimodel inference combined with importance sampling is proposed for quantifying and efficiently propagating the uncertainties resulting from small data sets. All of these three modern MC methods achieve a significant improvement of computational efficiency for probabilistic UQ, particularly uncertainty propagation. An algorithm summary and the corresponding code implementation are provided for each of the modern MC methods. The extension and application of these methods are discussed in detail.
Despite the growing literature on corporate social responsibility (CSR), little is known about how the board of directors' competence can affect the CSR‐financial performance relationship during severe uncertainties such as the COVID‐19 outbreak. This paper focuses on exploring the mediating role of CSR in the connection between board competence and corporate financial performance amidst global uncertainties. The sample consists of Jordanian companies listed on the Amman Stock Exchange. Data were analyzed using the partial least square structural equation modeling. The findings show that boards' CSR competence has a direct and indirect positive impact on financial performance. Therefore, boards of directors' CSR competence can be seen as enablers for CSR activities. In this regard, companies could invest more in qualifying board directors to be socially responsible and enhance their role in improving corporate financial performance. This study identifies and provides empirical evidence on a critical enabler of CSR activities (i.e., boards of directors' CSR competence) from a developing country perspective. This, in turn, could widen the management and other stakeholders' understanding of CSR‐enhancing factors and therefore increase its efficiency. We provide theoretical and practical implications to guide regulators and businesses to ensure sustainable development.
Offline reinforcement learning (RL) aims at learning a good policy from a batch of collected data, without extra interactions with the environment during training. However, current offline RL benchmarks commonly have a large reality gap, because they involve large datasets collected by highly exploratory policies, and the trained policy is directly evaluated in the environment. In real-world situations, running a highly exploratory policy is prohibited to ensure system safety, the data is commonly very limited, and a trained policy should be well validated before deployment. In this paper, we present a near real-world offline RL benchmark, named NeoRL, which contains datasets from various domains with controlled sizes, and extra test datasets for policy validation. We evaluate existing offline RL algorithms on NeoRL and argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward. The empirical results demonstrate that the tested offline RL algorithms become less competitive to the deterministic policy on many datasets, and the offline policy evaluation hardly helps. The NeoRL suit can be found at this http URL. We hope this work will shed some light on future research and draw more attention when deploying RL in real-world systems.
Learning effective strategies in sparse reward tasks is one of the fundamental challenges in reinforcement learning. This becomes extremely difficult in multi-agent environments, as the concurrent learning of multiple agents induces the non-stationarity problem and sharply increased joint state space. Existing works have attempted to promote multi-agent cooperation through experience sharing. However, learning from a large collection of shared experiences is inefficient as there are only a few high-value states in sparse reward tasks, which may instead lead to the curse of dimensionality in large-scale multi-agent systems. This paper focuses on sparse-reward multi-agent cooperative tasks and proposes an effective experience-sharing method, Multi-Agent Selective Learning (MASL), to boost sample-efficient training by reusing valuable experiences from other agents. MASL adopts a retrogression-based selection method to identify high-value traces of agents from the team rewards, based on which some recall traces are generated and shared among agents to motivate effective exploration. Moreover, MASL selectively considers information from other agents to cope with the non-stationarity issue while enabling efficient training for large-scale agents. Experimental results show that MASL significantly improves sample efficiency compared with state-of-the-art MARL algorithms in cooperative tasks with sparse rewards.
Reinforcement Learning (RL) seeks to develop systems capable of autonomous decision-making by learning through interaction with their environment. Central to this process are reward engineering and reward shaping, which are essential for enhancing the efficiency and effectiveness of RL algorithms. These techniques guide agents toward desired behaviors, improve learning stability, and accelerate convergence by addressing challenges such as sparse and delayed rewards. However, the complexity of real-world environments and the computational demands of RL algorithms remain significant obstacles to broader adoption. Recent advancements in deep learning have enabled RL to handle high-dimensional state and action spaces, facilitating applications in robotics, autonomous driving, and complex decision-making tasks. In response to these developments, this paper provides one of the first comprehensive reviews of reward design in RL, with a focus on the methodologies and techniques underpinning reward engineering and shaping. By introducing a detailed taxonomy, critically analyzing current approaches, and highlighting their limitations, this work fills an important gap in the literature, offering insights into how reward structures can be optimized to meet the growing demands of modern AI systems.
… In the former case, we are dealing with a sparse reward … assets as the reward at each time-step is also an example for a setup … ) signal of its performance. In this case, the density of the …
While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
In offline reinforcement learning, in-sample learning methods have been widely used to prevent performance degradation caused by evaluating out-of-distribution actions from the dataset. Extreme Q-learning (XQL) employs a loss function based on the assumption that Bellman error follows a Gumbel distribution, enabling it to model the soft optimal value function in an in-sample manner. It has demonstrated strong performance in both offline and online reinforcement learning settings. However, issues remain, such as the instability caused by the exponential term in the loss function and the risk of the error distribution deviating from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. This approach involves adjusting the modeled value function between the value function under the behavior policy and the soft optimal value function, thus achieving a trade-off between stability and optimality depending on the order of expansion. It also enables adjustment of the error distribution assumption from a normal distribution to a Gumbel distribution. Our method significantly stabilizes learning in online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several offline RL tasks from D4RL.
AI methods are used in societally important settings, ranging from credit to employment to housing, and it is crucial to provide fairness in regard to algorithmic decision making. Moreover, many settings are dynamic, with populations responding to sequential decision policies. We introduce the study of reinforcement learning (RL) with stepwise fairness constraints, requiring group fairness at each time step. Our focus is on tabular episodic RL, and we provide learning algorithms with strong theoretical guarantees in regard to policy optimality and fairness violation. Our framework provides useful tools to study the impact of fairness constraints in sequential settings and brings up new challenges in RL.
As image-based deep reinforcement learning tackles more challenging tasks, increasing model size has become an important factor in improving performance. Recent studies achieved this by focusing on the parameter efficiency of scaled networks, typically using Impala-CNN, a 15-layer ResNet-inspired network, as the image encoder. However, while Impala-CNN evidently outperforms older CNN architectures, potential advancements in network design for deep reinforcement learning-specific image encoders remain largely unexplored. We find that replacing the flattening of output feature maps in Impala-CNN with global average pooling leads to a notable performance improvement. This approach outperforms larger and more complex models in the Procgen Benchmark, particularly in terms of generalization. We call our proposed encoder model Impoola-CNN. A decrease in the network's translation sensitivity may be central to this improvement, as we observe the most significant gains in games without agent-centered observations. Our results demonstrate that network scaling is not just about increasing model size - efficient network design is also an essential factor. We make our code available at https://github.com/raphajaner/impoola.
Learning navigation capabilities in different environments has long been one of the major challenges in decision-making. In this work, we focus on zero-shot navigation ability using given abstract $2$-D top-down maps. Like human navigation by reading a paper map, the agent reads the map as an image when navigating in a novel layout, after learning to navigate on a set of training maps. We propose a model-based reinforcement learning approach for this multi-task learning problem, where it jointly learns a hypermodel that takes top-down maps as input and predicts the weights of the transition network. We use the DeepMind Lab environment and customize layouts using generated maps. Our method can adapt better to novel environments in zero-shot and is more robust to noise.
Quantum computing offers exciting opportunities for simulating complex quantum systems and optimizing large scale combinatorial problems, but its practical use is limited by device noise and constrained connectivity. Designing quantum circuits, which are fundamental to quantum algorithms, is therefore a central challenge in current quantum hardware. Existing reinforcement learning based methods for circuit design lose accuracy when restricted to hardware native gates and device level compilation. Here, we introduce gadget reinforcement learning (GRL), which combines learning with program synthesis to automatically construct composite gates that expand the action space while respecting hardware constraints. We show that this approach improves accuracy, hardware compatibility, and scalability for transverse-field Ising and quantum chemistry problems, reaching systems of up to ten qubits within realistic computational budgets. This framework demonstrates how learned, reusable circuit building blocks can guide the co-design of algorithms and hardware for quantum processors.
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various learning objectives, or subtasks, simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.
Reinforcement Learning has achieved tremendous success in the many Atari games. In this paper we explored with the lunar lander environment and implemented classical methods including Q-Learning, SARSA, MC as well as tiling coding. We also implemented Neural Network based methods including DQN, Double DQN, Clipped DQN. On top of these, we proposed a new algorithm called Heuristic RL which utilizes heuristic to guide the early stage training while alleviating the introduced human bias. Our experiments showed promising results for our proposed methods in the lunar lander environment.
In batch reinforcement learning, there can be poorly explored state-action pairs resulting in poorly learned, inaccurate models and poorly performing associated policies. Various regularization methods can mitigate the problem of learning overly-complex models in Markov decision processes (MDPs), however they operate in technically and intuitively distinct ways and lack a common form in which to compare them. This paper unifies three regularization methods in a common framework -- a weighted average transition matrix. Considering regularization methods in this common form illuminates how the MDP structure and the state-action pair distribution of the batch data set influence the relative performance of regularization methods. We confirm intuitions generated from the common framework by empirical evaluation across a range of MDPs and data collection policies.
Human-to-human conversation is not just talking and listening. It is an incremental process where participants continually establish a common understanding to rule out misunderstandings. Current language understanding methods for intelligent robots do not consider this. There exist numerous approaches considering non-understandings, but they ignore the incremental process of resolving misunderstandings. In this article, we present a first formalization and experimental validation of incremental action-repair for robotic instruction-following based on reinforcement learning. To evaluate our approach, we propose a collection of benchmark environments for action correction in language-conditioned reinforcement learning, utilizing a synthetic instructor to generate language goals and their corresponding corrections. We show that a reinforcement learning agent can successfully learn to understand incremental corrections of misunderstood instructions.
The research field of automated negotiation has a long history of designing agents that can negotiate with other agents. Such negotiation strategies are traditionally based on manual design and heuristics. More recently, reinforcement learning approaches have also been used to train agents to negotiate. However, negotiation problems are diverse, causing observation and action dimensions to change, which cannot be handled by default linear policy networks. Previous work on this topic has circumvented this issue either by fixing the negotiation problem, causing policies to be non-transferable between negotiation problems or by abstracting the observations and actions into fixed-size representations, causing loss of information and expressiveness due to feature design. We developed an end-to-end reinforcement learning method for diverse negotiation problems by representing observations and actions as a graph and applying graph neural networks in the policy. With empirical evaluations, we show that our method is effective and that we can learn to negotiate with other agents on never-before-seen negotiation problems. Our result opens up new opportunities for reinforcement learning in negotiation agents.
Remote sensing vision tasks require extensive labeled data across multiple, interconnected domains. However, current generative data augmentation frameworks are task-isolated, i.e., each vision task requires training an independent generative model, and ignores the modeling of geographical information and spatial constraints. To address these issues, we propose \textbf{TerraGen}, a unified layout-to-image generation framework that enables flexible, spatially controllable synthesis of remote sensing imagery for various high-level vision tasks, e.g., detection, segmentation, and extraction. Specifically, TerraGen introduces a geographic-spatial layout encoder that unifies bounding box and segmentation mask inputs, combined with a multi-scale injection scheme and mask-weighted loss to explicitly encode spatial constraints, from global structures to fine details. Also, we construct the first large-scale multi-task remote sensing layout generation dataset containing 45k images and establish a standardized evaluation protocol for this task. Experimental results show that our TerraGen can achieve the best generation image quality across diverse tasks. Additionally, TerraGen can be used as a universal data-augmentation generator, enhancing downstream task performance significantly and demonstrating robust cross-task generalisation in both full-data and few-shot scenarios.
Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7%, 4.3%, and 2.43%, respectively. The code is available at https://github.com/Sonettoo/AeroGen.
Unmanned aerial vehicles (UAVs) are now beginning to be deployed for enhancing the network performance and coverage in wireless communication. However, due to the limitation of their on-board power and flight time, it is challenging to obtain an optimal resource allocation scheme for the UAV-assisted Internet of Things (IoT). In this paper, we design a new UAV-assisted IoT systems relying on the shortest flight path of the UAVs while maximising the amount of data collected from IoT devices. Then, a deep reinforcement learning-based technique is conceived for finding the optimal trajectory and throughput in a specific coverage area. After training, the UAV has the ability to autonomously collect all the data from user nodes at a significant total sum-rate improvement while minimising the associated resources used. Numerical results are provided to highlight how our techniques strike a balance between the throughput attained, trajectory, and the time spent. More explicitly, we characterise the attainable performance in terms of the UAV trajectory, the expected reward and the total sum-rate.
Inverse reinforcement learning (IRL) is the task of finding a reward function that generates a desired optimal policy for a given Markov Decision Process (MDP). This paper develops an information-theoretic lower bound for the sample complexity of the finite state, finite action IRL problem. A geometric construction of $β$-strict separable IRL problems using spherical codes is considered. Properties of the ensemble size as well as the Kullback-Leibler divergence between the generated trajectories are derived. The resulting ensemble is then used along with Fano's inequality to derive a sample complexity lower bound of $O(n \log n)$, where $n$ is the number of states in the MDP.
System identification, also known as learning forward models, transfer functions, system dynamics, etc., has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined as a Supervised Learning problem in a direct way. This common approach faces several difficulties due to the inherent complexities of the dynamics to learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and, more important, error accumulation when using bootstrapped predictions (predictions based on past predictions), over large time horizons. Here we explore the use of Reinforcement Learning in this problem. We elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to solve these kind of problems.
In an infinitely repeated general-sum pricing game, independent reinforcement learners may exhibit collusive behavior without any communication, raising concerns about algorithmic collusion. To better understand the learning dynamics, we incorporate agents' relative performance (RP) among competitors using experience replay (ER) techniques. Experimental results indicate that RP considerations play a critical role in long-run outcomes. Agents that are averse to underperformance converge to the Bertrand-Nash equilibrium, while those more tolerant of underperformance tend to charge supra-competitive prices. This finding also helps mitigate the overfitting issue in independent Q-learning. Additionally, the impact of relative ER varies with the number of agents and the choice of algorithms.
Continual learning is the process of training machine learning models on a sequence of tasks where data distributions change over time. A well-known obstacle in this setting is catastrophic forgetting, a phenomenon in which a model drastically loses performance on previously learned tasks when learning new ones. A popular strategy to alleviate this problem is experience replay, in which a subset of old samples is stored in a memory buffer and replayed with new data. Despite continual learning advances focusing on which examples to store and how to incorporate them into the training loss, most approaches assume that sampling from this buffer is uniform by default. We challenge the assumption that uniform sampling is necessarily optimal. We conduct an experiment in which the memory buffer updates the same way in every trial, but the replay probability of each stored sample changes between trials based on different random weight distributions. Specifically, we generate 50 different non-uniform sampling probability weights for each trial and compare their final accuracy to the uniform sampling baseline. We find that there is always at least one distribution that significantly outperforms the baseline across multiple buffer sizes, models, and datasets. These results suggest that more principled adaptive replay policies could yield further gains. We discuss how exploiting this insight could inspire new research on non-uniform memory sampling in continual learning to better mitigate catastrophic forgetting. The code supporting this study is available at $\href{https://github.com/DentonJC/memory-sampling}{https://github.com/DentonJC/memory-sampling}$.
The Origins, Spectral Interpretation, Resource Identification, and Security Regolith Explorer (OSIRIS-REx) spacecraft arrived at its target, near-Earth asteroid 101955 Bennu, in December 2018. After one year of operating in proximity, the team selected a primary site for sample collection. In October 2020, the spacecraft descended to the surface of Bennu and collected a sample. The spacecraft departed Bennu in May 2021 and will return the sample to Earth in September 2023. The analysis of the returned sample will produce key data to determine the history of this B-type asteroid and that of its components and precursor objects. The main goal of the OSIRIS-REx Sample Analysis Plan is to provide a framework for the Sample Analysis Team to meet the Level 1 mission requirement to analyze the returned sample to determine presolar history, formation age, nebular and parent-body alteration history, relation to known meteorites, organic history, space weathering, resurfacing history, and energy balance in the regolith of Bennu. To achieve this goal, this plan establishes a hypothesis-driven framework for coordinated sample analyses, defines the analytical instrumentation and techniques to be applied to the returned sample, provides guidance on the analysis strategy for baseline, overguide, and threshold amounts of returned sample, including a rare or unique lithology, describes the data storage, management, retrieval, and archiving system, establishes a protocol for the implementation of a micro-geographical information system to facilitate co-registration and coordinated analysis of sample science data, outlines the plans for Sample Analysis Readiness Testing, and provides guidance for the transfer of samples from curation to the Sample Analysis Team.
Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.
In multi-goal reinforcement learning with a sparse binary reward, training agents is particularly challenging, due to a lack of successful experiences. To solve this problem, hindsight experience replay (HER) generates successful experiences even from unsuccessful ones. However, generating successful experiences from uniformly sampled ones is not an efficient process. In this paper, the impact of exploiting the property of achieved goals in generating successful experiences is investigated and a novel cluster-based sampling strategy is proposed. The proposed sampling strategy groups episodes with different achieved goals by using a cluster model and samples experiences in the manner of HER to create the training batch. The proposed method is validated by experiments with three robotic control tasks of the OpenAI Gym. The results of experiments demonstrate that the proposed method is substantially sample efficient and achieves better performance than baseline approaches.
Learning in multi-agent systems is highly challenging due to several factors including the non-stationarity introduced by agents' interactions and the combinatorial nature of their state and action spaces. In particular, we consider the Mean-Field Control (MFC) problem which assumes an asymptotically infinite population of identical agents that aim to collaboratively maximize the collective reward. In many cases, solutions of an MFC problem are good approximations for large systems, hence, efficient learning for MFC is valuable for the analogous discrete agent setting with many agents. Specifically, we focus on the case of unknown system dynamics where the goal is to simultaneously optimize for the rewards and learn from experience. We propose an efficient model-based reinforcement learning algorithm, $M^3-UCRL$, that runs in episodes, balances between exploration and exploitation during policy learning, and provably solves this problem. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC, obtained via a novel mean-field type analysis. To learn the system's dynamics, $M^3-UCRL$ can be instantiated with various statistical models, e.g., neural networks or Gaussian Processes. Moreover, we provide a practical parametrization of the core optimization problem that facilitates gradient-based optimization techniques when combined with differentiable dynamics approximation methods such as neural networks.
Legged locomotion is arguably the most suited and versatile mode to deal with natural or unstructured terrains. Intensive research into dynamic walking and running controllers has recently yielded great advances, both in the optimal control and reinforcement learning (RL) literature. Hopping is a challenging dynamic task involving a flight phase and has the potential to increase the traversability of legged robots. Model based control for hopping typically relies on accurate detection of different jump phases, such as lift-off or touch down, and using different controllers for each phase. In this paper, we present a end-to-end RL based torque controller that learns to implicitly detect the relevant jump phases, removing the need to provide manual heuristics for state detection. We also extend a method for simulation to reality transfer of the learned controller to contact rich dynamic tasks, resulting in successful deployment on the robot after training without parameter tuning.
The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment. In this work, we entertain an extreme scenario wherein some combination of immense environment complexity and limited agent capacity entirely precludes identifying an exactly value-equivalent model. In light of this, we embrace a notion of approximate value equivalence and introduce an algorithm for incrementally synthesizing simple and useful approximations of the environment from which an agent might still recover near-optimal behavior. Crucially, we recognize the information-theoretic nature of this lossy environment compression problem and use the appropriate tools of rate-distortion theory to make mathematically precise how value equivalence can lend tractability to otherwise intractable sequential decision-making problems.
The 2023-2032 Planetary Science and Astrobiology Decadal Survey Origins, Worlds, and Life recommended that "NASA develop scientific exploration strategies, as it has for Mars, in areas of broad scientific importance, e.g., Venus... that have an increasing number of U.S. missions and international collaboration opportunities" (OWL, p.22-10). In NASA's initial responses to that Decadal Survey, the agency asserted that "...specific scientific exploration strategies should be community generated by bodies such as the Analysis Groups," thus placing the onus on the planetary community to generate and support these exploration strategies. In late 2022, the Venus Exploration Analysis Group began a project to develop a new exploration strategy for Venus, reflecting the 2021 selections of the VERITAS, DAVINCI, and EnVision missions and the sweeping comparative planetology recommendations relevant to Venus in Origins, Worlds, and Life. This is that strategy. Taking a broad look at the scientific, technological, and programmatic advances required to address the key outstanding questions that Venus poses, and predicated on VERITAS, DAVINCI, and EnVision flying as planned in the early 2030s, this report outlines a set of actions available to NASA, VEXAG, and the planetary science community at large to establish a sustained program of Venus exploration in the years and decades ahead. Key to this approach is recognizing Venus as a unique setting where multiple, cross-disciplinary, Decadal-level planetary, Earth, heliophysics, and exoplanet science questions can be addressed, as well as being a worthy target of exploration in its own right. This report offers Assessments of the current state of Venus exploration, and Actions for the U.S. and international Venus community, as well as NASA, to consider. This strategy is a living document and should be updated as warranted.
Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: There is no consensus on the proper definition of ``uncertainty'' for a specific task under a specific loss, nor a theoretical guarantee that prescribes a standard protocol to implement the algorithm. In this work, we systematically examine uncertainty sampling algorithms in the binary classification problem via a notion of equivalent loss which depends on the used uncertainty measure and the original loss function, and establish that an uncertainty sampling algorithm is optimizing against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, we give a sample complexity result for the equivalent loss, and later translate it into a binary loss guarantee via the surrogate link function. We prove the asymptotic superiority of the uncertainty sampling against the passive learning via this approach under mild conditions. We also discuss some potential extensions, including pool-based setting and potential generalization to the multi-class classification as well as the regression problems.
Adaptive exploration methods propose ways to learn complex policies via alternating between exploration and exploitation. An important question for such methods is to determine the appropriate moment to switch between exploration and exploitation and vice versa. This is critical in domains that require the learning of long and complex sequences of actions. In this work, we present a generic adaptive exploration framework that employs uncertainty to address this important issue in a principled manner. Our framework includes previous adaptive exploration approaches as special cases. Moreover, we can incorporate in our framework any uncertainty-measuring mechanism of choice, for instance mechanisms used in intrinsic motivation or epistemic uncertainty-based exploration methods. We experimentally demonstrate that our framework gives rise to adaptive exploration strategies that outperform standard ones across several environments.
Multi-agent reinforcement learning (MARL) methods have achieved state-of-the-art results on a range of multi-agent tasks. Yet, MARL algorithms typically require significantly more environment interactions than their single-agent counterparts to converge, a problem exacerbated by the difficulty in exploring over a large joint action space and the high variance intrinsic to MARL environments. To tackle these issues, we propose a novel algorithm that combines a decomposed centralized critic with decentralized ensemble learning, incorporating several key contributions. The main component in our scheme is a selective exploration method that leverages ensemble kurtosis. We extend the global decomposed critic with a diversity-regularized ensemble of individual critics and utilize its excess kurtosis to guide exploration toward high-uncertainty states and actions. To improve sample efficiency, we train the centralized critic with a novel truncated variation of the TD($λ$) algorithm, enabling efficient off-policy learning with reduced variance. On the actor side, our suggested algorithm adapts the mixed samples approach to MARL, mixing on-policy and off-policy loss functions for training the actors. This approach balances between stability and efficiency and outperforms purely off-policy learning. The evaluation shows our method outperforms state-of-the-art baselines on standard MARL benchmarks, including a variety of SMAC II maps.
Large Multimodal Models (LMMs), harnessing the complementarity among diverse modalities, are often considered more robust than pure Language Large Models (LLMs); yet do LMMs know what they do not know? There are three key open questions remaining: (1) how to evaluate the uncertainty of diverse LMMs in a unified manner, (2) how to prompt LMMs to show its uncertainty, and (3) how to quantify uncertainty for downstream tasks. In an attempt to address these challenges, we introduce Uncertainty-o: (1) a model-agnostic framework designed to reveal uncertainty in LMMs regardless of their modalities, architectures, or capabilities, (2) an empirical exploration of multimodal prompt perturbations to uncover LMM uncertainty, offering insights and findings, and (3) derive the formulation of multimodal semantic uncertainty, which enables quantifying uncertainty from multimodal responses. Experiments across 18 benchmarks spanning various modalities and 10 LMMs (both open- and closed-source) demonstrate the effectiveness of Uncertainty-o in reliably estimating LMM uncertainty, thereby enhancing downstream tasks such as hallucination detection, hallucination mitigation, and uncertainty-aware Chain-of-Thought reasoning.
Autonomous exploration is a crucial aspect of robotics, enabling robots to explore unknown environments and generate maps without prior knowledge. This paper proposes a method to enhance exploration efficiency by integrating neural network-based occupancy grid map prediction with uncertainty-aware Bayesian neural network. Uncertainty from neural network-based occupancy grid map prediction is probabilistically integrated into mutual information for exploration. To demonstrate the effectiveness of the proposed method, we conducted comparative simulations within a frontier exploration framework in a realistic simulator environment against various information metrics. The proposed method showed superior performance in terms of exploration efficiency.
High sample complexity remains a barrier to the application of reinforcement learning (RL), particularly in multi-agent systems. A large body of work has demonstrated that exploration mechanisms based on the principle of optimism under uncertainty can significantly improve the sample efficiency of RL in single agent tasks. This work seeks to understand the role of optimistic exploration in non-cooperative multi-agent settings. We will show that, in zero-sum games, optimistic exploration can cause the learner to waste time sampling parts of the state space that are irrelevant to strategic play, as they can only be reached through cooperation between both players. To address this issue, we introduce a formal notion of strategically efficient exploration in Markov games, and use this to develop two strategically efficient learning algorithms for finite Markov games. We demonstrate that these methods can be significantly more sample efficient than their optimistic counterparts.
Neural rendering algorithms introduce a fundamentally new approach for photorealistic rendering, typically by learning a neural representation of illumination on large numbers of ground truth images. When training for a given variable scene, i.e., changing objects, materials, lights and viewpoint, the space D of possible training data instances quickly becomes unmanageable as the dimensions of variable parameters increase. We introduce a novel Active Exploration method using Markov Chain Monte Carlo, which explores D, generating samples (i.e., ground truth renderings) that best help training and interleaves training and on-the-fly sample data generation. We introduce a self-tuning sample reuse strategy to minimize the expensive step of rendering training samples. We apply our approach on a neural generator that learns to render novel scene instances given an explicit parameterization of the scene configuration. Our results show that Active Exploration trains our network much more efficiently than uniformly sampling, and together with our resolution enhancement approach, achieves better quality than uniform sampling at convergence. Our method allows interactive rendering of hard light transport paths (e.g., complex caustics) -- that require very high samples counts to be captured -- and provides dynamic scene navigation and manipulation, after training for 5-18 hours depending on required quality and variations.
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://github.com/AlbertTan404/Latent-Exploration-Decoding.
This paper proposes a method to probabilistically quantify the moments (mean and variance) of excavated material during excavation by aggregating the prior moments of the grade blocks around the given bucket dig location. By modelling the moments as random probability density functions (pdf) at sampled locations, a formulation of the sums of Gaussian based uncertainty estimation is presented that jointly estimates the location pdfs, as well as the prior values for uncertainty coming from ore body knowledge (obk) sub block models. The moments calculated at each random location is a single Gaussian and they are the components of Gaussian mixture distribution. The overall uncertainty of the excavated material at the given bucket location is represented by the Gaussian Mixture Model (GMM) and therefore moment matching method is proposed to estimate the moments of the reduced GMM. The method was tested in a region at a Pilbara iron ore deposit situated in the Brockman Iron Formation of the Hamersley Province, Western Australia, and suggests a frame work to quantify the uncertainty in the excavated material that hasn't been studied anywhere in the literature yet.
Machine learning classifiers are probabilistic in nature, and thus inevitably involve uncertainty. Predicting the probability of a specific input to be correct is called uncertainty (or confidence) estimation and is crucial for risk management. Post-hoc model calibrations can improve models' uncertainty estimations without the need for retraining, and without changing the model. Our work puts forward a geometric-based approach for uncertainty estimation. Roughly speaking, we use the geometric distance of the current input from the existing training inputs as a signal for estimating uncertainty and then calibrate that signal (instead of the model's estimation) using standard post-hoc calibration techniques. We show that our method yields better uncertainty estimations than recently proposed approaches by extensively evaluating multiple datasets and models. In addition, we also demonstrate the possibility of performing our approach in near real-time applications. Our code is available at our Github https://github.com/NoSleepDeveloper/Geometric-Calibrator.
The transductive inference is an effective technique in the few-shot learning task, where query sets update prototypes to improve themselves. However, these methods optimize the model by considering only the classification scores of the query instances as confidence while ignoring the uncertainty of these classification scores. In this paper, we propose a novel method called Uncertainty-Based Network, which models the uncertainty of classification results with the help of mutual information. Specifically, we first data augment and classify the query instance and calculate the mutual information of these classification scores. Then, mutual information is used as uncertainty to assign weights to classification scores, and the iterative update strategy based on classification scores and uncertainties assigns the optimal weights to query instances in prototype optimization. Extensive results on four benchmarks show that Uncertainty-Based Network achieves comparable performance in classification accuracy compared to state-of-the-art method.
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods.
Large-scale Web-based services present opportunities for improving UI policies based on observed user interactions. We address challenges of learning such policies through model-free offline Reinforcement Learning (RL) with off-policy training. Deployed in a production system for user authentication in a major social network, it significantly improves long-term objectives. We articulate practical challenges, compare several ML techniques, provide insights on training and evaluation of RL models, and discuss generalizations.
Stationary Distribution Correction Estimation (DICE) addresses the mismatch between the stationary distribution induced by a policy and the target distribution required for reliable off-policy evaluation (OPE) and policy optimization. DICE-based offline constrained RL particularly benefits from the flexibility of DICE, as it simultaneously maximizes return while estimating costs in offline settings. However, we have observed that recent approaches designed to enhance the offline RL performance of the DICE framework inadvertently undermine its ability to perform OPE, making them unsuitable for constrained RL scenarios. In this paper, we identify the root cause of this limitation: their reliance on a semi-gradient optimization, which solves a fundamentally different optimization problem and results in failures in cost estimation. Building on these insights, we propose a novel method to enable OPE and constrained RL through semi-gradient DICE. Our method ensures accurate cost estimation and achieves state-of-the-art performance on the offline constrained RL benchmark, DSRL.
In offline reinforcement learning, agents are trained using only a fixed set of stored transitions derived from a source policy. However, this requires that the dataset be labeled by a reward function. In applied settings such as video game development, the availability of the reward function is not always guaranteed. This paper proposes Trajectory-Ranked OFfline Inverse reinforcement learning (TROFI), a novel approach to effectively learn a policy offline without a pre-defined reward function. TROFI first learns a reward function from human preferences, which it then uses to label the original dataset making it usable for training the policy. In contrast to other approaches, our method does not require optimal trajectories. Through experiments on the D4RL benchmark we demonstrate that TROFI consistently outperforms baselines and performs comparably to using the ground truth reward to learn policies. Additionally, we validate the efficacy of our method in a 3D game environment. Our studies of the reward model highlight the importance of the reward function in this setting: we show that to ensure the alignment of a value function to the actual future discounted reward, it is fundamental to have a well-engineered and easy-to-learn reward function.
Offline reinforcement learning (RL) is a promising direction that allows RL agents to pre-train on large datasets, avoiding the recurrence of expensive data collection. To advance the field, it is crucial to generate large-scale datasets. Compositional RL is particularly appealing for generating such large datasets, since 1)~it permits creating many tasks from few components, 2)~the task structure may enable trained agents to solve new tasks by combining relevant learned components, and 3)~the compositional dimensions provide a notion of task relatedness. This paper provides four offline RL datasets for simulated robotic manipulation created using the $256$ tasks from CompoSuite [Mendez at al., 2022a]. Each dataset is collected from an agent with a different degree of performance, and consists of $256$ million transitions. We provide training and evaluation settings for assessing an agent's ability to learn compositional task policies. Our benchmarking experiments show that current offline RL methods can learn the training tasks to some extent and that compositional methods outperform non-compositional methods. Yet current methods are unable to extract the compositional structure to generalize to unseen tasks, highlighting a need for future research in offline compositional RL.
In this paper, we introduce d3rlpy, an open-sourced offline deep reinforcement learning (RL) library for Python. d3rlpy supports a set of offline deep RL algorithms as well as off-policy online algorithms via a fully documented plug-and-play API. To address a reproducibility issue, we conduct a large-scale benchmark with D4RL and Atari 2600 dataset to ensure implementation quality and provide experimental scripts and full tables of results. The d3rlpy source code can be found on GitHub: \url{https://github.com/takuseno/d3rlpy}.
Reinforcement Learning (RL) models have continually evolved to navigate the exploration - exploitation trade-off in uncertain Markov Decision Processes (MDPs). In this study, I leverage the principles of stochastic thermodynamics and system dynamics to explore reward shaping via diffusion processes. This provides an elegant framework as a way to think about exploration-exploitation trade-off. This article sheds light on relationships between information entropy, stochastic system dynamics, and their influences on entropy production. This exploration allows us to construct a dual-pronged framework that can be interpreted as either a maximum entropy program for deriving efficient policies or a modified cost optimization program accounting for informational costs and benefits. This work presents a novel perspective on the physical nature of information and its implications for online learning in MDPs, consequently providing a better understanding of information-oriented formulations in RL.
In this paper, we propose an off-policy deep reinforcement learning (DRL) method utilizing the average reward criterion. While most existing DRL methods employ the discounted reward criterion, this can potentially lead to a discrepancy between the training objective and performance metrics in continuing tasks, making the average reward criterion a recommended alternative. We introduce RVI-SAC, an extension of the state-of-the-art off-policy DRL method, Soft Actor-Critic (SAC), to the average reward criterion. Our proposal consists of (1) Critic updates based on RVI Q-learning, (2) Actor updates introduced by the average reward soft policy improvement theorem, and (3) automatic adjustment of Reset Cost enabling the average reward reinforcement learning to be applied to tasks with termination. We apply our method to the Gymnasium's Mujoco tasks, a subset of locomotion tasks, and demonstrate that RVI-SAC shows competitive performance compared to existing methods.
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network to learn a near-optimal policy in-context. This approach leverages the shared structure across tasks, does not require access to optimal actions, and can outperform the demonstrator. We validate these claims over a wide variety of structured bandit problems to show that our proposed solution is general and can quickly identify expected rewards on unseen test tasks to support effective exploration.
合并后的最终分组将“样本效率”研究脉络按并列且尽量不交叉的维度归纳为 7 条主线:①离线/批量RL的保守与分布偏移缓解;②经验回放/重采样以提高数据复用率(含Hindsight与结构化重放);③探索与信息/不确定性驱动的少样本交互;④模型化与规划/想象数据(含不确定性校正与合成经验);⑤奖励与任务信号重构(稀疏/噪声/内部奖励与奖励几何/课程/表示);⑥训练范式与工程/系统级插件降低单位交互利用成本;⑦理论与算法结构的可证高效(样本复杂度、收敛与regret)。同时对多智能体/控制/真实系统等“应用与复杂设置”做了专门单独分组,以保留其在交互预算与可部署性上的独特性。