research-article

Foresight Distribution Adjustment for Off-policy Reinforcement Learning

Authors:

Yang YuAuthors Info & Claims

AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

Pages 317 - 325

Published: 06 May 2024 Publication History

Abstract

Off-policy reinforcement learning algorithms maintain a replay buffer to utilize samples obtained from earlier policies. The sampling strategy that prioritizes certain data in a buffer to train the value function or the policy, has been shown to significantly influence the sample efficiency and the final performance of the algorithm. However, which distribution for the experience prioritization is the best choice has not been explored thoroughly. In this paper, we proved that the post-update policy distribution (i.e. the visitation distribution of the policy after the current iteration of update) is the best Q training distribution to benefit the policy improvement. Nevertheless, accessing this "future" distribution is not straightforward. In this work, we find that the current experiences can be modulated by the critic information to simulate the post-update policy distribution. Technically, we derive the gradient of the visitation distribution with respect to the policy parameter and obtain an explicit expression to approximate the post-update policy distribution. The derived method is named as Foresight Distribution Adjustment (FoDA), and seamlessly integrates with conventional off-policy actor-critic algorithms. Our experiments validate FoDA's ability to closely approximate the post-update policy distribution, and demonstrate its utility in enhancing performance across continuous control task benchmarks.

References

[1]

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. 2019. Understanding the impact of entropy on policy optimization. In Int'l Conf. on machine learning. PMLR, 151--160.

[2]

David Andre, Nir Friedman, and Ronald Parr. 1997. Generalized Prioritized Sweeping. In Proceedings of the 10th Conf. on Neural Information Processing Systems (NeurIPS'97). Denver, CO.

Digital Library

[3]

Marc Brittain, Joshua R. Bertram, Xuxi Yang, and Peng Wei. 2019. Prioritized Sequence Experience Replay. CoRR, Vol. abs/1905.12726 (2019).

[4]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. CoRR, Vol. abs/1606.01540 (2016).

[5]

Scott Fujimoto and Shixiang Shane Gu. 2021. A Minimalist Approach to Offline Reinforcement Learning. In Advances in Neural Information Processing Systems 34 (NeurIPS'21). Virtual Event.

[6]

Scott Fujimoto, David Meger, and Doina Precup. 2020. An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay. In Proceedings of the 33rd Conf. on Neural Information Processing Systems (NeurIPS'20). Virtual Event.

[7]

Scott Fujimoto, Herke van Hoof, and David Meger. 2018. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th Int'l Conf. on Machine Learning (ICML'18). Stockholmsmässan, Sweden.

[8]

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. 2019. A theory of regularized markov decision processes. In Int'l Conf. on Machine Learning (ICML'19).

[9]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th Int'l Conf. on Machine Learning (ICML'18). Stockholmsmässan, Sweden.

[10]

Zhang-Wei Hong, Tao Chen, Yen-Chen Lin, Joni Pajarinen, and Pulkit Agrawal. 2021. Topological Experience Replay. In Int'l Conf. on Learning Representations (ICLR'21).

[11]

Ryan Hoque, Ashwin Balakrishna, Ellen R. Novoseller, Albert Wilcox, Daniel S. Brown, and Ken Goldberg. 2021a. ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning. In Proceedings of the 5th Conf. on Robot Learning (CoRL'21). London, UK.

[12]

Ryan Hoque, Ashwin Balakrishna, Carl Putterman, Michael Luo, Daniel S. Brown, Daniel Seita, Brijen Thananjeyan, Ellen R. Novoseller, and Ken Goldberg. 2021b. LazyDAgger: Reducing Context Switching in Interactive Imitation Learning. In Proceedings of the 17th Int'l Conf. on Automation Science and Engineering (CASE'21). Lyon, France.

Digital Library

[13]

Chengxing Jia, Fuxiang Zhang, Tian Xu, Jing-Cheng Pang, Zongzhang Zhang, and Yang Yu. 2023. Model gradient: unified model and policy learning in model-based reinforcement learning. In Frontiers of Computer Science. 18:184339.

Digital Library

[14]

Xue-Kun Jin, Xu-Hui Liu, Shengyi Jiang, and Yang Yu. 2022. Hybrid Value Estimation for Off-policy Evaluation and Offline Reinforcement Learning. CoRR, Vol. abs/2206.02000 (2022).

[15]

Sham M. Kakade and John Langford. 2002. Approximately Optimal Approximate Reinforcement Learning. In Proceedings of the 19th Int'l Conf. on Machine Learning (ICML'02). Sydney, Australia.

[16]

Aviral Kumar, Abhishek Gupta, and Sergey Levine. 2020. DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction. In Proceedings of 33rd Conf. on Neural Information Processing Systems (NeurIPS'20).

[17]

Sanghwa Lee, Jaeyoung Lee, and Ichiro Hasuo. 2021. Predictive PER: balancing priority and diversity towards stable deep reinforcement learning. In 2021 Int'l Joint Conf. on Neural Networks (IJCNN'21). IEEE, 1--10.

[18]

Su Young Lee, Choi Sungik, and Sae-Young Chung. 2019. Sample-efficient deep reinforcement learning via episodic backward update. In Proceedings of 32nd Conf. on Neural Information Processing Systems (NeurIPS'19).

[19]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In Proceedings of the 4th Int'l Conf. on Learning Representations (ICLR'16). San Juan, Puerto Rico.

[20]

Long Ji Lin. 1992. Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching. Journal of Machine Learning Research, Vol. 8 (1992), 293--321.

Digital Library

[21]

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation. In Proceedings of the 31st Neural Information Processing Systems (NeurIPS'18). Montréal, Canada.

[22]

Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. 2022. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. Advances in Neural Information Processing Systems, Vol. 35 (2022).

[23]

Xu-Hui Liu, Feng Xu, Xinyu Zhang, Tianyuan Liu, Shengyi Jiang, Ruifeng Chen, Zongzhang Zhang, and Yang Yu. 2023. How To Guide Your Learner: Imitation Learning with Active Adaptive Expert Involvement. CoRR, Vol. abs/2303.02073 (2023).

[24]

Xu-Hui Liu, Zhenghai Xue, Jingcheng Pang, Shengyi Jiang, Feng Xu, and Yang Yu. 2021. Regret Minimization Experience Replay in Off-Policy Reinforcement Learning. In Proceedings of 34th Conf. on Neural Information Processing Systems (NeurIPS'21).

[25]

Fan-Ming Luo, Tian Xu, Hang Lai, Xiong-Hui Chen, Weinan Zhang, and Yang Yu. 2024. A survey on model-based reinforcement learning. In SCIENCE CHINA Information Sciences. 67(2): 121101.

[26]

Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer. 2019. EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning. In Proceedings of Int'l Conf. on Intelligent Robots and Systems (IROS'19). Macau, China.

[27]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529--533.

[28]

Andrew W. Moore and Christopher G. Atkeson. 1993. Prioritized Sweeping: Reinforcement Learning With Less Data and Less Time. Journal of Machine Learning Research, Vol. 13 (1993).

[29]

Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Edward Lee, Jie Tan, and Sergey Levine. 2020. Learning Agile Robotic Locomotion Skills by Imitating Animals. In Proceedings of the 14th Robotics: Science and Systems (RSS'20). Virtual Event.

[30]

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. 2019. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems (NeurIPS'19), Vol. 32. Vancouver, BC, Canada.

[31]

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized Experience Replay. In Proceedings of the 4th Int'l Conf. on Learning Representations (ICLR'16). San Juan, Puerto Rico.

[32]

John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd Int'l Conf. on Machine Learning (ICML'15). Lille, France, 1889--1897.

[33]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR, Vol. abs/1707.06347 (2017).

[34]

Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. 2019. Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems, Vol. 32 (2019).

[35]

Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. 2022. Experience replay with likelihood-free importance weights. In Learning for Dynamics and Control Conf. (L4RC'22). Stanford, USA.

[36]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[37]

Yuval Tassa, Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, and Nicolas Heess. 2020. dm-control: Software and Tasks for Continuous Control. CoRR, Vol. abs/2006.12983 (2020).

[38]

Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. In Proceedings of 24th Int'l Conf. on Intelligent Robots and Systems (IROS'12). Vilamoura, Portugal.

[39]

Harm van Seijen and Richard S. Sutton. 2013. Planning by Prioritized Sweeping with Small Backups. In Proceedings of the 30th Int'l Conf. on Machine Learning (ICML'13). Atlanta, USA.

[40]

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. 2020a. Leverage the average: an analysis of KL regularization in reinforcement learning. In Proceedings of the 33rd Conf. on Neural Information Processing Systems (NeurIPS'20).

[41]

Nino Vieillard, Olivier Pietquin, and Matthieu Geist. 2020b. Munchausen reinforcement learning. In Proceedings of the 33rd Conf. on Neural Information Processing Systems (NeurIPS'20).

[42]

Che Wang, Yanqiu Wu, Quan Vuong, and Keith Ross. 2020. Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling. In Proceedings of the 37th Int'l Conf. on Machine Learning (ICML'20). Virtual Event.

[43]

Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, and Xing Xie. 2018. A Reinforcement Learning Framework for Explainable Recommendation. In Proceedings of the 18th Int'l Conf. on Data Mining (ICDM'18). Singapore.

[44]

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. 2019. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning. In Proceedings of the 3rd Conf. on Robot Learning (CoRL'19). Osaka, Japan.

[45]

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. 2020. GenDICE: Generalized Offline Estimation of Stationary Values. In Proceedings of the 8th Int'l Conf. on Learning Representations (ICLR'20). Addis Ababa, Ethiopia.

[46]

Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning. In Proceedings of the 24th Int'l Conf. on Knowledge Discovery & Data Mining (KDD'18). London, UK.

Digital Library

[47]

Zeyu Zheng, Junhyuk Oh, and Satinder Singh. 2018. On learning intrinsic rewards for policy gradient methods. Advances in Neural Information Processing Systems, Vol. 31 (2018).

Index Terms

Foresight Distribution Adjustment for Off-policy Reinforcement Learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning

Recommendations

Multi-threading parallel reinforcement learning

With respect to the problem of the slow convergence of the traditional reinforcement learning algorithm in practical applications, we propose a novel multi-threading parallel reinforcement learning algorithm - MPRL. MPRL is mainly composed of two parts. ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Hessian matrix distribution for Bayesian policy gradient reinforcement learning

Bayesian policy gradient algorithms have been recently proposed for modeling the policy gradient of the performance measure in reinforcement learning as a Gaussian process. These methods were known to reduce the variance and the number of samples needed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

May 2024

2898 pages

ISBN:9798400704864

General Chairs:
Mehdi Dastani
Utrecht University, Netherlands
,
Jaime Simão Sichman
University of São Paulo, Brazil
,
Program Chairs:
Natasha Alechina
Utrecht University, Netherlands
,
Virginia Dignum
Umeå University, Sweden

Sponsors

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 06 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation of China

Conference

AAMAS '24

Sponsor:

SIGAI

AAMAS '24: International Conference on Autonomous Agents and Multiagent Systems

May 6 - 10, 2024

Auckland, New Zealand

Acceptance Rates

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
19
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents