Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3635637.3662880acmconferencesArticle/Chapter ViewAbstractPublication PagesaamasConference Proceedingsconference-collections
research-article

Foresight Distribution Adjustment for Off-policy Reinforcement Learning

Published: 06 May 2024 Publication History

Abstract

Off-policy reinforcement learning algorithms maintain a replay buffer to utilize samples obtained from earlier policies. The sampling strategy that prioritizes certain data in a buffer to train the value function or the policy, has been shown to significantly influence the sample efficiency and the final performance of the algorithm. However, which distribution for the experience prioritization is the best choice has not been explored thoroughly. In this paper, we proved that the post-update policy distribution (i.e. the visitation distribution of the policy after the current iteration of update) is the best Q training distribution to benefit the policy improvement. Nevertheless, accessing this "future" distribution is not straightforward. In this work, we find that the current experiences can be modulated by the critic information to simulate the post-update policy distribution. Technically, we derive the gradient of the visitation distribution with respect to the policy parameter and obtain an explicit expression to approximate the post-update policy distribution. The derived method is named as Foresight Distribution Adjustment (FoDA), and seamlessly integrates with conventional off-policy actor-critic algorithms. Our experiments validate FoDA's ability to closely approximate the post-update policy distribution, and demonstrate its utility in enhancing performance across continuous control task benchmarks.

References

[1]
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. 2019. Understanding the impact of entropy on policy optimization. In Int'l Conf. on machine learning. PMLR, 151--160.
[2]
David Andre, Nir Friedman, and Ronald Parr. 1997. Generalized Prioritized Sweeping. In Proceedings of the 10th Conf. on Neural Information Processing Systems (NeurIPS'97). Denver, CO.
[3]
Marc Brittain, Joshua R. Bertram, Xuxi Yang, and Peng Wei. 2019. Prioritized Sequence Experience Replay. CoRR, Vol. abs/1905.12726 (2019).
[4]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. CoRR, Vol. abs/1606.01540 (2016).
[5]
Scott Fujimoto and Shixiang Shane Gu. 2021. A Minimalist Approach to Offline Reinforcement Learning. In Advances in Neural Information Processing Systems 34 (NeurIPS'21). Virtual Event.
[6]
Scott Fujimoto, David Meger, and Doina Precup. 2020. An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay. In Proceedings of the 33rd Conf. on Neural Information Processing Systems (NeurIPS'20). Virtual Event.
[7]
Scott Fujimoto, Herke van Hoof, and David Meger. 2018. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th Int'l Conf. on Machine Learning (ICML'18). Stockholmsmässan, Sweden.
[8]
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. 2019. A theory of regularized markov decision processes. In Int'l Conf. on Machine Learning (ICML'19).
[9]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th Int'l Conf. on Machine Learning (ICML'18). Stockholmsmässan, Sweden.
[10]
Zhang-Wei Hong, Tao Chen, Yen-Chen Lin, Joni Pajarinen, and Pulkit Agrawal. 2021. Topological Experience Replay. In Int'l Conf. on Learning Representations (ICLR'21).
[11]
Ryan Hoque, Ashwin Balakrishna, Ellen R. Novoseller, Albert Wilcox, Daniel S. Brown, and Ken Goldberg. 2021a. ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning. In Proceedings of the 5th Conf. on Robot Learning (CoRL'21). London, UK.
[12]
Ryan Hoque, Ashwin Balakrishna, Carl Putterman, Michael Luo, Daniel S. Brown, Daniel Seita, Brijen Thananjeyan, Ellen R. Novoseller, and Ken Goldberg. 2021b. LazyDAgger: Reducing Context Switching in Interactive Imitation Learning. In Proceedings of the 17th Int'l Conf. on Automation Science and Engineering (CASE'21). Lyon, France.
[13]
Chengxing Jia, Fuxiang Zhang, Tian Xu, Jing-Cheng Pang, Zongzhang Zhang, and Yang Yu. 2023. Model gradient: unified model and policy learning in model-based reinforcement learning. In Frontiers of Computer Science. 18:184339.
[14]
Xue-Kun Jin, Xu-Hui Liu, Shengyi Jiang, and Yang Yu. 2022. Hybrid Value Estimation for Off-policy Evaluation and Offline Reinforcement Learning. CoRR, Vol. abs/2206.02000 (2022).
[15]
Sham M. Kakade and John Langford. 2002. Approximately Optimal Approximate Reinforcement Learning. In Proceedings of the 19th Int'l Conf. on Machine Learning (ICML'02). Sydney, Australia.
[16]
Aviral Kumar, Abhishek Gupta, and Sergey Levine. 2020. DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction. In Proceedings of 33rd Conf. on Neural Information Processing Systems (NeurIPS'20).
[17]
Sanghwa Lee, Jaeyoung Lee, and Ichiro Hasuo. 2021. Predictive PER: balancing priority and diversity towards stable deep reinforcement learning. In 2021 Int'l Joint Conf. on Neural Networks (IJCNN'21). IEEE, 1--10.
[18]
Su Young Lee, Choi Sungik, and Sae-Young Chung. 2019. Sample-efficient deep reinforcement learning via episodic backward update. In Proceedings of 32nd Conf. on Neural Information Processing Systems (NeurIPS'19).
[19]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In Proceedings of the 4th Int'l Conf. on Learning Representations (ICLR'16). San Juan, Puerto Rico.
[20]
Long Ji Lin. 1992. Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching. Journal of Machine Learning Research, Vol. 8 (1992), 293--321.
[21]
Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation. In Proceedings of the 31st Neural Information Processing Systems (NeurIPS'18). Montréal, Canada.
[22]
Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. 2022. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. Advances in Neural Information Processing Systems, Vol. 35 (2022).
[23]
Xu-Hui Liu, Feng Xu, Xinyu Zhang, Tianyuan Liu, Shengyi Jiang, Ruifeng Chen, Zongzhang Zhang, and Yang Yu. 2023. How To Guide Your Learner: Imitation Learning with Active Adaptive Expert Involvement. CoRR, Vol. abs/2303.02073 (2023).
[24]
Xu-Hui Liu, Zhenghai Xue, Jingcheng Pang, Shengyi Jiang, Feng Xu, and Yang Yu. 2021. Regret Minimization Experience Replay in Off-Policy Reinforcement Learning. In Proceedings of 34th Conf. on Neural Information Processing Systems (NeurIPS'21).
[25]
Fan-Ming Luo, Tian Xu, Hang Lai, Xiong-Hui Chen, Weinan Zhang, and Yang Yu. 2024. A survey on model-based reinforcement learning. In SCIENCE CHINA Information Sciences. 67(2): 121101.
[26]
Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer. 2019. EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning. In Proceedings of Int'l Conf. on Intelligent Robots and Systems (IROS'19). Macau, China.
[27]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529--533.
[28]
Andrew W. Moore and Christopher G. Atkeson. 1993. Prioritized Sweeping: Reinforcement Learning With Less Data and Less Time. Journal of Machine Learning Research, Vol. 13 (1993).
[29]
Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Edward Lee, Jie Tan, and Sergey Levine. 2020. Learning Agile Robotic Locomotion Skills by Imitating Animals. In Proceedings of the 14th Robotics: Science and Systems (RSS'20). Virtual Event.
[30]
Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. 2019. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems (NeurIPS'19), Vol. 32. Vancouver, BC, Canada.
[31]
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized Experience Replay. In Proceedings of the 4th Int'l Conf. on Learning Representations (ICLR'16). San Juan, Puerto Rico.
[32]
John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd Int'l Conf. on Machine Learning (ICML'15). Lille, France, 1889--1897.
[33]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR, Vol. abs/1707.06347 (2017).
[34]
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. 2019. Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems, Vol. 32 (2019).
[35]
Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. 2022. Experience replay with likelihood-free importance weights. In Learning for Dynamics and Control Conf. (L4RC'22). Stanford, USA.
[36]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[37]
Yuval Tassa, Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, and Nicolas Heess. 2020. dm-control: Software and Tasks for Continuous Control. CoRR, Vol. abs/2006.12983 (2020).
[38]
Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. In Proceedings of 24th Int'l Conf. on Intelligent Robots and Systems (IROS'12). Vilamoura, Portugal.
[39]
Harm van Seijen and Richard S. Sutton. 2013. Planning by Prioritized Sweeping with Small Backups. In Proceedings of the 30th Int'l Conf. on Machine Learning (ICML'13). Atlanta, USA.
[40]
Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. 2020a. Leverage the average: an analysis of KL regularization in reinforcement learning. In Proceedings of the 33rd Conf. on Neural Information Processing Systems (NeurIPS'20).
[41]
Nino Vieillard, Olivier Pietquin, and Matthieu Geist. 2020b. Munchausen reinforcement learning. In Proceedings of the 33rd Conf. on Neural Information Processing Systems (NeurIPS'20).
[42]
Che Wang, Yanqiu Wu, Quan Vuong, and Keith Ross. 2020. Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling. In Proceedings of the 37th Int'l Conf. on Machine Learning (ICML'20). Virtual Event.
[43]
Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, and Xing Xie. 2018. A Reinforcement Learning Framework for Explainable Recommendation. In Proceedings of the 18th Int'l Conf. on Data Mining (ICDM'18). Singapore.
[44]
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. 2019. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning. In Proceedings of the 3rd Conf. on Robot Learning (CoRL'19). Osaka, Japan.
[45]
Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. 2020. GenDICE: Generalized Offline Estimation of Stationary Values. In Proceedings of the 8th Int'l Conf. on Learning Representations (ICLR'20). Addis Ababa, Ethiopia.
[46]
Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning. In Proceedings of the 24th Int'l Conf. on Knowledge Discovery & Data Mining (KDD'18). London, UK.
[47]
Zeyu Zheng, Junhyuk Oh, and Satinder Singh. 2018. On learning intrinsic rewards for policy gradient methods. Advances in Neural Information Processing Systems, Vol. 31 (2018).

Index Terms

  1. Foresight Distribution Adjustment for Off-policy Reinforcement Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems
    May 2024
    2898 pages
    ISBN:9798400704864

    Sponsors

    Publisher

    International Foundation for Autonomous Agents and Multiagent Systems

    Richland, SC

    Publication History

    Published: 06 May 2024

    Check for updates

    Author Tags

    1. experience replay
    2. off-policy reinforcement learning
    3. reinforcement learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation of China

    Conference

    AAMAS '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 19
      Total Downloads
    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media