BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning

Yijun Yang^11,12,
Jing Jiang¹¹,
Zhuowei Wang¹¹,
Qiqi Duan¹² &
…
Yuhui Shi¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13151))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

2157 Accesses
1 Citations

Abstract

Offline reinforcement learning (RL) aims to train an agent solely using a dataset of historical interactions with the environments without any further costly or dangerous active exploration. Model-based RL (MbRL) usually achieves promising performance in offline RL due to its high sample-efficiency and compact modeling of a dynamic environment. However, it may suffer from the bias and error accumulation of the model predictions. Existing methods address this problem by adding a penalty term to the model reward but require careful hand-tuning of the penalty and its weight. Instead in this paper, we formulate the model-based offline RL as a bi-objective optimization where the first objective aims to maximize the model return and the second objective is adaptive to the learning dynamics of the RL policy. Thereby, we do not need to tune the penalty and its weight but can achieve a more advantageous trade-off between the final model return and model’s uncertainty. We develop an efficient and adaptive policy optimization algorithm equipped with evolution strategy to solve the bi-objective optimization, named as BiES. The experimental results on a D4RL benchmark show that our approach sets the new state of the art and significantly outperforms existing offline RL methods on long-horizon tasks.

This work is partially supported by the Shenzhen Fundamental Research Program under the Grant No. JCYJ20200109141235597.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

Article 11 September 2024

A survey on model-based reinforcement learning

Article 23 January 2024

Model-Based Offline Policy Optimization with Distribution Correcting Regularization

Notes

1.
https://github.com/aravindr93/mjrl/issues/35

References

Berkenkamp, F., Turchetta, M., Schoellig, A., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: NeurIPS, pp. 908–918 (2017)
Google Scholar
Boney, R., Kannala, J., Ilin, A.: Regularizing model-based planning with energy-based models. In: CoRL (2019)
Google Scholar
Cheng, R., He, C., Jin, Y., Yao, X.: Model-based evolutionary algorithms: a short survey. Complex Intell. Syst. 4(4), 283–292 (2018). https://doi.org/10.1007/s40747-018-0080-1
Article Google Scholar
Choromanski, K., et al.: Provably robust blackbox optimization for reinforcement learning. In: CoRL, pp. 683–696 (2020)
Google Scholar
Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: NeurIPS (2018)
Google Scholar
Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., Abbeel, P.: Model-based reinforcement learning via meta-policy optimization. In: CoRL (2018)
Google Scholar
Désidéri, J.A.: Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C.R. Math. 350(5), 313–318 (2012)
Article MathSciNet Google Scholar
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning. arXiv:2004.07219 (2020)
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: ICML (2019)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870 (2018)
Google Scholar
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS (2019)
Google Scholar
Janner, M., Mordatch, I., Levine, S.: $\gamma $-models: generative temporal difference learning for infinite-horizon prediction. arXiv:2010.14496 (2020)
Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. arXiv:2005.05951 (2020)
Kumar, A., Fu, J., Tucker, G., Levine, S.: Stabilizing off-policy q-learning via bootstrapping error reduction. In: NeurIPS (2019)
Google Scholar
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv:2005.01643 (2020)
Luo, J., Chen, L., Li, X., Zhang, Q.: Novel multitask conditional neural-network surrogate models for expensive optimization. IEEE Trans Cyber. 1–14 (2020)
Google Scholar
Mania, H., Guy, A., Recht, B.: Simple random search of static linear policies is competitive for reinforcement learning. In: NeurIPS (2018)
Google Scholar
Milojkovic, N., Antognini, D., Bergamin, G., Faltings, B., Musat, C.: Multi-gradient descent for multi-objective recommender systems. In: AAAI (2020)
Google Scholar
Rajeswaran, A., Mordatch, I., Kumar, V.: A game theoretic framework for model based reinforcement learning. In: ICML, pp. 7953–7963 (2020)
Google Scholar
Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864 (2017)
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: ICLR (2016)
Google Scholar
Shin, M., Kim, J.: Randomized adversarial imitation learning for autonomous driving. In: IJCAI, pp. 4590–4596 (2019)
Google Scholar
Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bull. 2(4), 160–163 (1991)
Article Google Scholar
Touré, C., Hansen, N., Auger, A., Brockhoff, D.: Uncrowded hypervolume improvement: COMO-CMA-ES and the sofomore framework. In: GECCO, pp. 638–646 (2019)
Google Scholar
Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv:1911.11361 (2019)
Xu, Y., Liu, M., Lin, Q., Yang, T.: ADMM without a fixed penalty parameter: faster convergence with new adaptive penalization. In: NeurIPS, pp. 1267–1277 (2017)
Google Scholar
Yu, C., Ren, G., Liu, J.: Deep inverse reinforcement learning for sepsis treatment. In: ICHI, pp. 1–3 (2019). https://doi.org/10.1109/ICHI.2019.8904645
Yu, T., et al.: MOPO: model-based offline policy optimization. arXiv:2005.13239 (2020)

Download references

Author information

Authors and Affiliations

AAII, University of Technology Sydney, Ultimo, NSW, 2007, Australia
Yijun Yang, Jing Jiang & Zhuowei Wang
Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
Yijun Yang, Qiqi Duan & Yuhui Shi

Authors

Yijun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhuowei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiqi Duan
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhui Shi .

Editor information

Editors and Affiliations

University of Technology Sydney, Sydney, NSW, Australia
Guodong Long
RMIT University, Melbourne, SA, Australia
Xinghuo Yu
University of Queensland, Brisbane, QLD, Australia
Sen Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y., Jiang, J., Wang, Z., Duan, Q., Shi, Y. (2022). BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning. In: Long, G., Yu, X., Wang, S. (eds) AI 2021: Advances in Artificial Intelligence. AI 2022. Lecture Notes in Computer Science(), vol 13151. Springer, Cham. https://doi.org/10.1007/978-3-030-97546-3_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-97546-3_46
Published: 19 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97545-6
Online ISBN: 978-3-030-97546-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

A survey on model-based reinforcement learning

Model-Based Offline Policy Optimization with Distribution Correcting Regularization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

A survey on model-based reinforcement learning

Model-Based Offline Policy Optimization with Distribution Correcting Regularization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation