Abstract
Offline reinforcement learning (RL) aims to train an agent solely using a dataset of historical interactions with the environments without any further costly or dangerous active exploration. Model-based RL (MbRL) usually achieves promising performance in offline RL due to its high sample-efficiency and compact modeling of a dynamic environment. However, it may suffer from the bias and error accumulation of the model predictions. Existing methods address this problem by adding a penalty term to the model reward but require careful hand-tuning of the penalty and its weight. Instead in this paper, we formulate the model-based offline RL as a bi-objective optimization where the first objective aims to maximize the model return and the second objective is adaptive to the learning dynamics of the RL policy. Thereby, we do not need to tune the penalty and its weight but can achieve a more advantageous trade-off between the final model return and model’s uncertainty. We develop an efficient and adaptive policy optimization algorithm equipped with evolution strategy to solve the bi-objective optimization, named as BiES. The experimental results on a D4RL benchmark show that our approach sets the new state of the art and significantly outperforms existing offline RL methods on long-horizon tasks.
This work is partially supported by the Shenzhen Fundamental Research Program under the Grant No. JCYJ20200109141235597.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Berkenkamp, F., Turchetta, M., Schoellig, A., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: NeurIPS, pp. 908–918 (2017)
Boney, R., Kannala, J., Ilin, A.: Regularizing model-based planning with energy-based models. In: CoRL (2019)
Cheng, R., He, C., Jin, Y., Yao, X.: Model-based evolutionary algorithms: a short survey. Complex Intell. Syst. 4(4), 283–292 (2018). https://doi.org/10.1007/s40747-018-0080-1
Choromanski, K., et al.: Provably robust blackbox optimization for reinforcement learning. In: CoRL, pp. 683–696 (2020)
Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: NeurIPS (2018)
Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., Abbeel, P.: Model-based reinforcement learning via meta-policy optimization. In: CoRL (2018)
Désidéri, J.A.: Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C.R. Math. 350(5), 313–318 (2012)
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning. arXiv:2004.07219 (2020)
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: ICML (2019)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870 (2018)
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS (2019)
Janner, M., Mordatch, I., Levine, S.: \(\gamma \)-models: generative temporal difference learning for infinite-horizon prediction. arXiv:2010.14496 (2020)
Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. arXiv:2005.05951 (2020)
Kumar, A., Fu, J., Tucker, G., Levine, S.: Stabilizing off-policy q-learning via bootstrapping error reduction. In: NeurIPS (2019)
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv:2005.01643 (2020)
Luo, J., Chen, L., Li, X., Zhang, Q.: Novel multitask conditional neural-network surrogate models for expensive optimization. IEEE Trans Cyber. 1–14 (2020)
Mania, H., Guy, A., Recht, B.: Simple random search of static linear policies is competitive for reinforcement learning. In: NeurIPS (2018)
Milojkovic, N., Antognini, D., Bergamin, G., Faltings, B., Musat, C.: Multi-gradient descent for multi-objective recommender systems. In: AAAI (2020)
Rajeswaran, A., Mordatch, I., Kumar, V.: A game theoretic framework for model based reinforcement learning. In: ICML, pp. 7953–7963 (2020)
Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864 (2017)
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: ICLR (2016)
Shin, M., Kim, J.: Randomized adversarial imitation learning for autonomous driving. In: IJCAI, pp. 4590–4596 (2019)
Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bull. 2(4), 160–163 (1991)
Touré, C., Hansen, N., Auger, A., Brockhoff, D.: Uncrowded hypervolume improvement: COMO-CMA-ES and the sofomore framework. In: GECCO, pp. 638–646 (2019)
Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv:1911.11361 (2019)
Xu, Y., Liu, M., Lin, Q., Yang, T.: ADMM without a fixed penalty parameter: faster convergence with new adaptive penalization. In: NeurIPS, pp. 1267–1277 (2017)
Yu, C., Ren, G., Liu, J.: Deep inverse reinforcement learning for sepsis treatment. In: ICHI, pp. 1–3 (2019). https://doi.org/10.1109/ICHI.2019.8904645
Yu, T., et al.: MOPO: model-based offline policy optimization. arXiv:2005.13239 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Y., Jiang, J., Wang, Z., Duan, Q., Shi, Y. (2022). BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning. In: Long, G., Yu, X., Wang, S. (eds) AI 2021: Advances in Artificial Intelligence. AI 2022. Lecture Notes in Computer Science(), vol 13151. Springer, Cham. https://doi.org/10.1007/978-3-030-97546-3_46
Download citation
DOI: https://doi.org/10.1007/978-3-030-97546-3_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97545-6
Online ISBN: 978-3-030-97546-3
eBook Packages: Computer ScienceComputer Science (R0)