Nothing Special   »   [go: up one dir, main page]

Skip to main content

ORACLE: End-to-End Model Based Reinforcement Learning

  • Conference paper
  • First Online:
Artificial Intelligence XXXVIII (SGAI-AI 2021)

Abstract

Reinforcement Learning (RL) algorithms seek to maximize some notion of reward. There are two categories of RL agents, model-based or model-free agents. In the case of model-free learning, the algorithm learns through trial and error in the target environment in contrast to model-based where the agent train in a learned or known environment instead.

Model-free reinforcement learning shows promising results in simulated environments but falls short in the case of real-world environments. This is because trial and error do not fit with the reality where errors are related to an economic burden. On the other hand, Model-based reinforcement learning (MBRL) aims to exploit a known or learned dynamics model, which substantially increases sample efficiency. This paper focuses on learning a dynamics model and use the learned model to train several model-free algorithms by directly sampling the dynamics model. However, it is challenging to achieve good accuracy on dynamics models for highly complex domains due to stochasticity and compounding noise in the system. A majority of model-based RL focuses on dynamics models that derive policies from observation space. Deriving policies from observation space is problematic because it is often high dimensional with significant complexity.

This paper proposes an end-to-end model-based reinforcement learning algorithm for learning model-free algorithms to act in environments without trial and error in the real environment. This method is beneficial for existing installations that employ existing decision-making systems, such as an expert system. The proposed algorithm has the same fundamental learning principles as the Dreaming Variational Autoencoder but is substantially different architecturally. We show that the algorithm is more sample efficient and performs comparably with existing model-free approaches. We also demonstrate how the algorithm is actor agnostic, enabling existing model-free algorithms to operate in a model-based context.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The definition of ‘sufficient’ is to train up until a satisfactory performance in terms of average return.

  2. 2.

    We refer the reader to https://github.com/perara/oracle for a detailed implementation in python.

  3. 3.

    We take this opportunity to welcome the RL community to consider open-source benchmarks for easier comparison of scientific results.

  4. 4.

    We make the reader aware that the experiments are compute-heavy, hence few experiment iterations. In total, the experiments take \(\sim \)5 days of wall-clock time to train on consumer-level hardware.

References

  1. Andersen, P., Goodwin, M., Granmo, O.: Deep RTS: a game environment for deep reinforcement learning in real-time strategy games. In: 2018 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8 (2018). https://doi.org/10.1109/CIG.2018.8490409

  2. Andersen, P.-A., Goodwin, M., Granmo, O.-C.: The dreaming variational autoencoder for reinforcement learning environments. In: Bramer, M., Petridis, M. (eds.) SGAI 2018. LNCS (LNAI), vol. 11311, pp. 143–155. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04191-5_11

    Chapter  Google Scholar 

  3. Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34(6), 26–38 (2017). https://doi.org/10.1109/MSP.2017.2743240

    Article  Google Scholar 

  4. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 4754–4765. Curran Associates, Inc. (2018)

    Google Scholar 

  5. Coumans, E., Bai, Y.: PyBullet, a Python module for physics simulation for games, robotics and machine learning. http://pybullet.org

  6. Deisenroth, M., Rasmussen, C.E.: PILCO: A model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on Machine Learning ICML’11, pp. 465–472. Citeseer (2011)

    Google Scholar 

  7. Doerr, A., et al.: Probabilistic recurrent state-space models. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 1280–1289. PMLR (2018). http://proceedings.mlr.press/v80/doerr18a.html

  8. Draganjac, I., Miklic, D., Kovacic, Z., Vasiljevic, G., Bogdan, S.: Decentralized control of multi-AGV systems in autonomous warehousing applications. IEEE Trans. Autom. Sci. Eng. 13(4), 1433–1447 (2016). https://doi.org/10.1109/TASE.2016.2603781

    Article  Google Scholar 

  9. Fraccaro, M.: Deep latent variable models for sequential data (2018). https://orbit.dtu.dk/en/publications/deep-latent-variable-models-for-sequential-data

  10. Fuchs, A., Heider, Y., Wang, K., Sun, W.C., Kaliske, M.: DNN2: a hyper-parameter reinforcement learning game for self-design of neural network based elasto-plastic constitutive descriptions. Comput. Struct. 249, 106505 (2021). https://doi.org/10.1016/j.compstruc.2021.106505

    Article  Google Scholar 

  11. García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)

    MathSciNet  MATH  Google Scholar 

  12. Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: Proceedings 8th International Conference on Learning Representations, ICLR’20 (2020). https://openreview.net/forum?id=S1lOTC4tDS

  13. Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings 36th International Conference on Machine Learning, ICML’18, vol. 97, pp. 2555–2565. PMLR, Long Beach (2019). http://proceedings.mlr.press/v97/hafner19a/hafner19a.pdf

  14. Hafner, D., Lillicrap, T.P., Norouzi, M., Ba, J.: Mastering atari with discrete world models. In: Proceedings 9th International Conference on Learning Representations, ICLR’21 (2021). https://openreview.net/forum?id=0oabwyZbOu

  15. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Proc. 32nd Conference on Artificial Intelligence, AAAI’18, pp. 3215–3222. AAAI Press, New Orleans (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/download/17204/16680

  16. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. In: R. Silva, A.G., Globerson, A. (eds.) 34th Conference on Uncertainty in Artificial Intelligence 2018, pp. 876–885. Association For Uncertainty in Artificial Intelligence (2018). http://arxiv.org/abs/1803.05407

  17. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: Introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999). https://doi.org/10.1023/A:1007665907178

    Article  MATH  Google Scholar 

  18. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 2nd International Conference on Learning Representations (2013). https://doi.org/10.1051/0004-6361/201527329, http://arxiv.org/abs/1312.6114

  19. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proceedings 7th International Conference on Learning Representations, ICLR’19 (2019). https://openreview.net/forum?id=Bkg6RiCqY7

  20. Mallozzi, P., Pelliccione, P., Knauss, A., Berger, C., Mohammadiha, N.: Autonomous vehicles: state of the art, future trends, and challenges. In: Automotive Systems and Software Engineering, pp. 347–367. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12157-0_16

    Chapter  Google Scholar 

  21. Moerland, T.M., Broekens, J., Jonker, C.M.: Model-based reinforcement learning: a survey (2020). arxiv preprint arXiv:2006.16712

  22. Ozair, S., Li, Y., Razavi, A., Antonoglou, I., van den Oord, A., Vinyals, O.: Vector quantized models for planning. In: Proceedings 39th International Conference on Machine Learning, ICML’21 (2021). http://arxiv.org/abs/2106.04615

  23. Razavi, A., van den Oord, A., Poole, B., Vinyals, O.: Preventing posterior collapse with delta-VAEs. In: Proceedings 7th International Conference on Learning Representations, ICLR’19 (2019). https://openreview.net/forum?id=BJe0Gn0cY7

  24. Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. pp. 14837–14847. Curran Associates Inc., Vancouver (2019). http://papers.nips.cc/paper/9625-generating-diverse-high-fidelity-images-with-vq-vae-2

  25. Schrittwieser, J., et al.: Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020). https://doi.org/10.1038/s41586-020-03051-4

    Article  Google Scholar 

  26. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms (2017). arxiv preprint arXiv:1707.06347

  27. Seetharaman, P., Wichern, G., Pardo, B., Roux, J.L.: Autoclip: adaptive gradient clipping for source separation networks. In: IEEE International Workshop on Machine Learning for Signal Processing, MLSP, vol. 2020-September. IEEE Computer Society (2020). https://doi.org/10.1109/MLSP49062.2020.9231926

  28. Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bull. 2(4), 160–163 (1991). https://doi.org/10.1145/122344.122377

    Article  Google Scholar 

  29. Varghese, N.V., Mahmoud, Q.H.: A survey of multi-task deep reinforcement learning. Electronics 9(9) (2020). https://doi.org/10.3390/electronics9091363

  30. Yu, C., Liu, J., Nemati, S.: Reinforcement learning in healthcare: a survey (2019). arxiv preprint arXiv:1908.08796

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Per-Arne Andersen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Andersen, PA., Goodwin, M., Granmo, OC. (2021). ORACLE: End-to-End Model Based Reinforcement Learning. In: Bramer, M., Ellis, R. (eds) Artificial Intelligence XXXVIII. SGAI-AI 2021. Lecture Notes in Computer Science(), vol 13101. Springer, Cham. https://doi.org/10.1007/978-3-030-91100-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91100-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91099-0

  • Online ISBN: 978-3-030-91100-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics