Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Upper confident bound advantage function proximal policy optimization

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Proximal Policy Optimization (PPO) is one of the classical and excellent algorithms in Deep Reinforcement Learning (DRL). However, there are still two problems with PPO. The one problem is that PPO limits the policy update to a certain range, which makes PPO prone to the risk of insufficient exploration, the other problem is that PPO adopts mini-batch update method which leads to negative advantage estimation interference. To address these issues, we propose a new model-free algorithm, called Upper Confident Bound Advantage Function Proximal Policy Optimization (UCB-AF), which estimates the confidence of the advantage estimation through Hoeffding’s inequality, increases and adjusts advantage estimation with an upper confidence bound. Moreover, compare to PPO in multiple complex environments, our method not only improves the exploration ability, but enjoys better performance bound as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availibility

The data used to support the findings of this study are available from the corresponding author upon request.

References

  1. Silver, D., Huang, A., Maddison, C.J., et al.: Mastering the game of Go with deep neural networks and tree search. Nature. 529(7587), 484–489 (2016)

    Article  Google Scholar 

  2. Pang, Z. J., Liu, R. Z., Meng, Z. Y., et al.: On reinforcement learning for full-length game of starcraft. In: Thirty-third Association for the Advancement of Artificial Intelligence, pp. 4691-4698 (2019)

  3. Ye, D., Chen, G., Zhang, W., et al.: Towards playing full moba games with deep reinforcement learning. In: Thirty-forth Advances in Neural Information Processing Systems, pp. 621-632 (2020)

  4. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: A survey. Int. J. Robotic. Res. 32(11), 1238–1274 (2013)

    Article  Google Scholar 

  5. Kuderer, M., Gulati, S., Burgard, W.: Learning driving styles for autonomous vehicles from demonstration. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 2641-2646 (2015)

  6. Senior, A.W., Evans, R., Jumper, J., et al.: Improved protein structure prediction using potentials from deep learning. Nature 577(7792), 706–710 (2020)

    Article  Google Scholar 

  7. Li, M., Qin, Z., Jiao, Y., et al.: Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In: The World Wide Web Conference, pp. 983-994 (2019)

  8. Parr, R., Li, L., Taylor, G., et al.: An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In: Proceedings of the 25th International Conference on Machine learning, pp. 752-759 (2008)

  9. Hessel, M., Modayil, J., Van, Hasselt, H., et al.: Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017)

  10. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533 (2015)

    Article  Google Scholar 

  11. Kumar, A., Zhou, A., Tucker, G., et al.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1179-1191 (2020)

  12. Kakade. S, M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, pp. 1057-1063 (2001)

  13. Silver, D., Lever, G., Heess, N., et al.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387-395 (2014)

  14. Konda, V. R., Tsitsiklis, J. N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008-1014 (2000)

  15. Lillicrap, T. P., Hunt, J, J., Pritzel, A., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

  16. Schulman, J., Levine, S., Abbeel, P., et al.: Trust region policy optimization. In: Proceedings of Machine Learning Research, pp. 1889-1897 (2015)

  17. Schulman, J., Wolski, F., Dhariwal, P., et al.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  18. Ye, D., Liu, Z., Sun, M., et al.: Mastering complex control in moba games with deep reinforcement learning. In: Thirty-forth Advances in Neural Information Processing Systems, pp. 6672-6679 (2020)

  19. Chen, G., Peng, Y., Zhang, M.: An adaptive clipping approach for proximal policy optimization. arXiv preprint arXiv:1804.06461 (2018)

  20. Wang, Y., He, H., Tan, X., et al.: Trust region-guided proximal policy optimization. In: Thirty-third Advances in Neural Information Processing Systems, pp. 2061-2069 (2019)

  21. H?m?l?inen, P., Babadi, A., Ma, X., et al.: PPO-CMA: Proximal policy optimization with covariance matrix adaptation. In: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1-6 (2020)

  22. Van, Hasselt, H., Wiering, M. A.: Reinforcement learning in continuous action spaces. In: 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 272-279 (2007)

  23. Peng, Z., Lin, J., Cui, D., et al.: A multi-objective trade-off framework for cloud resource scheduling based on the deep Q-network algorithm. Cluster Comput. 23(4), 2753–2767 (2020)

    Article  Google Scholar 

  24. Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., et al.: Natural actor-critic algorithms. Automatica. 45(11), 2471–2482 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  25. Qiming, F., Wen, H., Quan, L., et al.: Residual Sarsa algorithm with function approximation. Cluster Computing. 22(1), 795–807 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grants No. 62003207 and No. 61773350, and China Postdoctoral Science Foundation funded project No. 2021M690629. Wei Zhang is corresponding author and we declare that there is no conflict of interest regarding the publication of this article.

Funding

This work is supported by the National Natural Science Foundation of China under Grants No. 62003207 and No. 61773350, and China Postdoctoral Science Foundation funded project No. 2021M690629.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to this work from different aspects. GX, WZ, ZH, GL: conceptualization, methodology, modeling, validation and results analysis were performed. GX and WZ: The original draft of this manuscript was written. all authors commented on previous versions of this manuscript, and then read and approved its current version.

Corresponding author

Correspondence to Wei Zhang.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Ethical approval

Ethics approval was not required for this research.

Humans resoureces

This work did not involved humans and animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, G., Zhang, W., Hu, Z. et al. Upper confident bound advantage function proximal policy optimization. Cluster Comput 26, 2001–2010 (2023). https://doi.org/10.1007/s10586-022-03742-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03742-9

Keywords

Navigation