Upper confident bound advantage function proximal policy optimization

Guiliang Xie¹,
Wei Zhang ORCID: orcid.org/0000-0003-1400-8612¹,
Zhi Hu^1,2 &
…
Gaojian Li¹

464 Accesses
Explore all metrics

Abstract

Proximal Policy Optimization (PPO) is one of the classical and excellent algorithms in Deep Reinforcement Learning (DRL). However, there are still two problems with PPO. The one problem is that PPO limits the policy update to a certain range, which makes PPO prone to the risk of insufficient exploration, the other problem is that PPO adopts mini-batch update method which leads to negative advantage estimation interference. To address these issues, we propose a new model-free algorithm, called Upper Confident Bound Advantage Function Proximal Policy Optimization (UCB-AF), which estimates the confidence of the advantage estimation through Hoeffding’s inequality, increases and adjusts advantage estimation with an upper confidence bound. Moreover, compare to PPO in multiple complex environments, our method not only improves the exploration ability, but enjoys better performance bound as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploration in policy optimization through multiple paths

Article 26 June 2021

Demonstration-Based Proximal Policy Optimization with Action Guidance

Fast Proximal Policy Optimization

Data Availibility

The data used to support the findings of this study are available from the corresponding author upon request.

References

Silver, D., Huang, A., Maddison, C.J., et al.: Mastering the game of Go with deep neural networks and tree search. Nature. 529(7587), 484–489 (2016)
Article Google Scholar
Pang, Z. J., Liu, R. Z., Meng, Z. Y., et al.: On reinforcement learning for full-length game of starcraft. In: Thirty-third Association for the Advancement of Artificial Intelligence, pp. 4691-4698 (2019)
Ye, D., Chen, G., Zhang, W., et al.: Towards playing full moba games with deep reinforcement learning. In: Thirty-forth Advances in Neural Information Processing Systems, pp. 621-632 (2020)
Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: A survey. Int. J. Robotic. Res. 32(11), 1238–1274 (2013)
Article Google Scholar
Kuderer, M., Gulati, S., Burgard, W.: Learning driving styles for autonomous vehicles from demonstration. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 2641-2646 (2015)
Senior, A.W., Evans, R., Jumper, J., et al.: Improved protein structure prediction using potentials from deep learning. Nature 577(7792), 706–710 (2020)
Article Google Scholar
Li, M., Qin, Z., Jiao, Y., et al.: Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In: The World Wide Web Conference, pp. 983-994 (2019)
Parr, R., Li, L., Taylor, G., et al.: An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In: Proceedings of the 25th International Conference on Machine learning, pp. 752-759 (2008)
Hessel, M., Modayil, J., Van, Hasselt, H., et al.: Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017)
Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533 (2015)
Article Google Scholar
Kumar, A., Zhou, A., Tucker, G., et al.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1179-1191 (2020)
Kakade. S, M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, pp. 1057-1063 (2001)
Silver, D., Lever, G., Heess, N., et al.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387-395 (2014)
Konda, V. R., Tsitsiklis, J. N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008-1014 (2000)
Lillicrap, T. P., Hunt, J, J., Pritzel, A., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Schulman, J., Levine, S., Abbeel, P., et al.: Trust region policy optimization. In: Proceedings of Machine Learning Research, pp. 1889-1897 (2015)
Schulman, J., Wolski, F., Dhariwal, P., et al.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Ye, D., Liu, Z., Sun, M., et al.: Mastering complex control in moba games with deep reinforcement learning. In: Thirty-forth Advances in Neural Information Processing Systems, pp. 6672-6679 (2020)
Chen, G., Peng, Y., Zhang, M.: An adaptive clipping approach for proximal policy optimization. arXiv preprint arXiv:1804.06461 (2018)
Wang, Y., He, H., Tan, X., et al.: Trust region-guided proximal policy optimization. In: Thirty-third Advances in Neural Information Processing Systems, pp. 2061-2069 (2019)
H?m?l?inen, P., Babadi, A., Ma, X., et al.: PPO-CMA: Proximal policy optimization with covariance matrix adaptation. In: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1-6 (2020)
Van, Hasselt, H., Wiering, M. A.: Reinforcement learning in continuous action spaces. In: 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 272-279 (2007)
Peng, Z., Lin, J., Cui, D., et al.: A multi-objective trade-off framework for cloud resource scheduling based on the deep Q-network algorithm. Cluster Comput. 23(4), 2753–2767 (2020)
Article Google Scholar
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., et al.: Natural actor-critic algorithms. Automatica. 45(11), 2471–2482 (2009)
Article MathSciNet MATH Google Scholar
Qiming, F., Wen, H., Quan, L., et al.: Residual Sarsa algorithm with function approximation. Cluster Computing. 22(1), 795–807 (2019)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grants No. 62003207 and No. 61773350, and China Postdoctoral Science Foundation funded project No. 2021M690629. Wei Zhang is corresponding author and we declare that there is no conflict of interest regarding the publication of this article.

Funding

This work is supported by the National Natural Science Foundation of China under Grants No. 62003207 and No. 61773350, and China Postdoctoral Science Foundation funded project No. 2021M690629.

Author information

Authors and Affiliations

Laboratory of Intelligent Control and Robotics, Shanghai University of Engineering Science, Shanghai, 201620, China
Guiliang Xie, Wei Zhang, Zhi Hu & Gaojian Li
School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
Zhi Hu

Authors

Guiliang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Gaojian Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to this work from different aspects. GX, WZ, ZH, GL: conceptualization, methodology, modeling, validation and results analysis were performed. GX and WZ: The original draft of this manuscript was written. all authors commented on previous versions of this manuscript, and then read and approved its current version.

Corresponding author

Correspondence to Wei Zhang.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Ethical approval

Ethics approval was not required for this research.

Humans resoureces

This work did not involved humans and animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xie, G., Zhang, W., Hu, Z. et al. Upper confident bound advantage function proximal policy optimization. Cluster Comput 26, 2001–2010 (2023). https://doi.org/10.1007/s10586-022-03742-9

Download citation

Received: 04 March 2022
Revised: 20 August 2022
Accepted: 25 August 2022
Published: 14 September 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10586-022-03742-9

Upper confident bound advantage function proximal policy optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploration in policy optimization through multiple paths

Demonstration-Based Proximal Policy Optimization with Action Guidance

Fast Proximal Policy Optimization

Data Availibility

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Humans resoureces

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Upper confident bound advantage function proximal policy optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploration in policy optimization through multiple paths

Demonstration-Based Proximal Policy Optimization with Action Guidance

Fast Proximal Policy Optimization

Data Availibility

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Humans resoureces

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation