Abstract
Proximal Policy Optimization (PPO) is one of the classical and excellent algorithms in Deep Reinforcement Learning (DRL). However, there are still two problems with PPO. The one problem is that PPO limits the policy update to a certain range, which makes PPO prone to the risk of insufficient exploration, the other problem is that PPO adopts mini-batch update method which leads to negative advantage estimation interference. To address these issues, we propose a new model-free algorithm, called Upper Confident Bound Advantage Function Proximal Policy Optimization (UCB-AF), which estimates the confidence of the advantage estimation through Hoeffding’s inequality, increases and adjusts advantage estimation with an upper confidence bound. Moreover, compare to PPO in multiple complex environments, our method not only improves the exploration ability, but enjoys better performance bound as well.
Similar content being viewed by others
Data Availibility
The data used to support the findings of this study are available from the corresponding author upon request.
References
Silver, D., Huang, A., Maddison, C.J., et al.: Mastering the game of Go with deep neural networks and tree search. Nature. 529(7587), 484–489 (2016)
Pang, Z. J., Liu, R. Z., Meng, Z. Y., et al.: On reinforcement learning for full-length game of starcraft. In: Thirty-third Association for the Advancement of Artificial Intelligence, pp. 4691-4698 (2019)
Ye, D., Chen, G., Zhang, W., et al.: Towards playing full moba games with deep reinforcement learning. In: Thirty-forth Advances in Neural Information Processing Systems, pp. 621-632 (2020)
Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: A survey. Int. J. Robotic. Res. 32(11), 1238–1274 (2013)
Kuderer, M., Gulati, S., Burgard, W.: Learning driving styles for autonomous vehicles from demonstration. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 2641-2646 (2015)
Senior, A.W., Evans, R., Jumper, J., et al.: Improved protein structure prediction using potentials from deep learning. Nature 577(7792), 706–710 (2020)
Li, M., Qin, Z., Jiao, Y., et al.: Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In: The World Wide Web Conference, pp. 983-994 (2019)
Parr, R., Li, L., Taylor, G., et al.: An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In: Proceedings of the 25th International Conference on Machine learning, pp. 752-759 (2008)
Hessel, M., Modayil, J., Van, Hasselt, H., et al.: Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017)
Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533 (2015)
Kumar, A., Zhou, A., Tucker, G., et al.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1179-1191 (2020)
Kakade. S, M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, pp. 1057-1063 (2001)
Silver, D., Lever, G., Heess, N., et al.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387-395 (2014)
Konda, V. R., Tsitsiklis, J. N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008-1014 (2000)
Lillicrap, T. P., Hunt, J, J., Pritzel, A., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Schulman, J., Levine, S., Abbeel, P., et al.: Trust region policy optimization. In: Proceedings of Machine Learning Research, pp. 1889-1897 (2015)
Schulman, J., Wolski, F., Dhariwal, P., et al.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Ye, D., Liu, Z., Sun, M., et al.: Mastering complex control in moba games with deep reinforcement learning. In: Thirty-forth Advances in Neural Information Processing Systems, pp. 6672-6679 (2020)
Chen, G., Peng, Y., Zhang, M.: An adaptive clipping approach for proximal policy optimization. arXiv preprint arXiv:1804.06461 (2018)
Wang, Y., He, H., Tan, X., et al.: Trust region-guided proximal policy optimization. In: Thirty-third Advances in Neural Information Processing Systems, pp. 2061-2069 (2019)
H?m?l?inen, P., Babadi, A., Ma, X., et al.: PPO-CMA: Proximal policy optimization with covariance matrix adaptation. In: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1-6 (2020)
Van, Hasselt, H., Wiering, M. A.: Reinforcement learning in continuous action spaces. In: 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 272-279 (2007)
Peng, Z., Lin, J., Cui, D., et al.: A multi-objective trade-off framework for cloud resource scheduling based on the deep Q-network algorithm. Cluster Comput. 23(4), 2753–2767 (2020)
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., et al.: Natural actor-critic algorithms. Automatica. 45(11), 2471–2482 (2009)
Qiming, F., Wen, H., Quan, L., et al.: Residual Sarsa algorithm with function approximation. Cluster Computing. 22(1), 795–807 (2019)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grants No. 62003207 and No. 61773350, and China Postdoctoral Science Foundation funded project No. 2021M690629. Wei Zhang is corresponding author and we declare that there is no conflict of interest regarding the publication of this article.
Funding
This work is supported by the National Natural Science Foundation of China under Grants No. 62003207 and No. 61773350, and China Postdoctoral Science Foundation funded project No. 2021M690629.
Author information
Authors and Affiliations
Contributions
All authors contributed to this work from different aspects. GX, WZ, ZH, GL: conceptualization, methodology, modeling, validation and results analysis were performed. GX and WZ: The original draft of this manuscript was written. all authors commented on previous versions of this manuscript, and then read and approved its current version.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Ethical approval
Ethics approval was not required for this research.
Humans resoureces
This work did not involved humans and animals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xie, G., Zhang, W., Hu, Z. et al. Upper confident bound advantage function proximal policy optimization. Cluster Comput 26, 2001–2010 (2023). https://doi.org/10.1007/s10586-022-03742-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-022-03742-9