The Sufficiency of Off-Policyness and Soft Clipping: PPO Is Still Insufficient according to an Off-Policy Measure

Authors

  • Xing Chen School of Artificial Intelligence, Jilin University, Changchun, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China
  • Dongcui Diao Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
  • Hechang Chen School of Artificial Intelligence, Jilin University, Changchun, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China
  • Hengshuai Yao Department of Computing Science, University of Alberta, Edmonton, Canada
  • Haiyin Piao School of Electronics and Information, Northwestern Polytechnical University, Xian, China
  • Zhixiao Sun School of Electronics and Information, Northwestern Polytechnical University, Xian, China
  • Zhiwei Yang School of Artificial Intelligence, Jilin University, Changchun, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China
  • Randy Goebel Department of Computing Science, University of Alberta, Edmonton, Canada Alberta Machine Intelligence Institute, University of Alberta, Edmonton, Canada
  • Bei Jiang Alberta Machine Intelligence Institute, University of Alberta, Edmonton, Canada Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
  • Yi Chang School of Artificial Intelligence, Jilin University, Changchun, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China

DOI:

https://doi.org/10.1609/aaai.v37i6.25864

Keywords:

ML: Reinforcement Learning Algorithms, ML: Optimization

Abstract

The popular Proximal Policy Optimization (PPO) algorithm approximates the solution in a clipped policy space. Does there exist better policies outside of this space? By using a novel surrogate objective that employs the sigmoid function (which provides an interesting way of exploration), we found that the answer is "YES", and the better policies are in fact located very far from the clipped space. We show that PPO is insufficient in "off-policyness", according to an off-policy metric called DEON. Our algorithm explores in a much larger policy space than PPO, and it maximizes the Conservative Policy Iteration (CPI) objective better than PPO during training. To the best of our knowledge, all current PPO methods have the clipping operation and optimize in the clipped policy space. Our method is the first of this kind, which advances the understanding of CPI optimization and policy gradient methods. Code is available at https://github.com/raincchio/P3O.

Downloads

Published

2023-06-26

How to Cite

Chen, X., Diao, D., Chen, H., Yao, H., Piao, H., Sun, Z., Yang, Z., Goebel, R., Jiang, B., & Chang, Y. (2023). The Sufficiency of Off-Policyness and Soft Clipping: PPO Is Still Insufficient according to an Off-Policy Measure. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6), 7078-7086. https://doi.org/10.1609/aaai.v37i6.25864

Issue

Section

AAAI Technical Track on Machine Learning I