Abstract
For actor-critic methods in reinforcement learning, it is vital to learn a useful critic such that the actor can be guided efficiently and properly. Previous methods mainly seek to estimate more accurate Q-values. However, in continuous control scenario where the actor is updated via deterministic policy gradient, only the action gradient (AG) is useful for updating the actor. It is thus a promising way to achieve higher sample efficiency by leveraging the action gradient of Q functions for policy guidance. Nevertheless, we empirically find that directly incorporating action gradient into the critics downgrades the performance of the agent, as it can be easily trapped in the local maxima. To fully utilize the benefits of action gradient and escape from the local optima, we propose Periodic Regularized Action Gradient (PRAG), which periodically involves action gradient for critic learning and additionally maximizes the target value. On a set of MuJoCo continuous control tasks, we show that PRAG can achieve higher sample efficiency and better final performance without much extra training cost, comparing to common model-free baselines. Our code is available at: https://github.com/Galaxy-Li/PRAG.
X. Li and Z. Qiao—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baird, L.: Residual algorithms: reinforcement learning with function approximation. In: Machine Learning Proceedings. Elsevier, pp. 30–37 (1995)
Balduzzi, D., Ghifary, M.: Compatible value gradients for reinforcement learning of continuous deep policies. arXiv preprint arXiv:1509.03005 (2015)
Brockman, G., et al.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)
D’Oro, P., Jaśkowski, W.: How to learn a useful critic? Model-based action-gradient-estimator policy optimization. Adv. Neural. Inf. Process. Syst. 33, 313–324 (2020)
Drucker, H., LeCun, Y.: Improving generalization performance using double backpropagation. IEEE Trans. Neural Netw. 36, 991–997 (1992)
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International conference on machine learning. PMLR, pp. 1587–1596 (2018)
Haarnoja, T., et al.: Soft actor-critic algorithms and applications. arXiv: Learning (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, p. 12 (1999)
Kuznetsov, A., Shvechikov, P., Grishin, A., Vetrov, D.P.: Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. arXiv preprint arXiv:2005.04269 (2020)
Li, G., Gomez, R., Nakamura, K., He, B.: Human-centered reinforcement learning: a survey. IEEE Trans. Human-Mach. Syst. 49(4), 337–349 (2019)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Liu, Y., Zeng, Y., Chen, Y., Tang, J., Pan, Y.: Self-improving generative adversarial reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 52–60 (2019)
Lyu, J., Ma, X., Yan, J., Li, X.: Efficient continuous control with double actors and regularized critics. In: Thirty-sixth AAAI Conference on Artificial Intelligence (2022)
Lyu, J., Yang, Y., Yan, J., Li, X.: Value activation for bias alleviation: generalized-activated deep double deterministic policy gradients. arXiv preprint arXiv:2112.11216 (2021)
Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Manage. Sci. 53(2), 308–322 (2007)
Pan, L., Cai, Q., Huang, L.: Softmax deep double deterministic policy gradients. arXiv preprint arXiv:2010.09177 (2020)
Pendrith, M.D., Ryan, M.R., et al.: Estimator Variance in Reinforcement Learning: Theoretical Problems and Practical Solutions. University of New South Wales, School of Computer Science and Engineering (1997)
Peters, J., Bagnell, J.A.: Policy gradient methods. Scholarpedia 5(11), 3698 (2010)
Puterman, M.L.: Markov decision processes. In: Handbooks in Operations Research and Management Science, vol. 2, pp. 331–434 (1990)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley (2014)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning. PMLR, pp. 387–395 (2014)
Smith, A.E., Coit, D.W., Baeck, T., Fogel, D., Michalewicz, Z.: Penalty functions. In: Handbook of Evolutionary Computation, vol. 97, no. 1, p. C5 (1997)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (2018)
Tesauro, G., et al.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Zeiler, M.: ADADELTA: an adaptive learning rate method. Computer Science (2012)
Acknowledgement
The authors would like to thank the insightful comments from anonymous reviewers. This work was supported in part by the Science and Technology Innovation 2030-Key Project under Grant 2021ZD0201404.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X. et al. (2022). PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13631. Springer, Cham. https://doi.org/10.1007/978-3-031-20868-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-20868-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20867-6
Online ISBN: 978-3-031-20868-3
eBook Packages: Computer ScienceComputer Science (R0)