PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control

Xihui Li¹¹,
Zhongjian Qiao¹¹,
Aicheng Gong¹²,
Jiafei Lyu¹¹,
Chenghui Yu¹¹,
Jiangpeng Yan¹³ &
…
Xiu Li¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13631))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1417 Accesses
1 Citations

Abstract

For actor-critic methods in reinforcement learning, it is vital to learn a useful critic such that the actor can be guided efficiently and properly. Previous methods mainly seek to estimate more accurate Q-values. However, in continuous control scenario where the actor is updated via deterministic policy gradient, only the action gradient (AG) is useful for updating the actor. It is thus a promising way to achieve higher sample efficiency by leveraging the action gradient of Q functions for policy guidance. Nevertheless, we empirically find that directly incorporating action gradient into the critics downgrades the performance of the agent, as it can be easily trapped in the local maxima. To fully utilize the benefits of action gradient and escape from the local optima, we propose Periodic Regularized Action Gradient (PRAG), which periodically involves action gradient for critic learning and additionally maximizes the target value. On a set of MuJoCo continuous control tasks, we show that PRAG can achieve higher sample efficiency and better final performance without much extra training cost, comparing to common model-free baselines. Our code is available at: https://github.com/Galaxy-Li/PRAG.

X. Li and Z. Qiao—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

WD3-MPER: A Method to Alleviate Approximation Bias in Actor-Critic

DAGE: Dropout with Action Gradient Estimator for Continuous Control

Integrated Actor-Critic for Deep Reinforcement Learning

References

Baird, L.: Residual algorithms: reinforcement learning with function approximation. In: Machine Learning Proceedings. Elsevier, pp. 30–37 (1995)
Google Scholar
Balduzzi, D., Ghifary, M.: Compatible value gradients for reinforcement learning of continuous deep policies. arXiv preprint arXiv:1509.03005 (2015)
Brockman, G., et al.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)
D’Oro, P., Jaśkowski, W.: How to learn a useful critic? Model-based action-gradient-estimator policy optimization. Adv. Neural. Inf. Process. Syst. 33, 313–324 (2020)
Google Scholar
Drucker, H., LeCun, Y.: Improving generalization performance using double backpropagation. IEEE Trans. Neural Netw. 36, 991–997 (1992)
Google Scholar
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International conference on machine learning. PMLR, pp. 1587–1596 (2018)
Google Scholar
Haarnoja, T., et al.: Soft actor-critic algorithms and applications. arXiv: Learning (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Google Scholar
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, p. 12 (1999)
Google Scholar
Kuznetsov, A., Shvechikov, P., Grishin, A., Vetrov, D.P.: Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. arXiv preprint arXiv:2005.04269 (2020)
Li, G., Gomez, R., Nakamura, K., He, B.: Human-centered reinforcement learning: a survey. IEEE Trans. Human-Mach. Syst. 49(4), 337–349 (2019)
Article Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Liu, Y., Zeng, Y., Chen, Y., Tang, J., Pan, Y.: Self-improving generative adversarial reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 52–60 (2019)
Google Scholar
Lyu, J., Ma, X., Yan, J., Li, X.: Efficient continuous control with double actors and regularized critics. In: Thirty-sixth AAAI Conference on Artificial Intelligence (2022)
Google Scholar
Lyu, J., Yang, Y., Yan, J., Li, X.: Value activation for bias alleviation: generalized-activated deep double deterministic policy gradients. arXiv preprint arXiv:2112.11216 (2021)
Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Manage. Sci. 53(2), 308–322 (2007)
Article MATH Google Scholar
Pan, L., Cai, Q., Huang, L.: Softmax deep double deterministic policy gradients. arXiv preprint arXiv:2010.09177 (2020)
Pendrith, M.D., Ryan, M.R., et al.: Estimator Variance in Reinforcement Learning: Theoretical Problems and Practical Solutions. University of New South Wales, School of Computer Science and Engineering (1997)
Google Scholar
Peters, J., Bagnell, J.A.: Policy gradient methods. Scholarpedia 5(11), 3698 (2010)
Article Google Scholar
Puterman, M.L.: Markov decision processes. In: Handbooks in Operations Research and Management Science, vol. 2, pp. 331–434 (1990)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley (2014)
Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning. PMLR, pp. 387–395 (2014)
Google Scholar
Smith, A.E., Coit, D.W., Baeck, T., Fogel, D., Michalewicz, Z.: Penalty functions. In: Handbook of Evolutionary Computation, vol. 97, no. 1, p. C5 (1997)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (2018)
Google Scholar
Tesauro, G., et al.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)
Article Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Google Scholar
Zeiler, M.: ADADELTA: an adaptive learning rate method. Computer Science (2012)
Google Scholar

Download references

Acknowledgement

The authors would like to thank the insightful comments from anonymous reviewers. This work was supported in part by the Science and Technology Innovation 2030-Key Project under Grant 2021ZD0201404.

Author information

Authors and Affiliations

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Xihui Li, Zhongjian Qiao, Jiafei Lyu, Chenghui Yu & Xiu Li
China Nuclear Power Engineering Company Ltd., Shenzhen, China
Aicheng Gong
Department of Automation, Tsinghua Unversity, Beijing, China
Jiangpeng Yan

Authors

Xihui Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhongjian Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Aicheng Gong
View author publications
You can also search for this author in PubMed Google Scholar
Jiafei Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Chenghui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jiangpeng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xiu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiu Li .

Editor information

Editors and Affiliations

CSIRO Australian e-Health Research Centre, Brisbane, QLD, Australia
Sankalp Khanna
Shanghai Jiao Tong University, Shanghai, China
Jian Cao
University of Tasmania, Hobart, TAS, Australia
Quan Bai
University of Technology Sydney, Sydney, NSW, Australia
Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X. et al. (2022). PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13631. Springer, Cham. https://doi.org/10.1007/978-3-031-20868-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-20868-3_8
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20867-6
Online ISBN: 978-3-031-20868-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

WD3-MPER: A Method to Alleviate Approximation Bias in Actor-Critic

DAGE: Dropout with Action Gradient Estimator for Continuous Control

Integrated Actor-Critic for Deep Reinforcement Learning

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

WD3-MPER: A Method to Alleviate Approximation Bias in Actor-Critic

DAGE: Dropout with Action Gradient Estimator for Continuous Control

Integrated Actor-Critic for Deep Reinforcement Learning

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation