research-article

Fast gradient-descent methods for temporal-difference learning with linear function approximation

Authors:

Richard S. Sutton,

Hamid Reza Maei,

Shalabh Bhatnagar,

Csaba Szepesvári,

Eric WiewioraAuthors Info & Claims

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

Pages 993 - 1000

https://doi.org/10.1145/1553374.1553501

Published: 14 June 2009 Publication History

Abstract

Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

References

[1]

Antos, A., Szepesvári, Cs., Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71:89--129.

Digital Library

[2]

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th Int. Conf. on Machine Learning, pp. 30--37.

[3]

Baird, L. C. (1999). Reinforcement Learning Through Gradient Descent. PhD thesis, Carnegie-Mellon University.

[4]

Barnard, E. (1993). Temporal-difference methods and Markov models. IEEE Transactions on Systems, Man, and Cybernetics 23(2):357--365.

[5]

Borkar, V. S. (1997). Stochastic approximation with two timescales. Systems and Control Letters 29:291--294.

Digital Library

[6]

Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization 38(2):447--469.

Digital Library

[7]

Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning 49:233--246.

Digital Library

[8]

Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22:33--57.

Digital Library

[9]

Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning 8:341--362.

Digital Library

[10]

Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings AAAI, pp. 356--361.

Digital Library

[11]

Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks 2:331--349.

Digital Library

[12]

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417--424.

Digital Library

[13]

Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy learning with recognizers. Advances in Neural Information Processing Systems 18.

[14]

Silver, D., Sutton, R. S., Müller, M. (2007). Reinforcement learning of local shape in the game of Go. Proceedings of the 20th IJCAI, pp. 1053--1058.

Digital Library

[15]

Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conf. on Computers and Games.

Digital Library

[16]

Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning 3:9--44.

[17]

Sutton, R. S., Precup D., and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181--211.

Digital Library

[18]

Sutton, R. S., Szepesvári, Cs., Maei, H. R. (2009). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in Neural Information Processing Systems 21.

Cited By

Chen XQin WGong YYang SWang W(2024)On Convergence Rate of MRetraceMathematics10.3390/math1218293012:18(2930)Online publication date: 20-Sep-2024
https://doi.org/10.3390/math12182930
Stanković MBeko MStanković S(2024)Distributed Multi-Agent Gradient Based Q-Learning with Linear Function Approximation2024 European Control Conference (ECC)10.23919/ECC64448.2024.10590764(2500-2505)Online publication date: 25-Jun-2024
https://doi.org/10.23919/ECC64448.2024.10590764
Zhang ZLing BHuang G(2024)Joint Principal Component Analysis and Supervised k Means Algorithm via Non-Iterative Analytic Optimization ApproachIEEE Transactions on Signal Processing10.1109/TSP.2024.336594472(1348-1360)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TSP.2024.3365944
Show More Cited By

Index Terms

Fast gradient-descent methods for temporal-difference learning with linear function approximation

Recommendations

Gradient temporal-difference learning algorithms
Restricted gradient-descent algorithm for value-function approximation in reinforcement learning

This work presents the restricted gradient-descent (RGD) algorithm, a training method for local radial-basis function networks specifically developed to be used in the context of reinforcement learning. The RGD algorithm can be seen as a way to extract ...
Proximal gradient temporal difference learning: stable reinforcement learning with polynomial sample complexity

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

June 2009

1331 pages

ISBN:9781605585161

DOI:10.1145/1553374

General Chair:
Andrea Danyluk
Williams College
,
Program Chairs:
Léon Bottou
NEC Laboratories America
,
Michael Littman
Rutgers University

Copyright © 2009 Copyright 2009 by the author(s)/owner(s).

Sponsors

NSF
Microsoft Research: Microsoft Research
MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ICML '09

Sponsor:

Microsoft Research

ICML '09: The 26th Annual International Conference on Machine Learning held in conjunction with the 2007 International Conference on Inductive Logic Programming

June 14 - 18, 2009

Quebec, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

266
Total Citations
View Citations
2,132
Total Downloads

Downloads (Last 12 months)142
Downloads (Last 6 weeks)18

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen XQin WGong YYang SWang W(2024)On Convergence Rate of MRetraceMathematics10.3390/math1218293012:18(2930)Online publication date: 20-Sep-2024
https://doi.org/10.3390/math12182930
Stanković MBeko MStanković S(2024)Distributed Multi-Agent Gradient Based Q-Learning with Linear Function Approximation2024 European Control Conference (ECC)10.23919/ECC64448.2024.10590764(2500-2505)Online publication date: 25-Jun-2024
https://doi.org/10.23919/ECC64448.2024.10590764
Zhang ZLing BHuang G(2024)Joint Principal Component Analysis and Supervised k Means Algorithm via Non-Iterative Analytic Optimization ApproachIEEE Transactions on Signal Processing10.1109/TSP.2024.336594472(1348-1360)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TSP.2024.3365944
Li GWu WChi YMa CRinaldo AWei Y(2024)High-Probability Sample Complexities for Policy Evaluation With Linear Function ApproximationIEEE Transactions on Information Theory10.1109/TIT.2024.339468570:8(5969-5999)Online publication date: Aug-2024
https://doi.org/10.1109/TIT.2024.3394685
Lee DLim HKim D(2024)Continuous-Time Distributed Dynamic Programming for Networked Multi-Agent Markov Decision Processes2024 IEEE 18th International Conference on Control & Automation (ICCA)10.1109/ICCA62789.2024.10591854(960-967)Online publication date: 18-Jun-2024
https://doi.org/10.1109/ICCA62789.2024.10591854
Lin B(2024)Reinforcement learning and bandits for speech and language processing: Tutorial, review and outlookExpert Systems with Applications10.1016/j.eswa.2023.122254238(122254)Online publication date: Mar-2024
https://doi.org/10.1016/j.eswa.2023.122254
Ren J(2024)Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement LearningInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00560-217:1Online publication date: 24-Jun-2024
https://doi.org/10.1007/s44196-024-00560-2
Jin ZSchmitt JWen Z(2024)On the Analysis of Model-Free Methods for the Linear Quadratic RegulatorJournal of the Operations Research Society of China10.1007/s40305-024-00546-zOnline publication date: 9-Jul-2024
https://doi.org/10.1007/s40305-024-00546-z
Wang YZhou YZou S(2024)Finite-time error bounds for Greedy-GQMachine Learning10.1007/s10994-024-06542-x113:9(5981-6018)Online publication date: 30-Apr-2024
https://doi.org/10.1007/s10994-024-06542-x
Ruan JNooning BParkes IBlejde WChiu GJain N(2024)Human operator decision support for highly transient industrial processes: a reinforcement learning approachJournal of Intelligent Manufacturing10.1007/s10845-023-02295-xOnline publication date: 11-Jan-2024
https://doi.org/10.1007/s10845-023-02295-x
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents