Abstract
We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Barto, A.G., Bradtke, S.J., and S.P., Singh (1991). Real-time Learning and Control Using Asynchronous Dynamic Programming, (Technical Report 91-57). Amherst, MA: University of Massachusetts, Computer Science Dept.
Bertsekas, D.P. (1982). Distributed Dynamic Programming. IEEE Transactions on Automatic Control, AC-27, 610–616.
Bertsekas, D.P. and Tsitsiklis, J.N. (1989). Parallel and Distributed Computation: Numerical Methods, Englewood Cliffs, NJ: Prentice Hall.
Bertsekas, D.P. and Tsitsiklis, J.N. (1991). An Analysis of Stochastic Shortest Path Problems. Mathematics of Operations Research, 16, 580–595.
Dayan, P. (1992). The Convergence of TD(λ) for general λ. Machine Learning, 8, 341–362.
Kushner, H.J. and Clark, D.S. (1978). Stochastic Approximation Methods for Constrained and Unconstrained Problems, New York, NY: Springer Verlag.
Kushner, H.J. and Yin, G., (1987). Stochastic Approximation Algorithms for Parallel and Distributed Processing, Stochastics, 22, 219–250.
Kushner, H.J. and Yin, G. (1987). Asymptotic Properties of Distributed and Communicating Stochastic Approximation Algorithms. SIAM J. Control and Optimization, 25, 1266–1290.
Li, S. and Basar, T. (1987). Asymptotic Agreement and Convergence of Asynchronous Stochastic Algorithms. IEEE Transactions on Automatic Control, 32, 612–618.
Moore, A.W. and Atkeson, C.G. (1992). Memory-based Reinforcement Learning: Converging with Less Data and Less Real Time, preprint, July 1992.
Poljak, B.T. and Tsypkin, Y. Z. (1973). Pseudogradient Adaptation and Training Algorithms. Automation and Remote Control, 12, 83–94.
Sutton, R.S. (1988). Learning to Predict by the Method of Temporal Differences. Machine Learning, 3, 9–44.
Sutton, R.S., Barto, A.G., and Williams, R.J. (1992). Reinforcement Learning is Direct Adaptive Control. IEEE Control Systems Magazine, April, 19–22.
Tsitsiklis, J.N., Bertsekas, D.P., and Athans, M. (1986). Distributed Deterministic and Stochastic Gradient Optimization Algorithms. IEEE Transactions on Automatic Control. 31, 803–812.
Watkins, C.I.C.H. (1989). Learning from Delayed Rewards. Doctoral dissertation. University of Cambridge, Cambridge, United Kingdom.
Watkins, C.I.C.H. and Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Tsitsiklis, J.N. Asynchronous Stochastic Approximation and Q-Learning. Machine Learning 16, 185–202 (1994). https://doi.org/10.1023/A:1022689125041
Issue Date:
DOI: https://doi.org/10.1023/A:1022689125041