article

Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

Authors:

Satinder Singh,

Tommi Jaakkola,

Michael L. Littman,

Csaba SzepesváriAuthors Info & Claims

Machine Learning, Volume 38, Issue 3

Pages 287 - 308

https://doi.org/10.1023/A:1007678930559

Published: 01 March 2000 Publication History

Abstract

An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

References

[1]

Barto, A. G., Bradtke S. J., & Singh S. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81-138.

Digital Library

[2]

Bellman, R. (1957). Dynamic Programming. Princeton, NJ: Princeton University Press.

Digital Library

[3]

Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. (Vol. 1 and 2). Belmont, Massachusetts: Athena Scientific.

Digital Library

[4]

Boyan, J. A. & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 369-376). Cambridge, MA: The MIT Press.

[5]

Breiman, L. (1992). Probability. Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics.

Digital Library

[6]

Crites, R. H. & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. Advances in neural information processing systems (Vol. 8). MIT Press.

[7]

Dayan, P. (1992). The convergence of TD(¿) for general ¿. Machine Learning, 8(3), 341-362.

Digital Library

[8]

Dayan, P. & Sejnowski, T. J. (1994). TD(¿) converges with probability 1. Machine Learning, 14(3).

Digital Library

[9]

Dayan, P. & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25, 5-22.

Digital Library

[10]

Gullapalli, V. & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 695-702). San Mateo, CA: Morgan Kaufmann.

[11]

Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185-1201.

Digital Library

[12]

John, G. H. (1994). When the best move isn't optimal: Q-learning with exploration. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA, p. 1464.

Digital Library

[13]

John, G. H. (1995). When the best move isn't optimal: Q-learning with exploration. Unpublished manuscript, available at URL ftp://starry.stanford.edu/pub/gjohn/papers/rein-nips.ps.

[14]

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285.

Digital Library

[15]

Kumar, P. R. & Varaiya, P. P. (1986). Stochastic systems: estimation, identification, and adaptive control. Englewood Cliffs, NJ: Prentice Hall.

Digital Library

[16]

Littman, M. L. & Szepesvári, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 310-318).

[17]

Littman, M. L. (1996). Algorithms for sequential decision making. Ph.D. Thesis, Department of Computer Science, Brown University, February. Also Technical Report CS-96-09.

Digital Library

[18]

Puterman, M. L. (1994). Markov decision processes--discrete stochastic dynamic programming. New York, NY: John Wiley & Sons, Inc.

Digital Library

[19]

Rummery, G. A. (1994). Problem solving with reinforcement learning. Ph.D. Thesis, Cambridge University Engineering Department.

[20]

Rummery, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.

[21]

Singh, S. & Bertsekas, D. P. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. Advances in neural information processing systems (Vol. 9, pp. 974-980). MIT Press.

[22]

Singh, S. & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1-3):123-158.

Digital Library

[23]

Singh, S. & Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16, 227-233.

Digital Library

[24]

Sutton, R. S. & Barto, A. G. (1998). An introduction to reinforcement learning. The MIT Press.

Digital Library

[25]

Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3(1): 9-44.

Digital Library

[26]

Sutton, R. S. (1996). Generalization in reinforcement learning: successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8). Cambridge, MA: The MIT Press.

[27]

Szepesvári, C. & Littman, M. L. (1996). Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms. Technical Report CS-96-11, Brown University, Providence, RI.

Digital Library

[28]

Tesauro, G. J. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 58-68.

Digital Library

[29]

Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: neural, fuzzy, and adaptive approaches. New York, NY: Van Nostrand Reinhold.

[30]

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185- 202, September 1994.

[31]

Tsitsiklis, J. N. & Van Roy, B. (1996). An analysis of temporal-difference learning with function approximation . Technical Report LIDS-P-2322, Massachusetts Institute of Technology, March. Available through URL http://web.mit.edu/bvr/www/td.ps. To appear in IEEE Transactions on Automatic Control.

[32]

Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, King's College, Cambridge, UK.

[33]

Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279-292.

Digital Library

[34]

Williams, R. J. & Baird, L. C., III (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS-93-14, Northeastern University, College of Computer Science, Boston, MA.

[35]

Zhang, W. & Dietterich, T. G. (1995). High-performance job-shop scheduling with a time delay TD(¿) network. Advances in neural information processing systems (Vol. 8, pp. 1024-1030). MIT Press.

Cited By

Li CJiang KZhang YJiang LLuo YWan S(2024)Deep Reinforcement Learning-based Mining Task Offloading Scheme for Intelligent Connected Vehicles in UAV-aided MECACM Transactions on Design Automation of Electronic Systems10.1145/365345129:3(1-29)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.1145/3653451
Luo QHan ZDi B(2024)Meta-Critic Reinforcement Learning for Intelligent Omnidirectional Surface Assisted Multi-User CommunicationsIEEE Transactions on Wireless Communications10.1109/TWC.2024.335837223:8_Part_1(9085-9098)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TWC.2024.3358372
Lin SYu TFeng RLi XYu XXiao LChen Z(2024)Local Patch AutoAugment With Multi-Agent CollaborationIEEE Transactions on Multimedia10.1109/TMM.2023.327063526(724-736)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3270635
Show More Cited By

Index Terms

Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

Recommendations

Weak convergence properties of constrained emphatic temporal-difference learning with constant and slowly diminishing stepsize

We consider the emphatic temporal-difference (TD) algorithm, ETD(λ), for learning the value functions of stationary policies in a discounted, finite state and action Markov decision process. The ETD(λ) algorithm was recently proposed by Sutton, Mahmood, ...
On the convergence of reinforcement learning with Monte Carlo Exploring Starts
Abstract
A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring Starts (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy ...
Semismooth Karush-Kuhn-Tucker Equations and Convergence Analysis of Newton and Quasi-Newton Methods for Solving these Equations

There are several forms of systems of nonsmooth equations which are equivalent to the Karush-Kuhn-Tucker KKT system of a nonlinearly constrained optimization problem NLP. If the NLP is twice continuously differentiable and the Hessian functions of its ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Machine Language

Machine Language Volume 38, Issue 3

March 2000

95 pages

ISSN:0885-6125

Editor:
Robert C. Holte
Univ. of Ottawa

Issue’s Table of Contents

Copyright © Copyright © 2000 Kluwer Academic Publishers.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 March 2000

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

125
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li CJiang KZhang YJiang LLuo YWan S(2024)Deep Reinforcement Learning-based Mining Task Offloading Scheme for Intelligent Connected Vehicles in UAV-aided MECACM Transactions on Design Automation of Electronic Systems10.1145/365345129:3(1-29)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.1145/3653451
Luo QHan ZDi B(2024)Meta-Critic Reinforcement Learning for Intelligent Omnidirectional Surface Assisted Multi-User CommunicationsIEEE Transactions on Wireless Communications10.1109/TWC.2024.335837223:8_Part_1(9085-9098)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TWC.2024.3358372
Lin SYu TFeng RLi XYu XXiao LChen Z(2024)Local Patch AutoAugment With Multi-Agent CollaborationIEEE Transactions on Multimedia10.1109/TMM.2023.327063526(724-736)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3270635
Shao ZYang HXiao LSu WChen YXiong Z(2024)Deep Reinforcement Learning-Based Resource Management for UAV-Assisted Mobile Edge Computing Against JammingIEEE Transactions on Mobile Computing10.1109/TMC.2024.343249123:12(13358-13374)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TMC.2024.3432491
Briglia GLippi MMariani SZambonelli F(2024)Improving Reinforcement Learning-Based Autonomous Agents with Causal ModelsPRIMA 2024: Principles and Practice of Multi-Agent Systems10.1007/978-3-031-77367-9_20(267-283)Online publication date: 18-Nov-2024
https://dl.acm.org/doi/10.1007/978-3-031-77367-9_20
Wang YZhou EOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Bayesian risk-averse Q-learning with streaming observationsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669441(75967-75992)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669441
Ran YLi YZhang FZhang ZYu YKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Policy regularization with dataset constraint for offline reinforcement learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619600(28701-28717)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619600
Zhang YKishk MAlouini M(2023)Deployment Optimization of Tethered Drone-Assisted Integrated Access and Backhaul NetworksIEEE Transactions on Wireless Communications10.1109/TWC.2023.330188023:4(2668-2680)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1109/TWC.2023.3301880
Wang YChen KGong ZCui QTao XZhang P(2023)Reliability-Guaranteed Uplink Resource Management in Proactive Mobile Network for Minimal Latency CommunicationsIEEE Transactions on Wireless Communications10.1109/TWC.2022.323131922:8(5018-5030)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TWC.2022.3231319
Lone KSofi S(2023)Cost Efficient Task Offloading for Delay Sensitive Applications in Fog Computing SystemSN Computer Science10.1007/s42979-023-02300-34:6Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1007/s42979-023-02300-3
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents