Average reward reinforcement learning: Foundations, algorithms, and empirical results

Sridhar Mahadevan¹

15k Accesses
Explore all metrics

Abstract

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Baird, L. Personal Communication.
Baird, L., (1995). Residual algorithms: Reinforcement learning with function approximation. InProceedings of the 12th International Conference on Machine Learning, pages 30–37. Morgan Kaufmann.
Barto, A., Bradtke, S. & Singh, S., (1995). Learning to act using real-time dynamic programming.Artificial Intelligence, 72(1):81–138.
Google Scholar
Bertsekas, D., (1982). Distributed dynamic programming.IEEE Transactions on Automatic Control. AC-27(3)
Bertsekas, D., (1987).Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall.
Blackwell, D., (1962). Discrete dynamic programming.Annals of Mathematical Statistics, 33 719–726.
Google Scholar
Boutilier, C. & Puterman, M., (1995). Process-oriented planning and average-reward optimality. InProceedings of the Fourteenth JCAI, pages 1096–1103. Morgan Kaufmann.
Dayan, P. & Hinton, G., (1992). Feudal reinforcement learning. InNeural Information Processing Systems (NIPS), pages 271–278.
Denardo, E., (1970). Computing a bias-optimal policy in a discrete-time Markov decision problem.Operations Research, 18:272–289.
Google Scholar
Dent, L., Boticario, J., McDermott, J., Mitchell, T. & Zabowski, D., (1992). A personal learning apprentice InProceedings of the Tenth National Conference on Artificial Intelligence (AAAI), pages 96–103. MIT Press.
Engelberger, J., (1989).Robotics in Service. MIT Press.
Federgruen, A. & Schweitzer, P., (1984). Successive approximation methods for solving nested functional equations in Markov decision problems.Mathematics of Operations Research, 9:319–344.
Google Scholar
Haviv, M. & Puterman, M., (1991) An improved algorithm for solving communicating average reward markov decision processes.Annals of Operations Research, 28:229–242.
Google Scholar
Hordijk, A. & Tijms, H., (1975). A modified form of the iterative method of dynamic programming.Annals of Statistics, 3:203–208.
Google Scholar
Howard, R., (1960).Dynamic Programming and Markov Processes. MIT Press.
Jalali, A. & Ferguson, M., (1989). Computationally efficient adaptive control algorithms for Markov chains InProceedings of the 28th IEEE Conference on Decision and Control, pages 1283–1288.
Jalali, A. & Ferguson, M., (1990). A distributed asynchronous algorithm for expected average cost dynamic programming. InProceedings of the 29th IEEE Conference on Dectsion and Control. pages 1394–1395.
Kaelbling, L., (1993a). Hierarchical learning in stochastic domains: Preliminary results. InProceedings of the Tenth International Conference on Machine Learning, pages 167–173 Morgan Kaufmann.
Kaelbling, L., (1993b)Learning in Embedded Systems. MIT Press.
Lin, L., (1993).Reinforcement Learning for Robots using Neural Networks. PhD thesis. Carnegie-Mellon Univ.
Mahadevan, S. A model-based bias-optimal reinforcement learning algorithm. In preparation.
Mahadevan, S., (1992). Enhancing transfer in reinforcement learning by building stochastic models of robot actions. InProceedings of the Seventh International Conference on Machine Learning, pages 290–299 Morgan Kaufmann.
Mahadevan, S., (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. InProceedings of the Eleventh International Conference on Machine Learning. pages 164–172. Morgan Kaufmann.
Mahadevan, S. & Baird, L. Value function approximation in average reward reinforcement learning. In preparation.
Mahadevan, S. & Connell, J., (1992). Automatic programming of behavior-based robots using reinforcement learning.Artificial Intelligence, 55:311–365. Appeared originally as IBM TR RC16359. Dec 1990.
Google Scholar
Moore, A., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state spaces. InProceedings of the Eighth International Workshop on Machine Learning, pages 333–337. Morgan Kaufmann.
Narendra, K. & Thathachar, M., (1989).Learning Automata: An Introduction. Prentice Hall.
Puterman, M., (1994).Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley.
Ross, S., (1983).Introduction to Stochastic Dynamic Programming. Academic Press.
Salganicoff, M., (1993). Density-adaptive learning and forgetting. InProceedings of the Tenth International Conference on Machine Learning, pages 276–283. Morgan Kaufmann.
Schwartz, A., (1993). A reinforcement learning method for maximizing undiscounted rewards. InProceedings of the Tenth International Conference on Machine Learning, pages 298–305. Morgan Kaufmann.
Singh, S., (1994a).Learning to Solve Markovian Decision Processes. PhD thesis, Univ of Massachusetts, Amherst.
Singh, S., (1994b) Reinforcement learning algorithms for average-payoff Markovian decision processes. InProceedings of the 12th AAAI. MIT Press.
Sutton, R., (1988). Learning to predict by the method of temporal differences.Machine Learning, 3:9–44.
Google Scholar
Sutton, R., (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the Seventh International Conference on Machine Learning, pages 216–224. Morgan Kaufmann.
Sutton, R., editor, (1992).Reinforcement Learning. Kluwer Academic Press. Special Issue of Machine Learning Journal Vol 8, Nos 3–4, May 1992.
Tadepall, P. Personal Communication.
Tadepalli, P. & Ok, D., (1994). H learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 94-30-01, Oregon State Univ.
Tesauro, G., (1992). Practical issues in temporal difference learning. In R. Sutton, editor,Reinforcement Learning. Kluwer Academic Publishers.
Thrun, S. The role of exploration in learning control. In D. A. White and D. A. Sofge, editors,Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold.
Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning.Machine Learning, 16:185–202.
Google Scholar
Veinott, A., (1969) Discrete dynamic programming with sensitive discount optimality criteria.Annals of Mathematical Statistics, 40(5):1635–1660.
Google Scholar
Watkins, C., (1989).Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, England.
Wheeler, R. & Narendra, K., (1986). Decentralized learning in finite Markov chains.IEEE Transactions on Automatic Control, AC-31(6)
White, D., (1963). Dynamic programming, markov chains, and the method of successive approximationsJournal of Mathematical Analysis and Applications, 6:373–376.
Google Scholar
Whitehead, S., Karlsson, J. & Tenenberg, J., (1993). Learning multiple goal behavior via task decomposition and dynamic policy merging. In J. Connell and S. Mahadevan, editors,Robot Learning. Kluwer Academic Publishers.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of South Florida, 33620, Tampa, Florida
Sridhar Mahadevan

Authors

Sridhar Mahadevan
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahadevan, S. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Mach Learn 22, 159–195 (1996). https://doi.org/10.1007/BF00114727

Download citation

Received: 15 November 1994
Accepted: 24 February 1995
Issue Date: March 1996
DOI: https://doi.org/10.1007/BF00114727

Average reward reinforcement learning: Foundations, algorithms, and empirical results

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning with Guarantees that Hold for Ever

Value-Based Reinforcement Learning

Reinforcement Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Average reward reinforcement learning: Foundations, algorithms, and empirical results

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning with Guarantees that Hold for Ever

Value-Based Reinforcement Learning

Reinforcement Learning

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords