Abstract
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Baird, L. Personal Communication.
Baird, L., (1995). Residual algorithms: Reinforcement learning with function approximation. InProceedings of the 12th International Conference on Machine Learning, pages 30–37. Morgan Kaufmann.
Barto, A., Bradtke, S. & Singh, S., (1995). Learning to act using real-time dynamic programming.Artificial Intelligence, 72(1):81–138.
Bertsekas, D., (1982). Distributed dynamic programming.IEEE Transactions on Automatic Control. AC-27(3)
Bertsekas, D., (1987).Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall.
Blackwell, D., (1962). Discrete dynamic programming.Annals of Mathematical Statistics, 33 719–726.
Boutilier, C. & Puterman, M., (1995). Process-oriented planning and average-reward optimality. InProceedings of the Fourteenth JCAI, pages 1096–1103. Morgan Kaufmann.
Dayan, P. & Hinton, G., (1992). Feudal reinforcement learning. InNeural Information Processing Systems (NIPS), pages 271–278.
Denardo, E., (1970). Computing a bias-optimal policy in a discrete-time Markov decision problem.Operations Research, 18:272–289.
Dent, L., Boticario, J., McDermott, J., Mitchell, T. & Zabowski, D., (1992). A personal learning apprentice InProceedings of the Tenth National Conference on Artificial Intelligence (AAAI), pages 96–103. MIT Press.
Engelberger, J., (1989).Robotics in Service. MIT Press.
Federgruen, A. & Schweitzer, P., (1984). Successive approximation methods for solving nested functional equations in Markov decision problems.Mathematics of Operations Research, 9:319–344.
Haviv, M. & Puterman, M., (1991) An improved algorithm for solving communicating average reward markov decision processes.Annals of Operations Research, 28:229–242.
Hordijk, A. & Tijms, H., (1975). A modified form of the iterative method of dynamic programming.Annals of Statistics, 3:203–208.
Howard, R., (1960).Dynamic Programming and Markov Processes. MIT Press.
Jalali, A. & Ferguson, M., (1989). Computationally efficient adaptive control algorithms for Markov chains InProceedings of the 28th IEEE Conference on Decision and Control, pages 1283–1288.
Jalali, A. & Ferguson, M., (1990). A distributed asynchronous algorithm for expected average cost dynamic programming. InProceedings of the 29th IEEE Conference on Dectsion and Control. pages 1394–1395.
Kaelbling, L., (1993a). Hierarchical learning in stochastic domains: Preliminary results. InProceedings of the Tenth International Conference on Machine Learning, pages 167–173 Morgan Kaufmann.
Kaelbling, L., (1993b)Learning in Embedded Systems. MIT Press.
Lin, L., (1993).Reinforcement Learning for Robots using Neural Networks. PhD thesis. Carnegie-Mellon Univ.
Mahadevan, S. A model-based bias-optimal reinforcement learning algorithm. In preparation.
Mahadevan, S., (1992). Enhancing transfer in reinforcement learning by building stochastic models of robot actions. InProceedings of the Seventh International Conference on Machine Learning, pages 290–299 Morgan Kaufmann.
Mahadevan, S., (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. InProceedings of the Eleventh International Conference on Machine Learning. pages 164–172. Morgan Kaufmann.
Mahadevan, S. & Baird, L. Value function approximation in average reward reinforcement learning. In preparation.
Mahadevan, S. & Connell, J., (1992). Automatic programming of behavior-based robots using reinforcement learning.Artificial Intelligence, 55:311–365. Appeared originally as IBM TR RC16359. Dec 1990.
Moore, A., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state spaces. InProceedings of the Eighth International Workshop on Machine Learning, pages 333–337. Morgan Kaufmann.
Narendra, K. & Thathachar, M., (1989).Learning Automata: An Introduction. Prentice Hall.
Puterman, M., (1994).Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley.
Ross, S., (1983).Introduction to Stochastic Dynamic Programming. Academic Press.
Salganicoff, M., (1993). Density-adaptive learning and forgetting. InProceedings of the Tenth International Conference on Machine Learning, pages 276–283. Morgan Kaufmann.
Schwartz, A., (1993). A reinforcement learning method for maximizing undiscounted rewards. InProceedings of the Tenth International Conference on Machine Learning, pages 298–305. Morgan Kaufmann.
Singh, S., (1994a).Learning to Solve Markovian Decision Processes. PhD thesis, Univ of Massachusetts, Amherst.
Singh, S., (1994b) Reinforcement learning algorithms for average-payoff Markovian decision processes. InProceedings of the 12th AAAI. MIT Press.
Sutton, R., (1988). Learning to predict by the method of temporal differences.Machine Learning, 3:9–44.
Sutton, R., (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the Seventh International Conference on Machine Learning, pages 216–224. Morgan Kaufmann.
Sutton, R., editor, (1992).Reinforcement Learning. Kluwer Academic Press. Special Issue of Machine Learning Journal Vol 8, Nos 3–4, May 1992.
Tadepall, P. Personal Communication.
Tadepalli, P. & Ok, D., (1994). H learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 94-30-01, Oregon State Univ.
Tesauro, G., (1992). Practical issues in temporal difference learning. In R. Sutton, editor,Reinforcement Learning. Kluwer Academic Publishers.
Thrun, S. The role of exploration in learning control. In D. A. White and D. A. Sofge, editors,Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold.
Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning.Machine Learning, 16:185–202.
Veinott, A., (1969) Discrete dynamic programming with sensitive discount optimality criteria.Annals of Mathematical Statistics, 40(5):1635–1660.
Watkins, C., (1989).Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, England.
Wheeler, R. & Narendra, K., (1986). Decentralized learning in finite Markov chains.IEEE Transactions on Automatic Control, AC-31(6)
White, D., (1963). Dynamic programming, markov chains, and the method of successive approximationsJournal of Mathematical Analysis and Applications, 6:373–376.
Whitehead, S., Karlsson, J. & Tenenberg, J., (1993). Learning multiple goal behavior via task decomposition and dynamic policy merging. In J. Connell and S. Mahadevan, editors,Robot Learning. Kluwer Academic Publishers.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mahadevan, S. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Mach Learn 22, 159–195 (1996). https://doi.org/10.1007/BF00114727
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00114727