Adaptive aggregation for reinforcement learning in average reward Markov decision processes

Ronald Ortner¹

836 Accesses
Explore all metrics

Abstract

We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the original bounds for UCRL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Taxonomy of Reinforcement Learning Algorithms

Policy Control with Delayed, Aggregate, and Anonymous Feedback

Learning in the Presence of Multiple Agents

References

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47, 235–256.
Article Google Scholar
Bartlett, P. L., & Tewari, A. (2009). REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI 2009, Proc. 25th annual conference on uncertainty in artificial intelligence (pp. 35–42).
Google Scholar
Bertsekas, D. P., & Castañon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598.
Article Google Scholar
Buşoniu, L., Schutter, B. D., & Babuška, R. (2010). Approximate dynamic programming and reinforcement learning. In Interactive collaborative information systems (pp. 3–44).
Google Scholar
Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 122–142.
Article Google Scholar
Burnetas, A. N., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–255.
Article Google Scholar
Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2005). An adaptive sampling algorithm for solving Markov decision processes. Operational Research, 53(1), 126–139.
Article Google Scholar
Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2007). Simulation-based algorithms for Markov decision processes. London: Springer.
Google Scholar
Diuk, C., Li, L., & Leffler, B. R. (2009). The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proc. 26th annual international conference on machine learning (p. 32).
Google Scholar
Even-Dar, E., & Mansour, Y. (2003). Approximate equivalence of Markov decision processes. In Computational learning theory and kernel machines, 16th annual conference on computational learning theory and 7th kernel workshop (pp. 581–594).
Chapter Google Scholar
Ferns, N., Panangaden, P., & Precup, D. (2004). Metrics for finite Markov decision processes. In UAI’04, proc. 20th conference in uncertainty in artificial intelligence (pp. 162–169).
Google Scholar
Givan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(1–2), 163–223.
Article Google Scholar
Givan, R., Leach, S. M., & Dean, T. (2000). Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1–2), 71–109.
Article Google Scholar
Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11, 1563–1600.
Google Scholar
Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19), 8584–8585.
Article Google Scholar
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Article Google Scholar
Leffler, B. R., Littman, M. L., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In Proc. 22nd AAAI conference on artificial intelligence (pp. 572–577).
Google Scholar
Li, L. (2009). A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers University.
Li, L., Littman, M. L., Walsh, T. J., & Strehl, A. L. (2011). Knows what it knows: a framework for self-aware learning. Machine Learning, 82(3), 399–443.
Article Google Scholar
Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In Proc. 9th international symposium on artificial intelligence and mathematics (pp. 531–539).
Google Scholar
Mannor, S., Menache, I., Hoze, A., & Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. In Machine learning, proc. 21st international conference.
Google Scholar
Munos, R. (2010). Approximate dynamic programming. In O. Sigaud & O. Buffet (Eds.), Markov decision processes in artificial intelligence (pp. 67–98), Chap. 3.
Google Scholar
Ortner, R. (2007). Pseudometrics for state aggregation in average reward Markov decision processes. In Algorithmic learning theory, 18th international conference, ALT 2007 (pp. 373–387).
Google Scholar
Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. New York: Wiley.
Book Google Scholar
Ravindran, B., & Barto, A. G. (2003). SMDP homomorphisms: An algebraic approach to abstraction in semi-Markov decision processes. In IJCAI-03, proc. 18th international joint conference on artificial intelligence (pp. 1011–1018).
Google Scholar
Roy, B. V. (2006). Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2), 234–244.
Article Google Scholar
Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Machine learning, proc. 11th international conference (pp. 284–292).
Google Scholar
Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. In Machine learning, proc. 22nd international conference (pp. 857–864).
Google Scholar
Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), 1309–1331.
Article Google Scholar
Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state MDPs. In Proc. 22nd AAAI conference on artificial intelligence (pp. 645–650).
Google Scholar
Tewari, A., & Bartlett, P. L. (2007). Bounded parameter Markov decision processes with average reward criterion. In Learning theory, 20th annual conference on learning theory, COLT 2007 (pp. 263–277).
Google Scholar
Tewari, A., & Bartlett, P. (2008). Optimistic linear programming gives logarithmic regret for irreducible MDPs. Advances in Neural Information Processing Systems, 20, 1505–1512.
Google Scholar

Download references

Acknowledgements

The author would like to thank Peter Auer for discussion of some preliminary ideas, Shiau Hong Lim for sharing his power plug adapter at Cumberland Lodge, and the anonymous reviewers for their comments, which helped to improve the paper. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements n^∘ 216886 (PASCAL2 Network of Excellence), and n^∘ 231495 (project CompLACS). The final version of this paper has been prepared when the author was supported by the Austrian Science Fund (FWF): J 3259-N13.

Author information

Authors and Affiliations

INRIA Lille-Nord Europe, équipe SequeL, 40 avenue Halley, 59650, Villeneuve d’Ascq, France
Ronald Ortner

Authors

Ronald Ortner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ronald Ortner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ortner, R. Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Ann Oper Res 208, 321–336 (2013). https://doi.org/10.1007/s10479-012-1064-y

Download citation

Published: 24 January 2012
Issue Date: September 2013
DOI: https://doi.org/10.1007/s10479-012-1064-y

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Taxonomy of Reinforcement Learning Algorithms

Policy Control with Delayed, Aggregate, and Anonymous Feedback

Learning in the Presence of Multiple Agents

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Taxonomy of Reinforcement Learning Algorithms

Policy Control with Delayed, Aggregate, and Anonymous Feedback

Learning in the Presence of Multiple Agents

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation