Abstract
We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the original bounds for UCRL.
Similar content being viewed by others
References
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47, 235–256.
Bartlett, P. L., & Tewari, A. (2009). REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI 2009, Proc. 25th annual conference on uncertainty in artificial intelligence (pp. 35–42).
Bertsekas, D. P., & Castañon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598.
Buşoniu, L., Schutter, B. D., & Babuška, R. (2010). Approximate dynamic programming and reinforcement learning. In Interactive collaborative information systems (pp. 3–44).
Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 122–142.
Burnetas, A. N., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–255.
Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2005). An adaptive sampling algorithm for solving Markov decision processes. Operational Research, 53(1), 126–139.
Chang, H. S., Fu, M. C., Hu, J., & Marcus, S. I. (2007). Simulation-based algorithms for Markov decision processes. London: Springer.
Diuk, C., Li, L., & Leffler, B. R. (2009). The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proc. 26th annual international conference on machine learning (p. 32).
Even-Dar, E., & Mansour, Y. (2003). Approximate equivalence of Markov decision processes. In Computational learning theory and kernel machines, 16th annual conference on computational learning theory and 7th kernel workshop (pp. 581–594).
Ferns, N., Panangaden, P., & Precup, D. (2004). Metrics for finite Markov decision processes. In UAI’04, proc. 20th conference in uncertainty in artificial intelligence (pp. 162–169).
Givan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(1–2), 163–223.
Givan, R., Leach, S. M., & Dean, T. (2000). Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1–2), 71–109.
Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11, 1563–1600.
Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19), 8584–8585.
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Leffler, B. R., Littman, M. L., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In Proc. 22nd AAAI conference on artificial intelligence (pp. 572–577).
Li, L. (2009). A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers University.
Li, L., Littman, M. L., Walsh, T. J., & Strehl, A. L. (2011). Knows what it knows: a framework for self-aware learning. Machine Learning, 82(3), 399–443.
Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In Proc. 9th international symposium on artificial intelligence and mathematics (pp. 531–539).
Mannor, S., Menache, I., Hoze, A., & Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. In Machine learning, proc. 21st international conference.
Munos, R. (2010). Approximate dynamic programming. In O. Sigaud & O. Buffet (Eds.), Markov decision processes in artificial intelligence (pp. 67–98), Chap. 3.
Ortner, R. (2007). Pseudometrics for state aggregation in average reward Markov decision processes. In Algorithmic learning theory, 18th international conference, ALT 2007 (pp. 373–387).
Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. New York: Wiley.
Ravindran, B., & Barto, A. G. (2003). SMDP homomorphisms: An algebraic approach to abstraction in semi-Markov decision processes. In IJCAI-03, proc. 18th international joint conference on artificial intelligence (pp. 1011–1018).
Roy, B. V. (2006). Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2), 234–244.
Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Machine learning, proc. 11th international conference (pp. 284–292).
Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. In Machine learning, proc. 22nd international conference (pp. 857–864).
Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), 1309–1331.
Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state MDPs. In Proc. 22nd AAAI conference on artificial intelligence (pp. 645–650).
Tewari, A., & Bartlett, P. L. (2007). Bounded parameter Markov decision processes with average reward criterion. In Learning theory, 20th annual conference on learning theory, COLT 2007 (pp. 263–277).
Tewari, A., & Bartlett, P. (2008). Optimistic linear programming gives logarithmic regret for irreducible MDPs. Advances in Neural Information Processing Systems, 20, 1505–1512.
Acknowledgements
The author would like to thank Peter Auer for discussion of some preliminary ideas, Shiau Hong Lim for sharing his power plug adapter at Cumberland Lodge, and the anonymous reviewers for their comments, which helped to improve the paper. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements n∘ 216886 (PASCAL2 Network of Excellence), and n∘ 231495 (project CompLACS). The final version of this paper has been prepared when the author was supported by the Austrian Science Fund (FWF): J 3259-N13.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ortner, R. Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Ann Oper Res 208, 321–336 (2013). https://doi.org/10.1007/s10479-012-1064-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-012-1064-y