Recent Advances in Hierarchical Reinforcement Learning

Andrew G. Barto¹ &
Sridhar Mahadevan¹

3537 Accesses
227 Citations
Explore all metrics

Abstract

Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state. Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination. This leads naturally to hierarchical control architectures and associated learning algorithms. We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed. Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review. We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Andre, D., and Russell, S. J. 2001. Programmable reinforcement learning agents. In Advances in Neural Information Processing Svstesns: Proceedings of the 2000 Conference. Cambridge, MA: MIT Press, pp. 1019-1025.
Google Scholar
Barto, A. G., Bradike, S. J., and Singh. S. P. 1995. Learning to act using real-time dynamic programming. Artificial Intelligence 72: 81-138.
Google Scholar
Bemstein, D., Zilberstein, S., and Immerman, N. 2000. The complexity of decentralized control of markov decision processes. In C. Boutilier and M. Goldszmidt (eds). Uncertainly in Artificial Intelligence: Proceedings of the 16th Conference. San Francisco CA: Morgan Kaufmann, pp. 32-37.
Google Scholar
Bertsekas, D. P. 1987. Dynamic Programming: Deterministic and Stochastic Models. Englewood Cliffs, NJ: Prentice-Hall.
Google Scholar
Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific.
Google Scholar
Boyen, X., and Koller, D. 1998. Tractable inference for complex stochastic processes. In G. F Cooper and S. Moral (eds.), Proceedings of the Fourteenth Conference on Uncertainty in Al, San Francisco. CA: Morgan Kaufmann, pp. 33-42.
Google Scholar
Bradtke, S. J., and Duff. M. O. 1995. Reinforcement learning methods for continuous-time markov decision problems. In G. Tesauro. D. S. Touretzky and T. Leen (eds.). Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference. Cambridge. MA: MIT Press. pp. 393-400.
Google Scholar
Branicky, M. S., Borkar, V. S., and Mitter, S. K. 1998. A unified framework for hybnd control: Model and optimal control theory. IEEE Transactions on Automatic Control 43: 31-45.
Google Scholar
Brooks, R. A. 1986. Achieving artificial intelligence through building robots. Technical Report Al. Memo 899. Cambridge, MA: Massachusetts Institute of Technology Artificial Intelligence Laboratory.
Google Scholar
Cao, X. R., Ren, Z., Bhatnagar, S., Fu, M., and Marcus, S. 2002. A time aggregation approach to Markov decision processes. Automatica 38: 929-943.
Google Scholar
Crites. R. H. 1996. Large-Scale Dvnaniic Optimization Using Teams of Reinforcement Learning Agents. Ph.D. thesis. Amherst. MA: University of Massachusetts.
Google Scholar
Crites. R. H., and Barto. A. G. 1998. Elevator group control using multiple reinforcement learning agents. Machine Learning 33: 235-262.
Google Scholar
Das. T. K., Gosavi. A., Mahadevan, S., and Marchalleck, N. 1999. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science 45: 560-574.
Google Scholar
Dean. T. L., and Kanazawa, K. 1989. A model for reasoning about persistence and causation. Computational Intelligence 5: 142-150.
Google Scholar
Dietterich, T. G. 2000. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial intelligence Research 13: 227-303.
Google Scholar
Digney, B. 1996. Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments. In P. Meas and M. Mataric (eds.), From Animals to Animals 4: The Fourth Confe,-ence on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press.
Google Scholar
Digney, B. 1998. Learning hierarchical control structure from multiple tasks and changing environments. In From Animals to Animals 5: The Fifth Conference on Simulation of Adaptive Behavior. Carnbndge, MA: MIT Press.
Google Scholar
Driessens, K., and Dzeroski, S. 2002. Integrating experimentation and guidance in relational reinforcement learning. In Maching Learning: Proceedings of the Nineteenth international Conference on Machine Learning. San Francisco. GA: Morgan Kaufmann. pp. 115-122.
Google Scholar
Fikes. R. E., Hart, P. E., and Nilsson, N. J. 1972. Learning and executing generalized robot plans. Artificial Intelligence 3:251-288.
Google Scholar
Fine, S., Singer, Y., and Tishby, N. 1998. The hierarchical hidden Markov model: analysis and applications. Machine Learning 32(1). July.
Forestier, J.-P., and Varaiya, P. 1978. Multilayer control of large markov chains. IEEE Transactions on Automatic Control AC-23: 298-304.
Google Scholar
Ghavarnzadeh, M., and Mahadevan, S. 2001. Continuous-time hierarchical reinforcement learning. In Proceedings of the Eighteenth International conference on Machine Learning. San Francisco. CA: Morgan Kaufmann, pp. 186-193.
Google Scholar
Ghavamzadeh, M., and Mahadevan, S. 2002. Hierarchically optimal average reward reinforcement learning. In C. Sammut and M. Goldszmidt (eds). Proceedings of the Ni,ieteenth i,iternatioiial conference on Machine Learning (1LML 2002). San Francisco CA: Morgan Kaufmann, pp. 195-202.
Google Scholar
Grudic. G. Z., and Ungar. L. H. 2000. Localizing search in reinforcement learning. In Proceedings of the 18th National Conference on Artificial Intelligence, (AAAI-00), pp. 590-595.
Harel, D. 1987. Statecharts: Avisual formalixm for complex systems. Science of Coniputer Programming 8: 231-274.
Google Scholar
Hengst, B. 2002. Discovering hierarchy in reinforcement learning with hexq. In Maching Learning: Proceedings of the Nineteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, pp. 243-250.
Google Scholar
Hernandez, N., and Mahadevan. S. 2001. Hierarchical memory-based reinforcement learning. In Advances in Neural information Processing Systems: Proceedings of the 2000 conference. Cambridge, MA: MIT Press, pp. 1047-1053.
Google Scholar
Howard, R. A. 1971. Dynamic Probabilistic Systems: Semi-Markar and Decision Processes. New York: Wiley.
Google Scholar
Huber, M., and Grupen. R. A. 1997. A feedback control structure for on-line learning tasks. Robotics and Autononioiis Sstems 22: 303-315.
Google Scholar
Iba, G. A. 1989. A heuristic approach to the discovery of macro-operators. Machine Learning 3: 285-317.
Google Scholar
Jaakkola, T., Jordan, M. I., and Singh, S. P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6: 1185-1201.
Google Scholar
Jonsson. A., and Barto. A. G. 2001. Automated state abstraction for options using the li-tree algorithm. In Advances in Neural Information Processing Systems: Proceedings of the 2000 Conference. Cambridge, MA: MIT Press, pp. 1054-1060.
Google Scholar
Kaelbling, L., Littrnan. M., and Cassandra, A. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence 101.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4: 237-285.
Google Scholar
Klopf, A. H. 1972. Brain function and adaptive systems-a heterostatic theory. Technical Report AFCRL-72-0164, Bedford, MA: Air Force Cambridge Research Laboratories. A summary appears in Proceedings of the International Conference on Systems,Man, and Cybernetics, 1974, IEEE Systems, Man, and Cybernetics Society, Dallas, TX.
Google Scholar
Klopf, A. H. 1982. The Hedonistic Neuron: A Theory of Memory, Learning, and Intelligence. Washington, D.C.: Hemisphere.
Google Scholar
Koenig, S., and Simmons, R. 1997. Xavier: A robot navigation architecture based on partially observable markov decision process models. In D. Kortenkamp, P. Bonasso, and R. Murphy (eds.), Al-based Mobile Robots: Case-studies of Successful Robot Systems. Cambridge, MA: MIT Press.
Google Scholar
Kokotovic, P. V., Khalil, H. K., and O'Reilly, J. 1986. Singular Perturbation Methods in Control: Analysis and Design. London: Academic Press.
Google Scholar
Korf, R. E. 1985. Learning to Solve Problems by Searching for Macro-Operators. Boston, MA: Pitman.
Google Scholar
Littman. M. 1994. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, pp. 157-163.
Google Scholar
Mahadevan, S. 1996. Average reward reinforcement leaming: Foundations, algorithms, and empirical results. Machine Learning 22: 159-196.
Google Scholar
Mahadevan, S., Marchalleck, N., Das, T., and Gosavi, A. 1997. Self-improving factory simulation using continuous-time average-reward reinforcement learning. In Proceedings of the Fourteenth international Conference. San Francisco, CA: Morgan Kaufmann, pp. 202-210.
Google Scholar
Makar, R., Mahadevan, S., and Ghavamzadeh, M. 2001. Hierarchical multi-agent reinforcement learning. In J. P. Muller, E. Andre, S. Sen, and C. Frasson (eds.), Proceedings of the Fifth international Conference on Autonomous Agents. New York: ACM Press, pp. 246-253.
Google Scholar
McCalIum, A. K. 1996. Reinforcement Learning with Selective Perceptioin and Hidden State. Ph.D. thesis, University of Rochester.
McGovern, A. 2002. Autonomous Discovers of Temporal Abstractions from interaction with An Environment. Ph.D. thesis. University of Massachusetts.
McGovern, A., and Barb, A. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, pp. 361-368.
Google Scholar
Minsky, M. L. 1954. Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem. Ph.D. thesis, Princeton University.
Naidu, D. S. 1988. Singular Perturbation Methodology in Control Systems. London: Peter Peregrinus Ltd.
Google Scholar
Nourbakhsh, I., Powers, R., and Birchfield, S. 1995. Dervish: An office-navigation robot. Al Magazine 16(2): 53-60.
Google Scholar
Parr, R. 1998. Hierarchical Control and Learning for Markov Decision Processes. PhD thesis, Berkeley, CA: University of California.
Google Scholar
Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural information Processing Systems: Proceedings of the 1997 Conference. Cambridge. MA: MIT Press.
Google Scholar
Perkins, T. J., and Barto, A. G. 2001. Lyapunov-constrained action sets for reinforcement learning. In C. Brodley and A. Danyluk (eds.). Proceedings of the Eighteenth international Conference on Machine Learning.San Francisco, CA: Morgan Kaufmann. pp. 409-416.
Google Scholar
Perkins, T. J., and Barb, A. G. 2002. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research 3: 803-832.
Google Scholar
Precup, D. 2000. Temporal Abstraction in Reinforcentent Learning. Ph.D. thesis, Amherst, MA: University of Massachusetts.
Google Scholar
Precup. D., and Sutton, R. S. 1998. Multi-time models for temporally abstract planning. In Advances in Neural information Processing Systems: Proceedings of the 1997 Conference. Cambridge MA: MIT Press, pp. 1050-1056.
Google Scholar
Precup. D., Sutton, R. S., and Singh, S. 1998. Theoretical results on reinforcement learning with temporally abstract options. In Proceedings of the 10th European Conference on Machime Learning, ECML-98. Springer Verlag, pp. 382-393.
Puterman, M. L. 1994. Markov Decision Problems. Wiley, NY.
Google Scholar
Rohanimanesh, K., and Mahadevan, S. 2002 Learning to take concurrent actions. In Advances in Neural Information Processing Systems: Proceedings of the 2002 Conference. Cambridge: MIT Press. In press.
Google Scholar
Rohanimanesh, K., and Mahadevan, S. Structured approximation of stochastic temporally extended actions. In preparation.
Rohanimanesh, K., and Mahadevan, S. 2001. Decision-theoretic planning with concurrent temporally extended actions. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence.
Ross. S. 1983. Introduction to Stochastic Dynamic Programming. New York: Academic Press.
Google Scholar
Rummery, G. A., and Niranjan, M. 1994. On-line q-learrnng using connectionist systems. Technical Report CUED/F-INFENG/'FR 166, Cambridge University Engineering Department.
Samuel. A. L. 1963. Some studies in machine learning using the game of checkers. IBM Journal on Research and Developntent 3: 211-229, 1959. Reprinted in E. A. Feigenbaurn and J. Feldman (eds.), Computers and Thought. New York: McGraw-Hill, pp. pp71-105.
Google Scholar
Samuel. A. L. 1967. Some studies in machine learning using the game of checkers. Il-Recent progress. IBM Journal on Research and Development 11: 601-617.
Google Scholar
Schwartz, A. 1993. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmann, pp. 298-305.
Shatkay, H., and Kaelbling. L. P. 1997. Learning topological maps with weak local odometric information. In IJC4I (2). 920-929.
Google Scholar
Singh. S., and Bertsekas, D. 1997. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 conference. Cambridge, MA: MIT Press, pp. 974-980.
Google Scholar
Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári. C. 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38: 287-308.
Google Scholar
Singh, S. P. 1992. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the Tenth Notional Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press/MIT Press, pp. 202-207.
Google Scholar
Singh, S. P. 1992. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Proceedings of the Ninth International Machine Learning Conference. San Mateo, CA: Morgan Kaufmann. pp. 406-415.
Google Scholar
Stone, P., and Sutton, R. S. 2001. Scaling reinforcement learning toward RoboCup soccer. In C. Brodley and A. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, pp. 537-544.
Google Scholar
Sugawara, T., and Lesser, V. 1998. Learning to improve coordinated actions in cooperative distributed problem-solving environments. Machine Learning 33: 129-154.
Google Scholar
Sutton, R. S. 1996. Generalization in reinforcement leaming: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (eds.). Advances in Neural Information Processing Svstenis: Proceedings of the 1995 Conference. Cambridge. MA: MIT Press. pp. 1038-1044.
Google Scholar
Sutton, R. S., and Barto, A. G. 1981. Toward a modem theory of adaptive networks: Expectation and prediction. Psychological Review 88: 135-170.
Google Scholar
Sutton. R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
Google Scholar
Sutton, R. S., Precup, D., and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112: 181-211.
Google Scholar
Tan, M. 1993. Multi-agent reinforcement leaming: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Mac/tine Learning. San Francisco, CA: Morgan Kaufmann, pp. 330-337.
Google Scholar
Tesauro, G. J. 1992. Practical issues in temporal difference leaming. Machine Learning 8: 257-277.
Google Scholar
Tesauro, G. J. 1994. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Coniputation 6(2): 215-219.
Google Scholar
Theocharous, G. 2002. Hierarchical Learning and Planning in Partially Observable Markov Derision Processes. Ph.D. Thesis, Michigan State University.
Theocharous, G., and Mahadevan. S. 2002. Approximate planning with hierarchical partially observable markov decision processs for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Autontation (ICRA).
Theocharous, G., Rohanimanesh. K., and Mahadevan, S. 2001. Learning hierarchical partially observable markov decision processs for robot navigation. In Proceedings of the IEEE International Conference o,i Robotics and Automation (ICRA).
Thrun, S. B., and Schwartz, A. 1995. Finding structure in reinforcement learning. In G. Tesauro, D. S. Touretzky, and T. Leen (eds.), Advances tn Neural Information Processing Systems: Proceedings of the 1994 Conference. Cambridge, MA: MIT Press, PP. 385-392.
Google Scholar
Tsitsikhs, J. N., and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42: 674-690.
Google Scholar
Watkins, C. J. C. H. 1989. Learning front Delayed Rewards. Ph.D. thesis, Cambridge, England: Cambridge University.
Google Scholar
Watkins, C. J. C. H., and Dayan, P. 1992. Q-learning. Machine Learning 8: 279-292.
Google Scholar
Weiss, G. 1999. Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence. Cambridge, MA: MIT Press.
Google Scholar
Werbos, P. J. 1977. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook 22: 25-38.
Google Scholar
Werbos, P. J. 1987. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics 17: 7-20.
Google Scholar
Werbos, P. J. 1992. Approximate dynamic programming for real-time control and neural modeling. In D. A. White and D. A. Sofge (eds.), Handbook of Intelligent Control: Neural, Fuzzy.and Adaptive Approaches. New York: Van Nostrand Reinhold, pp. 493-325.
Google Scholar
Woods, W. A. 1970. Transition network grammars for natural language analysis.Comntunications of the ACM 13: 591-606.
Google Scholar

Download references

Author information

Authors and Affiliations

Autonomous Learning Laboratory, Department of Computer Science, University of Massachusetts, Amherst, MA, 01003
Andrew G. Barto & Sridhar Mahadevan

Authors

Andrew G. Barto
View author publications
You can also search for this author in PubMed Google Scholar
Sridhar Mahadevan
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barto, A.G., Mahadevan, S. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Dynamic Systems 13, 341–379 (2003). https://doi.org/10.1023/A:1025696116075

Download citation

Issue Date: October 2003
DOI: https://doi.org/10.1023/A:1025696116075

Recent Advances in Hierarchical Reinforcement Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls

Behavioral Hierarchy: Exploration and Representation

Exploring the limits of hierarchical world models in reinforcement learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

Recent Advances in Hierarchical Reinforcement Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls

Behavioral Hierarchy: Exploration and Representation

Exploring the limits of hierarchical world models in reinforcement learning

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation