Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Hierarchical reinforcement learning with the MAXQ value function decomposition

Published: 01 November 2000 Publication History

Abstract

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics--as a subroutine hierarchy--and a declarative semantics--as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this nonhierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

References

[1]
Bellman, R. E. (1957). Dynamic Programming. Princeton University Press.
[2]
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA.
[3]
Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structure in policy construction. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1104-1111.
[4]
Currie, K., & Tate, A. (1991). O-plan: The open planning architecture. Artificial Intelligence, 52 (1), 49-86.
[5]
Dayan, P., & Hinton, G. (1993). Feudal reinforcement learning. In Advances in Neural Information Processing Systems, 5, pp. 271-278. Morgan Kaufmann, San Francisco, CA.
[6]
Dean, T., & Lin, S.-H. (1995). Decomposition techniques for planning in stochastic domains. Tech. rep. CS-95-10, Department of Computer Science, Brown University, Providence, Rhode Island.
[7]
Dietterich, T. G. (1998). The MAXQ method for hierarchical reinforcement learning. In Fifteenth International Conference on Machine Learning, pp. 118-126. Morgan Kaufmann.
[8]
Fikes, R. E., Hart, P. E., & Nilsson, N. J. (1972). Learning and executing generalized robot plans. Artificial Intelligence, 3, 251-288.
[9]
Forgy, C. L. (1982). Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19 (1), 17-37.
[10]
Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998). Hierarchical solution of Markov decision processes using macro-actions. In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-98), pp. 220-229 San Francisco, CA. Morgan Kaufmann Publishers.
[11]
Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA.
[12]
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6 (6), 1185-1201.
[13]
Kaelbling, L. P. (1993). Hierarchical reinforcement learning: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167-173 San Francisco, CA. Morgan Kaufmann.
[14]
Kalmár, Z., Szepesvári, C., & Löörincz, A. (1998). Module based reinforcement learning for a real robot. Machine Learning, 31, 55-85.
[15]
Knoblock, C. A. (1990). Learning abstraction hierarchies for problem solving. In Proceedings of the Eighth National Conference on Artificial Intelligence, pp. 923-928 Boston, MA. AAAI Press.
[16]
Korf, R. E. (1985). Macro-operators: A weak method for learning. Artificial Intelligence, 26 (1), 35-77.
[17]
Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. thesis, Carnegie Mellon University, Department of Computer Science, Pittsburgh, PA.
[18]
Moore, A. W., Baird, L., & Kaelbling, L. P. (1999). Multi-value-functions: Efficient automatic action hierarchies for multiple goal MDPs. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1316-1323 San Francisco. Morgan Kaufmann.
[19]
Parr, R. (1998a). Flexible decomposition algorithms for weakly coupled Markov decision problems. In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-98), pp. 422-430 San Francisco, CA. Morgan Kaufmann Publishers.
[20]
Parr, R. (1998b). Hierarchical control and learning for Markov decision processes. Ph.D. thesis, University of California, Berkeley, California.
[21]
Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, Vol. 10, pp. 1043-1049 Cambridge, MA. MIT Press.
[22]
Pearl, J. (1988). Probabilistic Inference in Intelligent Systems. Networks of Plausible Inference . Morgan Kaufmann, San Mateo, CA.
[23]
Rummery, G. A., & Niranjan, M. (1994). Online Q-learning using connectionist systems. Tech. rep. CUED/FINFENG/TR 166, Cambridge University Engineering Department, Cambridge, England.
[24]
Sacerdoti, E. D. (1974). Planning in a hierarchy of abstraction spaces. Artificial Intelligence, 5 (2), 115-135.
[25]
Singh, S., Jaakkola, T., Littman, M. L., & Szepesváári, C. (1998). Convergence results for single-step on-policy reinforcement-learning algorithms. Tech. rep., University of Colorado, Department of Computer Science, Boulder, CO. To appear in Machine Learning.
[26]
Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8, 323-339.
[27]
Sutton, R. S., Singh, S., Precup, D., & Ravindran, B. (1999). Improved switching among temporally abstract actions. In Advances in Neural Information Processing Systems, Vol. 11, pp. 1066-1072. MIT Press.
[28]
Sutton, R., & Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA.
[29]
Sutton, R. S., Precup, D., & Singh, S. (1998). Between MDPs and Semi-MDPs: Learning, planning, and representing knowledge at multiple temporal scales. Tech. rep., University of Massachusetts, Department of Computer and Information Sciences, Amherst, MA. To appear in Artificial Intelligence.
[30]
Tadepalli, P., & Dietterich, T. G. (1997). Hierarchical explanation-based reinforcement learning. In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 358-366 San Francisco, CA. Morgan Kaufmann.
[31]
Tambe, M., & Rosenbloom, P. S. (1994). Investigating production system representations for non-combinatorial match. Artificial Intelligence, 68 (1), 155-199.
[32]
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, King's College, Oxford. (To be reprinted by MIT Press.).
[33]
Watkins, C. J., & Dayan, P. (1992). Technical note Q-Learning. Machine Learning, 8, 279.

Cited By

View all
  • (2024)Faster MIL-based Subgoal Identification for Reinforcement Learning by Tuning Fewer HyperparametersACM Transactions on Autonomous and Adaptive Systems10.1145/364385219:2(1-29)Online publication date: 20-Apr-2024
  • (2023)Creating multi-level skill hierarchies in reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668226(48472-48484)Online publication date: 10-Dec-2023
  • (2023)Train once, get a familyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668161(47081-47104)Online publication date: 10-Dec-2023
  • Show More Cited By
  1. Hierarchical reinforcement learning with the MAXQ value function decomposition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Journal of Artificial Intelligence Research
    Journal of Artificial Intelligence Research  Volume 13, Issue 1
    August 2000
    335 pages

    Publisher

    AI Access Foundation

    El Segundo, CA, United States

    Publication History

    Published: 01 November 2000
    Received: 01 November 1999
    Published in JAIR Volume 13, Issue 1

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Faster MIL-based Subgoal Identification for Reinforcement Learning by Tuning Fewer HyperparametersACM Transactions on Autonomous and Adaptive Systems10.1145/364385219:2(1-29)Online publication date: 20-Apr-2024
    • (2023)Creating multi-level skill hierarchies in reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668226(48472-48484)Online publication date: 10-Dec-2023
    • (2023)Train once, get a familyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668161(47081-47104)Online publication date: 10-Dec-2023
    • (2023)Hierarchies of reward machinesProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618832(10494-10541)Online publication date: 23-Jul-2023
    • (2023)Scaling marginalized importance sampling to high-dimensional state-spaces via state abstractionProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i8.26128(9417-9425)Online publication date: 7-Feb-2023
    • (2022)Hardness in Markov decision processesProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601348(14824-14838)Online publication date: 28-Nov-2022
    • (2022)An In-Depth Analysis of Cooperative Multi-Robot Hierarchical Reinforcement LearningProceedings of the 7th International Conference on Sustainable Information Engineering and Technology10.1145/3568231.3568258(119-126)Online publication date: 22-Nov-2022
    • (2022)Graph-Based Design of Hierarchical Reinforcement Learning Agents2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS40897.2019.8968252(1003-1009)Online publication date: 28-Dec-2022
    • (2022)Team-Imitate-Synchronize for Solving Dec-POMDPsMachine Learning and Knowledge Discovery in Databases10.1007/978-3-031-26412-2_14(216-232)Online publication date: 19-Sep-2022
    • (2021)Active Screening for Recurrent Diseases: A Reinforcement Learning ApproachProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3463952.3464068(992-1000)Online publication date: 3-May-2021
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media