Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Efficient solution algorithms for factored MDPs

Published: 01 October 2003 Publication History

Abstract

This paper addresses the problem of planning under uncertainty in large Markov Decision Processes (MDPs). Factored MDPs represent a complex state space using state variables and the transition model using a dynamic Bayesian network. This representation often allows an exponential reduction in the representation size of structured MDPs, but the complexity of exact solution algorithms for such MDPs can grow exponentially in the representation size. In this paper, we present two approximate solution algorithms that exploit structure in factored MDPs. Both use an approximate value function represented as a linear combination of basis functions, where each basis function involves only a small subset of the domain variables. A key contribution of this paper is that it shows how the basic operations of both algorithms can be performed efficiently in closed form, by exploiting both additive and context-specific structure in a factored MDP. A central element of our algorithms is a novel linear program decomposition technique, analogous to variable elimination in Bayesian networks, which reduces an exponentially large LP to a provably equivalent, polynomial-sized one. One algorithm uses approximate linear programming, and the second approximate dynamic programming. Our dynamic programming algorithm is novel in that it uses an approximation based on max-norm, a technique that more directly minimizes the terms that appear in error bounds for approximate MDP algorithms. We provide experimental results on problems with over 1040 states, demonstrating a promising indication of the scalability of our approach, and compare our algorithm to an existing state-of-the-art approach, showing, in some problems, exponential gains in computation time.

References

[1]
Arnborg, S., Corneil, D. G., & Proskurowski, A. (1987). Complexity of finding embeddings in a K-tree. SIAM Journal of Algebraic and Discrete Methods, 8 (2), 277 - 284.
[2]
Becker, A., & Geiger, D. (2001). A sufficiently fast algorithm for finding close to optimal clique trees. Artificial Intelligence, 125 (1-2), 3-17.
[3]
Bellman, R., Kalaba, R., & Kotkin, B. (1963). Polynomial approximation - a new computational technique in dynamic programming. Math. Comp., 17 (8), 155-161.
[4]
Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, New Jersey.
[5]
Bertele, U., & Brioschi, F. (1972). Nonserial Dynamic Programming. Academic Press, New York.
[6]
Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, Massachusetts.
[7]
Boutilier, C., Dean, T., & Hanks, S. (1999). Decision theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11, 1 - 94.
[8]
Boutilier, C., & Dearden, R. (1996). Approximating value trees in structured dynamic programming. In Proc. ICML, pp. 54-62.
[9]
Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structure in policy construction. In Proc. IJCAI, pp. 1104-1111.
[10]
Boutilier, C., Dearden, R., & Goldszmidt, M. (2000). Stochastic dynamic programming with factored representations. Artificial Intelligence, 121 (1-2), 49-107.
[11]
Cheney, E. W. (1982). Approximation Theory (2nd edition). Chelsea Publishing Co., New York, NY.
[12]
de Farias, D., & Van Roy, B. (2001a). The linear programming approach to approximate dynamic programming. Submitted to Operations Research.
[13]
de Farias, D., & Van Roy, B. (2001b). On constraint sampling for the linear programming approach to approximate dynamic programming. To appear in Mathematics of Operations Research.
[14]
Dean, T., Kaelbling, L. P., Kirman, J., & Nicholson, A. (1993). Planning with deadlines in stochastic domains. In Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93), pp. 574-579, Washington, D.C. AAAI Press.
[15]
Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation. Computational Intelligence, 5 (3), 142-150.
[16]
Dean, T., & Givan, R. (1997). Model minimization in Markov decision processes. In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI- 97), pp. 106-111, Providence, Rhode Island, Oregon. AAAI Press.
[17]
Dearden, R., & Boutilier, C. (1997). Abstraction and approximate decision theoretic planning. Artificial Intelligence, 89 (1), 219-283.
[18]
Dechter, R. (1999). Bucket elimination: A unifying framework for reasoning. Artificial Intelligence, 113 (1-2), 41-85.
[19]
Gordon, G. (1995). Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 261-268, Tahoe City, CA. Morgan Kaufmann.
[20]
Guestrin, C. E., Koller, D., & Parr, R. (2001a). Max-norm projections for factored MDPs. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-01), pp. 673-680, Seattle, Washington. Morgan Kaufmann.
[21]
Guestrin, C. E., Koller, D., & Parr, R. (2001b). Multiagent planning with factored MDPs. In 14th Neural Information Processing Systems (NIPS-14), pp. 1523-1530, Vancouver, Canada.
[22]
Guestrin, C. E., Koller, D., & Parr, R. (2001c). Solving factored POMDPs with linear value functions. In Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-01) workshop on Planning under Uncertainty and Incomplete Information, pp. 67-75, Seattle, Washington.
[23]
Guestrin, C. E., Venkataraman, S., & Koller, D. (2002). Context specific multiagent coordination and planning with factored MDPs. In The Eighteenth National Conference on Artificial Intelligence (AAAI-2002), pp. 253-259, Edmonton, Canada.
[24]
Hoey, J., St-Aubin, R., Hu, A., & Boutilier, C. (1999). SPUDD: Stochastic planning using decision diagrams. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-99), pp. 279-288, Stockholm, Sweden. Morgan Kaufmann.
[25]
Hoey, J., St-Aubin, R., Hu, A., & Boutilier, C. (2002). Stochastic planning using decision diagrams - C implementation. http://www.cs.ubc.ca/spider/staubin/Spudd/.
[26]
Howard, R. A., & Matheson, J. E. (1984). Influence diagrams. In Howard, R. A., & Matheson, J. E. (Eds.), Readings on the Principles and Applications of Decision Analysis, pp. 721-762. Strategic Decisions Group, Menlo Park, California.
[27]
Keeney, R. L., & Raiffa, H. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Wiley, New York.
[28]
Kim, K.-E., & Dean, T. (2001). Solving factored Mdps using non-homogeneous partitioning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-01), pp. 683-689, Seattle, Washington. Morgan Kaufmann.
[29]
Kjaerulff, U. (1990). Triangulation of graphs - algorithms giving small total state space. Tech. rep. TR R 90-09, Department of Mathematics and Computer Science, Strandvejen, Aalborg, Denmark.
[30]
Koller, D., & Parr, R. (1999). Computing factored value functions for policies in structured MDPs. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 1332-1339. Morgan Kaufmann.
[31]
Koller, D., & Parr, R. (2000). Policy iteration for factored MDPs. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-00), pp. 326- 334, Stanford, California. Morgan Kaufmann.
[32]
Meuleau, N., Hauskrecht, M., Kim, K., Peshkin, L., Kaelbling, L., Dean, T., & Boutilier, C. (1998). Solving very large weakly-coupled Markov decision processes. In Proceedings of the 15th National Conference on Artificial Intelligence, pp. 165-172, Madison, WI.
[33]
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference . Morgan Kaufmann, San Mateo, California.
[34]
Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming . Wiley, New York.
[35]
Reed, B. (1992). Finding approximate separators and computing tree-width quickly. In 24th Annual Symposium on Theory of Computing, pp. 221-228. ACM.
[36]
Schuurmans, D., & Patrascu, R. (2001). Direct value-approximation for factored MDPs. In Advances in Neural Information Processing Systems (NIPS-14), pp. 1579-1586, Vancouver, Canada.
[37]
Schweitzer, P., & Seidmann, A. (1985). Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110, 568 - 582.
[38]
Simon, H. A. (1981). The Sciences of the Artificial (second edition). MIT Press, Cambridge, Massachusetts.
[39]
Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. In Jordan, M. I., Kearns, M. J., & Solla, S. A. (Eds.), Advances in Neural Information Processing Systems, Vol. 10. The MIT Press.
[40]
St-Aubin, R., Hoey, J., & Boutilier, C. (2001). APRICODD: Approximate policy construction using decision diagrams. In Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, pp. 1089-1095, Denver, Colorado. MIT Press.
[41]
Stiefel, E. (1960). Note on Jordan elimination, linear programming and Tchebycheff approximation. Numerische Mathematik, 2, 1-17.
[42]
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
[43]
Tadepalli, P., & Ok, D. (1996). Scaling up average reward reinforcmeent learning by approximating the domain models and the value function. In Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy. Morgan Kaufmann.
[44]
Tatman, J. A., & Shachter, R. D. (1990). Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 20 (2), 365-379.
[45]
Tsitsiklis, J. N., & Van Roy, B. (1996a). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59-94.
[46]
Tsitsiklis, J. N., & Van Roy, B. (1996b). An analysis of temporal-difference learning with function approximation. Technical report LIDS-P-2322, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology.
[47]
Van Roy, B. (1998). Learning and Value Function Approximation in Complex Decision Processes. Ph.D. thesis, Massachusetts Institute of Technology.
[48]
Williams, R. J., & Baird, L. C. I. (1993). Tight performance bounds on greedy policies based on imperfect value functions. Tech. rep., College of Computer Science, Northeastern University, Boston, Massachusetts.
[49]
Yedidia, J., Freeman, W., & Weiss, Y. (2001). Generalized belief propagation. In Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, pp. 689-695, Denver, Colorado. MIT Press.
[50]
Zhang, N., & Poole, D. (1999). On the role of context-specific independence in probabilistic reasoning. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 1288-1293. Morgan Kaufmann.

Cited By

View all
  • (2024)Factored MDP based Moving Target Defense with Dynamic Threat ModelingProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663095(2165-2167)Online publication date: 6-May-2024
  • (2023)Learning dynamic attribute-factored world models for efficient multi-object reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666960(19117-19144)Online publication date: 10-Dec-2023
  • (2023)Differential privacy in cooperative multiagent planningProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625867(347-357)Online publication date: 31-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Artificial Intelligence Research
Journal of Artificial Intelligence Research  Volume 19, Issue 1
July 2003
652 pages

Publisher

AI Access Foundation

El Segundo, CA, United States

Publication History

Published: 01 October 2003
Received: 01 January 2002
Published in JAIR Volume 19, Issue 1

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Factored MDP based Moving Target Defense with Dynamic Threat ModelingProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663095(2165-2167)Online publication date: 6-May-2024
  • (2023)Learning dynamic attribute-factored world models for efficient multi-object reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666960(19117-19144)Online publication date: 10-Dec-2023
  • (2023)Differential privacy in cooperative multiagent planningProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625867(347-357)Online publication date: 31-Jul-2023
  • (2023)Scalable safe policy improvement via monte carlo tree searchProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618560(3732-3756)Online publication date: 23-Jul-2023
  • (2023)Planning to chronicleInternational Journal of Robotics Research10.1177/0278364921106915442:6(412-432)Online publication date: 1-May-2023
  • (2023)Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement LearningACM SIGMETRICS Performance Evaluation Review10.1145/3606376.359354551:1(83-84)Online publication date: 27-Jun-2023
  • (2023)Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement LearningProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794437:1(1-51)Online publication date: 2-Mar-2023
  • (2023)Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement LearningAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/3578338.3593545(83-84)Online publication date: 19-Jun-2023
  • (2023)Invariant Policy Learning: A Causal PerspectiveIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.323236345:7(8606-8620)Online publication date: 1-Jul-2023
  • (2023)Teamwork Reinforcement Learning With Concave UtilitiesIEEE Transactions on Mobile Computing10.1109/TMC.2023.331512023:5(5709-5721)Online publication date: 13-Sep-2023
  • Show More Cited By

View Options

View options

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media