Abstract
We consider semi-Markov decision processes with finite state and action spaces and a general multichain structure. A form of limiting ratio average (undiscounted) reward is the criterion for comparing different policies. The main result is that the value vector and a pure optimal semi-stationary policy (i.e., a policy which depends only on the initial state and the current state) for such an SMDP can be computed directly from an optimal solution of a finite set (whose cardinality equals the number of states) of linear programming (LP) problems. To be more precise, we prove that the single LP associated with a fixed initial state provides the value and an optimal pure stationary policy of the corresponding SMDP. The relation between the set of feasible solutions of each LP and the set of stationary policies is also analyzed. Examples are worked out to describe the algorithm.
Similar content being viewed by others
Notes
For simplicity of notations, we shall denote by \([V]_i\) the i-th component of the vector V; the j-th row and the k-th column of the matrix A are denoted by \(A_{j\cdot }\) and \(A_{\cdot k}\) respectively.
References
Baykal-Gürsoy, M. (2010). Semi-Markov decision processes. Wiley Encyclopedia of Operations Research and Management Science, doi:10.1002/9780470400531.eorms0757.
Bellman, R. E. (1962). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684.
Bewley, T., & Kohlberg, E. (1978). On stochastic games with stationary optimal strategies. Mathematics of Operations Research, 3(2), 104–125.
Blackwell, D. (1962). Discrete dynamic programming. Annals of Mathematical Statistics, 33, 719–726.
Chen, D., & Trivedi, K. S. (2005). Optimization for condition-based maintenance with semi-Markov decision process. Reliability Engineering & System Safety, 90(1), 25–29.
Denardo, E. V., & Fox, B. L. (1968). Multichain Markov renewal programs. SIAM Journal of Applied Mathrmatics, 16(3), 468–487.
Derman, C. (1962). On sequential decisions and Markov chains. Management Science, 9(1), 16–24.
Derman, C. (1964). On sequential control processes. Annals of Mathematical Statististics, 35(1), 341–349.
Doob, J. L. (1953). Stochastic processes (p. 52369). Hoboken: Willey.
Federgruen, A., Hordijk, A., & Tijms, H. C. (1978). A note on simultaneous recurrence conditions on a set of denumerable stochastic matrices. Journal of Appllied Probablity, 15(4), 842–847.
Feinberg, E. A. (1994). Constrained semi-Markov decision processes with average rewards. Mathematical Methods of Operational Research, 39(3), 257–288.
Fox, B. (1966). Markov renewal programming by linear fractional programming. SIAM Journal on Applied Mathematics, 14(6), 1418–1432.
Hordijk, A., & Kallenberg, L. C. M. (1979). Linear programming and Markov decision chains. Management Science, 25(4), 352–362.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. New York: Wiley.
Howard, R. A. (1963). Linear programming and Markov decision chains. Ottawa: Proc Internat Statist Inst.
Jaśkiewicz, A. (2004). On the equivalence of two expected average cost criteria for Semi-Markov control processes. Mathematics of Operations Research, 29, 326–338.
Jaśkiewicz, A., & Nowak, A. (2007). Average optimality for semi-Markov control processes. Morfismos, 11(1), 15–36.
Jewell, W. S. (1963). Markov-renewal programming. Operations Research, 2, 938–971.
Jianyong, L., & Xiaobo, Z. (2004). On average reward semi-Markov decision processes with a general multichain structure. Mathematics Operations Research, 29(2), 339–352.
Mondal, P., & Sinha, S. (2015). Ordered field property for semi-Markov games when one player controls transition probabilities and transition times. International Game Theory Review, 17(2), 1540022-1–1540022-26.
Mondal, P. (2015). Linear programming and zero-sum two-person undiscounted Semi-Markov games. Asia-Pacific Journal of Operational Research, 32(6), 1550043-1–1550043-20.
Mondal, P. (2016a). On undiscounted Semi-Markov decision processes with absorbing states. Mathematical Methods of Operations Research, 83(2), 161–177.
Mondal, P. (2016b). Completely mixed strategies for single controller unichain Semi-Markov games with undiscounted payoffs. Operational Research, doi:10.1007/s12351-016-0272-7.
Mondal, P. (2017). On zero-sum two-person undiscounted semi-Markov games with a multichain structure. Advances in Applied Probability, 49(3), 826–849. doi:10.1017/apr.2017.23.
Osaki, S., & Mine, H. (1968). Linear programming algorithms for Semi-Markovian decision processes. Journal of Mathematical Analysis and Applications, 22, 356–381.
Ross, S. M. (1970). Applied probability models with optimization applications. San Francisco: Holden-Day.
Schal, M. (1992). On the second optimality equation for semi-markov decision models. Mathematics of Operations Research, 17(2), 470–486.
Schweitzer, P. J. (1971). Iterative solution of the functional equations of undiscounted Markov renewal programming. Journal of Mathematical Analysis and Applications, 34(3), 495–501.
Sennott, L. I. (1989). Average cost semi-Markov decision processes and the control of queueing systems. Probability in the Engineering and Informational Sciences, 3(02), 247–272.
Shapley, L. (1953). Stochastic games. In Proceedings of National Academy Sciences, USA (Vol. 39, pp. 1095–1100).
Sinha, S., & Mondal, P. (2017). Semi-Markov decision processes with limiting ratio average rewards. Journal of Mathematical Analysis and Applications, 455(1), 864–871. doi:10.1016/j.jmaa.2017.06.017.
White, D. J. (1993). A survey of applications of Markov decision processes. Journal of the Operational Research Society, 44(11), 1073–1096.
Acknowledgements
I am grateful to Prof. T. Parthasarathy of CMI & ISI Chennai. Some ideas presented in this paper results from a fruitful discussion with him during the International Conference & Workshop on “Game Theory and Optimization”, June 6-10, 2016 at IIT Madras. This paper is dedicated to celebrate the 75th birthday of Prof. T. Parthasarathy, who has made significant contributions to the theory of games and linear complementarity problems. I am also thankful to Prof. S. Sinha of Jadavpur University, Kolkata for many valuable suggestions. I would like to thank the two anonymous Referees for their valuable and detailed comments that has helped structure this paper better.
Author information
Authors and Affiliations
Corresponding author
Appendix (Proof of Theorem 1)
This result is published in Sinha and Mondal (2017).
Appendix (Proof of Theorem 1)
We use the following simple lemma to prove Theorem 1.
Lemma 11
Let \(\{a_n\}\) be a bounded sequence in \(\mathbf {R}^m\) and \(f: \mathbf {R}^m\rightarrow \mathbf {R}\) be a continuous function, then
Proof of Theorem 1
Let \(|S|=z\). For simplicity, we assume that \(|A(s)|=|A|=k\)\(\forall ~s\in S\). Let \({s_0}\in S\) be a fixed initial state. We prove that if the SMDP starts at \({s_0}\), there exists a pure stationary policy \(f^{s_0}\in \mathscr {F}^\mathscr {P}\) such that \(\phi ({s_0},f^{s_0})\ge \phi ({s_0},\pi )\) for all \(\pi \in \varPi \). This result will prove the theorem because if \(\{f^{s_0}\mid {s_0}\in S\}\) are as above, then the semi-stationary policy \(\xi ^*=(f^1,f^2,\ldots ,f^z)\) is optimal in the SMDP.
For every \(\pi \in \varPi \), consider the zk component vectors
Note that \(x_{ns'a}\) represents the average of the probabilities that the decision maker visits to a given state \(s'\) and chooses a given action \(a\in A(s')\) in m-step (\(m=0,1,2,\ldots ,n\)) when the system starts at a fixed state \(s_0\).
Let \(H^{\pi }({s_0})\) be the set of limit points of \(\{\psi _n^\pi ({s_0})\}\) as \(n\rightarrow \infty \). If \(\pi \in \mathscr {F}\), we denote \(\psi ^{\pi }({s_0})=\lim \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})\) (which exists by Lemma 1(a)). Let \(H({s_0})=\bigcup _{\pi \in \varPi }H^{\pi }({s_0})\), \(H'({s_0})=\bigcup _{\pi \in \mathscr {F}}H^{\pi }({s_0})\) and \(H''({s_0})=\bigcup _{\pi \in \mathscr {F}^{\mathscr {P}}}H^{\pi }({s_0})\) and let \(\bar{H}'({s_0})\) and \(\bar{H}''({s_0})\) denote the closure of the convex hulls of \(H'({s_0})\) and \(H''({s_0})\) respectively. If the decision maker in the SMDP uses \(\pi \in \varPi \) with initial state \({s_0}\in S\), then from (1) we have
where r and \(\tau \) are respectively the immediate reward and the mean sojourn time vectors of order zk and they are in conformity with \(\psi _n^\pi ({s_0})\).
Define \(\varTheta (a_n)=\frac{a_n^T\cdot r}{a_n^T\cdot {\tau }} \) where \( a_n\in [0,1]^{zk}\), \(a_n^T\mathbf{{1}}=1\) (1 is a vector of order zk with all coordinates equal to 1). Then \(\{a_n\}\) is a bounded sequence and both the numerator and the denominator of \(\varTheta (a_n)\) are bounded linear functions with denominator being nonzero. Thus, \(\varTheta (a_n)\) is a continuous real valued function.
When \(a_n=\psi _n^\pi ({s_0})\), from (25) using Lemma 11, we obtain
where the equality occurs if \(\pi \in \mathscr {F}\). In Derman (1964) (p. 342, theorem 1, part (a)), it was shown that: \(H({s_0})\subset \bar{H}'({s_0})=\bar{H}''({s_0})\). By this argument, \(\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})\in H({s_0})\subset \bar{H}'({s_0})=\bar{H}''({s_0})\), thus there exist nonnegative scalars \(\lambda _1,\lambda _2,\ldots ,\lambda _d\) adding upto unity (where \(d=|\mathscr {F}^{\mathscr {P}}|=zk\) and \(\mathscr {F}^{\mathscr {P}}=\{f_1,f_2,\ldots ,f_d\}\)) such that \(\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})=\sum _{l=1}^{d}\lambda _l\psi ^{f_l}({s_0})\), where \(\psi ^{f_l}({s_0})=\lim \limits _{n\rightarrow \infty }\psi _n^{f_l}({s_0})\). Thus, for all \(\pi \in \varPi \),
The above linear fractional programming problem has an optimal solution which is an extreme point of the feasible zone \(\{\lambda _l\ge 0, \forall l=1,2,\ldots ,d; \sum \limits _{l=1}^d \lambda _l=1\}\) and hence corresponds to an optimal pure stationary policy \(f^{s_0}\) such that \(\phi ({s_0},f^{s_0})\ge \phi ({s_0},\pi )~ \forall \pi \in \varPi \). This completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Mondal, P. Computing semi-stationary optimal policies for multichain semi-Markov decision processes. Ann Oper Res 287, 843–865 (2020). https://doi.org/10.1007/s10479-017-2686-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-017-2686-x
Keywords
- Semi-Markov decision processes
- Limiting ratio average reward
- Multichain structure
- Pure optimal semi-stationary policies
- Linear programming