Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Computing semi-stationary optimal policies for multichain semi-Markov decision processes

  • Game theory and optimization
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

We consider semi-Markov decision processes with finite state and action spaces and a general multichain structure. A form of limiting ratio average (undiscounted) reward is the criterion for comparing different policies. The main result is that the value vector and a pure optimal semi-stationary policy (i.e., a policy which depends only on the initial state and the current state) for such an SMDP can be computed directly from an optimal solution of a finite set (whose cardinality equals the number of states) of linear programming (LP) problems. To be more precise, we prove that the single LP associated with a fixed initial state provides the value and an optimal pure stationary policy of the corresponding SMDP. The relation between the set of feasible solutions of each LP and the set of stationary policies is also analyzed. Examples are worked out to describe the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. For simplicity of notations, we shall denote by \([V]_i\) the i-th component of the vector V; the j-th row and the k-th column of the matrix A are denoted by \(A_{j\cdot }\) and \(A_{\cdot k}\) respectively.

References

  • Baykal-Gürsoy, M. (2010). Semi-Markov decision processes. Wiley Encyclopedia of Operations Research and Management Science, doi:10.1002/9780470400531.eorms0757.

  • Bellman, R. E. (1962). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684.

    Google Scholar 

  • Bewley, T., & Kohlberg, E. (1978). On stochastic games with stationary optimal strategies. Mathematics of Operations Research, 3(2), 104–125.

    Article  Google Scholar 

  • Blackwell, D. (1962). Discrete dynamic programming. Annals of Mathematical Statistics, 33, 719–726.

    Article  Google Scholar 

  • Chen, D., & Trivedi, K. S. (2005). Optimization for condition-based maintenance with semi-Markov decision process. Reliability Engineering & System Safety, 90(1), 25–29.

    Article  Google Scholar 

  • Denardo, E. V., & Fox, B. L. (1968). Multichain Markov renewal programs. SIAM Journal of Applied Mathrmatics, 16(3), 468–487.

    Article  Google Scholar 

  • Derman, C. (1962). On sequential decisions and Markov chains. Management Science, 9(1), 16–24.

    Article  Google Scholar 

  • Derman, C. (1964). On sequential control processes. Annals of Mathematical Statististics, 35(1), 341–349.

    Article  Google Scholar 

  • Doob, J. L. (1953). Stochastic processes (p. 52369). Hoboken: Willey.

    Google Scholar 

  • Federgruen, A., Hordijk, A., & Tijms, H. C. (1978). A note on simultaneous recurrence conditions on a set of denumerable stochastic matrices. Journal of Appllied Probablity, 15(4), 842–847.

    Article  Google Scholar 

  • Feinberg, E. A. (1994). Constrained semi-Markov decision processes with average rewards. Mathematical Methods of Operational Research, 39(3), 257–288.

    Article  Google Scholar 

  • Fox, B. (1966). Markov renewal programming by linear fractional programming. SIAM Journal on Applied Mathematics, 14(6), 1418–1432.

    Article  Google Scholar 

  • Hordijk, A., & Kallenberg, L. C. M. (1979). Linear programming and Markov decision chains. Management Science, 25(4), 352–362.

    Article  Google Scholar 

  • Howard, R. A. (1960). Dynamic Programming and Markov Processes. New York: Wiley.

    Google Scholar 

  • Howard, R. A. (1963). Linear programming and Markov decision chains. Ottawa: Proc Internat Statist Inst.

    Google Scholar 

  • Jaśkiewicz, A. (2004). On the equivalence of two expected average cost criteria for Semi-Markov control processes. Mathematics of Operations Research, 29, 326–338.

    Article  Google Scholar 

  • Jaśkiewicz, A., & Nowak, A. (2007). Average optimality for semi-Markov control processes. Morfismos, 11(1), 15–36.

    Google Scholar 

  • Jewell, W. S. (1963). Markov-renewal programming. Operations Research, 2, 938–971.

    Article  Google Scholar 

  • Jianyong, L., & Xiaobo, Z. (2004). On average reward semi-Markov decision processes with a general multichain structure. Mathematics Operations Research, 29(2), 339–352.

    Article  Google Scholar 

  • Mondal, P., & Sinha, S. (2015). Ordered field property for semi-Markov games when one player controls transition probabilities and transition times. International Game Theory Review, 17(2), 1540022-1–1540022-26.

    Article  Google Scholar 

  • Mondal, P. (2015). Linear programming and zero-sum two-person undiscounted Semi-Markov games. Asia-Pacific Journal of Operational Research, 32(6), 1550043-1–1550043-20.

    Article  Google Scholar 

  • Mondal, P. (2016a). On undiscounted Semi-Markov decision processes with absorbing states. Mathematical Methods of Operations Research, 83(2), 161–177.

    Article  Google Scholar 

  • Mondal, P. (2016b). Completely mixed strategies for single controller unichain Semi-Markov games with undiscounted payoffs. Operational Research, doi:10.1007/s12351-016-0272-7.

  • Mondal, P. (2017). On zero-sum two-person undiscounted semi-Markov games with a multichain structure. Advances in Applied Probability, 49(3), 826–849. doi:10.1017/apr.2017.23.

    Article  Google Scholar 

  • Osaki, S., & Mine, H. (1968). Linear programming algorithms for Semi-Markovian decision processes. Journal of Mathematical Analysis and Applications, 22, 356–381.

    Article  Google Scholar 

  • Ross, S. M. (1970). Applied probability models with optimization applications. San Francisco: Holden-Day.

    Google Scholar 

  • Schal, M. (1992). On the second optimality equation for semi-markov decision models. Mathematics of Operations Research, 17(2), 470–486.

    Article  Google Scholar 

  • Schweitzer, P. J. (1971). Iterative solution of the functional equations of undiscounted Markov renewal programming. Journal of Mathematical Analysis and Applications, 34(3), 495–501.

    Article  Google Scholar 

  • Sennott, L. I. (1989). Average cost semi-Markov decision processes and the control of queueing systems. Probability in the Engineering and Informational Sciences, 3(02), 247–272.

    Article  Google Scholar 

  • Shapley, L. (1953). Stochastic games. In Proceedings of National Academy Sciences, USA (Vol. 39, pp. 1095–1100).

  • Sinha, S., & Mondal, P. (2017). Semi-Markov decision processes with limiting ratio average rewards. Journal of Mathematical Analysis and Applications, 455(1), 864–871. doi:10.1016/j.jmaa.2017.06.017.

    Article  Google Scholar 

  • White, D. J. (1993). A survey of applications of Markov decision processes. Journal of the Operational Research Society, 44(11), 1073–1096.

    Article  Google Scholar 

Download references

Acknowledgements

I am grateful to Prof. T. Parthasarathy of CMI & ISI Chennai. Some ideas presented in this paper results from a fruitful discussion with him during the International Conference & Workshop on “Game Theory and Optimization”, June 6-10, 2016 at IIT Madras. This paper is dedicated to celebrate the 75th birthday of Prof. T. Parthasarathy, who has made significant contributions to the theory of games and linear complementarity problems. I am also thankful to Prof. S. Sinha of Jadavpur University, Kolkata for many valuable suggestions. I would like to thank the two anonymous Referees for their valuable and detailed comments that has helped structure this paper better.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prasenjit Mondal.

Appendix (Proof of Theorem 1)

This result is published in Sinha and Mondal (2017).

Appendix (Proof of Theorem 1)

We use the following simple lemma to prove Theorem 1.

Lemma 11

Let \(\{a_n\}\) be a bounded sequence in \(\mathbf {R}^m\) and \(f: \mathbf {R}^m\rightarrow \mathbf {R}\) be a continuous function, then

$$\begin{aligned} f(\liminf _{n\rightarrow \infty }a_n)\ge \liminf _{n\rightarrow \infty }f(a_n)~\text{ and }~f(\limsup _{n\rightarrow \infty }a_n)\le \limsup _{n\rightarrow \infty }f(a_n). \end{aligned}$$

Proof of Theorem 1

Let \(|S|=z\). For simplicity, we assume that \(|A(s)|=|A|=k\)\(\forall ~s\in S\). Let \({s_0}\in S\) be a fixed initial state. We prove that if the SMDP starts at \({s_0}\), there exists a pure stationary policy \(f^{s_0}\in \mathscr {F}^\mathscr {P}\) such that \(\phi ({s_0},f^{s_0})\ge \phi ({s_0},\pi )\) for all \(\pi \in \varPi \). This result will prove the theorem because if \(\{f^{s_0}\mid {s_0}\in S\}\) are as above, then the semi-stationary policy \(\xi ^*=(f^1,f^2,\ldots ,f^z)\) is optimal in the SMDP.

For every \(\pi \in \varPi \), consider the zk component vectors

$$\begin{aligned} \psi _n^\pi ({s_0})= & {} \{x_{n11},\ldots ,x_{n1k},\ldots ,x_{nz1},\ldots ,x_{nzk}\}, n\in \mathbf {N},\\ \text{ where }~ x_{ns'a}= & {} (n+1)^{-1}\sum \limits _{m=0}^nP(X_m=s',A_m=a\mid X_0={s_0}), s'\\&=1,2,\ldots ,z;~ a=1,2,\ldots ,k. \end{aligned}$$

Note that \(x_{ns'a}\) represents the average of the probabilities that the decision maker visits to a given state \(s'\) and chooses a given action \(a\in A(s')\) in m-step (\(m=0,1,2,\ldots ,n\)) when the system starts at a fixed state \(s_0\).

Let \(H^{\pi }({s_0})\) be the set of limit points of \(\{\psi _n^\pi ({s_0})\}\) as \(n\rightarrow \infty \). If \(\pi \in \mathscr {F}\), we denote \(\psi ^{\pi }({s_0})=\lim \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})\) (which exists by Lemma 1(a)). Let \(H({s_0})=\bigcup _{\pi \in \varPi }H^{\pi }({s_0})\), \(H'({s_0})=\bigcup _{\pi \in \mathscr {F}}H^{\pi }({s_0})\) and \(H''({s_0})=\bigcup _{\pi \in \mathscr {F}^{\mathscr {P}}}H^{\pi }({s_0})\) and let \(\bar{H}'({s_0})\) and \(\bar{H}''({s_0})\) denote the closure of the convex hulls of \(H'({s_0})\) and \(H''({s_0})\) respectively. If the decision maker in the SMDP uses \(\pi \in \varPi \) with initial state \({s_0}\in S\), then from (1) we have

$$\begin{aligned} \phi ({s_0},\pi )=\liminf _{n\rightarrow \infty }\frac{\sum \limits _{s'\in S}\sum \limits _{a\in A}x_{ns'a}r(s',a)}{\sum \limits _{s'\in S}\sum \limits _{a\in A}x_{ns'a}\tau (s',a)}=\liminf _{n\rightarrow \infty }\frac{[\psi _n^\pi ({s_0})]^T\cdot r}{[\psi _n^\pi ({s_0})]^T\cdot \tau }, \end{aligned}$$
(25)

where r and \(\tau \) are respectively the immediate reward and the mean sojourn time vectors of order zk and they are in conformity with \(\psi _n^\pi ({s_0})\).

Define \(\varTheta (a_n)=\frac{a_n^T\cdot r}{a_n^T\cdot {\tau }} \) where \( a_n\in [0,1]^{zk}\), \(a_n^T\mathbf{{1}}=1\) (1 is a vector of order zk with all coordinates equal to 1). Then \(\{a_n\}\) is a bounded sequence and both the numerator and the denominator of \(\varTheta (a_n)\) are bounded linear functions with denominator being nonzero. Thus, \(\varTheta (a_n)\) is a continuous real valued function.

When \(a_n=\psi _n^\pi ({s_0})\), from (25) using Lemma 11, we obtain

$$\begin{aligned} \phi ({s_0},\pi )=\liminf _{n\rightarrow \infty }\varTheta (\psi _n^\pi ({s_0}))\le \varTheta (\liminf _{n\rightarrow \infty }\psi _n^\pi ({s_0}))= \frac{[\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})]^T\cdot r}{[\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})]^T\cdot \tau }, \end{aligned}$$

where the equality occurs if \(\pi \in \mathscr {F}\). In Derman (1964) (p. 342, theorem 1, part (a)), it was shown that: \(H({s_0})\subset \bar{H}'({s_0})=\bar{H}''({s_0})\). By this argument, \(\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})\in H({s_0})\subset \bar{H}'({s_0})=\bar{H}''({s_0})\), thus there exist nonnegative scalars \(\lambda _1,\lambda _2,\ldots ,\lambda _d\) adding upto unity (where \(d=|\mathscr {F}^{\mathscr {P}}|=zk\) and \(\mathscr {F}^{\mathscr {P}}=\{f_1,f_2,\ldots ,f_d\}\)) such that \(\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})=\sum _{l=1}^{d}\lambda _l\psi ^{f_l}({s_0})\), where \(\psi ^{f_l}({s_0})=\lim \limits _{n\rightarrow \infty }\psi _n^{f_l}({s_0})\). Thus, for all \(\pi \in \varPi \),

$$\begin{aligned} \phi ({s_0},\pi )\le \max \limits _{\lambda _l\ge 0,\sum \lambda _l=1} \frac{[\sum \limits _{l=1}^{d}\lambda _l\psi ^{f_l}({s_0})]^T\cdot r}{[\sum \limits _{l=1}^{d}\lambda _l\psi ^{f_l}({s_0})]^T\cdot \tau }=\max \limits _{\lambda _l\ge 0,\sum \lambda _l=1} \frac{\sum \limits _{l=1}^{d}\lambda _l\bigl \{[\psi ^{f_l}({s_0})]^T\cdot r\bigr \}}{\sum \limits _{l=1}^{d}\lambda _l\bigl \{[\psi ^{f_l}({s_0})]^T\cdot \tau \bigr \}}. \end{aligned}$$

The above linear fractional programming problem has an optimal solution which is an extreme point of the feasible zone \(\{\lambda _l\ge 0, \forall l=1,2,\ldots ,d; \sum \limits _{l=1}^d \lambda _l=1\}\) and hence corresponds to an optimal pure stationary policy \(f^{s_0}\) such that \(\phi ({s_0},f^{s_0})\ge \phi ({s_0},\pi )~ \forall \pi \in \varPi \). This completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mondal, P. Computing semi-stationary optimal policies for multichain semi-Markov decision processes. Ann Oper Res 287, 843–865 (2020). https://doi.org/10.1007/s10479-017-2686-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-017-2686-x

Keywords

Mathematics Subject Classification

Navigation