Computing semi-stationary optimal policies for multichain semi-Markov decision processes

Prasenjit Mondal ORCID: orcid.org/0000-0002-6319-4098¹

201 Accesses
1 Citation
Explore all metrics

Abstract

We consider semi-Markov decision processes with finite state and action spaces and a general multichain structure. A form of limiting ratio average (undiscounted) reward is the criterion for comparing different policies. The main result is that the value vector and a pure optimal semi-stationary policy (i.e., a policy which depends only on the initial state and the current state) for such an SMDP can be computed directly from an optimal solution of a finite set (whose cardinality equals the number of states) of linear programming (LP) problems. To be more precise, we prove that the single LP associated with a fixed initial state provides the value and an optimal pure stationary policy of the corresponding SMDP. The relation between the set of feasible solutions of each LP and the set of stationary policies is also analyzed. Examples are worked out to describe the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mean-Variance Problems for Finite Horizon Semi-Markov Decision Processes

Article 27 November 2014

The Steady-State Control Problem for Markov Decision Processes

Markov Decision Processes with Discounted Costs: Improved Successive Over-Relaxation Method

Notes

For simplicity of notations, we shall denote by $[V]_i$ the i-th component of the vector V; the j-th row and the k-th column of the matrix A are denoted by $A_{j\cdot }$ and $A_{\cdot k}$ respectively.

References

Baykal-Gürsoy, M. (2010). Semi-Markov decision processes. Wiley Encyclopedia of Operations Research and Management Science, doi:10.1002/9780470400531.eorms0757.
Bellman, R. E. (1962). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684.
Google Scholar
Bewley, T., & Kohlberg, E. (1978). On stochastic games with stationary optimal strategies. Mathematics of Operations Research, 3(2), 104–125.
Article Google Scholar
Blackwell, D. (1962). Discrete dynamic programming. Annals of Mathematical Statistics, 33, 719–726.
Article Google Scholar
Chen, D., & Trivedi, K. S. (2005). Optimization for condition-based maintenance with semi-Markov decision process. Reliability Engineering & System Safety, 90(1), 25–29.
Article Google Scholar
Denardo, E. V., & Fox, B. L. (1968). Multichain Markov renewal programs. SIAM Journal of Applied Mathrmatics, 16(3), 468–487.
Article Google Scholar
Derman, C. (1962). On sequential decisions and Markov chains. Management Science, 9(1), 16–24.
Article Google Scholar
Derman, C. (1964). On sequential control processes. Annals of Mathematical Statististics, 35(1), 341–349.
Article Google Scholar
Doob, J. L. (1953). Stochastic processes (p. 52369). Hoboken: Willey.
Google Scholar
Federgruen, A., Hordijk, A., & Tijms, H. C. (1978). A note on simultaneous recurrence conditions on a set of denumerable stochastic matrices. Journal of Appllied Probablity, 15(4), 842–847.
Article Google Scholar
Feinberg, E. A. (1994). Constrained semi-Markov decision processes with average rewards. Mathematical Methods of Operational Research, 39(3), 257–288.
Article Google Scholar
Fox, B. (1966). Markov renewal programming by linear fractional programming. SIAM Journal on Applied Mathematics, 14(6), 1418–1432.
Article Google Scholar
Hordijk, A., & Kallenberg, L. C. M. (1979). Linear programming and Markov decision chains. Management Science, 25(4), 352–362.
Article Google Scholar
Howard, R. A. (1960). Dynamic Programming and Markov Processes. New York: Wiley.
Google Scholar
Howard, R. A. (1963). Linear programming and Markov decision chains. Ottawa: Proc Internat Statist Inst.
Google Scholar
Jaśkiewicz, A. (2004). On the equivalence of two expected average cost criteria for Semi-Markov control processes. Mathematics of Operations Research, 29, 326–338.
Article Google Scholar
Jaśkiewicz, A., & Nowak, A. (2007). Average optimality for semi-Markov control processes. Morfismos, 11(1), 15–36.
Google Scholar
Jewell, W. S. (1963). Markov-renewal programming. Operations Research, 2, 938–971.
Article Google Scholar
Jianyong, L., & Xiaobo, Z. (2004). On average reward semi-Markov decision processes with a general multichain structure. Mathematics Operations Research, 29(2), 339–352.
Article Google Scholar
Mondal, P., & Sinha, S. (2015). Ordered field property for semi-Markov games when one player controls transition probabilities and transition times. International Game Theory Review, 17(2), 1540022-1–1540022-26.
Article Google Scholar
Mondal, P. (2015). Linear programming and zero-sum two-person undiscounted Semi-Markov games. Asia-Pacific Journal of Operational Research, 32(6), 1550043-1–1550043-20.
Article Google Scholar
Mondal, P. (2016a). On undiscounted Semi-Markov decision processes with absorbing states. Mathematical Methods of Operations Research, 83(2), 161–177.
Article Google Scholar
Mondal, P. (2016b). Completely mixed strategies for single controller unichain Semi-Markov games with undiscounted payoffs. Operational Research, doi:10.1007/s12351-016-0272-7.
Mondal, P. (2017). On zero-sum two-person undiscounted semi-Markov games with a multichain structure. Advances in Applied Probability, 49(3), 826–849. doi:10.1017/apr.2017.23.
Article Google Scholar
Osaki, S., & Mine, H. (1968). Linear programming algorithms for Semi-Markovian decision processes. Journal of Mathematical Analysis and Applications, 22, 356–381.
Article Google Scholar
Ross, S. M. (1970). Applied probability models with optimization applications. San Francisco: Holden-Day.
Google Scholar
Schal, M. (1992). On the second optimality equation for semi-markov decision models. Mathematics of Operations Research, 17(2), 470–486.
Article Google Scholar
Schweitzer, P. J. (1971). Iterative solution of the functional equations of undiscounted Markov renewal programming. Journal of Mathematical Analysis and Applications, 34(3), 495–501.
Article Google Scholar
Sennott, L. I. (1989). Average cost semi-Markov decision processes and the control of queueing systems. Probability in the Engineering and Informational Sciences, 3(02), 247–272.
Article Google Scholar
Shapley, L. (1953). Stochastic games. In Proceedings of National Academy Sciences, USA (Vol. 39, pp. 1095–1100).
Sinha, S., & Mondal, P. (2017). Semi-Markov decision processes with limiting ratio average rewards. Journal of Mathematical Analysis and Applications, 455(1), 864–871. doi:10.1016/j.jmaa.2017.06.017.
Article Google Scholar
White, D. J. (1993). A survey of applications of Markov decision processes. Journal of the Operational Research Society, 44(11), 1073–1096.
Article Google Scholar

Download references

Acknowledgements

I am grateful to Prof. T. Parthasarathy of CMI & ISI Chennai. Some ideas presented in this paper results from a fruitful discussion with him during the International Conference & Workshop on “Game Theory and Optimization”, June 6-10, 2016 at IIT Madras. This paper is dedicated to celebrate the 75th birthday of Prof. T. Parthasarathy, who has made significant contributions to the theory of games and linear complementarity problems. I am also thankful to Prof. S. Sinha of Jadavpur University, Kolkata for many valuable suggestions. I would like to thank the two anonymous Referees for their valuable and detailed comments that has helped structure this paper better.

Author information

Authors and Affiliations

Mathematics Department, Government General Degree College, Ranibandh, Bankura, 722135, India
Prasenjit Mondal

Authors

Prasenjit Mondal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prasenjit Mondal.

Appendix (Proof of Theorem 1)

This result is published in Sinha and Mondal (2017).

We use the following simple lemma to prove Theorem 1.

Lemma 11

Let $\{a_n\}$ be a bounded sequence in $\mathbf {R}^m$ and $f: \mathbf {R}^m\rightarrow \mathbf {R}$ be a continuous function, then

$$\begin{aligned} f(\liminf _{n\rightarrow \infty }a_n)\ge \liminf _{n\rightarrow \infty }f(a_n)~\text{ and }~f(\limsup _{n\rightarrow \infty }a_n)\le \limsup _{n\rightarrow \infty }f(a_n). \end{aligned}$$

Proof of Theorem 1

Let $|S|=z$. For simplicity, we assume that $|A(s)|=|A|=k$$\forall ~s\in S$. Let ${s_0}\in S$ be a fixed initial state. We prove that if the SMDP starts at ${s_0}$, there exists a pure stationary policy $f^{s_0}\in \mathscr {F}^\mathscr {P}$ such that $\phi ({s_0},f^{s_0})\ge \phi ({s_0},\pi )$ for all $\pi \in \varPi $. This result will prove the theorem because if $\{f^{s_0}\mid {s_0}\in S\}$ are as above, then the semi-stationary policy $\xi ^*=(f^1,f^2,\ldots ,f^z)$ is optimal in the SMDP.

For every $\pi \in \varPi $, consider the zk component vectors

$$\begin{aligned} \psi _n^\pi ({s_0})= & {} \{x_{n11},\ldots ,x_{n1k},\ldots ,x_{nz1},\ldots ,x_{nzk}\}, n\in \mathbf {N},\\ \text{ where }~ x_{ns'a}= & {} (n+1)^{-1}\sum \limits _{m=0}^nP(X_m=s',A_m=a\mid X_0={s_0}), s'\\&=1,2,\ldots ,z;~ a=1,2,\ldots ,k. \end{aligned}$$

Note that $x_{ns'a}$ represents the average of the probabilities that the decision maker visits to a given state $s'$ and chooses a given action $a\in A(s')$ in m-step ($m=0,1,2,\ldots ,n$) when the system starts at a fixed state $s_0$.

Let $H^{\pi }({s_0})$ be the set of limit points of $\{\psi _n^\pi ({s_0})\}$ as $n\rightarrow \infty $. If $\pi \in \mathscr {F}$, we denote $\psi ^{\pi }({s_0})=\lim \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})$ (which exists by Lemma 1(a)). Let $H({s_0})=\bigcup _{\pi \in \varPi }H^{\pi }({s_0})$, $H'({s_0})=\bigcup _{\pi \in \mathscr {F}}H^{\pi }({s_0})$ and $H''({s_0})=\bigcup _{\pi \in \mathscr {F}^{\mathscr {P}}}H^{\pi }({s_0})$ and let $\bar{H}'({s_0})$ and $\bar{H}''({s_0})$ denote the closure of the convex hulls of $H'({s_0})$ and $H''({s_0})$ respectively. If the decision maker in the SMDP uses $\pi \in \varPi $ with initial state ${s_0}\in S$, then from (1) we have

$$\begin{aligned} \phi ({s_0},\pi )=\liminf _{n\rightarrow \infty }\frac{\sum \limits _{s'\in S}\sum \limits _{a\in A}x_{ns'a}r(s',a)}{\sum \limits _{s'\in S}\sum \limits _{a\in A}x_{ns'a}\tau (s',a)}=\liminf _{n\rightarrow \infty }\frac{[\psi _n^\pi ({s_0})]^T\cdot r}{[\psi _n^\pi ({s_0})]^T\cdot \tau }, \end{aligned}$$

(25)

where r and $\tau $ are respectively the immediate reward and the mean sojourn time vectors of order zk and they are in conformity with $\psi _n^\pi ({s_0})$.

Define $\varTheta (a_n)=\frac{a_n^T\cdot r}{a_n^T\cdot {\tau }} $ where $ a_n\in [0,1]^{zk}$, $a_n^T\mathbf{{1}}=1$ (1 is a vector of order zk with all coordinates equal to 1). Then $\{a_n\}$ is a bounded sequence and both the numerator and the denominator of $\varTheta (a_n)$ are bounded linear functions with denominator being nonzero. Thus, $\varTheta (a_n)$ is a continuous real valued function.

When $a_n=\psi _n^\pi ({s_0})$, from (25) using Lemma 11, we obtain

$$\begin{aligned} \phi ({s_0},\pi )=\liminf _{n\rightarrow \infty }\varTheta (\psi _n^\pi ({s_0}))\le \varTheta (\liminf _{n\rightarrow \infty }\psi _n^\pi ({s_0}))= \frac{[\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})]^T\cdot r}{[\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})]^T\cdot \tau }, \end{aligned}$$

where the equality occurs if $\pi \in \mathscr {F}$. In Derman (1964) (p. 342, theorem 1, part (a)), it was shown that: $H({s_0})\subset \bar{H}'({s_0})=\bar{H}''({s_0})$. By this argument, $\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})\in H({s_0})\subset \bar{H}'({s_0})=\bar{H}''({s_0})$, thus there exist nonnegative scalars $\lambda _1,\lambda _2,\ldots ,\lambda _d$ adding upto unity (where $d=|\mathscr {F}^{\mathscr {P}}|=zk$ and $\mathscr {F}^{\mathscr {P}}=\{f_1,f_2,\ldots ,f_d\}$) such that $\liminf \limits _{n\rightarrow \infty }\psi _n^\pi ({s_0})=\sum _{l=1}^{d}\lambda _l\psi ^{f_l}({s_0})$, where $\psi ^{f_l}({s_0})=\lim \limits _{n\rightarrow \infty }\psi _n^{f_l}({s_0})$. Thus, for all $\pi \in \varPi $,

$$\begin{aligned} \phi ({s_0},\pi )\le \max \limits _{\lambda _l\ge 0,\sum \lambda _l=1} \frac{[\sum \limits _{l=1}^{d}\lambda _l\psi ^{f_l}({s_0})]^T\cdot r}{[\sum \limits _{l=1}^{d}\lambda _l\psi ^{f_l}({s_0})]^T\cdot \tau }=\max \limits _{\lambda _l\ge 0,\sum \lambda _l=1} \frac{\sum \limits _{l=1}^{d}\lambda _l\bigl \{[\psi ^{f_l}({s_0})]^T\cdot r\bigr \}}{\sum \limits _{l=1}^{d}\lambda _l\bigl \{[\psi ^{f_l}({s_0})]^T\cdot \tau \bigr \}}. \end{aligned}$$

The above linear fractional programming problem has an optimal solution which is an extreme point of the feasible zone $\{\lambda _l\ge 0, \forall l=1,2,\ldots ,d; \sum \limits _{l=1}^d \lambda _l=1\}$ and hence corresponds to an optimal pure stationary policy $f^{s_0}$ such that $\phi ({s_0},f^{s_0})\ge \phi ({s_0},\pi )~ \forall \pi \in \varPi $. This completes the proof. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mondal, P. Computing semi-stationary optimal policies for multichain semi-Markov decision processes. Ann Oper Res 287, 843–865 (2020). https://doi.org/10.1007/s10479-017-2686-x

Download citation

Published: 14 October 2017
Issue Date: April 2020
DOI: https://doi.org/10.1007/s10479-017-2686-x

Computing semi-stationary optimal policies for multichain semi-Markov decision processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Mean-Variance Problems for Finite Horizon Semi-Markov Decision Processes

The Steady-State Control Problem for Markov Decision Processes

Markov Decision Processes with Discounted Costs: Improved Successive Over-Relaxation Method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix (Proof of Theorem 1)

Lemma 11

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Computing semi-stationary optimal policies for multichain semi-Markov decision processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Mean-Variance Problems for Finite Horizon Semi-Markov Decision Processes

The Steady-State Control Problem for Markov Decision Processes

Markov Decision Processes with Discounted Costs: Improved Successive Over-Relaxation Method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix (Proof of Theorem 1)

Appendix (Proof of Theorem 1)

Lemma 11

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation