Abstract
We discuss a new applied probability model: there is a system whose evolution is described by a Markov chain (MC) with known transition matrix on a discrete state space and at each moment of a discrete time a decision maker can apply one of three possible actions: continue, quit, and restart MC in one of a finite number of fixed “restarting” points. Such a model is a generalization of a model due to Katehakis and Veinott (Math. Oper. Res. 12:262, 1987), where a restart to a unique point was allowed without any fee and quit action was absent. Both models are related to Gittins index and to another index defined in a Whittle family of stopping retirement problems. We propose a transparent recursive finite algorithm to solve our model by performing O(n 3) operations.
Similar content being viewed by others
References
Denardo, E. V., Rothblum, U. G., & Van der Heyden, L. (2004). Index policies for stochastic search in a forest with an application to R&D project management. Mathematics of Operations Research, 29, 162–181. doi:10.1287/moor.1030.0072.
Derman, C. (1970). Finite state Markovian decision processes. Orlando: Academic Press.
Feinberg, E. A., & Shwartz, A. (2002). Handbook of Markov decision processes. International series in operations research & management science (Vol. 40). Boston: Kluwer Academic.
Grassmann, W. K., Taksar, M. I., & Heyman, D. P. (1985). Regenerative analysis and steady state distributions for Markov chains. Operations Research, 33, 1107–1116. doi:10.1287/opre.33.5.1107.
Irle, A. (1980). On the best choice problem with random population size. Mathematical Methods of Operations Research, 24, 177–190. doi:10.1007/BF01919245.
Katehakis, M., & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials, Lecture notes—monograph series (Vol. 8, pp. 29–39).
Katehakis, M. N., & Veinott, A. F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12, 262–268. doi:10.1287/moor.12.2.262.
Kemeny, J. G., Snell, J. L., & Knapp, A. W. (1976). Finite Markov chains. Princeton: Springer.
Mitten, L. G. (1960). An analytic solution to the least cost testing sequence problem. International Journal of Industrial Engineering, 11, 17.
Puterman, M. (2005). Markov decision processes: discrete stochastic dynamic programming. New York: Wiley.
Sheskin, T. J. (1985). Technical note—a Markov chain partitioning algorithm for computing steady state probabilities. Operations Research, 33, 228–235. doi:10.1007/s10479-012-1089-2.
Sheskin, T. J. (1999). State reduction in a Markov decision process. International Journal of Mathematical Education in Science and Technology, 30(2), 167–185.
Sonin, I. (2006). The optimal stopping of a Markov chain and recursive solution of Poisson and Bellman equations. In Y. Kabanov, R. Liptser, & J. Stoyanov (Eds.), From stochastic calculus to mathematical finance. The Shiryaev Festschrift (pp. 609–621). Berlin: Springer.
Sonin, I. (2008). A generalized Gittins index for a Markov chain and its recursive calculation. Statistics & Probability Letters, 78(12), 1526–1533. doi:10.1016/j.spl.2008.01.049.
Sonin, I. (1999a). The elimination algorithm for the problem of optimal stopping. Mathematical Methods of Operational Research, 49(1), 111–123. doi:10.1007/s001860050016.
Sonin, I. (1999b). The state reduction and related algorithms and their applications to the study of Markov chains, graph theory, and the optimal stopping problem. Advances in Mathematics, 145, 159–188. doi:10.1006/aima.1998.1813.
Sonin, I. M. (2011). Optimal stopping of Markov chains and three abstract optimization problems. Stochastics, 83(4–6), 405–414. doi:10.1080/17442508.2010.514051.
Sonin, I., & Presman, E. (1972). The problem of best choice in the case of a random number of objects. Theory of Probability and Its Applications, 17, 695–706.
Sonin, I., & Thornton, J. (2001). Recursive algorithm for the fundamental/group inverse matrix of a Markov chain from an explicit formula. SIAM Journal on Matrix Analysis and Applications, 23(1), 209–224. doi:10.1137/S0895479899351234.
Stewart, W. (2009). Probability, Markov chains, queues, and simulation: the mathematical basis of performance modeling. Princeton: Princeton University Press.
Tsitsiklis, J. N. (1994). A short proof of the Gittins index theorem. The Annals of Applied Probability, 4, 194–199. doi:10.1214/aoap/1177005207.
Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 42(2), 143–149.
Ye, Y. (2011). The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36(4), 593–603. doi:10.1287/moor.1110.0516.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 5.1 Three abstract optimization problems
The common part of all three problems described above in Sect. 3 is a maximization over the set of all partitions of the state set X into three sets: continuation, restart and quit regions. This is a special case of a very general situation of optimization on an abstract index set.
Let us consider the following three abstract optimization problems 1, 2 and 3. Suppose there is an abstract index set U, and let A={a u } and B={b u } be two sets of real numbers indexed by the elements of U. Suppose that an assumption U holds,
In all three problems a DM knows sets U,A and B.
- Problem 1: restart problem.:
-
Find solution(s) of the equation
$$h=\sup_{u\in U}\bigl[a_{u}+(1-b_{u})h\bigr]\equiv H(h). $$(41)Equation (41) is a Bellman (optimality) equation for the “value of the game”, i.e. the supremum over all possible strategies in the optimization problem with two equivalent interpretations. In both cases set U represents a set of available actions. A DM can select one of them, say u, and then she obtains a reward a u and, according to the first interpretation with probability b u , the game is terminated, and with complimentary probability 1−b u she is again in an initial situation, i.e. she can select any action. Her goal is to maximize the total (undiscounted) reward. According to the second interpretation, the game is continued sequentially without possibility of random termination, but the value 1−b u is now not a probability, but a discount factor applied to the future rewards after an action u was used at the first step.
- Problem 2: ratio (cycle) problem.:
-
Find
$$\alpha=\sup_{u\in U}\frac{a_{u}}{b_{u}}. $$(42)The interpretation of this problem is straightforward: a DM can select action u only once and her goal is to maximize the ratio in (42), the one step reward per “chance of termination”. Since the game is terminated after the first trial anyway, 1/b u has an interpretation of a “multiplicator” applied to a “direct” reward a u .
Let H(k) be a function defined in the right side of (41).
- Problem 3: a parametric family of retirement problems.:
-
Find w, defined as follows: given parameter k,−∞<k<∞, let
$$v(k)=k\vee H(k),\qquad w=\inf\bigl\{k:v(k)=k\bigr\}. $$(43)In this problem, given number k, a DM has the following one step choice: to obtain k immediately, or to select an action u once and then to obtain a reward a u and after that additionally with probability 1−b u to obtain k, and with complimentary probability to obtain zero.
Using the fact that functions H(k) and v(k),−∞<k<∞, are nondecreasing, continuous, and convex (concave up), the following theorem was proved in Sonin (2011).
Theorem 5
(Abstract optimization equality)
-
(a)
Solution h of (41) is finite and unique;
-
(b)
h=α=w;
-
(c)
the optimal index, or an optimizing sequence for any of the three problems is the optimal index (an optimizing sequence) for the other two problems.
See the brief discussion of one more problem initially analyzed in one page seminal paper Mitten (1960), and its relation to the GI and GGI in Sonin (2008).
In the next section we shall formally show how the three problems described in Sect. 3 can be presented as abstract problems.
1.2 5.2 Proofs of some statements
Proof of Proposition 3
For any set S⊂X by definition of τ=τ S and definition of function g(x|k) we have
Using the definitions of R π(x) and Q π(x) in (26) and (25), the definition of function g(x|k) and the definition of a partition π 0, we can see that for all partitions π with the same set S we have \(C^{\pi}(x)=C^{\pi_{0}}(x)\), \(R_{q}^{\pi }(x)+R_{r}^{\pi}(x)+Q_{r}^{\pi}(x)k\leq R_{q}^{\pi_{0}}(x)+R_{r}^{\pi _{0}}(x)+Q_{r}^{\pi_{0}}(x)k\). Hence R π(x)+(1−Q π(x))k≤ \(R^{\pi_{0}}(x)+(1-Q^{\pi_{0}}(x))k\). Thus sup π>0[…]=sup τ>0[…], where [..] is the expression in square brackets in (36). □
Proof of Theorem 3, using Theorem 4
Given x∈X let us introduce the set of indices U={u}={π>0}∪{0}, where π>0 is any 3-partition such that x∈C and 0 is an index such that a 0=q(x)−r(x),b 0=1. For any u≠0, i.e. for a partition π=(C,S q ,S r ), S=S q ∪S r , and τ=τ S , we define \(R^{\pi}(x)=E_{x}\sum_{n=0}^{\tau-1}c(Z_{n})\), the total expected reward until moment τ, and Q(x)=P x (Z τ ∈S q ), the probability of termination on [0,τ). The rewards a u and probabilities b u we define as a u =R π(x)−r(x), b u =Q π(x). Then function H(k) coincides with sup τ>0 E x g(Z τ ), where g(x)=k. Respectively v(x|k)=k∨H(k)=sup τ≥0 E x g(Z τ ), i.e. v(x|k) is the value function in an OS for MC in model M(k). Then according to formulas (29), (30) and (33) and (34) three Problems 1, 2, 3 from Sect. 4 are represented as three abstract problems in this section and we can apply Theorem 4. □
Proof of formula (15)
Note that if D=∅ then W ∅=P ∅=P and if D={y} then p D╲y (⋅,y)=p(⋅,y) and formula (14) with w′=w D and w=p for z=y implies that
so formula (15) holds. It remains to prove that if it holds for a set D⊂X, then it will hold for a set D∪z, z∉D. For D⊂X let us define matrix S D as follows: s D (⋅,y)=w D (⋅,y)=p D (⋅,y) for y∉D and s D (⋅,y)=p D╲y (⋅,y) for y∈D. To simplify notation let us denote matrices W D =W 1,W D∪z =W 2,S D =S 1,S D∪z =S 2. If formula (15) holds for a set D⊂X, then two equivalent statements are true
The first formula is our assumption and the second formula follows from the first one if we apply the first formula for x=y to obtain the equality 1−s 1(y,y)=1/(1+w 1(y,y)). Our goal is to prove
For y=z formula (47) can be checked directly: formula (14) for y=z implies w 2(⋅,z)=w 1(⋅,z)/(1−w 1(z,z)) and for z∉D by definition of S 2 we have s 2(⋅,z)=s 1(⋅,z), and by definition of S 1 we have s 1(⋅,z)=w 1(⋅,z). For y∈D by definition of W 2, i.e. by formula (14), using formula (46) and the equality w 1(⋅,z)=s 1(⋅,z) (since z∉D) we have
By definition of S 2=S D∪z , using formula (14) with w′=p (D∪z)╲y and w=p D╲y for y∈D we have
where A=1/(1−p D╲y (z,z)). Using formula (14) with w′=p D and w=p D╲y , we have
Applying this formula for x=y, we obtain \(\frac{p_{D\diagdown y}(y,z)}{1-p_{D\diagdown y}(y,y)}=p_{D}(y,z)=s_{D}(y,z)=s_{1}(y,z)\) and hence
Using formula (50) for x=z, we obtain p D╲y (z,z)=s 1(z,z)−s 1(z,y)s 1(y,z), so
Now using this equality and formula (50), formula (49) can be written as
Using x=y in (52), we obtain s 2(y,y)=s 1(y,y)(1−s 1(z,z))A+s 1(y,z)s 1(z,y)A. Then, using formula (51), we obtain 1−s 2(y,y)=(1−s 1(y,y))(1−s 1(z,z))A and dividing both part of formula (52) by 1−s 2(y,y), we obtain that (48) for y∈D can be written as (47). □
Rights and permissions
About this article
Cite this article
Sonin, I.M., Steinberg, C. Continue, quit, restart probability model. Ann Oper Res 241, 295–318 (2016). https://doi.org/10.1007/s10479-012-1089-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-012-1089-2