Markov 123

Probability and Statistics with
Reliability, Queuing and Computer

Science Applications: Chapter 8 on
Continuous-Time Markov Chains
Kishor Trivedi
Non-State Space Models

Recall that non-state-space models like RBDs
and FTs can easily be formulated and (assuming
statistical independence) solved for system
reliability, system availability and system MTTF
Each component can have attached to it
A probability of failure
A failure rate
A distribution of time to failure
Steady-state and instantaneous unavailability
Markov chain
To model complex interactions between components,
use other kinds of models like Markov chains or
more generally state space models.
Many examples of dependencies among system
components have been observed in practice and
captured by Markov models.
MARKOV CHAINS
State-space based model
States represent various conditions of the system
Transitions between states indicate occurrences of
events
State-Space-Based Models
States and labeled state transitions
State can keep track of:
Number of functioning resources of each type
States of recovery for each failed resource
Number of tasks of each type waiting at each
resource
Allocation of resources to tasks
A transition:
Can occur from any state to any other state
Can represent a simple or a compound event
State-Space-Based Model (Continued)

Drawn as a directed graph
Transition label:
Probability: homogeneous discrete-time Markov chain (DTMC)
Rate: homogeneous continuous-time Markov chain (CTMC)
Time-dependent rate: non-homogeneous CTMC
Distribution function: semi-Markov process (SMP)
Two distribution functions; Markov regenerative process (MRGP)
MARKOV CHAINS
(Continued)
For continuous-time Markov chains (CTMCs)

the time variable associated with the system
evolution is continuous
We will mean a CTMC whenever we speak of
Markov model (chain)
Chapter 8
Continuous Time Markov Chains
Formal Definition
A discrete-state continuous-time stochastic process
is called a Markov chain if
for t0 < t1 < t2 < . < tn < t , the conditional pmf
satisfies the following Markov property:
A CTMC is characterized by state changes that can

occur at any arbitrary time
Index space is continuous.
The state space is discrete valued.
Continuous Time Markov Chain

(CTMC)
A CTMC can be completely described by:
Initial state probability vector for X(t0):
Transition probability functions (over an
interval)
pmf of X(t)
Using the theorem of total probability
If v = 0 in the above equation, we get
Homogenous CTMCs
is a (time-)homogenous CTMC iff
Or, the conditional pmf satisfies:
A CTMC is said to be irreducible if every state can

be reached from every other state, with a non-zero
probability.
A state is said to be absorbing if no other state can be
reached from it with non-zero probability.
Notion of transient, recurrent non-null, recurrent null
are the same as in a DTMC. There is no notion of
periodicity in a CTMC, however.
CTMC Dynamics
Chapman-Kolmogorov Equation
Note that these transition probabilities are functions of

elapsed time and not of the number of elapsed steps
The direct use of the this equation is difficult unlike the
case of DTMC where we could anchor on one-step
transition probabilities
Hence the notion of rates of transitions which follows
next
Transition Rates
Define the rates (probabilities per unit time):
net rate out of state j at time t:
the rate from state i to state j at time t:
Kolmogorov Differential
Equation
The transition probabilities and transition rates are,
Dividing both sides by h and taking the limit,
Kolmogorov Differential Equation

(contd.)
Kolmogorovs backward equation,
Writing these eqs. in the matrix form,
Homogeneous CTMC
Specialize to HCTMC (Kolmogorov diff. eqn) :
In the matrix form, (Matrix Q is called the infinitesimal

generator matrix (or simply Generator
Matrix))
CTMC Steady-state Solution
Steady state solution of CTMC obtained by solving the

following balance equations:
Irreducible CTMCs with all states recurrent non-null will have

+ve steady-state {j} values that are unique and independent of
the initial probability vector. All states of a finite irreducible
CTMC will be recurrent non-null.
Measures of interest may be computed by assigning reward
rates to states and computing expected steady state reward rate:
CTMC Measures
Measures of interest may be computed by assigning reward rates to states and

computing expected reward rate at time t:
Expected accumulated reward (over an interval of time)
Lj(t) is the expected time spent in state j during (0,t)
(LTODE)
Markov Availability Model
2-State Markov Availability Model
UP
1
DN
0
1
MTTF
1
MTTR
1) Steady-state balance equations for each state:

Rate of flow IN = rate of flow OUT
State1:
State0:
0 1
1
0
2 unknowns, 2 equations, but there is only
one independent
equation.

(Continued)
0 1 1
Need an additional equation:
1
1 1 1 1
1
MTTF
1 Ass
1 1 MTTF MTTR
1
MTTR
1 Ass
MTTF MTTR
Downtime in minutes per year =
MTTR
MTTF MTTR
* 8760*60
Ass 0.99999 1 Ass 10 5 DTMY 5.356 min

(Continued)
2) Transient Availability for each state:

Rate of buildup = rate of flow IN - rate of flow OUT
d 1
0 (t ) 1 (t )
dt
d 1
(1 1 (t )) 1 (t )
since 0 (t ) 1 (t ) 1 we have
dt
d 1
( ) 1 (t )
dt
This equation can be solved to obtain assuming 1(0)=1
( ) t
1 (t ) A(t )

(Continued)
3)
R (t ) e t
4) Steady State Availability:
lim A(t ) Ass

t
Markov availability model

Assume we have a two-component parallel
redundant system with repair rate .
Assume that the failure rate of both the components
is .
When both the components have failed, the system
is considered to have failed.

(Continued)
Let the number of properly functioning components be the

state of the system. The state space is {0,1,2} where 0 is
the system down state.
We wish to examine effects of shared vs. non-shared repair.

(Continued)
2
2
Non-shared (independent)
repair
Shared repair

(Continued)
Note: Non-shared case can be modeled & solved using a RBD

or a FTREE but shared case needs the use of Markov chains.
Steady-state balance equations

For any state:
Rate of flow in = Rate of flow out
Consider the shared case
i: steady state probability that system is in state i
2 2 1
( ) 1 2 2 0
1 0

(Continued)
Hence
Since
We have
or
2
1 1 0
2
0 1 2 1

0 0
1
0
2

1 2
2
0 1
2

(Continued)
Steady-state unavailability = 0= 1 - Ashared
Similarly for non-shared case,
steady-state unavailability = 1 - Anon-shared
Downtime in minutes per year = (1 - A)* 8760*60
1 Anon shared
2
2
1
2

A larger example
Return to the 2 control and 3 voice channels example and
assume that the control channel failure rate is c, voice channel
failure rate is v.
Repair rates are c and v, respectively. Assuming a single
shared repair facility and control channel having preemptive
repair priority over voice channels, draw the state diagram of a
Markov availability model. Using SHARPE GUI, solve the
Markov chain for steady-state and instantaneous availability.
WFS Example
A Workstations-Fileserver
Example
Computing system consisting of:
A file-server
Two workstations
Computing network connecting them
System operational as long as:

One of the Workstations
and
The file-server are operational
Computer network is assumed to be faultfree
The WFS Example
Markov Chain for WFS Example

Assuming exponentially distributed times to
failure
w : failure rate of workstation
f : failure rate of file-server
Assume that components are repairable

w: repair rate of workstation
f: repair rate of file-server
File-server has (preemptive) priority for

repair over workstations (such repair priority
cannot be captured by non-state-space
models)
Markov Availability Model for

WFS
w
2w
2,1
1,1
0,1
w
f
w
f
2w
2,0
1,0
0,0
Since all states are reachable from every other states, the
CTMC is irreducible. Furthermore, all states are positive
recurrent.

WFS (Continued)
In this figure, the label (i,j) of each state is
interpreted as follows: i represents the number of
workstations that are still functioning and j is 1
or 0 depending on whether the file-server is up
or down respectively.
Markov Model
Let {X(t), t > 0} represent a finite-state
Continuous Time Markov Chain (CTMC)
with state space .
Infinitesimal Generator Matrix Q = [qij]:
qij (i !j) : transition rate from state i to state j
qii = - qi=
j i qij , the diagonal element

WFS (Continued)
For the example problem, with the states ordered
as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q
matrix is given by:
f
2w
0
0
0
( f 2 w )
)
0
2
0
0
f
f
w
w
0
( w f w )
f
w
0
= w
0
0
)
0
f
f
w
w
0
0
w
0
( w f ) f
0
0
0
0
f
f
Markov Model (steady-state)

: Steady-state probability vector
Q 0,
( ( 2,1) , ( 2, 0) , (1,1) , (1,0 ) , ( 0,1) , ( 0, 0) )

These are called steady-state balance
equations
rate of flow
, in = rate of flow out
after solving A
for
SS
availability
obtain
( 2,1) Steady-state
(1,1)
Markov Model (transient)

(t):transient state probability vector
(0): initial probability vector of the CTMC
Transient behavior described by the
Kolmogorov differential equation (KDE):
d
(t ) (t )Q,
dt
given (0)

We compute the availability of the system:System is
available as long as it is in states (2,1) and (1,1).
Instantaneous availability of the system:
A (t ) ( 2 ,1) (t ) (1,1) (t )
Ass
lim
A
(
t
)
Availability (Continued)
Interval Availability:
Steady-State Availability:
There are three kinds of Availabilities!

Instantaneous, Interval & Steady-state
A (t )
I
A( x) dx
t
Expected uptime in (0, t ]
ASS lim A(t ) lim AI (t )

t

(Continued)
t
Define L(t ) (u )du

o
L(i,j)(t): Expected Total Time Spent in State (i,j)

during (0,t)
Integrating the KDE, we get the LTODE:
d
L(t ) L(t )Q (0) ,
dt
L ( 0) 0
Interval availability
AI (t )
L( 2,1) (t ) L(1,1) (t )
t
Markov Availability Model Results
w 0.0001 hr 1 , f 0.00005 hr 1 , w 1.0 hr 1 , f 0.5 hr 1
Ass 0.9999
2-component Availability model with

finite Detection delay
2-component availability model
Steady state availability Ass = 1-0
Failure detection stage takes random time, EXP()
Down states are 0 and 1D Ass = 1- 0- 1D

Therefore, steady state unavailability U() is given by
Redundant System with Finite

Detection Switchover Time
After solving the Markov model, we obtain steady-state probabilities:
Can solve in closed-form or using SHARPE
2 , 1D , 1 , 0
Asys 2 1 (or 1D )
Closed-form

2
2

0 [1
] 1

( ) 22
0
1
E

1
E
2
1
1D
( ) E
2

1
2
22 E
1
A 2 1 r 1 D
2

2
( 2
r
)/ E
2
( )
2-component availability model

with imperfect coverage
Coverage factor = c (conditional probability
that the fault is correctly handled)
1C state is a reboot (down) state.
2-components availability
model : delay + imperfect
coverage
Model has detection delay + imperfect

coverage
Down states are 0, 1C and 1D.
Modeling Software Faults

Operating System Failure
Availability model with hardware and software (OS)
redundancy; operational phase; Heisenbugs
Probability & Statistics with Reliability,
Queuing and Computer Science Applications
(2nd ed.)
K. S. Trivedi
John Wiley, 2001.
Assumptions
Hardware failures are permanent

A repair or replacement action
while OS failures are cleared by a
reboot
Repair or reboot takes place at
rates and for the hardware
and OS, respectively.
Webserver Availability Model

with warm Replication
Two nodes for hardware redundancy
Each node has a copy of the webserver (software
redundancy replication)
Primary node can fail
Secondary node can fail
Primary process can fail
Secondary process can fail
Failures may have imperfect coverage
Time delay for fault detection
Model of a real system developed at Avaya Labs
Modeling Software Faults

Application Failure
Availability model with passive redundancy
(warm replication) of application; Operational phase;
Heisenbugs or hardware transients
Assumptions
A web server software,

that fails at the rate p
running on a machine
that fails at the rate m
Mean time to detect

server process failure 1 and the mean time to
p
detect machine failure
-1m
The mean restart time

of a machine -1m
The mean restart time

of a server -1p
Performance and Reliability Evaluation of Passive Replication Schemes in Application Lev

Fault-Tolerance
S. Garg, Y. Huang, C. Kintala, K. S. Trivedi and S. Yagnik
Proc. of the 29th Intl. Symp. On Fault-Tolerant Computing, FTCS-29, June 1999.
Parameters
Process MTTF = 10 days (1/p)
Node MTTF = 20 days (1/n)
Process polling interval = 2 seconds (1/p)
Mean process restart time = 30 seconds (1/p)
Mean process failover time = 2 minutes (1/n)
Switching time with mean 1/ s
C = 0.95
Solution for warm replication
Modeling an N+1 Protection

System
Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model - approximation
method using a 3-stage Erlang distribution
Using equations of the underlying SemiMarkov Process
Solutions for the models

N = Number of protected units (we use N=1)
= Unit failure rate
= Unit restoration rate
T = deterministic time between routine diagnostics
c = Probability that a protection switch
successfully restores service
d = Probability that a failure in the standby unit is
detected
Outline
Using a Semi-Markov model - approximation
method using a 3-stage Erlang distribution
Hot Standby with different coverages

Normal
(1+1)
(1-d)
(1-c)
(c+d)
Protection
Switch
Failure
Failure to
Detect
Protection
Fault
Simplex
(1)
2
Failed
(0)
Normal:
1
Protection Switch Failure:
2
Simplex:
3
Failure to detect protection fault: 4
Failed:
5
N=1
Diagnostics; Using a rate

approximation
Normal
(1+1)
(1-d)
(1-c)
(c+d)
Protection
Switch
Failure
Simplex
(1)
2
Failed
(0)
Time to diagnostic is
exponentially distributed
with mean T/2
2/T
Failure to
Detect
Protection
Fault
Normal:
1
Protection Switch Failure:
2
Simplex:
3
Failure to detect protection fault: 4
Failed:
5
N=1
Outline
Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
1.8
Comparison of
probability density functions (pdf)
1.6
1.4
1.2
1
pdf
3-stage Erlang pdf

U(0,1) pdf
0.8
0.6
0.4
0.2
0
T=1
time
1.2
Comparison of cumulative distribution

functions (cdf)
cdf
0.8
3-stage Erlang cdf
0.6
U(0,1) cdf
0.4
0.2
T=1
time

Normal
(1+1)
(1-d)
(1-c)
(c+d)
Protection
Switch
Failure
Time to
diagnostic is
uniformly
distributed over
(0,T) approximated
by a 3-stage
Erlang with
mean T/2
Simplex
(1)
s1
s2
6/T
Failure to
Detect
Protection
Fault
6/T
Failed
(0)
6/T
Outline
Erlang distribution
Using a Semi-Markov model approximation method using an Erlang

distribution (N=1) E(t) -> 3-stage Erlang distribution
given by,
31
Normal
(1+1)
(1-d)
(1-c)
(c+d)
Time to
diagnostic is
uniformly
distributed over
(0,T) approximated
by a 3-stage
Erlang
distribution
with mean T/2
Protection
Switch
Failure
Simplex
(1)
Failure to
Detect
Protection
E(t)
Fault
Failed
(0)
k 0
( T6 t ) k
k!
T6 t
Outline
Erlang distribution
Using Equations of the underlying

Semi-Markov Process
Steady state solution
One step transition probability matrix, P of the
embedded DTMC
0
0
1-c
2
0
0
0
0
cd
2
0
T
1
(
1
e
)
T
1
1-d
2
0
0
0
0
1
T
(1 e
0

Semi-Markov Process (Continued)
Solve v vP to obtain,
v v,
1 c
1 2
v,
1 d
1 2
v,
v1 , (
(1 c )
2( )
1 d
2
(1
where v1 1 1c 1d (1c ) 1 1d (1 1 (1e T ))

2
2 ( )
1
T
(1 e
)))v1

Time to the next diagnostic is uniformly distributed over (0,T)
H i (t ) : CDF of the sojourn time in state i

H1 (t ) 1 e 2 t ,
H 3 (t ) 1 e
( )t
H 2 (t ) 1 e ( )t ,
,
H 5 (t ) 1 e
tT
1,
tT
H 4 (t )
1 (1 Tt )e t ,
2 t

hi : mean sojourn time in state i [1-H i(t)]dt

0
h1
1
2
, h2
, h3
, h4
1
1
T 2
(1 e
State probabilities of the SMP are given by,
vi hi
5
v jhj
j 1
Unavailability 2 5
), h5
1
2
Outline
Erlang distribution

Parameter values assumed:
N=1
c = 0.9
d = 0.9
= 0.0001 / hour
= 1 / hour
T = 1 hour
Results obtained
Steady state availability
Probability of being in states Normal,

Simplex, or Failure to Detect Protection Fault
Steady state unavailability
Probability of being in states Protection Switch

Failure, or Failed (0)
Average downtime in steady state
Steady state unavailability * Number of minutes

in a year
Average #units available

2*PNormal + 1*PSimplex +1*PFailuretoDetectProtectionFault
Diagnostic start time

approximation
Steady state availability
Steady state
unavailability
Avg. downtime in
steady state
(Minutes/year)
Avg. #units available (out of 1 + 1

spare)
Exp(2/T)
9.99989992e-01
1.00075983e-05
5.25999
1.99977503e+00
3-stage Erlang with mean

T/2
9.99989992e-01
1.00075983e-05
5.25999
1.99977503e+00
Semi-Markov (3-stage
Erlang approx. with mean
T/2)
9.99989992e-01
1.00075983e-05
5.25999
1.99977503e+00
Semi-Markov Process
equations-U([0,T])
9.99989992e-01
1.00075983e-05
5.25999
1.99977503e+00
Markov Reliability Model
Markov reliability model

with repair
Consider the 2-component parallel system (no delay +

perfect cov) but disallow repair from system down state
Note that state 0 is now an absorbing state. The state
diagram is given in the following figure.
This reliability model with repair cannot be modeled using
a reliability block diagram or a fault tree. We need to
resort to Markov chains. (This is a form of dependency
since in order to repair a component you need to know the
status of the other component).

with repair (Continued)
Absorbing state
Markov chain has an absorbing state. In the

steady-state, system will be in state 0 with
probability 1. Hence transient analysis is of
interest. States 1 and 2 are transient states.

Assume that the initial state of the Markov chain
is 2, that is, 2(0) = 1, k (0) = 0 for k = 0, 1.
Then the system of differential Equations is written
based on:
rate of buildup = rate of flow in - rate of flow out
for each state

d 2 (t )
2 2 (t ) 1 (t )
dt
d 1 (t )
2 2 (t ) ( ) 1 (t )
dt
d 0 (t )
1 (t )
dt

After solving these equations, we get
R(t) = 2(t) +1(t)
Recalling that MTTF R(t ) dt

0
MTTF
2
2 2
, we get:

Note that the MTTF of the two component
parallel redundant system, in the absence
of a repair facility (i.e., = 0), would have
been equal to the first term,
3 / ( 2* ), in the above expression.
Therefore, the effect of a repair facility is to
increase the mean life by / (2*2), or by a
factor (1 22 ) 1
3
Markov Reliability Model with

Repair ( WFS Example)
Assume that the computer system does not recover if
both workstations fail, or if the file-server fails
Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing states while (2,1) and (1,1) are transient states.
Note: we have made a simplification that, once the CTMC reaches a system failure state, we
do not allow any more transitions.
Markov Model with Absorbing

States
If we solve for (2,1)(t) and 1,1)(t) then
R(t)= (2,1)(t) + 1,1)(t)
For a Markov chain with absorbing states:
A: the set of absorbing states
B = - A: the set of remaining states
i,j): Mean time spent in state i,j until absorption
(i , j )
( i , j ) ( x ) dx ,
QB B (0)
QB B (0)
(i, j ) B
Markov Model with Absorbing

States (Continued)
QB derived from Q by restricting it to only
states in B
Mean time to absorption MTTA is given as:
MTTA
( i , j )B
(i , j )

Repair (Continued)
QB
( f 2w )
2w
( w f w )
First Solve

Repair (Continued)
Then :
next solve
Then :
Mean time to failure is 19992 hours.

without Repair
Assume that neither workstations nor fileserver is repairable

without Repair (Continued)
States (0,1), (1,0) and (2,0)

become absorbing states

without Repair (Continued)
QB
( f 2 w )
2w
( f w )
Mean time to failure is 9333 hours.

with Imperfect Coverage
Markov model
Next consider a modification of the above
example proposed by Arnold as a model of
duplex processors of an electronic
switching system. We assume that not all
faults are recoverable and that c is the
coverage factor which denotes the
conditional probability that the system
recovers given that a fault has occurred.
The state diagram is now given by the
following picture:
Now allow for Imperfect

coverage
c
Markov model
(Continued)
Assume that the initial state is 2 so that:
2 (0) 1, 0 (0) 1 (0) 0
Then the system of differential equations are:

d2 (t )
2c2 (t ) 2 (1 c) 2 (t ) 1 (t )
dt
d1 (t )
2c2 (t ) ( ) 1 (t )
dt
d0 (t )
2 (1 c) 2 (t ) 1 (t )
dt
Markov model
(Continued)
After solving the differential equations we obtain:

R(t)=2(t) + 1(t)
From R(t), we can system MTTF:
(1 2c)
MTTF
2[ (1 c)]
It should be clear that the system MTTF and system reliability are
critically dependent on the coverage factor.

Markov 123

Uploaded by

Copyright:

Available Formats

Markov 123

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markov 123

Uploaded by

Copyright:

Available Formats

Probability and Statistics with

Reliability, Queuing and Computer

Non-State Space Models

State-Space-Based Model (Continued)

For continuous-time Markov chains (CTMCs)

A CTMC is characterized by state changes that can

Continuous Time Markov Chain

If v = 0 in the above equation, we get

is a (time-)homogenous CTMC iff

Or, the conditional pmf satisfies:

A CTMC is said to be irreducible if every state can

Note that these transition probabilities are functions of

the rate from state i to state j at time t:

Dividing both sides by h and taking the limit,

Kolmogorov Differential Equation

In the matrix form, (Matrix Q is called the infinitesimal

CTMC Steady-state Solution

Steady state solution of CTMC obtained by solving the

Irreducible CTMCs with all states recurrent non-null will have

Measures of interest may be computed by assigning reward rates to states and

Expected accumulated reward (over an interval of time)

Lj(t) is the expected time spent in state j during (0,t)

Markov Availability Model

2-State Markov Availability Model

1) Steady-state balance equations for each state:

2-State Markov Availability Model

Need an additional equation:

Downtime in minutes per year =

Ass 0.99999 1 Ass 10 5 DTMY 5.356 min

2-State Markov Availability Model

2) Transient Availability for each state:

2-State Markov Availability Model

4) Steady State Availability:

lim A(t ) Ass

Markov availability model

Markov availability model

Let the number of properly functioning components be the

Markov availability model

Markov availability model

Note: Non-shared case can be modeled & solved using a RBD

Steady-state balance equations

i: steady state probability that system is in state i

Steady-state balance equations

Steady-state balance equations

Downtime in minutes per year = (1 - A)* 8760*60

Steady-state balance equations

System operational as long as:

Computer network is assumed to be faultfree

The WFS Example

Markov Chain for WFS Example

Assume that components are repairable

File-server has (preemptive) priority for

Markov Availability Model for

Markov Availability Model for

j i qij , the diagonal element

Markov Availability Model for

Markov Model (steady-state)

( ( 2,1) , ( 2, 0) , (1,1) , (1,0 ) , ( 0,1) , ( 0, 0) )

Markov Model (transient)