Partnership Sub-Series
1. Disarmament Technologies Kluwer Academic Publishers
2. Environment Springer-Verlag /
Kluwer Academic Publishers
3. High Technology Kluwer Academic Publishers
4. Science and Technology Policy Kluwer Academic Publishers
5. Computer Networking Kluwer Academic Publishers
NATO-PCO Database
The electronic index to the NATO ASI Series provides full bibliographical references
(with keywords and/or abstracts) to about 50000 contributions from international
scientists published in all sections of the NATO ASI Series. Access to the NATO-PCO
Database compiled by the NATO Publication Coordination Office is possible in two
ways:
- via online FILE 128 (NATO-PCO DATABASE) hosted by ESRIN, Via Galileo Galilei,
1-00044 Frascati, Italy.
- via CD-ROM "NATO Science & Technology Disk" with user-friendly retrieval
software in English, French and German ( WTV GmbH and DATA WARE
Technologies Inc. 1992).
The CD-ROM can be ordered through any member of the Board of Publishers or
through NATO-PCO, B-3090 Overijse, Belgium.
Edited by
Siileyman Ozekici
Department of Industrial Engineering
Bogazi/yi University
80815 Bebek-istanbul, Turkey
In cooperation with
Erhan <;mlar
Princeton University
Frank Van der Duyn Schouten
Tilburg University
Nozer D. Singpurwalla
The George Washington University
Jack P.e. Kleijnen
Tilburg University
Rommert Dekker
Erasmus University Rotterdam
Springer
Proceedings of the NATO Advanced Study Institute on Current Issues
and Challenges in the Reliability and Maintenance of Complex Systems,
held in Kerner-Antalya, Turkey, June 12-22, 1995.
1. Introduction
Reliability theory has gained much interest in recent times. This becomes
evident if one realizes the number of publications in this field. Such numbers
are available in the MATH DATABASE of STN International, the Scien-
tific & Technical Information Network. This database is the online version
of Zentralblatt fiir Mathematik/Mathematical abstracts and contains all en-
tries in the Zentralblatt since 1972. The following table shows the number of
publications from 1972 up to and including 1994 for some keywords:
establishes the link between the cumulative hazard and the survival function.
Modeling in reliability theory is mainly concerned with additional informa-
tion about the state of a system, which can be gathered during the operating
time of the system. This additional information leads to updating predic-
tions about proneness to system failure. There are a lot of ways to introduce
such additional information into the model. In a general setting Arjas (1993)
uses marked point processes to describe this flow of information in an in-
structive way. In the following some examples of how to introduce additional
information are given.
comp.2
comp.1
comp.3
the single components. As long as all components are intact only a failure of
component 1 leads to system failure. If one of the components 2 or 3 fails
first then the next component failure is a system failure.
Under the classical assumption that all components work independently,
i.e., the random variables Xi, i = 1, ... , n are independent, the investigations
concentrate on the following problems:
- Determining the system lifetime distribution from the known component
lifetime distributions or finding at least bounds for this distribution.
- Are certain properties of the component lifetime distributions like IFR
(increasing failure rate: A(t)) or IFRA (increasing failure rate average:
(l/t)A(t)) preserved by forming monotone systems? One of these closure
theorems states for example that the distribution of the system lifetime is
IFRA if all component lifetimes have IFRA distributions.
- In what way does a certain component contribute to the function of the
whole system? The answer to this question leads to the definition of several
importance measures. A short survey of the importance of components in
a monotone coherent system has been given by Natvig (1988).
A basic reference for monotone coherent systems is still the book of Barlow
and Proschan (1975). More recent related publications, which contain a lot of
generalizations, are Aven (1992) and Shaked and Shanthikumar (1990). For
the state of the art the contributions of Aven (1996a, 1996b), Van der Duyn
Schouten (1996) and Ozekici (1996a, 1996b) to this volume are referred to.
Additional information about the lifetime ( can also be introduced into the
model in a quite different way. If the state or damage of the system at time
t E ~+ can be observed and this damage is described by a random variable
X t then the lifetime of the system
( = inf{t E ~+ : X t ~ S}
6 Uwe Jensen
can be defined as the first time the damage hits a given level S. Here S can be
a constant or, more general, a random variable independent of the damage
process. Some examples of damage processes X = (Xt ) of this kind are the
following.
Wiener Process. The damage process is a Wiener process with positive
drift starting at 0 and the failure threshold S is a positive constant. Then
the lifetime of the system is known to have an inverse Gaussian distribution.
Models of this kind are especially of interest if one considers different envi-
ronmental conditions under which the system is working, as for example in
so-called burn-in models. An accelerated aging caused by additional stress or
different environmental conditions can be described by a change of time. Let
r : ~+ -+ ~+ be an increasing function, then Zt := XT(t) denotes the actual
observed damage. The time transformation r drives the speed of the deterio-
ration. Following Doksum (1991) one possible way to express different stress
= =
levels in time intervals [ti, ti+l), 0 to < tl < ... < tic, i 0, 1, ... , k-1, kEN,
is the choice
i-l
t 1-+ r(t) = 2:,8j(ti+ 1 - tj) + ,8i(t - td,
j=O
In this case it is easily seen that if Fo is the inverse Gauss distribution function
of ( = inf{t E ~+ : X t ~ S}, and F is the distribution function ofthe lifetime
(a = inf{t E ~+ : Zt ~ S} under accelerated aging then F(t) = Fo(r(t)).
Some further references on accelerated aging can be found in Doksum (1991).
The failure time distribution for damage processes with more general failure
thresholds is investigated by Domine (1996), among others. A generalization
in another direction is to consider a random time change which means that r
is a stochastic process. By this, randomly varying environmental conditions
can be modeled. This idea has been develol?ed by Qmlar (1984) for semi-
Markov processes and further by Qmlar and Ozekici in (1987) and by Qmlar
et al. (1989).
Compound Point Processes. Processes of this kind describe so-called
shock processes where the system is subjected to shocks which occur from
time to time and add a random amount to the damage. The successive
times of occurrence of shocks, Tn, are given by an increasing sequence
o < Tl ::; T2 ::; ... of random variables where the inequality is strict unless
Tn = 00. Each time point Tn is associated with a real-valued random mark
Mn which describes the additional damage caused by the n-th shock. The
marked point process is denoted (T, M) = (Tn, Mn)nEN. From this marked
point process the corresponding compound point process X with
2:
00
Xt = I{T .. 9}Mn
n=l
Stochastic Models of Reliability and Maintenance: An Overview 7
1.3 Maintenance
In the last two subsections various ways of modeling the lifetime of a technical
system by introducing additional information were described. In addition to
such models it is often useful to take maintenance actions into account to
prolong the lifetime, to increase the availability and to reduce the probability
of an unpredictable failure. The most important maintenance actions include:
- Preventive replacements of parts of the system or of the whole system
- Providing spare parts
- Providing repair facilities
- Inspections to check the state of (parts of) the system if not observed
continuously
Taking maintenance actions into account leads, depending on the specific
model, to one of the following problem fields.
Availability Analysis. If the system or parts of it are repaired or replaced
when failures occur the problem is to characterize the performance of the
system. Different measures of performance can be defined as for example
- The intact probability at a certain time point or in a given time interval
- The mean time to first failure of the system
- The probability distribution of the downtime of the system in a given time
interval.
Of course, a lot of other measures and generalizations of the above ones
have been investigated. An overview of different performance measures for
monotone systems is given by Aven (1996a) in his contribution to this volume.
Optimization Models. If a valuation structure is given, i.e. costs of re-
placements, repairs, downtime ... and gains, then one is naturally led to the
problem of planning the maintenance action so as to minimize (maximize) a
given cost (gain) criterion. Examples of such criteria are expected costs per
unit time or total expected discounted costs. Surveys of these models can be
found in the review articles mentioned below.
One can imagine that thousands of models (and papers) can be created
in combining the different types of lifetime models with different mainte-
nance actions. Instead of providing a long and, inevitably, almost certainly
incomplete list of references, some of the surveys and review articles will be
8 Uwe Jensen
In Sections 1.1 and 1.2 it was pointed out in what way additional information
can lead to a reliability model. But it is also important to note that in one
and the same model different observation levels are possible, i.e. the amount
of actual available information about the state of a system may vary. So for
example in optimization models the optimal strategy will strongly depend on
the available information. The following two examples will show the effect of
different degrees of information.
Simpson's paradox. This paradox says that if one compares the death
rates in two countries, say A and B, then it is possible that the crude overall
death rate in country A is higher than in B although all age-specific death
rates in B are higher than in A. This can be transferred to reliability in the
following sense. Considering a two-component parallel system, the failure rate
of the system lifetime may increase although the component lifetimes have
decreasing failure rates. The following proposition, which can be proved by
some elementary calculations, yields an example of this.
If In( 2~f3) > ~ and c < /3 then the failure rate A of the lifetime ( increases,
whereas the component lifetimes Xi have decreasing failure rates.
Stochastic Models of Reliability and Maintenance: An Overview 9
This example shows that it makes a great difference whether only the
system lifetime can be observed (aging property: IFR) or additional infor-
mation about the component lifetimes is available (aging property: DFR). In
addition one may also notice that the aging property of the system lifetime
of a complex system does not only depend on the joint distribution of the
component lifetimes but of course also on the structure function. Consider,
instead of a two-component parallel system, a series system where the com-
ponent lifetimes have the same distributions as in the proposition. Then the
failure rate of (3er= Xl "X 2 decreases, whereas (par = Xl V X 2 has an
increasing failure rate.
Predictable Lifetime. The Wiener process X = (Xt)telll+ with positive
drift I' and variance scaling parameter u serves, as mentioned before, as a
popular damage threshold model. X can be represented as X t = uBt + I't,
where B is standard Brownian motion. If one assumes that the failure level S
Zt = Zo + 1t f.ds + Mt , (2.1)
where f = (ft )tEllR+ is a stochastic process with EU; If. Ids ) < 00 for all
t E ~+, EIZol < 00 and M = {Mt)tEllR+ is a martingale which starts in 0:
Mo = O. A martingale is the mathematical model of a fair game with constant
expectation function EMo = 0 = EMt for all t E ~+. Since the drift part
in the above decomposition is continuous, a process Z, which admits such a
representation is called a smooth semimartingale or smooth IF-semimartingale
if one wants to emphasize that Z is adapted to the filtration IF. For details
and basic results concerning smooth semimartingales see Jensen (1989).
First let us consider the simple indicator process Zt = I{(9}' where (
is the lifetime random variable defined on the basic probability space. The
paths of this indicator process are constant, except for one jump from 0 to
1 at (. The general model now simply consists of the assumption that this
indicator process has a smooth IF-semimartingale representation:
The process). = ().t)tEllR+ is called failure rate or hazard rate process and the
compensator At = f; I{(>.})'.ds is called hazard process. Before investigating
under what conditions such a representation exists some examples are given.
Example 2.1 If the failure rate process). is deterministic then forming
expectations leads to the integral equation
The unique solution F = exp{- I~ .A,ds} is just equation (1.1). This shows
that if the hazard rate process .A is deterministic then it coincides with the
standard failure rate.
Example 2.2 In continuation of the example of a 3-component complex
system in Section 1.1 it is assumed that the component lifetimes Xl, X 2, X3
are LLd. exponentially distributed with parameter a > O. What is the failure
rate process corresponding to the lifetime ( = Xl /I. (X2 V X3) ? This depends
on the information level,i.e. the filtration IF.
-:I"t = U(I{Xl~,},I{X2~,},I{X3~,},0:5 s:5 t). Observing on the component
level means that :l"t is generated by the indicator processes of the compo-
nent lifetimes up to time t. It can be shown that the failure rate process of
the system lifetime is given by .At = a(l + I{x2~t} + I{x3~t}). As long as
all components work,the rate is a due to component 1. When one of the
two parallel components 2 or 3 fails first, then the rate switches to 2a.
- :l"t = u(I{(~s}, 0 :5 s :5 t). If only the system lifetime cane observed then
the failure rate process diminishes to the ordinary deterministic failure rate
where r~(t) is the set of critical components at time t, the failure of which
would immediately result in a system failure, i.e. i E r~(t) if and only if
1t
with representation
Zt = Zo + /,ds+ Mt ,
then the projection theorem of filtering theory (see Jensen 1989) for detailed
references) ensures that such a representation also applies to the conditional
expectation Z with it = E(Zt IAt):
it = io + 1t i, ds + Mt , (2.4)
where it is some suitable version of the conditional expectation E(ft IAt) and
M is an A-martingale. This projection theorem can be applied to the lifetime
indicator process lit = I{(9} with presentation (2.2). If the lifetime can be
observed,i.e. {( ~ s} E At for all 0 ~ s ~ t, which is assumed throughout,
then the change of the information level from IF to A leads from (2.2) to the
representation
The following example from Heinrich and Jensen (1992) illustrates the role
of partial information.
Example 2.5. Consider a two-component parallel system with i.i.d. ran-
dom variables Xi, i = 1,2 describing the component lifetimes, which fol-
Iowan exponential distribution with parameter a. Then the system life-
time is ( = Xl V X 2 and the "complete information" filtration is given by
:Ft = u(I{xl:5 a},I{X2:5 a}, 0 ~ s ~ t). In this case the IF-semimartingale rep-
resentation (2.2) is given by
Af { U(IK::;6}
u(I{(:5a}. 0 s t)
~ ~
, I{Xl::;u} , I{X2:;:u} , s t, u t - h)
~ ~
for O~t<h
for t ? h,
ja
t { a (2 - I{X1>t-h}e -<:>h - I {X2>t-h}e -<:>h forfor
2a(1 - (2 - e-<:>t)-l) o~ t < h
t? h,
{
2a(1- (2 - e-<:>t)-l) for 0~t <h
At for t? h,
u(I{(:5a},I{xl:5a}, 0 ~ s ~ t),
a(I{X 1 9} + I{x 1 >t}P(X2 ~ t,
d) Information only about (:
Stochastic Models of Reliability and Maintenance: An Overview 15
At = u(I{(:$8},O ~ s ~ t),
5.t = 2a(l- (2 - e-at)-l).
The failure rate corresponding to Ad in part d) of this example is the stan-
dard deterministic failure rate, because {( > t} is an atom of At so that 5. d
can always be chosen to be deterministic on {( > t}. Example 2.1 showed
that such deterministic failure rates satisfy the well-known exponential for-
mula (1.1).One might ask under what conditions such an exponential formula
extends also to random failure rate processes. This question was referred to
briefly in Arjas (1989) and answered in Yashin and Arjas (1988) to some ex-
tent. The following treatment differs slightly in that the starting point is the
basic model (2.2). The failure rate process A is assumed to be observable on
some level A, i.e. A is adapted to that filtration. This observation level can be
=
somewhere between the trivial filtration G (Qt)tEl!+, (gt) = {0,.a} which
does not allow for any random information, and the basic complete informa-
tion filtration IF. So ( itself need not be observable at level A (and should
not, if we want to arrive at an exponential formula). Using the projection
theorem one obtains
t
Ft = 1-1o F8_A8ds - Mt .
Under mild conditions an A-martingale L can be found such that M can
be represented as the (stochastic) integral Mt = f;
F8 _dL 8 With the semi-
martingale Z, Zt = - f;
A8ds - Lt equation (2.6) becomes
16 Uwe Jensen
Ft = 1 + fat F _dZs .
8
The unique solution of this integral equation is given by the so-called Doleans
exponential
exp{Zn II (1 + .1Z8)
It is (perhaps) no surprise that the total lifetime after a black box minimal
repair is stochastically greater than after a physical minimal repair:
= L I{Tn~t}
00
Nt
n=l
counts the number of minimal repairs up to time t and is adapted to some
filtration IF. Similar to the failure time model (2.2) it is now assumed that N
has an absolutely continuous compensator:
18 Uwe Jensen
Nt = 1t A,ds + M t , (3.1)
where M' is the stopped martingale M, Mf = M tllTl . The time to first failure
corresponds to the original lifetime ( = Tl.
Example 3.2. Different types of minimal repair processes are characterized
by different intensities A.
a) Poisson process with constant intensity At == A. The times between suc-
cessive 'minimal' repairs are independent Exp(A) distributed random vari-
ables. This is the simple case in which repairs have the same effect as
replacements with new items.
b)If in a) the intensity is not constant but a random variable A(w)which is
known at the time origin (A is :Fo -measurable) then the process is called
doubly stochastic Poisson process or Cox process.
c) If in a) the intensity is not constant but a time-dependent deterministic
function At = A(t) then the process is a non-homogeneous Poisson process.
Most attention in the literature on minimal repairs has been paid to this
case of black box minimal repairs in which, after repairs, the failure inten-
sity remains the same as if the system had not failed before. In the case of
the parallel system in Example 3.1 one has A(t) = 2~:::::~ :::! .
d)The general case, A is IF-adapted. This applies to the physical minimal
repair in Example 3.1: At = I{XlI1X2:5t}.
Example 3.1 suggests comparing the effects of minimal repairs on different in-
formation levels. However, it seems difficult to define such point processes on
different levels. One possible way is sketched in the following where considera-
tions are restricted to the given IF-level of the basic model(3.1) and the 'black-
box-level' A' generated by ( = Tl,At = u(I{T1 :5'},0 ::; s ::; t).Proceeding
from the representation (3.1) the time to first failure is governed by the
IF-hazard rate process A for t E (0, (]. The change to the AClevel by condi-
tioning leads to the failure rate ~, ~t = E(AtIAt).As described in Section 2.2,
~ can be chosen deterministically. For the time to first failure we have the
two representations
N; = L I{T~~;t} =
00
n=l
1~8ds + M;
0
t
describes the minimal repair process on the M -level. Comparing these two
information levels example 3.1 might suggest ENt ~ EN: for all positive
t.A general comparison, also for arbitrary subfiltrations, seems to be an open
problem (see Arjas 1989 and Natvig 1990).
Example 3.3. In the two-component parallel system of Example 3.1 we
have the failure rate process At = I{X 1AX2 9} on the component level and
~t = 2 ~-exp
-exp -
-~ on the black-box level. So one has two descriptions of the
same random lifetime ( = Tl
1t I{Tl>8}I{xlAX2~8}ds + t M
1 o
t 1- exp (-s)
I{T1 >8}22
-
()ds+Mt .
- exp -s
The process N counts the number of minimal repairs on the component
level:
Nt =
t
10 I{X1AX2~.}ds
+ Mt
This is a delayed Poisson process, the (repair) intensity of which is equal to 1
after first component failure. The process N' counts the number of minimal
repairs on the black-box level:
-it
N t' -
o
21 - exp ( -s) d
2-exp(-s)
M'
s + t
To interpret this result one should note that on the component level only the
critical component which caused the system to fail is repaired. A black box
repair, which is a replacement by a system of the same age that has not yet
failed, can be a replacement by a system with both components working.
20 Uwe Jensen
qi = h-O+
lim -hI P(Yh i:- ilYo = i), qij = lim -hI P(Yh = jlYo = i).
h_O+
- The time points of failures (minimal repairs) 0 < Tl < T2 < ... form a
point process and N = (Nt )tEl!l+ is the corresponding counting process:
00
Nt = L: I{T.. :$t}.
n=l
It is assumed that N has a stochastic intensity AY which depends on the
j
- There is an initial capital u and an income of constant rate c > 0 per unit
time.
Now the process R, given by
N,
R t = u + ct - l: Xn
n=l
describes the available capital at time t as the difference of the income and
the total amount of costs for minimal repairs up to time t.
where Ut(j) = I{y,=il is the indicator of the state at time t and kj E ~,j E S
are stopping costs (for inspection and replacement) which may depend on the
stopping state. The process Z can not be observed directly because only the
22 Uwe Jensen
failure time points and the costs for minimal repairs are known to an observer.
The observation filtration A = (At)tEllt+ is given by
Zt = u - t
j=1
kjUo(j) + 1t t
0 j=1
U,(j)rjds + M t , t E ~+, (3.2)
rj =C-Ajp,- ~)kv-kj)qjv.
vtj
Zt = E(Zt/At ) = u - t
j=1
k/fo(j) + 1t
0 j=1
t Us (j}rjds + Mt , t E JR+. (3.3)
t. (~U.(
~
the first time the conditional expectation of the net gain rate falls below O.
Theorem 3.1. Let r* be the A-stopping time (3.5) and assume that condi-
tions (3.4) hold true. If in addition qim > Am - Ai, i = 1, ... , m - 1 then r* is
optimal:
A proof can be found in Jensen and Hsu (1993). The additional condition
qim > Am - Ai ensures that the integrand of the drift term gt = 2:}:1 U t (j)rj
has non-increasing paths and the monotone case applies. But in any case
under conditions (3.4) 9 = (gt)tER+ is a supermartingale and r* is optimal in
a smaller set of A-stopping times with finite expectation. Of special interest
is the case m = 2 for which an explicit solution of the stopping problem will
be given.
24 Uwe Jensen
K I I II I )
o u
The Case of m=2 States. For two states the stopping problem can be
reformulated as follows. At an unobservable random time, say u, there occurs
a switch from state 1 to state 2. Detect this change as well as possible by
means of the failure process observations. The conditions (3.4) now read
where dn = (1 - =
UT" (2)) -1 ,gn(t) (q - (>'2 - >.t})(t - Tn). The stopping
time r* in (3.5) can now be written as
Remark 3.1. If the failure rates in both states coincide, i.e. >'1 = >'2 the
observation of the failure time points should give no additional information
about the change time point from state 1 to state 2. Indeed, in this case the
conditional distribution of u is deterministic,
In this section the basic lifetime model is combined with the possibility of
preventive replacements. A system with random lifetime ( > 0 is replaced by
a new equivalent one after failure. A preventive replacement can be carried
out before failure. There are costs for each replacement and an additional
amount has to be paid for replacements after failures. The aim is to determine
an optimal replacement policy with respect to some cost criterion.
There is an extensive literature about models of this kind which is
surveyed in the overviews by Pierskalla and Voelker (1976), Sherif and
Smith (1981) and Valdez-Flores and Feldman (1989) mentioned before. Sev-
eral cost criteria are known among which the long run average cost per
unit time criterion is by far the most popular one. A general set-up for
cost minimizing problems is introduced in Jensen (1990) similar to Aven
and Bergman (1986). It allows for specialization in different directions. As
an example the total expected discounted cost criterion as described by
Aven (1983) will be applied. What goes beyond the results in Aven and
26 Uwe Jensen
I{(9} = 1t I{(>6}Aads + M t ,
Zt c + 1t I{(>3}O'e- a3 ( -c + A3 ~) ds + Rt
c + 1t I{(>3}O'e- a3 r 3ds + R t , (4.2)
where rs =
*(-O'c + Ask) is a cost rate and R = (Rt)teR+ is a uniformly
integrable IF-martingale.
K - EZT C C - b
T - E(1 _ e-aT) ~ E(1 _ caT) +q ~ E(1 _ e-a,) +q - I
c * E((c+k)e- a()
bl = E(1 _ e-a,) + q ::; K ::; bu = E(1- e- a()' (4.3)
Yt = -c + it I{(>&}ae-a&(K* - r&)ds + R t.
So if the cost rate r crosses K* only once from below then it is optimal to
stop the first time r hits K* since ERT = 0 for all T E e[.If r has this
monotonicity property then instead of considering all stopping timesT E e[
one may restrict the search for an optimal stopping time to the class of
indexed stopping times
From EYq = 0 it follows then that the optimal stopping level x* is given by
and the question, to what extent the information level influences the cost
minimum,has to be investigated.
Considerations are now restricted to coherent monotone systems with ran-
dom component lifetimes Xi > 0, i = 1,2, ... , n, n E N and structure function
qj : {O, l}n -+ {O, I las described in section 1.1. The system lifetime <: is given
by ( = inf{t E JR+ : qjt= I}, where qjt = qj(I{Xl~t}' I{x2~t}, ... , I{Xn9}) =
Stochastic Models of Reliability and Maintenance: An Overview 29
I{($t} indicates the state of the system at time t. It is assumed that iPt admits
a semimartingale representation with failure rate process A with respect to
the filtration IF generated by the component lifetimes:
comp.1
comp.2
The idea behind Freund's model is that after failure of one component
in a two-component parallel system the stress, placed on the surviving com-
ponent, is changed. As long as both components work, the lifetimes follow
independent exponential distributions with parameters /31 and /32. When one
of the components fails, the parameter of the surviving component is switched
to 131 or132 respectively.
Marshall and Olkin proposed a bivariate exponential distribution for a
two-component system where the components are subjected to shocks. The
components may fail separately or both at the same time due to such shocks.
This model includes the possibility of a common cause of failure which de-
stroys the whole system at once.
As a combination of these two models the following bivariate distribu-
tion can be derived. Let the pair (Y1, Y2 ) of random variables be distributed
according to the model of Freund and let Y12 be another positive random vari-
able, independent ofYl and Y2, exponentially distributed with parameter /312.
Then (Xl, X 2 ) with Xl = Y1 t\ Y12 , X 2 = Y2 t\ Y12 is said to follow a combined
exponential distribution. For brevity the notation Ii = /31 + /32 - 13i , i E {I, 2}
and /3 = /31 + /32 + /312 is introduced. The survival function
where here and in the following Ii "I 0, i E {I, 2} is assumed. For /3i = 13i this
formula diminishes to the Marshall-Olkin distribution and for /312 = (4.7)
gives the Freund distribution. A detailed derivation, statistical properties and
methods of parameter estimation of this combined exponential distribution
can be found in Heinrich and Jensen (1995). From (4.7) the distribution H
of the system lifetime ( = Xl t\ X 2 can be obtained:
0 for :z:<fu-
- a
c
Xl for a - a
C
Since the optimal value :z:* lies between the bounds b, and bu considerations
can be restricted to the cases :z: >
-
b, > fu
a
- c. In the first case fu - c < :z: <
a -
lh~,812 - c one has Px = Xl /I. X 2 and
E(1 - e- ap .,)
p+a
cE(e-ap")+E(I{,~p.,}e-ap")=cf3!a + :~2a'
32 Uwe Jensen
X
= -1 (R R) f' f312
cfJ + fJl2 1 - - C <x ~
PI + f312 - C
a a a
{ 0< C ~ ci
X I l\X2 for
PIC = Xl for CI < C ~ C2
( for C2 < C
{
xI for 0< C ~ CI
x* = x2 for CI < C ~ C2
X3 for C2 < c,
x3
The explicit formulas for the optimal stopping value were only presented here
to show how the procedure works and that even in seemingly simple cases
extensive calculations are necessary. The essential conclusion can be drawn
from the structure of the optimal policy. For small values of c (note that the
penalty costs for failures are k=l) it is optimal to stop and replace the system
at the first component failure. For mid-range values of c the replacement
should take place when the "better" component with a lower residual failure
rate (PI ~ P2) fails. If the "worse" component fails first this results in an
intentional replacement after system failure. For high values of c preventive
replacements do not pay and it is optimal to wait until system failure. In this
case the optimal stopping value is equal to the upper bound x = bu.
4.3.2 Information about Xl and C. The failure rate process correspond-
ing to this observation level A is given by
get)
The paths of this failure rate process depend only on the observable compo-
nent lifetime Xl, as required, and not on X 2 The paths are non-decreasing
Stochastic Models of Reliability and Maintenance: An Overview 33
so that the same procedure as in the last Section 4.3.1 can be applied. For
/1 = f31 + f32 - 131 > 0 the following results can be obtained:
Px = {, X1/1.b*
Xl
for
for
for
0< c $ C1
C1 < c $ C2
C2 < C
{ o< c $
xi for C1
x* = x; for C1 <c$ C2
x*3 for C2 < c.
The constants C1, C2 and the stopping values x;,
xa are the same as in
Section 4.3.1. What is optimal on a higher information level and can be
observed on a lower information level must be optimal on the latter too. So
only the case 0 < c $ C1 is new. In this case the optimal replacement time is
Xl /I. b*with a constant b*, which is the unique solution of the equation:
In this case the replacement times Px = , /I. b, b E ~+ U {oo }are the well-
known age-replacement policies. Even if A is not monotone such a policy is
optimal on this lIB-level. The optimal values b* and x* have to be determined
by minimizing K p ", as a function of b.
4.3.4 Numerical Examples. The following tables show the effects of
changes of two parameters, the replacement cost parameter c and the "de-
pendence parameter" f312. To be able to compare the cost minima K* = x*
both tables refer to the same set of parameters: f31 = 1, f32 = 3,131 = 1.5,132 =
3.5, a = 0.08. The optimal replacement times are denoted:
a: Px. = Xl /I. X 2 b: Px. = Xl c: Px. = Xl /I. b*
d: px. = ( /I. b* e: Px. = ( = Xl V X 2
34 Uwe Jensen
Table 4.1. /31 = 1, /32 = 3, /312 = 0.5, ill = 1.5, il2 = 3.5, = 0.08
Q'
Information Level
IF I Pi. I
0.01 6.453 6.813 a 9.910 c 11.003 d 20.506
0.10 8.280 11.875 a 17.208 c 19.678 d 22.333
0.50 16.402 28.543 b 28.543 b 30.455 e 30.455
1.00 26.553 39.764 b 39.764 b 40.606 e 40.606
2.00 46.856 60.900 e 60.900 e 60.900 e 60.900
Table 4.2. /31 = 1, /32 = 3, /312 = 0.5, ill = 1.5, il2 = 3.5, c = 0.1, = 0.08 Q'
Information Level
IF I Pi. I
0.00 1.505 5.000 a 10.739 c 13.231 d 16.552
0.10 2.859 6.375 a 12.032 c 14.520 d 17.698
1.00 15.067 18.750 a 23.688 c 26.132 d 28.235
10.00 138.106 142.500 b 142.500 b 144.168 e 144.168
50.00 687.677 689.448 e 689.448 e 689.448 e 689.448
Table 4.1 shows the cost minima x* for different values of c. For small
values of c the influence of the information level is greater than for moderate
values. For c > 1.394 preventive replacements do not pay, additional infor-
mation concerning ( is not profitable. Table 4.2 shows how the cost minimum
depends on the parameter /312. For increasing values of /312 the difference be-
tween the cost minima on different information levels decreases, because the
probability of a common failure of both components increases and therefore
extra information about a single component is not profitable.
References
Jensen, U., Hsu, G.: Optimal Stopping by means of Point Process Observations with
Applications in Reliability. Mathematics of Operations Research 18, 645-657
(1993)
Koch, G.: A Dynamical Approach to Reliability Theory. Proc. Int. School of Phys.
"Enrico Fermi", XCIV. Amsterdam: North-Holland 1986, pp. 215-240
Marshall, A. W., Olkin, I.: A Multivariate Exponential Distribution.J. Amer. Stat.
Ass. 62, 30-44 (1967)
Natvig, B.: Reliability: Importance of Components. In: Kotz, S., Johnson, N. (eds.):
Encyclopedia of Statistical Sciences 8. New York: Wiley 1988, pp. 17-20
Natvig, B.: On Information-Based Minimal Repair and the Reduction in Remaining
System Lifetime due to the Failure of a Specific Module. J. AppJ. Prob. 27, 365-
375 (1990)
Ozekici, S.: Complex Systems in Random Environments. In this volume (1996a),
pp. 137-157
Ozekici, S.: Optimal Replacement of Complex Systems. In this volume (1996b), pp.
158-169
Phelps, R.: Optimal Policy for Minimal Repair. J. OpJ. Res. 34, 425-427 (1983)
Pierskalla, W., Voelker, J.: A Survey of Maintenance Models: The Control and
Surveillance of Deteriorating Systems. Nav. Res. Log. Q. 23, 353-388 (1976)
Rogers, C., Williams, D.: Diffusions, Markov Processes and Martingales. Vol. 1, 2nd
ed. Chichester: Wiley 1994
Shaked, M., Shanthikumar, G.: Multivariate Imperfect Repair. Oper. Res. 34, 437-
448 (1986)
Shaked, M., Shanthikumar, G.: Reliability and Maintainability. In: Heyman, D.,
Sobel, M. (eds.): Stochastic Models. Vol. 2. Amsterdam: North-Holland 1990,
pp.653-713
Shaked, M., Shanthikumar, G.: Dynamic Multivariate Aging Notions in Reliability
Theory. Stoch. Proc. AppJ. 38, 85-97 (1991)
Sherif, Y., Smith, M.: Optimal Maintenance Models for Systems Subject to Failure.
A Review. Nav. Res. Log. Q. 28, 47-74 (1981)
Stadje, W., Zuckerman, D.: Optimal Maintenance Strategies for Repairable Systems
with General Degree of Repair. J. AppJ. Prob. 28, 384-396 (1991)
Valdez-Flores, C., Feldman, R.: A Survey of Preventive Maintenance Models for
Stochastically Deteriorating Single-Unit Systems. Nav. Res. Log. Q. 36, 419-
446 (1989)
Van der Duyn Schouten, F.: Maintenance Policies for Multicomponent Systems: An
Overview. In this volume (1996), pp. 117-136
Yashin, A., Arjas, E.: A Note on Random Intensities and Conditional Survival
Functions. J. AppJ. Prob. 25, 630-635 (1988)
Fatigue Crack Growth
Erhan Qmlar
Department of Civil Engineering and Operations Research, Princeton University,
Princeton, N J 08544, USA
Summary. Metal fatigue is a major cause for failure of mechanical and structural
components. We review the fracture mechanics offatigue and Paris-Erdogan law for
the mean behavior. After a consideration of experimental data reported by Virkler
et al. (1979), we propose a continuous semimarkov process to model crack growth.
The model accounts for the material randomness and sees crack as a motion in a
random field.
1. Introduction
Metal fatigue is a major cause for failure of mechanical and structural compo-
nents. It is widely recognized to be a random phenomenon, two main reasons
being the randomness in stress loading and the random variations in the
material properties.
From the point of view of fracture mechanics, the fatigue damage of a
component is measured by the size of the dominant crack, and the failure
is defined to occur when that crack's size reaches a critical magnitude. The
cracks in question are in the range from 10 to 50 mm, so that micromechanical
considerations need not be made explicit. Indeed, micro level considerations
will be used to derive, through probabilistic reasoning, the likely laws for the
growth of macroscopic cracks.
Our model for crack growth is based on two considerations. First, the
effect of material inhomogeneities is viewed as a Markov random field on
the plane. Second, the motion of crack tip is seen as a continuous increasing
semimarkov process in that Markov random field. As such, our model is a
refinement of all models known to us except one. In fact, as we shall make
clear, the better accepted models in literature are averaged versions of ours.
However, our model is best described in terms of the process of primary
interest, namely, the random time To it takes the crack to reach size a. Our
model is that it is infinitesimally a gamma process given the Markov random
field for the strength of the material.
In Section 1, we review the deterministic equation of Paris and Erdogan
(1963) which is based on fracture mechanical considerations. In Section 2, we
describe the results of the experimental work by Virkler et al. (1979), which
seems to be the only work suitable for probabilistic reasoning. In Section 3,
we review various stochastic models proposed and insights to be gained from
38 Erhan Qmlar
them. In Section 4, we describe our model, justify its basic assumptions, and
point out its relationships to earlier models. Finally, in Section 5, we present
a summary of our model and some results.
2. Deterministic Analysis
Early work on fatigue crack growth has been from the point of view offracture
mechanics. In general, all such work treat cracks in infinite sheets subjected
to a uniform stress perpendicular to the crack. They relate the crack length,
2a, to the number N of cycles of load applied, with the stress range u and
some material constants Ci.
The single form in which all crack propagation laws can be written is
da
dN = I(u,a, Ci) , (2.1)
and the problem is to figure out the function f. Note, incidentally, the treat-
ment of N as a continuous variable.
Paris and Erdogan (1963) present a critical evaluation of various forms of
the equation proposed by earlier researchers. Here are some such models (see
Paris and Erdogan (1963) for references):
da C3 u 2 a
(2.2)
dN C2 - u '
called Head's formula;
da u3 a
(2.3)
dN C4 '
proposed by Frost and Dugdale ;
da
dN = l(u(l + 2y. r::7:
a/p)) (2.4)
proposed by McEvilly and Illg, where p is the end radius of the elliptical hole
with semimajor axis a, and where 1 is obtained empirically to be
34
I(x) = 0.00509x - 5.472 - x _ 34 .
From the continuum mechanics point of view, the essential factor is the
stress-intensity factor k. The latter reflects the effect of external load and
configuration on intensity of the whole stress field around the crack tip. More-
over, for various configurations, the crack tip stress fields have the same form.
Therefore, the stress-intensity factor k should control the rate of crack ex-
tension, that is, the law should be
!!:!!....
dN
=C k P
5 (2.5)
Fatigue Crack Growth 39
3. Experimental Facts
A typical crack propagation test consists of a wide plate with a central crack
of some initial length 2ao subjected to uniform tension (1 applied repeatedly.
During a single test, (1 is kept constant, and the data consist of half crack
lengths a and the corresponding cycle numbers Na (at which the crack length
becomes 2a).
Early experiments were conducted by engineers unfamiliar with statistical
or probabilistic thinking. Such experiments do not have the idea of replica-
tion and are therefore useless for our purposes. The first significant study
of random variability occurs in Virkler et al. (1979). We shall present their
results with some care, because their results are instructive and because they
have influenced the stochastic modeling of crack growth ever since.
Virkler et al. have designed their experiments with careful attention to
the random nature of the phenomenon. They have performed the test on
68 identical specimens coming from a single lot of material, an aluminum
alloy. Replicate tests were conducted under identical load and environmental
conditions; a constant amplitude loading, careful experimental conditions,
with measurements errors of the order of 0.00141 mm (which can be ignored).
For each one of the 68 specimens, they started a crack and brought the
crack to the half size ao = 9.00mm, and then, they recorded the num-
ber N a , of cycles it took to bring the crack to size ai for 164 fixed values
al, a2, ... , al64. Thus, what we have is a random sample of size 68 from a
stochastic process {Na , a?: 9.00}, each sample path having 164 data points
corresponding to N ao = 0, N aI' , N a164. Figure 3.1 shows the essentials of
their data. For purposes of clarity, instead of 68 paths, we have drawn only
5; instead of 164 data points (a;, Na .}, we have marked only 3 per path.
There are two qualitative observations we can make immediately: sample
paths do cross, that is, there is a good amount of randomness, but neither
40 Erhan Qmlar
E
,.
4
-
(\I
.!:
Cl
c:
~ 3
.::t!.
...
(.)
ctI
()
a nor N is Markovian. In fact, cracks that grow fast in the beginning seem
to grow fast at all times, modulo some randomness. There are three other
observations made by Virkler et al. We list these next.
Distribution of increments of N
For each fixed crack size ai, the data consists of a random sample of size 68.
Thus, for instance, for ai =
38.20 mm and aiH =
38.60 mm, we have 68
independent observations for the random variable NiH - Ni, which is the
number of cycles it takes to grow the crack from size 38.20 to 38.60. The
Fatigue Crack Growth 41
flavor of the data is in Figure 3.2 where the horizontal axis is logarithmic
and the vertical axis is chosen so that cumulative normal distribution would
be a straight line.
As can be seen, 3-parameter log-normal distribution is a good fit. In fact,
Virkler et al. (1979) have repeated this exercise with each ai, i = 1,2, ... ,163,
and tested the goodness of fit of five different distributions. Of these 163 fits,
3-parameter log-normal distribution had the best fit 137 times, 3-parameter
gamma distribution had the best fit 16 times, 2-parameter log-normal 7 times,
and Wei bull distribution 3 times.
~ 0.98
:0
til
...0
.0
a.. 0.90
iii 0.80
...
E
0
z
"E
til 0.50
-0
c:
19
CI)
(]) 0.20
>
~ 0.10 ..
:;
E
::J
0
0.02 _
log N
Fig. 3.2. Fit of the log-normal distribution to the number of cycles needed to bring
crack length to a = 3.82 cm
Distribution of da/dN
Virkler et al. have also looked at the distribution of the slope da/dN as a
function of LlK, stress-intensity factor range. According to the determinis-
42 Erhan Qmlar
Prediction of a versus N
There is a final, negative and useful result in Virkler et al. Using the distri-
butions fitted to da/dN at various Lll{ levels, they have simulated 68 sample
paths for {Na : a 2: 9.00 mm} under the assumption that da/dN values at
different Lll{ levels are independent. The simulation yields paths that cap-
ture the mean path quite well, but the variance is much smaller than with
the real data. We conclude from this that the assumed independence does
not hold true. This confirms the observed behavior where growth rate shows
abrupt changes.
4. Stochastic Models
An almost exhaustive survey of stochastic models that have been proposed
in the past can be found in Sobczyk and Spencer (1992). Their survey, sup-
plemented by a few recent papers can be summarized as follows.
Stochastic models proposed have always been for the crack growth pro-
cesses parametrized by cycle counts, the latter being treated as a continuous
"time" parameter. There are three basic models, which we describe now.
In many ways the simplest model was proposed by Kozin and Bogdanov;
see Kozin and Bogdanov (1989). Their model is a Markov chain {An: n =
0,1, ...} where Ao is the initial crack size and An is the crack size after
n cycles. This Markov chain is increasing, the state space is the set of all
positive integers, and each transition is either from a to a or from a to a + 1.
It follows that, starting at a, the crack size stays a for a geometric random
amount of time with some mean l/p{a) and then jumps to a + 1, stays at
a + 1 some random time with geometric distribution with mean l/p{a + 1)
and then jumps to a + 2, and so on. Kozin and Bogdanov claim good fit to
data.
If we make the parameter continuous, so that we are talking of the crack
size At at time t, the model becomes a pure birth process with state space
Fatigue Crack Growth 43
{1, 2, ...}. The model captures the essentials of crack growth, but without
attempting to account for material inhomogeneity.
where N(t) is a Poisson process and the yt are independent and identically
distributed random variables independent of the process N(t), t ~ O. In other
words, A is a compound Poisson process. Here, N(t) is viewed as the num-
ber of jumps in crack size, which presumably corresponds to the high level
exceedances in stress, and yt is the jump size for the ith jump. They cite
experiments performed by Kogajew and Liebiedinskij (1983) as a justifica-
tion for assuming that the distribution of yt be exponential with a random
parameter. See also Sobczyk and Trebicki (1991) for the same model, but
with yt correlated.
Here, At is the crack size at time t, f is some positive function, LlK is the
stress intensity factor range, Kmax is the maximum stress intensity factor,
S is the stress amplitude, R is the stress ratio, and finally X t is some process
to be specified.
The essential point here is that the deterministic model
da
dt = f(LlK,Kmax,S,R,a)
is being randomized by multiplying the right hand side by some random pro-
cess X t and replacing the deterministic function aCt) by the random process
At.
Different authors have argued variously regarding the process X t ; see the
references at the end of Spencer et al. (1989). Some have taken X t = Xo for
all t, and at the other extreme, some have taken X to be the white noise. The
former, random constant case, fails to account for random inhomogeneities
adequately. The latter should be written as a stochastic differential equation
(for fixed everything except A)
44 Erhan Qmlar
dA t = g(A t ) dWt
where W is the Wiener process. Of course, this is unacceptable since it is
impossible to make A increasing.
The role of X t in (4.1) should be to account for material inhomogeneities.
Thus as was argued by Ortiz and Kiremidjian (1988) and by Spencer et al.
(1989), X t should have the form
X t = Y(A t ) (4.2)
where Y(a) stands for the material properties at the point a. Further, Y
should be a positive process, and they propose that
Y(a) = eZ(a) , a 2: 0, (4.3)
where Z is an Ornstein-Uhlenbeck process, that is, a Gaussian process with
mean 0 and covariance function
2
-
Q
e -,6la-a'l a,a' 2: O. (4.4)
2f3
Thus, in its essentials (4.1) becomes
5. Proposed Model
We shall let At denote the crack size at time t, "time" being the continuous
analog of cumulative number of cycles. We shall take Ao = ao fixed. As the
functional inverse of A, we let Ta denote the time at which the crack size
exceeds aj more precisely,
Ta = inf {t : At > a} , a 2: ao . (5.1)
Our model will be directly for the process T = {Ta : a 2: ao}, but we shall
motivate and justify our model by considerations on A = {At: t 2: O}.
At
-
Fig. 5.1. Crack size against time (number of cycles) in the microscale
horizontal axis. For the function f, we choose the exponential function, which
seems justified by both the statistical data and the experience. Thus, writing
simply Za instead of Z(a,O), we put our assumptions regarding the stress level
at the crack tip when the size is a. Here u is the macroscopic stress magnitude.
Hypothesis 5.2. a) When the crack size is a, the actual stress at the crack
tip is
The law of M itself is specified by its mean measure J.l(db, dt), which gives
the mean number of points in the small box with sides db and dt. In our case,
since T is conditionally a process with independent increments, the measure
J.l is random and depends on Z. It is clear that J.l has the form
J.l(da, dt) = f(u, a, Za, t) da dt
for some positive function f. Moreover, since T is to be strictly increasing
(so that A be continuous), we must have
Fatigue Crack Growth 49
10 00
f((J', a, Za, t)dt = +00. (5.3)
Also, in our case, the conditional mean rate of increase of Ta should be finite
and in agreement with Virkler et al. data, which requires us to have
10 00
f((J', a, Za, t) t dt = Co ((J'e-Zd)q a- P (5.4)
where p and q are material constants, (J' is the macroscopic stress level, and
Z is the Ornstein- Uhlenbeck process described in Hypothesis (5.2).
Stripped to its essentials, J-l has the form
e-h(a)t
J-l(da, dt) = g(a)-t-dadt.
Recall that time is measured continuously, but stands for the cumulative
number of cycles, that At denotes the crack size at time t, with Ao = ao
fixed, and that Ta is the time at which crack size exceeds a. Finally, let us
introduce a new process S to simplify notation:
(6.1)
We think of Sa as the random stress intensity factor when the crack size is
a.
Throughout this section we assume that Hypotheses (5.1), (5.2), (5.3)
hold. The following describe the processes S, A, T one by one.
Intensity process
q2 a 2 -Pla-bl
2/3 e ,a, b ~ o. (6.3)
(6.4)
with W denoting the Wiener process, and Zo being independent of Wand
having the Gaussian distribution with mean 0 and variance a 2 /2/3.
In other words, given the process S, the random variable M(B) has the
Poisson distribution
e-I'(B) J.L(B)k
k! ' k = 0,1,2, ... (6.6)
1 1
where
e-aPt
J.L(B)= J.L(da,dt) = Sa-- dadt .
B B t
In terms of M, the process T is defined as follows:
Ta = L ti. (6.8)
ao<a.<a
Increments of T are conditionally independent given S (or, equivalently,
given Z). It follows from (6.7) that, for ao ~ a < b,
(6.10)
exp[ l
a
b aP
Sa /og-,-- da].
A+ a P
Unconditional expectation and Laplace transforms can be obtained from
these by taking expectations. For instance,
(6.12)
Crack size
References
Qmlar, E.: On Increasing Continuous Processes. Stoch. Proc. Their Appl. 9, 147-
154 (1979)
Qmlar, E.: On a Generalization of Gamma Processes. J. Appl. Prob. 17, 467-480
(1980)
Kogajew, V.H., Liebiedinskij, S.G.: Probabilistic Model of Fatigue Crack Growth
(In Russian). Mashinoviedinije 4, 78-83 (1983)
Kozin, F., Bogdanov, J.L.: Recent Thought on Probabilistic Fatigue Crack Growth.
Appl. Mech. Rev. 42, S121-S127 (1989)
Naronha, P.J. et al.: Fastener Hole Quality, I and II. Tech. Report AFFDL-TR-78-
206, Wright-Patterson Air Force Base, Ohio (1978)
Ortiz, K., Kiremidjian, A.: Stochastic Modeling of Fatigue Crack Growth. Engng.
Fracture Mechanics 29, 317-334 (1988)
Paris, P.C., Erdogan, F.: A Critical Analysis of Crack Propagation Laws. J. Basic
Engng. 85, 528-534 (1963)
Sobczyk, K.: Modelling of Random Fatigue Crack Growth. Engng. Fracture Me-
chanics 24, 609-623 (1986)
Sobczyk, K., Spencer Jr., B.F.: Random Fatigue: From Data to Theory. Boston:
Academic Press 1992
Sobczyk, K., Trebicki, J.: Modelling of Random Fatigue by Cumulative Jump Pro-
cesses. Engng. Fracture Mechanics 34, 477-493 (1989)
Sobczyk, K., Trebicki, J.: Cumulative Jump-Correlated Model for Random Fatigue.
Engng. Fracture Mechanics 40,201-210 (1991)
Spencer Jr, B.F., Tang, J., Artley, M.E.: Stochastic Approach to Modeling Fatigue
Crack Growth. AIAA Journal 27, 1628-1635 (1989)
Virkler, D.A., Hillberry, B.M., Goel, P.K.: The Statistical Nature of Fatigue Crack
Propagation. J. Engng. Materials Tech. Trans. ASME 101, 148-153 (1979)
Predictive Modeling for Fatigue Crack
Propagation via Linearizing Time
Transformations
Panickos N. Palettas1 and Prem K. Goel 2
1 Department of Statistics, Virginia Polytechnic Institute and State University,
Blacksburg, VA 24061-0439, USA
2 Department of Statistics, The Ohio State University, Columbus, OR 43219-1247
1. Introduction
(2.1)
where Ti denotes the change point and 6i the shift in the growth rate relative
to the mean for the ith unit.
The complexities imposed by the longitudinal nature of the raw FCP
data, as well as the constraints imposed by the monotonicity of mN(a) on
the error terms in 2.1 can be overcome by a model implied by 2.1, in terms of
successive increments in mN(aij). Thus, at the expense of possibly increased
error variance, we use the shifting regimes regression model
(2.2)
(2.4)
where j = 1,2, ... , kj , and i = 0,1,2, .. . ,p.
3. Model Implementation
The fitting of the proposed model to data is made cumbersome by the non
linearity in 2.4 with respect to the unknown change points ti, i = 0, 1, ... , p
and the need to estimate the linearizing transformation mN(a). Nevertheless,
the use of an iterative fitting algorithm in the spirit of alternating conditional
expectations (Brieman and Friedman 1985) is conceptually simple to imple-
ment. More specifically, starting with a set of initial values of f3i, 6i , say 1
and respectively, i = 0,1,2, ... , p, least squares estimates of these parame-
ters can be obtained by iterative repetition of the following two alternating
conditional estimation steps:
56 Panickos N. Palettas and Prem K. Goel
Step 1: Given the current set of estimates for (3j, bj and tj, i = 1,2, ... , p, the
linearizing transformation g(a) = mN(a) is obtained by means of a non-
parametric smoother ofthe scatter plot {(ajj, (3j Njj + bj(Nij - T;)+); j =
1,2, ... , kj, i = 1,2, .. . p}. The supersmoother (Friedman and Steutzle,
1982), in particular, is easy to use and it generally works reasonably well.
Step 2(a): Given the transformation mN(a) and Tj = Tj, i = 1,2, .. . ,p,
the least squares estimates for the parameters (3j and bj, i = 1,2, ... , p, are
straight forward to obtain.
Step 2(b): Given mN(a), (3i = it and b; = 6i , the current estimate, Ti
for the change point is recursively updated to 0: 2 /(0: 1 - 0: 3 ), where 0: 1 is
the estimate for the slope in (2.2) fitted to the data, from only the ith
replicate test, with abscissas to the left of Ti, and 0: 2 , 0: 3 are the estimates
for the intercept and the slope in (2.2) fitted to the data, from only the ith
replicate test, with abscissas to the right of Ti.
In other words, the shifting regimes model in (2.2) is fitted to the data
for each replicate test separately to the left and to the right of the current
estimate of the change point. The point of intersection of these line segments
gives the new estimate of the change point.
In theory, convergence of the estimation algorithm in Step 2(b) can be
adversely affected by outliers and spurious effects due to the random errors
in the neighborhood of the change point. In practice, however, this procedure
is typically well behaved, yielding a sequence of estimates that converge quite
rapidly to the overall least squares estimate for the change point. Yet, when-
ever a shift is actually questionable this algorithm does also diverge rapidly.
Thus the divergence becomes an indicator of this phenomena.
Fitting (2.4) to the training set is done as described above, with the only
difference in Step 2 in order to account for the constraint in (2.3). Thus,
given the linearizing transformation mN(a), obtained as in Step 1, (3i =
Pi, and Tj = Tj, i = 1,2, ... , p, the model in (2.4) is just a simple linear
regression model with unknown slope b. Likewise, given mN(a), b = 6, and
Tj = Ti, i = 1,2, ... ,p, the model in (2.4) is also linear with respect to the
parameters {31, {32, ... , (3p, which may be estimated by least squares. Finally,
given mN(a), b = 6, and (3i= Pi, = i 1,2, ... ,p, the analog of Step 2(b)
provides a recursive updating scheme, in which Tj is repeatedly replaced by
0: 0 / b(1 - o:d until convergence, where 0: 0 and 0: 1 denote the least squares
estimates for the parameters in the linear regression model (3.1):
so
CJl
~ 40
~
rl
-B 30
01
~
Q)
H
~ 20
ro
H
u
10
Fig. 4.1. Sample FCP Tests from 2024-T3 Aluminum Alloy (Virkler et al. 1979)
50
~ 40
.::
.,.-i
"fJ 30
01
.::
Q)
..:I
t3 20
rd
~
U
10
number of sample functions, a dense grid, and regularly spaced crack lengths
featured in the Virkler et al data.
Figures 4.3 and 4.4 clearly indicate that, the residuals from the shifting
regimes model (2.4) fitted to the data in the training set are free of any ap-
parent trend, thus indicating a satisfactory fit. Also notable is the lack of any
indication of serial correlation and no clear evidence of heteroscedasticity.
Figures 4.5 and 4.6, corresponding to the first sample unit in the training
set, further support the conclusions of adequate fit and lack of serial cor-
relation and absence of any sizable heteroscedasticity. Thus the assumption
of independent homoscedastic error terms, eij, j = 1,2, ... ,ki, seems to be
highly tenable. The validity of these assumptions is certainly essential to the
validity of bootstrap prediction regions, based on (2.4), that are presented in
Section 5.
Fig. 4.3. Residual increments in the transformed crack length, mN(a,j), from the
shifting regimes model (2.4) fitted to the training set
3
0
0
2
-
Ul
a
a
00)
0
0
a 0
0
rl 0 0
0
~
oM 0
Ul
.-f 0
Cd
.g -1
oM
0
0
Ul
Q) <0
p:; 0 0
-2 0 0 0
0
0
-3
4 5 6 7 8
Fitted Values (in 1000's)
Fig. 4.4. Residual increments in the transformed crack length, mN(a,j), from the
shifting regimes model (2.4) fitted to the training set
60 Panickos N. Palettas and Prem K. Goel
1
0
0 0
Ul 0
0 cP
00
0
0
0.5 0 0 00
rl
0 0
o 0 0 0 cp
C 0 0 00
rl 0
& cpo 0
0 0 o 0
Ul 0 0
rl 0 0
0 0
rtl
;j 0
'0 -0.5 0 0 0 00
rl 0 0
Ul 0
(I)
0::
-1 0 0
Fig. 4.5. Residual increments in the transformed crack length, mN(aij), from the
shifting regimes model (2.4) fitted to the data from the Replicate Test 1
1
0
0 0
Ul 0
0 0 0 0
0 ~ 0
0
0
rl
0 8 0
000 00 0
C
rl 0o 0 0 0
0 0&99 0
Ul
o 0 0
rl 0 0
rtl 0 0
;j 0
'0 0 0 o 0 0
rl
Ul 0 0
(I) 0
0::
-1 0 0
4 5 6 7
Fig. 4.6. Residual increments in the transformed crack length, mN(aij), from the
shifting regimes model (2.4) fitted to the data from the Replicate Test 1
Predictive Modelling for Fatigue Crack Propagation 61
50
CIl
~ 40
J:!
rl
-B 30
OJ
J:!
Q)
...:l
~
~ 20
H
U
10
0 50 100 150 200 250 300
Load Cycles (in 1000's)
Fig. 5.1. The Fep curves for the Replicate Tests 5,17,47,49, and 68 considered
for prediction. Early growth data (the solid portion of each curve) are used as
information available for predictions
obtained as in (5.2). In each case the observed sample FCP curves are also
shown for comparison. Again, the solid portion of each curve indicates the
range of the data used to fit the model in (5.1), while the dotted portion was
assumed to be the future to be predicted.
Predictive Modelling for Fatigue Crack Propagation 63
50
i
~
Ul
~ 40
.::
rl
"fJ 30 ,t-J
l
01
.::
OJ
tJ
H
.!<:
~ 20 ~J
H
u
~.J
10
0 50 100 150 200 250 300
Load cycles (in 1000's)
Fig. 5.2. Prediction intervals for the Replicate Test 5. Data from 9 to 15 mm crack
length (solid curve) used as available information
50
,l
Ul
40
~
ri
'501 30 )J
~
Q)
~.J
l
..."l
..>:
~ 20
H
.IJ'
t'~.
u
~t'
10
0 50 100 150 200 250 300
Fig. 5.3. Prediction intervals for the Replicate Test 17. Data from 9 to 15 mm in
crack length (solid curve) used as available information
50
m
~ 40
~
.r!
i:iOJ 30
~
())
..:l
,;.:
~ 20
H
U
10
0 50 100 150 200 250 300
Load Cycles (in 1000's)
Fig. 5.4. Prediction intervals for the Replicate Test 17. Data from 9 to 20 mm in
crack length (solid curve) used as available information
50
m
40
~
.r! :,J
i:i 30
.f
OJ
~
Q)
..:l
,;.:
~ 20
H
U
10
0 50 100 150 200 250 300
Load Cycles (in 1000's)
Fig. 5.5. Prediction intervals for the Replicate Test 17. Data from 9 to 30 mm in
crack length (solid curve) used as available information
66 Panickos N. Palettas and Prem K. Goel
50
Vl
40
C
rl
tho
01
c
<l>
H
,.'<!
~ 20
H
U
10
0 50 100 150 200 250 300
Load Cycles (in 1000's)
Fig. 5.6. Prediction intervals for the Replicate Test 68. Data from 9 to 30 mm in
crack length (solid curve) used as available information
50
Vl
40
s::::
rl
.'
tho
01
s::::
<l>
H
,.'<!
~ 20
H
U
10
0 50 100 150 200 250 300
Load Cycles (in 1000's)
Fig. 5.7. Prediction intervals for the Replicate Test 49. Data from 9 to 15 mm in
crack length (solid curve) used as available information
Predictive Modelling for Fatigue Crack Propagation 67
50
Vl
~ 40
s::
rl
::130
01
s::
(])
..:l
.!i! .'
..
~ 20
H
U
10
0 50 100 150 200 250 300
Load Cycles (in 1000's)
Fig. 5.8. Prediction intervals for the Replicate Test 49. Data from 9 to 20 mm in
crack length (solid curve) used as available information
50
Vl
~ 40
s::
rl
:501 30
s::(])
..:l
.!i!
~ 20
H
U
10
0 200 250
Load Cycles (in 1000's)
Fig. 5.9. Prediction intervals for the Replicate Test 47. Data from 9 to 20 mm in
crack length (solid curve) used as available information
68 Panickos N. Palettas and Prem K. Goel
50
(1
~ 40
c:
.-<
:i30
01
c:
OJ
...:I
~
~ 20
H
U
10
0
Load Cycles (in 1000's)
Fig. 5.10. Prediction intervals for the Replicate Test 47. Data from 9 to 30 mm in
crack length (solid curve) used as available information
References
Virkler, D.A., Hillberry, B.M., Goel, P.K.: The Statistical Nature of Fatigue Crack
Propagation. Technical Report No. 78-43. U.S. Air Force Flight Dynamics Lab-
oratory (1978)
Virkler, D.A., Hillberry, B.M., Goel, P.K.: The Statistical Nature of Fatigue Crack
Propagation. Journal of Engineering Material Technology 101,148-153 (1979)
The Case for Probabilistic Physics of Failure
Max Mendel
Department of Industrial Engineering and Operations Research, University of
California, Berkeley, CA 94720, USA
1. Introduction
2. What is PPoF?
physics of the failure mechanism that replaces the data that is often used to
determine the probability distribution.
To make this distinction concrete with an example, consider the problem
of specifying a lifetime distribution for machine tools such as drill bits. One
might choose some parametric family for lifetime such as a Wei bull distribu-
tion:
(2.1)
where F = Prob(Lifetime > x) is the survival function. This choice fixes the
distribution up to two parameters, a and {3, which can then be estimated
from lifetime data. Table 2.1 shows estimates of these parameters for various
cutting speeds. Notice that the shape increases and the scale parameter de-
creases with cutting speed. Many data points were required to obtain these
estimates and more will be needed to determine the probability model at
other cutting speeds.
Table 2.1. Weibull scale and shape parameters for multiple feed rates as reported
by Negishi and Aoki (1976)
Feed rate Shape Scale
mm/rev f3 ~
0.265 0.632 1245
0.335 0.725 423
0.425 0.624 480
0.850 0.531 715
1.060 0.760 120
1.320 0.850 86
1.700 1.325 40
Consider now the following PPoF approach to the same problem. Assume
that wear is the dominant failure mechanism. Wear is studied extensively in
the field of tribology and is documented in so-called wear curves. Figure 2.1
provides several examples. Qualitatively, we expect the probability of failure
to increase with increasing wear. Quantitatively, we can assume that:
Assessment 2.1. If one bit has twice the wear as another, then it is twice
as likely to fail in an upcoming infinitesimal interval.
This assumption is further discussed in Section 3. For a potentially infinite
supply of drill bits, the assumption can be shown to imply that lifetimes are
conditionally independent and identically distributed according to:
F(xIO) = e-G(~)/8.
Here G(x) is the area under the wear curve evaluated at the lifetime x of a
generic bit. 0 is the limiting average value of the G(Xi) as i ranges over the
bits in the supply:
72 Max Mendel
1 N
o= N-oo
lim N L G(Xi).
i=l
] 0.8
Fig. 2.1. Wear curves for MI0 cemented carbide tools with various coatings (cut
speed v = 200 m/min; feed rate f = 0.41 mm/rev; cut depth a = 2 mm; work
piece: grey cast iron bar, hardness 170 HB (from Schintlmeister, et al. 1984).
We now compare these two solutions. First notice that in the PPoF model
all the components of the model have a direct tribological meaning. To com-
pare, we can think of the Weibull model as a tribological PPoF model to-
gether with the assumption that the wear curve is a power law. This latter
assumption is not too bad as can be seen from Figure 2.1, although it does
underestimate the effects of run-in wear. Under this assumption, the shape
parameter [3 would be determined by the wear curve itself and does not need
to be estimated from data. The role of the scale parameter IX is played by the
average integrated wear O. This parameter is not fixed by the wear curves;
given G, 0 is a function of the unknown lifetimes making it a random variable
itself.
Endowing the components with a tribological meaning has several ad-
vantages. First, the task of assessing the parameters is simplified; the shape
parameter follows directly from the wear curve and by giving a physical mean-
ing to the scale parameter one can imagine that it is easier to assess a prior
for it. More importantly, however, it makes it possible to actually predict
the reliability of the bits at various cutting speeds. Figure 2.2 shows a set
of wear curves at various cutting speeds. Notice that the curves climb faster
with increasing cutting speed. By substituting these curves into the proba-
bility model we can predict the probabilistic behavior of the bits at various
cutting speed. If one assumes that the wear curve can be approximated by a
power law, then it follows that the shape parameter increases with increasing
The Case for Probabilistic Physics of Failure 73
cutting speed. We also expect the average cumulative wear to decrease. This
corroborates entirely with the empirical data in Table 2.1.
Wear
(mm)
Usage (cydes)
Fig. 2.2. Wear curves for high-speed cutting (g) and normal cutting conditions
(g).
Two critical remarks of the PPoF approach are appropriate here. First,
is the claim that no data is needed to determine the model. In the drill bit
example, this should be taken to mean that no lifetime data are needed.
The wear curves are data based, but this is data concerning the wear on the
tool's face. The PPoF approach eliminates the need to take additional data.
The second is to point to the weak link in the derivation: the assessment
that links wear to probability in assessment 2.1. This assessment is neces-
sarily subjective and one may disagree with it. It is generally impossible to
avoid subjective assessments altogether in lifetime modelling. The choice of a
Weibull model is subjective, even if this choice is based on some type of data-
based identification method since the choice of such a method is subjective.
From an engineering perspective, the goal is to provide simple statements
that relate directly to the relevant engineering quantities. An engineer can
then choose to agree or disagree. This is a critical component of PPoF and
several alternative methods for making assessments are overviewed in the
next section.
Figure 3.1 divides the PPoF approach into three steps: identifying the failure
mechanism, making an probabilistic assessment with respect to the mecha-
nism, and translating this into a likelihood model. This section analyzes these
steps into more detail.
74 Max Mendel
Failure Mechanism
Assessment
Likelihood Model
First is the identification of the failure mechanism. The simplest models re-
sult when there is one failure mechanism that is dominant. Again, this is a
subjective engineering assumption. Multiple failure mechanisms are handled
using the theorem of total probability as follows:
The conditional probabilities are then handled the same way as the models
pertaining to a single mechanism. The marginal probabilities for the failure
mechanism have to be assessed using other means, though. They serve as the
weights that measure the importance of the various mechanisms. Choosing a
dominant failure mechanism, then, corresponds to assigning probability 1 to
that mechanism.
Other relevant factors such as multiple failure sites are handled in a similar
way. Although extending the analysis in this way is straightforward from a
theoretical perspective, it greatly increases the modelling and assessment
efforts. In practice, it may therefore be more expedient to limit oneself to
a small number of important mechanisms rather than attempting to be as
inclusive as possible.
The Case for Probabilistic Physics of Failure 75
3.2 Assessment
The assessment step relates the failure mechanism to the probability model.
The assessment should be simple and relate directly to the relevant engineer-
ing quantities.
An example is the assessment "twice the wear, twice the probability of
failure in an upcoming small interval" that was used in the introduction to
assess a lifetime distribution for drill bits. This example is taken from Chick
and Mendel (1994). To make this comparison precise, we have to consider
a batch of, say, N items (drill bits). Denote the vector of their lifetimes by
x = (Xl' ... ' XN) and let Xi and Xj be the lifetimes of two different items.
Let h be the upcoming time interval. Then, the assessment becomes:
from which the expression given in the introduction follows after a passage
to the limit as N -+ 00.
This example can be applied to many other damage models apart from
wear. For instance, in fatigue fracture, it is customary to express the fatigue
damage g( n) after n cycles as follows:
g(n) = ~ (L\r)-~
2 2f./
Here L\r is the shear strain range (in percent), f / is the fatigue ductility
coefficient, and c is the fatigue ductility exponent. A probability model that
is consistent with the statement that twice the damage, twice the probability
of failure is:
F(n) = exp (_ n
2
1). (3.1)
B(~r
This is a Weibull model with the average cumulative damage as scale param-
eter and a shape parameter of 2.
The assessment "twice the 'damage', twice the density of failure" can be
applied to any scalar damage model. Although it is a very simple assumption,
it does have certain attractive characteristics. It gives an entire lifetime dis-
tribution, integrates the damage model into this distribution, and it does not
introduce abstract parameters. Compare this with the usual Coffin-Manson
life equation, which gives only the median life:
76 Max Mendel
Nso ="21(Ll"Y)
2fj
*
If a lifetime distribution is needed, it is common to choose a Weibull that
has the same median. However, this involves the introduction of a new and
abstract shape parameter that has to be estimated from data. Notice, by the
way, that the median pertaining to (3.1) is,
o(Ll"Y) C (-In[O.5])1/2),
1
mso =
2fj
which is in general not equal to the Coffin-Manson median. When, however,
the average cumulative damage is close to the Coffin-Manson median, then
the two are quite close. However, the model in (3.1) provides a mechanism
for adjusting the median based on an observed average.
When there is more physical structure available than a simple scalar dam-
age model, we can take a more sophisticated approach based on indifference
or invariance. The idea is to identify sets of outcomes that are equally likely
or, equivalently, identify a set of transformations that leave the distribution
invariant. This way of assessing likelihood models was pioneered in the statis-
tics literature by DeFinetti (1964) and further extended by several others (see
Bernardo and Smith 1994 for an overview). For engineering applications, vec-
tor fields on manifolds are a convenient way for identifying equi-probable sets
or, alternatively, to function as the (infinitesimal) generators for the invari-
ance transformations.
To illustrate this, consider the example in Figure 3.2. This is taken from
ShortIe and Mendel (1996). A rotor is placed on a shaft which is suspended
by two bearing; in the figure, either the pair Bl and B2 or the pair Bl and B~.
Inaccuracies in the manufacture of this assembly lead to imbalances. These
lead to torques in the bearings which cause the bearings to fail. We model
the torques probabilistically. There are two sources of imbalance: (1) Static
imbalance that occurs when the rotor's center of mass is off of the axis of
rotation and (2) dynamic imbalance that occurs when the rotor's principal
axis of inertia are not aligned with the axis of rotation.
Consider only the dynamic imbalance. To model it probabilistically, we
need to put a distribution on the space of inertia tensors. This is a 6 di-
mensional space; it is spanned by 3 normal moments of inertia and 3 cross
moments. These are usually arranged in an inertia matrix:
l( II I I) =
T 1, 2, a
~2 [(k_-..I2kk2 _T2 )t + (k+-..IP_T2)t]
2k
(k 2_ 2)-t T
(3.3)
Here k = w21Ia - h 1/2 is the maximum torque required to spin the rotor
at an angular velocity of w. This density if shown in Figure 3.3. It shows
clearly the problem in the manufacture: without control, it is much more
likely to produce an assembly that leads to high torques than it is one with
low torques.
f('t)
't
-10 -s o s 10
Fig. 3.3. Bearing-torque density for a randomly oriented rotor (here k = 10).
0\
,
,
,
y
Fig. 4.1. Manufacture of a rotor by drilling the center hole on a drill press. 8 is
the unknown error angle from the vertical.
Minimize Minimize
Worst-case -c E(-c)
.
1.S:
,: Optimal
: Drill Height
>.5~/
~ :
::~
o
.--
(a) ~
Fig. 4.2. Conditional density of the bearing torque r as a function of the drilling
height for either suspension case.
The Case for Probabilistic Physics of Failure 81
ings are on one side of the rotor. The PPoF analysis allows us to control the
chances on the various bearing torques by redesigning the manufacture.
5. Conclusions
The PPoF approach uses the physics of the failure process to derive a prob-
ability model. The paper argues that this is useful when there is no data
and when we wish to use probability in the design phase of an engineering
system. However, how would a PPoF approach use lifetime data when this is
available? This is the question this concluding section addresses.
The PPoF approach yields a likelihood model. This likelihood model can
be used to process any data that may be available. The situation is summa-
rized in Figure 5.1. The failure mechanisms produces both the physics of the
failure and the failure data. The physics of failure leads to a PPoF likelihood
model that combines with data to provide an updated model.
Failure Mechanism
Physics of Failure
Bayes Formalism
particular set of bits. If the physics of failure addresses a single system, the
PPoF approach will specify both likelihood and prior, although it is not clear
how useful the distinction is then. Thus, although the PPoF approach covers
this, additional information concerning the physics of failure of a particular
system has to be introduced to derive a prior.
References
Chick, S.E., Mendel, M.B.: Deriving Accelerated Lifetime Models from Engineering
Curves with an Application to Tribology. 40th IES Annual Technical Meeting
Proceedings (1994)
Chick, S.E., Mendel, M.B.: Using Wear Curves to Predict the Cost of Changes in
cutting Conditions. ASME Journal of Engineering for Industry. To appear in
(1996)
DeFinetti, B.: La Prevision: ses Lois Logiques, ses Sources Subjectives. Annales de
l'Institut Henri Poincare 7, 1-68 (1937). English translation in: Kyburg, Jr.,
H.E., Smokler, H.E.(eds.): Studies in Subjective Probability. New York: Wiley
1964
Smith, A.F.M., Bernardo, J.M.: Bayesian Theory. New York: Wiley 1994
Negishi, H., Aoki, K.: Investigations of Reliability of Carbide Cutting Tools (1st
Report). Precision Machining (Journal of the Japanese Society of Precision
Engineers) 42 (6-extra), 459-464 (1976)
Diaconis P., Freedman D.: A Dozen de Finetti-style Results in Search of a Theory.
Annales de l'Institut Henri Poincare 23, 397-423 (1987)
Schintlmeister, W., Wallgram, W., Kanz, J., Gigl, K.: Cutting Tool Materials
Coated by Chemical Vapour Deposition. In: Dowson, D.(ed.): Wear, a Cele-
bration Volume. Lausanne: Elsevier 1984, pp. 153-169
Shortle, J.F., Mendel, M.B.: Probabilistic Design of Rotors: Minimizing Static and
Dynamic Imbalance. Technical Report #95-29, ESRC (1995)
ShortIe, J.F., Mendel, M.B.: Predicting Dynamic Imbalance in Rotors. Probabilistic
Engineering Mechanics. To appear in (1996)
Dynamic Modelling of Discrete Time
Reliability Systems
Moshe Shaked,h J. George Shanthikumar,2u Jose Benigno Valdez-Torres3
1 Department of Mathematics, University of Arizona, Tucson, AZ 85721-0001, USA
2 The Walter A. Haas School of Management, University of California, Berkeley,
CA 94720, USA
3 Escuela de Ciencias Quimicas, Universidad Autonoma de Sinaloa, Culiacan,
Sinaloa, Mexico
Summary. In this paper we summarize recent results that have been obtained
in Shaked et al. (1994, 1995) on the dynamic modelling of reliability systems in
discrete time. Discrete time models of reliability systems are appropriate when the
system operates in cycles or the system is monitored at discrete time epochs. On
the other hand, discrete failure times arise naturally in several common situations
in reliability theory where dock time is not the best scale on which to describe
lifetime. Specifically, we model the dynamic behavior of the components of a relia-
bility system by discrete multivariate conditional hazard rates (which is equivalent
to specifying the joint life time distribution of the components). But this represen-
tation allows one to extend the basic model to incorporate repairs and replacements
of components in a natural way. An algorithm to construct sample paths of the dy-
namics of the components based on the discrete multivariate conditional hazard
rate is described. This algorithm can be used to simulate the system behavior and
can be used for numerical studies as well as for analytic stochastic comparisons. We
use this construction to study stochastic comparison of life times in the hazard rate
and other stochastic orderings (of vectors of discrete dependent random lifetimes).
1. Introduction
This paper surveys and summarizes recent results which have been obtained
by Shaked et al. (1994, 1995) in the dynamic modelling of reliability systems
in discrete time. One may choose to model the dynamics of a reliability
system in discrete time when it is operated in cycles and the observation is the
number of cycles successfully completed prior to failure. In other situations
a device may be monitored only once per time period and the observation
then is the number of time periods successfully completed prior to the failure
of the device. On the other hand discrete failure times in reliability systems
may arise naturally in several common situations where clock time is not the
best scale on which to describe lifetimes. For example, in weapons reliability,
the number of rounds fired until failure is more important than age in failure
Supported in part by the NSF Grant DMS 9303891
Supported in part by the NSF Grant DMS 9308149
84 Moshe Shaked et al.
and in the modelling of the landing gear in aeroplane the number of take-offs
and landings is more important.
The time-dynamic modelling of multi-component reliability systems us-
ing a marked point approach in the continuous time was initially proposed
by Arjas (1981a, 1981b). These works were further extended by Arjas and
Norros (1984, 1989) and Norros (1985, 1986). The continuous analog of the
work described here was originally carried out in a series of papers start-
ing with Shaked and Shanthikumar (1986a, 1986b, 1987a, 1987b). Specif-
ically, among other things, a definition of multivariate conditional hazard
rate functions was introduced in Shaked and Shanthikumar (1986a). The
usefulness of these functions for modelling imperfect repair in the multivari-
ate setting (Shaked and Shanthikumar 1986a) and for characterizing aging
in the multivariate setting (see Shaked and Shanthikumar 1988, 1991a) have
been demonstrated. Several notions of probabilistic ordering among vectors
of random lifetimes, using this dynamic modelling is studied in Shaked and
Shanthikumar (1987b). A new hazard rate ordering relation among such ran-
dom vectors is defined and its relationship to other probabilistic orderings
are studied in Shaked and Shanthikumar (1990). A summary of these results
(in the context of continuous time modelling) can be found in Shaked and
Shanthikumar (1993b). The results of the present paper can be looked at as
a discrete parallel development of the absolute continuous case summarized
in Shaked and Shanthikumar (1993b). However, in the discrete case there
are some technical problems which do not appear in the absolute continuous
case. These require the different methodology which is used in the present
paper.
The notion of discrete multivariate conditional hazard rate functions is
presented in Section 2. In Section 3 we present an algorithm (called the dis-
crete dynamic construction) which can construct dynamically, using the dis-
crete multivariate conditional hazard rate functions, a random vector having
a desirable distribution. This algorithm may be used for simulation purposes,
but here we illustrate its use as a technical tool for proving stochastic ordering
among multi-component reliability systems. In Section 4 we give the defini-
tions of the probabilistic orderings which are studied later in the paper. A
result, which states that the discrete multivariate hazard rate ordering implies
stochastic ordering, is proved in Section 4. In the same section, we study the
relationship between the discrete likelihood ordering and the discrete hazard
rate ordering. In Section 5 we discuss the dependence structure among the
components. A summary is provided in Section 6.
Dynamic Modelling of Discrete Time Reliability Systems 85
(3.2)
For i E J the algorithm does not define 'ii in this step; these 'ii's will be
defined in a later step. Upon determination of J and TJ the algorithm sets
t = 2 and then proceeds to Step t.
Thus, upon exit from Step 1, some of the 'ii's (if any) have been deter-
mined already as described in (3.2), and the other 'ii's (i. e., for i E J) are
still to be determined. Therefore TJ > e. (If J = 0 then after Step lone has
T> e.)
Step t. Upon entrance to this step some of the 'ii's (if any) have already
been determined. Suppose that the algorithm has already determined the
'ii's with i E I for some set I C {I, 2, ... , n}. More explicitly, suppose that
upon entrance to this step we already know that TJ = tJ (where, of course,
tJ < te) and that Ty ~ teo The algorithm now chooses a set J C I with
probability AJIJ(tltJ) and defines (if J =1= 0)
Dynamic Modelling of Discrete Time Reliability Systems 87
TJ = teo
For i E I U J the algorithm does not define t in this step; these t's (if any)
will be determined in a later step. From step t the algorithm proceeds to Step
t + 1 provided I U J =f. 0. Otherwise the construction is complete.
Thus, upon exit from Step t, the t's with i E IuJ have been determined
already. The other t's (if any) are still to be determined, that is, T[UJ > teo
Upon entrance to Step t + 1 (if ever) we already know the values of t for
i E IU J.
The algorithm performs the steps in sequence until all the t's have been
determined. With probability one this will happen in a finite number of steps
whenever P{T; < 00, i = 1,2, ... , n} = l.
From the construction it is clear that T has the discrete multivariate
conditional hazard rate functions of T. Since the discrete multivariate condi-
tional hazard rate functions uniquely determine the probability function, it
follows that T =st T.
The discrete dynamic construction can be used to simulate discrete de-
pendent lifetimes. This can be done by generating a sequence of independent
uniform random variables {Ut , t E N++} and using Ut in order to generate
the required probabilities in Step t, t E N++. In this paper, however, we use
the discrete dynamic construction as a technical tool for proving Theorem 4.1
in Section 4.
4.1 Definitions
Let X = (Xl, X2, ... , Xn) and Y = (Yl, Y2, ... , Y n ) be two discrete random
vectors taking on values in { ... ,-1,O,1, ... }n = zn. The random vector
X is said to be stochastically smaller than the random vector Y (denoted
X $st Y) if
assure the stochastic ordering relation between two vectors of discrete random
vectors.
In order to define the next ordering (the one we call the hazard rate
ordering) we need to introduce some notation. This ordering will be used
only in order to compare vectors of discrete random lifetimes. Therefore, we
assume now that X and Y can take on values only in N++.
For t E N++ let ht denote a realization of the failure times of n com-
ponents up to time t, exclusive. That is, if X l ,X2 , ... ,Xn are the dis-
crete random lifetimes of the components, then ht is an event of the form
{XI = XI, Xy ~ tel for some I C {I, 2, ... , n} and XI < teo On such events
we condition the probabilities in the definition (2.1) of the discrete multivari-
ate conditional hazard rate functions. Such an event will be called a history.
Fix atE N++. If h t and h~ are two histories such that in h t there are
more failures than in h~ and every component which failed in h~ also failed
in ht, and, for components which failed in both histories, the failures in h t
are earlier than the failures in h~, then we say that h t ::s; h~. More explicitly,
if h t is a history associated with X of the form {XI = XI, X Y ~ te} and h~ is
a history associated with Y of the form {Y A = YA, Y A ~ te} then h t ::s; h~
if, and only if, A C I and XA ::s; Y A (of course, we also have XI -A < te and
YA<te).
Remark 4.1. Before proceeding, we note a 1-1 association between {O,l}n
and the set of subsets of {I, 2, ... , n}. For each point u E {O, l}n let A(u) C
{I, 2, ... , n} be the set of the coordinates of u which are 1'So Conversely, for
each set A = {il' i 2, ... , ik} C {I, 2, ... , n} let u(A) E {O, l}n be the vector
which has 1's in places iI, i 2 , . , i k and O's elsewhere.
Let /l.I.(I) denote the discrete multivariate conditional hazard rate func-
tion of X (as defined in (2.1)). Similarly let 7].1.(-1) be the hazard rate func-
tions of Y.
Given a history ht, associated with X, of the form {XI = XI, Xl ~ te},
we define now a probability measure Qh, on {O, l}n as follows. For A C 7 set
(4.2)
and let the mass of Qh, on all other points of {O, l}n be 0. It is obvious
that Qh, is a proper probability measure; it corresponds to the indicators of
the components that have failed by time t, inclusive. We call Q. the discrete
multivariate conditional hazard rate measure of X.
Similarly, given a history h~, associated with Y, one can define, as in (4.2),
the discrete multivariate conditional hazard rate measure of Y. It is denoted
by R ..
- IJI(l
I'JII (t IXI ) - Pril - Pili
)f11-IJI
' J CI c {I, 2, ... , n}, tEN++.
Let Y have the same distribution but with parameters qn, qn-!,.'" q!, rather
than Pn,Pn-l, ... ,Pl. That is, suppose that the discrete multivariate condi-
tional hazard rate functions of Y are
X ~st Y. (4.9)
Proof. The proof will ~e dC?ne by cC?nstructin~, 0!l the sa!lle probability space,
two random vectors (Xl, X 2, ... , Xn) and (Yl, Y2 , , Yn) such that
X =st X, (4.10)
Y =st Y, and (4.11)
X < Y a.s .. (4.12)
From (4.10), (4.11) and (4.12) one obtains (4.9). "
Denote the discrete multivariate conditional hazard rate functions of X
by p'I.(I) and of Y by 1/'1.(1).
The construction of X and Y will be done in steps indexed by t E N++.
Here, as in the discrete dynamic construction, we describe an algorithm in
which t is to be thought of as a value of discrete time. In Step t it is determined
which Xi'S (if any) and which 'fi's (if any) are equal to t.
Step 1. The algorithm enters this step with the obvious information that
=
X ~ e and Y ~ e. Consider Qhl as in (4.3) with t 1 and I 0 (because =
h1 = {X ~ e}). Consider Rhl as in (4.3) with t = 1 and 1=0 except that
here 1/ replaces p. From (4.3) it follows that Qhl ~st Rh 1 Therefore random
vectors U 1 and V 1 , which can take on values in {O, l}n, can be defined on
the same probability space such that U 1 has the probability measure Qhl , V 1
has the probability measure Rhl and U 1 ~ V 1 with probability one (see, e.g.,
Kamae et al. 1977). Let 81 be the joint probability measure of (U 1, V d. The
algorithm now chooses a realization (Ul, vt) according to 81.
Let A C {1, 2, ... , n} be the set associated with U1 as described in Re-
mark 4.1. Similarly let A' C {I, 2, ... , n} be the set associated with V1. Since
U1 ~ V1 it follows that A :::> A'. Of course A' or A may be the empty sets.
Define
=
XA e, Y A' e, =
set t = 2 and proceed to Step t.
Upon exit from Step 1 some of the Xi'S and some of the }j's (if any) have
been determined and it is known, then, that X'A > e and Yji' > e. It follows
that we already have
Step t. Upon entrance to this step some of the X/s and sO,!lle of the "fi's
(if any) have already been determined. Suppose that the Xi's have been
determined for all i E A for some set A C {1, 2, ... , n}. More explicitly
suppose that XA = XA, X:A ~ teo Suppose, also, that the "fi's have been
determined for i E A' for some set A' C {1, 2, ... , n}. More explicitly, suppose
VA' = YA', VA' ~ teo By the induction hypothesis, A:::> A', XA < te, XA' $
YA' < teo Therefore, if we define h t = {XA' = XA',XA-A' = XA-A"X:A ~
= =
tel and h~ {Y A' YA', Y A , ~ tel we have h t $ h~. Consider now Qh, and
Rh~ as defined in Section 4. From (4.3) it follows that Qh, ~st Rh~. Therefore,
random vectors U t and V t , taking on values in {O,l}n, can be defined, on
the same probability space, such that U t is distributed according to Qh" V t
is distributed according to Rh~' and U t ~ V t with probability one. Let St
be the joint probability measure of (U t , V t ). The algorithm now chooses a
realization (Ut, Vt) according to St.
Let B C {1, 2, ... , n} be the set associated with Ut as described in Remark
4.1 and let B' C {1, 2, ... , n} be the set similarly associated with Vt. From
the definition of Qh, is clear that B :::> A. Similarly from the definition of Rh~
it is seen that B' :::> A'. Also, since Ut ~ Vt it follows that B :::> B'. Define
XB < VB a.s ..
Notice that not necessarily all the Yi's with i E B have been determined by
Step t. The Yi's with i E B - B' have not been determined yet, but they
must satisfy Yi > t.
Performing the steps of this procedure in sequence the algorithm finally
determines all the X/s and Yi's using a construction for all h t and h~ which
are realized. The resulting X and V must satisfy (4.12). The Xsatisfies (4.10)
because it is marginally constructed as in the discrete dynamic construction.
Similarly V satisfies (4.11). II
As an example for the use of Theorem 4.1 consider the X and the Y
defined in Example 4.1. It has been shown in Example 4.1 that X $h Y. It
follows from Theorem 4.1 that X $st Y.
X~h Y. (4.13)
Proof. Denote the discrete density of X by f and of Y by g.
Split {1, 2, ... , n} into three mutually exclusive sets I, J and L (so that
L = I U J). Fix XI, XJ, YI and t E N++ such that XI ~ YI < te and XJ < teo
Let ht = {XI = XI,XJ = XJ,XL ~ tel and h~ = {VI = YI,YJUL ~ tel
First we show that
= =
provided aI XI, aJ XJ, aL ~ te, and is 0 otherwise. The discrete density
Of[(YI' Y J, Y L) IYI = YI, Y JuL ~ tel is
- -
j(aI, aJ, aL)i(hI , hJ, h L) ~j(aI 1\ hI,aJ 1\ hJ,aL 1\ hL) .
.i(aIVhI,aJVhJ,aL VhL) (4.15)
Since XI ~ YI < te, XJ ~ te, it follows that (4.15) holds if
j(XI, XJ, aL)g(YI, hJ, hL) ~ j(XI, XJ, aL 1\ hL)g(YI, hJ, aL V hL)
for hJ ~ te, aL ~ te and hL ~ teo But this follows from the assumption that
X ~lr Y. Thus (4.14) holds.
Since ~lr==>~st (see, e.g., Karlin and Rinott (1980) or Whitt (1982)) it
follows from (4.14) that
I if Xi ::; t;
Wi =
{
0: if Xi > t;
and
I ifYi::; t;
Zi = { 0', I'fv
Ii> t.
From (4.16) it follows that
(4.17)
The conditional distribution ofW given ht is determined by JlAIIUJ(tlxI, xJ),
A c I U J, which are the discrete multivariate conditional hazard rate func-
tions conditioned on h t . This distribution is the one which is associated with
the discrete multivariate conditional hazard rate measure Qhl of X (see Sec-
tion 4 for its definition). Similarly, the conditional distribution of Z given h~
is the one associated with the discrete multivariate conditional hazard rate
measure Rh: ofY. And (4.17) is equivalent to
when one tries to study them. One use of Theorem 4.2 is to show that the pos-
itive dependence notion defined by X ~lr X implies the positive dependence
notion defined by X ~h X.
In this paper we have presented the discrete multivariate hazard rate func-
tions as the time-dynamic models of multi component reliability systems
and studied stochastic order relationships among them. These orderings are
discrete analogues of the continuous orderings of Shaked and Shanthikumar
(1990b), but the technical difficulties which are encountered while studying
the discrete orderings are different from those involved with the continuous
orderings of Shaked and Shanthikumar (1990b).
In Shaked and Shanthikumar (1990b) an ordering relation, called the
cumulative hazard ordering, denoted by ~ch, is also studied. An analogue of
this ordering is not studied here because, a "correct" discrete analogue of ~ch
in not easy to identify; see Valdez-Torres (1989).
Shaked and Shanthikumar (1991a) used the orderings ofShaked and Shan-
thikumar (1990b) in order to define several multivariate aging notions for con-
tinuous dependent random lifetimes such as MIFR (multivariate increasing
failure rate) and a kind of multivariate logconcavity which was called MPF 2
(multivariate Polya frequency of order 2). Similar discrete analogues can be
developed using the discrete multivariate orderings of the present paper. We
may do it elsewhere.
References
Karlin, S., Rinott, Y.:. Classes of Orderings of Measures and Related Correlation
Inequalities. I. Multivariate Totally Positive Distributions. Journal of Multivari-
ate Analysis 10, 467-498 (1980)
Norros, I.: Systems Weakened by Failures. Stochastic Processes and Their Appli-
cations 20,181-196 (1985)
Norros, I.: A Compensator Representation of Multivariate Life Length Distribu-
tions, with Applications. Scandinavian Journal of Statistics 13, 99-112 (1986)
Ross, S.M.: A Model in Which Component Failure Rates Depend on the Working
Set. Naval Research Logistics Quarterly 31, 297-300 (1984)
Shaked, M., Shanthikumar, J.G.: Multivariate Imperfect Repair. Operations Re-
search 34, 437-448 (1986a)
Shaked, M., Shanthikumar, J.G.: The Total Hazard Construction, Antithetic Vari-
ates and Simulation of Stochastic Systems. Stochastic Models 2, 237-249
(1986b)
Shaked, M., Shanthikumar, J.G.: The Multivariate Hazard Construction. Stochastic
Processes and Their Applications 24, 241-258 (1987a)
Shaked, M., Shanthikumar, J.G.: Multivariate Hazard Rates and Stochastic Order-
ing. Advances in Applied Probability 19, 123-137 (1987b)
Shaked, M., Shanthikumar, J.G.: Multivariate Conditional Hazard Rates and the
MIFRA and MIFR Properties. Journal of Applied Probability 25, 150-168
(1988)
Shaked, M., Shanthikumar, J.G.: Multivariate Stochastic Orderings and Positive
Dependence in Reliability Theory. Mathematics of Operations Research 15,
545-552 (1990)
Shaked, M., Shanthikumar, J.G.: Dynamic Multivariate Aging Notions in Reliabil-
ity Theory. Stochastic Processes and Their Applications 38, 85-97 (1991a)
Shaked, M., Shanthikumar, J.G.: Dynamic Construction and Simulation of Random
Vectors. In: Block, H.W., Sampson, A., Savits, T.H. (eds.): Topics in Statistical
Dependence. IMS Lecture Notes (1991b), pp. 415-433
Shaked, M., Shanthikumar, J.G.: Dynamic Multivariate Mean Residual Functions.
Journal of Applied Probability 28, 613-629 (1991c)
Shaked, M., Shanthikumar, J.G.: Dynamic Conditional Marginal Distributions in
Reliability Theory. Journal of Applied Probability 30, 421-428 (1993a)
Shaked, M., Shanthikumar, J.G.: Multivariate Conditional Hazard Rate and Mean
Residual Life Functions and Their Applications. In: Barlow, R.E., Clarotti,
C.A., Spizzichino, F. (eds.): Reliability and Decision Making. Chapman and
Hall: New York 1993b, pp. 137-155
Shaked, M., Shanthikumar, J.G., Valdez-Torres, J.B.: Discrete Probabilistic Order-
ing in Reliability Theory. Statistica Sinica 4, 567-579 (1994)
Shaked, M., Shanthikumar, J.G., Valdez-Torres, J.B.: Discrete Hazard Rate Func-
tions. Computers and Operations Research 22, 391-402 (1995)
Valdez-Torres, J .B.: Multivariate Discrete Failure Rates with Some Applications.
Ph.D. Dissertation. University of Arizona (1989)
Whitt, W.: Multivariate Monotone Likelihood Ratio and Uniform Conditional
Stochastic Order. Journal of Applied Probability 19, 695-701 (1982)
Reliability Analysis via Corrections
Igor N. Kovalenko 1 ,2
1 STORM Research Centre, University of North London, 166-220 Holloway Road,
London N7 8DB, United Kingdom
2 V.M. Glushkov Institute of Cybernetics, National Academy of Sciences, Ukraine,
40 Glushkov Avenue, Kiev 252207, Ukraine
1. Introductory Remarks
Let me cite from Asmussen and Rubinstein (1995):"Analytical and even
"good" asymptotical expressions for ... rare event probabilities ... are only
available for a very small class of systems." I 100% agree with such an opin-
ion, but my experience suggests that it is not fruitless to seek for more and
more general queueing models admitting the derivation of asymptotical or
approximate expressions for reliability parameters, say. And in a case if there
is no explicit formula for the desired parameter, one very often can choose an
appropriate formula for a close, slightly changed system, and then to calculate
necessary corrections.
The purpose of the submitted paper consists of the derivation of some
corrections in three problems typical while investigating complex systems
reliability.
For the simplicity, only a simple queueing system M/G/2/2 is considered
throughout the paper. But the approach is fruitful in much more general
cases as well.
A short annotated bibliography is attached.
qo =).3 JJ
00 00
_ 00
where B(t) =1- B(t), B(t) = J B(z) dz if the expression for I is finite.
t
Consider a random variable To vanishing in each of the two cases: (i) no
system failure occurs within the busy period, (ii) a non-monotonic failure
occurs in the same period, and defined as the length of the first system
failure interval in the case (iii) a monotonic path failure occurs within the
busy period. In the example being considered
JJJ
00 00 00
For small).
E To "" 2
1 ).3 J::2
00
Z B (z) dz
o
as soon as RHS is finite.
The cited asymptotic expressions are well known, but non-monotonic fail-
ure paths can contribute essentially in practicable cases. Consider, for exam-
ple, the exponential case B(t) = e-I't, t ~ O. Set p = )./jJ, and let ql be
defined as ql = q - qo. For a small p
Reliability Analysis via Corrections 99
1 3 3 4 9 5
qo "" 4p , ql "" 8 p, q2 "" 16 P ,
p
0.1 0.15 0.022
0.01 0.015 0.0002
0.001 0.0015 2.10- 6
1=..iQ.} :S (1 + p) (1 + ~)2 -
E[i~Tol 1 "" 2p, p -+ o.
ETo
We have
q = qo + Ll q, T = To + Ll T,
where Ll q and Ll T should be estimated via simulation.
Many resent investigations deal with the elaboration of variance reduction
methods for the estimation of rear events probabilities. We should mention
monographs Asmussen (1987), Rubinstein and Shapiro (1993). In the both
ones the score function method generalizing a traditional importance sam-
pling was developed. I suggest an analytical computation of the qo and To
whereas some estimates are applied for the computation of the correction
terms Llq and LlT. The approach of stratified sampling combined with the
score function method is used.
Let IA denote the indicator of any random event A. Then
II (Ae->'Xloj / Aoe->'oXloj)
rio
(he =
j=l
where Xkj, 1 ~ j ~ rk, denote failure free times in trial k, AO is the param-
eter of the sampling exponential law.
Consider, for example, the construction of a small - variance unbiased
estimate of L1q for the exponential case B(t) = e-J.lt which can be reduced
to a stopped random walk with transition probabilities
1 - 2 : 1; 2 - 2 : b; 3 - 4 : b
2 - 1 : 1 - b; 3 - 2 : 1 - b;
The value of b is chosen as
b = (1 + JiO)/6 ~ 0.6937
Then
~ 1 4
cr [Llq] '" - . O.6604p
n
for small p whereas
qo = l/4
The bounds
cr[.1q/qo] ~ Cp, cr[LfT/To] ~ C 1p
can be established for a wide class of queueing systems, the constants C and
C1 depending on an appropriate moment of the repair time distribution.
A further improvement can be suggested: to compute qo + q1 analytically
and use a correction being computed by simulation. The simulated variable
in a single trial has the form
o
and up-transitions of the process lI(t) has a rate >"k = (4 - k..
as soon
as lI(t) = = k, 0 :::; k :::; 4. For the latter system the parameter q can be
estimated in a similar manner as it is done in section 3. For a fixed B(t) and
small >..
J
00
A'p = Ii
where Ii = (1,0,0,0, O)T. The following approximate equations holds for A' :
(!
1
A' = 0
o 0
3
~
-4 6p 2 -\2p
2
12p
-4+ Sp 3 -ISp
o
24p
1
-4+6p 4- 24p
1
000 1 -4
00
where p = ).//-1, 1
/-I
= f0 B(t)dt. We have
Let poet) denote the pointwise availability of the system. It is well known
that
poet) - 4 -/-I-
t-+oo ). + /-I
Reliability Analysis via Corrections 103
and
h(t) _ \ AJ.t
t-oo A + J.t
where h(t) is up-to-down renewal rate [Indeed h(t) = '\Po(t).] Though it is
important to estimate the deviations of both the functions from their steady-
state limits. Kovalenko and Birolini (1995) derive an exponential two-sides
bound for Po(t) :
E( -It pn B * B*(n)(t)
00
L1(t) =
n=O
holds and hence some Monte Carlo procedures can be derived for the com-
putation of a non-stationary correction.
References
Appendix
Bibliography
Asmussen, S.: Applied Probability and Queues. New York: Wiley 1987
Asmussen, S.: Light Traffic Equivalence in Single Server Queues. Ann. Appl. Prob.
2, 555-574 (1992)
Baccelli, F., Schmidt, V.: Taylor Expansions for Poisson Driven (max, +)-linear
systems. Research Report No. 2494, INRIA (1995)
Birolini, A.: Quality and Reliability of Technical Systems. Berlin: Springer 1994
Blaszczyszyn, B., Rolski, T., Shmidt, V.: Light Traffic Approximations in Queues
and Related Stochastic Models. In: Dshalalow, J. H.: Advances in Queueing.
Boca Raton: CRC Press 1995, pp. 379-406
Cohen, J.W.: The Single Server Queue. Amsterdam: North Holland 1982
Qmlar, E.: Fatigue Crack Growth. In this volume (1996), pp. 37-52
Reliability Analysis via Corrections 105
1. Introduction
3. Repairable Items
We shall restrict our attention in this paper to items that normally are re-
paired upon failure with a replacement being only an alternative maintenance
action. This situation, rather than the single-failure case where an item is
replaced upon its first failure, is the one encountered with most items of im-
portance: the large and the expensive ones, e.g., engines, and machines, and
even with circuit boards in electronic devices. For such items the term life-
distribution is clearly a misnomer (and hence the quotation marks we used
Towards Rational Age-Based Failure Modelling 109
earlier) but conforming with the general literature we shall retain this term
here.
Since the subject of this paper is age-based failure-modelling we shall
assume here that repairs are minimal (see Barlow and Proschan 1975) so
that the past history of failures has no impact on the future of the fail-
ure process and only age counts. This last, essentially qualitative, type of
property translates elegantly to a specific stochastic process namely a (time
non-homogeneous) Poisson process [follows immediately from the indepen-
dent increment property of the (unit jump) failure process throughout the
life of an item, which is implied by the very definition of a minimal repair,
coupled with a corresponding characterization of Poisson processes (Qmlar
1975)]. Moreover, it can be shown that the intensity function of this Poisson
process is exactly the hazard function r(.) of the underlying life distribution
F(). These two last functions are interrelated by the one to one relationship:
corresponds to
F(x) (ax+b)-X, x~O
ax
(c) J.l(x) -
l+x
- , x>- O, a>O
corresponds to
F(x) e- ax '/(1+ x )
corresponds to
F(x) = e- xeu , x ~ o.
The first thing to observe is that even though the functional form of J.l( x )
in the above four examples are basic and "nice" the resulting life distributions
are rarely, if ever, used and are unlikely to be arrived at by direct "guessing".
Thus, perfectly legitimate candidates for age-based failure-modelling have
been ignored merely because the "niceness" of the mathematical functional
forms has been placed on the "wrong guys" .
We note that all the above J.l(x) are increasing monotonously (technically
relating to the IFRA aging property -see discussion later), which is the natu-
ral state of affairs with mechanical systems (at least once the burn-in period
is over) with the different functional forms above, reflecting different mono-
tonicity characteristics (rate of increase, etc.). Continuing this pattern, an
extensive mapping of life distribution can be created which are interrelated
through the monotonicity characteristics of their corresponding J.l( x) func-
tions. This scheme of life distributions should serve as the knowledge base for
the fitting procedure.
the most basic such property is merely that J.l( x) is an increasing function of x.
In the present context of minimal repairs and the Poisson failure process thus
generated, so that J.l(x) = R~X), this property of J.l(x) is technically identical
to the IFRA (we should say technically because the IFRA was devised for the
single-failure case). We have thus "rescued" this mathematical property by
identifying a context where it is meaningful and useful. The payoff is that all
the mathematical results obtained for the IFRA (Barlow and Proschan 1975)
can be directly applied for the case considered here, namely, repairable items
with minimal repairs. In particular, this includes probably the most useful of
these results, namely the closure-property of (coherent) systems, i.e., (when
phrased in the present context) if for each component the expected number
of failures per unit of time increases with age then so will be the case for the
system itself (assuming operational independence between the components).
References
Barlow, R.E., Proschan, F.: Statistical Theory of Reliability and Life Testing. New
York: Holt, Rinehart and Winston 1975
Berg, M.: Reliability Analysis for Mission-Critical Items. Naval Research Logistics
34,417-429 (1987)
Berg, M.: Age-Based Failure Modelling: A Hazard-Function Approach. CentER
Discussion Paper (No. 9569), Tilburg University (1995)
Qmlar, E.: Introduction to Stochastic Processes. Englewood Cliffs: Prentice-Hall
1975
Qmlar, E.: Fatigue Crack Growth. In this volume (1996), pp. 37-52
Goel, P., Palettas, P.N.: Predictive Modelling for Fatigue Crack Propagation via
Linearizing Transformations. In this volume (1996), pp. 53-69
Mendel, M.: The Case for Probabilistic Physics of Failure. In this volume (1996),
pp. 70-82
Ozekici, S.: Complex Systems in Random Environments. In this volume (1996), pp.
137-157
Part II
1. Introduction
for maintenance and replacement decisions include failure data of the equip-
ment under consideration which are usually not widely available nor easy
to obtain. This makes the application of mathematical models to support
maintenance and replacement decisions less obvious.
A second reason that is often put forward to explain the lack of success
in applications of maintenance and replacement models is the simplicity of
the models compared to the complex environment where the applications
occur. In particular the fact that up to ten years ago the vast majority of
the models were concerned with one single piece of equipment operating in
a fixed environment was considered as an intrinsic barrier for applications.
However, one should realise that this argument is also valid for waiting time
and inventory applications. The booming interest in polling models in queue-
ing theory and in multi-item inventory control models in logistics reflects this
increasing need for more realistic modelling of complex management prob-
lems. From this point of view also the increasing interest for multicomponent
maintenance models can be understood. In this context we should realise,
however, that the availability of reliable data becomes even more important
for successful applications of theoretical developments in this area. Successful
case studies on practical maintenance models are badly needed to convince
management of the potential cost savings in this management field. In sec-
tion 4 we will briefly describe the application of one of the models in the area
of road management. For other implementations of maintenance models we
refer to Dekker and Van Rijn (1996) and Groenendijk (1996).
This chapter is organised as follows. In Section 2 we present an overview
of the models to be discussed in this chapter. We also indicate the various
economic backgrounds that justify the choice of these models. In Section 3
we address the problem how to structure the (corrective) maintenance on
parallel and identical units. In Section 4 preventive maintenance on paral-
lel and non-identical units is considered, while in Section 5 we address the
problem how to combine in an economic optimal fashion corrective and pre-
ventive maintenance actions on a number of independent units. Finally in
Section 6 we pay attention to models which take explicitly into account that
maintenance activities should be considered as an intrinsic part of produc-
tion schedules, implying that scheduling of preventive maintenance activities
should not only be based on the physical condition of the equipment but also
on its immediate impact on the production process in which it operates.
operation). But also in case the unavailability of one single unit prevents the
system from operating, there might be room for combination of maintenance
activities. In particular the moments at which corrective maintenance activ-
ities are called for, might be used to carry out preventive maintenance on
non-failed, but deteriorated, units. Such a policy might reduce the number
of unexpected corrective maintenance activities at fairly low costs, since pre-
ventive maintenance, when combined with corrective maintenance, can be
carried out without substantial additional expenses. Models to describe de-
cision problems of this kind have been studied by many authors. In Section
5 we describe some results based on Haurie and L'Ecuyer (1982), Ozekici
(1988), Van der Duyn Schouten and Vanneste (1990, 1993) and Wijnmalen
and Hontelez (1996)
A major difficulty experienced in practical situations is that maintenance
and production are considered as responsibilities of different departments.
The maintenance department prefers to do preventive maintenance at those
moments at which the (maintenance) workload is low, while the production
department prefers to carry out maintenance activities when the demand rate
is low. Unfortunately, in general those dips in workload will not coincide.
The most sensible solution is to make the production department responsible
for maintenance of their own equipment as long as technical skills are not
prohibitive in this respect. In Section 8 we describe a model which is aimed
at illustrating the possible effects of this integration in terms of cost savings.
The presentation is based on Van der Duyn Schouten and Vanneste (1995)
and De Waal and Vanneste (1995).
Theorem 3.1. Suppose the system starts at time 0 with all units in opera-
tional condition. When
Proof. The proof proceeds in three steps. First of all it is noted that in the
search for the optimal policy we can restrict ourselves to policies characterized
by two critical numbers m and I: repair I units whenever the number offailed
units reaches the level m, for some 1 :::; I :::; m :::; n. This result is based on
the observations that the number of failed units increases by steps of size one
while any repair will reduce the number of failed units. Secondly, it is shown
by renewal reward arguments that the average cost g( m, I) per unit of time
for the policy with critical numbers m and I is given by
Theorem 3.2. (i) The optimal critical repair level m* is the smallest inte-
ger m, for which
k AGo
2: -m-
m-l
k=O
> ~-:-:::
n - k - G AG1 - 2
(ii) For large values of n, the optimal critical repair level m* is asymptotically
proportional to ..;n.
Proof. See Assaf and Shanthikumar (1987).
122 Frank Van der Duyn Schouten
Ritchken and Wilson (1990) consider the case of general lifetime distri-
butions, also with instantaneous repair. They restrict attention to a class of
policies characterized by two critical numbers m and T, implying that a main-
tenance activity (including repair of all failed units and overhaul of all non
failed units) is started if and only if the number of failed units has reached the
level m or T units of time have passed since the previous maintenance activ-
ity. Since only combined maintenance on all units is considered the moments
at which maintenance is started are renewal points for the process describing
the ages of each of the individual components. From the analysis by Assaf
and Shanthikumar (1987) it can be concluded that, in case of exponentially
distributed lifetimes, for the optimal policy within this class we have T = 00.
Note that the expected time between two subsequent maintenance activ-
ities equals
Similar expressions can be obtained for the total expected costs during one
cycle, which provides an explicit expression for the average costs per unit of
time as a function of the control parameters m and T. Using some properties
of this function Ritchken and Wilson present an algorithm to compute the
optimal values of m and T from a finite number of function evaluations.
Assaf and Shanthikumar also show that under the assumptions of their
model there exists an optimal policy which does not allow operational units
to idle. Jansen and Van der Duyn Schouten (1995) show that this conclusion
is not correct in case the repair is not instantaneous. They consider the case
where the costs for production losses far outweigh the actual repair costs
of the machines, i.e. C 2 (k) and C3 (k) are both assumed to be equal to zero.
However, the costs for production losses C1(k) are non-decreasing and convex
in k. The lifetime distributions are again exponential with parameter A (like
in the case of Assaf and Shanthikumar). Also the repair time of one single
unit is exponentially distributed with parameter 1'. There are no economies of
scale in repair time, i.e. the total length of the repair time of two units is the
sum of two exponentials, each with parameter 1'. Due to the assumptions on
the cost functions, it follows, in correspondence with the result of Assaf and
Shanthikumar, that the optimal policy will not allow a repair on a failed unit
to be postponed until other units have failed (the critical repair level is equal
to 1). In order to investigate whether it is profitable to allow operational units
to idle we assume that the running speed of each individual unit is adjustable
between 0 and 1. Using speed x for a unit simply means that the failure rate
of this unit is reduced from A to XA. So using speed 0 means that a unit
is completely idling. Consequently, when i units are operational, the total
production speed of all units together can be controlled within the interval
[0, i). The function C10 that represents the costs of loss of production is
now assumed to have a continuous argument. Jansen and Van der Duyn
Maintenance Policies for Muiticomponent Systems: An Overview 123
Schouten (1995) consider the case ofrestricted repair capacity (meaning that
the number of available servers 8 is smaller than or equal to n, the number
of units). In this presentation we will only deal with the case of ample repair
capacity (8 = n).
This control problem can be formulated as a semi-Markov decision model
with discrete state space {I, ... , n} and continuous action space [0, i] in state
i. State i corresponds to the situation that i units are available and n - i
are under repair. Taking action a E [0, i] means that the system produces at
capacity i-a, while capacity a is kept in reserve. Note that putting a unit in
reserve position has a negative impact on the present productivity level, but
has the advantage that this unit is not subject to failure and hence is available
with certainty when the next unit breaks down. In case the running speeds of
individual units are not adjustable the action space in state i simply reduces
to {O, 1, ... , i}. Now in state i, only transitions to states i-I and i + 1 will
occur, since the capacity that is kept in reserve is available again at the next
decision epoch. This leads to the following transition probabilities, expected
transition times, and expected one-step transition costs for the semi-Markov
decision process:
)
Pij (a)
(a::; i;j = i+ 1);
(3.1)
Ti(a) (a::; i);
Ci( a) (a::; i).
The average cost optimality equations thus become
Hence
A fixed policy Y = (Yl, ... , Yn) can now be analyzed by means of a birth-death
process with transition rates
t
i
1ri= ( IIi=lYi
)-1 (n-i)!
n. (JJ)'
'I 1ro (.z=I, ... ,n) (3.5)
This gives the following expression for the average costs as a function of the
control rates Yi (i = 1, ... , n) (where Yi denotes the actual productivity level
when i units are available):
The optimal policy is determined by the control rates Yi that minimize ( 3.6)
subject to Yi $ i (i = 1, ... , n). Moreover, ( 3.6) can be used to construct
an efficient policy iteration algorithm as follows. In any step of the policy
iteration procedure, the following system of equations has to be solved for
some fixed policy Y = (Yl, ... , Yn).
OJ
Cl(n-Yi)-g(Y)+Yi AV i_l(y)+(n-i)I'Vil(Y) (. 0 ) (3.7)
Yi A +(n-i)1' Z = , ... , n .
Since the average costs g(y) can be calculated from ( 3.6), system ( 3.7) can
be solved recursively as follows (compare ( 3.3)):
Vo(y) o
vo(y) - Vl(y) C 1 (n)-g(y)
nl'
}(3 8)
0
Using this efficient method, the policy iteration algorithm converges relatively
fast to the optimal policy as compared to the value iteration algorithm in
terms of number of iterations as well as calculation time. For n = 5 the
procedure takes in most cases less than 5 iterations and less than 10 seconds
of calculation time on an 80386 DX microprocessor to find the optimal policy.
Maintenance Policies for Multicomponent Systems: An Overview 125
Theorem 3.3.
YiSYi+t (i=0, ... ,n-1)
For the complete proof of this theorem we refer to Jansen and Van der Duyn
Schouten (1995). Here we only provide a global indication ofthe various steps
of the proof, since these steps are more or less typical for proving structural
results of this kind. First define
and
Hence
o ~ Vi,a - vn,a ~ C 1(n).
Now, using Theorem V. 2.2 (ii) in Ross (1983), we conclude that there exists
a sequence of discount factors Ck 1 1 such that
From this limiting relation follows that Vi inherits the monotonicity and con-
vexity of Vi,a (see also Ross 1983, p. 96, Remark 1).
Finally we note that using (3.3) and differentiating the expression that
has to be minimized it can be seen that the optimal control in state i equals
min{ i, zd, with Zi satisfying
Table 3.1. Minimal average costs and optimal productivity levels for various values
of the workland
p go g1,g2 Y1 Y2 Y3 Y4 Y5
0.01 0.05146 0.05146 1 2 3 4 5
0.05144 1 2 3 4 5
1 7.5000 7.4806 1 2 3 4 4
7.4793 1 2 3 3.99 4.19
2 12.222 12.206 1 2 3 3 4
12.198 1 2 3 3.37 3.58
5 18.056 18.032 1 2 2 3 3
18.026 1 1.91 2.22 2.46 2.64
10 21.074 21.059 1 1 2 2 2
21.039 1 1.40 1.64 1.83 1.99
available. The optimal control limit type policy can then be determined by a
straightforward one-dimensional search procedure.
where
n
Ll(k)=I)-1)i+1 L /cm(kal'" .. ,kaJ-l
j=l {aiac{1, ... ,n},iai=j}
and M j (t) denotes the renewal function generated by Fj (t), (see Dagpunar
1981).
Maintenance Policies for Multicomponent Systems: An Overview 129
h.(k T) .= aj
J ,.
+ cjMj(kT)
kT
Step 0 Determine for every individual unit j the optimal replacement cycle
Tj, i.e. solve n independent one-dimensional optimization problems of the
following type
Let T := minl~j~n Tj
Step 1 For every j E {I, ... , n} determine the smallest value of kj such that
and go to step l.
In Vos De Wael (1995) an application of this model to the maintenance
of road traffic control systems is described. In this situation a unit consists
of a group of light bulbs functioning in a traffic control system on a certain
road crossing. Bulbs with the same burning and cost characteristics are put
together into one single group. A typical replacement rule used in practice is:
replace all bulbs serving red lights every three months, the bulbs serving the
green lights every six months, and the bulbs serving the yellow lights every
year. The following numerical example illustrates the algorithm.
130 Frank Van der Duyn Schouten
Here Erl. (A; k) denotes the Erlang distribution with parameter A and k,
while Weib. (A; a) denotes the Weibull distribution with parameters A and a
Step 0 Tl =
0.344; T2 = 0.229; T3 = 0.372; n = 0.924; Ts =
1.218; T = 0.229
optimal due to their special cost structure). Haurie and L'Ecuyer show by
counterexample that the optimal policy is not necessarily monotone, not even
when the lifetime distribution is IFR. A policy 7r is called monotone when
7r(x) C 7r(Y), whenever x :::; y.
Ozekici (1988) considers the same model under much more general as-
sumptions on the aging process (not necessarily independent lifetime distri-
butions) and on the cost functions. In this situation it is certainly not always
optimal to start a maintenance action only when a failure has occurred. In
Ozekici (1988), the following structural result is shown:
Theorem 5.1. The optimal policy 7r* has the following property:
Policy class B: a complete system overhaul is carried out at the first time
epoch at which an individual component enters state 2 or 3 after the first
moment at which the number of doubtful components has reached the level I<.
The difference between both control rules is rather subtle and concerns
the decision to make when the number of doubtful components has reached
the level I<. Under policy B a system overhaul will certainly be performed
at the first subsequent epoch at which one of the components turns bad or
fails. However, when this component was a doubtful one, a system overhaul is
not carried out under policy A, because the number of doubtful components
decreases from I< to I< - 1. For both type of policies explicit expressions are
derived for the average number of system overhauls per unit of time as well
as the expected number of preventive and corrective maintenance actions on
individual units per system lifetime. Also the authors show how this model
can be used as an approximation for the situation where the failures are
governed by a lifetime distribution. Assuming an IFR lifetime distribution
they propose to identify the age interval [0, r] with a good state, the age
interval [r, R] with a doubtful state and the interval [R, 00) with a bad state.
Apart from the control limit I<, also the parameters rand R can be used as
control variables.
Numerical investigations show that this approximation gives fairly good
results and certainly can be used to support the decision how to choose the
relevant control variables. In particular, it is noteworthy that the quality
of the approximations improves when the number of units increases. The
validation of the approximation is done by simulation.
To conclude this section we mention a recent paper by Wijnmalen and
Hontelez (1996), in which a promising computational procedure is proposed
to find "good" policies for the general model introduced in this section.
Attention is restricted to policies which are characterized by two vectors
(MI' M 2 , ... , Mn) and (ml' m2, ... , m n ) as a straightforward generalization
of the (m, M) policies introduced in Vergin and Scriabin (1977) for the two-
unit system. A maintenance action on component j is compulsory as soon
as its age (or condition) exceeds level Mj and when maintenance on another
unit is carried out, then unit j is included in this maintenance operation
whenever its age (or condition) exceeds mj. As mentioned earlier an exact
134 Frank Van der Duyn Schouten
optimal policy for this decision process has a rather complex structure. Hence,
Van der Duyn Schouten and Vanneste (1995) restrict attention to the subclass
of (m, M, k)-policies. An (m, M, k)-policy prescribes to start a PM-action if
and only if the age of the production unit i and the buffer content x satisfy
the relations i ~ M and k ~ x < K, or the relations i ~ m and x = K.
Numerical experiments show that the class of (m, M, k)-policies performs
very well in the sense that the optimal policy within this restricted class is in
general less than 1% off from the overall optimal policy. The advantages of
the (m, M, k)-policies over the overall optimal policy are two-fold. First of all
these policies are relatively easy to implement, and secondly the performance
of these policies can be analytically determined as is shown in Van der Duyn
Schouten and Vanneste (1995). The latter advantage enables us to do (brute
force) optimization within the class of (m, M, k)-policies, in a much more
efficient way than overall optimization by Markov decision theory. The effect
of using both buffer content and condition of the unit as indicators for the
advisability of PM, can be quantified by comparing the best policy in the
(m, M, k)-class with the optimal age replacement policy, in which only the
condition of the unit counts. Numerical examples show that the differences
are typically in the range from 0 to 25%.
In De Waal and Vanneste (1995) a detailed analysis is presented of the
transient behaviour of the buffer content process under various maintenance
strategies.
References
Assaf, D., Shanthikumar, J.G.: Optimal Group Maintenance Policies with Contin-
uous and Periodic Inspections.Management Sc. 33, 1440-1452 (1987)
Dagpunar, J.S.: Formulation of a Multi Item Single Supplier Inventory Problem.
Journal of the Operational Research Society 33, 285-286 (1981)
Dekker, R., van Rijn, C.: PROMPT, A Decision Support System for Opportunity-
Based Preventive Maintenance. In this volume (1996), pp. 530-549
Dekker, R., Frenk, J.B.G., Wildeman, R.E.: How to Determine Maintenance Fre-
quencies for Multi-component Systems? A General Approach. In this volume
(1996), pp. 239-280
De Waal, P.R., Vanneste, S.G.: System Effectiveness of a Production Unit with an
Output Buffer. Shell Research, Rep. AMER. 94.010 (1995)
Federgruen, A., Groenevelt, H., Tijms, H.C.: Coordinated Replenishments in a
Multi-Item Inventory System with Compound Poisson Demands. Management
Sci. 30, 344-357 (1984)
Goyal, S.K., Kusy, M.I.: Determining Economic Maintenance Frequency for a Fam-
ily of Machines. Journal of the Operational Research Society 36, 1125-1128
(1985)
Groenendijk, W.: Maintenance Management System: Structure, Interfaces and Im-
plementation. In this volume (1996), pp. 519-529
136 Frank Van der Duyn Schouten
Haurie, A., L'Ecuyer, P.: A Stochastic Control Approach to Group Preventive Re-
placement in a Multicomponent System. IEEE Trans Automat. Control 27,
387-393 (1982)
Jansen, J., Van der Duyn Schouten, F.A.: Maintenance Optimization on Parallel
Production Units. IMA J. Math. Appl. Bus. Indust. 6, 113-134 (1995)
Ozekici, S.: Optimal Periodic Replacement of Multi-Component Reliability Sys-
tems. Operations Res. 36, 542-552 (1988)
Ritchken, P., Wilson, J .G.: (m, T) Group Maintenance Policies. Management Sci.
36, 632-639 (1990)
Ross, S.M.: Introduction to Stochastic Dynamic Programming. Orlando: Academic
Press 1983
Van der Duyn Schouten, F.A., Vanneste, S.G.: Analysis and Computation of (n, N)-
Strategies for Maintenance of a Two-Component System. European Journal of
Operational Research 48, 260-274 (1990)
Van der Duyn Schouten, F.A., Vanneste, S.G.: Two Simple Control Policies for a
Multicomponent Maintenance System. Operations Res. 41, 1125-1136 (1993)
Van der Duyn Schouten, F.A., Vanneste, S.G.: Maintenance Optimization of a
Production System with Buffer Capacity. European Journal of Operational Re-
search 82, 323-338 (1995)
Van der Duyn Schouten, F.A., Wartenhorst, P.: Transient Analysis of a Two-Unit
Standby System with Markovian Degrading Units. Management Sci. 40, 418-
428 (1994)
Van Eijs, M.J.G.: On the Determination of the Control Parameters of the Optimal
Can-Order Strategy. Z. Oper. Res. 39, 289-304 (1994)
Vergin, R.C., Scriabin, M.: Maintenance Scheduling for Multicomponent Equip-
ment. AIlE Transactions 9, 297-305 (1977)
Vos De Wael, S.: Strategies for Lampremplace (In Dutch). Masters Thesis. Tilburg
University (1995)
Wijnmalen, D.J.D., Hontelez, J.A.M.: Coordinated Condition Based Repair Strate-
gies for Components of a Multicomponent Maintenance System with Discounts.
European Journal of Operational Research. To appear (1996)
Complex Systems in Random Environments
Siileyman Ozekici
1. Introduction
Most of the literature on stochastic models in operations research and man-
agement science involve models where the parameters remain unchanged. In
cases where they do, the change is usually indexed by the time factor only,
leading to dynamic models. There are many real life applications where these
parameters, whether they involve the deterministic or stochastic structure
of the model, change randomly with respect to a randomly changing envi-
ronmental factor. Thus, the model parameters can be viewed as stochastic
processes rather then simple deterministic constants, as in stationary models,
or deterministic functions of time, as in dynamic models.
This paper takes a close look at complex stochastic models that oper-
ate in a randomly changing environment which affects the deterministic and
stochastic model parameters. Here, complexity is due not only to the variety
in the number of components of the model, but also to the fact that these
components are interrelated through their common environmental process.
For example, the demand processes for a multi-item inventory model, the
customer arrival processes to various queues in a network and the compo-
nent lifetimes in a multi-component reliability model will be dependent due
to the fact that they are all subject to a common environmental process. In
all of our models, we assume that stochastic dependence is only due to the
138 Siileyman Ozekici
2. Inventory Models
We now discuss the periodic review model of Ozekici and Parlar (1995)
which unites the notions discussed above through an environmental process
which not only affects the demand, but also the supply availability and the
cost parameters. Consider a single product inventory system which is in-
spected periodically over an infinite planning horizon. The state of the en-
vironment observed at the beginning of period n is represented by Yn and
we assume that Y = {Yn ; n ~ O} is a time-homogeneous Markov chain on
a discrete state space E with a given transition matrix P having elements
P(i,j) = P[Yn+1 = j I Yn = i]. Let Xn denote the inventory position ob-
served at the beginning of period n. The basic assumption of this model is
that all parameters during a period depend on the state of the environment
at the beginning of that period. Therefore, the decision maker observes both
the inventory position and the environmental state to decide on the order
140 Siileyman Ozekici
(2.3)
for n ~ O. Let Vi (x) be the optimal expected total discounted cost if the
initial environment is i and inventory position is x. Then, using the Markov
property one can easily verify that Vi (x) satisfies the dynamic programming
equation (DPE)
with
L;(y) = hi l Y
A;(dz)(y - z) + Pi 1 00
Ai(dz)(z - y) . (2.6)
The derivation ofthe DPE (2.4) is quite straightforward. Here, y is not the
policy but a real number which represents the order-up-to level. This notation
is used interchangeably to denote a policy as well as a decision whenever our
intention is clear from the expressions. Note that in environment i with the
inventory position at level x, a decision y ~ x is taken only if the supply is
available with probability Ui. In this case, a fixed cost Ki is incurred only if
y > x and the purchase cost is C; (y - x). Moreover, the one- period expected
holding and shortage cost is Li(Y) and, as usual, the sum on the right-hand-
side of (2.5) is the expected optimal cost from the next period onward. If
the supply is not available with probability (1 - Ui), then no ordering cost is
incurred and the inventory position stays at the level x.
For any real valued function f : E x R -+ R, define the mapping T as
Si - x, if x ::; Si
y*(i, x) = { 0,
if x> Si,
for i E E, x E R.
For any i E E, Si is the smallest minimizer of G; so that Gi(Si) ::; Gi(Y)
for all y E R. Moreover, Si ::; Si can be computed by solving
Suppose that there are m items in store and the demand for the k'th one
in period n is given by D~ so that
(2.11)
for any item k and n ~ O. Here X represents the inventory position of the
whole system and Xk represents the inventory position of the k'th item. Let
Vi (;z;) be the optimal expected total discounted cost if the initial environment
is i and inventory position is ;z;. Then, using the Markov property one can
easily verify that Vi(;z;) satisfies the dynamic programming equation (DPE)
v;(;z;) = miny~x{K6(L;=1(Yk-;z;k))+L;=lufKf6(Yk-Xk)+
+ LbE{O,l}m ui(b)Gi(;z; + b(y -;z;))} - L;=l C~;z;k
(2.12)
for i E E, x E Rm where 6(z) is the indicator function which is equal to 0
only if Z = 0 and 1 otherwise, and
Gi(y) = I>fYk
k=l
m
+ Li(Y) + () L
iEE
P(i,j) 1 Rm
Ai(dz)Vj(Y - z) (2.13)
Complex Systems in Random Environments 145
with
L;(y) = f
k=l
(h~ [Yk A~(dz)(Yk - z) + p~
Jo
1 A~(dz)(z
00
Yk
- Yk)). (2.14)
3. Queueing Models
Queueing models also involve stochastic and deterministic parameters that
are subject to variations depending on some environmental factors. The cus-
tomer arrival rate as well as the service rate are not necessarily constants
that remain intact throughout the entire operation of the queueing system.
The environmental process in this case could represent any factor that may
influence these rates. Arrival rates of vehicles to a highway and their ser-
vice rates on that highway obviously depend on weather conditions, or the
production rate of a machine or work station depends on how well it is per-
forming physically and, in particular, this rate would be zero if it is in a
failed state. In server-vacation models, the service rate is zero if the server is
"vacationing" . Production rates are routinely changed due to work schedules
and many businesses go through slack periods where hardly any customers
arnve.
A queueing model where the arrival and service rates depend on a ran-
domly changing two-state environment was first introduced by Eisen and
146 Siileyman Ozekici
= =
where Ao(i,j) J.tiI(i,j),A1(i,j) Q(i,j) - (Ai + J.ti)I(i,j), and A 2(i,j) =
AiI(i,j) are all M x M matrices.
Suppose that the environment has the stationary distribution 1I"i =
limt--++oo P[Yi = i] that can be determined by solving 1I"Q = 0 and 11"1 = l.
The average arrival rate is 11" A == EiEE 1I"iAi and the average service rate is
11"1' == EiEE 1I"iJ.ti so that the traffic intensity is now expressed as p = 1I"A11I"J.t.
The stationary distribution
vn(i) = t-+oo
lim P[Xt = n, Yi = i]
(3.3)
Complex Systems in Random Environments 147
4. Reliability Models
In reliability and maintenance models, it is generally assumed that a device
always works in a given fixed environment. The probability law of the dete-
rioration and failure process thus remains intact throughout its useful life.
The life distribution and the corresponding failure rate function is taken to
be the one obtained through statistical life testing procedures that are usu-
ally conducted under ideal laboratory conditions by the manufacturer of the
device. Data on lifetimes may also be collected while the device is in oper-
ation to estimate the life distribution. In any case, the basic assumption is
that the prevailing environmental conditions either do not change in time or,
in case they do, they have no effect on the deterioration and failure of the
device. Therefore, statistical procedures in estimating the life distribution
parameters and decisions related with replacement and repair are based on
the calendar age of the item.
There has been growing interest in recent years in reliability and main-
tenance models where the main emphasis is placed on the so-called intrinsic
age of a device rather than its real age. This is necessitated by the fact that
devices often work in varying environments during which they are subject
to varying environmental conditions with significant effects on performance.
The deterioration and failure process therefore depends on the environment,
and it no longer makes much sense to measure the age in real time without
taking into consideration the different environments that the device has op-
erated in. There are many examples where this important factor can not be
neglected or overlooked. Consider, for example, a jet engine which is subject
to varying atmospheric conditions like pressure, temperature, humidity, and
mechanical vibrations during take-off, cruising, and landing. The changes in
these conditions cause the engine to deteriorate, or age, according to a set of
rules which may well deviate substantially from the usual one that measures
the age in real time irrespective of the environment.
As a matter of fact, the intrinsic age concept is being used routinely in
practice in one form or another. In aviation, the calendar age of an airplane
since the time it was actually manufactured is not of primary importance
in determining maintenance policies. Rather, the number of take-offs and
landings, total time spent cruising in fair conditions or turbulence, or total
Complex Systems in Random Environments 149
miles flown since manufacturing or since the last major overhaul are more
important factors.
Another example is a machine or a workstation in a manufacturing sys-
tem which may be subject to varying loading patterns depending on the
production schedule. In this case, the atmospheric conditions do not neces-
sarily change too much in time, and the environment is now represented by
varying loading patterns so that, for example, the workstation ages faster
when it is overloaded, slower when it is underloaded, and not at all when
it is not loaded or kept idle. Therefore, the term "environment" is used in
a loose sense here so that it represents any set of conditions that affect the
deterioration and aging of the device.
Once again, it is also routine practice in manufacturing systems to mea-
sure the age of a workstation not with respect to its real age since installation,
but with respect to another criterion like the number of parts produced by
the workstation since its installation or since the last maintenance. Here, the
environment can be the production rate which can be set at different levels
depending on the production schedule or on a usually cyclic workload re-
quired during the production shifts on any given day. It is reasonable then
to suppose that if the production rate is increased or decreased, the worksta-
tion ages faster or slower respectively. In case the production rate is zero, the
workstation should not age at all or possibly at a very low failure rate that
accounts for the effect of deterioration in real time.
(4.2)
for n ~ 1.
It follows trivially that (4.2) leads to the recursive formula
150 Siileyman Ozekici
=
with the obvious boundary condition Pi[L > 0] 1 where Pi[A] P[AIYo = =
z1 for any event A.
Life distribution classifications play an important role in many problems
on reliability and maintenance. Recall that stochastic processes are often
classified with respect to the life distribution classification of their first pas-
sage times. In particular, supposing that the state space of Y is ordered as
E = {O, 1,2, ...}, then the Markov chain Y is said to be an IFRA (DFRA)
process if the first passage time Tj = inf{n ~ 0: Yn ~ j} has discrete IFRA
(DFRA) distribution respectively on {Yo = i} for any i < j.
Theorem 4.1. Suppose that q(i) is decreasing (increasing) in i E E; ifY is
an increasing IFRA (DFRA) process, then L has an IFRA (DFRA) distri-
bution on {Yo = i} for any i E E.
Theorem 4.1 states that if the environmental process increases such that
the states get "worse" in the IFRA sense with decreasing survival probabil-
ities, then the lifetime L has an IFRA distribution. One of the implications
here is that the probability of failure in the next period increases in time.
The opposite conditions yield the DFRA case.
If the device consists of m components connected in series so that com-
ponent k survives a period in environment i with probability qk(i) and fails
with probability qk(i), then the characterization provided by (4.1)-(4.3) holds
for the lifetime Lk for component k. The conditional joint distribution is
m
still holds true with Pi [L > 0] = 1. Comparison of (4.6) with (4.3) reveals the
obvious fact that the series system can be regarded as a single component
that has survival probability q(i) = n~=l qk(i) in environment i. So, the life
distribution classification provided in Theorem 4.1 can easily be extended to
the complex case with many stochastically dependent components.
Complex Systems in Random Environments 151
where At = (A}, A , ... , Ar) is the intrinsic age of the system at time t that
consists of the intrinsic ages of the m components, Y = (Y/, y?, ... ,~d)
is the environmental process with state space E that reflects the states of
various environmental factors and f is the intrinsic hazard rate function.
For example, ~1 can be the calendar time t, ~2 could be the temperature
at t, ~3 could be the pressure at time t, etc .. Moreover, f is of the form
f(i, x) = (!t(i, x), !2(i, x), .!m(i, x)) where J,.(i, x) is the intrinsic aging
rate of component k in environment i if the intrinsic ages of the components
are given by the vector x = (Xl, X2,, x m ).
In our exposition, we will further specialize on this basic model by adapt-
ing the notation and terminology of Ozekici (1995) who analyzed the optimal
maintenance problem of a single-component device operating in a random en-
vironment. In particular, we suppose that the state space E is discrete and
fk(i, x) = fk(i, Xk) such that the intrinsic aging rate of any component k
depends only on the environment and the intrinsic age of that component,
independent of the ages of all other components. This implies that both
stochastic dependence among the components and intrinsic aging of each
component depend only on environmental factors that the system as a whole
is subjected to. Furthermore, we will relate the intrinsic failure rate function
fk(i, x) to the failure rate function rf(t) of component k while it operates in
environment i. We will now present the details of the specific construction of
the intrinsic aging process.
Let Lk denote the lifetime of the k'th component while L represents the
lifetime of the system. Suppose, for now, that the environment remains fixed
at some state i E E so that yt = i for all t ~ O. In any environment i E E,
the life distribution of component k is given by the cumulative distribution
function
=
Fl(t) P[Lk :5 tlY i) = (4.8)
with failure rate function rf(t) and hazard function Rf(t) = J; rf(s)ds so
that the survival probability function F; = 1 - Fl can be written as
Complex Systems in Random Environments 153
P[L 1 > ul,L2 > U2, ,L m > umlY = i] = exp(- LR~(Uk)) (4.10)
k=l
(4.12)
Therefore, in the fixed environment i E E, it follows that if the intrinsic
age is measured by the hazard function, then component k has an exponen-
tially distributed intrinsic lifetime with parameter 1. Moreover, its intrinsic
clock ticks at the rate r~(t) at time t. If the real time is t, then the in-
trinsic clock shows time R~(t). Similarly, when the intrinsic time is x, the
corresponding real time is given by the inverse function
In other words, it takes n~ (x) units of real time operation to age a brand
new component to intrinsic age x in environment i.
Let a~ = limt-++oo R~(t) denote the maximum intrinsic age that compo-
nent k can reach while operating in environment i E E and t~ = inf {t ~
OJ Rf(t) = an denote the time when this maximum age is reached. In most
environments a~ = = t~ +00, but it is also possible that a~, t~ < +00. In
R:
particular, if 6 E E represents an environmental state during which the com-
ponent is kept idle, then r: = = a~ = t~ = O. Moreover, if a~ < +00,
then r~(s) = 0 and R~(s) = a~ for all s ~ t~. This is equivalent to saying
that once the component reaches the intrinsic age a~, it does not fail or age
any more in environment i. As a matter of fact, if a~ < +00, then the life
distribution is defective with P[Lk = +oolY = i] = e-a~ > 0 and the device
may function forever without failing. This may also correspond to the case
where, upon reaching the critical age a~, the component is used no more
in environment i. Note that the definition (4.13) implies that n~(x) = +00
whenever x ~ a~. Throughout the remainder of this article the intrinsic age,
intrinsic time, and intrinsic lifetime will be referred to as simply the age,
time, and lifetime unless stated otherwise.
154 Siileyman Ozekici
(4.15)
for any k ~ 0, s ~ Tn+! - Tn, and initial age A~ ~ O. The model therefore
supposes that if the component has already reached the critical maximum
age a~ n by time Tn, it is either kept idle or it does not fail or age any more
throughout the n'th environment X n . An equivalent definition is provided by
the derivative
(4.18)
Complex Systems in Random Environments 155
References
Arjas, E.: The Failure and Hazard Process in Multivariate Reliability Systems.
Mathematics of Operations Research 6, 551-562 (1981)
Bellman, R., Glicksberg, I., and Gross, 0.: On the Optimal Inventory Equation.
Management Science 2, 83-104 (1955)
Bertsekas, D. P.:Dynamic Programming, Deterministic and Stochastic Models. En-
glewood Cliffs: Prentice-Hall 1987
QInlar, E.: Markov Additive Processes: I. Z. Wahrscheinlichkeitstheorie verw. Geb.
24, 85-93 (1972a)
QInlar, E.: Markov Additive Processes: II. Z. Wahrscheinlichkeitstheorie verw. Geb.
24, 95-121 (1972b)
QInlar, E.: Introduction to Stochastic Processes. Englewood Cliffs, NJ: Prentice-
Hall 1975
QInlar, E.: Shock and Wear Models and Markov Additive Processes. In: Shimi, LN.,
Tsokos, C.P. (eds.):The Theory and Applications of Reliability 1. New York:
Academic Press 1977, pp. 193-214
QInlar, E., Ozekici, S.: Reliability of Complex Devices in Random Environments.
Probability in the Engineering and Informational Sciences 1, 97-115 (1987)
QInlar, E., Shaked, M., Shanthikumar, J.G.: On Lifetimes Influenced by a Common
Environment. Stochastic Processes and Their Applications 33 347-359 (1989)
Eisen, M., Tainiter, M.: Stochastic Variations in Queuing Processes. Operations
Research 11, 922-927 (1963)
Ezhov, 1.1., Skorohod, A.V.: Markov Processes with Homogeneous Second Compo-
nent: I. Teor. Verojatn. Primen. 14, 1-13 (1969)
Feldman, R.: A Continuous Review (8, S) Inventory System in a Random Environ-
ment. Journal of Applied Probability 15, 654-659 (1978)
Gaver, D. P.: Random Hazard in Reliability Problems. Technometrics 5, 211-226
(1963)
Gupta, D.: The (Q, r) Inventory System with an Unreliable Supplier. Technical
Report. School of Business, McMaster University (1993)
Giirler, D., Parlar, M.: An Inventory Problem with Two Randomly Available Sup-
pliers. Technical report. School of Business, McMaster University (1995)
Iglehart, D.L.: Dynamic Programming and Stationary Analysis of Inventory Prob-
lems. In: Scarf, H.E., Gilford, D.M., Shelly, M.W. (eds.): Multistage Inventory
Models and Techniques. Stanford: Stanford University Press 1963
Iglehart, D.L., Karlin, S.: Optimal Policy for Dynamic Inventory Process with Non-
stationary Stochastic Demands. In: Arrow, K.J., Karlin, S., Scarf, H. (eds.):
Studies in Applied Probability and Management Science. Stanford: Stanford
University Press 1962 pp. 127-147
Kalymon, B.: Stochastic Prices in a Single Item Inventory Purchasing Model. Op-
erations Research 19, 1434-1458 (1971)
Lefevre, C., Malice, M.P.: On a System of Components with Joint Lifetimes Dis-
tributed as a Mixture of Exponential Laws. Journal of Applied Probability 26,
202-208 (1989)
Lefevre, C., Milhaud, X.: On the Association of the Lifelenghts of Components
Subjected to a Stochastic Environment. Advances in Applied Probability 22,
961-964 (1990)
Lindley, D.V., Singpurwalla, N.D.: Multivariate Distributions for the Lifelengths of
Components of a System Sharing a Common Environment. Journal of Applied
Probability 23, 418-431 (1986)
Complex Systems in Random Environments 157
Nahmias, S.: Production and Operations Analysis. 2nd edition. Homewood: Irwin
1993
Neuts, M.F.: The MIMI 1 Queue with Randomly Varying Arrival and Service Rates.
Opsearch 15, 139-157 (1978a)
Neuts, M.F.: Further Results on the MIMII Queue with Randomly Varying Rates.
Opsearch 15, 158-168 (1978b)
Neuts, M.F.: Matrix Geometric Solutions in Stochastic Models. Baltimore: John
Hopkins University Press 1981
Neveu, J.: Une Generalisation des Processus a Accroisements Positifs Independants.
Abhandlungen aus den Mathematischen Seminar der Universitat Hamburg 25,
36-61 (1961)
Ozekici, S.: Optimal Control of Storage Models with Markov Additive Inputs. Ph.D.
Dissertation, Northwestern University (1979)
Ozekici, S.: Optimal Maintenance Policies in Random Environments. European
Journal of Operational Research 82, 283-294 (1995)
Ozekici, S.: Optimal Replacement of Complex Devices. In this volume (1996a), pp.
158-169
Ozekici, S.: Markov Modulated Bernoulli Process. Technical Report. Department
of Industrial Engineering, Bogazi9 University (1996b)
Ozekici, S., Parlar, M.: Periodic-review Inventory Models in Random Environments.
Technical Report. School of Business, McMaster University (1995) .
Ozekici, S., Sevilir, M.: Maintenance of a Device with Environment Dependent
Survival Probabilities. Technical Report. Department of Industrial Engineering,
Bogazi9 University (1996)
Parlar, M.: Continuous-Review Inventory Problem Where Supply Interruptions Fol-
Iowa Semi-Markov Process. Technical Report. School of Business (1993)
Parlar, M., Berkin, D.: Future Supply Uncertainty in EOQ Models. Naval Research
Logistics 38, 107-121 (1991)
Prabhu, N.U., Zhu, Y.: Markov-Modulated Queueing Systems. Queueing Systems
5, 215-246 (1989)
Purdue, P.: The MIMII Queue in a Markovian Environment. Operations Research
22, 562-569 (1974)
Sethi, S.P., Cheng, F.: Optimality of (s, S) Policies in Inventory Models with Marko-
vian Demand Processes. Technical Report. Faculty of Management, University
of Toronto (1993)
Shaked, M., Shanthikumar, J. G.: Some Replacement Policies in a Random Envi-
ronment. Probability in the Engineering and Informational Sciences 3,117-134
(1989)
Silver, E.A.: Operations Research in Inventory Management: A Review and Cri-
tique. Operations Research 29, 628-645 (1981)
Singpurwalla, N.D., Youngren, M.A.: Multivariate Distributions Induced by Dy-
namic Environments. Scandinavian Journal of Statistics 20, 251-261 (1993)
Song, J.S., Zipkin, P.: Inventory Control in Fluctuating Demand Environment.
Operations Research 41, 351-370 (1993)
Zheng, Y.S.: A Simple Proof of Optimality of (s, S) Policies in Infinite-Horizon
Inventory Systems. Journal of Applied Probability 28, 802-810 (1991)
Optimal Replacement of Complex Devices
Siileyman Ozekici
1. Introduction
Optimization problems involving complex systems are quite challenging due
to the multidimentionality created by the large number of components or
subsystems that make up the whole system. These problems are further com-
plicated by the fact that, in many cases, the components or subsystems are
stochastically and economically dependent. We suppose that dependence is
induced by a randomly changing environment that all components or subsys-
tems operate in.
The formulation of the optimization problem, the characterization of op-
timal policies and the solution procedure is undoubtedly more complicated.
It is well-known that the structure of optimal policies may be quite complex
in multicomponent systems even when there are no environmental fluctua-
tions. However, it is surprising that, under fairly reasonable conditions, the
environmental process does not increase the complexity of the structure of
optimal policies or the solution procedures. We will demonstrate this con-
jecture on the optimal component replacement problem and show that the
random environment does not actually create optimal policies which are far
more complex than those obtained in the standard single environment mod-
els. This chapter builds on the intrinsic aging model described in Section 4 of
Ozekici (1996) in this volume, our notation and terminology will follow those
introduced there.
Preventive replacement is perhaps the most widely used maintenance pol-
icy to prevent the device from failure during operation, thus incurring exces-
sive failure costs. The fixed environment case with several cost structures and
objectives is discussed extensively in the literature by many authors and, in
Optimal Replacement of Complex Devices 159
most cases, an age replacement policy is optimal if the life distribution is IFR.
Ozekici (1985) provides an example along this direction and a discussion on
the optimality conditions for control-limit policies can be found in So (1992).
Throughout the remainder of this chapter, we make a similar assumption by
requiring that the failure rate functions {rf (.); i E E} are all increasing. Note
that this assumption implies that af = =
Rf (+(0) +00 for all i E E except
for idle environmental states with a~ = 0 and r~ = 0 identically. Therefore,
HNx, t) = Rf(n~(x/c) + t) for all x E R+, t ~ 0 and i E E except for the
idleness case where Hi(x, t) = X/c.
2. Single-Component Model
We now consider a series system with m components that operates under the
randomly changing environmental process Y. All components age intrinsically
162 Siileyman Ozekici
as described in Section 4 of Ozekici (1996) in this volume. Recall that for any
component k, rf{-) is failure rate function and Rf{-) is the hazard function in
environment i (i.e., Rf(t) = I~ rf(s)ds). Similarly, Hf(Xk, t) = Rf(nf(Xk) +
t) is the intrinsic age of component k at time t in environment i if the initial
age is Xk, and Tik(Xk' u) = n:(Xk + u) - n:(Xk) is the amount of real-time
operation required in environment i to age the k'th component intrinsically
u time units given that its initial age is Xk.
Lifetime of component k is denoted by Lk and L = mink Lk is the life-
time of the system. The age process of the system is A = (A 1 , A 2 , . , Am)
where A: is the intrinsic age of component k at time t. The construc-
tion of Ak is described in detail by equations (4.15) and (4.16) in Ozekici
(1996) in this volume. It is clear that the process A takes values in the
state space S = [0, +oo]m where +00 denotes a failed component. For any
x = (Xl, X2, ... ,Xm) E S, Xk E [0, +00] is the intrinsic age of component
k. We also define Hi(X,t) = (Hl(X1,t), Hl(x2,t), .. ,Hr(xm,t E S to be
the intrinsic age of the system at time t in environment i if the initial age is
xES.
f= R*g
where R = En=o Q.n is the Markov renewal kernel corresponding to Q.
00'
Optimal Replacement of Complex Devices 163
At the beginning of each environment, the age of the system is observed and a
decision is made to either replace any component k or not. This is represented
by the binary variable sk which is 1 only if component k is replaced. Therefore,
the decisions on all the components is represented by s == (S1 , s2, ... ,sm) E
J = {a, l}m. If the system is observed to be in state x at the beginning of
environment i, then a state-dependent cost Ci(X) is incurred. For any decision
s, the cost of replacement is p;(s) so that this gives the cost of replacing
components {I ~ k ~ m; sk = I}. Finally, the downtime cost is d;(t) if the
system is down for t units of time in environment i.
Assumption 3.1. For the optimal replacement problem, the following con-
ditions hold for all i E E and component k :
a. rf(t) is increasing in t,
b. Ci(X) is increasing in x,
c. di(t) is increasing in t,
d. p;(s) is increasing in s,
e. r, s E J with rs = => p;(r + s) ~ Pi(r) + p;(s).
These assumptions are quite reasonable and they do not impose unneces-
sary restrictions on our problem. The first one require that all life distribu-
tions are IFR in all of the environments, an assumption that is often made
in optimal replacement problems. The second one simply states that as the
system gets "older" it costs more. In particular, if there are only failure costs
involved so that the failure cost of component k in environment i is c~ , then
it suffices to take
m
The third assumption states that the downtime cost increases as the system
stays down for a longer duration of time. According to the fourth one, the
replacement cost increases as more components are replaced. Finally, the
last assumption reflects the economies of scale involved in replacing many
components at the same time. This is an important assumption which makes
the components economically dependent as well. For example, the cost of
replacing components, say, 1 and 2 at the same time is less than or equal
to replacing them separately at different times. This fact is often true due
to possible set-up costs involved in making replacements. If the preventive
replacement cost is p~ for component k in environment i and there is a fixed
replacement cost K i , then the replacement cost function
(3.3)
satisfies this assumption.
164 Siileyman Ozekici
jEE 0 k=l
[H~(:t:klt)-:t:k
L: 10
m
+ du e- U exp( - L:[Hf(xn, r;"(xk' u)) - Xn]) .
k=l 0 nf:k
'[9i(1i~(X, u)) + di(t - r;"(Xk' u))]} (3.5)
for i E E,x E [O,+oo)m and
A
2 1,2
even though component 1 is at the same age as point A. This is called "op-
portunistic replacement" since 1 is replaced by making use of the opportunity
in replacing component 2. Since an "old" component 2 is being replaced opti-
mally, it may be best to replace component 1 at the same time. Opportunistic
replacement is due to the fact that nl ~ Nl and n2 ~ N 2 This could be fur-
ther clarified by considering the case where the components are stochastically
independent while 1 has an exponentially distributed lifetime and 2 has an
IFR life distribution. If the components are economically dependent, Radner
and Jorgenson (1963) showed that the optimal policy has the simple form
given in Figure 3.2 for a fixed environment model. Note that component 1 is
replaced only at failure (F) since it has an exponential lifetime. Component
2 is optimally replaced at the critical age N 2 , but if component 1 has failed
and must be replaced, then 2 can be replaced opportunistically as early as
age n2 ~ N 2 .
2 1,2
Nk +--------------------------------------------;
o
1
Xl F
Fig. 3.2. Optimal policy when one component has exponential lifetime
2 1,2
N2+-----------~----_,
o
1
Note that the (n, N) policy provides substantial simplification when com-
pared with the optimal policy in Figure 3.1. The fact that nl ~ Nl and
168 Siileyman Ozekici
2 1,2
Xl
References
1. Introduction
Here we present the general structure of the framework for a continuous time
setting; extensions follow later. Consider a component (for ease of terminology
we use this term, it may also be a part of a system) which deteriorates
in time and which can be returned to the as-good-as-new condition by a
preventive maintenance activity. The main question the framework focuses
at is when to execute the activity and the answer will be based on cost
considerations. We primarily consider long-term average costs as objective
criterion, as that best reflects what one should do on a long term. The central
notion in the framework is the so-called marginal expected cost of deferring
the execution of the activity for an infinitesimally small interval. We first
consider the case in which the activity can be carried out at any moment
against the same cost cp In this case it is natural to speak of the marginal
deterioration cost rate, denoted by mO, which is assumed to be a continuous
and piecewise differentiable function ofthe time t since the previous execution
of the activity. We will now show that these assumptions are sufficient to
determine an average optimal maintenance interval.
Let M(t) := f~ m(x)dx, i.e. the total expected costs due to deterioration
over an interval of t time units, when the component was new at the start. It
easily follows from renewal theory that the average costs g(t) per time unit
when executing the activity every t time units amount to
Equation (2.2) is the key for the analysis of g(t). Notice that w(t) is increas-
ing (decreasing) if m(t) is increasing (decreasing). The following theorem
summaries the relations between the behaviour of m(t) and the existence of
an average cost minimum. Part (iv) is a generalization of results for exist-
ing models (see, e.g., Barlow and Proschan 1965 for the block replacement
model), the other parts are simple new results.
Theorem 2.1.
(i) if met) is decreasing or constant on [to, ttl and m(to) < g(to), then get)
is also decreasing on [to, tl],
(ii) ifmo(t) = ml(t) + c, for some c and all t > 0, then go(t) and gl(t) have
the same extremes,
(iii) if mo(t) is nonincreasing on (0, to) and increases thereafter, then go(t)
has the same minima as gl(t), where ml(t) = mo(to) fort < to and ml(t) =
mo(t) else and ci = cb + J;O(mo(t) - mo(todt,
(iv) if met) increases strictly for t > to , where m(to) < g(to), and if either
(a) limt_oom(t) = 00, or
(b) limt_oo m(t) = c and limt_oo [ct - M(t)] > cP, for some c > 0,
ro
then get) has a minimum, say g* in t*, which is unique on [to, 00); moreover,
for to < t < t*
m(t) - g(t) =0 for t = t* (2.3)
for t > t*
ro
>0
and
for to < t < t*
m(t) - g* =0 for t = t* (2.4)
>0 for t > t*
(v) ifm(t) increases strictly fort> to, where m(t o) < g(to), limt_oom(t) =
c and limt_oo [et - M(t)] < cP, for some c > 0, then g(t) is decreasing for
t > to.
(vi) if m(t) is convex on [to, T], where m(to) < g(to) and T1to [m(T) -
m(to)] > cP + J;O(m(t) - m(todt, then get) has a minimum, say g* in t*,
which is unique on (to,T) and (2.3) and (2.4) hold on [to,T]. If to = 0,
then it is sufficient that t[m(T) - m(O)] > cPo
Proof (i) Notice that m(to) < g(to) implies that W(to) < cp If m(t) is
decreasing or constant, then w(t) is also decreasing or constant and the
result is immediate.
(ii) If m2(t) = ml (t) + c, then W2(t) = WI (t) and the result is immediate.
(iii) According to (i) neither glO nor g2() have a minimum before to. Notice
next that for t > to we have Wl(t) = ci } W2(t) = ~, from which the
assertion follows.
174 Rommert Dekker
(iv) Notice that !Ii(t) - !Ii(to) = ftto(m(t) - m(xdx + to[m(t) - m(to)] >
(t1 - to)[m(t) - m(tt}], t > to, for some t1 E (to, t). Hence, !Ii(t) increases
strictly to infinity if m(t) does so. Since !Ii(to) < eP by (i), !Ii(t) passes
the level eP only once for t > to, which guarantees the uniqueness. If
limt_oo [et - M(t)] = d some d, then it easily follows that Iimt_oog(t) = c.
Moreover, for t large enough, say> tf we have M(t) < et - d + f and
g(t) < c + [eP - d + flit for any f > O. Hence if cP - d < 0, then g(t)
approaches c from below, implying that it must have a finite minimum.
The uniqueness of the minimum follows from the fact that (2.2) implies
that m(t) intersects g(t) in minima from below and in maxima from above.
As m(t) is strictly increasing on [to, 00), there can be no maxima in that
region.
(v) Notice that in this case g(t) approaches its limit c from above. If there
would have been a minimum of g( t) there should also be a maximum. Since
in each extreme !Ii(t) = eP, there is a contradiction, since !Ii(t) is increasing
because m(t) increases and can therefore cross eP only once for t > to.
Accordingly g(t) is decreasing for t > to.
(vi) If m(t) is convex on [to, T] then M(T) - M(to) < (T - to)[m(T) +
m(to)]/2. Inserting this in !Ii(T) and using assertion (iii) shows after some
algebra that !Ii(T) > cP , from which the results follow in the same way as
p~O~. D
Remark A decreasing m(t) may be due to burn-in (or initial) failures. Part
(iii) of this theorem shows that we only need to estimate their contribution to
the total costs and that we can leave the burn-in failures out of the modelling
of m(t) provided that a compensation is made for them in cP. In this way we
can take care of the bathtub curve.
Relation (2.4) can interpreted in the following way (Berg 1980 was the
first one to introduce it). Consider at time t the two options: (a) maintain
now or (b) defer maintenance for a time dt. For option (b) the expected costs
over [t, t + dt] amount to m(t)dt + cp For option (a) there are direct costs
cP , and the renewal is dt time units before the option (b). To compensate
for this time difference we associate costs g* dt to the interval, which gives
a total expected costs of cP + g*dt for option (a). Subtraction then yields
that maintaining is cost-effective if m(t) - g* > O. The myopic stopping rule:
maintain if m(t) - g* ~ 0 is therefore average optimal. Although a simple
enumeration to locate the average-cost minimum usually satisfies in practice,
one can speed up calculations by using relations (2.3) and (2.4) and applying
a bisection procedure.
Special cases:
(i) if m(t) = at f3 - 1, a > 0 then !Ii t= a(1 - 1/{3)tf3, which increases if
{3 > 1. In that case t* = {3cP / a({3 - 1).
(ii) if m(t) = at + b, a, b > 0, then !Ii(t) = !at 2 , and t* = J2cP fa.
Equation (2.2) also allows us to do some sensitivity analysis. We have
A Framework for Single-Parameter Maintenance Activities 175
Theorem 2.2.
(i) if m2(t) = Aml (t) with A > 1 and cf = ~, then t; < ti,
(ii) if m2(t) - ml (t) increases in t and cf = ~, then t; < ti,
(iii) if ~ > cf and m2(t) = ml (t), then t; > ti.
Proof Notice that W;(t) -W~ (t) = t[m;(t) - m~ (t)]. For case (ii) we now have
that W2(t)-Wl(t) increases in t and that W2(t) reaches the level cf earlier than
Wl (t) from which the assertion follows. In case (i), W2(t) -WI (t) is increasing
if WI (t) increases and the same argument holds. Assertion (iii) is also a direct
consequence of (2.2). D
2.4 Extensions
discretised: i.e. m(t) indicates the expected deterioration costs until the next
time moment.
A() cP + I~ m(y)e-AYdy
v t = 1- e- At . (2.5)
which leads to a similar analysis as for the average costs (see also Section 2.5).
see Dekker and Smeitink (1991). In a similar, but far more complicated way,
they derive inequalities (2.3) and (2.4) with m(t) replaced by 77(t) = J: m(t+
z)dP(Y :::; z), the expected deterioration costs until the next opportunity.
where L(t) = J;(l- F(xdx indicates the expected cycle length. It is easily
shown that g'(t) = (m(t) - g(t/(l- F(tL(t). Let 4i(t) be the analogue
of !li(t), i.e. 4i(t) = m(t)L(t) - J; m(x)(l - F(xdx. Hence g'(t) = 0 >
m(t) - g(t) = 0 > 4i(t) = cp Notice further that 4i'(t) = m'(t)[l - F(t)].
We are now in a position to formulate a theorem similar to theorem 2.1 and
which proof is analogous.
Theorem 2.3.
(i) if m(t) is nonincreasing on [to, t 1] and m(to) < g(to), then g(t) has no
minimum on [to, t1],
(ii) if mo(t) = m1 (t) + c, for some c and all t > 0, then go(t) and gl (t) have
the same extremes,
(iii) if mo(t) is nonincreasing on (0, to) and increases thereafter, then go(t)
has the same minima as gl(t), where m1(t) = mo(to) for t < to and
ml(t) = mo(t) else and ~ = c{; + J;O(mo(x) - mo(to(l- F(xdx,
(iv) if m(t) increases strictly for t > to, where m(to) < g(to), and either
(a) limt_oom(t) = 00, or
(b) limt_oom(t) = c where c> limt_oog(t) for some c> 0,
ro
then g(t) has a minimum, say g" in t", which is unique on [to, 00); more-
over,
for to < t < t"
m(t) - g(t) = 0 fort = t" (2.8)
>0 for t > t"
and
<0 for to < t < t*
m(t) - g' { =0 for t = t* (2.9)
>0 for t > t*
A Framework for Single-Parameter Maintenance Activities 179
In case of the age replacement model the marginal deterioration cost rate
m(x) amounts to (c! - cP)r(x), where r(x) denotes the hazard rate, r(x) =
f(x)/(I-F(x)). In that case the numerator of (2.6) equals cP+(c! -cP)F(t).
Notice that the discounted cost case (see Section 2.4) can be regarded
as a special case of the extended framework by considering discounting as a
truncation of the system lifetime. Hence the cdf of the time to system renewal,
F(x), should be defined as F(x) = 1 - e->'z, where A is the continuous
discount rate. In this case the expected lifetime equals I/A and hence the
total discounted costs per unit of time equal AV>'(t), where v>'(t) is given by
(2.5).
The main problem to use the age replacement extension for planning and
combining is that we no longer can predict in advance whether we will replace
at some time t, as that depends upon the possible occurrence of failures in
between. Doing a correct analysis implies that we have to condition on all
possible events between the moment of planning and the expected moment
of execution. This directly leads to intractable models in case of multiple
components. An heuristic way out is to do a conditional planning, assuming
that no failures occur in the planning horizon and taking the actual ages into
account. This is a reasonable approach since numerical experiments show that
in cases where preventive maintenance is really cost-effective, F(t*) is quite
small (up to 20%). Implementing this approach on a rolling horizon basis (i.e.
adapting the planning in course of time with the occurrence of events) takes
care of failures. This idea was pursued in Dekker et al. (1993) in a discrete
time case.
Aven and Bergman (1986) argue that the objective function in many main-
tenance optimisation models can be written as:
deviating x time units from the optimum t* for a short-term shift, long-term
shift and permanent shift respectively. It is easy to see that
hs(x) M(t* + x) + M(t* - x) - 2M(t*), (3.1)
t
(m(y) - g*)dy, (3.2)
hp(x) = g(t* + x) - g(t*) = hL(x)j{t* + x), (3.3)
where g* denotes the minimum long term average costs. These penalty func-
tions can not only be used to assess the cost-effectiveness of any special sales
offer, but also for priority setting and to assist in combining activities or
integrating maintenance planning with production planning. Notice that the
penalty functions have the following properties: they are always nonnegative
and they are zero for x = O. Furthermore, h() is symmetric round zero.
These penalty functions indicate the expected cost for deviating from the
optimum interval. It may happen, however, that the present state already
deviates from the optimum and that one does not need to take the costs into
account for arriving in the present state, but that one is interested in the
extra costs for deviating even further. More specifically, suppose one is at t
time units, t > t* since the last execution of the activity. The expected costs
for deferring (in this case their is no other option) the activity for another x
time units amount to (we only consider the long-term shift)
t
hL(x) -
_I t
t +:t' * 1 - F(y)
(m(y) - 9 ) 1- F(t) dy,x > O. (3.5)
Dekker et al. (1996). It is called static since fixed combinations are made.
Other long-term approaches apply a variable combining, based on the state
of the other components, such as the (n, N) policies.
Consider n maintenance activities, a;, i = 1, ... , n which, if carried out
alone, costs cf, i = 1, ... , n. All activities share the same set-up work. Hence
if k activities are carried out together the cost savings by joint execution
amounts to (k -1)c8 , where c8 are the cost of the set-up work. Suppose next
that the set-up work is done every t time units and that activity i is carried
out every k;-th time, i.e. with an interval of kit, where k; is an integer decision
variable. The total long-term average costs g(t, kl' ... , kn ) now amount to
(4.1)
In this section we will consider short-term combining and show that the
penalty functions derived in the Section 4 allow a cost-effectiveness evalua-
tion of combinations and assist in the timing of the execution. The main idea
is to apply a decomposition approach, that is, we first determine for each
activity its preferred execution moment and derive its penalty function. Next
we consider groups of activities, for which the preferred moment of execution
follows from a minimisation of the sum of the penalty functions involved. If
this sum is less than the set-up savings because of a joint execution, combin-
ing is cost-effective. Corrective maintenance work can also be involved in the
combination, provided that it is known at the outset of planning. In case it is
deferrable a penalty function for deferring should be determined. Determin-
ing the optimal groups can be formulated as a set-partitioning problem (see
Dekker et al. 1992). Wildeman et al. (1992) show that under certain condi-
tions the optimal grouping consists of groups with consecutive initial planning
moments, which allows the formulation of an O(n 2 ) dynamic programming
algorithm (n being the number of activities).
Example Table 5.1 provides data on 8 maintenance activities, which each
replace a unit. Deterioration costs of unit i are primarily due to small failures
upon which a minimal repair is done. These occur independently of the state
of other units and the cost rate amounts to: mi(xi) = cp 00 ((3;/ >';)(x;/ >.;)f3.- 1 ,
where Xi denotes unit i's age. Special case (i) (see Section 2.2) gives a formula
A Framework for Single-Parameter Maintenance Activities 183
for the individually optimal replacement age, which we denote by xi. Finally
let ti be the resulting initial planning moment (counted from the start of the
planning horizon).
The resulting penalty functions are shown in Figure 5.1 (the numbers
refer to the activities).
We consider combining under short-term shifts, in which case the penalty
costs are given by equation (3.1). The planning horizon is [0,220]. As in
the previous section we assume that combining execution of any k activities
saves k - 1 times the set-up work (for any k), which is estimated at 15
cost units (about 10% of the preventive maintenance costs of an activity).
Using the algorithm of Wildeman et al. (1992) yields as optimal groupings:
{1,2,3} executed at day 12.6, {4,5} at day 97.9 and {6,7,8} at day 192.9.
The savings (set-up cost reduction minus penalty costs) for the combinations
amount to 29.4, 14.4 and 28.2 respectively. Total savings amount to 72.0,
which constitutes 6% of total preventive maintenance costs.
Dekker et al. (1993) give an analysis of the performance of this combina-
tion method for a more complex case where components are replaced using a
discrete time age replacement. They apply a conditional planning (assuming
no failures in the planning horizon) on a rolling horizon basis (implement the
decision for the current epoch, observe the new state at the next epoch and
make a new planning). They use the discrete version of the penalty func-
tions (3.5). They consider combining both for a finite and infinite horizon
and compare their planning method with an optimal solution obtained by
solving a large scale Markov decision chain numerically (which was tractable
up to four identical components only). It appears that for high set-up costs
and many components the cost allocation in the component decomposition
has to be changed because components are almost always replaced together.
When that has been done the loss of their strategy compared to the optimal
one is less than 1%.
184 Rommert Dekker
15.00~----~--~----~'---~----~~~--'---------~
10.00
6. L/.
(J)
(jj
0
0
5.00
7-
~.
0.00+,.....2$':::';O~.::::;.,=---r--==7,==---"'-rL----r-.=;:=----r-~=-__,"':O:-___4
o 20 40 60 80 100 120 140 160 180 200 220
Time
6. Priority Setting
Maintenance is usually classified into corrective and preventive work. The
first originates from a directly foreseeable, or already observed malfunction-
ing of systems, and the latter from a preconceived plan to keep systems in a
good condition in the long run. Often the first type of work is the most ur-
gent one. The maintenance capacity needed to take care of that may fluctuate
severely in time, due to the random character of failures. Hence preventive
work is often delayed in favor of the corrective work. Accordingly, there is
usually a large backlog of preventive work, with the implication that an indi-
vidual preventive maintenance activity is either delayed for an unknown time
or even never carried out. Most maintenance organisations have problems in
managing the backlog. It will be clear that the results of maintenance optimi-
sation decrease in value if the maintenance organisation is not able to do the
work on time, which is especially a problem for the many small maintenance
activities. Priority criterion functions, embedded in management information
systems, can be of help.
Here we propose the use of the penalty functions hL(x) (or hs(x) if ap-
propriate, see Section 3) as priority functions, where the long-term objective
is the average costs. Although they are formulated for a continuous time
setting, where at each moment a decision can be taken, they can easily be
A Framework for Single-Parameter Maintenance Activities 185
Many problems are that complex that either the optimal strategy is unlikely
to have a simple structure, or the computational effort to determine it may
be prohibitive. In those cases one has to resort to approximate solutions. The
framework does allow the derivation of meaningful and often good performing
heuristics. Here we will show how they can be derived. First we state the
underlying philosophy.
We fix an action (e.g. either to replace or not) and focus on the tim-
ing aspect by considering at each moment whether deferring the action is
cost-effective. The replacement criterion is based on a comparison between
the local deterioration costs and the minimum average costs, i.e. equation
(2.2). Local deterioration costs are usually easy to determine, contrary to
the minimum average costs. For the latter one basically needs to enumerate
all possible deterioration and action possibilities. If there are many options
computational problems arise. Hence the heuristic criteria approximate the
minimum average costs by either restricting the number of options, or by
comparing with a simpler model. Concluding, the replacement criteria read:
"replace if m(t, I(t)) - 9 2: 0", where I(t) stands for all relevant information
available at time t and 9 for the minimum average costs in a suitable simpler,
but consistent model.
Example Dekker and Roelvink (1995) present such a heuristic for the fol-
lowing problem. Consider a maintenance package consisting of n activities,
each addressing one component within a unit. Upon failure of a component,
only the corresponding activity is executed, with the result that only that
component is renewed; the conditions of the other components remain the
same (upon a failure during operations only the respective activity is carried
out, whilst there is no time left to do the other activities). At a preventive
basis always the full package is executed, since that is only done when the
system is not needed. Hence the problem is when to execute the full package.
A simple strategy is to execute it at fixed time intervals (block replacement),
yet under this policy relatively new components may be replaced preven-
tively. On the other hand, it is relatively simple to calculate the minimum
average costs under this policy (it involves minimising a one dimensional
function only), and let us denote these costs by 9'6. Suppose now that at time
t the ages of all components are available, denoted by ZI, ... , Zn, and that
we consider the problem in a continuous time frame. Then local deteriora-
tion costs amount to E j cJ rj(z), where rj(-} and cJ stand for component j's
hazard rate and failure costs respectively. Accordingly, we have the following
cf
replacement criterion: replace if "Ej rj(z) - 9; 2: 0" The results obtained
by Dekker and Roelvink (1995) indicate that the difference in average costs
between this policy and the average optimal policy (which has been com-
puted for a 2 component case) are less than 1%, whereas the improvement
over block replacement varies between 0% and 10% (of total average costs).
A Framework for Single-Parameter Maintenance Activities 187
8. Conclusions
In this paper we presented a framework for optimisation models which allows
integration with priority setting, planning and combination of activities. Fur-
ther research is required to investigate whether more models can be incor-
porated into the framework, and whether other models can be converted to
allow combining and planning as done in this paper.
References
Aven, T., Bergman, B.: Optimal Replacement Times - A General Set-up. J. Appl.
Prob. 23, pp. 432-442 (1986)
Baker, R.D., Christer, A.H.: Review of Delay-Time OR Modelling of Engineering
Aspects of Maintenance. Eur. lourn. Oper. Res. 73, pp. 407-422 (1994)
Barlow, R.E., Proschan, F.: Mathematical Theory of Reliability. New York: John
Wiley 1965
Berg, M., Epstein, B.: A Note on a Modified Block Replacement Strategy with
Increasing Running Costs. Nav. Res. Log. Quat. 26, pp. 157-159 (1979)
Berg, M.: A Marginal Cost Analysis for Preventive Replacement Policies. Eur.
Journ. Oper. Res. 4, 136-142 (1980)
Berg, M., Cleroux, R.: A Marginal Cost Analysis for an Age Replacement Policy
for Units with Minimal Repair. Infor. 20, 258-263 (1982)
Berg, M.: The Marginal Cost Analysis and Its Application to Repair and Replace-
ment Policies. Eur. lourn. Oper. Res. 82, 214-240 (1995)
Cho, D.L, Parlar, M.: A Survey of Maintenance Models for Multi-Unit Systems.
Eur. Journ. Oper. Res. 51, 1-23 (1991)
Dekker, R.: Applications of Maintenance Optimisation Models: A Review and
Analysis. Report Econometric Institute 9228/ A, Erasmus University Rotter-
dam (1992)
Dekker, R.: Integrating Optimisation, Priority Setting, Planning and Combining of
Maintenance Activities. Eur. Journ. Oper. Res. 82, 225-240 (1995)
Dekker, R., Dijkstra, M.C.: Opportunity-Based Age Replacement: Exponentially
Distributed Times Between Opportunities. Naval Res. Log. 39, 175-190 (1992)
Dekker, R., Roelvink, I.F.K.: Marginal Cost Criteria for Group Replacement. Eur.
Journ. Oper. Res. 84, 467-480 (1995)
Dekker, R., Smeitink, E.: Opportunity-Based Block Replacement: The Single Com-
ponent Case. Eur. Journ. Oper. Res. 53, 46-63 (1991)
Dekker, R., Smeitink, E.: Preventive Maintenance at Opportunities of Restricted
Duration. Naval. Res. Log. 41, 335-353 (1994)
Dekker, R., Smit, A.C.J.M., Loosekoot, J.E.: Combining Maintenance Activities in
an Operational Planning Phase. IMA Journ. of Math. Appl. in Buss. Ind. 3,
315-332 (1992)
Dekker, R., Wildeman, R.E., Van Egmond, R.: Joint Replacement in an Opera-
tional Planning Phase. Report Econometric Institute 9438/A (revised version),
Erasmus University Rotterdam (1993)
188 Rommert Dekker
Dekker, R., Frenk, J.B.G., Wildeman, R. E.: How to Determine Maintenance Fre-
quencies for Multi-component Systems? A General Approach. In this volume
(1996), pp. 239-280
Kamath, A.R.R., AI-Zuhairi, A.M., Keller, A.Z., Selman, A.C.: A Study of Ambu-
lance Reliability in a Metropolitan Borough. ReI. Eng. 9, 133-152 (1984)
McCall, J. J.: Maintenance Policies for Stochastically Failing Equipment: A Survey.
Mgmt. Sci. 11, 493-524 (1965)
Noortwijk, J.M. van, Dekker, R., Cooke R.M., Mazucchi, T.A.: Expert Judgment
in Maintenance Optimisation. IEEE Trans. on Reliab. 41, 427-432 (1992)
Pierskalla, W.P., Voelker, J.A.: A Survey of Maintenance Models: The Control and
Surveillance of Deteriorating Systems. Nav. Res. Log. Quat. 23, 353-388 (1979)
Pintelon, L.: Performance Reporting and Decision Tools for Maintenance Manage-
ment. Ph.D. Dissertation, University of Leuven (1990)
Sherif, Y.S., Smith, M.L.: Optimal Maintenance Models for Systems Subject To
Failure - A Review. Nav. Res. Log. Quat. 28,47-74 (1981)
Valdez-Flores, C., Feldman, R.M.: A Survey of Preventive Maintenance Models
for Stochastically Deteriorating Single Unit Systems. Nav. Res. Log. Quat. 36,
419-446 (1989)
Wildeman, R.E., Dekker, R., Smit, A.C.J.M.: Combining Activities in an Opera-
tional Planning Phase: A Dynamic Programming Approach. Report Economet-
ric Institute 9424/ A (revised version), Erasmus University Rotterdam (1992)
Economics Oriented Maintenance Analysis and
the Marginal Cost Approach
Menachem P. Berg
Department of Statistics, University of Haifa, Mount Carmel, Haifa 31905, Israel
1. Introduction
The basic maintenance policy in the Age replacement family is the age re-
placement policy (ARP) under which an item is replaced at failure or oth-
erwise preventively when it reaches a certain critical age. We first note that
a policy of this type would mainly be appropriate (but not exclusively so,
because of the tempting simplicity of its implementation) when the failure
modelling of the item is of the age-based type - that is a 'black-box' statis-
tical approach that relates failures to age, or usage, through statistical data
gathering and inference procedures. (Other types of failure modelling which
attempt to go deeper into what causes a failure may have the disadvantage
of being prohibitively expensive in terms of the data required to infer on
the multitude of additional parameters, as well as of being non robust with
regard to the usually simplified set of assumptions employed).
The issue of age-based failure modelling and its ramifications is consid-
ered in Berg (1995b) and it is argued there that the pivotal quantity for
analysis should be the hazard (or, failure rate) function r(.) rather than the
life distribution F() and as demonstrated there this change of starting point
can have a concrete impact on the statistical procedure despite the equiva-
lence of probability information in both functions, which determine each other
through the mathematical relationship
In that regard we shall also note in the sequel that the hazard function, the
cornerstone of age-based failure modelling, indeed integrates well into the
analysis of the ARP: the age-dependent maintenance policy.
Once the ARP has been chosen as the maintenance policy employed, the
only undecided question is what critical age T, for preventive replacement, is
to be used? Given the hazard function r() which, as stated above, contains
all the probability information we use, and the costs Cl and C2 of failure and
preventive replacements, respectively, the optimal T is then the value which
minimizes our cost objective function. The most commonly used such objec-
tive function is the long-term expected cost per unit of time and expressing
it in terms ofT, using the renewal-reward theorem (e.g., Ross 1983), we find
(Barlow and Proschan 1965),
(2.4)
Some conclusions can be drawn and observations be made from examining
the optimality, equation (2.3). Firstly, we see that T* depends on the replace-
ment costs only through their ratio since, trivially, Ct!(Cl-C2) = (1-c2/ct)-l.
That is clear as we can set the money unit as we desire. Then, it can be easily
verified that if r() is increasing, or, using the common terminology, "F() is
IFR (increasing failure rate)" the left-hand side of (2.3) is increasing in T,
which ensures that T* must be unique. It can, however, still be 00 , in case
there is no finite solution to (2.3), which essentially says that an ARP is only
superior to a sheer failure replacement policy if enough is gained by preven-
tive maintenance in terms of replacement costs savings to offset the wastage
in replacing operative items.
So far so good, but then one may still have certain queries about these
results like why is the IFR so focal in ensuring a unique optimum?; or, is
there a simple, intuitively clear, condition for T* to be finite?; and, what is
behind the resulting concise (and "elegant") expression for C* in (2.4)?
It is clear that the probabilistic approach does not touch upon such issues
which is so because it is only a mathematical tool while these issues are con-
cerned with the true underlying nature of matters namely the economics of
maintenance. We shall now adopt an approach that makes economic consid-
erations the starting point of the analysis and, in particular, all the queries
above are consequently removed.
192 Menachem P. Berg
V2 (x, Ll) - the expected costs (failure or preventive replacement costs in-
cluded) in (x,x+Ll] if the preventive replacement is deferred to age x+Ll
(for an infinitesimal Ll ).
Then, the marginal cost of a preventive replacement at age x is defined as
(2.5)
and the resulting function of age on the left-hand side of (2.5) is the MCF. The
rationale is clear: the decision when to make a preventive replacement can be
decomposed into a sequence of decisions, at each age x, of whether to carry out
a preventive replacement now or wait with it another (infinitesimally short)
time period. We note the implicit conditioning embedded in the definition of
the MCF, since for the preventive replacement at age x to be at all relevant
that age must be first survived.
For the ARP we have, on basic principles,
Vl(X) = C2 and
V2 (x, Ll) r(x)Llcl + (1- r(x)Ll)c2. (ignoring o(Ll) terms) (2.6)
Substituting (2.6) into (2.5) we immediately obtain the MCF associated with
the ARP,
(2.7)
The MCF can be also utilized to obtain the cost objective function C(T).
For that however we also need an appropriately selected underlying renewal
process (see Berg 1995a for elaboration) whose renewal-interval c.dJ. is de-
noted by G(). For the ARP a convenient choice is that of the service-life of
an item, so that replacement moments of any kind constitute renewal epochs
for that purpose. Thus, by the definition of the ARP, we have
Now, by the very definition of the MCF with its abovementioned conditional
nature (for more details see Berg 1995a) the expected cost during the service-
life of an item is
(2.9)
Economics Oriented Maintenance Analysis 193
Note that (2.9), once (2.7) and (2.8) are substituted into it, coincides with the
numerator of (2.2). C(T) is then obtained by dividing D(T) by the expected
1 lT
service-life
U(T) = 00
G(x)dx = F(x)dx (2.10)
by (2.8).
Comparing the derivation of D(T) in the "classical" approach and the one
here we note that the main difference is that there we have an overall renewal-
interval calculation whereas here the procedure is broken into two steps. First
we have a micro-type calculation where we obtain, in a rather straightforward
manner and usually on mere basic principles, the MCF. This function is then
used, in a given formula, to yield D(T). While in mathematically simple
situations like the (basic) ARP the superiority of the approach here is not
obvious, as both methods provide an easy calculation, it is in the model
generalizations and policy extensions considered later that the simplification
of the setting and facilitation of the mathematics become apparent.
We now proceed to use the MCF for optimization and invoking a basic
principle from mathematical economics we have that the optimal critical age
T* is a solution of the equation
C(T) = TJ(T) (2.11)
which, as can be checked, is equivalent to (2.3).
Apart from delivering the required expressions and the simple setting of
the optimality equation, the marginal cost approach also clarifies the above-
stated queries.
First, invoking another economics-based principle we have that a sufficient
condition for the existence of a unique T* is that the MCF is increasing and
then T* is finite if and only if
TJ(OO) > C(oo) (2.12)
(The more general version of this last result, which covers non- monotonic
MCFs, is that the TJ(T) and C(T) intersect at all the extrema of C(T), and
only there, so that TJ(') crosses from below at the minima and from above at
the maxima (Berg 1980)). The above economics-based principle also clarifies
the role of the IFR property here since by the functional form of (2.7), if r()
increases so does TJO.
The above clearly demonstrates the use of the marginal cost analysis as a
comprehensive tool for the study of the ARP. But then, not less importantly,
it is the insightfulness and smoothness of the procedures that bring about
the much valued virtue of the approach: it enables clear and straightforward
model generalizations and policy extensions, of much usefulness for real life
maintenance planning, which are otherwise demanding and cumbersome in
the mathematics (as is clearly revealed by comparing with other works that
have considered some of these generalizations - see specific references later).
194 Menachem P. Berg
the case 6 --+ 00 corresponds to the classical MRP, with the repair cost fig-
ure there replaced by the expected repair cost here (conditional on a repair
decision),
Ca = 16 udL(u)/L(6) (5.1)
C2, and
car(x)..:1 + C2 (5.2)
(5.5)
Several comments are in place here. Firstly, we observe how easily the
marginal cost approach accommodates this non-trivial policy extension (com-
pare with the far more laborious mathematics in Cleroux et al. 1979 where
Economics Oriented Maintenance Analysis 197
the other approach is employed). This is once more due to the use here of an
approach which is designed to deal with economics-oriented modelling as is
essentially the case here.
We already noted that the RCL policy is a compromise between the "ex-
treme" cases of the ARP and the MRP but with (5.5) we have now obtained
an exact mathematical relationship among them.
To find D(T), through the application of formula (2.9), we still need the
distribution of the service-life GO. Probably the simplest way to find this
distribution here is to resort to the hazard function of G() (and recall here
our earlier argument about the pivotal role of hazard functions in "black-box"
(age-based) failure modelling contexts).
On basic principles we have
as the service life of an item will be terminated in the next interval of length
..1 , given that it survived age x, if there will be a failure there and also the
repair cost incurred will exceed 8. Consequently, by the basic relationship in
(2.1), between life distributions and hazard functions, we immediately obtain
(5.7)
by (5.6).
Combining 1](') and G(), from (5.5) and (5.7), respectively, we can pro-
ceed with our standard procedure and use formula (2.9) to obtain D(T),
Equating C(T) to 1](T) yields the optimality equation in this case for T*
This is further extension of the basic policy where we still adopt the RCL
rule only it is now made age-dependent, i.e., it is a function 8(x), of the age
at failure x (rather than a constant). Realistically, this could reflect, through
a decreasing function 8( .), a lessening readiness to invest in the repair of an
item as it gets older.
198 Menachem P. Berg
C3(X) = 10
r(:t:) dL(u)/ L(6(x.
6
(6.2)
For the implementation of the RCL policy we need to specify the RCL and
this could be either done directly on the basis of some relevant costing con-
siderations or we can make 6 subject to optimization with respect to the
overall cost objective function (and thus treat the latter, for that purpose, as
a function of two variables: T and 6). This task, however, becomes harder in
the age-dependent case and having an idea on the functional form of 6(x) is
very helpful. For instance, if we want to find an optimal RCL, then instead
of a mathematically- hard functional optimization procedure which searches
Economics Oriented Maintenance Analysis 199
for the optimal function 6(.), we only need to optimize for the constant coef-
ficients of a given function.
To approach this issue we take a closer look at the (economic) rationale of
using an age-dependent RCL. If we assume that the item's "revenue" is age-
invariant then the main reason to be less ready to invest in an older item is the
increased maintenance costs (and if the item's revenues are age-dependent,
for instance if the output reduces with age, then we can use the model with
running costs to account for that). Thus, the difference between the constant
RCL 6 and the age dependent one is a function of the anticipated repair costs.
This argument can be made formal by specifying a future planning horizon
of some length d (to be determined, as usual in economic planning, on the
basis of some exogenous considerations: budgeting or operational) and then
set
where Zx(d) is the expected repair cost for an item of age x in the next time
interval of length d (assuming no replacement there). Consequently,
l
the NHPP,
x +d
Zx(d) = Ca x r(y)dy (8.2)
(note that in the absence of aging, i.e., constant r(.), Zx(d) is constant inde-
pendent of z, and can be thus absorbed in 6 so that, in this case, a constant
RCL is appropriate). When the repair cost distribution is age-dependent this
can be generalized, by utilizing the independent increment property of the
l
NHPP, to obtain
x +d
Zx(d) = x ca(y)r(y)dy (8.3)
Example: Let
F(z) = e-(Jx"', z ~ 0, (},,> 0, (8.4)
i.e., a Weibulllife distribution, and
(8.5)
200 Menachem P. Berg
Then rex) = (),x'Y- 1 and from (8.1), combined with (8.3), we find
6(x) = 6 - f~ [(x + dP+u - x'Y+U] (8.6)
,+0"
We first note that even though the Weibull distribution is a commonly
used life distribution and f( x) is of a reasonable functional form, the resulting
functional form for the ReL in (8.6) is unlikely to be assessed directly. Indeed,
it is most likely, as a matter of routine practice, that 6(.) will be given a
"nice" functional form as, for instance, the (arbitrarily chosen) linear and
exponential functional forms for 6() in Berg et al. (1986).
For the function in (8.6) we have that 6(x) is decreasing if and only if
, + 0" ~ 1. Thus an IFR Weibull, i.e., , ~ 1, is sufficient for a decreasing
tendency to repair an item as it gets older regardless of whether the mean-
repair-cost function f( x) is increasing or not. However, if, < 1 then f( x) has
to increase fast enough or, more precisely, we need (J' > 1 - , to ensure such
a monotonicity behavior of 6(x).
Wa(T) - the expected long term total discounted costs of an ARP with critical
age T when the discounting factor is a.
While this is an alteration it is also, in a sense, a generalization since as
a - 0 we are back to the previous case only that an appropriate modification
is needed because Wa(T) is total costs whereas C(T) is average costs. Since
C(T) is in fact a rate of costs, to achieve commensuration we would need to
transform Wa(T) into a rate, and following Howard (1971, p. 853) we note
that a long-term continuous rate of costs of aWa(T) yields, when discounted,
a total Wa(T) (simply: fooo e- at aWa(T)dt = Wa(T)).
The derivation of Wa(T) here is a simple exercise through the renewal
type equation:
which concurs with the general relevant theoretical relationship (e.g., Ross
1970, p. 163).
The marginal cost function is obtained here, using the straightforward
derivations,
V1(X) C2
V2 (x, ..1) r(x)L1cl + (1- r(x)L1)C2e-aA
so that, by (2.5),
7]a(x) = (Cl - c2)r(x) - C2 a (9.3)
As expected, we have, comparing with (2.7),
(9.4)
(9.5)
lim T*(a) = T*
01-0
The rest of the marginal cost procedure continues to follow as before, with
TJa(T) and aWa(T) replacing TJ(T) and C(T), respectively, so that if TJa(T)
is increasing, a unique T* exists and it is then finite if TJa( 00) > a Wa(00).
Also, in general, TJa(T) intersects aWa(T) at the extrema of the latter (and
only there): from below at the minima and from above at the maxima.
The application of the marginal cost approach to the block replacement policy
(BRP) follows similar lines. The main change is that it is now the length of
the (fixed) preventive replacement intervals, still denoted by T, which is the
argument of the MCF and that now preventive replacements constitute the
selected underlying renewal process (Berg 1995a). Thus, here
a-( x)-_ {1
0
x <T
x~T (11.1)
and,
U(T) = T (11.2)
Once the MCF TJ(T), is computed, it is combined in formula (2.9), with
the distribution in (11.1) to generate D(T). Then division by U(T), in (11.2),
yields C(T). The optimality equation for T* is
C(T) = TJ(T).
The marginal cost analysis of the (basic) BRP is considered in detail
in Berg (1980), and the extensions to the RCL policy: in Berg and Cleroux
(1982b) for constant 6, and in Berg and Cleroux (1991) for an age-dependent
6(). The case of the expected total discounted costs as objective function is
also considered in Berg (1980) while the extension to opportunistic policies
has been studied in Dekker and Smeitink (1991).
12. Conclusion
The main message here is that in the study of maintenance policies, within
costing frameworks, the approach should be economics-based with probabil-
ity tools being subservient to that. This shift of orientation, compared to
commonly used ones in the literature of mathematical maintenance theory,
turns out to be not merely a conceptual change but one with a clear effect
on both the analysis and optimization of maintenance policies.
The specific mathematical economics concept used is that of the marginal
cost of (preventive) maintenance, at a given age or time as far as the main
maintenance policies of age replacement and block replacement are con-
cerned, which gives rise to the marginal cost function. This last function is
utilized for both the derivation ofthe cost objective function ofthe long-term
expected costs as well as its optimization,including the investigation of prop-
erties of the optimal solution. The method proves mathematically smooth
and effective for different model generalizations and policy extensions that
make the maintenance planning more realistic.
204 Menachem P. Berg
References
Barlow, R., Proschan, F.: Mathematical Theory of Reliability. New York: Wiley
1965
Berg, M.: Optimal Replacement Policies for Two-Unit Machines with Increasing
Running Costs -I. Stochastic Processes and Their Applications 4,89-106 (1976)
Berg, M.: Marginal Cost Analysis for Preventive Replacement Policies. European
Journal of Operational Research 4, 136-142 (1980)
Berg, M.: A Preventive Replacement Policy for Units Subject To Intermittent De-
mand. Operations Research 32, 584-595 (1984)
Berg, M.: The Marginal Cost Analysis and Its Application to Repair and Replace-
ment Policies. European Journal of Operational Research 82, 214-224 (1995a)
Berg, M.: Age-Dependent Failure Modelling: A Hazard-Function Approach. Cen-
tER Discussion Paper (No. 9569), Tilburg University (1995b)
Berg, M., Bienvenu, M., Cleroux, R.: Age Replacement Policy with Age-Dependent
Minimal Repair. Infor 24, 26-32 (1986)
Berg, M., Cleroux, R.: A Marginal Cost Analysis for an Age Replacement Policy
with Minimal Repair. Infor 20, 258-263 (1982a)
Berg, M., Cleroux, R.: The Block Replacement Problem with Minimal Repair and
Random Repairs Costs Journal of Statistical Computation and Simulation 15,
1-7 (1982b)
Berg, M., Cleroux, R.: Maintenance Policies with Jointly Optimal Repair and Re-
placement Actions. Proceedings of Relectronics 91. Eighth Symposium on Re-
liability and Electronics, Budapest (1991)
Berg, M., Epstein, B.: A Note on a Modified Block Replacement Policy for Units
with Increasing Marginal Running Costs. Naval Research Logistics Quarterly
26, 157-179 (1979)
Economics Oriented Maintenance Analysis 205
Block, H.W., Borges, W.S., Savits, T.H.: A General Age Replacement Model with
Minimal Repair. Naval Research Logistics 35, 365-372 (1988)
Qmlar, E.:lntroduction to Stochastic Processes. Englewood Cliffs: Prentice-Hall
1975
Cleroux, R., Dubuc, S., Tilquin, C.: The Age Replacement Problem with Minimal
Repair and Random Repair Costs. Operations Research 27, 1158-1167 (1979)
Cleroux, R., Hanscomb, M.: Age Replacement with Adjustment and Depreciation
Costs and Interest Charges. Technometrics 16, 235-239 (1974)
Dekker, R., Dijkstra, C.: Opportunity-Based Age Replacement: Exponentially Dis-
tributed Times Between Opportunities. Naval Research Logistics 39, 175-190
(1992)
Dekker, R., Smeitink, E.: Opportunity Based Block Replacement. European Journal
of Operational Research 53, 46-62 (1991)
Drinkwater, R., Hastings, N.: An Economic Replacement Model. Operational Re-
search Quarterly 18, 69-71 (1967)
Howard, R.: Dynamic Probabilistic System: Semimarkov and Decision Processes.
Vol. II. New York: Wiley 1971
Ross, S.: Applied Probability Models with Optimization Applications. San Fran-
cisco: Holden- Day 1970
Ross, S.: Stochastic Processes. New York: Wiley 1983
Availability Analysis of Monotone Systems
Terje Aven
Rogaland University Centre, Ullandhaug, 4004 Stavanger, Norway
1. Introduction
2. Model
Let <p(t) be a non-negative random variable representing the state or the
performance level of the system at time t, t E J = (u, u + s]. The interval J
is the time interval we observe the system. We assume that <p(t) can take one
of M + 1 values
<PO,<PI, ... ,<PM (<Po < <PI < ... < <PM).
The M + 1 states represent successive levels of performance ranging from
the perfect functioning level <PM down to the complete failure level <Po. In a
flow network we interpret <p(t) as the throughput rate of the system at time t.
An example of a flow network is given in Section 3. In the following we will use
the word "throughput rate" also in the general case. The system comprises
n components, numbered consecutively from 1 to n. Let Xi(t) be a random
variable representing the state of component i at time t, i = 1,2, ... , n,
t E J. We assume that Xi (t) can be in one of two states, which we denote
XiO and XiI (XiO < xid. The state Xil and XiO represents the functioning
state and the not functioning state, respectively. If a component fails, it is
repaired and put into operation again. We assume that the time to failure
of component i has a distribution Fi(t) with mean MTTFi and the time to
repair has a distribution Hi(t) with mean MTTRi. We assume that all times
to failure and repair times are stochastically independent. Furthermore, we
assume that for all components the distribution of the sum of an operation
period and its subsequent repair is not periodic, i.e. a discrete distribution
with values in {r, 2r, 3r, ... } for some r, and that with probability one, two
or more components cannot fail at the same time.
= =
Let Pi(t) P(Xi(t) XiI). Thus Pi(t) represents the availability of com-
ponent i at time t. The above assumptions guarantee that Pi(t) converges
to
Pi == MTTFi + MTTRi
when t --+ 00.
208 Terje A yen
We also assume that there exists a reference level at time t, D(t), which
expresses a desirable level of system performance at time t. The reference level
=
D(t) is a non-negative random variable, taking values in 'D {do, db ... , dw }.
For a flow network system we interpret D(t) as the demand rate at time t.
In the following we will use the word "demand rate" also in the general case.
The system throughput rate is assumed to be a function of the states of
the components and the demand rate, i.e.
Notation
q (q1,q2,,qn)
h(q) P((X,d) ~ Ic), where qi = P(Xi = XiI)
and the XiS are independent
h,q) = (qb .. , qi-1, ., Qi+1, ... , qn)
1() Indicator function, which equals 1 if the argument is true and 0
otherwise
3. Performance Measures
Below we list some relevant performance measures for the system. The mea-
sures will be denoted 11a , 116, ... , 12a , 12b, ... , etc. For some other closely
related measures, see Haukaas (1995) and Haukaas and Aven (1994).
1. Ita = Probability distribution of (t) given a demand rate D(t) = d
Itb = E[(t) I D(t) = d]
11e = P((t) ~ D(t))
2. Let
l2b = EV
l2c = P(V = 0) = P((t) ~ k, t E J)
Some closely related measures are obtained by replacing k by D(t).
3. Let
1
4. Let
Z = I~I l((t) =D(t)) dt
The random variable Z represents the portion of time the throughput
rate equals the demand rate.
l4a = Probability distribution of Z
l4b = EZ
The measure l4b is called "demand availability" .
w
he = P((t) ~ D(t)) = L P((t) ~ di I D(t) = di ) P(D(t) = di )
i=O
Example
Figure 4.1 shows a simple example of a flow network model. Flow (gas/oil) is
transmitted from A to B. The system comprises four two-state components,
with XiO = 0, i = 1,2,3, X40 = 1, Xu = X21 = 1, X3l = X4l = 2. Hence
the components 1 and 2 are binary components, component 3 has possible
states 0 and 2, and component 4 has possible states 1 and 2. The states of
the components are interpreted as flow capacity rates for the components.
The demand is assumed to be equal to 2, and the state/level of the system
is defined as the maximum flow that can be transmitted from A to B, i.e.
= =
If for example the component states are Xl 0 , X2 1, X3 2 and X4 2,= =
then the flow throughput equals 1, i.e. = (O, 1,2,2) = 1. The time unit is
hours.
Pi 0.96, i = 1,2
P3 0.99
P4 0.98
Availability Analysis of Monotone Systems 211
P((t) = 2) = = =
P(X1(t) 1, X2(t) 1, X3(t) 2, X4(t) = 2)
= 0.96 x 0.96 x 0.99 x 0.98 = 0.894
P((t) ~ 1) = = =
P(X1(t) 1 U X2(t) 1, X3(t) 2)
= P(X1 = 1 U X 2(t) = 1) P(X3(t) = 2)
= {1- P(X1(t) =0) P(X2(t) =On P(X3(t) = 2)
0.9984 x 0.99 = 0.988
E(t)/2 (0.094 x 1 + 0.894 x 2)/2 = 0.941
First we state some well-known asymptotic results for the (expected) number
of system failures below state k. To simplify the computations we assume
that the demand D(t) is a constant d.
Let N(t) denote the number of system failures (relative to level k) in
[0, t]. It can be shown using results from the theory of counting processes and
renewal processes, see e.g. Aven (1992), that
i=l U
There does not exist any general formula for the distribution of the number of
system failures V = N (u + s) - N (u). Only in some special cases is it possible
to obtain practical computation formulae. For example, if all components
have exponentially distributed life times, it is possible to derive a simple
approximation formula. Let Ni(t) denote the number offailures of component
212 Terje Aven
i in the time interval [0, t). Then ifthe repair times are small compared to the
life times and the life times are exponentially distributed with parameter >'i, it
follows that the number offailures of component i in the time interval (u, u +
s), Ni(U + s) - Ni(U), is approximately Poisson distributed with parameter
>'is. If the system is a series system, and we make the same assumptions as
above, the number of system failures in the interval (u, u+s) is approximately
Poisson distributed with parameter E~=1 >'is. The number of system failures
in [0, t], N(t), is approximately governed by a Poisson process with intensity
E?:1 >'i.
If the components are highly available and have constant failure rates, the
Poisson distribution will produce good approximations also for more general
systems. The parameter, i.e. the mean, of the Poisson distribution is in prac-
tice calculated by using the asymptotic system failure rate >'if> defined by
(4.1). Assuming exponential life times, X(t) is a regenerative process with
renewal cycles given by the time between consecutive visits to the best state
(X11, X21, . , X n 1), and it can be shown that N(t/(J) converges in distribu-
tion to a Poisson variable with mean t when >'iMTTRi - 0, i = 1,2, ... , n,
where (J is a suitable normalising factor, see Aven and Haukaas (1996a),
Aven and Jensen (1996), Gertsbakh (1984), Gnedenko and Solovyev (1975),
Solovyev (1971) and Ushakov (1994). Suitable normalising factors include
p/ EoT, 1/ ETif> and >'if>, where p equals the probability that a system failure
occurs in a renewal cycle, EoT equals the expected length ofthe renewal cycle
when it is given that a system failure does not occur in the renewal cycle,
and ETif> equals the expected time to the first system failure. The asymp-
totic exponentiality follows by applying a generalized version of the following
well-known result (cf. Gertsbakh 1984, Kalashnikov 1989, Keilson 1975 and
Kovalenko 1994):
Let TJ, j = 1,2, ..., be a sequence of non-negative i.i.d. random
variables with distribution function F(t), and v a {TJ} indepen-
dent geometrically distributed random variable with parameter p, i.e.
P(v = k) = p(l- p)k-1, k = 1,2, ... ,. Then if ET1 = a E (0,00),
1/
(p/a) 2:TJ
;=1
converges in distribution to an exponential distribution with param-
eter 1 as p - O.
Refer to Aven and Haukaas (1996a) and Haukaas (1995) for a study of the
accuracy of the Poisson approximation and a study of the performance of the
different normalising factors. See also Appendix B.
It is also possible to use Markov models to compute the measures of
category 2. Refer to Muppala et al. (1996) in this volume.
The measure 12e is a subset of the measure 1 2a , the distribution of V.
The measure 12e can therefore be computed (approximated) using the above
Availability Analysis of Monotone Systems 213
approach. The reader should consult the recent work by Smith (1995) for some
interesting new results related to this measure. The measure 12e has also been
studied by Natvig (1984, 1991). He has, however, focused on finding bounds
on 12e under various assumptions. We will not look closer into this problem
here.
Example continued
Consider first the number of times the system state is below level 2. In this
case the system can be viewed as a series system. The number of system
failures is approximately governed by a Poisson process, with an intensity
which equals the sum of the failure rates: 1/480 + 1/480 + 1/990 + 1/490 =
7 x 10- 3 . The formula (4.2) gives approximately the same intensity value.
Consider now the number of times the system state is below level 1. To
compute the expectation EV, we use formula (4.2):
EV
s
~ (1 - P2)P3AIPI + (1 - pI)P3A2P2 +
[1 - (1 - pI)(l - P2)]A3P3 = 1.1 X 10- 3
Hence the average number of times the state of the system equals 0 is ap-
proximately 10 (8760 x 1.1 x 10- 3 ) per year. The distribution of the number
of times the system state is below level 1 can be accurately approximated by
a Poisson distribution, see Haukaas (1995).
From a practical point of view, it is not possible to find the exact distribution
of Y, the lost throughput (volume) in J, using analytical methods. It is,
however, possible to obtain an approximate distribution in many cases.
Writing
Y = ((D(t) - 4(t dt =
JJ I
1
I: a, (D(t) - 4(t dt == I: Y,
bl
and assuming that the lost volume in the intervals (a" b,], Y" are approx-
imately independent and identically distributed, it follows by the Central
Limit Theorem that Y has an approximate normal distribution for large s,
cf. Asmussen (1987), Theorem 3.2, p.136. To guarantee independent, identi-
cal distributions and the asymptotic normality, the process D(t) - 4(t) must
be a regenerative process.
The mean of Y and y(i) can be calculated using Fubini's Theorem:
E 1 Z(t) dt = 1 EZ(t) dt
214 Terje Aven
where Z(t) is one of the processes D(t) - cjJ(t), D(t) and cjJ(t). Using limiting
probabilities we can easily obtain approximate values for this mean value.
Hence the problem has been reduced to computing measures of category 1.
To compute the variance, the asymptotic results in Asmussen (1987) can
be used in the case that Z(t) is a regenerative process. Here we will, how-
ever, consider some simple alternative methods. For the sake of simplicity
we assume that the demand D(t) equals the maximum system state cjJM. We
restrict attention to the variable Y. We assume exponential life times, so that
cjJ is a regenerative process. The methods give normally good approximations
for highly available systems. Initially we look at the case that the system has
only two states, so that a "system failure" is well-defined.
Let Yi denote the throughput loss in the lth interval Jr == (ai, btl, where
al and b, are constants. Then by assuming that the Yis are approximately
independent and that the probability of having two or more system failures
occurring in Jr is small, we obtain
V ar(Y) ~ L V ar(Yi)
I
where H(y) equals the distribution of the component repair times. The steady
state formula represents the system downtime distribution at time "t = 00" ,
and does not depend on the life time distribution. The asymptotic distribu-
tion gives very good approximations if it is likely that the number of com-
ponent failures at the occurrence of the system failure, is relatively large for
each component.
From formula (4.3) we see that for a series system of highly available
components, we can write
(4.4)
Example continued
We assume as an approximation that the component availabilities Pi(t) equal
the limiting availabilities for all t in the interval. Then we can calculate an
approximate distribution of the lost throughput in this time interval by using
an approximation to the normal distribution with mean equal to 8760 x 2 x (1-
0.941) = 8760xO.118 = 1033. It remains to calculate an approximate variance
using the approach described above. For both system failure states, we can in
this case consider the system as a series type structure and use formula (4.4),
observing that the probability of having more than one component down at
the same time is small. Some straightforward calculations give
Var(Y) ~ 2 x 10 4
Using this value for the variance we can easily calculate an approximate
probability distribution for the lost throughput in J:
where (12 and r2 denote the variance of the up-times and the downtimes,
respectively.
The approximation to a normal distribution give poor results if the inter-
val is small. Then alternative calculation methods must be used, see Haukaas
and Aven (1996b), Smith (1995) and Appendix B.
Acknowledgement. The author would like to thank M.A.J. Smith, Erasmus Univer-
sity, A. Csenki, University of Bradford, and I. Kovalenko, Ukrainian Academy of
Sciences, for valuable discussions.
References
Natvig, B.: Multistate Coherent Systems. In: Johnson og Kotz (ed.): Encyclopedia
of Statistical Science 5. New York: Wiley (1984)
Natvig, B.: Strict and Exact Bounds for the Availabilities in a Fixed Time Interval
for Multistate Monotone Systems. Research Report. University of Oslo (1991)
Natvig, B., Streller, A.: The Steady State Behaviour of Multistate Monotone Sys-
tems. J. Appl. Prob. 21, 826-835 (1984)
Ostebo, R.: System Effectiveness Assessment in Offshore Field Development Us-
ing Life Cycle Performance Simulation. Proceedings of Annual Reliability and
Maintainability Symposium (RAMS). Atlanta (1993)
Ross, S.M.: Applied Probability Models with Optimization Applications. San Fran-
cisco: Holden-Day 1970
Smith, M.A.J.: The interval availability of complex systems. Research Report. Eras-
mus University Rotterdam (1995)
Smit, A.C.J.M., van Rijn, C.F.H., Vanneste, S. G.: SPARC: A Comprehensive Re-
liability Engineering Tool. In: Flamm, J. (ed.): Proceedings of the 6th ESReDA
Seminar on Maintenance and System Effectiveness. Chamonix (1995)
Solovyev, A.D.: Asymptotic Behavior of the Time of First Occurrence of a Rare
Event. Engineering Cybernetics 9, 1038-1048 (1971)
Streller, A.: A Generalization of Cumulative Processes. Elektr. Informationsverarb.
Kybern. 16, 449-460 (1980)
Takacs, L.: On Certain Sojourn Time Problems in the Theory of Stochastic Pro-
cesses. Acta Math. Acad. Sci. Hungar. 8, 169-191 (1957)
Ushakov, I.A. (ed.): Handbook of Reliability Engineering. New York: Wiley 1994
Appendix
J, [1 - H(x)]dx
t- 1
OO
1 - [1 - H( )][ Y
Y MTTR
To see this, let R* be the remaining repair time at a given point in time
of a failed component in steady state. It is well-known from the theory of
alternating renewal processes that the probability distribution of R* is given
by
P(R* > ) = froo [1 - H(x)] dx
r MTTR'
cf. e.g. Birolini (1994). Let Y* be the downtime of a system failure that occurs
at a given point in time in steady state, caused by the failure of component
i. Then, since the processes are stochastically independent, it follows that
Availability Analysis of Monotone Systems 219
J, oo [1 - H(x)]dx
1- [1 - H( )][ Y ]n-l
y MTTR
Next we assume that t/J is a parallel system of not necessarily identical com-
ponents. Then the steady state downtime distribution given system failure,
is given by
where
l/MTTRj
f3i = L?=ll/MTTRi
denotes the steady state probability that component j causes a system failure.
This result is shown as above for identical components, the difference being
that we have to take into consideration which component that causes system
failure and the probability of this event given system failure. In view of (4.1)
and (4.2) the probability that component j causes system failure equals
1
MTTF;+MTTR;
TI-
i#j Pi
j
This formula will produce good approximations for highly available compo-
nents, see Aven and Haukaas (1996b), Haukaas (1995) and Haukaas and Aven
(1996a).
The references Haukaas (1995), Haukaas and Aven (1996a) and Smith
(1995) include also a transient analysis of the downtime distribution. Formu-
lae are established which give improved results for the first system failure in
the time interval.
220 Terje A yen
A useful lemma
Consider a unit that is put into operation at time zero. At failures the unit
is repaired, and put into operation again. Let Rj, j = 1,2,, denote the
consecutive repair times (downtimes). We assume that the R/s are stochas-
tically independent. Let H Rj (r) denote the distribution of Rj. Furthermore
let N* (s) denote the number of system failures after s operational time units.
In addition define
N*(s-) = limN*(z)
Zl3
Assume that the repair times are independent of the process N* (s). Let Z (s)
denote the total downtime associated with the operating time s, but not
including s, i.e.
N(3- )
Z(s) = L R;
i=l
T(s) = s + Z(s)
We see that T( s) represents the calendar time after an operation time of
s time units and the completion of the repairs associated with the failures
occurred up to s but not including s.
Now, let Y(t) denote the total downtime of the unit in the time interval
[0, tl. The following lemma gives an exact expression of the probability dis-
tribution of Y(t).
=L
00
n>l
n=O
=L
00
L H(n)(y)P(N*t -
00
= y)-) = n)
n=O
We have used that the repair times are independent of the process N*(s).
Remark B.l. Different versions of the above lemma has been formulated and
proved, cf. Birolini (1985, 1994), Donatiello and Iyer (1987), Funaki and
Yashimoto (1994) and Takacs (1957). The above proof seems to be the sim-
plest one.
To motivate this result, we note that the expected number of system failures
per unit of time when considering calendar time is approximately equal to
>',p, given by (see Section 4.2)
>. = ~ h(l;,p)-h(Oi,p)
4> ~ (MTTF; + MTTR;)
Then observing that the ratio between calendar time and operational time is
approximately l/h(p), we see that the expected number of system failures per
unit oftime when considering operational time, E(N*(s+v)-N* (s))/v, is ap-
proximately equal to >'4>/ h(p). It is not difficult to see that this expectation is
approximately independent of the history of N* up to s, noting that the state
process X frequently restarts itself probabilistically (i.e. X = (1,1, ... ,1)).
The system downtimes are approximate identically distributed with dis-
tribution G(r) (see Section 4.3 and Appendix A) independent of N*, and
approximately independent observing that the state process X with a high
probability restarts itself just after a system failure.
As an approximation we can therefore assume that the conditions of the
lemma are satisfied, with N*(s) approximately Poisson distributed with pa-
rameter >..
Now using the above lemma it follows that
Note that the above lemma does not require identically distributed down-
times. Hence formula (B.1) can also be used with G(n)(Y) as the convolution
of not necessarily identically distributed downtimes given system failure, cf.
the analysis in Haukaas (1995) and Smith (1995).
In the case that the expected number of system failures in the interval
is small, significantly less than 1, Pt(y) can be accurately approximated by
some simple bounds:
The lower bound follows by including only the first two terms of the sum in
Pt(y), whereas the upper bound follows by using the inequality
the downtime distribution, P(Y(t) ~ y), in the case that the components are
highly available. Table B.1 below shows the simulation results for the parallel
system analysed in Section 4.1. The system comprises 2 identical components,
with
The length of the time interval is 8760 units of time. Both components are
assumed to be functioning at time zero. The number of simulation runs were
30000, so the standard deviation is bounded by (0.5 0.5/30000)1/2 ~ 0.003.
o 0.246 0.249
2 0.281 0.284
4 0.320 0.319
6 0.361 0.358
8 0.405 0.399
10 0.451 0.445
12 0.501 0.495
14 0.554 0.546
16 0.610 0.604
18 0.670 0.664
20 0.733 0.729
22 0.763 0.758
24 0.792 0.786
26 0.819 0.814
28 0.845 0.840
30 0.868 0.866
32 0.890 0.887
34 0.909 0.906
36 0.926 0.924
38 0.940 0.938
40 0.950 0.949
45 0.969 0.969
50 0.982 0.982
60 0.994 0.994
70 0.998 0.998
80 0.999 1.000
We see that the approximation is very good for this case. Refer to Aven
and Jensen (1996) for some formal asymptotic results related to the downtime
distribution of Y(t).
Optimal Replacement of Monotone
Repairable Systems
Terje Aven
Rogaland University Centre, Ullandhaug, 4004 Stavanger, Norway
1. Introduction
I:
homogeneous Poisson process with an intensity function A(t). The expected
number of minimal repairs in the time interval T is EN(T) = A(t) dt. The
cost of a minimal repair is c (c > 0) and the cost of a replacement equals K
(K > 0).
The long run average cost per unit of time when adopting this minimal
repair/replacement policy equals
B
T C I:
A(t)dt + J{
= --"-''---:T~--
From this expression it is straightforward to find an optimal T.
A(t) = v(X(t))
where v(x) is a deterministic function. The interpretation of A(t) is that given
the history of the system up to time t the probability that the system shall
fail in the interval (t, t+h) is approximately A(t)h. We assume that the failure
intensity process A(t) is non-decreasing.
If the failure intensity process depends only on the state process X(t) and
not on the failure process N(t), we can interpret the repairs as minimal: a
Optimal Replacement of Monotone Repairable Systems 227
repair which changes neither the age of the system nor the information about
the condition of the system. In this case, the running information about the
condition of the system can be thought to be related to a system which is
always functioning.
The following simple cost structure is assumed: A planned replacement
of the system costs K (> 0) and a repair/replacement at system failure costs
c(>O).
It is assumed that the systems generated by replacements are stochasti-
cally independent and identical, the same replacement policy is used for each
system and the replacement and repairs take negligible time.
The problem is to determine a replacement time minimizing the long run
(expected) cost per unit time.
Let MT and sr denote the expected cost associated with a replacement
cycle and the expected length of a replacement cycle, respectively. We restrict
attention to T's having MT < 00 and sr
< 00. Then using Ross (1970),
Theorem 3.16, the long run (expected) cost per unit time can be written:
T MT cEN(T) + K
B = sr - ET (3.1)
I
mInImIZeS
T
MT - 8ST = E [d(t) - 6jdt+ K
The results of Aven and Bergman (1986) follow. Let B(6) = B T6.
The stopping time Ta*, where 6* = infTBT, minimizes BT. The
value 6* is given as the unique solution of the equation 6 = B(6).
Moreover, if 6 > 6*, then 6 > B(6), if 6 < 6*, then 6 < B(6); B(6)
is non-increasing for 6 ~ 6*, non-decreasing for 6 ~ 6*, and B(6) is
left-continuous.
Choose any 61 such that p(n l > 0) > 0, and set iteratively
Remark 3.1. The above algorithm usually converges very fast. Stan-
dard numerical iterative methods, for example the bisection method
or modified regula falsi (see e.g. Conte and Boor 1977, Section 2) can
in addition to the above algorithm be used to locate 6*. We must
then start with 6a ~ 6b such that 6a ~ B( 6a ) and 6b ? B( 6b). Then
we have 6a ~ 6* ~ 6b.
Remark 3.2. If we restrict attention to stopping times T which are
bounded by a stopping time S, say, satisfying ES < 00, and a(t) is
non-decreasing for t ~ S, then the above results are valid with T6
replaced by min{T6, S}. The stopping time S could for example be
the point in time of the mth system failure.
B ()
6 =
c Iooo f[I(cv(x) < 6))v(x)Qt(dx)dt + K
00 (3.3)
fo f I(cv(x) < 6)Qt(dx)dt
Note that if X(t) is a vector process, then one of the components may be the
time.
Below we apply the above model to analyse a monotone system comprising
n components.
4>(t) = 4>(X(t))
where X(t) = (Xl(t),X 2 (t),"" Xn(t)) and 4>(x) is the structure function of
the system. The structure function 4>(x) is assumed to be monotone, i.e.
- 4>(0) = 0 and 4>(1) = 1, and
- the structure function 4>(x) is non-decreasing in each argument.
Let N;(t) denote the number offailures of component i in [0, t], and N(t) the
number of system failures in the same interval. The counting process Ni(t) is
assumed to have an intensity process -\;(t). Hence the failure process of the
system N(t) has an intensity -\(t) given by
n
-\(t) = L -\;(t)X;(t)(l - 4>(0;, X(t)))4>(X(t)) (3.4)
i=1
where 4>h, x) = 4>(XI, ... , Xi-I, " XHl, ... , x n ).
Observe that X;(t)(l - 4>(0;, X(t)))4>(X(t)) is either 0 or 1, and equals 1 if
and only if the system is functioning, component i is functioning and the
system fails if component i fails.
Hence we have a special case of the general set-up described above pro-
vided that the intensity process is non-decreasing.
Below we look closer at two special cases.
3.2.1 Replacement at System Failures. First we consider the case that
the components are all replaced at system failure, but no repairs are carried
out before system failures. Hence in this case we have X;(t) = I(t < Ri),
where R; is a random variable representing the time to the first failure of
the ith component, i = 1,2, ... , n. We assume that component i has a life
time distribution F;(t) with failure rate equal to r;(t). The n components are
assumed to be independent.
It follows that we have a special case of the general model, with A;(t) =
r;(t)X; (t) and S (cf. Remark 3.2) equal to the failure time of the system. It
is not difficult to see that X; (t)(l - 4>(0;, X(t))) is non-decreasing for t < S.
Thus the failure intensity process is non-decreasing if the failure rates r;(t)
are non-decreasing.
Let vet, x) = E~=1 r;(t)x;(l - 4>(0;, x)) and
n
G(t, x) = P(X(t) = x) = II[l- F;(t)]Xi[F;(t)F-Xi
i=1
= c 10 J~06
Lx l(cv(t,x)<6)v(t,x)tf>(x)G(t,x)dHK
00
B(6) Lx l(cv(t,x)<6)tf>(x)G(t,x)dt
c LX:4>(xl=l 1000
l(cv(t,x)<6)v(t,x)G(t,x)dt+K
LX:4>(xl=l J~oo l(cv(t,x)<6)G(t,x)dt
230 Terje Aven
If the failure rates are constant, i.e. ri(t) = ri, then v(t, x) = v(x) and
Iooo G(t, x)dt = Expected sojourn time in state x
= P(x) L~=: rjZj =P(x) Lj'' ':=1 rj
where P(x) represents the probability that the process X(t) visits state x,
observing that the sojourn time in state x given that the process visits this
state, has an exponential distribution with mean I/L~=l riZi. The proba-
bility P(x) can be found using standard Markov theory, see Ross (1970),
Proposition 4.11. Note that the probability of a transition from state (lj, y)
to state (OJ, y) equals rj / E~=l riYi with Yj = 1. It follows that
Numerical example
Assume that (x) = 1 - (1 - zI)(1 - Z2), i.e. the system is a parallel sys-
tem comprising two components. Assume that the components have constant
failure rates ri given by
rl = 1, r2 = 4
Furthermore assume that K = 1 and c =
6. Then we can easily find the
optimal replacement policy. Since cv(x) can only take the values 0,6 and 24,
it suffices to consider the following three replacement policies:
1. Replace the components at system failures only.
2. If component 1 fails before component 2, replace component 1 at failure
(due to exponentiality this action is equivalent to a system replacement).
If component 2 fails before component 1, replace both components at
system failure. This policy corresponds to a 6 value equal to 24.
3. Replace each component at failure. This policy corresponds to a 6 value
equal to 6.
To compute B(6) we use formula (3.5). It is not difficult to show that
r2
P(I,O) P(R2 < RI) =- - =
rl + r2
0.80
rl
P(O,I) P(R1 < R 2) = - - = 0.20
rl + r2
Clearly P(I, 1) = 1. It follows that the expected sojourn times in the states
(1,1), (1,0) and (0,1) equal 0.20, 0.80 and 0.05, respectively. From this we find
= = =
that BOO 7/(0.20+0.80+0.05) 6.7, B(24) (60.80+ 1)/(0.20+0.80) =
= =
5.8 and B(6) 1/0.20 5.0. Thus it is optimal to use policy 3: replace each
component at failure.
Optimal Replacement of Monotone Repairable Systems 231
Now if the accumulated damage is u and the number of failures are k and
a shock occurs which causes an amount of damage y, then the system fails
with a probability Pk ( u + y). Then it is not difficult to see that the system
failure counting process N(t) has failure intensity process
232 Terje Aven
A(t) = v 1 00
PN(t)(U(t) + y)dH(y)
For a formal proof of this result, see Aven (1987). We see that if Pk(U) is
non-decreasing in U for each k and non-decreasing in k for each u, then the
failure intensity process is non-decreasing.
If Pk ( u) = p( u) for all k, the model represents a kind of a minimal repair
model, since the system after a repair has "forgotten that it failed" .
Numerical example
Suppose the parameters of the model are
v = 1, K = 1, c = 2, Yi == 1, Pk(U) = 1- e- u / 4
Hence V(t) = U(t) and
A(t) = 1 _ e-(U(t)+1)/4
Using formula (3.3) it is not difficult to find the optimal policy: Replace the
system when the number of shocks, V(t), equals 3. The average cost function
then equals 1.1.
3.3.2 Monotone System of n Components. Consider a monotone sys-
tem 4> of n components.
Assume that shocks occur to the system according to a Poisson process
V(t) with rate v; shock j causes a random amount of damage Yii on com-
ponent i. At a shock the system fails with a given probability. A compo-
nent/system failure can occur only at the occurrence of a shock.
Let U and N denote the vector of component processes as defined in
Section 3.3.1, and let Yi denote the vector of damages at the ith shock. We
assume that the ViS are independent and identically distributed. Let X(t)
denote the vector of component states as defined in Section 3.2. As in the
general set-up, N(t) and A(t) denote the system failure process and intensity,
respectively.
If the state of the components is x, the accumulated damage is u, the
number of component failures is k and a shock occurs which causes an
amount of damage y, then the state of the system equals x' with a prob-
ability rx,k(x / , u + y).
Then it is not difficult to see that the failure counting process N(t) has
failure intensity process
A(t) = 4>(X(t))v J L
x' ,q,(x/)=o
/
rX(t),N(t)(X , U(t) + y)dH(y)
J:
In addition we have a cost associated with system failures. It is not difficult
to see that this cost equals kN(T) + b [1 - 4>(t)] dt, where N(t) represents
the number of system failures in [0, t].
234 Terje Avell
Thus (4.1) expresses the expected cost per unit of time, and the problem of
finding an optimal replacement time is reduced to that of minimizing this
function with respect to T.
Using that N;(t) is a counting process with intensity process Ai(Z;(t))X;(t)
it follows that
(4.3)
where <1>(1;, X(t)) - <1>(0;, X(t)) equals 1 if and only if component i is critical,
i.e. the state of component i determines whether the system functions or not.
Combining (4.1), (4.2) and (4.3) we get
T E IoT a(t) dt + J{
B - (4.4)
- EI: dt
where
n
a(t) = L:[c; + k( <1>(1;, X(t)) - <1>(0;, X(t)))]A;(Z;(t))X;(t) + b[l - cp(t)] (4.5)
;=1
Observe that Z;(t) ~ t if the downtimes are relatively small compared to the
uptimes.
We see from the above expression for BT that it is basically identical
to the one analysed in Aven and Bergman (1986). Unfortunately, a(t) does
not have non-decreasing sample paths. Hence we cannot apply the results of
Aven and Bergman (1986).
In theory, Markov decision processes can be used to analyse the optimiza-
tion problem. The Markov decision process is characterized by a stochastic
process Yi, t ~ 0, defined here by
Let 'T]T denote the first component failure after T. Then from (4.4) it follows
that
B(T,S) = JoT Ea(t) dt + J~ EI(t < 'T]T )a(t) dt + K
T+ JT P(t < 'T]T)dt
where a(t) is defined by (4.5). To compute B(T,S) we will make use of the
approximation Zi(t) ~ t. This means that the downtimes are relatively small
compared to the uptimes. Using that the structure function of a monotone
system can be written as a sum of products of component states with each
term of the sum multiplied by a constant, it is seen that
for some deterministic functions VI(t) and sets Al C {I, 2,, n}. It suffices
therefore to calculate expressions of the form
236 Terje Ayen
(4.6)
and
is n v/(t)E
I
Xi (t)I(t < 1JT) dt (4.7)
To compute (4.6) we make use of the following formula for qi(t) = 1 - Pi(t):
where
Hi(y, t) = P(SiN;(t) ::; y)
It is seen that P(Xi(t) = OISiN;(t) = y) R:1 Gi(t - y), and using that
Hi(y, t) = P(SiN;(t) ::; y) = P(Ni(t) - Nj(Y) = 0) R:1 e-(A;(t)-A;(y
formula (4.8) follows.
The accuracy of formula (4.8) is studied in Sandve (1996).
It remains to compute (4.7). Here we shall present a very simple approx-
imation formula. Observing that I(t < 1JT) = 1 means that there are no
component failures in the interval (T, t], and the components are most likely
to be up at time T, we have
n
= II P(Nj(t) - Nj(T) = 0)
i=l
4.4 Remarks
The (T, S) policy can be improved by taking into account which component
fails. In stead of replacing the system at the first component failure after
T (assuming this occurs before S), we might replace the system at the first
component failure resulting in a critical component, or, wait until the first
system failure after T.
Optimal Replacement of Monotone Repairable Systems 237
=
If T* minimizes Lf- , where 6* infT B T , then T* also minimizes BT. Hence
we can focus on Lf.
It is clear from the expression of Lf
that an optimal policy will be greater
than or equal to the stopping time
To = inf{t : a(t) ~ 6}
Using the optimal average cost B(T,S) as an approximation for 6* we can
obtain an improved replacement policy (To-, S).
An alternative replacement policy is obtained by considering the time
points where component failures occur as decision points. Let T; be the point
in time of the ith component failure and let Fi denote the history up to
time T;. Then based on Fi we determine a time Ri (E [0,00]) such that the
system is replaced at T; + Ri if T; + Ri < T;+1. The value of Ri is determined
by minimizing the conditional expected cost from T; until the next decision
point or replacement time, whichever occurs first, i.e. Ri minimizes
[T.+r
g(r) = JT. E[(a(t) - 6)I(t < T;+dIFi] dt
References
Conte, S.D. , Boor, C.: Elementary Numerical Analysis. New York: McGraw-Hill
1972
Dekker, R.: A Framework for Single-Parameter Maintenance Activities and its Use
in Optimization, Priority Setting and Combining, In this volume (1996), pp.
170-188
Jensen, U.: A General Replacement Model. ZOR - Methods and Models of Opera-
tions Research 34, 423-439 (1990)
Pierskalla, W.P. , Voelker, J.A.: A Survey of Maintenance Models: the Control
and Surveillance of Deteriorating systems. Naval Res. Log. Quart. 23, 353-388
(1979)
Ross, S.M.: Applied Probability Models with Optimization Applications. San Fran-
cisco: Holden-Day 1970
Sandve, K.: Cost Analysis and Optimal Maintenance Planning of a Monotone, Re-
pairable System. Ph.D. Thesis. Rogaland University Centre and Robert Gorden
University. In progress (1996)
Taylor, H.M.: Optimal Replacement Policy Under Additive Damage and Other
Failure Models. Naval Res. Logist. Quart. 22, 1-18 (1975)
Valdez-Flores, C., Feldman, R.M.: A Survey of Preventive Maintenance Models for
Stochastically Deteriorating Single-Unit Systems. Naval Res. Logist. Quart. 36,
419-446 (1989)
How to Determine Maintenance Frequencies
for Multi-Component Systems?
A General Approach
Rommert Dekker, Hans Frenk and Ralph E. Wildeman
Econometric Institute, Erasmus University Rotterdam, 3000 DR Rotterdam, The
Netherlands
1. Introduction
A technical system (such as a transportation fleet, a machine, a road, or
a building) mostly contains many different components. The cost of main-
taining a component of such a technical system often consists of a cost that
depends on the component involved and of a fixed cost that only depends on
the system. The system-dependent cost is called the set-up cost and is shared
by all maintenance activities carried out simultaneously on components of
the system. The set-up cost can consist of, for example, the down-time cost
due to production loss if the system cannot be used during maintenance, or
of the preparation cost associated with erecting a scaffolding or opening a
machine. Set-up costs can be saved when maintenance activities on different
components are executed simultaneously, since execution of a group of activ-
ities requires only one set-up. This can yield considerable cost savings, and
therefore the development of optimisation models for multiple components is
an important research issue.
For a literature overview of the field of maintenance of multi-component
systems, we refer to Van der Duyn Schouten (1996) in this volume. An-
other review is given by Cho and Parlar (1991). By now there are several
240 Rommert Dekker et al.
methods that can handle multiple components. However, most of them suffer
from intractability when the number of components grows, unless a special
structure is assumed. For instance, the maintenance of a deteriorating sys-
tem is frequently described using Markov decision theory (see, for example
Howard 1960, who was the first to use such a problem formulation). Since
the state space in such problems grows exponentially with the number of
components, the Markov decision modelling of multi-component systems is
not tractable for more than three non-identical components (see, for example
Backert and Rippin 1985). For problems with many components heuristic
methods can be applied. For instance, Dekker and Roelvink (1995) present
a heuristic replacement criterion in case always a fixed group of components
is replaced. Van der Duyn Schouten and Vanneste (1990) study structured
strategies, viz. (n, N)-strategies, but provide an algorithm for only two iden-
tical components. Summarising, these models are of limited practical use,
since reasonable numbers of components cannot be handled.
An approach that can handle many components was introduced by Goyal
and Kusy (1985) and Goyal and Gunasekaran (1992). In this approach a
basis interval for maintenance is taken and it is assumed that components can
only be maintained at integer multiples of this interval, thereby saving set-up
costs. The authors present an algorithm that iteratively determines the basis
interval and the integer multiples. The algorithm has two disadvantages. The
first is that only components with a very specific deterioration structure can
be handled, which makes it more difficult to fit practical situations and makes
it impossible to apply it to well-known maintenance models. The second
disadvantage is that the algorithm often gives solutions that are not optimal
and that there is no information of how good the solutions are (see Van
Egmond et al. 1995).
The idea of using a basic cycle time and individual integer multiples was
first applied in the definition of the joint-replenishment problem in inventory
theory, see Goyal (1973); the joint-replenishment problem can be considered
as a special case of the maintenance problem of Goyal and Kusy (1985). A
method to solve the joint-replenishment problem to optimality was presented
by Goyal (1974). However, this method is based on enumeration and is com-
putationally prohibitive. Moreover, it is not clear how this method can be
extended to deal with the more general cost functions in case of maintenance
optimisation. Many heuristics have appeared in the joint-replenishment liter-
ature (see Goyal and Satir 1989). But again, it is not clear how these heuristics
will perform in case of the more general maintenance cost functions.
In this chapter we present a general approach for the coordination of main-
tenance frequencies, thereby pursuing the idea of Goyal and Gunesekaran
(1992) and Goyal and Kusy (1985). With the approach we can easily solve
the model of Goyal et al. to optimality, but we can also incorporate other
maintenance models like minimal repair, inspection and block replacement.
How to Determine Maintenance Frequencies? 241
2. Problem Definition
Consider a multi-component system with components i, i = 1, ... , n. Cre-
ating an occasion for preventive maintenance on one or more of these com-
ponents involves a set-up cost S, independent of how many components are
maintained. The set-up cost can be due to, for example, system down-time.
Because of this set-up cost S there is an economic dependence between the
individual components.
In this chapter we consider preventive maintenance activities of the block
type, that is, the determination of the next execution time depends only on
the time passed since the latest execution. Otherwise, for example in case of
age replacement, execution of maintenance can no longer be coordinated and
one has to use opportunity or modified block-replacement policies.
On an occasion for maintenance, component i can be preventively main-
tained at an extra cost of cf. Let Mi (:c) be the expected cumulative deteriora-
tion costs of component i (due to failures, repairs, etc.), :c time units after its
latest preventive maintenance. We assume that MiC) is continuous and that
after preventive maintenance a component can be considered as good as new.
Consequently, the average costs <1>;(:c) of component i, when component i is
preventively maintained on an occasion each :c time units, amount to
(2.1)
Since the function MiO is continuous, the function <1>;(.) is also continuous.
To reduce costs by exploiting the economic dependence between compo-
nents, maintenance on individual components can be combined. We assume
242 Rommert Dekker et al.
where lcm(k a1 , , kaJ denotes the least common multiple of the integers
kap- .. , ka;. Notice that ..1(k) S; 1 and that ..1(k) ~ (mindk;})-1. Conse-
quently, if mindk;} = 1, then ..1(k) = 1.
Goyal (1982), however, criticises the formulation of Dagpunar (1982).
In the maintenance context (see Goyal and Kusy 1985 and Goyal and Gu-
nasekaran 1992), but also in the formulation of the joint-replenishment prob-
lem found in the inventory literature, the correction factor is usually ne-
glected, or equivalently, assumed to be equal to 1. This is correct under the
assumption that the set-up cost is also incurred at occasions at which no
actual maintenance is carried out.
We will consider here two different problem formulations, one with the
correction factor and another without. With the correction factor we have
the following problem:
. {S..1(k) n }
mf -T- + t;14>;(k i T) : k; EN, T> 0 , (2.3)
Denote now by v(Pc), v(P) the optimal objective value of (Pc), (P) respec-
tively, and by T(Pc), T(P) an optimal T (if it exists) for these problems.
Notice that ifT(Pc) and (kl(T(Pc)), k2(T(Pc)), ... , kn(T(Pc))) E N n are op-
timal for (Pc), then T = l/T(Pc) and the same values of ki , i =
1, ... , n,
are optimal for the optimisation problem (2.3). Analogously, if T(P) and
(kl(T(P)), k2(T(P)), . .. , kn(T(P))) E N n are optimal for (P), then T =
l/T(P) and (kl(T(P)), k2(T(P)), ... , kn(Tp))) are optimalfor problem (2.4).
Let V(Prel) be the optimal objective value of (Prel) and let T(Prel) be a
corresponding optimal solution of (Pre!) (if it exists).
For this relaxation it clearly follows that v(P) 2: V(Pre !}. Without any
assumptions on 4>iO, it can be shown that v(Pre !} is also a lower bound on
v(Pc). This is established in the following lemma.
Lemma 3.1. It follows that v(P) 2: v(Pc) 2: v(Pre1 ).
Proof. Since for every vector k = (k 1, ... , kn ) it holds that Ll( k) ~ 1, the first
inequality follows immediately. To prove the second inequality, we observe
that for every ( > 0 there exists a vector (T" k1(T,), ... , kn(T,)) satisfying
246 Rammert Dekker et al.
t
every i, and consequently
~ S~(k(TfTf + (~(k(;fTJ ~ I} -(
m ~ I}} -,
v(Pc ) inf {cPi : kj
and so for every T> l/xi the objective function of (Pre!) evaluated in T is
larger than the objective function evaluated in the point 1/ xi. This implies
T(Prel) ~ l/xi and the desired result is proved. 0
In Section 4 we will simplify the objective function of problem (Pre1 )
by imposing some assumptions on the functions 4>.(.). In order to simplify
the objective function of problem (P), we also need some assumptions on
the same functions 4>j (.). However, before introducing these assumptions, we
discuss the literature on problem (P).
Goyal and Kusy (1985) and Goyal and Gunasekaran (1992) apply an iterative
algorithm to solve problem (2.4) in the previous section (equivalent with (P
for their specific deterioration-cost functions. The authors initialise each k j =
248 Rommert Dekker et al.
V(Prel) is a lower bound on both v(Pc) and v(P), we can decide whether this
feasible solution is good enough.
If this feasible solution is not good enough, we subsequently apply a
global-optimisation procedure to the simplified problem (P) in an interval
that is obtained by the relaxation and that contains an optimal T(P). For
the special cases of Goyal et aI., the minimal-repair model and the inspection
model it is then possible to find in little time a solution to (P) with an objec-
tive value that has an arbitrarily small deviation from the optimal value v(P).
For the block-replacement model this is not possible, but application of a fast
golden-section search heuristic yields a good solution as well. In all cases our
approach outperforms that of Goyal et al. Our approach can also be applied
to find an optimal solution to the joint-replenishment problem, see Dekker et
al. (1995). In that case the procedure can be made even more efficient, since
the cost functions in that problem have a very simple form.
With a solution to problem (P), we then have an improved upper bound
v(P) on v(Pc). If this is close to V(Prel), then it is by Lemma 3.1 also close
to v(Pc) and so we have a good solution of (Pc) as well.
We will now simplify under cert;:l.in conditions the objective function of
problem (P).
with respect to b ~
Definition 3.1. A function f(x), x E (0,00), is called unimodal on (0,00)
if f( x) is decreasing for x ~ b and increasing for
x ~ b. That is, fey) ~ f(x) for every y ~ x ~ b, and fey) ~ f(x) for every
y ~ x ~ b.
Observe that by this definition it is immediately clear that any increasing
function f( x), x E (0,00), is unimodal on (0,00) with respect to b = 0.
Assumption 3.2. For each i = 1, ... , n the optimisation problem (Pd given
by inf{~i(x) : x> o} has a finite optimal solution xi> 0. Furthermore, for
each i the function ~i(') is unimodal on (0,00) with respect to x;'
By Assumption 3.2 the objective function of problem (P) can be simplified
considerably. To this end consider the interval li( k) := [k / xi, (k + 1) / xiJ, k =
0,1, ..., introduced in Section 3.1 and observe that ift E li k ) and k ~ 1, then
it holds that kit ~ xi ~ (k + l)/t, so that
xi ~ (k + l)/t ~ (k + 2)/t ~ (k + 3)/t ~ ...
and
xi ~ kit ~ (k - l)/t ~ ... ~ l/t.
250 Rommert Dekker et al.
ift E in)
I
t-
Fig. 3.1. An example ofthe function gi (.). The thin lines are the graphs ofthe func-
tions ~i(l/t), ~i(2/t), ... , ~i(5/t). The (bold) graph of gi() is the lower envelope
of these functions.
we only need to consider c < 00. Observe now that for any Y with bi <Y<x
it holds that
(Mi{X) - xc) - (Mj{Y) - yc) = (Mj{X) - Mi{Y) -
x-Y
c) (x - y).
Since by the first part of this lemma we have that the function x --+ (Mi (x) -
Mj{Y/{x-y) is increasing on (y, 00) and limx_oo{Mi{x)-Mj{Y/{x-y) =
c, it follows by the above equality that Mi{X) - xc :S Mj{Y) - yc. 0
Using Lemma 3.3 we can show the following result. Observe that the first
part of this lemma improves a result given by Dekker (1995).
Lemma 3.4. If Mj{) is concave on (O,b;) and convex on (bj,oo) for some
bi ~ 0, then the set of optimal solutions of the optimisation problem (Pi)
given by inf{4>j{x) x > O} is nonempty and compact if and only if
limx_ oo Mj{x) - xc < -cf, with c:= liIDx_oo Mj(x)/x. Moreover, it follows
for any optimal solution xi of (Pj) that xi ~ bi and that the function 4>i{)
is unimodal on (0,00) with respect to xi.
Proof. If for some bi > 0 the function Mi() is concave on (O, bi) then the
function cf + Mi{) is also concave on (0, bj). This implies for every 0 < Zl <
Z2 < bi that cf + Mi{zd = cf + MiZ!/Z2)Z2) > (zI/ z2){cf +Mj{Z2. Hence,
by equation (2.1) it follows that 4>i{Zt) > 4>i{Z2) and, consequently, that 4>;{.)
252 Rommert Dekker et al.
(0,00). By Lemma 3.4 we then have that lPiO is unimodal with respect
to x;.
2. Special Case of Goyal and Gunasekaran. It is easy to show (by setting
the derivative of lPiO to zero) that the optimisation problem (Pi) has an
optimal solution x; = {2(cf - a,X,Yi)/(biY?) + X1P/2. This solution is
finite and positive if and only if bi and Yi are strictly larger than zero
and cf > XiYi(aj - bi XiYi/2), and by the assumption that x; > 0 we
can assume that this is the case.
We have that Mi(X) = J:i(:C-Xi)(ai + bsi)dt = aiYi(x - X;) + biY?(x-
X;) 2 /2, so that Mf'(x) = biY? > 0 and, as a result, MiO is (strictly)
convex on (0,00). By Lemma 3.4 we then have that lPiO is unimodal
with respect to x; .
3. Minimal-Repair Model. If the rate of occurrence of failures ri(') is uni-
modal with respect to a value bi 2: 0, then as Mi(X) = ci J; ri(t)dt it fol-
lows that Mf() is decreasing on (0, bi) and increasing on (bi, 00). Hence
MiO is concave on (0, bi) and convex on (b i , 00). Since the optimisation
problem (PI) has a finite solution x; > 0 we then have by Lemma 3.4
that q)i(-) is unimodal with respect to x;.
Notice that ifbi = 0, then ri(') is increasing on (0,00) and M,O is convex
on (0,00). If riO is unimodal with respect to a bi strictly larger than
zero, then lPiO follows a bathtub pattern. In Lemma 3.4 we showed that
for this case x; 2: bi. As the function MiO is convex on (b"oo), it is
a forteriori convex on (x;, 00), a result that will be used later to prove
that the relaxation (Prel) of (P) is a convex-programming problem (see
Lemma 4.2).
4. Inspection Model. Since Mi(X) = cr J; Fi(t)dt, we have that Mf(x) is
increasing on (0,00), and hence that Mi(X) is convex on (0,00). Since
the optimisation problem (Pi) has a finite solution x;> 0 we then have
by Lemma 3.4 that IPi(') is unimodal with respect to xi.
Consequently, if for each i = 1, ... , n one of the above models is used (possibly
different models for different i), then IPj(') is unimodal with respect to x; and
so we have verified that Assumption 3.2 is satisfied. 0
Observe that by Lemma 3.4 an easy necessary and sufficient condition for the
existence of only finite optimal solutions of (Pi) is presented for both cases 3
and 4 above.
In Figure 3.2 an example of the objective function of problem (P) under
Assumption 3.2 is given. In general this objective function has several local
minima, even for the simple models described above. This is dueto the shape
of the functions giO and it is inherent to the fact that the ki have to be inte-
ger. In the following section we show that when problem (Prel) is considered,
often a much easier problem is obtained; for the special cases of Goyal et
aI., the minimal-repair model and the inspection model the relaxation (Prel)
turns out to be a single-variable convex-programming problem and so it is
easy to solve.
254 Rommert Dekker et al.
v(P)
T(P)
t_
Fig. 3.2. An example of the objective function of problem (P); there are many
local minima.
Assumption 4.1. For each i = 1, ... , n the optimisation problem (Pi) given
by inf{~i(x) : x> O} has a finite optimal solution xi > o. Furthermore, for
each i = 1, ... , n it holds that ~iO is increasing on (x;, 00).
Theorem 3.1 showed for the special cases of Goyal et al., the minimal-repair
model with a unimodal rate of occurrence of failures and the inspection
model, that Assumption 3.2 is satisfied when (Pi) has a finite solution xi > O.
As a result, also Assumption 4.1 is satisfied for these models.
By Assumption 4.1 the objective function of problem (Prel) can be sim-
plified. Analogously to equation (3.1) we have for
(R)(t) ._ {~i(l/t) if t :5 l/x; (4.1)
gi .- ~i(Xn ift~ l/x;
that g~R)(t) = inf{4>i(kdt) : ki ~ I}. In Figure 4.1 an example of the
function g~R)(.) is given.
, I
t_
Fig. 4.1. An example of the function g\R}(.). Notice the similarity with the graph
of gi() in Figure 3.1.
solution. Notice that by Assumption 4.1 it follows that v(R) = V(Prel) and
T(R) = T(Prel), since (R) and (Prer) are equivalent under this assumption.
Remember, if we use (R) we always assume that Assumption 4.1 holds.
We will now consider a class offunctions 4>i(-) that satisfy Assumption 4.1.
Lemma 4.1. If the optimisation problem (Pi) given by inf{4>i(x) : x> O}
has a finite optimal solution xi > 0 and the function MiO is convex on
(xi, (0), then the function 4>i(') is increasing on (xi, (0).
Proof Since the function Mi (.) is convex on (xi, (0), if follows by Theo-
rem 3.51 of Martos (1975) that 4>i(t) = (cf +M;(t))/t is a so-called quasicon-
vex function on (xi, (0). Since inf{4>i (x) : x > O} has an optimal solution
xi > 0, the desired result follows by Proposition 3.8 of Avriel et al. (1988).
o
Under the same condition as imposed in Lemma 4.1, one can prove ad-
ditionally that the function g~R)(.) is convex. Consequently, if the condition
of Lemma 4.1 holds for each i, the optimisation problem (R) is a univariate
convex-programming problem and so it is easy to solve. The convexity of the
function g~R) (.) is established by the following lemma.
Lemma 4.2. If the function Mi(') is convex on (xi, (0), then the function
g}R)(-) is convex on (0, (0).
8 (t t ) = f(t) - f(to)
j , a t - to '
and let f(t) := t4>i(t) and g(t) := iP;(l/t). It is easy to verify that
Sj(t, to) = 4>;(to) - (1/t o)8 g (1/t, l/to). (4.2)
The well-known criterion of increasing slopes valid for convex functions (see,
e.g., Proposition 1.1.4 in Chapter I of Hiriart-Urruty and LemankhaI1993),
yields for the convex function /(t) = tiPi(t) on (bi' (0) that 8,(t, to) is in-
creasing in t > bi for every to > b;. By (4.2) this implies that 4>;(to) -
(1/to)8 g (1/t, l/to) is increasing in t > bi for every to > k Since 4>i(tO) and
l/to are constants, the function -8g (1/t, l/to) is then increasing in t > bi for
How to Determine Maintenance Frequencies? 257
every to > k Hence, sg(llt, lito) is increasing as a function of lit < lib; for
every lito < lib;, which is equivalent with Sg(x, xo) is increasing in x < lib;
for every Xo < lib;. Using again the criterion of increasing slopes for convex
functions we obtain that get) = ~;(llt) is convex on (0, lib;).
If M; (.) is convex on (xi, 00 ), that is, if bi = xi, then we have that
t - ~;(llt) is convex on (0, l/xt), which completes the proof. (Notice that
if MiO is convex on (0,00), that is, if bi = 0, then we have that t - ~i(llt)
is also convex on (0,00).) 0
We can now apply the above results to the special cases of Goyal et aI.,
the minimal-repair model and the inspection model.
Theorem 4.1. If each (Pi), i = 1, ... , n, has a finite solution xi > and is
formulated according to one of the special cases of Goyal et al., the minimal-
repair model with a unimodal rate of occurrence of failures or the inspection
model, then problem (Prel) is equivalent with problem (R) and (R) is a convex-
programming problem.
Proof. In the proof of Theorem 3.1 we showed that for the minimal-repair
model with a unimodal rate of occurrence of failures the function M; (-) is
convex on (x;, 00). In case of an increasing rate of occurrence of failures,
MiO is even convex on (0,00), and thus a forteriori convex on (xi, 00). We
also showed that for the special cases of Goyal et al. and the inspection model
the function M;() is convex on (0,00), so that MiO is a forteriori convex
on (xi, 00). Consequently, if for each i = 1, ... , n one of the above models is
used (possibly different models for different i), then by Lemma 4.2 the cor-
responding g}R)O are convex so that problem (R) is a convex-programming
problem. 0
In Figure 4.2 an example of the objective function of problem (R) is given.
We can now explain why we applied in the previous section the trans-
formation of T into liT in the original optimisation problem (2.4). We saw
that (R) is a convex-programming problem if each function g}R)O is con-
vex on (0,00). In the proof of Lemma 4.2 we showed that this is the case
if each function t - ~i(llt) is convex on (0, l/x;). We showed furthermore
that the function t - ~i(llt) is convex on (O,l/x;) if Mi(') is convex on
(x;' 00) (which is generally the case for the models described before). If we
did not apply the transformation of T into liT, we would obtain that the
corresponding relaxation is a convex-programming problem only if each func-
tion ~i(') is convex on (xi, 00). This is a much more restrictive condition and
it is in general not true (not even for the models mentioned before). Sum-
marising, the transformation of T into liT causes the relaxation to be a
convex-programming problem for the models described before, a result that
otherwise does not generally hold.
If (R) is a convex-programming problem, it can easily be solved to op-
timality. When the functions g}R)O are differentiable (which is the case if
258 Rommert Dekker et aI.
v(R)
T(R)
t_
the functions <PiO are differentiable), we can set the derivative of the cost
function in (R) equal to zero and subsequently find an optimal solution with
the bisection method. When the functions g~R)O are not differentiable, we
can apply a golden-section search. (For a description of these methods, see
Chapter 8 of Bazaraa et al. 1993).
To apply these procedures it is necessary to have a lower and an upper
bound on an optimal value T( R). If we assume again without loss of generality
that l/x~ :::; I/X~_l :::; ... :::; l/xi, then for any optimal T(R) of (R) it follows
by Lemma 3.2 that 0 < T(R) :::; l/xi.
Once (R) is solved we have an optimal T(R). If additionally (R) is a
convex-programming problem, it is possible to derive an easy dominance
result for the optimal solution of (P). In order to do so, we first need the
following lemma.
Lemma 4.3. If T(R) :::; l/x~ is an optimal solution of problem (R) (and
Assumption 4.1 holds), then (T(R), 1, ... ,1) is an optimal solution of (Pc)
and of (P) . Moreover, if there does not exist an optimal solution of (R) within
the interval (0, l/x~), then any optimal solution T(P) of(P) is bounded from
below by l/x~.
Proof Since T( R) :::; 1/ x~ is an optimal solution of problem (R) it follows by
Assumption 4.1 that the optimal scalars ki(T(R)), i = 1, ... , n, are equal to
one and so (T(R), k 1 (T(R)), ... , kn(T(R))) is also a feasible solution of prob-
lem (Pc) and (P). Hence we obtain that vCR) = ST(R) + 2:7=1 <Pi (I/T(R)) 2:
How to Determine Maintenance Frequencies? 259
v(P) and this yields by Lemma 3.1 that vCR) = v(Pc) = v(P), implying that
(T(R), 1, ... ,1) is also an optimal solution of (Pc) and of (P).
To prove the second part, observe since the functions gi(-) and g~R)(-)
(see (3.1) and (4.1 are identical on (0, l/z~] (by Assumption 4.1), that
T < 1/ z~ is a local optimal solution of problem (R) if and only if T is a local
optimal solution of problem (P). Hence, if there is no local optimal solution
of (R) within the interval (0, l/z~), then there is no local optimal solution of
problem (P) within (0, l/z~), and this yields T(P) ~ l/z~. 0
If (R) is a convex-programming problem, then Lemma 4.3 yields the fol-
lowing result. If T(R) ~ l/z~ then T(R) is an optimal solution of (Pc) and
(P) and the optimal scalars kj , i = 1, ... , n, are equal to one. IfT(R) > l/z~,
we evaluate the objective function of (R) in l/z~ and ifthis value equals vCR)
then 1/ z~ is also an optimal solution of (R), so that by Lemma 4.3 it follows
that l/z~ is an optimal solution of (Pc) and (P) as well. Finally, if the ob-
jective function of (R) in l/z~ is larger than vCR), then there does not exist
a local optimal solution of (R) within (0, l/z~) (since the objective func-
tion of (R) is convex) and thus it follows by Lemma 4.3 that T(P) ~ l/z~.
Consequently, we have shown the following corollary.
Corollary 4.1. Suppose (R) is a convex-programming problem. If T(R) >
l/z~ and the objective function of (R) in l/z~ is larger than vCR), then for
any optimal solution T(P) of (P) it follows that T(P) ~ l/z~. Otherwise,
an optimal T(P) is given by T(P) = min{l/z~, T(R)}.
Observe for T(R) > l/z~ that T(R) may not be an optimal solution
of problem (P). Besides, the values of ki corresponding with T(R) are not
necessarily integer, implying that the optimal solution of (R) is in general not
feasible for (P) when T(R) > l/z~. Consequently, the first thing to do when
T(R) > l/z~ is to find a feasible solution for (P) (which is consequently also
a feasible solution for problem (Pe .
A straightforward way for finding a feasible solution for (Pc) and (P) is
to substitute the value of T(R) in (3.1). This is specified by the following
Feasibility Procedure (FP).
Feasibility Procedure
Improved-Feasibility Procedure
1. Let kj(IFP) = ki(FP), i = 1"oo,n, with kj(FP) the values given by
the feasibility procedure FP.
2. Solve the optimisation problem
ST(R) + L gi(T(R))
i=l
v(FP).
How to Determine Maintenance Frequencies? 261
(?,) +~~{ST+t"',(lfT)},
with v(P!) the optimal objective value and T(Pt} an optimal T. If for each
i = 1, ... , n the function M;(-) is convex and differentiable on (0,00), and
for at least one i E {1, ... , n} the function M;(.) is strictly convex on (0,00),
and the differentiable convex-programming problem (R) has no global optimal
solution within (0, 1/x~), then T(P) 2 T(Pt} 2: 1/x~.
Proof. If there does not exist a global optimal solution of (R) in (0, 1/x~),
then it can be shown analogously to Lemma 4.3 that T(Pt} 2: 1/x~.
To prove the inequality T(P) 2 T(Pt} , notice first that (Pt) equals the
optimisation problem (P) when all k; are fixed to the value 1. Consequently,
(P1 ) is a more restricted problem than (P) and it is easy to verify that
v(P) :::; v(P!). Furthermore, if T(P) and certain values of k; are optimal
for (P), then it is easy to see that if the functions Pi (.) are differentiable the
following holds:
so that
t
It is easily verified that
so that
n
v(P) = L M[{k;/T(P. (4.4)
;=1
How to Determine Maintenance Frequencies? 263
Analogously, it can be shown for the optimal objective value of (PI) that
n
v(PI) = L M/(l/T(P!)). (4.5)
;=1
Suppose now that the inequality T(P) ~ T(PI) does not hold, that is, T(P) <
T{PI). Since the functions M;{) are (strictly) convex and, consequently, the
functions Mf() are (strictly) increasing, this implies that (use (4.4) and (4.5))
n
v(P) = L Mf(ki/T(P))
;=1
n
> L M/(l/T{P))
;=1
n
> L M/(l/T(PI))
i=1
v(PI) ,
which is in contradiction with v(P) :::; v(PI). Hence, T(P) ~ T(PI). 0
A rough upper bound on T(P) is obtained by the following lemma.
Lemma 4.5. For an optimal T(P) of (P) it holds that
ST + L g~R)(T) ~ ST + L 4'i{Xt)
;=1 ;=1
STup + """
LJgi(R) (Tup).
i=1
That is, in T = (I/S){v(FP) - E~=I4'i(X;)} the objective function of (R)
is not smaller than in Tup. Since (R) is a convex-programming problem
and Tup is the smallest T ~ T(R) for which the objective function of (R)
equals v{F P), we have that T ~ Tup. 0
Notice that the upper bound Tup can easily be found with a bisection on the
interval [T{R), (I/ S){ v{F P) - E~=1 4'; {x;)}].
It cannot generally be proved that the objective function of (R) is equal
to v{F P) for a value of T $ T{R), but if it is, we have a lower bound Tiow
on T(P) analogously.
Lemma 4.7. If there is aT $ T{R) for which the objective function of(R)
is equal to v{ F P), let then Tiow be the largest T $ T( R) for which this holds.
If (R) is a convex-programming problem then Tiow is a lower bound on T{P).
How to Determine Maintenance Frequencies? 265
Proof For values of T < l10w the objective function of (R) is larger than or
equal to v(F P), since (R) is a convex-programming problem and the min-
imum is obtained in T(R). Since (R) is a relaxation of (P), the objective
function of (P) is also larger than or equal to v(F P) for values of T < l1ow,
so that Ttow is a lower bound on T(P). 0
!J(FP)
Fig. 4.3. A lower bound lIow and an upper bound Tup on an optimal T(P) are
found where the objective function of relaxation (R) equals v( F P), the value of the
objective function of problem (P) in T(R),
In Figure 4.3 it is illustrated how the bounds l10w and Tup are generated.
If (R) is a convex-programming problem and the lower bound Ttow exists,
then it can easily be found as follows. We first check whether Ttow ~ l/x~,
with 1/x~ the lower bound given by Corollary 4.1. To this end we compute the
objective function of (R) in l/x~ and check whether it is smaller than v(F P).
If so, then 1}ow < l/x~ and otherwise l10w ~ l/x~. In the latter case we can
easily find 1}ow with a bisection on the interval [l/x~, T(R)].
Notice that if (R) is a convex-programming problem, it can be useful to
apply the IFP. In that case the bounds Tup and l10w derived above may
be improved when the objective value v(F P) is replaced by v{IF P), since
v{I F P) ~ v(F P).
In this subsection we derived a number of lower and upper bounds on
T(P). The results are summarised in Table 4.1.
266 Rommert Dekker et al.
From Table 4.1 we can find the bounds that can be used dependent on
certain conditions. For example, for the special cases of Goyal et al., the
minimal-repair model with a unimodal rate of occurrence of failures and the
inspection model, we showed in Theorem 4.1 that (R) is a convex program-
ming problem. This is already sufficient to use all bounds of Table 4.1, except
the lower bound T(P!). To use the bound T(P!) , each Mi() must be con-
vex on (0,00) and at least one Mi() must be strictly convex. We showed in
the proof of Theorem 3.1 that each Mi (.) is convex on (0, 00) for the mod-
els described above (with an increasing rate of occurrence of failures for the
minimal-repair model). For the special cases of Goyal et al. each Mi() is
even strictly convex on (0,00), so that the bound T(P1 ) can then always be
used. For the minimal-repair and inspection model at least one MiC) must
be strictly convex.
Let now TI be the largest lower bound and Ttl be the smallest upper bound
that can be used for a specific problem, then we have that T(P) E [TI, Ttl].
Consequently, it is sufficient to apply a global-optimisation technique on the
interval [Tr, Ttl] to find a value for T(P).
Lipschitz Optimisation
Efficient global-optimisation techniques exist for the case that the objective
function of (P) is Lipschitz. A univariate function is said to be Lipschitz if
for each pair x and y the absolute difference of the function values in these
points is smaller than or equal to a constant (called the Lipschitz constant)
multiplied by the absolute distance between x and y. More formally:
How to Determine Maintenance Frequencies? 267
We can also apply alternative methods that do not use the notion of Lipschitz
optimisation. One such a method is golden-section search. Golden-section
search is usually applied (and is optimal) for functions that are strictly uni-
modal, which the objective function of (P) is generally not. However, we will
apply an approach in which the interval [11, Tu] is divided into a number
of subintervals of equal length, on each of which a golden-section search is
applied. The best point of these intervals is taken as solution. We then divide
268 Rommert Dekker et aI.
the subintervals into intervals that are twice as small and we apply on each
a golden-section search again. The doubling of intervals is repeated until
no improvement is found. We refer to this approach as the multiple-interval
golden-section search heuristic, the results of which are given in Section 5.
5. Numerical Results
In this section the solution procedure for (P) described in the previous section
will be investigated and it will be compared with the iterative approach of
Goyal et al. This will first be done for the special case of Goyal and Kusy,
the minimal-repair model with an increasing rate of occurrence of failures,
and the inspection model, in which cases an optimal solution v(P) of (P) can
be found by Lipschitz optimisation. This makes it possible to make a good
comparison and also to investigate the performance of the multiple-interval
golden-section search heuristic. Subsequently, the performance of the solution
procedure for the block-replacement model is investigated, using the golden-
section search heuristic. All algorithms are implemented in Borland Pascal
version 7.0 on a 66 MHz personal computer.
By considering the gap between v(R) and v(P) we are by Lemma 3.1 able
to say something about the optimal objective value v(Pc) of (Pc). We will
not investigate problem (Pc) any further, since incorporation of the correction
factor Ll(k) in a solution procedure is too time consuming.
For all models we have six different values for the number n of components
and seven different values for the set-up cost S. This yields forty-two different
combinations of nand S, and for each ofthese combinations hundred random
problem instances are taken by choosing random values for the remaining
parameters. For the minimal-repair, inspection and block-replacement model
the lifetime distribution for component i is given by a Weibull-(Ai, f3d dis-
tribution (a Weibull-(A,,8) distributed stochastic variable has a cumulative
distribution function F(t) = 1 - e-(t/~)'''). The data are summarised in Ta-
ble 5.1.
Results for the special case of Goyal and Kusy, the minimal-repair
model and the inspection model
For the special case of Goyal and Kusy, the minimal-repair model and the
inspection model, the value v( P) can be determined by Lipschitz optimisation
with an arbitrary deviation from the optimal value; we allowed a relative
deviation of 10- 4 (i.e., 0.01%). In Table 5.2 the relevant results of the 4200
problem instances for each model are given.
Notice first that from this table it follows that the difference between the
relaxed solution v(R) and the optimal objective value v(P) of problem (P)
is not very large. On average the gap is approximately one per cent or less
and the maximum deviation is 5.566% for the model of Goyal and Kusy and
270 Rommert Dekker et al.
cr
the expectation and the standard deviation of the lifetime distribution of component i.
Notice that for the inspection model we take ~ cf / 1-'; + 1 and for the block-replacement
model c{ ~ 2cf /(1 - (1:/1-':) + 1. This guarantees the existence of a finite minimum
xi for the individual average-cost function 4';(-). In Dekker (1995) it is shown that for
the inspection model a finite minimum for 4'i( -) exists if cf < cr
I-'i, and, a forteriori, if
cr ~ cf / I-'i + 1. For the block-replacement model it can be shown (see also Dekker 1995)
.
that a finite minimum exists if cf > 2c P/(1 - (12/1-'2). Notice finally that since Pi > 1, the
, "
rate of occurrence of failures for the minimal-repair model is increasing.
-
even smaller for the other models. By Lemma 3.1 we have that the optimal
objective value v(Pc) of problem (Pc) will deviate even less from v(R). This
implies that if one wants to find a solution to problem (Pc), it is better to solve
the easier problem (P) first. Since the gap between v(P) and v(R) is often
small, this yields a solution that will in most cases suffice. Only when the
gap is considered not small enough, one can subsequently apply a heuristic
to problem (Pc) to try to find an objective value that is smaller than v(P).
From the table it can be seen that solving the relaxation takes very little
time. A subsequent application of the FP requires only one function evalu-
ation for each component and this takes a negligible amount of time, which
is why for the FP no running times are given in Table 5.2. Applying the
IFP also takes little time. (All running times in Table 5.2 are higher for
the inspection model than for the special case of Goyal and Kusy and the
minimal-repair model, since for the inspection model a numerical routine has
to be applied for each function evaluation, whereas for the other two models
the cost functions can be computed analytically.) Notice that some deviations
are negative. This is due to the relative deviation of 0.01 % allowed in the op-
timal objective value determined by the Lipschitz optimisation; a heuristic
can give a solution with an objective value up to 0.01% smaller than that
according to the Lipschitz-optimisation procedure.
As can be expected, the algorithm of Goyal and Kusy outperforms the
algorithm of Goyal and Gunasekaran. This is explained from the fact that
Goyal and Kusy take the optimal ki given a value of T, whereas Goyal and
How to Determine Maintenance Frequencies? 271
Table 5.2. Results of 4200 Random Examples for the Special Case of Goyal and
Kusy, the Minimal-Repair Model and the Inspection Model
Gunasekaran take for each ki the rounded optimal real value. However, the
differences between the two algorithms are small.
The feasible solution corresponding with the relaxation (i.e., obtained by
application of the FP) is in most cases better than that of the algorithms
of Goyal et ai. Only for the special case of Goyal and Kusy the FP per-
forms somewhat worse. For the minimal-repair and inspection model the FP
performs much better.
In all cases the IFP (that is an intelligent modification of the approach
of Goyal et aI.) outperforms the iterative algorithms of Goyal et aI., while
the running times of the IFP are equal or faster. The differences are smallest
272 Rommert Dekker et al.
for the special case of Goyal and Kusy. This can be explained from the fact
that in the model of Goyal and Kusy there is little variance possible in the
lifetime distributions of the components, mainly because the exponent e has
to be the same for all components. In the inspection model, however, there
can be large differences in the individual lifetime distributions, and this can
cause much larger deviations for the iterative algorithms of Goyal et al.j the
average deviation for Goyal and Kusy's algorithm is then 1.253% and the
maximum deviation even 66.188%, which is much higher than the deviations
for the IFP. The IFP performs well for all models.
Since for many examples the algorithms of Goyal et al. and the IFP find
the optimal solution, the average deviations of these algorithms do not differ
so much (in many cases the deviation is zero per cent). However, there is
a considerable difference in the number of times that large deviations were
generated. This is illustrated in Table 5.3 that gives the percentage of the
examples in which the IFP and the algorithm of Goyal and Kusy had a devia-
tion larger than 1% and 5% for the three models discussed in this subsection.
From this table it is clear that the IFP performs much better than the algo-
Table 5.3. Percentage of the Examples Where the IFP and the Algorithm of Goyal
and Kusy Generated Deviations of More Than 1% and 5%.
Algorithm Deviation> 1% Deviation > 5%
Special Case of Goyal and Kusy
IFP 12.86 1.79
Goyal and Kusy 27.50 2.10
Minimal-Repair Model
IFP 1.57
Goyal and Kusy 12.38 1.64
Inspection Model
IFP 3.12 0.05
Goyal and Kusy 26.50 6.69
rithm of Goyal and Kusy and that if the algorithm of Goyal and Kusy does
not give the optimal solution, the deviation can be large. The conclusion is
that solving the relaxation and subsequently the improved feasibility proce-
dure is better than and at least as fast as the iterative algorithms of Goyal
et al. This also implies that the algorithms of Goyal et al. can be improved
considerably if another initialisation of the ki and T is taken, viz. according
to the solution of the relaxation.
The deviation of 66.188% in Table 5.2 occurs for one of the problem
instances of the inspection model with n = 5 and S = 10. The parameters and
results are given in Table 5.4. The large deviation for the algorithm of Goyal
and Kusy can be explained as follows. In the first iteration of the algorithm all
ki are initialised at the value one. The corresponding T is then determinedj it
equals 5.87. In the following iteration it is investigated for each component i
How to Determine Maintenance Frequencies? 273
Table 5.4. Parameters and Results for the Problem Instance of the Inspection
Model for Which the Algorithm of Goyal and Kusy Performs Worst
S -10
Component cPI c~ ~i ~i x
1 247.00 962.00 1 3.50 0.83
2 472.00 475.00 9 3.45 5.99
3 344.00 511.00 20 1.71 7.04
4 459.00 528.00 14 3.90 8.45
5 225.00 541.00 17 2.47 6.45
Optimal solution:
T = 0.85
ki = 1,1,1,1,1
corresponding objective value v(GK) = 1173.77
100% X (v(GK) - v(P/v(P) = 66.188%
whether a larger integer value for ki given T = 5.87 yields lower individual
average costs. This is not the case, as can also be expected considering the
individual x; in the last column of Table 5.4. Take, for example, k2 = 2 for
component 2. This implies that component 2 is inspected each 2 x 5.87 =
11.74 time units, whereas its optimal inspection interval has length x2 =
5.99. The value 5.87 turns out to be a better alternative than 11.74, which
also turns out to be the case for the other components. Consequently, the
algorithm terminates with T = 5.87 and all ki equal to one. For component 1
this implies that it is inspected each 5.87 time units whereas the optimal
inspection interval has length 0.83. Since for component 1 the failure cost cf
per unit time is relatively large, this implies a large deviation; the individual
average-cost function of component 1 is relatively steep. It would be much
better to take a smaller T and to increase the ki for components 2,3,4,5
accordingly, which is indeed reflected by the optimal T that equals 0.85.
From the results of Table 5.2 it can further be seen that the multiple-
interval golden-section search heuristic performs very well in all cases. The
average deviation is almost zero, and the maximum deviation is relatively
small. The heuristic is initialised with four subintervals and this number is
doubled until no improvement is found. It turned out that four subintervals is
mostly sufficient. The running time of the heuristic is also quite moderate: less
than a second for the special case of Goyal and Kusy and the minimal-repair
model, and almost 12 seconds for the inspection model (where a numerical
routine has to be applied for each function evaluation). This is not much
compared to, for example, the algorithms of Goyal et al.
Usually, Lipschitz optimisation can take much time. For the special cases
in this subsection, Lipschitz optimisation can be made much faster by ap-
274 Rommert Dekker et al.
seen from this table, the running time increases somewhat more than linearly
in the number n of components and decreases in the set-up cost S. The almost
linear increment of speed is a nice result when it is considered that Lipschitz
optimisation is an optimal solution procedure and that alternative optimal
procedures published so far in the literature (see, for example, Goyal 1974 in
the inventory context) involve only enumeration methods with exponentially
growing running times. The fact that the running time decreases if S increases
is due to a steeper objective function for larger S. A larger S causes smaller
upper bounds for T(P) and, as a result, smaller intervals on which Lips-
chitz optimisation has to be applied. The running time also depends on the
precision that is required. For less precision Lipschitz optimisation becomes
much faster. Future generations of computers will make the advantage ofthe
golden-section search heuristic over Lipschitz optimisation less important.
We can conclude that if a solution is required in little time, we can solve
the relaxation and apply the improved feasibility procedure to obtain a so-
lution with a deviation of less than one per cent on average. The improved
feasibility procedure outperforms the algorithms of Goyal et al. not only by
time and average deviation, but the maximum deviation is also much smaller.
When precision is more important, we can apply the golden-section search
How to Determine Maintenance Frequencies? 275
Table 5.6. Results of 4200 Random Examples for the Block-Replacement Model
Relaxation (R):
A verage running time relaxation (sec.) 0.23
Average deviation (R) (v(P) - v(R/v(R) 0.402%
Minimum deviation (R) 0.000%
Maximum deviation (R) 2.708%
Feasibility Procedure (FP):
A verage deviation FP (v( F P) - v( P / v( P) 0.196%
Minimum deviation FP 0.000%
Maximum deviation FP 12.217%
Improved Feasibility Procedure (IFP):
Average running time IFP (sec.) 1.30
A verage deviation IFP (v( I F P) - v( P / v( P) 0.051%
Minimum deviation IFP -0.002%
Maximum deviation IFP 5.921%
Golden-Section Search (GSS):
Average running time GSS (sec. 10.26
Goya and Kusy G :
Average running time GK (sec.) 3.72
Average deviation GK (v(GK) - v(P))/v(P) 0.658%
Minimum deviation GK -0.222%
Maximum deviation GK 39.680%
Goyal and Gunasekaran (GG):
Average running time GG (sec.) 3.54
Average deviation GG (v(GG) - v(P))/v(P) 0.943%
Minimum deviation GG -0.222%
Maximum deviation GG 41.003%
276 Rommert Dekker et aI.
From this table is follows again that the gap between v(R) and v(P)
is small: maximally 2.637% and only 0.399% on average. This implies that
also for the block-replacement model it is better to solve first problem (P)
than problem (Pc), since the solution thus obtained will in many cases be
sufficiently good. If the gap is not small enough, one can subsequently apply
a heuristic to problem (Pc).
The average running time of the relaxation is again very small. It is larger
than the average running time of, for example, the inspection model, since
golden-section search is not applied once but four times, according to one
iteration of the multiple-interval golden-section search heuristic.
Also in this case the algorithm of Goyal and Kusy outperforms the algo-
rithm of Goyal and Gunasekaran, though the differences are small. The FP
already outperforms the algorithms of Goyal et al. and the IFP performs even
better. The average deviation is 0.658% for the algorithm of Goyal and Kusy
and only 0.051% for the IFP. Besides, the maximum deviation for the IFP is
quite moderate, 5.921%, whereas for the algorithm of Goyal and Kusy this
can be as large as 39.680% (and for the algorithm of Goyal and Gunasekaran
even larger). It can happen that the algorithms of Goyal et al. sometimes
perform slightly better than the IFP, reflected in the minimum deviations of
-0.222% for the algorithms of Goyal et al. and -0.002% for the IFP.
The golden-section search heuristic applied to solve problem (P) needed
again four intervals in most cases. The average running time of the heuristic is
10.26 seconds, which is not much compared to, for example, the algorithms of
Goyal et al. Remember that the solutions of the algorithms of Goyal et al. and
of the IFP are compared with the solutions according to the golden-section
search heuristic. Notice that the negative deviations of -0.222% and -0.002%
imply that both the algorithms of Goyal et al. and the IFP can in some cases
be better than the golden-section search heuristic, though the differences are
small. This implies that the golden-sectic:>ll search heuristic is not optimal, but
that was already clear from the results in the previous subsection. However,
in most cases the heuristic is better than the other algorithms, regarding the
average deviations of 0,658% and 0.943% for the algorithms of Goyal et al.
and 0.051% for the IFP, compared to the heuristic.
The conclusion here is again that when a solution is required in little time,
we can solve the relaxation and apply the (improved) feasibility procedure;
this is better than the algorithms of Goyal et al. (especially the maximum
deviation is much smaller). When precision is more important, we can apply
the golden-section search heuristic, at the cost of somewhat more time.
6. Conclusions
In this chapter we presented a general approach for the coordination of main-
tenance frequencies. We extended an approach by Goyal et al. that deals
with components with a very specific deterioration structure and that does
How to Determine Maintenance Frequencies? 277
not indicate how good the obtained solutions are. Extension of this approach
enabled incorporation of well-known maintenance models like minimal re-
pair, inspection and block replacement. We presented an alternative solu-
tion approach that can solve these models to optimality (except the block-
replacement model, for which our approach is used as a heuristic).
The solution of a relaxed problem followed by the application of a feasibil-
ity procedure yields a solution in little time and less than one per cent above
the minimal value. This approach outperforms the approach of Goyal et al.
When precision is more important, a fast heuristic based on golden-section
search can be applied to obtain a solution with a deviation of almost zero
per cent. For the special cases of Goyal et al., the minimal-repair model and
the inspection model, application of a procedure using a dynamic Lipschitz
constant yields a solution with an arbitrarily small deviation from an optimal
solution, with running times somewhat larger than those of the golden-section
search heuristic.
In the solution approach of this chapter many maintenance-optimisation
models can be incorporated. Not only the minimal-repair, inspection and
block-replacement models, but many others can be handled as well. It is
also easily possible to combine different maintenance activities, for example
to combine the inspection of a component with the replacement of another.
Altogether, the approach presented here is a flexible and powerful tool for
the coordination of maintenance frequencies for multiple components.
References
Avriel, M., Diewert, W.E., Schaible, S., Zang, I.: Generalized Concavity. New York:
Plenum Press 1988
Bii.ckert, W., Rippin, D.W.T: The Determination of Maintenance Strategies for
Plants Subject to Breakdown. Computers and Chemical Engineering 9, 113-
126 (1985)
Barros, A.I., Dekker, R., Frenk, J.B.G., van Weeren, S.: Optimizing a General
Replacement Model by Fractional Programming Techniques. Technical Report.
Econometric Institute, Erasmus University Rotterdam (1995)
Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming Theory and
Algorithms. New York: Wiley 1993
Cho, D.I., Parlar, M.: A Survey of Maintenance Models for Multi-Unit Systems.
European Journal of Operational Research 51, 1-23 (1991)
Dagpunar, J.S.: Formulation of a Multi Item Single Supplier Inventory Problem.
Journal of the Operational Research Society 33, 285-286 (1982)
Dekker, R.: Integrating Optimisation, Priority Setting, Planning and Combining of
Maintenance Activities. European Journal of Operational Research 82, 225-240
(1995)
Dekker, R., Frenk, J.B.G., Wildeman, R.E. : An Efficient Optimal Solution Method
for the Joint Replenishment Problem. European Journal of Operational Re-
search. To appear (1996)
278 Rommert Dekker et al.
Dekker, R., Roelvink, I.F.K.: Marginal Cost Criteria for Preventive Replacement of
a Group of Components. European Journal of Operational Research 84, 467-
480 (1995)
Goyal, S.K.: Determination of Economic Packaging Frequency for Items Jointly
Replenished. Management Science 20, 293-298 (1973)
Goyal, S.K.: Determination of Optimum Packaging Frequency of Items Jointly Re-
plenished. Management Science 21, 436-443 (1974)
Goyal, S.K.: A note on Formulation of the Multi-Item Single Supplier Inventory
Problem. Journal of the Operational Research Society 33, 287-288 (1982)
Goyal, S.K., Gunasekaran, A.: Determining Economic Maintenance Frequency of a
Transport Fleet. International Journal of Systems Science 4, 655-659 (1992)
Goyal, S.K., Kusy, M.I.: Determining Economic Maintenance Frequency for a Fam-
ily of Machines. Journal of the Operational Research Society 36, 1125-1128
(1985)
Goyal, S.K., Satir, A.T.: Joint Replenishment Inventory Control: Deterministic and
Stochastic Models. European Journal of Operational Research 38, 2-13 (1989)
Hiriart-Urruty, J.-B., Lemarechal, C.: Convex Analysis and Minimization Algo-
rithms I: Fundamentals. A Series of Comprehensive Studies in Mathematics.
Vol. 305. Berlin: Springer 1993
Horst, R., Pardalos, P.M.: Handbook of Global Optimization. Dordrecht: Kluwer
1995
Howard, R.A.: Dynamic Programming and Markov Processes. New York: Wiley
1960.
Martos, B. Nonlinear Programming Theory and Methods. Budapest: Akademiai
Kiado 1975
Smeitink, E., Dekker, R.: A Simple Approximation to the Renewal Function. IEEE
Transactions on Reliability 39, 71-75 (1990)
Van der Duyn Schouten, F.A., Vanneste, S.G.: Analysis and Computation of (n, N)-
Strategies for Maintenance of a Two-Component System. European Journal of
Operational Research 48, 260-274 (1990)
Van der Duyn Schouten, F. A.: Stochastic Models of Reliability and Maintenance:
An Overview. In this volume (1996), pp. 117-136
Van Egmond, R., Dekker, R., Wildeman, R.E.: Correspondence on: Determining
economic maintenance frequency of a transport fleet. International Journal of
Systems Science 26, 1755-1757 (1995)
Appendix
n
L = S+ L:Li, (A.l)
i=1
with S the set-up cost. Consequently, we have to find an expression for Li.
To do so, consider an arbitrary i E {I, ... , n} and determine which of the
intervals I}k) (see Section 3) overlap with the interval [11, Tu]. Clearly, this
is for each k with lllziJ ;:; k ;:; lTuziJ. Now define L~k) as the Lipschitz
constant of gi(-) on Irk) for each of these k ~ 1. If lllztJ = 0, then let L~O)
be the Lipschitz constant of gi(') on [1/z~, l/z;]. We will show that
Li = max{L~k)},
(A.2)
where k ranges from max{O, ll1z;J} to lTuz:J.
To prove this, observe first that if t1, t2 belong to the same interval I}k), then
by definition
Igi(t1) - g;(t2)1 ;:; L~k)lt1 - t21 ;:; Liltl - t21
If t 1, t2 do not belong to the same interval then assume without loss of gener-
ality that g;(t1) ~ gi(t2). For t1 < t2 with t1 belonging to Irk) it then follows
that
o ;:; g;(td - gi(t2) < gi(td - ~i(Zn
gi(tl) - gik + 1)/z;)
< L~k)k + 1)/z; - t1)
< L~k)(t2 - t 1)
< Lilt1 - t21
The other case t2 < t1 can be derived in a similar way and so we have shown
that
Igi(td - g;(t2)1 ;:; Liltl - t21,
with Li according to (A.2).
If we now find an expression for the Lipschitz constant L~k), then
with (A.l) and (A.2) we have an expression for the Lipschitz constant L.
In the proof of Lemma 4.2 we showed that if Mi(t) is convex on (0,00),
then ~i(l/t) is also convex on (0,00). We saw in the proof of Theorem 3.1
that M;(t) is convex on (0,00) for the special cases of Goyal et al., the
minimal-repair model with an increasing rate of occurrence offailures, and the
inspection model. Consequently, for these models ~i(l/t) is convex on (0,00).
This implies that the derivative of the function ~i(l/t) is increasing, and con-
sequently we obtain that for all tl ;:; t2 E [1/x~, l/x;]:
Igi(t1) - gi(t2)1 = l~i(l/tl) - ~i(l/t2)1
< - .!!...~i(l/t)1
dt t=tl
lt 1 - t21
280 Rommert Dekker et aI.
so that
Li(0) -- (xn )24J~(x*
, n'
) (A.3)
By the same argument we find that for k ~ 1
and so
L,{k) _
- max {k k+ 1( !)2m~ (k +k 1x,!) ,
2 X,!P,
(AA)
(k: 1)2(x;)24J~ (k! 1x;) }.
Notice that both arguments in (AA) are decreasing in k since 4J~O is in-
creasing. This implies that L~k) is maximal for k = lor k = O. Consequently,
(A.2) becomes
L. _ { L~LTI:c;J) if l1}xiJ ~ 1,
,- max{Lp),L~O)} iQ1}xiJ = 0,
1. Introduction
continuous distribution G.(t), (i = 0, 1). The distributions are such that, for
their hazard rate functions, it is ro(t) $ rl(t), 'Vt ~ 0.
In many cases it may be natural to consider Cl, ... , CN to be exchange-
able; in such cases T1 , ... ,TN are exchangeable as well.
The above can be seen as an appropriate probability model for describing
some of the situations in which infant mortality may be present.
In such situations, burn-in procedures are to be considered; in other words
it may be convenient to observe all the individuals for a while at the beginning
of their life in order to overcome the problem of early failures. A decision
problem related with that is the One of optimally choosing the duration of
the burn-in period.
The paper will be divided into three parts.
In the first part we study different aspects of the distribution of T 1 , ... ,
TN, namely the joint survival function, their univariate and multivariate con-
ditional hazard rate functions, dependence properties, univariate and mul-
tivariate aging properties, extendibility. An important aspect is that the
joint distribution of T1 , ... , TN is characterized by meanS of only N, Gi(t),
= =
(i 0,1) and of the distribution of M 2:f=l Cj ; it is interesting to study
how the afore-mentioned properties are influenced by the choice of Gi and
of the distribution of M. It will be of interest furthermore to study the evo-
lution of the distribution of the number of weak individuals in the residual
population, during a life-testing experiment; in its turn this will put us in
a condition to describe the evolution of the distribution of the residual life-
times of surviving individuals. This study extends the one begun in Iovino
and Spizzichino (1993). It will also achieve the goal of providing a tutorial
presentation; indeed it allows to illustrate a number of general concepts, by
showing how they are manifested in the case at hand.
The second part is devoted to a discussion concerning critical aspects.
First of all we define the concepts of early failures and infant mortality and
formulate the problem of optimally choosing the length of the burn-in period.
The discussion aims to clarify the relationships among the present model,
what is usually referred to as mixed populations and more general situations
where infant mortality can be present. A discussion is indeed needed since
possible confusions between different situations can arise. As an interesting
feature of such topics, there is the often presence of apparent paradoxes (e.g.
those connected with observed decreasing failure rate in mixed populations).
This actually calls for a precise statement of the model and a careful use
of language. Our scheme aims to unify and put into precise terms different
models used in various fields of applications. It will particularly stress the
primary role of the probability distribution of M in heterogeneous popula-
tions.
In the third part we study in some more detail the concepts introduced
before and discuss the problem of the optimal choice of the duration of the
burn-in test, presenting results concerning the model of heterogeneous popu-
A Probabilistic Model for Heterogeneous Populations 283
lations. In the frame of this special model we develop some arguments intro-
duced in Spizzichino (1991) with respect to sequential stopping procedures.
The computation of the optimal procedure for stopping the burn-in, however,
is a difficult task due to its complexity; for this reason it is convenient to con-
sider also the concept of open loop feedback optimal stopping procedures. A
formal definition of the latter will then be given together with some heuristic
illustration.
As we shall see, a specific burn-in decision problem is determined by both
the structure of associated costs and the structure of the probability model.
The arguments to be presented can be used flexibly and be applied in many
different areas; special forms of the costs will be imposed by applications to
any specific field.
As mentioned, we shall in particular show examples of cost functions
which describe the cases when individuals Ul , ... , UN are devices to be used
for building a coherent system.
As far as the probability-model structure is concerned, particular cases of
special interest are those with exponential G;(t) and those with a binomial
distribution for M. The latter condition is equivalent to independence among
T l , ... , TN and deserves special attention for the following two reasons:
- it has been often (sometimes implicitly) assumed in the past literature on
the subject;
- its treatment provides an introduction to open loop feedback optimal pro-
cedures for stopping the burn-in.
An example will be presented at the end of Section 4.
In this section we aim to carry out a study of the probability model for
observable lifetimes of the cohort of individuals coming from two different
subcohorts; in particular we shall point out that the distribution of the num-
ber M of individuals in the weak sub cohort and the distributions G;(t) of
lifetimes in the two sub cohorts influence dependence and aging properties of
the lifetimes. Such properties have an impact on the form of the solution of
the burn-in stopping problem.
We start by presenting the notation that will be used in the paper.
Let G == (Gl, ... ,GN) be a vector of N exchangeable binary random
variables and let M == L~l Gi . Denote
284 Fabio Spizzichino
By a well-known fact about exchangeable events (see e.g. de Finetti 1970 and
Kendall 1967), it is
(2.2)
In particular, w(1)(1) = 2:f=l f,w(N)(k) =~.
Let Go, G1 be given probability distributions on [0, +oo)j we assume Go,
G 1 to be absolutely continuous and such that their respective failure rate
= =
functions ri(t) ~ii(/t~ (i 0,1) satisfy the inequality
(2.3)
Furthermore we assume
/Ji = 1
00
t g;(t) dt < +00 .
We consider T 1 , .. ,TN to be non-negative random variables, which will be
interpreted as lifetimes of individuals Ul, ... , UN in our heterogeneous pop-
ulation Pj we assume that, for C == (Cl, .. . ,CN) E {O, 1}N, it is
N
P{T1 > h, .. . ,TN > tN ICl = C1, .,CN =CN} = II GCj(tj) . (2.4)
j=l
In other words, T 1, ... , TN are conditionally independent given C l , ... , C N ,
each with conditional one-dimensional survival functions equal to Go or to
G 1 , depending on the value taken by the corresponding Cj.
Under (2.4) and the assumption that Cl, ... , CN are exchangeable, the
joint distribution of T 1, ... , TN turns out to be exchangeable as well and it
is completely characterized by Go, G1 and by the probabilities w(N)(k) (k =
0,1, ... , N)j more precisely, as far as the joint survival function is concerned,
we obtain, by combining (2.1) and (2.4)
Proposition 2.1.
(2.5)
A Probabilistic Model for Heterogeneous Populations 285
D[n; t; s] == {71 1 =t1 , ... ,71 .. = tn ; Tit > Sl , ... , 1jN-.. > SN-n} (2.13)
where (I == {i 1,i2, ... i n }, I == {j1,h, ... ,jN-n}) is an arbitrary pair of
complementary subsets of the index set {1, 2, ... , N} (possibly I = 0, or
I = 0). The symbol D[O; a] will in particular stand for {1jl > Sl, Til >
s2, ... ,1jN > SN}'
We are interested in studying the conditional distribution of the residual
lifetimes 1jl - Sl, 1j2 - S2,, 1jN-.. - SN-n given the history D[n; t; s] (of
course in the case n < N). Taking into account conditional independence of
T1 , . , TN, given C, we readily obtain
rN-n)(~ID[n; t; aD
== P{1jl - S1 > 6,, 1jN-.. - SN > {N-nID[n; t; sn
:L
ce{o,l}N
P{ C = cID[n; t; sn II G GCj(Sj)
jei
(s'+f)
Cj J J (2.14)
D[h; t; s] == {1L =t1 , ... ,11" =th ; 1il > s , ... , 1iN-" > s}
where 0 ~ t1 ~ t2,"" ~ th ~ s, and h denotes the value at time s of the
stochastic process
N
H, == L:
;=1
I[Tj$']
However a slight different way to look at rN-h)(eID[h; t; s]) may turn out
to be more convenient. Let
Mo == M, number of weak units in the population P at time s = 0
M(') == l:~=1 Gi r , number of weak units among the units which failed up to
time s
M, == M - M('), number of weak units in the residual population at time s,
'Vs > 0
N, == N - H" total number of units in the residual population at time s,
'Vs > 0
w}N-h)(klt) == P{M, = klD[h; t; s]}, k = 0, 1, ... , N - h
(N-h) _ w!N-")(klt) _
p, (klt)= (N;;K) ,k-O,I, ... ,N-h.
288 Fabio Spizzichino
Note now that the probability model describing the population ofthe units
surviving at the time s is analogous to the one of the original population P,
only we must respectively replace
F(N-h\eID[h; t; s]) = L
cE{O,l}N-h
p~N-h) (~th cilt)
.=1
OCj(s + ej)
OCj(s)
ilh
j=l
(2.18)
and, by analogy with (2.9), the one-dimensional conditional survival function
of a single residual lifetime is
= lim
{~O,
~P{1j < s + eID[h; t; s]} = lim
{~O,
~{1 - F(1)WD[h; t; s))}
lE(M.ID[h; t; s))
N _ h
()
rl s +N - h -lE(M.ID[h; t; s])
N _ h
() (2 20)
ro s . .
Remark 2.2. The equation (2.20) might also be proven in a more formal
way, by applying to our case general results about stochastic filtering of point
processes (Bremaud 1982, see also Koch 1986 and Arjas 1992). First note that
the two processes M. and H. have a crucial role. What we can observe is the
evolution of HSl while we are of course interested in estimating at any time
s the actual value of M., which cannot be observed; the joint distribution of
residual life-times, and then the future evolution of H. depend directly on
Ms.
A Probabilistic Model for Heterogeneous Populations 289
Before continuing, we pay further attention to the probabilities p~N -h) (k It)
(k = 0, ... , N - h) entering in the formula (2.18). The p~N-h) (klt)'s are in
particular needed for the computation of lE(M,ID[h; t; sD which appears in
the expression (2.20) for the multivariate conditional hazard function and in
the definition of open loop feedback optimal procedures to be given in the
Section 4.
where we let
1
z(s) == G1 (s)/G O(s) and W(v, h) == N h N
Lm:o p(N)(v + m){ ~h)[z(s)]m
h
w~Nh)(klt) == L P{M. = kIM(') = v, H. = h} P{M(s) = vID[h; t; s]}
v=o
as far as the term
= v) n (H. = h)} =
P{(M(s)
whence
IF.(M.ID[h;t;s]) = ~ k(N;h)rZ(S)]"I:t.p(N)(k+V)W(V,h)
. P{M(s) = vID[h; t; s]} (2.22)
We now turn to study some aging and dependence properties of the joint
distribution of the lifetimes T 1 , ... , TN; we want to point out that these prop-
erties are influenced by the distribution of M and by Gi(t) (i = 0,1). On the
other hand, as already mentioned, they have an influence on qualitative prop-
erties of the optimal procedures for stopping the burn-in test. Some precise
result in this direction may be obtained in future research.
First we consider a result concerning aging properties of the one-dimen-
sional marginal p(1). By taking into account the condition (2.3) and the
Proposition 2.2, one readily obtains
Proposition 2.4.
- - -(1)
(a) IfGo(s) and G1 (s) are DFR (Decreasing Failure Rate) then F is DFR
(b) IfGo(s) and G 1 (s) are NWU (New Worst than Used) then F(1) is NWU.
A Probabilistic Model for Heterogeneous Populations 291
whence
o
Proposition 2.5. It is
(a) Cov(Tl , T2 ) = Cov(Cl , C2)(Jtl - JtO)2,
(b) ;;<2)(S1, S2) - F(l)(sd;;<l)(S2) = Cov(Cl, C2 )(Go(sd -G l (sd](Go(S2)-
G 1(S2)].
292 Fabio Spizzichino
+ w(2)(2) 11 00 00
t1t2g1(t1)g1(t2)dt1 dt2
+ ~w(2)(1){lOO 1 00
t1t2g0(tt}g1(t2)dtl dt2
+ 10 10
00 00
t1t2g0(t2)g1 (t1)dt1 dt2 }
N
= /{LP(N)(2 + L CI)G 1(Si)G 1(Sj). II Gc,(SI)
C
N
+/{LP(N)(l + L cdG 1(s,)G O(Sj). II Gc/(SI).
c l#i,l#j I#i, I#j
Let us rewrite the above identity in the shortened form
P{C, = 1ID[O; sn
= G1(Ss)G 1(Sj)W'(s) + G1(s;)G O(Sj)W"(s)
where W'(s) and W"(s) are positive quantities. Similarly
P{Cj = 1ID[O;sn = G1(sdG 1(sj)W'(s) + GO(Si)G 1(Sj)W"(s) .
Whence under the conditions (2.3) and for Si < Sj,
P{Ci = 1ID[O; sn -P{Cj = 1ID[O; s]}
= W"(S)[G1(Ss)G O(Sj) - GO(s;)G 1(Sj)] ~ 0 .
o
294 Fabio Spizzichino
Now we compare P{T; - Si > eID[O; s]} with P{1j - Sj > eID[O; s]} for
two different indexes i and j. We are in particular interested in obtaining
sufficient conditions under which the following implication holds:
ve > 0 , Si < Sj :::} P{T; - Si > eID[O; s]} < P{1j - Sj > eID[O; s]} . (2.23)
In this respect we have the following result
Proposition 2.6. Under the assumption (2.3), a sufficient condition for the
validity of the implication (2.23) is that one of the following set of conditions
hold:
(a) ro(t), and [Go(t+e)/GO(t)-G1(t+e)/G1(t)] are non-increasing functions
e
oft, for any > O.
(b) rl (t) and [G1 (t+e)/G 1(t)-Go(t+e)/Go(t)] are non-increasing functions
e
oft, for any > O.
Proof. Consider the set of conditions (a). By letting n = 0 and el = e,6 =
... = eN = 0 in (2.14), we can obtain
P{T; - Si > eID[O; s]) =
= P{Ci = OID[O; s]} GO(Si + e) + P{Ci = 1ID[0; s]} G1(Si + e)
Go(sd Gl(Si)
whose right hand side can be rewritten in the form
Now compare P{T; - Sa > eID[O; s]} with P{Tj - Sj > eID[O; s]}. By our
hypotheses and by (2.3), {Go(s+e)/GO(s)-G1(S+e)/G1(s)} is non-negative
and non-increasing VS ;::: 0, and Go(s +e)/Go(s) is non-decreasing. Thus the
implication (2.23) is seen to be valid by taking into account Lemma 2.2.
Under the set of conditions (b), an analogous proof can be given by writing
o
Of course Proposition 2.6 only gives sufficient conditions for the implica-
tion in (2.23). It is to be stressed that these conditions are verified when Go
and G 1 are exponential distributions.
A Probabilistic Model for Heterogeneous Populations 295
(2.24)
is less severe than D[h; t; s], (see Shaked and Shanthikumar 1990).
In some cases we can be interested in checking the validity of the impli-
cation:
D[h;t;s] ~ D[h';t';s']::} Il~h)(t) ~ Il~~')(t') .
= =
We remark that in the special case h h', t t', the above implication is
= =
a condition of negative aging, while, under the condition h h', s s', it can
296 Fabio Spizzichino
where 1/(1) $ 1/(2) $ ... $ 1/(N) and 1/('1) $ 1/(~) $ ... $ 1/(~) are the order
statistics of (1It, ... ,y'N) and (y'{, ... , y'N)' respectively. This is denoted by
y' -< y". A function 'rP : ~ N -+ ~ is Schur-convex if it is non-decreasing with
respect to the majorization ordering:
(yi, ... , y~) -< (y'{, ... , y'fy) implies 'rP(1it , ... , y'N) ::; 'rP(y~, ... , y'lv) .
The following characterization aims to clarify the connection between the
phenomenon of infant mortality and the Schur-convexity property of -pN) .
Lemma 2.3. (Spizzichino 1992) F(N) is Schur-convex if and only if the im-
plication (2.23) holds.
By combining Proposition 2.6 with the latter result, we immediatelyob-
tain
Proposition 2.8. Under the assumptions of Proposition 2.6, -pN) is Schur-
convex.
(b) and (c) of Proposition 2.7 are trivially satisfied, for any pairs sand s'.
Then the inequality I'~N')(t) ~ I'~~/)(t') is equivalent to (a); furthermore an
immediate application of Proposition 2.8 shows that P<N) is Schur-convex.
(2.25)
Gl , ... , GN are infinitely extendible if and only if they are conditionally i.i.d.,
given a random quantity 8, taking values in [0,1]; in other words if and only
if it is
(2.26)
Proposition 2.9.
(a) n(T1, ... , TN) ~ n(G1,"" GN).
(b) Tl"'" TN are i.i.d. if and only if M "" b(N,p) for some p E [0,1].
(c) If (2.26) holds then T l , ... , TN are conditionally i.i.d.
Proof. (a) Let the binary random quantities G1, ... , GN be R-extendible with
R > N; then we can consider the R-dimensional survival function
298 Fabio Spizzichino
where w(R)(/) = (~)p(R)(I) are the probabilities in (2.25). The joint survival
-r
function N) of T I , .. , TN is the N-dimensional marginal of-r
R ) and R) -r
is obviously exchangeable. So 'R(Tl, ... , TN) 2: 'R(CI , ... , CN)'
(b) M "" b( N, p) is equivalent to independence among C I, ... , CN. On the
other hand independence among C I , ... , CN is equivalent to independence
among T I , ... , TN. Note that in this case (2.5) becomes
L [10( pi=1f Cj
(l-p)
N- f Ci 1 N -
i=1 d1l'(p) IICcj(Sj)
cE{O,I}N 0 j=l
= 1II1
o
N
j=l
[pG I (Sj)+(1-p)G o(Sj)]d1l'(p). (2.27)
o
We recall here that the condition Cov (Xl, X 2 ) < 0 for an arbitrary pair
of exchangeable random quantities Xl, X 2 , implies finite extendibility.
tk+1 > max{t1, ... ,tk) => k+1{t1, ... ,tk, t k+d ~ k(t1, ... ,tk)'
Practical examples will be shown in next section.
We will say that a set of subsequent observed failure times t1 :::; ... :::; tN
contains early failures iffor some 1 :::; h < N, one has
In words, the inequality (3.1) says that we have h "early" failures at sub-
sequent times t1, ... , tk if the times t1, ... , tk are so short that the follow-
ing circumstance happens: the gain obtainable from putting, at time th, the
(N - h) surviving components into operations would be greater than the gain
obtained from putting all the N components into operations at time O.
Suppose that, at time 0, we start testing simultaneously U1, ... , UN (as-
sumed to be of age 0 at time 0), progressively observing possible failures and
taking records of the different failure times. In this way, up to any time s, we
observe a dynamic history of the form D[h; t; s]. Define
where k = N -h is the remaining number of components, and 1i, -s, ... , 1ik-
s are their residual lifetimes. Ift(h; t; s) is the expected gain from putting into
operation the components surviving a test of duration s, conditional on the
failure history observed in the test.
Let now (1 be a stopping time with respect to the filtration {Ft} (Ft
generated by {H.; 0:::; s :::; t}, with H. == 2:f=11[Tj!'>.1)'
300 Fabio Spizzichino
the expected gain deriving from putting all the components (of age 0) into
operations.
It is to be stressed that, in these cases, burn-in has a special interpretation:
it is a procedure to eliminate from P substandard components (but not nec-
essarily all of them). By taking into account that 0'( s) == P {Gi = 1 111 > s}
is non-increasing in S (see Remark 2.1), one can show that the distribution
of ~ is stochastically non-increasing in s. Thus in particular we see that the
effect of burn-in is to decrease the proportion of surviving weak components.
We point out that the model of heterogeneous populations corresponds
to different situations according to different possible types of distributions
{w(N)(k); k = 0,1, ... , N} for M = Ef=l Gj. Such different types, in their
turn, correspond to different forms of dependence for the lifetimes T l , ... , TN.
To illustrate that we shall now examine a number of special cases, while
clarifying differences between the different situations, from a statistical point
of view.
(A) We start with the special case of a heterogeneous population P for which
p (0 < p < 1) is the known probability that any element chosen from P is
substandard and the conditions Gl, ... , GN are assessed to be independent;
this is equivalent to assuming that the distribution of Mis b(N,p). By (b)
in Proposition 2.9, T l , ... , TN are independent identically distributed as well
and thus (2.5) becomes
(B) Consider now the case in which C t , ... ,CN are conditionally indepen-
dent identically distributed, i.e., (2.26) holds and, by c) of Proposition 2.9,
T t , ... ,TN also are conditionally independent identically distributed (using
the language of frequentist probability, we could say that this case corre-
sponds to (A) with p unknown).
Think again of a burn-in with a duration s > 0 , and consider the group of
components that survive at time s. The conditional probability distribution
of M, is still of the form (2.26), where N is replaced by N, and 7r is replaced
by a new mixing distribution 7r(-ID[h; t; s]) depending on s and on the his-
tory observed up to s. During burn-in two different processes take place: we
eliminate weak components from P and, simultaneously, we learn about p.
Of course this is a case of positive dependence among T t , ... , TN: the dis-
tribution of M. conditional on a history D[h; t; s1 is stochastically greater
than the distribution of M. conditional on a different history D[h'; t'; s'] if
D[h; t; s] ~ D[h'; t'; s'] in the sense of Definition 2.1. The conditional distri-
bution -P<N-h) (eID[h; t; s)) of residual lifetimes can be obtained by means of
a suitable modification of formula (2.27).
(C) We analyze here a case of positive dependence different from (B). Sup-
pose we assess P{Ct = C 2 = ... = CN} = 1, i.e. P{Ct = C 2 = ... = CN =
1} = q, P{Ct = C 2 = ... = CN = O} = 1 - q, namely the distribution of
M is concentrated on the two extreme values 0 and N. In words this means
that all components are in a same (unknown) condition.
It is easy to see that, at any time s, the conditional distribution of M.
= =
remains of this same kind: P{M. N,} 1 - P{M. = O}. P{M. = N.} of
course depends on the observed history; more precisely, by applying Bayes
formula, we have
A Probabilistic Model for Heterogeneous Populations 303
(D) The last special case of distribution for M that we consider is one of neg-
ative dependence among Tl, ... , TN. Suppose we are sure from the beginning
about the value of MIN: for some 0 < k < N, it is
w(N)(k) = 1 , dN)(n) = 0, for n:f:. k .
Most of the existing literature deals with the following extension of case
(A): populations P are considered which are mixtures of more than two sub-
populations so that one-dimensional density functions of components' life-
times are of the form
/(t) = 1
g(t; )")dP()") (3.8)
about the hypothesis {Gj= O} than testing the component Uj(1 :::; j :::; N)
for a while.
This would be reversed in case (C). In such a case we can test the hypoth-
esis {Gj = O} without burning-in Uj: this can be done by testing a number
r of other components Gh ... Gjr up to the failures of all of them. In this
way we can learn about {Gj = O} or {Gj = I}, leaving Uj completely new
(Le., of age O)j under (3.9) the latter may turn out to be a more convenient
procedure.
In this section we explain in some more detail the concept of optimal se-
quential procedure for stopping the burn-in introduced in the last section
and show some fundamental facts concerning the case of a heterogeneous
population of components.
Consider a burn-in experiment according to which all the components
U1 , .. , UN, belonging to P, are simultaneously put under test at time 0,
progressively recording all the subsequent observed failure times t1, t2 ... up
to a pre-fixed stopping time (T. At (T the experiment is stopped and all the
surviving components are delivered to operation or, in any case, are kept and
considered to be usable later on for assembling some wanted system. In this
case we say that we adopted the procedure (T for stopping the burn-in.
Up to any time s > 0 we observe a history of the form D[hj tj s]. As
already mentioned above, it is of course necessary for (T to be a rule such
that, at any time s, we are in a position to establish whether {(T :::; s} or
{(T > s} based on the information carried by D[hj tj s]. If in particular we fix
(T = s, for some value s, already before starting the experiment, we say that
(T = T(h) (corresponding to stopping the test as soon as the h-th failure has
- p<jV -h) (tl' t2, ... , th) will in general be a function of tl, t2, ... , th.
To denote the above, we shall write
0'=
- {(N) (N-l)(t)
Pq ,Pq 1, .. ,Pq(1)(t 1, .. , t N-l )} (4.1)
The subscript 0' can be omitted when unnecessary.
Now consider the expected gain Wq , defined in (3.3). Before going ahead
note that Wq obviously will depend on the joint distribution of T1, ... , TN,
which is in its turn determined by the distribution of M, ~(N) == {w(N)(O),
w(N)(I), ... , w(N)(N)}, and by the pair of one-dimensional survival function
Go(s) and G1 (s), via the equation (2.5).
Our task in the following is to express Wq in terms of the representation
(4.1) and to characterize the optimal stopping procedure 0'*, as defined in
(3.5). To our aims we must adopt a dynamic point of viewj for that we
must in general look at the conditional expected gain, given that the history
D[h; t; s] has been observed, if continuing with a procedure 0'. This may be
denoted by
Wq(h; t; s) or W(p~), p~-I), ... , p~I)ID[h; tj s])
Consider now the conditional survival function ofthe variable T(h+l) -s given
the history D[hj tj s] observed up to s. T(h+1) - s is the waiting time up to the
next failure after the instant s and its conditional survival function is given
by
P{T(h+1) - s > eID[h; t; s]} = -pN -h) (e, ... , eID[h; t; s]) (4.3)
f(eID[h; t; s]}
the corresponding density function
d
f(eID[h; t; s]} = - deP{1(h+t) - s > elD[h; t; s]} (4.4)
Note that, for an arbitrary stopping time (1, the following identity holds
W*(h;t;s) ~ sup4>w.(h;t;8,U)
u~O
Thus an optimal stopping procedure does exist and can be described as fol-
lows: after observing a history D[h; t; s]:
(i) stop at 8, if
(1* == inf {8 ~ :W*(H3; T(1), ... , T(Hs); s) = tli(H3; 1(1), ... , 1(H.); 8)} .
(4.8)
In other words
pq(N-h)(t)
' = t -_.mf{ 8
- P*(N-h)() ~ th .. W*(h, t,. 8 ) -- .T'(h
'I'
. )}
, t, 8 (4.9)
Wu(h;t;s) = W*(h;t;s) .
In particular (1* is optimal in the sense of the definition in (3.5).
A Probabilistic Model for Heterogeneous Populations 309
Remark 4.1. In order to obtain the optimal stopping time (1*, one must pre-
viously compute the functions w(;;) and W*(;; .). W*(h; t; s) can be com-
puted in terms of the functions W*(h + 1;;) and w(h; .; .).
The stopping time (1* is optimal in the sense of Bayes optimality and
the history of already observed failure times 1(1) = tt, ... , TCh) = th, is of
course taken into account in the dynamic characterization of (1*, since it
influences the conditional distribution of residual lifetimes of the surviving
components. Actually for Go(t) and Gl(t) given such distribution is deter-
mined by the conditional distribution of M, (number of those substandard
components which are still surviving at s). So p~N-h)(t) in (4.9) depends on
t only through the conditional probabilities w~N-h)(klt)(k = 0, ... , N - h).
Qualitative properties of p~N-h)(t) are then affected by the kind of stochastic
dependence among Cl ,., CN.
We now turn to write down special forms of the functions tPk, which
reasonably describe the cases when Ul, ... , UN are components to be possibly
used for assembling a reliability system.
First of all it can be natural to assume
(4.10)
for some non-negative quantity 6 and some n ~ N. This means that we have
a loss, or at the best no gain, if less than n components are available.
For n ~ k ~ N, the following practical examples can be given.
k
1. tPk(tl,"" tk) = L J(tj) (Components to be used separately one of an-
j=l
other)
In the general case, finding u* is not a feasible task. For this reason we
do not pursue further the analysis of the computation of p~N-h)(t)j rather
we prefer to concentrate attention on Open Loop Feedback Optimal (OLFO)
procedures. Open loop feedback optimality is a general concept from Optimal
Control Theory (see Runggaldier 1993), for a transposition to the burn-in
problem).
We shall denote by p(N), p(N-1)(td, ... , p(1)(t1,"" tN-1) the functions
characterizing the OLFO procedure.
In order to define the p's, it is previously necessary to analyze the special
case M "" b(N,p) (0 ::; p ::; 1), considered at point (A) in Section 3. As
we saw this corresponds to the assumption that T1, ... ,TN are independent
random quantities.
The problem of computing p~N-h)(t) is much simpler in this case than in
the general casej indeed p~N-h)(t) is simply a function of the arguments h
and th, which will be denoted by p~N-h)(-).
In order to obtain ~N-h)( .), the following arguments are to be taken into
account.
Conditionally on D[hj tj s], the residual lifetimes Til -s, ... , TiN-h -s are
independent and their one-dimensional survival function ~1)(e) is provided
by formula (3.7).
We can then write
lli(hj tj s) == lli(hj s) = 11 1
00 00
...
00
tPN-h(6, ... ,eN-h) .
It will furthermore be
Remark 4.2. In the case considered just above, Tl, ... , TN are independent
variables distributed according to the survival function F (1 ) given in (2.9).
For given N (initial number of components) and cost functions tPk, F (1 ) com-
pletely determines the quantity p~N) initiating the optimal burn-in procedure
(1'*. For the subsequent developments we then use the symbol p~N)(FC1.
Remark 4.3. In the case of independence, we can say that infant mortality is
present if
{f..N>CP(1 > 0 .
Thus, for the probability model defined by the assumption of independence
and by F(1), we see that infant mortality depends on the structure of the
reliability system to be built, which determines the form of tPk'S.
Now we turn to consider the OLFO procedure for stopping the burn-in.
The functions p(N) , p(N-l)(tt), ... , pCl)(tl, ... , tN-d are defined as fol-
lows.
At time S = 0, let
peN) = {f..N>(j;;(l
as if we were in the case of stochastic independence.
For t1 < p(N), let
1/;,,(tl, ... , t,,) = -c(N - k) + C L" 1[t;>TJ - L L" 1[t;$TJ' k = 0,1, ... , N
j=l j=l
(4.16)
where L > C> c> 0 are given quantities and T> 0 is a fixed mission time.
Gi(t) = exp{ -Ait}, t ~ 0 (4.17)
pA(N) , P
I n order t 0 0 bt am A(N-1)(t) A(l)(t
I , ... , P I , ... , t ) we prevIous
N-1 . Iy must
consider the case of independence, characterized by
(4.19)
where
,(p) = AlPexp{-AlP}
- AlPexp{-AlP} + Ao(1- p)exp{-AOP}
(see also Clarotti and Spizzichino 1990).
As far as the OLFO procedure is concerned we then have
z(s)=exp{-(Al-AO)S} .
Acknowledgement. I thank colleagues Menachem Berg and Uwe Jensen for useful
discussions and comments. I also like to thank the organizing committee of the
Antalya NATO-ASI meeting for excellent organization and hospitality.
Partial support of CNR Progetto Strategico Applicazioni della Matematica per la
Teenologia e la Societd is acknowledged too.
References
Aldous, D.J.: Exchangeability and Related Topics. Ecole d'Ete St. Flour. Lecture
Notes in Mathematics. Berlin: Springer 1983
Arjas, E.: The Failure and Hazard Processes in Multivariate Reliability Systems.
Math. Oper. Res. 6, 551-562 (1981)
Arjas, E.: Survival Model and Martingale Dynamics. Scand. J. Statist. 16, 117-225
(1989)
Arjas, E., Haara, P., Norros, I.: Filtering the Histories of a Partially Observed
Marked Point Process. Stoch. Proc. AppI. 40 225-250 (1992)
Barlow, R.E.: A Bayesian Explanation of an Apparent Failure Rate Paradox. IEEE
Trans. on ReI. R34, 107-108 (1985)
314 Fabio Spizzichino
1. Introduction
The term "reliability" is used in the same sense for software as it is for
hardware (Musa et al. 1987). It is the probability of failure-free execution
of a program for a specified period, use, and environment. For example, a
program may have a reliability of 0.99 for 8 hours of execution. Note that
the relevant time is execution time, the actual time that the processor is exe-
cuting the program. The definition of software reliability in analogous terms
to hardware reliability is deliberate, because we want to be able to combine
reliabilities of hardware and software components to obtain system reliability.
The cause of failure in software is different than in hardware; it is erroneous
or incomplete design rather than wear, fatigue, burnout, etc. It should not be
surprising that we use compatible definitions even though failure mechanisms
are different; we already employ a common definition across hardware even
though hardware has many different failure mechanisms. Note that hardware
can also fail from design errors; in this sense, software reliability theory could
be applied to some hardware situations.
Software reliability engineering has spread rapidly in practice because of
the substantial benefits it provides and the relatively low cost of implemen-
tation.
2. Benefits
The benefits derived from software reliability engineering start in the system
engineering phase. Quantitative expression of reliability needs enables sup-
pliers of software-based products to more precisely understand the needs of
users of these products. Assuming that a product is designed to deliver the
functionality required, user satisfaction (the concept of "quality") depends
on multiple factors, but perhaps the three salient ones are reliability, delivery
date, and cost. These quality attributes interact with each other; to obtain
increased reliability requires longer development time or greater cost or both.
If rapid delivery of a product is essential to meet a user's needs, something
must give: either reliability will suffer or cost will escalate. When you can
analyze a user's conflicting needs with respect to these quality attributes and
set more precise goals, you set the stage for a higher level of user satisfaction.
Software reliability engineering includes quantitatively determining how
users will employ a system and uses this information to both tune the system
to this pattern of use and to focus development attention on the operations
that are used the most and/or are most critical. A "critical" operation is one
whose failure will have a severe impact in terms of risk to human life, cost, or
level of service. This focus speeds up development and reduces costs because
we don't waste time and effort on infrequently used, noncritical operations.
Software reliability engineering reduces the risk of unsatisfactory reliabil-
ity by engineering and tracking reliability during development.
An Overview of Software Reliability Engineering 321
3. Nature of Practice
Software reliability engineering consists of seven principal activities, spread
out over the software life cycle:
1. developing the operational profile,
2. defining "failure" with severity classes,
3. setting failure intensity objectives,
4. engineering the product and the development process to meet the failure
intensity objectives,
5. certifying the failure intensities of acquired software components,
6. reducing and assuring failure intensities during test, and
7. monitoring field failure intensities against objectives.
principal strategies are fault tolerance, reviews, and test. We must determine
the contribution each strategy must make to the overall failure intensity, con-
sidering the effects on development time, development cost, and operational
efficiency. When the failure intensity objective is high, testing alone may be
sufficient. As the objective is reduced (made more stringent), we must increas-
ingly use requirements, design, and code reviews. Very low failure intensity
objectives require the use of fault tolerant features.
The third sub activity is to use the operational profile and a list of critical
operations to allocate process resources (primarily people). Allocations are
made with respect to operations, which are externally initiated tasks such
as commands or transactions. You can speed up the delivery of operations
that are heavily employed or critical to users by operational development,
the organization and scheduling of development by operation rather than by
module. You can reduce cost with the concept of reduced operation software
(ROS). This is the analog of RISC (reduced instruction set computing). You
reduce the total number of operations that must be implemented by elimi-
nating or finding other ways to accomplish the infrequently used, noncritical
operations. For example, you may replace a complex operation by a sequence
of simpler basic operations, possibly with some manual intervention. Any loss
in operational efficiency is small because the operations replaced occur only
rarely, and it is more than compensated for by development cost savings.
Note that there are three regions: reject, continue, and accept. As long as
failure times remain in the continue region, you keep testing. As soon as a
failure time crosses into a reject or accept region, you can reject or accept the
software based on the discrimination ratio, risk levels, and failure intensity
objective that have been set. For example, in Figure 3.1, the first two failures
(at 15 and 25 CPU hr) plot in the continue region. The third failure occurs
at 100 CPU hours; it is in the accept region, permitting the component to
be accepted. It is possible for software that experiences no failures to be
accepted; in this example, this would happen after 40 CPU hours of failure-
free operation.
Failure 16
number
14
12 Reject
10
8 Continue
4
Accept
o
o
2
o
o
o 40 80 120 160 200
also serve to increase the level of reliability assurance. Alternatively you may
think of these test phases as periods in which failure intensity is reduced and
we increase our assurance that it is reduced. The reduction comes about, of
course, as we experience failures and we search out and remove the faults
that are causing them.
A model of this failure intensity reduction is shown in Figure 3.2. The
actual reduction is discontinuous. The removal of each fault causes a dis-
continuity whose size depends on how often that fault is activated by the
usage pattern (operational profile) ofthe software. Software reliability mod-
els generally focus on test periods, they are generally nonincreasing, and they
are usually expressed in execution time (Musa et al. 1987). Most of them are
based on nonhomogeneous Poisson processes. Maximum likelihood estimation
is commonly used to determine their parameters, although this is certainly
not a requirement. The models that have been most commonly employed in
practice are the Musa-Okumoto logarithmic Poisson execution time model
(Musa and Okumoto 1984) and the Musa basic execution time model (Musa
1975).
Failure
intensity
good balance between high confidence levels and the necessarily large ranges
associated with such intervals. Figure 3.3 indicates how the confidence inter-
val typically decreases with execution time, as failure intensity estimates are
based on more and more data. Note that we are concerned principally with
the upper confidence limit; we don't care how much failure intensity might
be lower than what we have estimated.
Failure
intensity
Confidence
interval
Nominal value :
The procedure for estimating failure intensity during system test or beta
test is straightforward, although there are refinements for special situations
such as program evolution, absence of execution time information, etc. (Musa
et al. 1987). The system is tested by selecting runs in accordance with the
operational profile. Failures are identified and failure times are recorded. The
failure data is input to a reliability estimation program. Such programs use
reliability models and estimation techniques (as noted above) to estimate
failure intensity and its confidence interval or intervals. You compare failure
intensity with your failure intensity objective on a periodic basis. This typi-
cally occurs daily for short test periods and weekly for long ones. As noted
previously, you may have multiple failure intensity objectives to account for
such situations as failure severity classes. In this case, you have corresponding
multiple failure intensity measurements.
328 John D. Musa
The comparison is used initially to highlight the need for corrective ac-
tions, such as changing the levels of resources devoted to testing, changing
testing schedules, or renegotiating delivery dates or failure intensity objec-
tives. When the failure intensity reaches the objective, one of the criteria
that guides release to the next phase is satisfied. We usually track the upper
confidence bound of estimated failure intensity, because we want to establish
meeting the objective at some level of confidence.
The failure intensity estimates generated by a software reliability estima-
tion program for the system test phase of a software development project
are shown in Figure 3.4. The center line is the maximum likelihood estimate;
the other two lines are the bounds of the 75 % confidence intervals. Note
that the test phase covers almost four months, during which time the failure
intensity is substantially reduced (the vertical axis is logarithmic, tending to
deemphasize the reduction). The "noise" in the plots represents not only the
discontinuous nature of failure intensity reduction but also natural random
variation (the estimates are made from relatively small sample sizes early in
test).
10000
5000
1000
Failure
intensity 500
(failuresl
1000 hr)
100
50
10~------~---------L--------L- _______
Aug Sept Oct Nov
Fig. 3.4. Failure intensity estimates and 75 % confidence intervals during system
test
However, you will note a significant upward trend in September that dom-
inates the variation resulting from random effects. This was the sign of a
potential problem requiring investigation. The investigation showed that, un-
known to the testers, some developers had added additional new features to
the system, introducing additional faults and driving up the failure intensity.
An Overview of Software Reliability Engineering 329
This is a graphic example of how tracking failure intensity during test can
uncover problems.
5000
1000 Unsatisfactory
500
Failure 100
intensity Service Objective
(failuresl 50
1000 hr)
10
5 Satisfactory
75% Confidence
:;
Interval
In this figure, the center line again represents the maximum likelihood
estimate of failure intensity, with the two other lines representing the 75 %
confidence bounds. We will focus on the upper confidence bound. Note the
sawtooth pattern. Each release of new features causes a jump in failure in-
tensity that results from the new faults introduced. Then in the periods
between releases, failure intensity declines as the failures experienced lead
to removal of the faults causing them. Observation of this behavior leads to
a simple policy to implement in the field to stabilize field reliability. When
the upper confidence bound of failure intensity exceeds the failure intensity
objective, freeze the system (allow no new feature introduction). When the
upper confidence bound of failure intensity falls well below the objective, you
can consider adding new features. The size of the permissible addition can
be guided by how far below the failure intensity objective you are.
Failure intensity in the field can be estimated with the same model and
estimation method, and hence the same program, as used for system and beta
test. In some cases, faults are not removed in the field between releases. In
that situation, the failure intensity is time invariant. The program will simply
yield model parameters that characterize a zero reliability growth case of the
models.
4. Research Questions
5. Summary
A study by the Strategic Planning Institute (Buzzell and Gale 1987) shows
that customer-perceived quality is the factor with the strongest influence on
long-term profitability of a company. Users view achieving the right balance
among reliability, delivery data, and cost as having the greatest effect on their
perception of quality. Since one of the main purposes of software reliability
engineering is achieving this balance in software-based systems, this discipline
is an extraordinarily important one. Finding solutions to some of the research
needs can stimulate rapid progress. Finally, there is a compelling need to
educate software and reliability engineers in this technology and practice.
332 John D. Musa
Acknowledgement. The author is indebted to James Cusick for his helpful com-
ments.
References
Buzzell, R.D., Gale, B.T.: The PIMS Principles - Linking Strategy to Performance.
The Free Press 1987, p. 109
Musa, J.D.: A Theory of Software Reliability and its Application. IEEE Transac-
tions in Software Engineering 1, 312-327 (1975)
Musa, J.D.: Operational Profiles in Software Reliability Engineering. IEEE Software
10 (2), 14-32 (1993)
Musa, J.D.: The Operational Profile. In this volume (1996), pp. 333-344
Musa; J.D., lannino, A., Okumoto, K.: Software Reliability: Measurement, Predic-
tion, Application. New York: McGraw-Hill 1987
Musa, J.D., Okumoto, K.: A Logarithmic Poisson Execution Time Model for Soft-
ware Reliability Measurement. Proceedings of the 7th International Conference
on Software Engineering. Orlando 1984, pp. 230-238
The Operational Profile
John D. Musa
AT & T Bell Laboratories, 480 Red Hill Road, Middletown, NJ 07748-3052, USA
Summary. Operational profiles are an important part of the technology and prac-
tice of software reliability engineering. The concept was developed originally (Musa
et al. 1987) to make it possible to specify the nature of the use of a software-
based system so that testing could be made as realistic as possible and so that
reliability measurements would reflect that realism. However, the operational pro-
file rapidly became useful for additional purposes in software reliability engineering
(Musa 1993). In fact, it is also proving useful for purposes outside of software re-
liability engineering as well. This paper gives an overview of operational profile
practice, discussing what the operational profile is, why it is important, and how
it is developed and applied. It also presents some current open research questions;
work in these areas can be expected to affect the practice of the future.
1. Definition
We will first define the term "operation" and then show how this leads to the
concept of the operational profile. An operation is an externally-initiated task
performed by a system "as built." We contrast it with a function, which is
an externally-initiated task to be performed by a system, as viewed by users.
The idea or the need for the task ordinarily first arises in the minds of users,
who transmit it to system engineers as a requirement. It is sometimes first
conceived by developers, however. At this stage it is a function. As the system
is designed by system architects and developers, functions evolve into and are
implemented as operations. Functions often map one-to-one to operations,
but the mapping is also often more complex, driven by performance and other
needs. Examples of operations (and functions) include specific commands,
transactions, and processing of external events.
An operation or function is generally initiated and followed by an external
intervention, which may come from a human or another machine. Operations
(and functions) are not restricted to one machine; they may be executed over
several machines and thus can be used for distributed systems. Further, they
can be executed in segments separated in time. Thus, they are essentially
logical concepts that are not closely tied to hardware.
We will later refer to sequences of operations and functions that may be
initiated to implement a work process; these are called, respectively, opera-
tional scenarios and functional scenarios. Since these sequence patterns may
occur repetitively, and since interactions may occur between the operations,
the scenarios must be considered when testing.
334 John D. Musa
The operational profile is now simply a set of operations and their prob-
abilities of occurrence. For example, suppose we have a system that receives
various alarms and processes them, taking actions that depend on the partic-
ular alarms. Table 1.1 shows a possible operational profile for such a system.
A functional profile is a set of functions and their probabilities of occurrence;
it is thus the exact analog of an operational profile.
2. Benefits
3. Development
DIALING CALL
TYPE DESTINAnON
Standard =0.8
Internal =0.3
Abbrev =0.2 External =0.1
The three activities needed to develop operational profiles, once the four basic
decisions are made, are sequential and must be done for each system mode.
They are:
1. Identify user types,
2. Develop the functional profile, and
3. Convert the functional profile to the operational profile.
The first two activities are commonly performed by system engineers. The
third activity is usually done by system designers (architects) and developers,
although system engineers may be involved.
The Operational Profile 339
3.2.1 Identify User Types. User types are sets of users who are expected
to employ the system in the same way. In order to identify user types, you
must first identify customer types for the system. A customer type is a set of
customers that are expected to have the same user types. For example, for
Fone Follower, educational organizations and medical organizations might
represent two different customer types. Different universities are, of course,
different customers, but they belong to the same customer type because they
can be expected to have the same user types.
Next, you consider each customer type and list all its user types. You then
create a consolidated list of user types, eliminating duplications. Continuing
with the Fone Follower example, suppose that the educational organizations
customer type only has the user type "users without pagers." Assume that
the medical organizations customer type has two user types, "users without
pagers" and "users with pagers." The consolidated user type list is:
For the user type "user with pagers" the function list is:
FUNCTION OCCURRENCES
Update, Follow
Update, Voice mail
INITIAL FINAL
OCCUR. OCCUR. OCCUR.
FUNCTION PROB PROP. OPER. PROB. PROB.
Forward call,
nonpaging 0.54
0.2 ~
.. Follow 0.9 0.740
Fig. 3.2. Conversion of explicit functional profile to explicit operational profile for
Fone Follower
Follow
Page
Voice Mail
will be critical.
If the critical operations occur rarely, we will need to create an additional
system mode that includes them, and devote enough test time to that system
mode to be able to assure with reasonable confidence that the failure intensity
objective for the critical operations can be met.
342 John D. Musa
4. Application
During the requirements and design phases and even part of the implemen-
tation phase, one employs the functional profile because function to opera-
tion mapping is still evolving and the operational profile isn't yet ready. The
functional profiles of the system modes are averaged, the system modes being
weighted by the proportion of execution time they represent. This average
functional profile and the critical function list are used to allocate system en-
gineering, system design, and implementation resources and priorities. They
are used to manage the potentially schedule-delaying requirements, design,
and code reviews so that they are maximally effective within the deadlines
they must meet. They are used to guide operational development, where de-
velopment is divided and managed by functions and then operations rather
than modules, and releases are scheduled so that the most used and most
critical operations are delivered first. Finally, they support the system engi-
neering of reduced operation software (ROS). As previously noted, ROS is
the software analog of RISe. The functional profile and critical function list
are used to highlight infrequently used, noncritical operations in the context
of what it costs to develop them. In many cases, the goals of these opera-
tions can be attained in other ways, perhaps by combining simpler operations
or by incorporating manual interventions. In many cases, the operations are
sufficiently unimportant that they can be eliminated.
Testing is done on a system mode basis. Recall that we may have a system
mode of critical operations that we provide with extra execution time so that
we can obtain sufficient confidence that it meets its failure intensity objective.
The operational profile for each system mode is used to manage the first stage
of test selection, choice of the operation that will be executed. The probability
that an operation will be selected for test is made to match the probability
that the operation occurs in the field. We use the operational scenario list
to bring out interactions between operations that occur. When an operation
is selected that starts an identified operational scenario, we execute the rest
of the scenario some proportion of the time before returning to operational
profile selection.
5. Research Questions
The area of operational profiles is young but very dynamic. Hence there are
many research needs and opportunities that will shape the practice of the
future. Two of the most important areas involve project trials. The concepts
of using operational profiles to system engineer reduced operation software
(ROS) and to guide operational development have been investigated to the
point of indicating feasibility and promise of substantial benefits. However,
they have not been extensively tested on projects. Project trials should de-
velop much useful information about how to best practice these two ideas.
The Operational Profile 343
6. Summary
Acknowledgement. The author is indebted to James Cusick for his helpful com-
ments.
344 John D. Musa
References
1. Introduction
Over the last two decades, a considerable amount of effort has been devoted
to developing probability models for describing the failure of software. Such
models help assess software reliability, which is a measure of the quality of
software. Like hardware reliability, software reliability is defined as the prob-
ability of failure-free operation of a computer code for a specified period of
time, called the mission time, in a specified environment, called the oper-
ational profile; see, for example, Musa and Okumoto (1984). However, the
causes of software failure (a notion that will be made more precise later)
are different from those of hardware failure, and whereas hardware reliability
tends to decrease with mission time, software reliability can, in principle, be
100% reliable for any mission time.
Software fails because there are errors, called "flaws" or "bugs" in the logic
of a software code. These flaws are caused by human error. Hardware fails
because of material defects and/or wear, both of which initiate and propagate
microscopic cracks that lead to failure. With hardware failures, the random
element is, most often, the time taken for the dominant microscopic crack to
propagate beyond a threshold. Thus meaningful probability models for the
time to hardware failure should take cognizance of the rates at which the
cracks grow in different media and under different loadings. With the failure
of software, the situation is quite different. We first need to be more precise
346 Nozer D. Singpurwalla and Refik Sayer
We have said before, that with hardware failures the random element is the
time it takes for a crack to propagate beyond a threshold. With software
failures it is the uncertainty about the presence, the location and the en-
counter with a bug that induces randomness. There are two types of random
variables that can be conceived, the first being binary and the second being
continuous. We shall first discuss the nature of the binary random variables
and propose some plausible probability models for it.
Suppose that Xi, i = 1,2, ... , k is a binary random variable which takes
the value 1 ifthe i-th type of input results in a desired (correct) output within
Assessing the Reliability of Software: An Overview 347
its allowable service time; otherwise Xi takes the value zero. The number of
distinct input types is assumed to be k. Let Pi denote the probability that
= = =
Xi 1. If Pi p, i 1, ... , k, and if the Xi'S are assumed to be independent,
were P to be known, then a naive measure of the reliability of the software
would be p. If n :::; k distinct input types were to be tested and L~ Xi
observed, then an estimator of P would be L~ Xi/no If the number of distinct
input types can be conceptually extendible to infinity, then the sequence of
Xi's, i = 1,2, ... , could be judged exchangeable and by virtue of de Finetti's
representation theorem P would have a prior distribution 1I'(p) which would
then be a naive measure of the reliability of the software. Correspondingly,
if L~ Xi/n were available, then the posterior distribution of P would be a
naive measure of the reliability of the software. We say that p (or its prior
and posterior distributions) are naive measures of the reliability, because in
assuming the conditional independence of the Xi'S and the fact that Pi = p,
i = 1, ... , k, we have de facto ignored the possibility that some input types
may be encountered more often than the others, and that some input types
may not be encountered at all. A more realistic approach would be to assume
the Pi'S are generated by a common distribution which then describes the
reliability of the software. Assuming that the Pi'S are generated by a common
distribution entails modeling the joint distribution of the Pi'S by a two-stage
hierarchical model, as is done by Chen and Singpurwalla (1996). The idea of
a hierarchical two-stage model for Bernoulli data on software failures remains
to be explored.
The second type of random variable used for modeling software reliability
pertains to the times between software failures. It is motivated by the notion
that the arrival times, to the software, ofthe different input types are random.
As before, those inputs which traverse through their designated paths in the
logic engine will produce desired outputs. Those which do not, because of
bugs in the engine, will produce erroneous outputs. For assessing software
reliability, one observes T 1 , T2 , , 11, where 11 is the time between the (i -
l)st and the i-th software failure. With this conceptualization, even though
the failure of software is not generated stochastically, the detection of errors is
stochastic, and the end result is that there is an underlying random process
that governs the failure characteristics of software.
Most of the well known models for assessing software reliability are cen-
tered around the interfailure times T 1 , T2 , , or the point processes that they
generate; see Singpurwalla and Wilson (1994). Sections 2 and 3 of this pa-
per provide an overview. Whereas, the monitoring of time is conventional for
assessing reliability, we see several issues that arise when this convention is
applied to software reliability. For one, monitoring the times between failures
ignores the amount of time needed to process an input. Thus an input that is
executed successfully but which takes a long time to process will contribute
more to the reliability than one which takes a small time to process. Second,
also ignored is the fact that between two successive failure times there could
348 Nozer D. Singpurwalla and Refik Soyer
be several successful iterations of inputs that are of the same type. Thus,
in principle there could be an interfailure time of infinite length. Of course
one may argue that monitoring the interfailure times takes into account the
frequency with which the different types of inputs occur and in so doing the
assessed reliability tends to be more realistic than the one which assumes that
all the input types occur with equal frequency. In view of the above consider-
ations it appears that a meaningful way to model the software failure history
is by a marked point process (cf. Arjas and Haara 1984) wherein associated
with each inter-arrival time, say Zi, i = 1,2, ..., there is an indicator Di,
with Di = 1 if the i-th input is successfully processed and Di = 0, otherwise.
Progress in this direction has been initiated by Eric Slud of the University of
Maryland at College Park (personal communication).
The point process approach to software reliability modeling has also been
considered by Miller (1986), Fakhre-Zakeri and Slud (1995), Kuo and Yang
(1995a) and by Chen and Singpurwalla (1995). These authors have been able
to unify most of the existing models in software reliability by adopting a
point process perspective. Some of this work is reviewed in Section 4 of this
paper.
2. Model Classification
Many of the proposed models for software reliability that are based on ob-
serving times between software failures can be classified into two categories;
these are:
way are said to be of Type 1-2. These models have the advantage over
Type 1-1 in that they directly model the times between failure, which are
observable quantities, and not the more abstract failure rates, which are
unobservable. For example, as a simple case, one could declare that 11+1 =
p11+fj, where p ~ 0 is a constant and fj is a disturbance term (typically some
random variable with mean 0). Then p < 1 would indicate decreasing times
between failure (software reliability expected to become worse), p = 1 would
indicate no change in software reliability whilst p > 1 indicates increasing
times between failure (software reliability expected to improve). The simple
relationship of this example is known as an auto-regressive process of order 1;
in general, one could say that 11+1 = f(T1 , T2 , .. ,11) + fj for some function
f.
The category labeled Type II, modeling the number of failures, uses a
point process to count failures. Let M(t) be the number of failures of the
software that are observed during time [0, t). Often M(t) is modeled by a
Poisson process with mean value function j-t(t), where j-t(t) is non-decreasing
and, for the purposes of this paper, differentiable. The mean number of fail-
ures at time t is given by j-t(t). The different models of this type specify a
different function j-t(t). The Poisson process is chosen because in many ways
it is the simplest point process to work with. The point process approach has
become increasingly popular in recent years. There is no reason why point
processes other than the Poisson could not be used.
used and has formed the basis for many models developed after. It is a Type
1-1 model; it models times between failure by considering their failure rates.
Jelinski and Moranda reasoned as follows. Suppose that the total number of
bugs in the program is N (which can be related to the size of the code), and
suppose that each time the software fails, one bug is corrected. The failure
rate of Ii, is then assumed a constant, proportional to N - i + 1, which is
the number of bugs remaining in the program. In other words
(3.5)
and that instead of Ai decreasing with certainty, as is assumed in the JM
model, they merely required that the sequence of Ai's be stochastically de-
creasing i.e. P(Ai+l < ,\) ~ P(Ai < ,\), for i = 1,2, ... and ,\ ~ O.
If one assumes a gamma distribution for Ai with shape parameter a and
scale lIF(i), where lIF is a monotonically increasing function of i, then
(3.6)
and the required ordering on the distribution of the Ai'S is achieved. The
function lIF( i) is supposed to describe the quality of the programmer and
the programming task. The authors give equations for the distribution of 11
from the instant of the (i - l)st repair and from an arbitrary time-point,
and give an estimate of the instantaneous failure rate. They also investigate
the possibility of an unknown lIF(i) , and consider goodness of fit tests for
deciding on a suitable family of functional forms for lIF. It can be shown that
the reliability function for 11 is given by
.
RT,(t I a,tJi(* = [lIF( i) ]
lIF(i) +t '
Q
(3.7)
(3.8)
Because one is able to specify a failure rate for this model, it is considered
to be of Type 1-1. This model has received quite a lot of attention and has been
the subject of various modifications; for example see the model of Littlewood
(1980).
An alternate structure to the Littlewood and Verrall model was considered
in Soyer (1992) where the author considered E(Ai I a, f3) = ai f3 where values
of f3 < 0 (f3 > 0) implying Ai's be decreasing (increasing). It was recognized
352 Nozer D. Singpurwalla and Refik Soyer
that the proposed model fit into the framework of general linear models
and linear Bayesian methods were used for inference. A generalization of the
model was presented by assuming that 0: and {3 be only locally constant, that
is, changing with i.
3.1.3 Imperfect Debugging Model (Goel and Okumoto 1978). This
is an attempt to improve upon the JM model by altering its assumption that
a perfect fix of a bug always occurs. Goel and Okumoto's Imperfect Debugging
Model is like the Jelinski and Moranda model, but assumes that there is a
probability p, 0 ~ p ~ 1, of fixing a bug when it is encountered. This means
that after i faults have been found, we expect i x p faults, instead of i faults,
to have been corrected. The failure rate of 11 is
11 = oiT;(J~1. (3.13)
The authors then make the following assumptions, which greatly facil-
itate the analysis of this model. They assume the 11 's to be log normally
distributed, that is to say that 10g11's have a normal distribution, and that
they are all scaled so that 11 ~ 1. The 0; 's are also assumed to be log normal,
with median 1 and variance O'~ (the conventional notation is A(l, O'~)). Then
by taking logs on the relationship above they obtain
(3.15)
where Wi is N(O, Wi) with Wi known. When a is known, the expressions
for log11 and Oi together form a Kalman filter model, on which there is
also an extensive literature. When a is not known an adaptive Kalman filter
model results, for which there are no closed form results; instead, the authors
propose an approximation.
In Singpurwalla and Soyer (1992), the authors discuss inference for the
generalizations of these models when the variance u? was unknown and con-
duct a comparison between two of the models for the Oi'S; the exchangeable
model and the adaptive Kalman filter model, using the "System 40" data
of Musa (1979). The former model is found to be more robust to the initial
choice of its parameters and is able to track the data better. In the case of
the adaptive Kalman filter model, the paper also discusses Bayesian inference
for the parameter a.
3.1. 7 Bayes Empirical Bayes or Hierarchical Model (Mazzuchi and
Soyer 1988). In 1988 Mazzuchi and Soyer proposed a Bayes Empirical
Bayes or Hierarchical extension to the Littlewood and Verrall model. As
with the original model, they assumed 11 to be exponentially distributed
with scale Ai. Then they proposed two ideas for describing Ai, here called
model A and model B.
Model A. Still assume that Ai is described by a gamma distribution, but with
parameters a and 13. Now assume that a and 13 are independent and that
they themselves are described by probability distributions; a by a uniform
and 13 by another gamma. In other words,
1I'(a 1 v) 1
ii' ~ a ~ v, (3.16)
(3.17)
and that !li(i) = 130 + 131i, except now place probability distributions on a,
f30 and f31 as follows:
Assessing the Reliability of Software: An Overview 355
1I"(a I W) 1
W' ~ a ~W,
(3.22)
Assessing the Reliability of Software: An Overview 357
(3.23)
and j is given as
j _ L~=l mi
(3.24)
- ,,~-l
L.",=o
ki '
This is one of the first example of the Type II class of models.
3.2.2 Time-dependent Error Detection Model (Goel and Okumoto
1979). This is the second Type II model that we will consider. First, the au-
thors make the assumption that the expected number of software failures to
time t, given by the mean value function I-'(t), is non-decreasing and bounded
=
above. Specifically, 1-'(0) 0 and limt_oo I-'(t) =
a, where a represents the
expected number of errors in the software. They also assume that the ex-
pected number of failures in the time interval (t, t + Llt) is proportional to
the number of undetected errors, or
I-'(t) = a(l - e- bt )
(3.26)
A(t) = J.t'(t) = abe- bt
The function J.t(t) is used to define a Poisson process, and the distribution
of M(t) is given by the well known formula
(3.27)
Two assumptions of the JM model are modified here. First, the total number
of errors in the software is a random variable with mean a, contrasted with
358 Nozer D. Singpurwalla and Refik Soyer
the fixed but unknown number in the JM model. Secondly, the times between
successive failures are assumed dependent here, whilst the JM model assumes
independence. Goel and Okumoto claim that these modifications are a better
description of the actual occurrence of failures in software.
The authors present various relevant formulae, for instance the distribu-
tion of Ti given that the time to the (i - 1)th failure was t, is given by
(3.28)
Let tl, t2, ... ,tn be the observed times between successive failures. Maximum
likelihood estimators for a and b are the solutions to the equations
!!.
a
= 1- exp( -bsn ), .
(3.29)
%= L~:l Si + asne-b&n, where Si = LJ:l tj,
which must be obtained numerically.
Experience has shown that often the rate of faults in software increases
initially before eventually decreasing, and so in Goel (1983) the model was
modified to account for this by letting
(3.30)
where a is still the total number of bugs and band c describe the quality of
testing. Goel and Okumoto's model has spawned a plethora of similar non-
homogeneous Poisson process models, each based on different assumptions
as to the expected detection of errors. An overview of such models may be
found in Yamada (1991).
3.2.3 Logarithmic Poisson Execution Time Model (Musa and Oku-
moto (1984). The Logarithmic Poisson Execution Time Model of Musa
and Okumoto has gained much popularity in recent years. Unlike the model
of Goel and Okumoto, this model has not been motivated by directly pos-
tulating a form for the intensity function A(t) of a Poisson process. Rather,
first A(t) is modeled via J-t(t), the expected number of failures in time [0, t),
via the relationship
A(t) = (>'9~+1)'
- In(>.9t+l) (3.32)
J-t (t) - 9 .
Assessing the Reliability of Software: An Overview 359
P(M( t ) = n ) = (In(>.6t+i))'' 0 1
8"(>'6t+i)17s n !, n = , , ... , (3.33)
and that the density of 11+1 given that Si = T is
(3.34)
Estimation of the parameters of the model (3.33) has been done via the
method of maximum likelihood and via a Bayesian approach involving the use
of expert opinion. The method of maximum likelihood is described in detail
by Musa and Okumoto (1984); some difficulties in using this approach are
given by Campod6nico and Singpurwalla (1994). An outline of the Bayesian
approach, which applies to any non-homogeneous Poisson process (NHPP) is
given below; it is abstracted from Campod6nico and Singpurwalla (1994).
Consider a NHPP with a mean value function J.l(t). Suppose that J.l(t)
contains two unknown parameters, and suppose that an analyst (software
reliability assessor), A, asks an expert (software developer, debugger, user,
etc.), &, to think about J.l(t), and to choose two points in time, say Ti and
T2, 0 < Ti < T2, for which can provide opinions on J.l(Tt) and J.l(T2). Let
=
J.li =
J.l(Tt) and J.l2 J.l(T2).
Because J.li and J.l2 are unknown parameters, treats them as random
quantities, and conceptualizes their distributions as P;l (.)
and P;2 (.),
re-
spectively. Then, for each i (i = 1,2), & declares to A two numbers mi and Si,
as measures of the location and the scale of P;, (.),
respectively. For example,
mi and Si may be declared as the mean and the standard deviation of P;, ().
It is important to bear in mind, that even though & has declared the mi and
the Si to be measures of location and scale, it is possible that in A's mind,
what declares may not reflect the true opinions of , and the procedures
that follow, provide for this possibility.
Suppose that for A, the model (3.31) is the appropriate one to consider.
That is, for t > 0, the mean value function of the NHPP is of the form
k
J.l(t I (), A) = In(A(}t + 1). A Bayesian analysis of the NHPP requires that A
construct a joint prior distribution for the parameters (},A), and our goal is
to show how the information provided by can be used by A to induce the
required prior. To do this, A may first construct a joint prior distribution of
J.li and J.l2. For this, we observe that the system of equations
the expertise of , and A's perceived correlation between M1 and M2, and
between Sl and S2. Since it is the same individual, namely, the expert , that
R
provides A information about both J.t1 and J.t2, and also since J.t1 < J.t2 < J.t1,
it is reasonable to suppose that in A's opinion, M1 and M2 will be dependent,
and positively so. To summarize, given m1, m2, 81 and 82, A needs to obtain
P(J.t1, J.t2 I m1, m2, 81,82), which incorporates the above dependencies and the
expertise of the expert.
To proceed further, A uses Bayes' law and writes
where denotes oc "proportional to" . The second term on the right hand side of
the above expression is A's prior opinion about J.t1 and J.t2, and the first term
is A's likelihood which, by the product rule of probability, can be factored as
A 7. PSI (81 I JIl, JI2) is exponential with mean (JI2 - JId. This implies that as
the disparity between JIl and JI2 increases, the uncertainty SI becomes larger
and larger.
A8. P(JIl,JI2) is relatively constant over the range of JIl and JI2 on which the
likelihood is appreciable.
(3.36)
fU exp{ _x 2/2} d
were < JIl < JI2 < :f;JIl' 'Y U = -00
h 0 T Ai. ( )
,f2; x, an a, ,'Y an k , are
d b d
parameters specified by A.
Observe that A2 contains four parameters, a, b, 'Y and k; these are in-
troduced to capture A's view of the biases and the expertise of [;. Thus for
example, with b = 1, a denotes the amount of bias by which A believes that
[; overestimates JI2. If A thinks that [; overestimates (underestimates) JI2 by
10%, then a = 0 and b = 1.1(.9). If A thinks that [; tends to exaggerate
(is overcautious about) the precision of 's assessment, then 'Y > )1. The
parameter k describes A's views as to how cautious [; is in discriminating
between JIl and JI2. These parameters do impact the resulting prior. For
instance, a large value for 'Y will imply a large uncertainty on the predictions.
The joint prior distribution 1f'(JIl' JI2) given by (3.36) can be evaluated
numerically for any specified values of T l , T 2, ml, m2, 81, 82, a, b, 'Y and
k. Note that T l , T 2, ml, m2, 81 and 82 are obtained via expert opinion; the
parameters a, b, 'Y and k bring flexibility into the analysis by allowing the
analyst to evaluate the expert's skills in a formal manner. If the analyst has
no opinion on the expertise of the expert, or chooses not to incorporate such
opinions into the analysis, then a = 0, b = 1, 'Y =1 and k = 0 or l.
The relationship between the parameters (0)) of the logarithmic-Poisson
execution time model, and (JIl,JI2) is given by
362 Nozer D. Singpurwalla and Refik Soyer
and the (unconditional) probability of k failures in the interval (s, tJ, s < t,
for k = 0, 1,2, ... :
The last two quantities given above, are known as the predictive distri-
butions. These are used to provide a measure of uncertainty associated with
the predicted number of failures in a specified interval.
A computer code that facilitates computations involving the above inte-
grals has been developed by Campod6nico (1993), and can be made available
to potential users.
4. Model Unification
From the material of the previous section it is apparent that unlike hardware
reliability where a few probability models like the Weibull playa dominant
Assessing the Reliability of Software: An Overview 363
time. They then model the software failure process as a self-exciting point
process (cf. Snyder and Miller 1991, p. 287) and show that all the models dis-
cussed in Section 3 including the Kalman filter based ones by Singpurwalla
and Soyer and by Chen and Singpurwalla are special cases of such processes.
Furthermore, the intensity function of the point process is indeed what soft-
ware engineers (like, Jelinski and Moranda, Littlewood and Verrall, Schick
and Wolverton, etc.) refer to as the failure rate of software. This work, plus
the preceding papers by Miller, (F-Z)S and Koch and Spreij should signal a
shift in the paradigm of software reliability modeling from its current focus on
the failure rate to that of counting process theory and martingale dynamics.
The work of Kuo and Yang (1995a) is also noteworthy, because these
authors introduce the idea of using record value statistics (cf. Glick 1978) for
modeling software failures when new faults may be introduced during the
process of correcting other faults. The unifying theme of Kuo and Yang is a
use of the non-homogeneous Poisson process (NHPP); their focus of attention
is Bayesian inference using the Gibbs sampling approach. An overview of the
main ideas of Kuo and Yang is given next.
Suppose that at the beginning of software testing there is an unknown
number of faults, say N. Then, the first n ~ N epochs of software failure can
be modeled as the first n order statistics of N independent and identically
distributed (i.i.d.) random variables having density f. This idea parallels that
of Miller (1986), who restricts attention to the case of f being exponential.
The authors refer to their set-up as the general order statistics (GOS) model.
When f is exponential, we get the model by Jelinski and Moranda. By varying
f we can obtain analogues to the Jelinski-Moranda model.
Let M(t) be the number of software failures in (0, t], and let JL(t) =
E(M(t)), be its expectation. We assume that JL(t) is differentiable, and let
>'(t) = dJL(t)fdt. Suppose that the prior on N is a Poisson with mean O. Then,
it can be shown (cf. Langberg and Singpurwalla 1985) that M(t) is NHPP
with JL(t) = OF(t), where F is the cumulative distribution function of f, and
intensity function >.(t). With F(t) = =
1 - e- f3t , JL(t) 0(1 - e- f3t ), and the
resulting process for M(t) is the model of Goel and Okumoto. Processes for
which N has a Poisson distribution and limt-+oo JL(t) < 00, are referred to by
Kuo and Yang as "NHPP-I" processes." Nonhomogeneous Poisson processes
with limt-+oo JL(t) = 00, are called "NHPP-II" processes." An example of an
NHPP-II process is the model by Musa and Okumoto (1984).
We now turn attention to record value statistics. Suppose that 8 1 , 8 2 , ,
are independent and identically distributed random variables with density
function f. We define the sequence ofrecord values {Xn }, n 2: 1 and record
times Rk, k 2: 1, as follows:
Rl = 1
Rk min{i Ii> Rk-1,8i > 8Rk_J, k 2: 2, and
Xk 8RIe' k 2: 1.
Assessing the Reliability of Software: An Overview 365
References
Snyder, D.L., Miller, M.I.: Random Point Processes in Time and Space. Second
Edition. New York: Springer 1991
Soyer, R.: Monitoring Software Reliability Using NonGaussian Dynamic Models.
Proceedings of the Engineering Systems Design and Analysis Conference 1,
419-423 (1992)
van Pul, M.C.: Asymptotic Properties of Statistical Models in Software Reliability.
Scand. J. Statist. 19, 235-254 (1992)
Yamada, S.: Software Quality/Reliability Measurement and Assessment: Software
Reliability Growth Models and Data Analysis. J. Inform. Process. 14, 254-266
(1991 )
The Role of Decision Analysis
.
In
Software Engineering
Jason Merrick and Nozer D. Singpurwalla
Department of Operations Research, The George Washington University, Wash-
ington, DC 20052, USA
Summary. There are many decisions involved in the creation of a reliable software
system. In this paper we demonstrate the use of Bayesian decision theory for mak-
ing decisions in software engineering. We give two examples of such decisions; the
first concerns the choice of a software house to use when an organization identifies a
particular software requirement. The second decision pertains to an optimal testing
strategy that a software house should adopt before releasing a piece of software.
We consider both single and multiple stage testing and utilize existing software
reliability models to determine the optimal rule.
1. Introduction
panies. To attain a higher level, certain key process areas (or KPA's) must
be satisfied, in addition to the preceding level. To facilitate a discussion of
the model, we introduce the following notation. Let
, if i = 1,2,3,4,
(2.1)
, if i=5.
For the i-th maturity level, there are ni associated KPA's, where
I<i,j = 1(0) denotes the satisfaction (or otherwise) of the j-th KPA
associated with the i-th level.
For the j-th KPA at the i-th level, there are ri,j questions, where
Xi,j,A: = 1(0) denotes that the answer to the k-th such question is a
Yes (No).
The vector of questions for the j-th KPA associated with the i-th maturity
level, (Xi,j,!, ... , Xi,j,r;), is denoted by Qi,j' The set of all such vectors for
the i-th level is denoted by Qi and the responses to the entire questionnaire
are denoted by Q.
The CMM model can be represented by the tree in Figure 2.1, showing
the event hierarchy. This structure shows that for the event Mi to occur, the
Mj
I
I I 1
M. I- K..
I, == 1 ~,Dj -1
event Mi-l must occur and all the Ki,l, ... , Ki,n; must take the value one.
The decision of whether the j-th KPA of the i-th level is satisfied or not, is
based upon
The Role of Decision Analysis in Software Engineering 371
1 ri,j
-L:Xook 1,3, ,
r'I,). k=l
The decision tree is shown in Figure 2.2. The tree consists of one decision
U(H)
node, V, the choice of which house to hire, and k random nodes, R 1 , ... , Rk.
We denote by d. the profit the company would make if they used the software
system offered by the s-th software house under the assumption of bug-free
operation. The random node for each decision, Rs for s = 1, ... , k, gives the
actual cost due to delays; this cost is denoted by the random variable C. So
the utilities of the decisions to be made are thus given by
where
If all these values are below the utility for not purchasing a software system,
then the optimal decision is not to purchase a software system. Otherwise,
the optimal choice is the house with the highest expected utility.
for k ~ 1,
for k = O.
376 Jason Merrick and Nozer D. Singpurwalla
3.1.1 The Decision of Optimal Test Duration. The decision tree for
the optimal test duration, r, is shown in Figure 3.1. The tree consists of two
Choose N-MI (t I)
Release
bugs left
\----i~(N.MI(t I),t.)
Fig. 3.1. The decision tree for the optimal test duration in single-stage testing
n
decision nodes, V 1 and V 2 , and two random nodes, n 1 and 2 . AtV 1 we
choose a value for r. At V 2 the only possible action is to release the software.
The failures of the system when tested for time r are observed at nl and the
failures of the system after delivery are observed at 2 .n
The utility function UR(N, M( r), r) reflects the cost to the software house
of the remaining (N - M( r)) bugs in the system delivered, after testing for
time r and discovering M( r) = k bugs at times t(k). Obviously, to make
a useful choice of the test duration a suitable form for this utility must be
specified. A plausible form is given in Section 3.1.2.
To obtain an optimal choice of T we start at the terminus of the tree and
work backwards following the principle of maximization of expected utility.
The expected utility at the random node R2 is found by averaging the utility
of N bugs over the distribution of N given k failures during the test period
r and running times t(k); thus
(3.3)
3.1.3 Application to an Error Counting Model. As the observable
quantity in the testing procedure is the number of failures in a fixed test pe-
riod, we must model the reliability of the software through an error-counting
model. There are many models that have been proposed for this purpose;
a model which has the qualities of simplicity and realism was proposed by
Jelinski and Moranda (1972). Although this model has come under much
criticism, it is sufficient for the purpose of demonstrating the optimal testing
strategy outlined in Section 3.I.
Letting Tl, T2, ... denote the successive running times of the software
between failures, we define for t > 0
P(T; > tiN, .1) = e-.d(N-i+1)t, (3.4)
The general approach, outlined above, for finding the optimal test dura-
tion for single stage testing can be used with this model. This involves finding
expressions for the posterior distribution of N, given that M(r) = k, t(k) and
the test duration r, and the prior predictive probabilities that M (r) = k and
~ L~\T) Ii = t, given the test duration r, for k = 0,1,2, ... The distribu-
tions, see Singpurwalla (1991), are given by
We- 9 0k+i
P(N = k + j I M(r) = k, t(k)) = .,
J.
(Jl + S + kTra-k , (3.5)
P(M(r) = k 11l) = 1
0
00 e-8(1-e-AT)
k! (0(1- e ) )
-AT k e-JA(JlA)a-l Jl
r(a) dA;
(3.6)
and
P(~ L~\T) Ii =t I M(r) = k) roo (e-I'A(JA)"'-lJ) ~ -Art
Jo rea) (k_l)!re
LJ~1(-1)j (~ ) (tr-jr)k-1dA,
(3.7)
for ko < t < (ko + 1) and ko = 0,1, ... , (k - 1), where b = (1 - e-AT)-l.
The final expression for the expected utility given r is given in Singpur-
walla (1991). The complexity of this expression is evident from its constituent
parts, (3.5), (3.6) and (3.7), which must be substituted in (3.2); the use of
computational methods is necessary for its calculation. A software implemen-
tation of this decision method is available. For details refer to the World Wide
Web page of The George Washington University's Institute of Reliability and
Risk Analysis, ( http://www.seas. gwu. edu/seas/institutes/irra).
Example. To use the above method to make an optimal choice oftest duration
for software system, the decision makers' prior beliefs must be specified. These
are the parameter () of the Poisson distributed prior for N, the shape, a, and
the scale, Jl, of the gamma distributed prior for ..:1. To illustrate the sensitivity
of the results to these input parameters various values were chosen and the
expected utility curves were calculated.
The utility of delivering a useless, bug-ridden system, al, was chosen to
be -10. The utility of delivering a bug-free system, al + a2, was chosen to be
100, so a2 was 110. The utility of fixing a bug discovered during testing, C,
was 0.1 and the utility of testing for time T, f(T), was chosen to be simply
T.
Figure 3.2 shows the curves of the expected utilities when 0 takes the
values 5, 10, 15, 20, 25 and 30, with a and Jl fixed at 10 and 1 respectively.
The path of the optimal test duration, ro, for each value of 0 is shown. As
can be seen ro increases with (), but the expected utility of a test of duration
ro decreases.
The Role of Decision Analysis in Software Engineering 379
99
Total
Expected
Utility
'U(T)
83
T...1",
4-------------------~------------------+__.Tu.T
5.50 10.00
Figure 3.3 shows the curves of the expected utilities for a taking the values
2, 3, 4 and 5, with () and J.l fixed at 10 and 1 respectively. Again the path of
the optimal test duration, TO, is shown. In this case, as a increases both the
optimal test duration and the maximum utility increase.
98.50
Total
Ixpec:ted
O~ty
t .. til1l
88.SO+----------I---------~- Tille T
1.00 5.5 10.00
Choose
N-M.("t 1)-M2(t 2)
bugs left
)---.... 1tz
Release
In the multiple-stage testing strategy used by Morali and Soyer (1995), the
software is tested until a bug is detected, located and corrected. A decision is
made whether to stop testing and release the software or to continue testing
for another stage. The decision at each stage is based on our belief of whether
testing for another stage would be beneficial. We therefore set up the problem
as a sequential decision problem and give a one-stage look ahead decision rule
for a given class of utility functions.
3.2.1 The Sequential Decision Problem. We denote the life length of
the software in the i-th stage of testing by T; for i = 1,2, .... A common
view of software is that it does not age or deteriorate with time. Thus, it
is assumed that the failure rate is constant if the code is not changed; The
failure rate of the software during the i-th phase, i.e. the failure rate after
the (i - 1)st modification, is denoted by (h. Thus the random variables T;
are exponentially distributed with parameter ();. At the end of the i-th stage
of testing, our decision is based upon T(i) ={T(O), T 1 , ... , T;}, where T(O)
denotes the prior information about the failure characteristics of the software
before testing.
Fig. 3.5. The decision tree for the software release problem
for further stages. Thus we have at stage i a decision node, Vi, where i =
0,1,2, ...; the choice at this node is whether to STOP and release the software
or to TEST the software for another stage.
The utility of a test of duration t is denoted Uo{t) and the utility associ-
ated with releasing a piece of software with failure rate () is denoted UR((}).
The solution of the tree follows the usual path of maximization of expected
utilities as in the previous sections. This means that at each node we must
look at the expected utility for the STOP and the TEST decisions and take
the maximum. After i stages of testing the expected utility of the STOP
decision is given by E[UR((}i+d I T(i)] and the expected utility for the TEST
decision is given by E[Uo(Ti) I T(i)] + Ui+1 where
where
6
uf6) = L E[Uo{Ti+j) I T(i)] + E[UR{(}iH+d I T(i)]
j=l
is the additional expected utility associated with testing for {) more stages
after the i-th modification to the software.
In Morali and Soyer (1995), a theorem is given that shows the existence of
an optimal stopping rule under certain conditions on the expected utilities.
It states that if E[Uo{Tj) I T(i)] is increasing in j and E[UR{(}j) I T(i)] is
discrete convex in j, for j = i + 1, ... , then the optimal stopping rule for the
tree in Figure 3.5 is
(3.11)
where ei "" Beta(-YO:i_l, (1 - -Y)O:i-l) and p, -y and O:i-l are known non-
negative quantities, with 0 < -y < 1. The relationship in (3.11) implies that
(Ji < p(Ji-l. It is next assumed that given X(i-l), (Ji-l has a gamma distri-
bution with shape parameter O:i-l and scale parameter 13i-l.
A prior is specified on the failure rate of the software before testing
through the parameters 0:0 and 130. The moments of the predictive distri-
butions of the observables and the posterior distributions of the parameters
can be found in closed form.
Chen and Singpurwalla (1994) note that the parameter p provides infor-
mation about whether the reliability of the software is being improved or
not. When bugs are corrected it is possible that further bugs are introduced.
If p < 1 then the failure rate of the software is strictly decreasing from one
stage to the next. If p > 1 then the failure rate may be increasing. However,
the value of p will be unknown and so a prior distribution is assigned. This
prior is updated with the test data using the standard Bayesian machinery;
the likelihood can be obtained from the predictive distribution of Ii given
Ii-l and p. Thus we can track the growth or decay in the reliability of the
software through the distribution of p given the lengths of the test stages, Ii
for i = 1,2, ....
3.2.3 A One-Stage Look Ahead Decision Rule. To apply the model
proposed by Chen and Singpurwalla(1994) to the decision methodology out-
lined in Section 3.2.1, we must first specify the utility functions, UD(Tj) and
UR((Ji+d
The utility function UD(Tj) can be reasonably assumed to be decreasing
in Tj. Defining the cost per time unit of testing as kD' we obtain the utility
function
(3.12)
If a company releases an unreliable piece of software, there will be an associ-
ated loss in profits. Morali and Soyer (1995) offer the following utility function
to express this loss in terms of the failure rate of the released software
(3.13)
To use the optimal stopping rule for the i-th stage given in (3.10), the
applicability of the theorem given in Morali and Soyer (1995) must first be
shown for this model and these utility functions; this is examined for the cases
384 Jason Merrick and Nozer D. Singpurwalla
P = 1 and P > 1. For the case p> 1, the sufficient conditions are 'YP < 1 and
'Y> 0.5, while for P = 1 the sufficient condition is 'Y > 0.5.
As P is treated as an unknown, we assert the optimality of the one-stage
look ahead rule in (3.10) using probability statements. The utilities for the
stopping rule conditional on P are given in Morali and Soyer (1995) as
ufO) = - {kR~('YP)} ,
(3.14)
uP) = - {kD (-ya~~l) + kR~('Yp)2} .
Using the posterior distribution of p, we can average the utilities in (3.14) to
obtain the utilities ut unconditional on p, as required for the stopping rule.
Thus we proceed by giving prior distributions for the reliability of the
software system, running the software until a failure occurs and updating
the model using this test data. The decision is then made by first checking
the conditions for optimality of the one-stage look ahead rule; if the proba-
bility that these conditions hold is sufficient then the expected utilities for
releasing the software, E[UfO)], and for testing for a further stage, E[UP)]'
are computed. The decision rule, given in 3.10, is then applied. If the decision
is to STOP then the software is released and our decision process is finished.
Otherwise, another stage of testing must be performed and we effectively
start from the beginning of the procedure using the posteriors obtained as
our priors.
Example. An 100 point discretized beta distribution on the range [1,2] was
chosen as the prior on Pi this prior distribution is discussed in Morali and
Soyer (1995). The parameters of the beta distribution, c and d, were chosen to
be 1.25 and 5 respectively. The prior parameters 0'0 and 13o were both chosen
to be 2 and the parameter 'Y was given the value 0.8. The utility constants
kD and kR were chosen to be 1 and 100,000 respectively.
We note that this set up does not guarantee the optimality of the one-
stage look ahead decision rule, because 'YP > 1 for P > 1.25. However, as can
be seen from the plots in Figure 3.6, the probability that p > 1.25 decreases
over subsequent stages of testing. The conditions for the optimality of the
rule are therefore likely to hold at the later stages.
Figure 3.7 shows the expected additional costs for further testing of the
software after having tested it for 0, 2, 3 and 5 stages. It can be seen from
the graph showing the additional expected costs of further testing after the
fifth stage that under the one-step look ahead decision rule, given in (3.10),
one would release the software after stage 5.
4. Conclusion
0.032
0.02e
0.024
~
i
""
0.020
C.01e
""
"-
0 0'2
o.ooe
O.OOA.
0.000
'.7 , .S , .9
RHO
O.O~
0.0"
~
! 0.03
=
"'"
"-
0.02
0.01
0.00
'.e '.9
o.oe
0 05
~
0.0 ....
~
I 0.03
0.01
0.00
, .0 '.4
RHO
'.5 '.
o.oe
0,07
0.015
~ O.O~
~
=
0.0 ....
2f 0.03
0.02
0.01
0.00
, .0 '.4 ,." '.e , .7 ,.S , .9
RHO
Fig. 3.6. Distributions of p at the Oth, 2nd, 3rd and 5th stage of testing
386 Jason Merrick and Nozer D. Singpurwalla
...
"
c
, .. :---
?
o ."
~ E 5~ -. e 5 "' .... ::.:
"200
ooL-________________________________~.------.~----~----~.~----~
T" I[ ~'9' I ""C 'ST.,C
~:,. o
'''2 0
2200
, '. 0
' ",. 0
,,~ o
'''' ':- 0
' 4 00
i
'-'
' ,) 00
, :3>' 0
' 2 ~?
. t ,O
0
Fig. 3.7. The expected additional costs of testing for further stages after the Oth,
2nd, 3rd and 5th stage
The Role of Decision Analysis in Software Engineering 387
Acknowledgement. This research was supported by the Army Research Office grant
DAAH04-93-G-0020 and the Air force Office of Scientific Research grant AFOSR-
F49620-95-1-0107.
References
Chen, Y., Singpurwalla, N.D.: A Non-Gaussian Kalman Filter Model for Tracking
Software Reliability. Statistica Sinica 4, 535-548 (1994)
Dalal, S.R., Mallows, C.L.: When Should One Stop Testing Software? J. Amer.
Statist. Assoc. 83, 872-879 (1988)
van Dorp, J.R., Mazzuchi, T.A., Soyer, R.: Sequential Inference and Decision Mak-
ing During Product Development. Under review (1994)
Forman, E.H., Singpurwalla N.D.: An Empirical Stopping Rule for Debugging and
Testing Computer Software. J. Amer. Statist. Assoc. 72, 750-757 (1977)
Forman, E.H., Singpurwalla N.D. : Optimal Time Intervals for Testing Hypotheses
on Computer Software Errors. IEEE Trans. Rel. R-28, 250-253 (1979)
French, S.: Decision Theory: An Introduction to the Mathematics of Rationality.
New York: Wiley 1986
388 Jason Merrick and Nozer D. Singpurwalla
Humphrey, W.S.: Managing the Software Process. SEI (The SEI Series in Software
Engineering). Reading: Addison-Wesley 1989
Jelinski, Z., Moranda, P. B.: Software Reliability Research. Computer Performance
Evaluation. New York: Academic Press 1972, pp. 485-502
Langberg, N., Singpurwalla, N.D.: A Unification of Some Software Reliability Mod-
els. SIAM J. Sci. Statist. Comput. 6, 781-790 (1985)
Landry, C., Singpurwalla, N.D.: A Probabilistic Capability Maturity Model for
Rating Software Development Houses. Technical Report IRRA-TR-95/3. IRRA
(1995)
Littlewood, B., Verall, J.L.: A Bayesian Reliability Growth Model For Computer
Software. J. Royal Statist. Soc. 22, 332-346 (1974)
Morali, N. , Soyer, R.: Optimal Stopping Rules for Software Testing. Under review
(1995)
Musa, J.D., Okumoto, K. : Software Reliability Models: Concepts Classification,
Comparisons and Practice. Electronic Systems Effectiveness and Life Cycle
Costing. New York: Springer 1982, pp. 395-423
Okumoto, K., Goel, A.L.: Optimum Release Time For Software Systems, Based on
Reliability and Cost Criteria. J. Syst. Software 1, 315-318 (1980)
Paulk, M.C., Curtis, B., Weber, C.V.: Capability Maturity Model, Version 1.1.
IEEE Software, (1993a)
Paulk, M.C., Curtis, B., Weber, C V. : Capability Maturity Model, Version 1.1.
Technical Report MU /SEI-93-T-24. SEI (1993b)
Ross, S.M.: Software Reliability: The Stopping Rule Problem. IEEE Trans. Software
Eng. SE-ll, 1472-1476 (1985)
Singpurwalla, N.D.: Pre-Posterior Analysis in Software Testing. Statistical Data
Analysis and Inference. Amsterdam: North-Holland 1989
Singpurwalla, N.D.: Determining an Optimal Time Interval for Testing and Debug-
ging Software. IEEE Trans. Software Eng. SE-17, 313-319 (1991)
Yamada, S., Narihisa, H., Osaki, S.: Optimum Release Policies for a Software Sys-
tem with a Scheduled Software Delivery Time. J. Roy. Statist. Soc. B 54, (1984)
Zacks, S.: Sequential Procedures in Software Reliability Testing. In: Recent Ad-
vances in Life-Testing and Reliability. Boca Raton: CRC Press 1995, pp. 107-
126
Analysis of Software Failure Data
Refik Soyer
Department of Management Science, The George Washington University, Wash-
ington DC 20052, USA
1. Introduction
Analysis of software failure data is the most practical test of validity of
the software reliability models. Implementation of the models, presented by
Singpurwalla and Soyer (1996) in this volume, require estimation of the un-
known model parameters. In this chapter, we will adopt the Bayesian point
of view to analyze software failure data using some of the software reliability
models. The Bayesian approach provides a coherent framework for making
inference via probability calculus and decision making via maximization of
expected utility (see Merrick and Singpurwalla 1996 in this volume). In so
doing, it also provides a formalism to incorporate expert opinion as discussed
in Singpurwalla and Soyer (1996). In addition to these, the Bayesian esti-
mation does not suffer from the well documented difficulties of maximum
likelihood estimation (see for example, Meinhold and Singpurwalla 1983 and
Campod6nico and Singpurwalla 1994).
We consider Bayesian analysis of software failure data using four different
models. For each model, we present details concerning Bayesian inference,
and discuss what insights about reliability of software can be obtained from
the models when they are applied to real data. We also discuss comparison of
the predictive performance of competing models. In Bayesian analysis of some
of the models, the relevant posterior and predictive distributions can not be
obtained analytically. In such cases, some posterior approximation methods
such as the one proposed by Lindley (1980) and Markov Chain Monte Carlo
(MCMC) methods such as Gibbs sampler (see for example, Gelfand and
Smith 1990) facilitate the Bayesian analysis. An overview of these methods
are also given.
In Section 2, we discuss the hierarchical Bayes setup of the Litttlewood-
Verrall (1973) model proposed by Mazzuchi and Soyer (1988) and present
390 Refik Soyer
inference results. We analyze the Naval Tactical Data System of Jelinski and
Moranda (1972) and compare two competing models considered by Mazzuchi
and Soyer (1988). We also discuss the Gibbs sampling approach of Kuo and
Tang (1995) to a generalization of these models. In Section 3, we present the
analysis of the 'System 40' data of Musa (1979) using the Kalman filter types
of models of Sing pur walla and Soyer (1985,1992) and Chen and Singpurwalla
(1994). In Section 4 we present the Bayesian analysis of logarithmic Poisson
execution time model of Musa and Okumoto (1984) which was developed by
Campod6nico and Singpurwalla (1994).
ba
7r({3o I a,b,{3t) = r(a)({3o + {3t)a-1 e-b(fjo+fjl), {3o 2: -{31
P(An I ten) = JJJ P(An I ten>, a, {3o, {3t)7r(a, {3o, {31 I t(n)dad{3od{31,
(2.1)
where P(An I t(n), a, {3o, {3d is the conditional posterior distribution of An
given (a, {3o, {3t) and 7r(a,{3o,{31 I ten) is the posterior joint distribution of
(a, {3o, {3t).
It can be shown (using the assumptions of Mazzuchi and Soyer) that given
Tn =tn, a, {3o, {31, An is independent of all other Ti'S with density
(n) (An)O'(tn + {3o + {31 n t+1 e ->.,,(t,,+fjO+fjl n)
P('\n It, a, {3o, {3t) = r(a + 1) , (2.2)
7r(a, {3o, {31 I ten) ex: C(a, {3o, {31 I t(n)7r(a, {3o, {3d, (2.3)
where C( a, {3o, {31 I t(n) is the likelihood function of (a, {3o, {3d and 7r( a, {3o, {3t)
is the prior where dependence on the hyperparameters is suppressed. It can
be shown that
f U(0)e A (8)dO
(2.6)
f eA (8)dO
where 0=(1 , O2 , , Om) is an m-dimensional vector of parameters;
For example, if U(O) = 0, then the above integral is the mean of the
posterior distribution of 0. Lindley's approximation is concerned with the
asymptotic behavior of (2.6) as the sample size gets large. The idea is to
obtain a Taylor's series expansion of all the functions of 0 in (2.6) about 0,
the posterior mode. The approximation to (2.6) is:
Ui = ()~O(~)
u,
I8=8
-' Ui,j = ~:~~:! I -' Ai,j,k = ():.;~~~ I -'
' J 8=8 ' J k 8=8
the data is based on trouble reports from one of the larger modules, the A-
module. The data consists of the number of days between the 26 failures that
occurred during the production phase of the software.
In analyzing the data, Mazzuchi and Soyer selected (arbitrarily), the val-
ues a = 10, b = 0.1, v = 500 for Model A. For Model B, the values of a = 10,
b = 0.1, c = 2, d = 0.25, w = 500 were selected so that, initially, the two
models were similar. In particular, above parameters were selected so that
the prior distribution of Q' was the same for both models, and the prior dis-
tribution of f30 + f31 for Model B was the same as the prior distribution of f3
for Model A. The Lindley approximation was used by the authors to obtain
the posterior means of Ai'S, the predictive distributions, and the predictive
means of the T;'s at each stage for both models.
Table 2.1 presents the actual times between failure along with the predic-
tive means of T;'s for each model. Except for an almost uniform difference,
the behavior of the predictive means from the two models is very similar. As
pointed out by Mazzuchi and Soyer (1988) the predictive means of the two
models differ by f31 i( Q' - 1), given that f3 / (Q' - 1) for Model A is equivalent to
(f3o + f3d( Q' -1)) for Model B. This difference is due to the growth parameter
f31 of Model A.
The plot of the posterior means of Ai (posterior mean of the failure rate of
the i-th time between failures) versus i gives an impression ofthe behavior of
the failure rates from one stage to another. This in turn displays the overall
effect of the modifications at each stage. This is shown in Figure 2.1. Though
both models pick-up an apparent reliability growth during the initial and
later stages of testing and an apparent reliability decay during the middle
stages, Model A is more responsive to the pattern changes present in the
failure data. This is indeed understandable since the underlying structure
of Model B is stronger due to the stochastic ordering assumption, and this
assumption is at odds with the data observed in the middle stages.
Mazzuchi and Soyer (1987) analyzed the same data by using the posterior
approximation technique of Tierney and Kadane (1986) and obtained almost
identical results.
(2.10)
394 Refik Soyer
observed value, ti, in the predictive distribution of~ given t(i-l) for models
A and B respectively.
If the posterior ratio is greater than 1, then model A is preferred to model
B; otherwise the reverse is true. Equation (2.10) provides a global measure
for comparing the two models. An alternative strategy is to compare the
predictive performance of the models with respect to each observation. Such
a local measure is given by the likelihood ratio:
p(ti I t(i-l), A)
(2.11)
p(ti I t(i-l, B) .
As before if the likelihood ratio is greater than 1, then model A is the preferred
model for the i-th observation.
Mazzuchi and Soyer (1988) compared the predictive performances of the
two models using both global and local measures as shown in Figure 2.2.
Using only the global criterion, Model B would be preferred to Model A.
Analysis of Software Failure Data 395
w '"0
>-
<i
""w
""::>
-'
<
":
;/
./
-- - ...- "'-
"-
0
ciLO--~~--~~--~--~~--~~--~--~~--~~
4 8 12 16 20 24 28
TESTING STAGE
LIKElIHOOD RATIOS or It 10 B
'.'
'.2
0.0
os
0.'
0.'
0.1
0.0 0
" TESTING STAGE 20
" '"
POSTERIOR RATIOS OF A 10 B
and
draws from 1I"(B1 I B~, ... , B~, t(n),
draws from 11"(02 10i, og, ... , O~, t(n),
(2.12)
(2.13)
is generated and under some mild regularity conditions, distribution of (Jk
converges to the posterior distribution, 1I"((J I t(n), as k -+ 00 and thus (Jk is
a sample point from 7r((J I t(n). Thus, to generate a sample from 7r((J I t(n),
one alternative is to generate s independent Gibbs sequences of k iterations
and use the k-th value from each sequence as a sample point from 1I"((J I t(n).
For a more detailed discussion of the Gibbs sampler and other related Monte
Carlo methods, see Gelfand and Smith (1990). Once a sample (Jl, (J2, ... , (Jr
is obtained from the posterior distribution 7r((J I t(n), the marginal posterior
Analysis of Software Failure Data 397
distributions of OJ'S and their moments can be approximated from the sample
points OJ, OJ, ... , OJ.
If the full conditional distributions are not of known distributional form
or if they do not exist in closed form, then to facilitate the implementation
of the Gibbs sampler, some random variable generation method such as the
adaptive rejection procedure of Gilks and Wild (1992) can be employed.
In analyzing Model B, Kuo and Tang (1995) assumed independent gamma
distributions for parameters ({3o,{3t), that is, ({3j I aj,bj)--Gam(aj,bj), j =
0,1. As before (a I w)--Uni(O,w).
Let ~=(Al,A2, ... ,An) and ~(-j)={,\j Ii i= j, for j = 1,2, ... ,n}. After
n stages of testing, the implementation of Gibbs sampler requires the full
conditonals:
p(Aj I ~(-j), a, {3o, (3l, ten), j = 1,2, ... , nj
p({3j I ~,a, (3i-ti' ten), j = 0,1 and p( a I ~,{3o, (3l, ten).
Specifying p(Aj I ~(-j),a,{3o,{3l,t(n) is easy, but the form for p({3j I
~,a, (3i-tj, ten) is a complicated mixture. To alleviate this difficulty Kuo and
Tang (1995) use data augmentation by introducing a latent variable Zj which
has a binomial distribution with parameter a and cell probability ri = ,Bo~~ti.
Defining Z=(Zl' Z2, , zn), it can be shown that
(Aj I ~(-j),a,{3o,{3l,z,t(n) Gam(a + 1, tj + {3o + (3d),
({3o I ~,a,{3l,z,t(n) Gam(ao + L:?=l (a - zj),bo + L:?=l Ai),
({3l I ~,a, (3o, Z, ten) Gam (al + L:?:l Zj, bl + L:?=l iAj) ,
(2.14)
where Zj --Bin(a, rj) and the distribution of a
(2.15)
The random variable a can be easily generated using the adaptive rejection
procedure of Gilks and Wild (1992) or the Metropolis algorithm as used by
Kuo and Tang (1995).
The second class of models we will consider for data analysis are those which
directly model the time between failures. These were classified as Type 1-
2 models in Singpurwalla and Soyer (1996). In the sequel we will discuss
inference for two examples of these models.
398 Refik Soyer
(3.1)
CT2 + Sn-1
2
C - (3.4)
n - 2 (2
1 + Yn-1 CT2 + Sn-1 )'
1Tn = 1+y"ft.-I (!2+
2 6,,-1
)' and ....1l..n...
.. is the least squares estimator of On.
y ft.-I
Similarly, the posterior of A is also a Student-t density with In degrees of
freedom, mean mn, and variance 6n s n /(rn - 2), where
(3.5)
Analysis of Software Failure Data 399
and
Pn=(
1 + U 222) 2
Yn _1 + Sn-1Yn _1
Finally, the predictive distribution of Yn +1 given y(n) is a Student-t with 'Yn
degrees of freedom, mean mnYn, and variance 8n {1 + u~ + sn)/('Yn - 2). As
noted by the authors, there is no tractable updating procedure when u~ is
unknown. This model will be referred to as Model A.
The second model considered for OJ was the autoregression
(3.7)
where Wj ' " N(O, u!N), with u; known. The values of 0 < (1 reflect our
belief that the initial modifications show more (less) improvement than the
latter ones and 0 = 1 implies a maturing of the growth process. Singpurwalla
and Soyer (1992) described uncertainty about 0 by a beta distribution over
(a, b) with parameters /31 and /32. Uncertainty about 00 is described by a
normal distribution with mean (variance) 00 (CoN), both specified, apart
from . As noted by the authors, when 0 is not known an adaptive Kalman
filter model results, for which there are no closed form results. The authors
used the Lindley's approximation for making inference in this model.
Given 0, the posterior inference is obtained via the ordinary Kalman filter
solution. For example, given y(n), the conditional distribution p(On I y(n), 0)
is a Student-t distribution with degrees offreedom 'Yn, variance 8n Cn /{-rn -2),
and mean On, where
- OOn-l + RnYnYn-l 2 2 ( )
On = (1 R 2 ) , Rn = 0 C n- 1 + u w , 3.8
+ nYn-l
- 2
Rn (Yn - Yn-1 00n-1) ()
Cn = (
1+
Rn 2 )' and 8n
Yn-1
= 8n - 1 + (1
+ R 2
nYn-1
)' 3.9
Figure 3.1 shows the plots of posterior mean of On under Models A and
B. The plots suggest an overall growth in reliability, since the values of the
posterior mean tend to hover above 1, at least during the initial stages of
testing.
Figure 3.2 shows a plot of mn (the mean of the posterior distribution of
A, in Model A) versus stages of testing. The plot shows that for n 2 25, mn
settles down to a value of about 1.03. This suggests that the overall policy of
making changes to the software results in a consistent growth in reliability.
Figure 3.2 also shows the mode of the posterior distribution of a in Model
B. The posterior mode Ii is below 1 for n 2 2 settling down to the value 0.96
for n 2 25; this suggests that the On's stochastically decrease in n, suggesting
that the initial phases of testing lead to a larger growth in reliability than the
latter ones. Thus it appears that the conclusions about the reliability growth
based on the two models are supportive of each other.
A comparison of the predictive performances of the two models was con-
sidered by the authors using the logarithm of the posterior ratios of Model
A to B for each stage n. It was found that Model A is preferred to Model B.
1.8,.--------_-_-_-_--_---;-_----,
P{)Sl[RlOIi I.fEANS OF THETA IN t.lOOfl A
,.. , .,
" "
10 25
TES11NG STAG[
III l5
" "
POSTERI(R WEANS or 11'" IN Moon 8
I.B
,.
1.2
I.'
'.1
,..
,.. .,
0
" 15
" 25
lES1N(; STAGE:
III
" " 50
or LAMEllA IN MODEl A
'..
POSTERIOR ~ONS
r--_-_-_-~--~-_-_-_-~-__,
'.'
'.'
.
.. -__:'",..._---!"
o. L,-~--,~,-~,,---:'''',...--,.:':"---:::,,----:,.:---~
l[STINC STAGE
'.'
,o~1.2
~to\-~~
~M L/
o.L,----~-~,.----,,---,,--,,--~-~-....J ..
l[STING STAGE
where Un = CUn-l + tn. One step ahead forecast distribution can also be
obtained as
(t I t(n-l) ex: (tn)w,,-l (3.13)
P n -:-(C-=--U-n-_":"l-+':""t-n-:")-q-"--l-:+-W-"
exponential density),
They noted that the value of the parameter C is critical in assessing whether
the times between failure are increasing or decreasing; with values of C ~ 1
implying a substantial growth in reliability, whereas values of C close to zero
implying a drastic reduction in reliability. Intermediate values of C would
imply a growth or decay in reliability depending on t(n-l). The authors de-
scribed uncertainty about C by a uniform distribution over (0, 1). As a result
the closed form nature of the inference was lost and the authors used a Gibbs
sampler to simulate the posterior and predictive distributions.
As an alternate to the Gibbs sampler, we can use a discretization of the
uniform density over (0, 1). If we consider a k-point discretization, given Tn =
tn is observed, the posterior distribution of C is obtained via the standard
Bayesian machinery as
(3.15)
where the likelihood term p(t n I C/, t(n-l) is the predictive density given by
(3.13). Once the posterior distribution (3.15) is available, the unconditional
posterior distribution of On can be obtained by averaging out (3.12) with
respect to this posterior distribution.
Chen and Singpurwalla (1994) analyzed the System 40 data of Musa
(1979) and compared the predictive performance of the model with that of
exchangeable model of Singpurwalla and Soyer (1985) using posterior ratios
and they concluded that the non-Gaussian Kalman filter model outperformed
the Singpurwalla and Soyer model. In what follows, we present an analysis
of the System 40 data by using the first 51 observations and a 200-point dis-
cretization of the uniform prior on C. Following Chen and Singpurwalla we
choose Wn = Vn = (Tn = 2 for all nand Uo = 500.
Analysis of Software Failure Data 403
.. ,------ - - - - - - - - - - - - - - - - - - - - ,
....
and Ln the logarithm of the posterior ratios (or product of the likelihood
ratios), that is,
404 Refik Soyer
We consider the logarithmic Poisson execution time model of Musa and Oku-
moto (1984), which was discussed in Singpurwalla and Soyer (1996), with
=
mean value function J.I(t) In(,xOt+1)jO. Following the expert opinion frame-
work of Campod6nico and Singpurwalla (1994), we assume that a joint prior
probability distribution, 11"(1'1,1'2), is elicited for 1'1 = J.I(Tt} and 1'2 = J.I(T2).
Analysis of Software Failure Data 405
lOGlIl(EUHOOO RATIOS Of A TO B
12 15 IS 21 fl V ~ " ~ "
T[STING STAG(
12 15 18 21 ~ ~ D ~ ~ q ~ U ~
TESTING STAG[
(4.1)
(I'( 12;11'1 ,1'2)-1'(t1 i 11'1 ,1'2 ))n;
nj!
406 Refik Soyer
Once the posterior distribution, 7r(/l1, /l2 I D), is obtained, then the quanti-
ties of interest discussed in Singpurwalla and Soyer (1996) can be evaluated
replacing the prior 7r(/l1, /l2) by the posterior (/l1, /l2 I D). For example, the
probability of k failures in the interval (8, t], 8 < tj k = 0, 1,2, ... is given by
(4.2)
and the (unconditional) expected number of failures in interval (8, t], 8 < t
IS:
(t, t + 1], for t = 0,1, ... ,4, as obtained by Campod6nico and Singpurwalla
(1994), and the MLE'sj in both cases, the authors use the data up to time t to
predict the number offailures in the next hour. As pointed out by the authors,
MLE is not available for the first two intervals. The Bayesian prediction for
the first interval is based on the prior alone. The authors also show that the
mean square error (MSE) of the Bayesian predictions are lower than those of
the MLE for the specific choice of prior parameters.
References
Mazzuchi, T.A., Soyer, R.: A Bayes Empirical-Bayes Model for Software Reliability.
IEEE Trans. ReI. R-37, 248-54 (1988)
Meinhold, R.J., Singpurwalla, N.D.: Bayesian Analysis of a Commonly Used Model
for Describing Software Failures. Statistician 32, 168-173 (1983)
Merrick J., Singpurwalla, N.D.: The Role of Decision Analysis in Software Engi-
neering. In this volume (1996), pp. 368-388
Musa, J.D.: Software Reliability Data. IEEE Computing Society Repository (1979)
Musa, J.D., Okumoto, K.: A Logarithmic Poisson Execution Time Model for Soft-
ware Reliability Measurement. Proceedings of the 7th International Conference
on Software Engineering. Orlando 1984, pp. 230-37
Roberts, H.V.: Probabilistic Prediction. J. Amer. Statist. Assoc. 60, 50-61 (1965)
Singpurwalla, N.D., Soyer, R.: Assessing (Software) Reliability Growth Using a
Random Coefficient Autoregressive Process and its Ramifications. IEEE nans.
Soft. Eng. SE-ll, 1456-1464 (1985)
Singpurwalla, N.D., Soyer, R.: Non-Homogeneous Autoregressive Processes for
Tracking (Software) Reliability Growth, and Their Bayesian Analysis. J. Roy.
Statist. Soc. B 54, 145-156 (1992)
Singpurwalla, N.D., Soyer, R.: Assessing the Reliability of Software: An overview.
In this volume (1996), pp. 345-367
Tierney, 1., Kadane, J.B.: Accurate Approximations for Posterior Moments and
Marginal Densities. J. Amer. Statist. Assoc. 81, 82-86 (1986)
Part IV
Summary. This chapter gives a tutorial survey on the use of simple statistical
techniques for the control of the runlength in simulation. The object of the sim-
ulation study may be either short-term operational decision-making or long-term
strategic decision-making. These decision types correspond with two types of sim-
ulation: terminating and steady-state simulations. First, terminating simulation is
discussed. At the preliminary end of a simulation run, a confidence interval for the
simulation response can be derived, using either the Student statistic or alternative
statistics (in case of non-normal simulation responses). From the resulting interval
the definitive run length can be derived. Next, steady-state simulation is discussed.
Such a simulation may be examined through renewal analysis. Both simulation
types may have responses that are not expected values, but either proportions or
quantiles. Whatever the simulation type or simulation response, the required length
of the simulation run may be reduced through simple variance reduction techniques,
namely common pseudorandom numbers, antithetic numbers, and control variates
or regression sampling. Importance sampling is necessary in rare event simulation.
Finally, a general technique -namely, jackknifing- is presented, to reduce possible
bias of estimated simulation responses and to construct robust confidence intervals
for the responses.
1. Introduction
The objective of this chapter is to give a tutorial survey on the use of sim-
ple statistical techniques for the control of the runlength in simulation. The
following questions are addressed:
(i) How should the simulation run be initialized; for example, should a sim-
ulation of a repairman system start with all machines running?
(ii) How long should this run be continued; for instance, should 1000 machine
breakdowns or one month be simulated?
(iii) How should the accuracy (or precision) of the simulation response be as-
sessed: what is a (say) 90% confidence interval for the simulation response?
(iv) If this precision is too low, how much longer should the system be simu-
lated (with fixed inputs)?
(v) To further improve the accuracy, can 'tricks' (Variance Reduction Tech-
niques or VRTs) be used?
412 Jack P.C. Kleijnen
For didactic reasons it seems useful to consider the following repairman exam-
ple (many more examples and references can be found in the survey, Jensen
1996, in this book). There are m machines that are maintained by a crew of
r repairmen (mechanics). Machine j has a stochastic time between failures
(say) Xlj with j = 1, ... , m; notice that stochastic variables are shown in
capitals; their realizations in lower case letters. Time to repair machine j by
repairman i (with i = 1, ... , r) is X 2ij; that is, mechanic j may be special-
ized in the repair of machine i. However, most analytical models assume that
Xlj and X 2ij do not depend on i and j, which simplifies the notation to
Xl and X2. In simulation this assumption is not necessary. Yet, to simplify
the example, let us make the same assumption as in those analytical mod-
els; that is, Xl and X 2 have Negative exponential (Ne) distributions with
parameters A (Mean Time To Failure or MTTF rate) and J.I. (repair rate).
Furthermore, different priority rules may be implemented: First-In-First-Out
(FIFO), Shortest-Processing-Time (SPT), and so on. A flowchart for the sim-
ulation of this model is given in Kleijnen and Van Groenendaal (1992, pp.
108-109) (that chart, however, should be corrected: replace the variable TIME
by TAT). A standard textbook on simulation is Law and Kelton (1991). Obvi-
ously this example is a Discrete-Event Dynamic System (DEDS): the system
changes state at discrete, not necessarily equidistant points of time.
Readers familiar with Markov analysis will notice that for the repairman
system with Poisson failure and repair processes a complete state description
is given by a single state variable (say) Y with y E {O,l, ... ,m}, which de-
notes the number of machines that is running or 'up'. Obviously, the number
of idle mechanics is uniquely determined by y: that number is max (r - y, 0).
Notice that since Poisson processes are memoryless, it is not necessary to
know how long a particular machine has been running, or how long a par-
ticular mechanic has been working on a machine (also see the discussion on
renewal analysis in Section 2.2). Let Py denote the steady-state probabil-
ity of the system being in state y. Obviously Py also gives the steady-state
percentage of time that the system is in state y.
Management may be interested in several types of response (performance
measure, criterion). In a computer center they may be interested in the per-
centage of time that at least one machine is up (in the steady state, this
percentage is 1 - Po). They may also be concerned about the percentage of
time that at least two machines are up (1 - Po - pt), because customer ser-
vice is better when two computers (instead of one computer) are up: faster
turnaround time. However, for simplicity this chapter concentrates on a sin-
gle variable; for example, p = 1- Po. Multi-variate responses can be handled
through Bonferroni's inequality; see Kleijnen (1987).
Now consider the simulation of this system. Let Z denote simulated avail-
ability, defined as the percentage of simulated time that at least one machine
is running: 0::; z ::; 1. So the response of a simulation run is Z = Z(m, r). In
other words, a simulation run is a single time path that has fixed values for all
Simulation: Runlength Selection and Variance Reduction Techniques 413
its inputs. In this example, these inputs are m and r, and the parameters of
the input distributions A (failure rate) and J1. (repair rate). A special variable
is the pseudorandom number seed (say) Ro, which has positive integers as
realizations. Alternative sources for this seed will be discussed in the section
on VRTs (Section 4).
This chapter is organized as follows. Section 2 covers short-term opera-
tional decisions versus long-term strategic decisions, which correspond with
terminating and steady-state simulations respectively. Section 2.1 derives
confidence intervals for terminating simulations, using either Student's statis-
tic or alternative statistics (in case of non-normal simulation responses). From
this interval the number of necessary simulation runs in a terminating simu-
lation is derived (stopping rule). Section 2.2 covers steady-state simulations,
concentrating on renewal analysis of such simulations, including approximate
renewal states. Section 3 covers proportions and quantiles, as alternatives for
the expected value. Section 4 covers VRTs. Simple VRTs are common pseu-
dorandom numbers, antithetic numbers, and control variates or regression
sampling. Importance sampling is necessary in rare event simulation. Section
5 covers jackknifing, which is a general technique for reducing possible bias
and constructing robust confidence intervals. Section 6 gives a brief summary
and conclusions. This chapter is based on Kleijnen and Van Groenendaal
(1992, pp. 187-203).
Note: Questions such as 'how many mechanics should be hired, and which
priority rule should be selected?' are addressed in Kleijnen (1996), which is
the companion chapter in this volume.
month. Such simulations are called terminating. From the viewpoint of math-
ematical statistics (not from the viewpoint of Markov analysis), terminating
simulations are easier to analyze. For didactic reasons, these simulations will
be discussed first (in Section 2.1); steady-state simulations will follow (Section
2.2).
Other examples of terminating simulations are queueing problems in a
bank that is open only between 9 a.m. and 4 p.m.; peak hours in traffic in-
tersections and telephone exchanges (simulation starts before the peak and
finishes after the rush hour); simulations of the life of a machine (simula-
tion begins with installation of the machine and ends when the machine is
scrapped).
d
8z = ~)Zh - 2)2 j(d-1) (2.1)
h=l
Simulation: Runlength Selection and Variance Reduction Techniques 415
(2.2)
Obviously, the Zh are i.i.d. Then consider the following 1 - 0: one-sided con-
fidence interval for E( Z) = p:
(2.5)
By definition, the steady state is reached only after a very long simulation
run. In practice, the start-up phase is often eliminated. Next the approach
of the preceding subsection (Section 2.1) could be applied. However, given d
runs, the transient phase is then eliminated d times: waste of computer time.
Moreover, it is not known when exactly the transient phase is over. Therefore
suppose the simulationists execute a single long run (not several replicated
runs); also see Kleijnen and Van Groenendaal (1992, pp. 190-191).
Assume a wide-stationary process. Most of these processes have positive
autocorrelation; for example, if a machine must wait long for a repairman to
become available, then the next machine that breaks down must probably
wait long too. This positive correlation implies that the traditional formula
for the variance estimation based on (2.1) has large bias; for example, for an
MIMll model it is known that a traffic load of 0.5 gives an estimate that
is wrong by a factor 10; for a 0.9 load this factor becomes 360 (see Kleijnen
Simulation: Runlength Selection and Variance Reduction Techniques 417
all machines are up again. Cycle responses are i.i.d .. Also see Muppala et al.
(1996) in this volume, and the discussion on nearly renewal states at the end
of Section 2.2.
Denote the length of the renewal cycle by L, and the cycle response (for
example, availability time during a cycle) by W. Then it is well-known that
the steady-state mean response (availability percentage) is
to keep the notation simple, we ignore the fact that 0.80d, I, and u are not
necessarily integers. Actually these three real variables must be replaced by
their integer parts.
The estimation of proportions and quantiles in terminating and steady-
state simulations is further discussed in Kleijnen (1987, pp. 36-40) and Klei-
jnen and Van Groenendaal (1992, pp. 195-197).
In the what-if approach there is not so much interest in the absolute magni-
tudes of the results of the simulation, as in the differences among the results
for various values of the parameters (such as A and ,,) and input variables (m
and r). Therefore it seems intuitively appropriate to examine simulated sys-
tems under equal conditions (environments). For example, when comparing
different numbers of mechanics (say) rl and r2, then the simulation may use
the same times between successive failures of machines (denote the succes-
sive realizations of Xl by X1t; that is, by Xu, X12," .). This implies the use
of the same stream of pseudorandom numbers for system variant # 1 and #
2 (rl and r2 repairmen respectively). In that case the two responses Z(rt}
and Z(r2) are correlated. Hence
var[Z(rt) - Z(r2)] = var[Z(rt}] + var[Z(r2)] (4.1)
-2p[Z( rd, Z( r2)] Jvar[Z( rdvar[Z(r2)]
where p[Z(rt} , Z(r2)] denotes the linear correlation coefficient between Z(rt}
and Z(r2)' So ifthe use of the same pseudorandom numbers results in positive
correlation, then the variance of the difference decreases.
In complicated models it may be difficult to realize a strong positive corre-
lation. Therefore separate sequences of pseudorandom numbers are used per
'process'; for example, in the repairman example use one seed for the times
between failures (Xt), and a different seed for the repair times (X2)' How
should these seeds be selected? One seed may be sampled through the com-
puter's internal clock. However, sampling the other seed(s) in this way may
cause overlap among the various streams (making failure and repair times de-
pendent). For certain generators, there are tables with seeds 100,000 apart.
Simulation: Runlength Selection and Variance Reduction Techniques 421
and substitute this U for Z in equations (2.1) through (2.3), to find a confi-
dence interval for the mean difference.
However, there are more than two responses when Design 01 Experiments
(DOE) is used. Then regression analysis or Analysis of Variance (ANOVA) is
applied. Classic ANOVA, however, assumes independent responses. Now we
can use either Generalized Least Squares (GLS) or Ordinary Least Squares
(OLS) with adjusted standard errors for the estimated regression parameters.
In practice this complication is often overlooked. For further discussion see
Kleijnen (1996), the companion chapter in this volume.
(4.5)
In practice, however, the correlation p(Xl' Z) is unknown. So this cor-
relation is estimated. Actually, replacing the three factors in the right-hand
side of the preceding equation by their classic estimates results in the OLS
estimate (say) /3 of the regression parameter f3 in the regression model
(4.6)
Therefore the technique of control variates is also called regression sampling.
Obviously, f3 in the latter equation is estimated from the d replications that
give d Li.d. pairs (Xl,h, Zh).
The OLS estimator of P (the optimal correction coefficient (4.5) for the
control variate estimator (4.4)) gives a new control variate estimator. Let
Z denote the sample average of the responses Zh, Xl the average over d
replications of the average failure time per run, and iJ the OLS estimator
of f3 in (4.6) or in (4.5) based on the d pairs (Z, Xl). Then the new control
variate estimator is
(4.7)
This formula can be easily interpreted, when we remember that the estimated
regression model goes through the point of gravity, (Xl, Z).
The example can be extended: take as control variates, not only the time
between failures (Xt), but also the repairmen's service time (X2). This re-
quires multiple regression analysis. A better idea may be to use the traffic
Simulation: Runlength Selection and Variance Reduction Techniques 423
load, Pl/ J-t)(m/r). Actually, the explanatory variables in the regression model
may be selected such that the multiple correlation coefficient R2 adjusted for
the number of control variates, is maximized.
A complication is that estimation of f3leads to a biased control variate; see
(4.7). Moreover, the construction of a confidence interval for E(Z) becomes
problematic. These problems can be solved, either assuming multivariate nor-
mality for (Z, Xl, X2 , ) or using the robust technique of jackknifing (see
Section 5).
The preceding VRT relied on the correlation between the responses of com-
parable simulated systems (common seeds, Section 4.1), or between the res-
ponses of antithetic runs (Section 4.2), or between input and output of a run
(Section 4.3). The simulation program itself was not affected; only seeds were
changed or inputs were monitored. Importance sampling, however, drastically
changes the sampling process of the simulation model. This technique is more
sophisticated, but it is necessary when simulating rare events; for example,
in a dependable system unavailability occurs with a probability of (say) one
in a million replicated months. But then a million replicated months must be
simulated, to expect to see (only) one breakdown of the system!
The basic idea of importance sampling is to change the probabilities of
the inputs such that the probability of the response increases; of course the
resulting estimator must be corrected in order to get an unbiased estimator.
This idea can be explained simply in the case of non-dynamic simulation, also
known as Monte Carlo sampling, as the following example demonstrates.
Consider the integral
e= 1 p
001
->'e-).xdx with>. > 0, p> 0
x
(4.8)
The value of this integral can be estimated (other techniques are integral cal-
culus and numerical approximation). Crude Monte Carlo proceeds as follows.
(i) Sample x from N e(>.).
(ii) Substitute the sampled value x into the 'response'
1
g(x) if x> 1/
x
o otherwise (4.9)
E[g*(X)] = 1 g(X)~~:~h(X)dX 1
00
=
00
g(x)f(x)dx = ~ (4.11)
It is quite easy to derive the optimal form of h(x), which results in minimum
varIance.
For dynamic systems (such as the repairman simulation) a sequence of
inputs must be sampled; for example, successive times between machine fail-
ures XU, X12, .. These inputs are assumed to be i.i.d., so their joint density
function is given by
f( xu, X12, . . ) -- I\e
\ -'\Xll I\e
\ -'\X12
... (4.12)
Suppose crude Monte Carlo and importance sampling use the same type of
input distribution (negative exponential) but with different parameters, A
and AO respectively. Then the likelihood ratio becomes
5. Jackknifing
M = Z(.5d) (5.1)
where Z(h) still denotes the order statistic; actually, .5d should be replaced
by its integer part (see equation 3.4).
Now partition those d observations into (say) g groups of equal size
v (= dig); g may be equal to d (so v = 1). We shall concentrate on the case
of groups of size one. Eliminate one observation, say, observation h (with
h = 1, ... , d). Calculate the same estimator from the remaining (d - 1) obser-
vations. For example, after dropping the first observation Zl, recalculate the
median. Denote the order statistic after eliminating observation h by Z-h;(j)
with j = 1, ... , d -1; for example, after eliminating observation 2 the biggest
observation is Z-2;(d-l)' In the example of the median, dropping observation
1 gives the estimator of the median
Each time, eliminate another observation. This gives d estimators. The hth
pseudovalue (say) Ph is defined as the following linear combination of the
original and the hth estimator of (say) the median:
(5.4)
It can be proved that if the original estimator is biased, then the jackknifed
estimator has less bias.
Moreover, jackknifing gives the following robust confidence interval. Treat
the d pseudovalues Ph as d i.i.d. variables: compute the 1 - 0: confidence
interval from the Student statistic with d - 1 degrees offreedom, using (2.1)
through (2.3), replacing Z by P.
Let us consider one more example. The VRT of control variates was based
on d i.i.d. pairs (Zh' X1,h); see Section 4.3. Now eliminate pair h, and calculate
the control variate estimator, using (4.7):
(5.5)
where Z-h denotes the sample average of the responses after elimination
of Zh, X-h;l denotes the average failure time after eliminating run h with
its average Xl;h, and B-h is the OLS estimator based on the remaining
d - 1 pairs. (Note that E(X-h;t) = E(Xt) = 1/-'.) This Z-h;e gives the
pseudovalue
426 Jack P.C. Kleijnen
(5.6)
where Zc is the control variate estimator based on all d pairs; see (4.7).
Jackknifed renewal analysis (Section 2.2) is discussed in Kleijnen and Van
Groenendaal (1992, pp. 202-203); jackknifed GLS (Section 4.1) is discussed
in Kleijnen et al. (1987).
Jackknifing is related to bootstrapping, which samples from the set of
d observations; see Efron (1982), Efron and Tibshirani (1993), and Cheng
(1995).
6. Conclusion
This chapter addressed the following questions (also see the introduction,
Section 1):
(i) How to initialize the simulation run?
This chapter emphasized the distinction between terminating and steady-
state simulations. In a terminating simulation we may start with the situation
at the end of last replication; in a steady-state simulation we may start with
all machines running, if that is a renewal state.
(ii) How to assess the accuracy of the simulation response at the end of the
simulation run?
Accuracy may be quantified by a 1 - 0: confidence interval for the simulation
response. This interval may assume normality (Student's statistic) or not
(Johnson's modified Student statistic, distribution-free statistics).
(iii) How to improve this accuracy, if it is too low: how much longer to sim-
ulate the system?
A confidence interval with a fixed length can be derived by sequential statis-
tical procedures. The resulting stopping rule selects the number of necessary
runs with the terminating simulation or the number of renewal cycles with
the steady-state simulation. The latter type of simulation may also use 'ap-
proximate' renewal states.
Further, this chapter covered proportions and quantiles, as alternatives
for the expected value.
(iv) Which {tricks' to use, in order to improve this accuracy?
References
Aven, T.: Availability Analysis of Monotone Systems. In this volume (1996), pp.
206-223
Bartels, R.: The Rank Version of Von Neumann's Ratio Test for Randomness.
Journal of the American Statistical Association 77, 40-46 (1982)
Cheng, R.C.H.: Bootstrap Methods in Computer Simulation Experiments. In: Alex-
opoulos, C., Kang, K., Lilegdon, W.R., Goldsman, D. (eds.): Proceedings of the
Winter Simulation Conference (1995)
Conover, W.J.: Practical Non-parametric Statistics. New York: Wiley 1971
Crane, M.A., Lemoine, A.J.: An Introduction to the Regenerative Method for Sim-
ulation Analysis. Berlin: Springer 1977
Efron, B.: The Jackknife, the Bootstrap, and Other Resampling Plans. CBMS-NSF
Series. Philadelphia: SIAM 1982
Efron, B., Tibshirani, R.J.: Introduction to the Bootstrap. London: Chapman and
Hall 1993
Fishman, G.S.: Focussed Issue on Variance Reduction Methods in Simulation: In-
troduction. Management Science 35, 1277 (1989)
Glynn, P.W., Iglehart, D.L.: Importance Sampling for Stochastic Simulation. Man-
agement Science 35, 1367-1392 (1989)
Heidelberger, P., Shahabuddin, P., Nicola, V.: Bounded Relative Error in Estimat-
ing Transient Measures of Highly Dependable Non-Markovian Systems. In this
volume (1996), pp. 487-515
Jensen, U.: Stochastic Models of Reliability and Maintenance: An Overview. In this
volume (1996), pp. 3-36
Kleijnen, J.P.C.: Statistical Techniques in Simulation (Two Volumes). New York:
Marcel Dekker 1974/1975
Kleijnen, J.P.C.: Statistical Tools for Simulation Practitioners. New York: Marcel
Dekker 1987
Kleijnen, J.P.C.: Simulation: Sensitivity Analysis and Optimization through Re-
gression Analysis and Experimental Design. In this volume (1996), pp. 429-441
Kleijnen, J.P.C., Karremans, P.C.A., Oortwijn, W.K., Van Groenendaal, W.J.H.:
Jackknifing Estimated Weighted Least Squares: JEWLS. Communications In
Statistics, Theory and Methods 16, 747-764 (1987)
Kleijnen, J.P.C., Kloppenburg, G.L.J., Meeuwsen, F.L.: Testing the Mean of an
Asymmetric Population: Johnson's Modified t-Test Revisited. Communications
in Statistics, Simulation and Computation 15, 715-732 (1986)
Kleijnen, J.P.C., Rubinstein, R.Y.: Optimization and Sensitivity Analysis of Com-
puter Simulation Models by the Score Function Method. European Journal of
Operational Research. To appear (1996)
Kleijnen, J.P.C., Van Groenendaal, W.J.H.: Simulation: A Statistical Perspective.
Chichester: Wiley 1992
Law A.M., Kelton W.D.: Simulation Modeling and Analysis. Second Edition. New
York: McGraw-Hill 1991
Miller, R. G.: The Jackknife - A Review. Biometrica 61, 1-15 (1974)
428 Jack P.C. Kleijnen
Muppala, J .K., Malhotra, M., Trivedi, K.S.: Markov Dependability Models of Com-
plex Systems: Analysis Techniques. In this volume (1996), pp. 442-486
Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems; Sensitivity Analysis and
Stochastic Optimization by the Score Function Method. New York: Wiley 1992
Tew, J.D., Wilson, J.R.: Estimating Simulation Metamodels Using Combined
Correlation-Based Variance Reduction Techniques. lIE Transactions 26, 2-16
(1994)
Simulation: Sensitivity Analysis and
Optimization Through Regression Analysis
and Experimental Design
Jack P.C. Kleijnen
Department of Information Systems and Auditing and Center for Economic Re-
search (CentER), School of Management and Economics, Tilburg University, 5000
LE Tilburg, The Netherlands
Summary. This chapter gives a tutorial survey on the use of statistical techniques
in sensitivity analysis, including the application of these techniques to optimization
and validation of simulation models. Sensitivity analysis is divided into two phases.
The first phase is a pilot stage, which consists of screening or searching for the im-
portant factors; a simple technique is sequential bifurcation. In the second phase,
regression analysis is used to approximate the input/output behavior of the sim-
ulation model. This regression analysis gives better results when the simulation
experiment is well designed, using classical statistical designs such as fractional fac-
torials. To optimize the simulated system, Response Surface Methodology (RSM)
is applied; RSM combines regression analysis, statistical designs, and steepest as-
cent. To validate a simulation model that lacks input/output data, again regression
analysis and statistical designs are applied. Several case studies are summarized;
they illustrate how in practice statistical techniques can make simulation studies
give more general results, in less time.
1. Introduction
this chapter will show. The literature on statistical designs uses the term
factor to denote a parameter, input variable or module.
2. Validation: is the simulation model an adequate representation of the cor-
responding system in the real world? This chapter addresses only part of
the validation problem.
To answer these practical questions, this chapter takes techniques from the
science of mathematical statistics (briefly, statistics). It is not surprising that
statistics is so important in simulation: by definition, simulation means that
a model is 'solved' - not by mathematical analysis (see many other chapters
in this volume) or by numerical methods (see Muppala et al. 1996)- but by
experimentation. But experimentation requires a good design and a good
analysis! DOE with its concomitant analysis is a standard topic in statistics.
However, the standard statistical techniques must be adapted such that they
account for the particularities of simulation. For example, there are a great
many factors in many practical simulation models. Indeed, one application
(discussed later) has hundreds of factors, whereas standard DOE assumes
only up to (say) fifteen factors. Moreover, stochastic simulation models use
pseudorandom numbers, which means that the analysts have much more
control over the noise in their experiments than the investigators have in
standard statistical applications (for example, common and antithetic seeds
may be used; see the companion chapter, Kleijnen 1996).
2. Sensitivity Analysis
The vast literature on simulation does not provide a standard definition of
sensitivity analysis. In this chapter, sensitivity analysis is defined as the sys-
tematic investigation of the reaction of the simulation responses to extreme
values of the model's input or to drastic changes in the model's structure. For
example, what happens to the system's availability, when the MTTF doubles;
what happens if the priority rule changes from FIFO to SPT? So the focus
in this chapter is not on marginal changes (or perturbations) in the input
values.
Moreover, the simulation model is treated as a black box: the simulation
inputs and outputs are observed, and from this input/output behavior the
factor effects are estimated. This approach is standard in DOE.
DOE has advantages and disadvantages. One benefit is that this approach
can be applied to all simulation models. A drawback is that it can not take
advantage of the specific structure of a given simulation model, so it may take
many simulation runs to perform the sensitivity analysis. But DOE requires
fewer runs than the intuitive approach often followed in practice (see the
one-factor-at-a-time approach in Section 2.3.1).
Note: The intricacies of the specific simulation model at hand are con-
sidered in perturbation analysis and in modern importance sampling, also
known as score function; see Ho and Cao (1991), Glynn and Iglehart (1989),
and Rubinstein and Shapiro (1993) respectively. Perturbation analysis and
score function require only one run. Unfortunately, these methods also require
more mathematical sophistication.
In the pilot phase of a simulation study there are usually a great many poten-
tially important factors. For example, in the repairman system of the Section
1 there are m failure rates and rm repair rates; r, m, and the queueing pri-
ority rule may also be factors. It is the mission of science to come up with a
short list of the most important factors; it is unacceptable to say 'everything
depends on everything else': parsimony principle.
In practice, analysts often restrict their study to a few factors, usually no
more than ten. Those factors are selected through intuition, prior knowledge,
and the like. The factors that are ignored (kept constant), are -explicitly
or implicitly- assumed to be unimportant. For example, in the repairman
example, it is traditional to assume equal MTTFs (1/ Aj = 1/ A ) and equal
repair rates (J-lij = J-l ). Of course, such an assumption severely restricts the
generality of the simulation study!
The statistics literature includes screening designs. These designs provide
scientific methods for finding the important factors. There are several types
of screening designs: random, supersaturated, group screening designs, and
so on; see Kleijnen (1987).
Simulation: Sensitivity Analysis and Optimization 433
inal 281 factors. Some of these 15 factors surprised the ecological experts, so
sequential bifurcation turns out to be a powerful statistical (black box) tech-
nique. Moreover, had the analysts assumed no interactions between factors,
then sequential bifurcation would have halved the number of runs (154/2 =
77 runs).
The ecological case study concerns a deterministic simulation model (con-
sisting of a set of non-linear difference equations). There is a need for more
research, applying sequential bifurcation to large random simulations, such
as simulations of reliability and maintenance of complex systems.
with
var(Ei) = or
ror E varies with the input combination of the random simulation model:
(So Y, the response of the stochastic simulation, has a mean
and a variance that both depend on the input.) Then weighted least squares
(with the standard deviations Ui as weights) yields unbiased estimators of
the factor effects, but with smaller variances than OLS gives.
Common pseudorandom number seeds can be used to simulate different
factor combinations (see the companion chapter, Kleijnen 1996). Then GLS
gives minimum variance, unbiased estimators. Unfortunately, in practice the
variances and covariances of the simulation responses Yare unknown, so they
must be estimated. The following equation gives the classic covariance esti-
mator, assuming d independent replications (or simulation runs) per factor
combination (so lig and lilg are correlated, but lig and Yigl are not):
d
cov(li, li/) = I:(Yig - fi)(li lg - fi,)/(d - 1) (2.3)
g=l
436 Jack P.C. Kleijnen
Fortunately, the resulting estimated G LS gives good results; see Kleijnen and
Van Groenendaal (1992).
Of course, it is necessary to check the fitted regression metamodel: is it an
adequate approximation of the underlying simulation model? Therefore the
metamodel may be used to predict the outcomes for new factor combinations
of the simulation model. So replace {J in the specified metamodel by the esti-
mate~, and substitute new combinations of x (there are n old combinations).
Compare the predictions fj with the simulation response y.
A refinement is cross-validation: do not add new combinations (which
require more computer time), but eliminate one old combination (say) com-
bination i and re-estimate the regression model from the remaining n - 1
combinations. Repeat this elimination for all values of i (i = 1"", n; see
equation (2.1)). This approach resembles jackknifing, discussed in the com-
panion chapter, Kleijnen (1996). Statistical details are discussed in Kleijnen
and Van Groenendaal (1992).
Applications of regression metamodeling will be discusses below (Sec-
tion 2.3 through Section 4).
be estimated. In practice, these full factorial designs are sometimes used in-
deed (but high-order interactions are hard to interpret). See Kleijnen (1987).
2.3.4 Quadratic Effects: Curvature. If the quadratic effects i3hh in Equa-
tion (2.1) are to be estimated, then at least k extra runs are needed (since
h runs from 1 to k). Moreover, each factor must be simulated for more than
two values.
Popular in statistics and in simulation are central composite designs. They
have five values per factor, and require many runs (n q). For example, if
= =
there are k 2 factors, then q 6 effects are to be estimated but as many as
n = 9 factor combinations are simulated. See Kleijnen (1987) and Kleijnen
and Van Groenendaal (1992).
Applications are found in the optimization of simulation models; see Sec-
tion 4.
3. Validation
This paper concentrates on the role of sensitivity analysis (Section 2) in
validation; other statistical techniques for validation and verification are dis-
cussed in Kleijnen (1995a). Obviously, validation is one of the first questions
that must be answered in a simulation study; for didactic reasons, validation
is discussed in this section.
True validation requires that data on the real system be available. In prac-
tice, the amount of data varies greatly: data on failures of nuclear installations
are rare, whereas electronically captured data on computer performance and
on supermarket sales are abundant.
If data are available, then many statistical techniques can be applied. For
example, simulated and real data on the response, can be compared through
the Student statistic for paired observations (see the companion chapter,
Kleijnen 1996), assuming the simulation is fed with real-life input data: trace
driven simulation. A better test uses regression analysis; see Kleijnen et al.
(1996).
However, if no data are available, then the following type of sensitivity
analysis can be used. The clients of the analysts do have qualitative knowl-
edge of certain parts of the real system; that is, these clients do know in
which direction certain factors affect the response of the corresponding mod-
ule in the simulation model (also see the discussion on sequential bifurcation
in Section 2.1.1). If the regression metamodel (see Section 2.2.2) gives an
estimated factor effect with the wrong sign, this is a strong indication of a
wrong simulation model or a wrong computer program.
Applications in ecological and military modeling are given in Kleijnen
et al. (1992) and Kleijnen (1995b) respectively. These applications further
show that the validity of a simulation model is restricted to a certain domain
of factor combinations, which corresponds with the experimental frame in
Zeigler (1976), a seminal book on modeling and simulation.
Simulation: Sensitivity Analysis and Optimization 439
Moreover, the regression metamodel shows which factors are most im-
portant. If possible, information on these factors should be collected, for
validation purposes.
5. Conclusions
In the introduction (Section 1) the following questions were raised:
1. What if. what happens if the analysts change parameters, input variables
or modules of a simulation model? This question is closely related to
sensitivity analysis and optimization.
2. Validation: is the simulation model an adequate representation of the
corresponding system in the real world?
These questions were answered as follows.
In the initial phase of a simulation it is often necessary to perform screen-
ing: which factors among the multitude of potential factors are really im-
portant? The goal of screening is to reduce the number of really important
factors to be further explored in the next phase. The technique of sequential
bifurcation is a simple, efficient, and effective screening technique.
Once the important factors are identified, further analysis with fewer as-
sumptions (no known signs) may use regression analysis. It generalizes the
results of the simulation experiment, since it characterizes the input/output
behavior of the simulation model.
Design Of Experiments (DOE) can give good estimators of the main ef-
fects, interactions, and quadratic effects that occur in the regression model.
These designs require relatively few simulation runs.
Once these factor effects are quantified, they can be used in
(i) validation, especially if there are no data on the input/output of the sim-
ulation model or its modules;
(ii) optimization through RSM, which builds on regression analysis and ex-
perimental designs.
Simulation: Sensitivity Analysis and Optimization 441
These statistical techniques have already been applied many times in prac-
tical simulation studies, in many domains. Hopefully, this survey will stim-
ulate even more analysts to apply these techniques. The goal is to make
simulation studies give more general results, in less time.
In the mean time the research on statistical techniques adapted to simu-
lation, continues in Europe, America and elsewhere.
References
Bettonvil, B., Kleijnen, J.P.C.: Searching for the Important Factors in Simulation
Models with Many Factors. Tilburg University (1995)
Glynn, P.W., Iglehart, D.L.: Importance Sampling for Stochastic Simulation. Man-
agement Science 35, 1367-1392 (1989)
Ho, Y., Cao, X.: Perturbation Analysis of Discrete Event Systems. Dordrecht:
Kluwer 1991
Hood, S.J., Welch, P .. : Response Surface Methodology and its Application in Sim-
ulation. Proceedings of the Winter Simulation Conference (1993)
Kleijnen, J.P.C.: Statistical Tools for Simulation Practitioners. New York: Marcel
Dekker 1987
Kleijnen, J.P.C.: Simulation and Optimization in Production Planning: A Case
Study. Decision Support Systems 9, 269-280 (1993)
Kleijnen, J.P.C.: Verification and Validation of Simulation Models. European Jour-
nal of Operational Research 82, 145-162 (1995a)
Kleijnen, J .P.C.: Case-Study: Statistical Validation of Simulation Models. European
Journal of Operational Research 87, 21-34 (1995b)
Kleijnen, J.P.C.: Sensitivity Analysis and Optimization in Simulation: Design of
Experiments and Case Studies. In: Alexopoulos, C., Kang, K., Lilegdon, W. R.,
Goldsman, D. (eds.): Proceedings of the Winter Simulation Conference (1995c)
Kleijnen, J.P.C.: Simulation: Runlength Selection and Variance Reduction Tech-
niques. In this volume (1996), pp. 411-428
Kleijnen, J.P.C, Bettonvil, B., Van Groenendaal' W.: Validation of Simulation Mod-
els: Regression Analysis Revisited. Tilburg University 1996
Kleijnen J.P.C., Van Groenendaal, W.: Simulation: A Statistical Perspective. Chich-
ester: Wiley 1992
Kleijnen, J.P.C., Van Ham, G., Rotmans, J.: Techniques for Sensitivity Analysis of
Simulation Models: A Case Study of the CO 2 Greenhouse Effect. Simulation
58,410-417 (1992)
Muppala J.K., Malhotra, M., Trivedi, K. S.: Markov Dependability Models of Com-
plex Systems: Analysis Techniques. In this volume (1996), pp. 442-486
Nash, S.G. : Software Survey NLP. OR/MS Today 22, 60-71 (1995)
Oren, T.!.: Three Simulation Experimentation Environments: SIMAD, SIMGEST
and E/SLAM. In: Proceedings of the 1993 European Simulation Symposium.
La Jolla: Society for Computer Simulation 1993
Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems: Sensitivity Analysis and
Stochastic Optimization via the Score Function Method. New York: Wiley 1993
Van Groenendaal, W.: Investment Analysis and DSS for Gas Transmission on Java.
Tilburg University (1994)
Van Meel, J.: The Dynamics of Business Engineering. Delft University (1994)
Zeigler, B.: Theory of Modelling and Simulation. New York: Wiley 1976
Markov Dependability Models of Complex
Systems: Analysis Techniques
Jogesh K. Muppala 1 , Manish Malhotra2 , and Kishor S. Trivedi3
1 Department of Computer Science, The Hong Kong University of Science and
Technology, Clear Water Bay Kowloon, Hong Kong
2 AT&T Bell Laboratories, Holmdel, NJ 07733, USA
3 Center for Advanced Computing & Communication, Department of Electrical
Engineering, Duke University, Durham, NC 27708-0291, USA
Summary. Continuous time Markov chains are commonly used for modelling large
systems, in order to study their performance and dependability. In this paper, we
review solution techniques for Markov and Markov reward models. Several meth-
ods are presented for the transient analysis of Markov models, ranging from fully-
symbolic to fully-numeric. The Markov reward model is explored further, and meth-
ods for computing various reward based measures are discussed including the ex-
pected values of rewards and the distributions of accumulated rewards. We also
briefly discuss the different types of dependencies that arise in dependability mod-
elling of systems, and show how Markov models can handle some of these depen-
dencies. Finally, we briefly review the Markov regenerative process, which relaxes
some of the constraints imposed by the Markov process.
1. Introduction
resource, and (2) transitions between states, which represent the change of
the system state due to the occurrence of a simple or a compound event such
as the failure of one or more resources, the completion of executing tasks, or
the arrival of jobs.
A Markov chain is a special case of a discrete-state stochastic process in
which the current state completely captures the past history pertaining to the
system's evolution. Markov chains can be classified into discrete-time Markov
chains (DTMCs) and continuous-time Markov chains (CTMCs), depending
on whether the events can occur at fixed intervals or at any time; that is,
whether the time variable associated with the system's evolution is discrete
or continuous. This paper is restricted to continuous-time Markov chains.
Further information on Markov chains may be found in (Trivedi 1982).
In a graphical representation of a Markov chain, states are denoted by
circles with meaningful labels attached. Transitions between states are rep-
resented by directed arcs drawn from the originating state to the destina-
tion state. Depending on whether the Markov chain is a discrete-time or a
continuous-time Markov chain, either a probability or a rate is associated
with a transition, respectively.
In this section, we present a brief introduction to the concepts and the no-
tation for Markov and Markov reward models. We shall illustrate the Markov
chain concepts using a simple example.
the file-server are exponentially distributed with the parameters J.lw and J.l f
respectively. The file-server has repair priority over the workstations. We also
assume that whenever the system is down, no further failures can take place.
Hence, when the file-server is down, the workstations cannot fail. Similarly
when both the workstations are down, the file-server does not fail.
J.lf
Fig. 2.2. Continuous-time Markov chain for the computer system of Fig. 2.1
2Aw
o
Q=[ -(J.lw + Af + Aw)
J.lf
J.lw
Markov Dependability Models of Complex Systems 447
We note that states of a CTMC will most often be vectors. However, the
discrete state space of a CTMC can always be mapped into positive integers.
We will, therefore assume a state space of {I, 2, ... , n}.
2.1.1 Instantaneous Transient Analysis. Let Pi(t) = =
Pr{Z(t) i} be
the unconditional probability of the CTMC being in state i at time t. Then
the row vector P(t) = [P1 (t), P2 (t), ... , Pn(t)] represents the transient state
probability vector of the CTMC. The behavior of the CTMC can be described
by the following Kolmogorov differential equation:
where P(O) represents the initial probability vector (at time t=O) of the
CTMC.
2.1.2 Cumulative Transient Analysis. Define L(t) = f~ P(u)du~ Then
Li(t) is the expected total time spent by the CTMC in state i during the
interval [0, t). L(t) satisfies the differential equation:
d
dt L(t) = L(t)Q + Po , L(O) = 0, (2.2)
Inst.Avail. ....-
Interval Avail. -+--.
0.99998
.~ 0.99996
< I
~
\
\
011 0.99994
!lil 0.99992
oS l "'---__....___ _
0.9999
...... ---....---"1:-----------+---
0.99988 '-------'-----'-----'----'----'
o 20 40 60 80 100
Time in hours.
By assuming that the example computer system does not recover when-
ever both workstations fail, or whenever the file-server fails, we make the
states (0, 1), (1,0), and (2, 0) the absorbing states. The corresponding Markov
chain is shown in Figure 2.4. This gives the following new matrix QB:
Markov Dependability Models of Complex Systems 449
The mean time to failure MTT F of the computer system, which is the
same as the mean time to absorption for the Markov chain given in Figure 2.4,
is obtained as
MTTF = Z(2,1) + Z(l,l)
Assuming that Aw = 0.0001 hr- 1 , AJ = 0.00005 hr- 1 , and Jlw = 1.0 hr- 1
we obtain the mean time to failure as 19992 hours.
Furthermore, since this Markov chain has absorbing states, we can also
compute the reliability of the system. The reliability R(t) is the probability
that the system is functioning throughout the interval [0, t). Since all system
failure states are absorbing, it follows that if the system is functioning at
time t, it must be functioning throughout the interval [0, t). Thus,
R(t) = P(2,1)(t) + P(1,1)(t).
The reliability for the example computer system is plotted in Figure 2.5.
Reliability -+-
0.9
0.8
0.7
0.6
.~
~
Cii
0.5
II:
0.4
0.3
0.2
0.1
0
0 20000 40000 60000 80000 100000
Time in hours.
the instantaneous reward rate of the Markov reward model (MRM). Let Y(t)
denote the accumulated reward in the interval [0, t).
Y(t) = ltX(r)dr.
2.2.1 Expected Rewards. The expected instantaneous reward rate E[X(t)],
the expected accumulated reward E[Y(t)], and the steady-state expected re-
ward rate E[X] = E[X( 00)] can be computed as
E[X(t)] =L rjPj(t) + L RijifJjj(t) ,
iEfJ i,jEfJ
where ifJ ij (t) and ifJ ij denote the expected frequency with which the transition
from state i to state j is traversed in the Markov chain at time t, and in
steady-state respectively; Njj (t) is the expected number of such traversals of
the transition from state i to state j during the interval [0, t).
For a Markov chain with absorbing states, the expected accumulated re-
ward until absorption E[Y( 00)] can he computed as
E[Y(oo)] = Lri
jEfJ
1 a
00
Pj(r)dr+ L
i,jEfJ
RijNij = LriZi +
iEfJ
L
i,jEfJ
RijNij ,
Markov Dependability Models of Complex Systems 451
where Nij is the expected number of traversals of the transition from state i
to state j until absorption.
Furthermore, we note that if hi is the expected holding time for the CTMC
in state i, then hi = l/lqiil. If i represents the frequency with which state
i is visited in steady-state, then i = 7r;jhi = 7rilqiil. Given that the CTMC
is in state i, the probability /Iij that the next transition will be to state j, is
given by /Iij = qij/lqiil. Thus, we can compute ~ij as
~ij = /Iiji = %7ri .
Similarly, we can prove that ~ij(t) = qijPi(t). Hence the expressions for the
expected instantaneous reward rate and the expected steady-state reward
rate can be rewritten as
E[X(t)] = 2)ri + 2: Rt,j% )Pi(t)
iEn jEn
and
E[X] = 2:(ri + 2: Rijqij)7ri .
iEn jEn
By a similar argument, if ni is the expected number of visits to state i
until absorption, then ni = zdhi = zilqiil. Then
Nij = /Iijni = qijZi .
Similarly, we can also prove that
Nij(t) = /Iijni(t) = qijLi(t) .
Thus, the expressions for the expected accumulated reward until absorption,
and the expected accumulated reward in the interval [0, t) may be rewritten
as
E[Y(oo)] = ~)ri +E R;.jqij)Zi
iEn jEn
and
E[Y(t)] = 2:(ri + 2: Rt,jqij)Li(t) .
iEn jEn
2.2.2 Distribution of Reward Measures. Assuming only reward rates
(no impulse rewards) are assigned, the distribution of X(t), P[X(t) ~ x], can
be computed as
P[X(t) ~ x] = 2: Pi(t).
r.:5x, iEn
The distribution of X can be computed similarly.
The distribution of accumulated reward until absorption, P[Y( 00) ~ yl,
and the distribution of accumulated reward over a finite horizon, P[Y(t) ~ y],
on the other hand, are difficult to compute. Numerical methods for computing
these distributions will be discussed in a later section.
452 Jogesh K. Muppala et al.
3. Computational Difficulties
Two major difficulties that arise in numerical computation of transient be-
havior of Markov chains are largeness and stiffness.
3.1 Largeness
Most Markov models of real systems are very large. The actual model (relia-
bilityor performance) may be specified using a high level description such as
stochastic Petri nets (Ajmone et al. 1984). However, these high level models
are solved after conversion to a Markov model that is typically very large.
Practical models, in general, give rise to hundreds of thousands of states (Ibe
and Trivedi 1990). Two basic approaches to overcome largeness are:
- Largeness-avoidance: One could use state-truncation techniques based on
avoiding generation of low probability states (Boyd et al. 1988, Kantz and
Trivedi 1991, Li and Silvester 1984, and Van Dijk 1991) and model-level
decomposition (Ciardo and Trivedi 1993 and Tomek and Trivedi 1991).
- Largeness-tolerance: In this approach, a concise method of description and
automated generation of the CTMC is used. Sparsity of Markov chains is
exploited to reduce the space requirements. However, no model reduction is
employed. Appropriate data structures for sparse matrix storage are used.
Sparsity preserving solution methods are used, which result in consider-
able reduction in computational complexity. CTMCs with several hundred
thousand states have been solved using this approach. We shall consider
largeness-tolerance methods in this paper.
3.2 Stiffness
arc from that place. The firing of a transition is an atomic action in which
one or more tokens are removed from each input place of the transition, and
one or more tokens are added to each output place of the transition, possibly
resulting in a new marking of the PN. Upon firing the transition, the number
of tokens deposited in each of its output places is equal to the cardinality of
the output arc. Each distinct marking of the PN constitutes a separate state of
the PN. A marking is reachable from another marking, if there is a sequence of
transition firings starting from the original marking which results in the new
marking. The reach ability set (graph) of a PN is the set (graph) of markings
that are reachable from the other markings (connected by the arcs labeled by
the transitions whose firing causes the corresponding change of marking). In
any marking of the PN, multiple transitions may be simultaneously enabled.
Another type of arc in a Petri net is the inhibitor arc. An inhibitor arc
drawn from a place to a transition, means that the transition cannot fire if
the place contains at least as many tokens as the cardinality of the inhibitor
arc.
Extensions to PN have been considered by associating firing times with
the transitions. By requiring exponentially distributed firing times, we ob-
tain the stochastic Petri nets. The underlying reach ability graph of a SPN
is isomorphic to a continuous time Markov chain (CTMC). Further gen-
eralization of SPNs has been introduced in (Ajmone et al. 1984) allowing
transitions to have either zero firing times (immediate transitions) or ex-
ponentially distributed firing times (timed transitions), giving rise to the
generalized stochastic Petri net (GSPN). In this paper, timed transitions are
represented by hollow rectangles, whereas immediate transitions are repre-
sented by thin bars. The markings of a GSPN are classified into two types. A
marking is vanishing if any immediate transition is enabled in the marking. A
marking is tangible if only timed transitions or no transitions are enabled in
the marking. Conflicts among immediate transitions in a vanishing marking
are resolved using a random switch (Ajmone et al 1984).
Although GSPNs provide a useful high-level language for evaluating large
systems, representation of the intricate behavior of such systems often leads
to a large and complex structure of the GSPN. To alleviate some of these
problems, several structural extensions to Petri nets are described in (Ciardo
et al. 1989), which increase the modelling power of GSPNs. These extensions
include guards (enabling functions), general marking dependency, variable
cardinality arcs, and priorities. Some of these structural constructs are also
used in stochastic activity networks (SANs) (Sanders and Meyer 1986) and
GSPNs (Chiola 1985). Stochastic extensions were also added to GSPNs to
permit the specification of reward rates at the net level, resulting in stochastic
reward nets (SRN). All these extensions will be described in the following
subsections.
To illustrate the concepts further, we consider an SRN model for the
computer system example. We consider one further extension to this model,
458 Jogesh K. Muppala et al.
wsup [sup
wsrp
wsrp wsrp
.r-------""'\.
Isrp Isft
Fig. 4.2. The reach ability graph for the SRN model
The corresponding continuous time Markov chain may be derived from the
reachability graph by eliminating the vanishing markings. The corresponding
CTMC model is shown in Figure 4.3. The algorithm for converting from the
SRN to the CTMC description may be found in Ciardo et al. (1993).
the input conditions must be satisfied), and a logical "OR" for inhibitor arcs
(any inhibitor condition is sufficient to disable the transition). For instance,
a guard such as (#(Pl) ~ 3 V#(P2) ~ 2) A(#(P3) = 5 V#(P4) ::; 1) is difficult
to represent graphically.
4.1.6 Output measures. For a SRN, all the output measures are expressed
in terms of the expected values of reward rate functions. Depending on the
quantity of interest, an appropriate reward rate is defined. In this section we
are not considering impulse rewards, but they can be easily added.
Suppose X represents the random variable corresponding to the steady-
state reward rate describing a measure of interest. A general expression for
the expected reward rate in steady-state is
E[X] = L rk7rk,
keT
where T is the set of tangible markings (no time is spent in the vanishing
markings), 7rk is the steady-state probability of (tangible) marking k, and rk
is the reward rate in marking k.
Analogously X(t) represent the random variable corresponding to the
instantaneous reward rate of interest. The expression for the expected in-
stantaneous reward rate at time t, becomes:
E[X(t)] = L rkPk(t),
keT
where Pk(t) is the probability of being in marking k at time t.
Similarly Y(t) represent the random variable corresponding to the ac-
cumulated reward in the interval [0, t), and let Y(oo) represent the corre-
sponding random variable for the accumulated reward until absorption. The
expressions for the expected accumulated reward in the interval [0, t) and the
expected accumulated reward until absorption are
E[Y(t)] = L
keT
rk t Pk(x)dx ,
Jo
and
E[Y(oo)] = L rk ('0 Pk(x)dx .
keT Jo
In the example model derived above, we assign appropriate reward rates
to the markings of the SRN to compute interesting measures. For example,
to compute the system availability, the reward rate rj associated with the
tangible marking i is given by
. _{I0
r, -
if #( wsup, i) > 0
otherwise
A#(1sup, i) = 1 } .
The instantaneous availability computed for the system for three different
values of the coverage parameter c, is plotted in Figure 4.4. As expected, the
availability decreases with the decrease in the coverage parameter c.
462 Jogesh K. MuppaIa et aI.
c=0.9...-
0.99998 c=0.8 -+--.
c=1.0 B
0.99996
0.99994
\\ ....
f3.
=ffi 0.99992
> \ EI'El-.... a.G ............................ .
< 0.9999
'iii
E 0.99988 \
\
0.99986 \,
\\,
0.99984 \,.
0.99982 ----.....----+-+------------------------
0.9998 L -_ _---lL...-_ _--I._ _ _- ' -_ _ _- '
o 5 10 15 20
Time in hours.
5. System Dependencies
Earlier we mentioned that continuous time Markov chains can easily rep-
resent many of the failure and repair dependencies that arise in modelling
of computer systems. Now in this section we present the nature of depen-
dencies in practice that can be handled by CTMCs. It is often assumed in
the dependability community that the failures of components are indepen-
dent. When dependencies are considered, they are usually modeled through
the use of multivariate distributions. In this section we present the following
seven kinds of system behaviors that are so simple that they can be easily
represented by CTMCs without resorting to complex mechanisms.
1. Imperfect coverage: Common-mode failures occur occasionally in com-
plex systems; that is, the failure of a component may induce the failure
of the entire system, since the system is unable to recover from the com-
ponent failure. We can use the imperfect coverage concept to model this
behavior. As an example, consider a system composed of two identical
processors. Upon failure of one of the processors, the system may recover
and continue functioning with a single processor. Such a fault is said to
be covered. Alternatively the system may not recover from the proces-
sor failure, causing the entire system to fail; the corresponding fault is
said to be not covered. We assume that upon failure of a processor, the
system recovers with probability c (covered failure) or the system fails
to recover with probability 1 - c (uncovered failure). The system has
imperfect coverage if c < 1.0. This system can be modeled by a CTMC
with three states, as shown in Figure 5.1(a). This dependence may easily
be mapped into the shock model of failure, as shown in the three-state
Markov Dependability Models of Complex Systems 463
2Ac
travel time may also be involved, where the repair personnel need to
travel to the site. However, this travel time appears only once, inde-
pendent of the number of components waiting for repairs. Furthermore,
both imperfect repair and faulty replacements can also be considered.
Once again, Markov and SPN models have been used to capture such
behavior (Ibe et al. 1989 and Muppala et al. 1992).
5. Hardware-software co-dependence: Failure of software usually does not
impact the underlying hardware, so the hardware can continue to execute
other software. However, failure of the hardware automatically implies
that the software running on the hardware will fail. This implied failure
of the software (upon failure of the underlying hardware) can also be
modeled through Markov chains.
6. Performance-dependability dependence: The system's performance and
dependability are also correlated, due to the following causes:
a) The failure of some components may in turn increase the load im-
posed upon the remaining components. Consequently the failure
rates of the functioning components might increase. This can be mod-
eled in Markov chains by making the failure rates dependent on the
number of functioning/failed components.
b) Degradable systems, which continue to function even in the presence
of failures, are best characterized by a combined evaluation of their
performance and dependability. This has led to the development of
performability concepts (Meyer 1982 and Trivedi et al. 1992) based
on Markov reward models (Howard 1971).
c) Inadequate performance behavior of a system may sometimes be con-
strued as a failure (Logothetis and Trivedi 1995). For example, in a
client-server based distributed system, a large delay in the server re-
sponding to a client request, may prompt the client to assume that
the server has failed.
7. Phased mission models: Phased mission models are common in situations
that have the system's configuration and behavior change in different
phases (Dugan 1991 and Kim and Park 1994); for example, a flight control
system has at least three distinct phases: take-off, cruising, and landing.
The failure rates as well as system requirements may be dependent upon
the phase. Markov chains can be used to develop phased mission models,
such that the final state probabilities of one phase are mapped into the
initial state probabilities in another phase. Note that both the structure
of the CTMC and the set of UP and DOWN states may change with the
phase (Somani et al. 1992).
appropriate, not only the solution methods for P(t), but also for L(t) and 1r
are explored.
We note that the Kolmogorov differential equation (2.1) is a first order linear
differential equation that can be solved using Laplace transforms (Trivedi
1982). Taking the Laplace transform on both sides of the equation, we get
sP(s) - P(O) = P(s)Q.
Rearranging the terms
P(s) = P(O)(sI _ Q)-I,
where I is the identity matrix. The transient state probability vector is ob-
tained by computing the inverse Laplace transform of P(s). In general, com-
puting the inverse Laplace transform for this equation is extremely difficult,
except for Markov chains with very small state spaces; details may be found
in (Trivedi 1982). The advantage of this method is that the solution thus ob-
tained will be closed-form and fully symbolic in both the system parameters
and time t. In principle this approach can also be used to compute L(t).
Suppose the matrix Q has m :$ n distinct eigenvalues, say AI, A2, ... Am
arranged in non-decreasing order of magnitude. Since Q is singular, Al = o.
Let di be the multiplicity of Ai. The general solution for the state probability,
Pi(t) can be written as,
m dj
Pi(t) = LL ajktk-leAjt,
j=lk=l
where the aj k 's are constants. The state probabilities can be easily computed,
once the eigenvalues Aj of the Q matrix, and the constants aj k are computed.
For an acyclic Markov chain, the diagonal elements of the Q matrix yield
the required eigenvalues. Using the convolution integration approach (Trivedi
1982), an 0(n 2 ) algorithm has been developed in Marie et al. (1987). With a
sparse Q matrix, the algorithm can be further simplified to obtain an 0(',,)
algorithm, where 1] is the number of non-zero entries in the Q matrix.
For a general Markov chain, an O( n 3 ) algorithm has been developed in
Tardif et al. (1988) and Ramesh and Trivedi (1995). They first determine
the eigenvalues for the Q matrix, using the QR algorithm (Wilkinson and
Reinsch 1971). Subsequently, the ajk constants are determined by solving a
linear system of equations.
This method yields a closed-form solution for the state probabilities, as
a function of the time variable t. In general, this method cannot be used
466 Jogesh K. MuppaIa et aI.
for Markov chains with large state spaces (2: 400 states), because the QR
algorithm produces a full upper Hessenberg matrix causing space and time
limitations. We are thus forced to resort to fully numerical solution methods
that are discussed next.
eQt = t
i=O
(~~)i
1.
.
P(t) = f
i=O
II(i)e-qt (q?i ,
1.
(6.2)
where q 2: maXi Iqii I; II (i) is the state probability vector of the underlying
discrete time Markov chain (DTMC) after step i. II( i) is computed itera-
tively:
II(O) = P(O), (6.3)
II(i) = II(i - l)Q" , (6.4)
where Q" = Q/q+I. In practice, the summation in equation (6.2) is carried
out up to a finite number of terms k, called the right truncation point. The
number of terms required to meet a given error tolerance f is computed from
~ -qt_q_ <
k ( t)i
1 - ~e ., _ f.
i=O 1.
As qt increases, the Poisson distribution thins from the left; that is, the terms
in the summation for small i become less significant. Thus it may be profitable
Markov Dependability Models of Complex Systems 467
to start the summation at a value I > 0, called the left truncation point (see
De Souza and Gail 1989 and Reibman and Trivedi 1988), to avoid the less
significant terms. In this case, equation (6.2) reduces to
k .
P(t) ~ L 11(i) e- qt (q?' .
~.
(6.5)
i=1
We compute the values of I and k from the specified truncation error tolerance
f, using
1-1 . k .
' " _qt(qt)' < ~ 1 _ ' " e-qt (qt)' < ~
~e i! - 2' ~ i! - 2'
,=0 i=O
Randomization has several desirable properties. We can bound the error
due to truncation of the infinite series. Thus given a truncation error tol-
erance requirement, we can precompute the number of terms of the series
needed to satisfy this tolerance. Since this method involves only additions
and multiplications and no subtractions, it is not subject to severe roundoff
errors.
One of the main problems with randomization is its O(7Jqt) complexity
(Reibman and Trivedi 1988). The number ofterms needed for randomization
between the left and the right truncation point is O(..;qt). However, it is nec-
essary to obtain the DTMC state probability vector at I, the left truncation
point, and I is O(qt). Thus we need to compute O(7Jqt) matrix-vector mul-
tiplications. Instead of using successive matrix-vector multiplies (MVMs) to
compute this vector, we could use the matrix squaring method and change the
complexity of computing 11(1) from O(7Jqt) to O(n 3 Iog(qt)) (Reibman and
Trivedi 1988), where n is the number of states in the Markov chain. How-
ever, the problem with this method is that squaring results in fill-in (reducing
sparsity), and hence it is not feasible for CTMC with large state spaces.
When qt is large, computing the Poisson probabilities, especially near
the tails of the distribution, may result in underflow problems (Fox and
Glynn 1988). We thus choose to use the method suggested by Fox and Glynn
(1988) to compute I and r. This method computes the Poisson probabilities
e-qtqt)i)/(i!) for all i = I, I + 1, ... , r - 1, r. Their method is designed to
avoid the underflow problems.
We have suggested a modified randomization-based method (Malhotra
et al. 1994) that addresses some of the problems caused by large values of
qt. Our method is based on recognizing the steady-state for the underlying
DTMC. We can take advantage of this fact, and rewrite the randomization
equations in such a way that further computation is minimized. One nicety
of our method is that the computation time is now controlled by the sub-
dominant eigenvalue of the DTMC matrix rather than by qt. Thus stiffness
as seen by the new randomization algorithm, is the same as that seen by the
power method (see Section 6.4.1) used for computing the steady-state solution
for the CTMC (Stewart and Goyal 1985). In our experience with a variety
468 Jogesh K. Muppala et al.
since this assures that the DTMC is aperiodic (Goyal et al. 1987). Note that
we do not require that the CTMC (or the DTMC) be irreducible. Indeed
we allow a more general structure with one or more recurrent classes and
a (possibly empty) transient class of states. Let II* denote the steady-state
probability vector of the DTMC.
Assume that the probability vector for the underlying DTMC attains
steady-state at the S-th iteration, so that IIII(S) - II* II is bounded above
by a given error tolerance. Three different cases arise in the computation of
the transient state probability vector of the CTMC: (1) S > k, (2) 1 < S ~ k
and (3) S ~ I. We examine each of these cases individually. In the following
equations we will denote the transient state probability of the CTMC com-
puted by the new randomization algorithm as P(t).
Case 1 (S > k): In this case the steady-state detection has no effect, and the
probability vector is calculated using equation (6.5).
Case 2 (I < S ~ k): Consider equation (6.5). By using II(i) = II(S), i > S,
the equation can be rewritten setting the right truncation point k to 00:
P(t) f
i=l
II(i)e-qt (q?i
to
8 . 00 .
Case 3 (S ~ I): The DTMC reaches steady-state before the left truncation
point. In this case, no additional computation is necessary and P(t) is set
equal to II (S).
For stiff problems, the number of terms needed to meet the truncation
error tolerance requirements is often very large. However as shown above, if
Markov Dependability Models of Complex Systems 469
L(t) !
q
f=
i=O
ll(i) f=
j=i+l
e- qt (q~t
J.
i .
! L ll(i)(l -l>-qt (q~t) .
00
(6.7)
q i=O j=O J.
This is again a summation of an infinite series, which can be evaluated up to
the first k significant terms (Reibman and Trivedi 1989), resulting in
L(t) =! (6.8)
q ;=0 j=O J.
The error due to truncation, (k)(t), is upper bounded by
470 Jogesh K. MuppaIa et aI.
Case A (S > k): In this case equation (6.8) is unaffected and the summation
is carried out upto k terms.
Case B (S ~ k): In this case equation (6.8) is modified as follows:
L(t) = ~
q
f
;=0
II(i) f
j=;+1
e- qt (q~t
J.
8 0 0 ' 00 00 .
!L L
8 0 0 '
II(i) e- qt (q~r
q ;=0 j=i+l J.
00 00 800 .
(6.10)
Markov Dependability Models of Complex Systems 471
Given a time point t at which the solution is required, the authors select a
time point to such that t = 2mto. The value to is chosen such that qto < 0.1,
to ensure that the Poisson terms e- qt (qitr decrease very fast, and thus the
summation can be truncated for 1 < 10. Then using Horner's algorithm, they
compute P(t) with the truncated summation. The value of m is chosen to be
m = llog2[4(1J + 3)qt]J .
They use the randomization equations to compute P(to) first. Noting that
iftA: = 2tA:-l, then P(tA:) = P(h_l)2, they use matrix squaring to compute
P(tA:) for different values of tA: until P(t) is computed. Then P(t) can be
obtained from equation (6.9).
This method also permits the solution ofthe Markov chain simultaneously
for different time points tA: that are 2A: multiples of to. In their experience, this
method yields faster solution for stiff Markov chains compared with normal
randomization. However, they note that the computation of the matrix P( tA:)
through matrix squaring results in some fill-in; so the sparseness ofthe matrix
is lost. This may affect the tractability of the uniformized power method for
large problems.
6.3.3 Adaptive Uniformization. Another method based on randomiza-
tion is adaptive uniformization (AU), and has been proposed by van Moorsel
and Sanders (1994). This method is suitable for stiff models and models
with infinite state space, even when the transition rates are not uniformly
bounded.
For the underlying DTMC matrix, they define the set of active states at
step n, (n = 0,1,2, ...) as the set {}n ~ {} with
{}n = {i E {}llT;(n) > O} .
Then for n = 0,1,2, ... , they define qn = sup{qiili E {}n} as the adapted uni-
Jormization rate. The corresponding adapted infinitesimal generator matrix
at step n, Q(n) = [qij(n)] is defined as
.. (n) _ {qi j
q'J - if i E {}n
otherwise
Similarly, the adapted transition matrices for the DTMC are defined as
Q*(n) = 1+ Q(n)/qn, n = 0,1,2, ...
Now, define a stochastic process T = {Tn, n = 0, 1,2, ...} where,
Tn = Exp(qo) + Exp(q!) + ... + EXp(qn-l), and To = ,
with Exp(q;) representing an exponentially distributed random variable with
rate qj. Furthermore, define Un(t) as the probability of exactly n jumps in
the interval [0, t]:
Un(t) = P{Tn ::; t 1\ Tn+! > t}, t ~ 0, n = 0,1,2, ...
472 Jogesh K. Muppala et al.
P(t) = P(O)
n=O i=O n=O
with
II(O) = P(O) and II(n) = II(n - l)Q*(n - 1), n = 0, 1,2, ...
The infinite summation is truncated after Na steps, where
n=O
where f. is the desired accuracy. They call the pure birth process with tran-
sition rates qo, ql, ... the AU-jump process, and the DTMC subordinated to
the AU-jump process as the A U process.
They note that in general the AU method requires fewer steps than the
standard uniformization for a given accuracy. However, each step of the AU
method requires more computation. Typically, AU is better than standard
randomization for t < t*, where t* is the turning point. For t > t* AU
becomes computationally more intensive than standard randomization.
When the state space is infinite, Grassman (1991) suggests a method
called dynamic uniformization which is also based on the concept of active
states and uses a fixed value of q. However this method does not yield accurate
results, because there exists a value of t at which one of the transition rates
out of an active state will exceed q. Adaptive uniformization does not suffer
from this problem, since the value of q is not fixed (but it is selected at each
step n based on the set of the active states).
6.3.4 ODE-based Methods. Numerical solution of Markov chains re-
quires the solution of a system of ODEs for which standard techniques are
known. Different methods can be used for different kinds of problems. For
example, stiff methods can be used for stiff systems (or stiff Markov chains).
Methods also differ in the accuracy of the solution yielded and computational
complexity.
ODE solution methods discretize the solution interval into a finite number
of time intervals {tl' t2, ... , ti, ... , tn}. Given a solution at ti, the solution at
ti + h (= ti+d is computed. Advancement in time is made with step size
h, until the time at which the solution is desired (we call it mission time)
is reached. Commonly, the step-size is not constant, but varies from step to
step. ODE solution methods can be classified into two categories: explicit and
implicit.
For stiff systems, the step size of an explicit method may need to be ex-
tremely small to achieve the desired accuracy (Gear 1971). However, when
the step size becomes very small, the round-off effects become significant and
computational cost increases greatly (as many more time steps are needed).
Markov Dependability Models of Complex Systems 473
Implicit ODE methods, on the other hand, are inherently stable as they do
not force a decrease in the step-size to maintain stability. The stability of
implicit methods can be characterized by the following definitions. A method
is said to be A-stable, if all numerical approximations to the actual solution
tend to zero as n --+ 00 when it is applied to the differential equation t = >'Y
with a fixed positive h and a (complex) constant>. with a negative real part
(Gear 1971); n is the number of mesh points, which divide the solution inter-
val. For extremely stiff problems, even A-stability does not suffice to ensure
that rapidly decaying solution components decay rapidly in the numerical
approximation as well, without large decrease in the step-size. This could
lead to a phenomenon called ringing, i.e., the successively computed values
tend to be of the same magnitude but of opposite sign (Yi+1 = -Yi). To pre-
vent ringing, the step-size must be reduced further (Bank et al. 1985), which
leads us back to the same problem. Axelsson (1969) defined methods to be
stiffly A-stable, iffor the equation t
= >'Y, Yi+t!Yi --+ 0 as Re(>.h) --+ -00.
This property is also known as L-stability (Lambert 1991). In this paper, we
describe two L-stable ODE methods.
TR-BDF2 Method. This is a re-starting cyclic multi-step composite method
that uses one step of TR (trapezoidal rule) and one step of BDF2 (second
order backward difference formula) (Bank et al. 1985). This method borrows
its L-stability from the L-stability of the backward difference formula, while
the TR step provides the desirable property of re-starting. A single step of
TR-BDF2 is composed of a TR step from ti to ti +,h and a BDF2 step
from ti +,h to ti+1 where 0 < , < 1. For the system of equations (2.1), this
method yields
( h) = _3,2 + 4, - 2 h3p(t)Q3
f 12(2 _ ,) ,
where h is the step size at time t. Direct estimation of the LTE vector is
perhaps most accurate, but it requires three matrix vector multiplications. A
474 Jogesh K. Muppala et al.
(6.13)
(6.14)
For the Markov chain solution, these approximations take the form of
matrix polynomials (polynomials in hQ). When computing state probabilities
using equation (2.1), these methods yield a linear algebraic system at each
time step:
r r-1
P(t + h) L (ti(hQ)i = P(t) L
t1i(hQ)i , (6.15)
where (ti and t1i are constants whose values are determined based upon the
order of the method desired. The Oth power of hQ is defined to be the Identity
matrix I. In general, these methods involve higher powers of the generator
matrix Q. Substituting r = 2 into equation (6.15), we get a third order
L-stable method:
2 1 1
=
P(t + h)(I - "3hQ + (3h 2 Q2) P(t)(I + "3hQ) . (6.16)
Similarly, using r = 3, a fifth order method may be derived. In principle, we
could derive methods of even higher order. However, with higher orders, we
also need to compute higher powers of the Q matrix, which means increased
computational complexity. We restrict ourselves to the third order method,
described by equation (6.16).
Various possibilities exist for solving the system in equation (6.16).
Markov Dependability Models of Complex Systems 475
i
where rl and r2 are the roots of the polynomial 1 - ~ x + x 2 . This system
can be solved by solving two systems:
1
X(I - r2hQ) = P(t)(1 + 3hQ) (6.18)
Unfortunately, the roots rl and r2 are complex conjugate; hence this ap-
proach will require the use of complex arithmetic.
For the third order implicit Runge-Kutta method, the LTE vector at t+h
is given by
(6.20)
and 2 - ,)1 - (1 - ,)hQ) for TRBDF2 and (I - ~hQ + ih2Q2) for the
implicit Runge-Kutta method) are diagonally dominant because of the spe-
cial structure of the Q matrix, which helps faster convergence of the iterative
solvers. However, if Gauss-Seidel does not converge within a specified number
of iterations, then we switch to SOR. If convergence is still not achieved, then
either the tolerance is relaxed by a constant factor or we could switch to a
sparse direct method.
In the next step, the LTE vector at the end of time step is calculated.
A scalar estimate of LTE is obtained as a suitable norm (L1' L 2 , or Loo) of
the LTE vector. If the scalar LTE estimate is within the error tolerance, then
the step is accepted. If the end of the solution interval is reached, then the
procedure ends. Otherwise a new step-size is computed, based on the step-
size control technique, such that it is less than h max . The above steps are
repeated, starting from the step in which the LHS matrix is computed. If the
scalar error estimate is not within the error tolerance, then the step-size is
reduced and the above time-step is repeated. If the step-size must be reduced
below hmin' then two approaches are either to increase the error tolerance
or to switch to another ODE solver with higher order of accuracy. Note that
since we work with local truncation errors, we require that error tolerance be
specified as the local error tolerance and not as the global error tolerance. It
is hard to estimate global error from the local errors occurring at each time
step. However, it is reasonable to assume that controlling local errors would
help bound the global error. There exist several step-size control techniques.
We use the following:
h opt = h( local tolerance) or der+l
1
(6.21)
LTE '
where order is the order of accuracy of the single-step method.
Computational Complexity. The computational complexity of ODE solution
methods has traditionally been evaluated in terms of the number of function
evaluations. In our case each function evaluation is a matrix-vector multipli-
cation. For implicit methods, computational complexity is heavily dependent
on the linear system solver. Each iteration of an iterative linear system solver
takes O( TJ) time, where TJ is the number of non-zero entries in the Q matrix.
However, the number of iterations until convergence can not be bounded a
priori. Let 8 be the number of time-steps required by the ODE-solver to
compute the state probability vector at the mission time.
For the TR-BDF2 method with an iterative linear system solver, the
complexity is 0(18TJ) where I is the average number of iterations per lin-
ear system solution. For the implicit Runge-Kutta method, we analyze the
case where the LHS matrix polynomial is computed directly. Computing the
matrix polynomial involves squaring the matrix and three matrix additions.
Squaring the matrix takes O( nTJ) time, where n is the number of states in the
Markov chain. The squaring of Q results in some fill-in. Suppose TJ' denotes
the number of non-zeroes in the squared matrix, and f the fill-in ratio (TJ' /TJ).
Markov Dependability Models of Complex Systems 477
We found that f increases with n. For most of the Markov chains we tried,
f was not more than 10 percent. Having computed the LHS matrix, the
remaining computation occurs in solving the linear system of n equations.
Using an iterative solver, the total time-complexity is O(n71 + 1s71') where
1 is the average number of iterations per linear system solution. We found
that usually not more than two to three iterations are required for iterative
methods to converge.
6.3.5 Hybrid Methods. These methods combine explicit (non-stiff) and
implicit (stiff) ODE methods for numerical transient analysis of Markov mod-
els. This approach (Malhotra 1996) is based on the property that stiff Markov
chains are non-stiff for an initial phase of the solution interval. A non-stiff
ODE method is used to solve the model for this phase, and a stiff ODE
method for the rest of the duration until the mission time. A formal crite-
rion to determine the length of the non-stiff phase is described. A significant
outcome of this approach is that the accuracy requirement automatically be-
comes a part of model stiffness. Two specific methods based on this approach
are implemented in (Malhotra 1996). Both the methods use the fourth order
Runge-Kutta-Fehlberg method as the non-stiff method. One uses the TR-
BDF2 method as the stiff method, whereas the other uses an implicit Runge-
Kutta method. Results from solving several models show that the resulting
methods are much more efficient than the corresponding stiff methods (TR-
BDF2 and implicit Runge-Kutta). The implementation details are similar to
those of the standard ODE implementation, with some minor modifications
required to be able to switch from the non-stiff ODE method to the stiff ODE
method, upon detection of stiffness in the Markov chain.
The solution of Markov reward models involves the computation of the expec-
tations and the distributions of various reward measures that were reviewed
earlier. In this section we briefly discuss some of the recent developments in
the solution of Markov reward models.
The expressions for expected values of the reward measures (which were
derived in Section 2.2.1) show that these measures are dependent on the state
probabilities, P(t), and 11', and the expected accumulated times in the states,
L(t) and z. Thus the computation of these measures is straightforward, once
the state probabilities and the expected accumulated times are computed.
Markov Dependability Models of Complex Systems 479
7.2.1 Computing P[Y( 00) ~ y). Beaudry (1978) first described a method
for computing P[Y( 00) $ y), the distribution of accumulated reward until
absorption. She assumed that all non-absorbing states have positive reward
rates assigned to them. Given a Markov chain {Z(t), t ~ O} with a reward
rate structure defined such that state i of the chain is assigned a reward rate
of ri, a new Markov chain {Z(t), t ~ O} is constructed by dividing the
transition rates out of state i by ri. It can be proved that the distribution
of the time to absorption of this new Markov chain yields the distribution of
accumulated reward until absorption, P[Y( 00) $ V], for the original Markov
chain. The reason this is true is that the sojourn time in state i of the original
Markov chain is speeded up or slowed down according to whether ri is smaller
or larger than 1. Thus, for state i, a sojourn time of Tin {Z(t), t ~ OJ, is
equivalent to a sojourn time of Tri in {Z(t), t ~ OJ.
Ciardo et a1. (1990) extended Beaudry's method to allow for non-absorbing
states with zero reward rates, and also allowed for the underlying process to
be semi-Markovian. They note that when the reward rates are zero, the above
transformation yields states from which the transition rates are infinite. Such
a situation actually occurs in the solution of generalized stochastic Petri net
(GSPN) models (Ajmone et a1. 1984), where vanishing states occur in the un-
derlying stochastic process describing the behavior of the GSPN. These states
are handled by eliminating them; that is, constructing a stochastic process
that contains only those states with non-zero sojourn times. The same prin-
ciple is used in this situation; that is eliminate those states with zero reward
rates. The solution of the time to absorption for the resulting stochastic pro-
cess yields the distribution of accumulated reward until absorption.
Note, however, that both solution methods consider only reward rates
assigned to the states of the (semi-) Markov process; they do not take into
account impulse rewards. .
7.2.2 Computing P[Y(t) ~ y). The computation of P[Y(t) $ y) is in
general difficult. Several numerical methods to solve this problem have been
presented in the literature. Meyer (1982) obtained a solution for acyclic
Markov reward models with the reward rate ri being a monotonic function
of the state labeling.
Considering the complement of the distribution function, let us denote
Y(t, y) = P[Y(t) ~ V]. Kulkarni et a1. (1986) derived a double Laplace trans-
form system relating Y(t, y) and the reward rates:
(sl + uR - Q)Y"'*(u, s) = e ,
where, Y"'*(u, s) is Y(t, y) with a Laplace-Stieltjes transform (...... ) taken with
respect to y, followed by a Laplace transform (*) taken with respect to t;
R = diag[rl, r2, ... , rn] is a diagonal matrix, and e is a column vector with
all elements equal to 1. Smith et a1. (1988) developed a double-transform
inversion method to solve the above system of equations.
480 Jogesh K. Muppata et at.
The matrix K(t) = [Kij(t)] is called the kernel of the Markov renewal
sequence. The time instances {Tn} are called the regeneration instances.
Definition 8.2 (Kulkarni 1995). Given a Markov renewal sequence {(Yn , Tn), n 2:
O} with the kernel K(t), define N(t) as
Markov Dependability Models of Complex Systems 481
(8.1)
where
where E(t) = [e;j(t)] is called the local kernel, while K(t) = [Kij(t)] is called
the global kernel of the MRGP.
Given the initial probability vector P(O), we can compute the system state
probabilities at time t as
P(t) = P(O)V(t) .
482 Jogesh K. MuppaIa et aI.
References
Abdallah, H., Marie, R.: The Uniformized Power Method for Transient Solutions
of Markov Processes. Computers and Operations Research 20, 515-526 (1993)
Markov Dependability Models of Complex Systems 483
Ajmone M.M., Conte, G., Balbo, G.: A Class of Generalized Stochastic Petri Nets
for the Performance Evaluation of Multiprocessor Systems. ACM Transactions
on Computer Systems 2, 93-122 (1984)
Axelsson, 0.: A Class of A-Stable Methods. BIT 9, 185-199 (1969)
Bank, R.E. et al.: Transient Simulation of Silicon Devices and Circuits. IEEE Trans-
actions on Computer-Aided Design 4,436-451 (1985)
Baskett, F. et al.: Open, Closed and Mixed Networks of Queues with Different
Classes of Customers. Journal of the ACM 22, 248-260 (1975)
Beaudry, M.D.: Performance-Related Reliability Measures for Computing Systems.
IEEE Transactions on Computers C-27, 540-547 (1978)
Bobbio, A., Trivedi, K.S.: An Aggregation Technique for the Transient Analysis of
Stiff Markov Chains. IEEE Transactions on Computers C-35, 803-814 (1986)
Boyd, M. et al.: An approach to solving large reliability models. Proceedings of
IEEE/AIAA DASC Symposium. San Diego (1988)
Carrasco, J.A., Figueras, J.: METFAC: Design and Implementation of a Software
Tool for Modeling and Evaluation of Complex Fault-Tolerant Computing Sys-
tems. Proceedings of the IEEE International Symposium on Fault-Tolerant
Computing. Los Alamitos: IEEE Computer Society Press 1986
Chiola, G.: A Software Package for the Analysis of Generalized Stochastic Petri
Net Models. Proceedings of the International Workshop on Timed Petri Nets.
Los Alamitos: IEEE Computer Society Press 1985, pp. 136-143
Ciardo, G. et al.: SPNP: Stochastic Petri Net package. Proceedings of the Interna-
tional Workshop on Petri Nets and Performance Models. Los Alamitos: IEEE
Computer Society Press 1989, pp. 142-150
Ciardo, G. et al.: Perform ability Analysis Using Semi-Markov Reward Processes.
IEEE Transactions on Computers C-39, 1251-1264 (1990)
Ciardo, G. et al.: Automated Generation and Analysis of Markov Reward Models
Using Stochastic Reward Nets. In: Meyer, C., Plemmons, R.J. (eds.): Linear
Algebra, Markov Chains, and Queueing Models. IMA Volumes in Mathematics
and its Applications 48 . Heidelberg: Springer 1993, pp. 145-191
Ciardo, G., Trivedi, K.S.: A Decomposition Approach for Stochastic Petri Net Mod-
els. Performance Evaluation 18, 37-59 (1993)
Clarotti, C.: The Markov Approach to Calculating System Reliability: Computa-
tional Problems. In: Serra, A., Barlow, R.E. (eds.): Proceedings of the Interna-
tional School of Physics. Course XCIV. Amsterdam: North-Holland 1986, pp.
55-66.
Couvillion, J.A. et al.: Performability Modeling with Ultrasan. IEEE Software 8,
69-80 (1991)
de Souza e Silva, E., Gail, H.R.: Calculating Availability and Perform ability Mea-
sures of Repairable Computer Systems Using Randomization. Journal of the
ACM 36,171-193 (1989)
de Souza e Silva, E. et al': Calculating Transient Distributions of Cumulative Re-
ward. Proceedings of the SIGMETRICS'95 (1995), pp. 231-240
Donatiello, L., Grassi, V.: On Evaluating the Cumulative Performance Distribu-
tion of Fault-tolerant Computer Systems. IEEE Transactions on Computers
40, 1301-1307 (1991)
Duff, I. et al.: Direct Methods for Sparse Matrices. Oxford: Oxford University Press
1986
Dugan, J.: Automated Analysis of Phased-Mission Reliability. IEEE Transactions
on Reliability 40, 45-55 (1991)
Dugan, J.B. et al.: Extended Stochastic Petri Nets: Applications and Analysis. In:
Gelenbe, E. (ed.) : Performance '84. Amsterdam: North-Holland 1984
484 Jogesh K. Muppala et al.
Dugan, J.B. et al.: The Hybrid Automated Reliability Predictor. AIAA Journal of
Guidance, Control and Dynamics 9, 319-331 (1986)
Fox, B.1., Glynn, P.W.: Computing Poisson Probabilities. Commun. ACM. 31,
440-445 (1988)
Gear, C.: Numerical Initial Value Problems in Ordinary Differential Equations.
Englewood Cliffs: Prentice-Hall 1971
Geist, R., Trivedi, K.S.: Reliability Estimation of Fault-Tolerant Systems: Tools
and Techniques. IEEE Computer 23,52-61 (1990)
German, R., Lindemann, C.: Analysis of Stochastic Petri Nets by the Method of
Supplementary Variables. Performance Evaluation 20, 317-335 (1994)
Golub, G., Loan, C. F.V.: Matrix Computations. Second Edition. Baltimore: Johns
Hopkins University Press 1989
Goyal, A. et al.: Probabilistic Modeling of Computer System Availability. Annals
of Operations Research 8, 285-306 (1987)
Grassmann, W.K.: Means and Variances of Time Averages in Markovian Environ-
ments. European Journal of Operations Research 31, 132-139 (1987)
Grassmann, W.K.: Finding Transient Solutions in Markovian Event Systems
through Randomization. In: Stewart, W.J. (ed.) : Numerical Solution of Markov
Chains. New York: Marcel Dekker 1991
Haverkort, B.R. et al.: DyQNtool - A Perform ability Modeling Tool Based on the
Dynamic Queuing Network Concept. In: Computer Performance Evaluation:
Modelling Techniques and Tools. Amsterdam (1992), pp. 181-195
Haverkort, B.R., Trivedi, K.S.: Specification Techniques for Markov Reward Models.
Discrete Event Dynamic Systems: Theory and Applications 3, 219-247(1993)
Howard, R.A.: Dynamic Probabilistic Systems: Semi-Markov and Decision Pro-
cesses. Vol. II. New York: Wiley 1971
Ibe, O.C., Trivedi, K.S.: Stochastic Petri Net Models of Polling Systems. IEEE
Journal on Selected Areas in Communication 8, (1990)
Ibe, O.C. et al.: Stochastic Petri Net Modeling of VAX Cluster System Availability.
In: Proceedings of the International Workshop on Petri Nets and Performance
Models. Los Alamitos: IEEE Computer Society Press 1989, pp. 112-121
Jensen, A.: Markov Chains as an Aid in the Study of Markov Processes. Skand.
Aktuarietidskr. 36, 87-91 (1953)
Johnson, S.C., Butler, R.W.: Automated Generation of Reliability Models. In: Pro-
ceedings of the Annual Reliability and Maintainability Symposium (1988), pp.
17-22
Kantz, H., Trivedi, K.S.: Reliability Modeling of the MARS System: A Case Study
in the Use of Different Tools and Techniques. In: Proceedings of the Fourth
International Workshop on Petri Nets and Performance Models. Los Alamitos:
IEEE Computer Society Press 1991
Keilson, J.: Markov Chain Models: Rarity and Exponentiality. Berlin: Springer 1979
Kim, K., Park, K.: Phased-Mission System Reliability Under Markov Environment.
IEEE Transactions on Reliability 43, 301-309 (1994)
Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems. Chapman and Hall
1995
Kulkarni, V.G. et al.: On Modeling the Performance and Reliability of Multi-Mode
Computer Systems. Journal of System Software 6, 175-182 (1986)
Lambert, J.: Numerical Methods for Ordinary Differential Systems. New York:
Wiley 1991
Lazowska, E.D. et al.: Quantitative System Performance. Englewood Cliffs:
Prentice-Hall 1984
Markov Dependability Models of Complex Systems 485
Levy, Y., Wirth, P.E.: A Unifying Approach to Performance and Reliability Objec-
tives. In: Bonatti, M. (ed.): Teletraffic Science for New Cost-Effective Systems,
Networks and Services, ITC-12. Amsterdam: North-Holland 1989, pp. 1173-
1179.
Li, V., Silvester, J.: Performance Analysis of Networks with Unreliable Components.
IEEE Transactions on Commun. COM-32, 1105-1110 (1984)
Logothetis, D. et al.: Markov Regenerative Models. In: Proceedings of the Interna-
tional Computer Performance and Dependability Symposium. Erlangen (1995)
Logothetis, D., Trivedi, K.S.: The Effect of Detection and Restoration Times for
Error Recovery in Communication Networks. In: MILCOM (1995)
Malhotra, M.: A Computationally Efficient Technique for Transient Analysis of
Repairable Markovian Systems. Performance Evaluation. To appear (1996)
Malhotra, M. et al.: Stiffness-Tolerant Methods for Transient Analysis of Stiff
Markov Chains. International Journal on Microelectronics and Reliability 34,
1825-1841 (1994)
Marie, R.A. et al.: Transient Analysis of Acyclic Markov Chains. Performance Eval-
uation 7, 175-194 (1987)
Meyer, J.F.: On Evaluating the Perform ability of Degradable Computing Systems.
IEEE Transactions on Computers C-29, 720-731 (1980)
Meyer, J.F.: Closed-Form Solutions of Performability. IEEE Transactions on Com-
puters C-31, 648-657 (1982)
Miranker, W.: Numerical Methods for Stiff Equations and Singular Perturbation
Problems. Dordrecht: D. Reidel 1981
Moler, C., Loan, C. F.V.: Nineteen Dubious Ways to Compute the Exponential of
a Matrix. SIAM Review 20, 801-835 (1978)
Muppala, 1.K. et al.: Dependability Modeling of a Heterogeneous VAX Cluster
System Using Stochastic Reward Nets. In: Avresky, D.R. (ed.) : Hardware and
Software Fault Tolerance in Parallel Computing Systems. Ellis Horwood Ltd.
1992, pp. 33-59
Peterson, 1.L.: Petri Net Theory and the Modeling of Systems. Englewood Cliffs:
Prentice-Hall 1981
Qureshi, M., Sanders, W.: Reward Model Solution Methods with Impulse and Rate
Rewards: An Algorithm and Numerical Results. Performance Evaluation 20,
413-436 (1994)
Ramesh, A.V., Trivedi, K.: Semi-Numerical Transient Analysis of Markov Models.
In: Proceedings of the 33rd ACM Southeast Conference (1995), pp. 13-23
Reibman, A. et al.: Markov and Markov Reward Model Transient Analysis: An
Overview of Numerical Approaches. European Journal of Operations Research
40, 257-267 (1989)
Reibman, A.L., Trivedi, K.S.: Numerical Transient Analysis of Markov Models.
Computers and Operations Research 15, 19-36 (1988)
Reibman, A.L., Trivedi, K.S.: Transient Analysis of Cumulative Measures of Markov
Model Behavior. Stochastic Models 5, 683-710 (1989)
Sahner, R.A. et al.: Performance and Reliability Analysis of Computer Systems:
An Example-Based Approach Using the SHARPE Software Package. Boston:
Kluwer 1995
Sanders, W.H., Meyer, J.F.: METASAN: A Perform ability Evaluation Tool Based
on Stochastic Activity Networks. In: Proceedings of the ACM-IEEE Computer
Society Fall Joint Computer Conference. Los Alamitos: IEEE Computer Society
Press 1986, pp. 807-816
Smith, R.M. et al.: Perform ability Analysis: Measures, an Algorithm, and a Case
Study. IEEE Transactions on Computers C-37, 406-417 (1988)
486 Jogesh K Muppala et al.
Summary. This paper deals with fast simulation techniques for estimating tran-
sient measures in highly dependable systems. The systems we consider consist of
components with generally distributed lifetimes and repair times, with complex
interaction among components. As is well known, standard simulation of highly
dependable systems is very inefficient and importance sampling is widely used to
improve efficiency. We present two new techniques, one of which is based on the
uniformization approach to simulation, and the other is a natural extension of the
uniformization approach which we call exponential transformation. We show that
under certain assumptions, these techniques have the bounded relative error prop-
erty, i.e., the relative error of the simulation estimate remains bounded as compo-
nents become more and more reliable, unlike standard simulation in which it tends
to infinity. This implies that only a fixed number of observations are required to
achieve a given relative error, no matter how rare the failure events are.
1. Introduction
Repairable systems with general repair and failure distributions are inher-
ently difficult to handle analytically or numerically, mainly because they do
not fall into the Markov, or semi-Markov, chain framework. HARP (Dugan
et al. 1986) and CARE (Stiffler and Bryont 1982) deal with methods to com-
pute dependability measures in large, but mostly non-repairable, Markovian
and non-Markovian systems. Analytical methods and numerical algorithms
for computing dependability measures of general non-Markovian repairable
systems are virtually non-existent.
An alternative approach is to use Monte Carlo simulation. Standard
Monte Carlo simulation is inefficient for highly dependable systems due to
the rarity of system failures events (Geist and Trivedi 1983). This results in
very long simulation run lengths to achieve a reasonable degree of accuracy.
One technique that is widely used to speed up simulations in highly depend-
able systems is importance sampling. In importance sampling we change the
This paper was originally published in ACM Transactions on Modeling and
Computer Simulation 4, 137-164 (1994). 1994, Association for Computing
Machinery, Inc. (ACM). Reprinted with permission.
488 Philip Heidelberger et al.
probabilistic dynamics of the system for simulation purposes. The new prob-
ability measure induces system failures to occur more frequently. Then we
make adjustments to the sample outputs to obtain an unbiased estimator.
The main problem in applying importance sampling to stochastic systems is
the design and implementation of specific importance sampling distributions
in order to obtain significant variance reductions which implies significant
speed-ups of the simulation.
Importance sampling, combined with the theory of large deviations, has
also proven effective in estimating buffer overflow probabilities in queueing
networks (see, e.g., Parekh and Walrand 1989, Frater et al. 1991 and Sad-
owsky 1991). An approach, other than importance sampling, for variance
reduction when estimating long-run averages affected by recoveries from rare
failure events is reported in Moorsel et al. (1991). A survey on using impor-
tance sampling to estimate rare event probabilities in queueing and reliability
models is given in Heidelberger (1995), and a survey on fast simulation of rare
events in reliability models is given in Nicola et al. (1993).
A considerable amount of work has been done in using importance sam-
pling for the fast simulation of highly dependable systems that consist of
highly reliable components with exponentially distributed failure and repair
times. In this case, the system is modeled as a continuous time Markov chain
(CTMC) with transitions of two types - component failures and component
repairs. Certain combinations of failed components cause the system to fail.
Typically, in the embedded Markov chain, component failure transitions hap-
pen with a much lower probability as compared to the component repair
transitions. The new importance sampling distribution is chosen in such a
way that component failure transitions occur with a much higher probability
than in the original system. This is called failure biasing and was introduced
in Lewis and Bohm (1984). in the context of reliability estimation. In Goyal
et al. (1992), it was further adapted to the estimation of steady state unavail-
ability, mean time to failure, and expected interval availability. Modifications
to the failure biasing heuristic were introduced in Shahabuddin (1990), Goyal
et al. (1992) and Shahabuddin (1994a) (balanced failure biasing), Carrasco
(1991a) (failure distance-based failure biasing) and Juneja and Shahabud-
din (1992) (failure biasing for Markovian systems with more general repair
policies).
In the estimation of transient measures in Markovian systems, besides
increasing the component failure transition probabilities of the embedded
Markov chain, we also have to increase the rates of transition in certain states
ofthe CTMC (that have very low transition rates), so that a sufficient number
of transitions happen in the given time horizon. For example, a technique
called forcing (Lewis and Bohm 1984, Goyal et al. 1992) causes the first
component failure time to occur within the time horizon, thus increasing the
probability of a system failure occurring during that time. Failure biasing,
in conjunction with forcing gives good results for time horizons that are
Bounded Relative Error in Estimating Transient Measures 489
small. However these techniques fail to work for larger time horizons. For
such cases, a method based on estimating Laplace transform functions is
studied in Carrasco (1991b) and another one based on estimating bounds to
the transient measure (rather than estimating the actual measure) is studied
in Shahabuddin (1994b) and Shahabuddin and Nakayama (1993).
Importance sampling has also been used for the fast simulation of highly
dependable systems with general component failure and repair distributions,
where the components are highly reliable. In Nicola et al. (1990), ideas for
accelerating component failure events using importance sampling, have been
combined with a clock rescheduling approach to devise a technique for fast
simulation. Analogous to the Markovian case, for transient measures, the fail-
ure acceleration combines two approaches: forcing and failure biasing. The
technique seems to work well in practice and gives orders of magnitude of vari-
ance reduction. Another importance sampling approach, using different forms
offorcing and failure biasing, to estimate unreliability in semi-Markov mod-
els of highly reliable systems is described in Geist and Smotherman (1989).
Their approach also extends to certain models with global time dependency.
Theoretical work in the area of importance sampling for highly depend-
able systems was started in Shahabuddin (1994a). In this paper, a large class
of highly dependable Markovian systems, (which includes systems of the type
in Goyal and Lavenberg 1987) were modeled and it was shown that for the
case of estimating steady state measures, the modification of the failure bi-
asing technique called balanced failure biasing has a desirable property of
bounded relative error. This implies that the simulation run-length for a de-
sired relative error remains bounded as component failure rates tend to zero.
This is in contrast to naive simulation in which the simulation run length
for a desired relative error tends to infinity as component failure rates tend
to zero. These bounded relative error results were extended to gradient esti-
mation (using balanced failure biasing) in Markovian systems in Nakayama
(1991) and to estimation of transient measures (using balanced failure biasing
and forcing) in Markovian systems in Shahabuddin (1994b) and Shahabud-
din and Nakayama (1993). Additional results on failure biasing for Markovian
systems are given in Nakayama (1993, 1994). However, until now, no tech-
nique has been proved to have the bounded relative error property for the
case of non-Markovian systems.
In this paper, we describe two different approaches to applying impor-
tance sampling for estimating system unreliability in non-Markovian sys-
tems. Then for a large class of highly dependable systems, we prove that
the two techniques have the property of bounded relative error. They also
seem to be easier to implement as compared to the clock rescheduling ap-
proach as they avoid rescheduling failure events and use only the exponential
distribution for failure event generation. The first approach is based on uni-
formization (Jensen 1953, Lewis and Shedler 1979, Shanthikumar 1986) and
the second uses a technique which we call exponential transformation. In
490 Philip Heidelberger et al.
The class of models that will concern us are essentially those that can be con-
structed using the SAVE (System Availability Estimator) modeling language
(see Goyal and Lavenberg 1987), except that general failure time and repair
time distributions will be allowed. However, in this paper we will consider
models that can be constructed using only a subset of the SAVE modeling
language. More specifically, we will consider models in which components
can be in one of two states: operational and failed. The SAVE modeling lan-
guage permits components to be in two additional states: spare and dormant.
(In SAVE, a component becomes dormant if its operation depends upon the
operation of some other component and that other component fails. For ex-
ample, a processor may not be operational unless its power supply is also
operational, and if the power supply fails, the processor is then considered
dormant. Different failure rates may be specified for the operational, spare
Bounded Relative Error in Estimating Transient Measures 491
and dormant states.) While the use of these additional states can be handled
within our framework, the notation becomes more complex and so will not
be considered in this paper.
We assume that there are N components which can fail and be repaired.
Let Gi(X) denote the failure distribution of component i, and let hj(x) be the
hazard rate (see Barlow and Proschan 1981) associated with this distribution:
hj(x) = gi(X)/Gi(X) where gi(X) is the probability density function of Gj(x)
and Gj(x) = 1 - Gi(X). We will assume that gj(x) > for all x > 0. A
component can fail in several failure modes, each mode occurring with a
certain probability. Let Pij be the probability of component i failing in mode
j, given that it fails. When component i fails in mode j, with probability Pijk
it can instantaneously "affect" a subset Sijk of other components, causing
them to fail as well. This is called failure propagation. A component may
have different repair time distributions in different failure modes. However,
for the sake of notational simplicity, we will assume that all modes have the
same repair time distribution. Let rj(x) denote the hazard rate associated
with the repair time distribution of the ith component. There is a set of
repairmen who repair failed compopents according to some fairly arbitrary
priority mechanism. For the purposes of this paper, details of the repair
processes are not crucial, and so will not be described in detail. However, we
allow general repair distributions and use of the SAVE "repair depends upon"
construct which permits modeling situations in which a component cannot
be repaired unless some other specified set of components is operational. We
do assume that no repairs are instantaneous. More specific conditions will be
given in Sections 3 and 4.
Another assumption (property) is that the system is composed of highly
reliable components, so that the component failure rates are much smaller
than the repair rates. To make this precise, we assume that the component
mean repair times are of order one, and there exists a small (but positive)
parameter f such that
(2.1)
for all x ~ 0, where the .Ai's and bi'S are positive constants with bi ~ 1. We
also assume that the ri(x)'s are constants, i.e., independent of L Finally, we
assume that the failure mode probabilities (Pi/S) and the failure propaga-
tion probabilities (Pij k 's) are also constants, though this assumption is not
essential. Inequality 2.1, which bounds the failure rates in terms of f, is the
natural generalization of the assumption in Shahabuddin (1994a) that, with
exponential distributions, the component failure rates are given by .Aifb;. We
will consider the limiting behavior of the unreliability estimates as f --+ 0,
i.e., as components become more reliable. In section 3, we will consider the
case where
(2.2)
for all x ~ 0, where the lti'S are positive constants. In Sections 4 and 5 we will
remove that assumption. The bounded hazard rate implicit in Inequality 2.1
492 Philip Heidelberger et al.
(2.3)
where t is the time horizon and the subscript G(E) denotes a system in which
the distribution of the component failure times are given by hazard rates
functions satisfying Inequality 2.1. For small E and fixed t, ,(E,t) ~ 0, i.e.,
the event {TF ~ t} is a rare event. In fact, we show in this paper that ,(E,t)
is 8( Er) for some r > 0 (a function f( E) is 8( Er) if there exist two constants
l{1 and l{2 such that l{1 Er ~ f( E) ~ l{2Er, for all sufficiently small E > 0)
and hence ,( E, t) -+ 0 as E -+ O. Now consider the problem of estimating
,(E, t)= EG(f)(I(TF < t)) where 1(.) is the indicator function. In standard
(naive) simulation we generate n independent replications from time 0 to time
min(TF, t) to obtain samples of I(TF < t), say It, 12 , , In Then 2:7=1 I;/n
is an unbiased estimator of ,( E, t). The variance of this estimator is given
by (J'~()(I(TF < t/n. Note that (J'~()(I(TF < t = ,(E, t) - ,2(E, t) is
also 8( Er). Thus, for a fixed n, the relative error {which is proportional to
(J'G(f)(I{TF < t/(Vn,(E,t))) goes to 00 as E -+ O. This is the main problem
in standard simulation of highly dependable systems. Importance sampling
is a well known technique to overcome this inherent difficulty. We illustrate
its basic idea by means of a simple example. (For a detailed discussion of the
concept see, for example, Hammersley and Handscomb 1964 and Glynn and
Iglehart 1989). Let f(.) be a probability density function (pdf) on the real
line and let A be a set on the real line which is rare with respect to f{.).
Bounded Relative Error in Estimating Transient Measures 493
- Pseudo events: Let Np(t) denote the total number of pseudo events in (0, t)
and let Pj be the time of the j'th pseudo event.
In a uniformization-based simulation, events are obtained by "thinning"
the Poisson process {N.a(t)} as follows. Suppose an event of {N.a(s)} occurs
at time S. Then that event is a
component i failure w.p. >'i(S)/f3,
component i repair w.p. l'i(S)/ f3, (3.3)
pseudo event w.p. [1- e(S)/f3].
Notice that N.a(t) = Np(t) + NR(t) + Np(t) and that if the upper bound of
Inequality 2.1 is satisfied for all components, then the probability of a failure
event is very low.
To implement importance sampling within a uniformization framework
simply involves changing the thinning probabilities in equation (3.3). (We
specifically assume that all failure modes and components affected through
failure propagation are sampled from their given distributions.) This, in turn,
is accomplished by using new failure and repair rates, >.as) and I'as). In the
new system (i.e., the system simulated using importance sampling), the total
failure rate is >.j..(s) = Li >'Hs), the total repair rate is I'R(s) = Li I'Hs) and
the total event rate is e'(s) = >.j..(s) + I'R(s). We assume that e'(s) :::; f3 w.p.
one for all s :::; t, so that f3 is a valid uniformization rate for both the original
and the new systems (and both processes can be simulated by thinning the
same Poisson process {N.a(s)}). In the new system, an event from {N.a(s)}
at time S is a
component i failure w.p. >'~(S)/f3,
(3.6)
496 Philip Heidelberger et al.
N NR(i,t) (R)
Lu(R,E,t) =II II
i=l i=l
J-li ii
'(Ri')'
J-l 1 J
(3.7)
Lu(P, E, t) = II
Np(t) [,B-e(P.)]
[,B -e,(A ))" (3.8)
component i failure w.p. -XHS)/,B [e'(S)/ ,B][-X~ (S)/ e' (S)][A~ (S)/ A~(S)]
component i repair w.p. J-l~(S)/,B [e' (S)/ ,B][J-l~(S)/ e' (S)][J-l~(S)/ J-l~(S)]
pseudo event w.p. [1- e'(S)/,B]. (3.10)
According to equation (3.10), we can view the selection of the event
as occurring in multiple steps. For example, to get a component i fail-
ure, we first must have a "real" event (i.e., failure or repair) which occurs
Bounded Relative Error in Estimating Transient Measures 497
w.p. e'(S)/f3. Then, the event must be a failure event which occurs w.p.
p~(S) == >'~(S)/e'(S), and finally, the event must be a type i failure which
occurs w.p. ff(S) == >'~(S)/ >'~(S).
In balanced failure biasing, we make the probability of a failure event
constant, say Pf , whenever repairs are ongoing. Thus, in uniformization, given
that an event is real (and there are ongoing repairs), we fix p~ (S) = Pf .
Next, in balanced failure biasing, given that an event is a failure, we choose
the failing component uniformly from among the operational components. In
uniformization, this simply corresponds to setting f1(S) = l/IO(S-)1 (the
number of operational components just before time S).
In balanced failure biasing, if an event is a repair, the relative proba-
bilities of selecting which component gets repaired are unchanged. Thus, in
uniformization, we set
Theorem 3.1. Suppose there exist positive finite constants ~,xi' Jl and fl
such that ~ifbi ~ h;(x) ~ 5.;f bi and Jl ~ rj(x) ~ fl for all i and 0 ;; x ~ t.
Then there exist positive finite consta-;'ts r, a(t) and b(t) such that, as f -+ 0,
(3.13)
498 Philip Heidelberger et al.
(3.14)
where am and aj are the products of the failure mode and failure propaga-
tion probabilities. The explanation for each term in equation (3.14) is quite
evident. For example, if a Poisson event occurs at any time 8 in (0, t), then
the probability that it is a type i failure is >'i(8)/ (3 ~ .A;fb i /(3. This proves
the lower bound for i( f, t).
The upper bound will be shown by deriving an upper bound on the likeli-
hood ratio LU(f, t) when the system is sampled using an importance sampling
distribution satisfying certain properties. Specifically, we assume that condi-
tion (3.9) is satisfied and that there exist positive finite constants .A', .x',
J.t', [i'
and (3' such that -
=
We assume that sampling is stopped at time T min(t, TF). Now for any
sample path such that TF ~ t, L:~l NF(i, t)b; ~ r and therefore the failure
event likelihood ratio, Lu(F, f, T), satisfies
Bounded Relative Error in Estimating Transient Measures 499
N NF(i,T) - b
Lu(F,E,r)::; II II )..~,'::; C;F(T)ED=:l NF(i,T)b.} ::; C;F(T)E r (3.16)
i=1 j=1 -
LU(E, r ) <
_ cR Np(T) cNF(T) Er <
NR(T) cp _ CNp(t) r
F 1 E (3.17)
Theorem 3.2. Suppose hi (x) ~ Ajb" ri( x) ~ p, for all i and 0 ~ x ~ t and
e(s) ~ (3 w.p. one for all 0 ~ s ~ t. If the importance sampling distribution
satisfies e'(s) ~ (3 w.p. one for all 0 ~ s ~ t, and equations (3.9) and (3.15),
then there exists a positive finite constant c(t) such that, as -+ 0,
Ehh(U, , t)2] ~ c(t)2r. (3.19)
If, in addition, hj(x) ~ ~b, and ri(x) ~ f!.. for all i and 0 ~ x ~ t, then
Since repairs are sampled from their given distributions, the likelihood ratio
does not contain any repair event terms. Similar to Section 3, the likelihood
ratio takes on a simple form:
L{j(F,f,t) = g }1
N Np(i,t) Ai(1ij)
AH1ij) (4.4)
3=1 F
(;)]"
3
(4.5)
etc. This complication arises from the possibility of having repair events in
the minimum distance failure path. We will describe some fairly general,
albeit somewhat indirect, conditions for which the lower bound is true, and
then give specific examples of repair queueing disciplines and repair service
distributions that satisfy these conditions. In order to do so, we need to
introduce some new notation. A sample path consists of an ordered sequence
of events (failures and repairs) and the times of those events. Let Ej denote
the type of the i-th event, i.e., Ei = Ikj if the event is a component k failure
in failure mode j and Ei = rk if the event is a repair of component k. Note
that El is always a component failure event. (We could allow simultaneous
repair of components, but will not consider that here since it complicates
the notation. Also, for simplicity, we will also assume that the failure modes
completely specify which components are failed through failure propagation
on each failure.) Let Ti denote the time of the i-th event (failure or repair).
As in the proof of Theorem 3.1, define r to be the minimum distance over
all possible sample paths in the set {TF < t}. (Note that the minimum
distance r is actually a function of t, the repair disciplines, and the repair
time distributions. However, we will assume that these factors are fixed and
suppress the dependence of r on them in our notation.) The sequence of
events till system failure, in any sample path with the minimum distance r,
will be called a most-likely event sequence. (Note that, in any system, there
are only a countably finite number of most-likely event sequences but there
are an uncountably infinite number of sample paths corresponding to any
given most-likely event sequence.)
Assumption A: There exists a most-likely event sequence P = (el' e2, ... , en),
constants 0 = to < tl < ... < tn < t and a constant 6 > 0, with the following
property:
let
Pk = {tj-l < Tj < tj, Ej :;; ei. for 1 ~ j ~ k,n+1 > ttl (4.6)
for 1 ~ k ~ n (Po == 0) and let 'R,k (.1'k) be the set of repair (respectively,
failure) events in (tk-I. tk) for 1 ~ k ~ n. Assume that for all f small enough,
for 1 ~ k ~ n,
P('R,k = {edIPk-l,.1'k = 0) ~ {) if ek is a repair event, (4.7)
P('R,k =0IPk-b.1'k ={et}) ~ {) if ek is a failure event. (4.8)
Assumption A basically states that the events of P occur in the correct
sequence with positive probability, given that the preceding failure and re-
pair events (in P) occur within certain time intervals. More specifically, the
assumptions imply that the interval [0, t) can be broken up into subintervals.
Equation (4.7) implies that if the k-th event is supposed to be a repair, then
there exists an interval such that a repair occurs in that interval with positive
probability. Similarly, equation (4.8) states that, if the k-th interval is sup-
posed to contain a failure event, then no repair events occur in that interval
with positive probability.
Bounded Relative Error in Estimating Transient Measures 503
Before proving the bounded relative error property, we will verify that
these conditions hold for several cases of interest. Let Ri denote a random
variable whose distribution is the repair time distribution of the ith compo-
nent.
Example 1: Consider systems with an arbitrary number of repairmen that
repair components with any non-preemptive priority repair discipline (with
any non-preemptive repair discipline - like FCFS, non-preemptive last come
first served (LCFS), etc., - used between members of the same priority class).
Assume that at least one most-likely event sequence does not contain any
repair completion events. This condition is always true in systems that do
not have failure propagation. In such systems none of the most-likely event
sequences include repair completion events. Repairs are assumed to be non-
instantaneous, i.e., peRi > 0) = 1 for all i. Hence there exists a constant
to > 0 such that for all i, peRi > to) > O. Let 8min = min{P(Ri > to) : 1 ~
i ~ N}. Let to = minHo, t/2}. Clearly peRi > to) ~ 8m in for all i. Let us see
why systems of this type satisfy assumption A.
We will show that Assumption A holds if we choose P as a most likely
event sequence with no repair completion events, ti = ito/n for 1 ~ i ~ n
and 8 = (8min )n. To see this note that since we only have failure events
in the most likely event sequence, we only have to check equation (4.8) for
1 ~ k ~ n. The failure of the ith component in the most likely event sequence
(at time T;) may begin a repair process if a repairman (that repairs this
component) is free. If it does begin a repair process then due to the fact
P( Ri > to) > 8m in, the probability that this repair process finishes before
(absolute) time to is greater than 8min. Hence the probability that all of the
repair processes started before to (i.e., that may have been started at the
times of the failure events in the most likely event sequence) finish after to,
is greater than 8 = (8 min)n. This in turn implies the conditions of equation
(4.8).0
Example 2: Consider systems with a single repairman, with any non-
preemptive priority repair discipline (with any non-preemptive repair dis-
cipline - like FCFS, non-preemptive LCFS, etc. - used between members
of the same priority class), in which the most likely event sequences may
contain repair completion events. Again, assume that the repairs are non-
instantaneous. Let us see now why systems of this type satisfy assumption
A.
First consider the case where a most likely event sequence has two repair
completions, with ml > 0 failure events before the 1st repair completion, m2
failure events between the 1st and 2nd repair completions, and m3 > 0 failure
events after the 2nd repair completion. First we will assume that the repair
completions are non-consecutive (i.e., m2 > 0) and then show how to extend
it to the consecutive case. Without loss of generality, assume that the first
three components that start repair in this most likely path are Component 1,
Component 2 and Component 3, respectively. Since completion of the repairs
504 Philip Heidelberger et aI.
of Component 1 and Component 2 (in the most likely event sequence), occurs
before t, P(R 1 + R2 < t) > O. Hence there exists positive constants 81 and
82, with 81 + 82 < t, such that for all ..1 > 0, P(81 - ..1 < R1 < 81 + ..1) > 0
and P(82 - ..1 < R2 < 82 + ..1) > O. Then the ti's are chosen as follows.
The interval corresponding to the first failure event is chosen small enough
so that if the repair times are near 81 and 82 then the second repair completes
before time t. The repair times are confined sufficiently close to the respective
8i'S, (i.e., the ..1 in (8i - ..1 < ~ < 8i + ..1) is small enough) so that 1) the
interval corresponding to the first failure does not overlap with the interval
corresponding to the 1st repair completion, 2) the intervals corresponding
to the repair completions do not overlap and 3) with positive probability,
the third repair does not complete within the interval corresponding to the
second repair completion, i.e., if 83 > 0 be such that P(R3 > 83) > 0 then it
is enough that the width of the interval corresponding to the second repair
completion be chosen smaller than 83. We choose the width for the 1st failure
interval and the ..1 corresponding to R1 and R2 to be the same; call it ..1 0.
We make sure that the ..10 is small enough so that all the above criteria are
satisfied. More formally, let
Next, choose tml+1 = 81 +2..1 0, as with probability at least 0, the first repair
completes in [81 - ..1 0,81 + 2..10]. The second repair starts as soon as the
first repair completes. By equation (4.10) (for j = 2) and equation (4.12),
with probability at least 0, the second repair does not complete in the interval
[81 +2..10,81 +82-2..10] (note that by equation (4.9), 81 +82-2..10> 81 +2..10).
Hence choose t m1 +1+m2 = 81 + 82 - 2..1 0, and the intermediate ti's evenly
between tml+1 and t m1 +1+m2' i.e.,
Bounded Relative Error in Estimating Transient Measures 505
Since with probability at least 8, the second repair completes in the interval
[81 + 82 - 2..10, 81 + 82 + 3..10]. choose tml +1+m2+1 = 81 + 82 + 3..10. The third
repair starts when the second repair completes. By equation (4.9), equation
(4.11) and equation (4.12),
P(Ra > 6..1 0) > 8, (4.13)
i.e., with probability at least 8, the third repair does not complete in the
interval [81 +82 -2..10,81 +82+4..10]. Hence choose t m1 +1+m2+1+m3 = 81 +82+
4..10, and the remaining ti 's evenly between tml +1+m2+1 and tml +1+m2+1+m3 ,
i.e.,
i - (m1 + m2 + 1)
ti = tm1+m2 +1 + ma
(tml+1+m2+1+m3 - tm1 +1+m2+d
for m1 + 1 + m2 + 1 < i < m1 + 1 + m2 + 1 + ma. Note that by equation
(4.9), tm1 +1+m2+1+ma < t.
For the case where m2 = 0, we extend the interval corresponding to the
first repair completion, from [81 - ..1 0,81 + 2..1 0], to [81 - ..1 0,81 + 82 - 2..1 0]
(note that 81 + 82 - 2..10 is the beginning of the second repair interval). The
other intervals remain unchanged.
This argument can easily be extended to cases where the the most likely
path contains more than two repair completions. Let us say that there are I
repair completions with m1, m2, ... ml, denoting the respective numbers of
intermediate failure events and ml+1 denoting the number of failure events af-
ter the last repair completion. We will assume that the repair completions are
non-consecutive, though (as in the two repair completion case) our arguments
can easily be extended to the consecutive case. Define 81,82, ... , 81,81+1 and
81,82, ... ,81,61+1 analogous to the two repair completion case and choose
(4.14)
Let
L 8;)/(/+3)}. (4.15)
I
..10 = min{8t/3, 82/5, ... ,81/(2/+1),81+1/(2/+2), (t-
;=1
Choose t1 = ..10. Then tl:~=l mk+j-1 (the start of the interval correspond-
ing to the jth repair completion) may be chosen as l:t=181c - j..1 o and
tl:~=l mk+i may be chosen as l:t=l 81c +(j + 1)..10. The intervals correspond-
ing to the intermediate failure events may be chosen to be evenly distributed
between the above intervals. Finally, choose tl:~:ll mdl as tl:~=l mdl + ..10,
and choose the intervals corresponding to the remaining failure events evenly
distributed between t"" m +1 and t",'+l m +1" 0
L....k=l Ie L....1e=1 Ie
It is possible to verify that other situations also satisfy these assumptions,
although it is difficult to state simple, direct conditions on the underlying
repair disciplines and distributions for which Assumption A is valid.
506 Philip Heidelberger et aI.
Theorem 4.1. Suppose there exist positive finite constants ~i and ~i such
that ~ifbi ~ hj(x) ~ ~ifb; for all i and 0 ~ x ~ t and that Assumption A
holds. Then there exist positive finite constants r, a(t) and b(t) such that, as
f~O,
a(tV ~ ,(f,t) ~ b(tV. (4.16)
Proof. To prove the lower bound, notice that P( 'Tp ~ t) 2: P(Pn ) =
n~=l P(PkIPk-l) (P(PtiPo) == P(Pt). Assume that the process is simu-
lated using uniformization (at rate fJ) of failure events as described earlier.
Now consider P(PkIPk-l) for 1 ~ k ~ n. Note that given Pk-l, the event
Pk implies that there is only one event in the interval (tk-b tk). Thus, if Ek
is a repair event, then
(4.17)
The first term on the right hand side of equation (4.17) is greater than fJ by
equation (4.7) of Assumption A, while the second term is greater than the
probability that a Poisson process with rate fJ has no events in the interval
(tk-l, tk)' Thus, in this case P(PkIPk-l) is greater than some function (of
tk-l, tk, f3 and fJ) that is independent of f. Similarly, if Ek is a failure event,
then
The first term on the right hand side of equation (4.18) is greater than fJ by
equation (4.8) of Assumption A, while the second term is greater than the
probability that a Poisson process with rate fJ has exactly one event in the
interval (tk-l. tk) times the probability of accepting event e,., as the failure
event. This latter probability is at least fbi ~d fJ if the event is a component
i failure. Thus P(Pn ) 2: fr times a function of t (and fJ, fJ, and the failure
mode and failure propagation probabilities) as desired.
The proof of the upper bound is also similar to that in Theorem 3.1. We
assume that
~' ~ ..\~(s) ~ ~I whenever ..\j(s) > 0,
(4.19)
fJ' ~ fJ - ..\;'(s) whenever fJ - Ap(S) > 0
5. Exponential Transformation
In this section, we describe an alternative importance sampling procedure
that gets around a potential computational inefficiency using uniformization:
the generation of pseudo events. The method, which we call exponential trans-
formation, is based on the following observation. Consider a uniformization-
based simulation using rate P for the Poisson process {N.B(s)}. Suppose each
event in {N.B(s)} is accepted as a failure event with fixed probability p, i.e.,
>.F( s) / P = p for all s. Then the time between accepted failure events has an
exponential distribution with rate a = p x p. This suggests simply sampling
the time to the next failure event from an exponential distribution with rate
a; this is basically what the exponential transformation method does.
We first describe the method in more detail and present its likelihood
ratio, and then show that the method possesses the bounded relative error
property.
system just after the event that took place at time Tn - 1 . An exponential
random variable En with some chosen rate an is sampled. If Tn - 1 +En ~ Rn,
then the next system event is a failure and Tn = Tn - 1 + En. In this case,
component i is chosen as the failing component with some chosen probability
qi( n) (provided component i is operational). The likelihood of such a failure
event is qi(n)a ne- a "6,,. On the other hand, if Tn - 1 + En > Rn, then the
next system event is a repair and Tn = Rn. The likelihood of such a repair
event is e- a "6,,. Define F{ n) = i if component i fails at time Tn and let
=
'Yn qF(n)(n)a n for a failure event and define 'Yn =1 for a repair event. Let
N{t) denote the number of events in (0, t). Then the likelihood associated
with sampling the failure times is
(5.1)
The term on the right in equation (5.1) represents the probability that the
last inter-failure time exceeds the remainder of t. However, since sampling
stops at time T = min(t, TF), this term does not appear in the likelihood
if TF < t; this can be formally accommodated in equation (5.1) by setting
an = 0 for n ~ N (TF) + l.
Let NF(i, t) denote the number of times that component i fails in the
interval (0, t). N F( i, t) counts only the times that component i fails on its own
accord, but not the time that the component fails because it is affected by
some other component. Let Mj(t) denote the number of times that component
i's failure clock is reset, but does not expire on its own accord in (O, t). Mi{t)
counts the number of times that component i fails because it is affected by
some other component, plus one if component i is operational at time t. Let
Xij, j = 1, ... , N F{ i, t) denote the age of component i when it fails of its own
accord for the j-th time, and let }'ij(t),j = 1, ... ,Mi(t) denote the age of
component i's clock when it is caused to fail by some other component for
the j-th time, or its age at time t. Then
(5.2)
is the likelihood associated with the failure times of the sample path under the
original failure distributions. Defining LE(f, t) = PG(t)/PE(t) and 'Y(E, f, t) =
LE(f, T)1{TF9}' we have 'Y{f, t) = EE['Y(E, f, t)] where the subscript E refers
to sampling with exponential distributions as described above.
We will assume that an and qi( n) are chosen such that they have the
following property: there exist positive finite constants q, ij,S!, and 0 such
ili~ -
!l ~ qj{n) ~ ij (5.3)
whenever component i is operational, and
Bounded Relative Error in Estimating Transient Measures 509
(5.4)
with probability one. We call this type of importance sampling "generalized
balanced failure biasing with exponential transformation." When qi( n) =
1/10(T;)I, we call the method "balanced failure biasing with exponential
transformation." As in the uniformization approach, there is considerable
flexibility in how to choose the rates Qn. Specific heuristics for doing so are
discussed in (Nicola et al. 1992 and Heidelberger 1992) and will also be de-
scribed briefly in Section 6..
Proof The required lower bound on 'Y(f, t) is true by Theorem 4.1. Thus
we only need to prove that EE['Y(E, f, t)2] ~ f(t)f2r for some function f(t).
We begin by establishing an upper bound for the numerator, pa(t), of the
likelihood ratio. Notice that 9i(Xij) = hi (Xij )Gi(Xij) ~ Xif bi . Thus, on
{rp ~ t},
II X~F(i,t)fbiNF(i,t) ~ XNF(t)f r
N
pa(t) ~ (5.5)
i=l
used for importance sampling even when the failure distributions do not have
bounded hazard rates. However, in this case, the method is not guaranteed
to possess the bounded relative error property.
6. Experimental Results
In this section, we present the results of experiments to test the effectiveness
of the exponential transformation method. Additional experimental results
are presented in Nicola et al. (1992) and Heidelberger et al. (1992) for both the
exponential transformation and uniformization-based importance sampling
approaches.
The test model we consider has two types of components and a single
repairman. There are three components of type one and two components of
type two. The system is considered operational if at least one component
of each type is operational. The repairman fixes components according to a
preemptive priority discipline, with type two components having the highest
priority. Components of type one have a constant repair time distribution
with mean one, and components of type two have a uniformly distributed
repair time distribution on (5,10). The model may include "failure propaga-
tion," i.e., the failure of one component may cause other components to fail
at the same time. Specifically, we assume that with probability a, a failure
of component type two causes two components of type one to also fail (and
with probability 1 - a the component affects no other components). We call
a the components-affected probability, and we consider two cases: a = 0 (no
failure propagation) and a = 0.25. The performance measure of interest is
the probability that the system fails before time t = 100.
The failure distributions are parameterized by f, which measures the rarity
of component failures. We consider two types of failure distributions: Erlang
with two stages and Hyperexponential with two phases. We let E 2 ( f) denote
the Erlang distribution with two stages and failure rate 2f in each stage. The
mean of this distribution is l/f, and the failure rate for small f and fixed
t is 0(f2) (since two exponentials with rate f need to fail within time t).
The Hyperexponential distribution is denoted by H2(f) and has coefficient
of variation equal to two. The parameterization of the Hyperexponential was
chosen so as to equalize P( E 2( f) ~ 100) and P( H 2( f) ~ 100) for a particular
value of f (f = 10- 6 , corresponding to configurations 7 and 8 below). Specif-
ically, with probability 0.7373, H2(f) is exponential with rate ..\(f) = 2.66f
and with probability 0.2727, H2(f) is exponential with rate 12..\(f).
We consider eight different configurations of this system with two dif-
ferent values of f (f = 0.01 and f= 0.0001) for each configuration. These
configurations are listed in Table 6.1. The configurations were chosen so that
a diverse range of most-likely failure paths occurs among the configurations.
Consider configuration 1 in which both components have E2(f) distributions
and there is no failure propagation. In this case, P( TF ~ t) = O( (4) and
Bounded Relative Error in Estimating Transient Measures 511
the most-likely failure path consists of two failures of component type one.
(These O( In estimates should not be taken too literally; they assume that
t is fixed and ---+ O. For example, for = 0.01, and t = 100(= 1/) this
assumption is clearly violated.) In configuration 4, component type one has
an E2( ) distribution, component type two has an E 2 ( 1.5) distribution and
a = 0.25. This is an example of an "unbalanced" system (see Goyal et al.
1992) since component type two is much more reliable than component type
one. For configuration 4, P(TF ~ t) = 0(5) and the most-likely failure path
consists of one failure of component type one and one affecting failure of
component type two, i.e., a failure of type two which causes two components
of type one to fail with it. For configuration 3, P( TF ~ t) = 0(6) and there
are two most-likely failure paths: three failures of component type one, or
two failures of component type two. For configuration 2, P(TF ~ t) = 0(4)
and there are (at least) four different types of most-likely failure paths:
1. two failures of type two,
2. one failure of type one and one affecting failure of type two,
3. an affecting failure of type two, a repair of type two, and an affecting
failure of type two,
4. an affecting failure of type two, two repairs of type two, and an affecting
failure of type two.
Similar analyses can also be made for configurations 5 - 8.
Each configuration was simulated for 256,000 replications using exponen-
tial transformation. The parameter settings for the exponential transforma-
tion were based on earlier experiments described in Nicola et al. (1990) and
Heidelberger et al. (1992). The rate of the first transition, al was chosen so
that an exponential with rate al is less than t = 100 with probability 0.8.
(This is called approximate forcing.) When repairs are ongoing, the values of
an were chosen so as to make the probability of failure before repair comple-
tion approximately equal to p = 1/3 (p i~ called the biasing probability). This
was done as follows. Let 1/Jln denote the mean repair time of the component
in repair at time Tn - 1 (1/ Jln = 0.5 for type one components and 1/Jln = 7.5
for type two components). Then an is chosen so that an/(a n +Jln) = 1/3. For
exponentially distributed repairs, this makes the biasing probability exactly
equal to 1/3. Balancing was done by equalizing the probabilities of which
component type fails upon a failure event (since there are more than one
components of each type). Importance sampling was, in effect, "turned off"
(by making an small, i.e., close to the original hazard rates) whenever the
system returned to the all components operational state.
The point estimates and the relative half-widths of 99% confidence in-
tervals are displayed in Table 6.1. Notice that system failures before time
t= =
100 are not particularly rare for 0.01, especially in configurations 1,2,
5 and 6. Indeed, for = 0.01, the importance sampling is not very effective
and actually can result in some variance increase. For example, using stan-
dard simulation when = 0.01 , the relative half-widths of 99% confidence
512 Philip Heidelberger et al.
Table 6.1. Estimates of the unreliability at time t = 100, I'(f, 100), for the model
with two component types, along with estimated relative half widths of 99% confi-
dence intervals. The estimates were obtained from 256,000 replications using expo-
nential transformation.
Type 1 Type 2 Components f = 0.01 f = 0.0001
No. Failure Failure Affected
Distribution Distribution Probability a
1 E2(f) E2{f) 0 8.19 x 10 -~ 7.34 x 10 -9
2.2% 3.4%
2 E2(f) E2(f) 0.25 1.17 x 10- 1 1.03 x 10 =1r
2.0% 2.7%
3 E2(f) E2(f1.") 0 2.40 x 10 -4 4.49 x 10 15
6.1% 11.5%
4 E2(f) E2(fU) 0.25 9.68 x 10 4 2.86 x 10 1"
14.5% 4.7%
5 H 2 {f LO ) E2{f) 0 8.16 x 10 . 7.34 x 10 -9
1.5% 3.4%
6 H 2 (fU) E2(f) 0.25 8.94 x 10 .~ 9.60 x 10 9
1.5% 2.7%
7 H 2 ( fl.") E 2 {f1.5) 0 5.69 x 10 5 2.48 x 10 :rs-
5.4% 5.8%
8 H 2 (fl.O) E2(f1.") 0.25 2.45 x 10 4 2.14 x 10 1"
5.3% 3.9%
7. Summary
This paper has considered the problem of efficiently simulating the system
failure time distribution in models of highly dependable systems with non-
Markovian component failure distributions. Several importance sampling ap-
proaches were described. These approaches are a natural generalization of
approaches used in Markovian systems. We proved (under appropriate tech-
nical conditions), that these approaches are all effective as component failure
events become rarer. Specifically, we showed that for a fixed time horizon t,
Bounded Relative Error in Estimating Transient Measures 513
Acknowledgement. Parts of Sections 1, 2 and 5.1 are taken from Nicola et al. (1992).
This material is reprinted with permission of the IEEE.
References
Barlow, R.E., Proschan, F.:Statistical Theory of Reliability and Life Testing. New
York: Holt, Reinhart and Winston 1981
Brown, M.: Error Bounds for Exponential Approximations of Geometric Convolu-
tions. The Annals of Probability 18, 1388-1402 (1990)
Carrasco, J. A.: Failure Distance-Based Simulation of Repairable Fault-Tolerant
Systems. Proceedings of the Fifth International Conference on Modelling Tech-
niques and Tools for Computer Performance Evaluation (1991a), pp. 337-351
514 Philip Heidelberger et al.
Lewis, E.E., Bohm, F.: Monte Carlo Simulation of Markov Unreliability Models.
Nuclear Engineering and Design 77, 49-62 (1984)
Lewis, P.A.W., Shedler, G.S.: Simulation of Nonhomogeneous Poisson Processes by
Thinning. Naval Research Logistics Quarterly 26, 403-413 (1979)
Moorsel, A.P.A. van, Haverkort, B.R., Niemegeers, I.G. :. Fault Injection Simula-
tion: A Variance Reduction Technique for Systems with Rare Events. Depend-
able Computing for Critical Applications 2. Berlin: Springer 1991, pp. 115-134
Nakayama, M.K.: A Characterization ofthe Simple Failure Biasing Method for Sim-
ulations of Highly Reliable Markovian Systems. ACM Transactions on Modeling
and Computer Simulation 4, 52-88 (1994)
Nakayama, M.K.: General Conditions for Bounded Relative Error in Simulations of
Highly Reliable Markovian Systems. IBM Research Report RC 18993. Yorktown
Heights, New York (1993)
Nakayama, M.K.: Simulation of Highly Reliable Markovian and Non-Markovian
Systems. Ph.D. Dissertation, Department of Operations Research, Stanford Uni-
versity (1991)
Nicola, V.F., Heidelberger, P., Shahabuddin, P.: Uniformization and Exponential
Transformation: Techniques for Fast Simulation of Highly Dependable Non-
Markovian Systems. Proceedings of the Twenty-Second International Sympo-
sium on Fault-Tolerant Computing. IEEE Computer Society Press 1992, pp.
130-139
Nicola, V.F., Nakayama, M.K., Heidelberger, P., Goyal, A.: Fast Simulation of De-
pendability Models with General Failure, Repair and Maintenance Processes.
Proceedings of the Twentieth International Symposium on Fault- Tolerant Com-
puting. IEEE Computer Society Press 1991, pp. 491-498
Nicola, V.F., Shahabuddin, P., Heidelberger, P.: Techniques for Fast Simulation of
Highly Dependable Systems. Proceedings of the Second International Workshop
on Performability Modelling of Computer and Communication Systems (1993)
Nicola, V.F., Shahabuddin, P., Heidelberger, P., Glynn, P.W: Fast Simulation of
Steady-State Availability in Non-Markovian Highly Dependable Systems. Pro-
ceedings of the Twenty- Third International Symposium on Fault- Tolerant Com-
puting. IEEE Computer Society Press 1992, pp. 38-47
Parekh, S., Walrand, J: A Quick Simulation Method for Excessive Backlogs in
Networks of Queues. IEEE Transactions on Automatic Control 34, 54-56 (1989)
Sadowsky, J.S.: Large Deviations and Efficient Simulation of Excessive Backlogs
in a GI/G/m Queue. IEEE Transactions on Automatic Control 36, 1383-1394
(1991)
Shahabuddin, P.: Simulation and Analysis of Highly Reliable Systems. Ph.D. Dis-
sertation, Department of Operations Research, Stanford University (1990)
Shahabuddin, P.: Importance Sampling for the Simulation of Highly Reliable
Markovian Systems. Management Science 40, 333-352 (1994a)
Shahabuddin, P.: Fast Transient Simulation of Markovian Models of Highly De-
pendable Systems. Performance Evaluation 20, 267-286 (1994b)
Shahabuddin, P., Nakayama, M. K.: Estimation of Reliability and its Derivatives
for Large Time Horizons in Markovian Systems. Proceedings of 1993 Winter
Simulation Conference. IEEE Press 1993, pp. 422-429
Shanthikumar, J. G.: Uniformization and Hybrid Simulation/Analytic Models of
Renewal Processes. Operations Research 34, 573-580 (1986)
Stiffler, J., Bryant, L.: CARE III Phase III Report-Mathematical Description.
NASA Contractor Report 3566 (1982)
Van Dijk, N.M.: On a Simple Proof of Uniformization for Continuous and Discrete-
State Continuous-Time Markov Chains. Adv. Appl. Prob. 22,749-750 (1990)
Part V
1. Introduction
The way companies conduct their business differs even if producing identi-
cal products. Their business processes will differ as a result of their specific
business principles, policies and strategies. Management systems supporting
these processes will therefore also be different between companies.
This implies that the maintenance management system supporting one
company's business process may therefore not be applicable to another. This
paper is therefore limited to describing some of the generic steps in structur-
ing the maintenance management system, critical success factors, where to
focus, and how to measure. The discussion is furthermore restricted to the
operational phase, i.e., maintenance input into design is not covered by this
paper.
Throughout this paper our definition of maintenance will be:
"The combination of all technical and associated administrative actions
intended to retain an item in, or restore it to, a state in which it can perform
its required function."
To set the scene, during the Operations phase the general objectives for
maintenance within the Oil & Gas industry are:
- to safeguard the technical integrity of all surface facilities;
520 Wim Groenendijk
acts as the framework within which the activities are defined, including a
description of their logical sequences and their interrelationships with other
activities and other processes.
2.2 Structure
1. the description of the process, activities and tasks designed to meet cor-
porate and customer requirements with performance measurement and
feedback systems to enable control and continuous improvement;
2. policies, standards and procedures related to the process and activities;
3. controls appropriate to the risks and critical activities of the process;
4. an organisational structure that matches the process, with tasks and
responsibilities defined for each critical activity;
5. a description of the main competencies required from staff to supervise
and carry out the activity/task;
6. information and data systems to enable control and improvement.
3.2.2 Schedule. The activities that have been planned eventually need to
be sequenced so that the work is done as efficiently and with least impact on
availability as possible.
Special attention needs to be paid to identify concurrent activities, i.e.,
to identify whether they can safely be executed simultaneously. Also, an
appropriate change control mechanism needs to be in place to manage short-
term deviations from the plan in response to operational conditions (e.g.
breakdowns, weather conditions etc.).
Finally, detailed work packages are prepared and the required resources
called off.
3.2.3 Execute. The execution phase is where the activities identified on the
plan actually take place. This is where most of the resources are consumed,
therefore it is important to ensure that proper controls are employed to ensure
efficient utilisation of resources. Also, control of concurrent operations needs
to be in accordance with the plans and schedules. Finally, documents and
drawing need to be revised and updated to reflect implemented changes.
3.2.4 Analyse. Analysis is an aid. to decision making. In turn, data is the
feedstock of analysis, and the choice of which data to collect, from a company
perspective, is driven by decision making needs. Decision making is forward
looking. Traditionally, maintenance management information systems have
collected vast amounts of information about equipment failures as an aid to
formulate maintenance options.
Analysis of maintenance effectiveness will reveal performance against the
desired state of technical integrity of the facilities. Analysis of maintenance
efficiency is used to control resource performance.
3.2.5 Improve. This is the stage where, based on the previous analysis,
improvement options are selected. These options should be selected after
studying their impact on long- term, preferably life-cycle, profitability. Dis-
counted cash flow methods are used to compare the different options with
respect to their economic viability. The various proposals should then be
ranked and prioritised according to their cost/benefit ratio and introduced
into the plans.
Appropriate change control procedures need to be in place to ensure that
all relevant systems, standards and procedures are updated to reflect the
changes being implemented.
1. Effectiveness (an attribute indicating that the products meet specified fit
for purpose requirements);
2. Efficiency (an attribute indicating that the products are produced with
minimum use of resources);
3. Flexibility (effective and efficient in face of change).
6. R&D Contribution
6.1 Reliability Engineering
in the Oil and Gas Industry, who have collected reliability data for various
topsides equipment on offshore production platforms. The data present an
industry-wide benchmark opportunity for offshore reliability data. The true
potential of the data, however, is still to be discovered by many, and includes
structured feedback of equipment reliability data to manufacturers, use in
selection and purchasing of equipment, etc ..
1. Introduction
a fruitful area, that review papers mention hundreds of articles (Sherif and
Smith 1981 list 524 papers) and many more have appeared since. The impact
on actual operations of all these papers, has been marginal, and in few other
fields there seems to be such a discrepancy between theory and practice. This
gap is being narrowed by the improvements and cost reductions in computer
technology, making computers and software also available for the maintenance
function, and thereby allowing the use of sophisticated models.
In this paper we describe a decision support system, called PROMPT,
which uses operations research models to assist the maintenance manager
in optimising preventive maintenance and to support him in executing pre-
ventive maintenance at the right time. It is an attempt to bridge the gap
between theory and practice, yet most of the theory needed was only devel-
oped during the construction of the d.s.s. PROMPT addresses that preventive
maintenance that is carried out to reduce downtime or to secure safety. It
offers both planning and scheduling tools to the user and is especially devel-
oped to make use of maintenance opportunities, thereby avoiding scheduled
downtime.
In this paper we will first give an overview of the problem characteris-
tics for which PROMPT was developed. Thereafter we give an overview of
PROMPT, its models, and what it is doing. Furthermore, we state our ex-
periences with a field test of the PROMPT system, which considered both
initialisation of the system as well as the effect of its advice. Finally we give
an evaluation of the system.
The PROMPT system which is described in this paper is the successor
of an earlier prototype which is described in Van Aken et al. (1984). Sim-
ilar to the present PROMPT system the prototype was directed at giving
advice for opportunity based preventive maintenance. Although this system
was considered to be successful, it had two major shortcomings. First of
all its objective was to increase reliability, whereas in a later stage not all
failures were considered to be of equal performance. Secondly, it could not
indicate how much preventive maintenance is cost effective, as the models
assumed that more preventive maintenance always implied more reliability.
The present PROMPT system, as described here, is a completely new sys-
tem, in which we took advantage of the experiences obtained with the earlier
prototype.
There are no comparable systems to PROMPT. In Dekker (1992) an
overview is given of maintenance decision support systems. Most of them are
tactical tools, which address a single unit or component and allow to optimize
a single action on that. Some maintenance management information systems
contain a reliability module, but hardly ever an optimisation module and
certainly not for opportunity maintenance.
532 Rommert Dekker and Cyp van Rijn
2. Problem Description
If preventive maintenance is applied to a unit, there is a preference to carry
it out only at those moments in time when the unit is not required for pro-
duction. In some cases, where units are used continuously (e.g. in the process
industry) this may cause problems. Execution of preventive maintenance is
then restricted to costly annual shutdowns. In some systems however, short
lasting interruptions of production occur by times for a variety of reasons,
e.g. breakdowns of or maintenance on essential units. During these interrup-
tions some other units are not required and can be maintained preventively,
in which case we speak of maintenance opportunities. Unfortunately, these
opportunities can mostly not be predicted in advance. Because of the ran-
dom occurrence of opportunities and of their limited duration, traditional
maintenance planning fails to make effective use of them.
The objective of PROMPT is to give decision support for opportunity-
based preventive maintenance. For PROMPT an opportunity is defined as
any moment in time at which preventive maintenance can be carried out with-
out adverse effects of a unit shutdown being incurred. The user of PROMPT
has to identify the opportunities and to report them to the system to get ad-
vise. PROMPT assumes that although opportunities occur randomly, they
do occur repeatedly and provide 'enough' time to use them for preventive
maintenance. PROMPT primarily focuses at routine preventive maintenance
as first line maintenance (greasing, etc.) can be executed during normal op-
erations and major overhauls are too large for opportunities and have to be
planned in advance.
An opportunity-based policy is of importance for continuously used equip-
ment, for which downtime costs are high. Examples of such equipment are
gas turbine driven power generators at offshore production platforms. A typ-
ical aspect of offshore platforms is that in a limited amount of space all the
equipment has to be installed, and that therefore in the design phase as few
equipment has been installed as possible, making downtime costs high. An-
other aspect is that the production of the platform has a high economic value.
Although usually production is not lost but rather deferred, there is a strong
incentive to recover the large investments as soon as possible and therefore
even deferred production has a high cost value.
To make effective use of opportunities, preventive maintenance has to
be split up into packages which can be fully carried out at an opportunity.
Both mechanical, instrument as well as electrical maintenance is included
and different age indicators, like runhours, starts and stops are allowed.
The tasks PROMPT has to carry out are threefold. First of all it should
indicate how much, if at all, preventive maintenance, is cost-effective and for
each maintenance package it should determine an optimal policy. Secondly,
PROMPT should schedule the cost-effective packages in such a way that
as much as possible is adhered to the optimal interval. For safety related
maintenance PROMPT assumes that a maximal interval can be specified by
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 533
the user. In that case PROMPT tries to execute the safety-related packages at
the last possible opportunity within the required interval. Finally, PROMPT
should administrate failures and preventive maintenance results so that in
course of time a better insight into failures can be obtained.
3. An Outline of PROMPT
3.1 Introduction
Any real system can be decomposed into units, parts and components. Sev-
eral hierarchies may exist, but we use the one applied by maintenance. Such a
decomposition is important since it imposes a lot of rules for maintenance and
for each level different information may be available, which has to be trans-
lated to other levels. In agreement with the operating company for which
PROMPT was developed the following hierarchy was assumed. In PROMPT
a system is defined as a whole of units performing a specified task with a
measurable output on which a lost or deferred product value can be set.
PROMPT assumes that a system is built up of subsystems in a series con-
figuration. A subsystem consists of one or multiple units in parallel, fulfilling
a certain task. A unit can be a gas turbine, compressor, pump or any other
physical entity. The unit is the highest level at which PROMPT gives advice.
PROMPT assumes that if for a unit an opportunity occurs, it is for all parts
of the unit. The planned maintenance routines are subdivided into mainte-
nance packages each consisting of one or more maintenance activities, which
are not overlapping. Apart from this maintenance hierarchy, PROMPT also
considers a physical hierarchy, in which the unit is subdivided into elements
each having one or more failure modes. The user is free to define the elements
which do not have correspond with one maintenance activity.
Basically the cost penalty for a given unit is calculated as follows. First an
enumeration of the states (either working or failed) of all other units in the
subsystem is performed. Next for each combined state of other units the costs
caused by losing the unit in question are determined. The cost penalty then
follows by taking a weighted summation of the costs per state multiplied with
the probability of occurrence of that state. Notice that this is a marginal cost
value, i.e., the costs incurred by one hour of extra downtime of a unit. It is
not the allocation of the actual downtime costs over the units. The model
is an extension of the so-called k-out-of-n availability models to nonidentical
machines and varying demand. It is not clear whether this cost allocation
is the best one. There is almost no literature in this respect, although the
problem arises in most systems with parallel units. Almost all papers assume
that either the cost penalty is given for a particular unit, or that the unit has
only one failure mode, which is an unrealistic assumption. For utility units
we considered the production systems sustained by it. Costs of downtime of
a utility unit then follow from the loss or deferment of production of those
production systems which have be shutdown because of the loss of the utility
unit. In this assessment we take into account the availability of other utility
units which are capable to take the duty over of the unit considered.
The unit downtime cost penalty was explicitly stored in the database of
the d.s.s. with the idea that it might change in time, because of e.g. depletion
of the field from which the platform was producing.
After having established a cost penalty for unit downtime, we will in this
section consider the positive effects of each preventive maintenance activity
in detail. For reasons of language simplicity we will regard in this section
all elements addressed by one activity as being one component (although in
practice this is not necessarily the case). In PROMPT a failure is defined as
"any event after which a component stops functioning in a prescribed way" .
In general two types of failures should be distinguished, viz. revealed and
unrevealed failures and a separate failure model should be used for each. A
failure model describes the relationships between failure and its consequences
and contains a quantitative prediction mechanism of failures. The latter oc-
curs through probability distributions, which may be in any type of condition
indicator (e.g. calendar time, runhours, etc.), as long as the indicator is pre-
dictable in time. PROMPT's failure models have been set up in such a way
that they are consistent with the findings of inspections.
3.5.1 The Revealed Failure Model. The revealed failure model assumes
that a failure is directly noticed and that an appropriate action is undertaken.
Consequences of the failure are assumed to occur directly after the failure.
As a result of the failure the unit may breakdown with a certain probability,
Pud (to be specified per component). Costs offailure are split up into indirect
536 Rommert Dekker and Cyp van Rijn
cost due to unit downtime (in case it breaks down) and direct costs due to
repair of the component. If the expected downtime amounts to d hours, the
unit downtime cost penalty to Cud and the repair costs to Cr , then the total
expected cost of failure cf is given by cf = Cr + Puddcud. Time to failure is
modelled through a two parameter Weibull distribution with shape parameter
A and scale parameter {3. Other data the user had to specify included average
downtime in case of a unit breakdown, average time needed for repair, average
number of men required for repair and additional material costs (normal
material costs were incorporated as a surcharge on the manpower costs).
the other hand, it is usually not economic neither convenient from an ad-
ministrative point of view to execute all maintenance activities separately.
Hence activities were grouped into packages. We therefore assumed that the
user would be able to define maintenance packages. It further appeared prac-
tical to advise only full maintenance packages, even if one of its activities
was already carried out because of a failure. Furthermore, failures provided
usually no time for preventive maintenance as the failed component had to
be repaired as soon as possible and no time was left over. The user had to be
given the freedom to report either a renewal or a repair of the component to
its state before the failure.
more than three components. Jorgenson et al. (1967) presented a model for
multiple components with exponential time between opportunities, but do
not specify how to optimise. None of the models was able to deal with a
restricted opportunity duration, neither with different failure models.
We therefore developed novel models to deal with this complex problem.
In our case opportunities are created by causes outside the unit and upon
failure of one of its components the unit is repaired as soon as possible and
no time for further preventive maintenance is available. Our first conclusion
was therefore that the only correlation between the packages consisted of
competing for the restricted time at an opportunity. Accordingly we reduced
the original problem to the following: "determine for each maintenance pack-
age separately an optimum policy, which indicates when it should be carried
out at an opportunity, independently of all other packages. Furthermore, de-
termine from the outcomes of these models a priority measure with which
maintenance packages should be executed at a given opportunity". To this
end we introduced the so-called one-opportunity-Iook-ahead policies, which
can be considered as a generalisation of the marginal cost approach (origi-
nally introduced by Berg 1980). At each opportunity these policies compare
for each package the costs of deferring the execution to the next opportunity
with the minimum long term costs. In the next sections the approach will be
discussed in more detail.
3.9.2 Maintenance Activity Optimisation Models. Consider a mainte-
nance activity addressing a revealed failure of a specific component. Basically
the maintenance optimisation can be tackled by the age or block replacement
model with the extra restriction that preventive maintenance is restricted to
opportunities. We took the block replacement model since that can be ex-
tended to multiple activities in a package and non-exponential times between
opportunities can be handled (for age replacement only exponentially dis-
tributed times between opportunities can be handled; non-exponential times
become very difficult, see Dekker and Dijkstra 1992). We did modify the
block policy to avoid replacing new components, but that will be explained
later. In the block replacement model a component is replaced preventively
every t time units against costs cp and upon failure at costs Cj(> cp ). Let
F(t) be the c.d.f. ofthe time to failure and let M(t) be the associated renewal
function, indicating the expected number of failures in [0, t]. The long-term
average costs g(t) follow easily from renewal theory and amount to
Dekker and Smeitink (1990) show that the same conditions are needed
for existence of a unique minimum t* to gy(t) as for the standard block re-
placement model. Furthermore, that t* is the unique solution to the following
optimality equation.
<0 for 0 < t < t*
cfE[M(t + Y) - M(t)]- gyEY { =0 for t = t* (3.4)
>0 for t > t*
where gy denotes the minimum average costs. Notice that cfE[M(t + Y) -
M(t)] can be interpreted as the expected costs of deferring execution of the
activity from the present opportunity at time t to the next one, Y time units
ahead.
The analysis of the opportunity block replacement model does not make
use of the interpretation of the cost over an interval. In fact any other cost
function may be used as well (as is also remarked in Dekker 1995). Accord-
ingly the analysis is easily set over to the unrevealed failure model with M(t)
replaced by I;(1- F(x))dx.
To calculate the integrals in equation (3.3) we approximated Zt in first
instance by a three point distribution with reasonably chosen values and
probabilities. Later, in Dekker and Smeitink (1990) it appeared that Zt can
be approximated by the forward recurrence time of a Coxian-2 distribution
in case the coefficient of variation is larger than 0.5 and by the stationary
excess distribution in the other case. For the renewal function a simple but
effective approximation was developed (see Smeitink and Dekker 1989).
540 Rommert Dekker and Cyp van Rijn
Re(t) = :~:::>J,;E[M;(t+Y)-M(t)]+ E E[
nr nu 1 HZ '
cJr,j(l-Fj(xdx]-gyEY
;=1 j=1 t
(3.6)
with gy, the minimum average costs of the total package. We can interpret
RC(t) as the expected costs of deferring the execution of the package to the
next opportunity, Y time units ahead, minus the long-term average costs over
that time. Hence it is an ideal candidate to rank packages on. Notice that at
an opportunity we only have to calculate the first part of RC(t); as gy can
be stored in the database we only need to calculate it upon initialisation of
the d.s.s.
The idea is now to execute those maintenance packages with the highest
ranking value until the opportunity is fully used. Notice that the ranking
criterion is myopic: a package may be delayed multiple times at an opportu-
nity. Including that effect, however, was considered to be too complex. The
procedure was tested in Dekker and Smeitink (1994) and performed quite
well.
Next, we did modify the block policy to take recent failure replacements
into account. If for some revealed failure components actual ages were known
we replaced the renewal function in equation (3.6) by the expected number
offailures given the present age(s), using the c.d.f. and its convolutions. This
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 541
idea was elaborated in Dekker and Roelvink (1995) and appeared to cover
most of the cost-performance difference between age and block replacement,
even in the multi-component case.
Finally, we did want to allow the user to enter a specific interval (either
as point value or as three point distribution) to the next opportunity, which
could differ from the long-term distribution of the time between opportu-
nities. In that case we replace the r.v. Y in equation (3.6) by the interval
specified.
3.11 Software
Although the company for which later a field test would occur had an exten-
sive maintenance management information system in use, we did decide to
develop PROMPT separately from it, with the intention to make connecting
links once PROMPT had demonstrated its value. One of the reasons behind
was that PROMPT needs more detailed information than what is in the
maintenance management information system.
The main part of the PROMPT software consists of a database which has
been written in a 4th generation database language, in order to secure easy
reporting facilities. As language we chose FOCUS, in order to provide com-
patibility between a mainframe and a PC version. The optimisation occurs
through Fortran subroutines.
Total code consists of some 20,000 lines. Although originally PROMPT
was set up for a personal computer (PC), we later switched to a mainframe,
as the complexity was too large to be handled by the then existing PC's (IBM
PC-AT) and the PC FOCUS version.
This was in fact a major task. Before PROMPT, routine maintenance was
lumped together in large packages of say, 150 hours which were executed
during the yearly platform shutdown. Each task had to be written down in
detail, with exact specifications of the equipment addressed. Thereafter all
activities had to be combined into maintenance packages. Although there are
optimisation aspects involved, this was purely done by engineering judgment,
grouping those activities which could easily be executed together. The type
of the maintenance activities could be either mechanical, instrument or elec-
trical. Furthermore, for each package one had to determine the best condition
indicator, being either runhours, calendar time or number of starts and stops.
Although the model developed to assess cost penalties for unit downtime was
considered to be quite general, the field test revealed that practice has many
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 543
Experiences with the software were positive. Although we used the term de-
cision support system for PROMPT, it is better described as a structured
decision system, as the support it provides is always of the same form. As
the main advice is at an opportunity and as opportunities occur repeatedly,
there is much to say for structured advice. Developing such large computer
systems does put a different light on mathematical optimisation. The larger
software gets, the more difficult it is handled and checked beforehand. Soft-
ware errors can produce completely wrong results and thereby destroy all
value of optimisation.
A major problem encountered concerned database integrity. In order to
secure this, all kinds of protection mechanism were built into the system,
next to the already existing protection mechanism provided by FOCUS. This
made it very time consuming for the user to change data which were inputted
erroneously. The user did want to have the flexibility of changing data like in a
spreadsheet, but that is not what database languages provide. Especially the
so-called key variables, from which the database is structured, are extremely
difficult to change. Users do not always have beforehand the right description
of their database elements, implying that later on difficult changes have to
be made or that a user is left with a difficult to understand database. The
latter may be a cause for future errors.
4.7 Evaluation
test. The actual PROMPT advice was in fact very flexible, and provided
exactly what they needed. The time needed to do the failure mode analysis
and assessment of failure time distribution was considerable (as it often is).
There were some complaints on the complexity and difficulties in managing
the database. A final version of PROMPT has to be simplified and to re-
quire far less data input, certainly when comparing the costs of initialising
PROMPT with the amount of money going on in the part of maintenance
suitable for execution at opportunities of the units in question. Besides, other
problems may overshadow PROMPT temporarily, thereby destroying the dis-
cipline needed to maintain it (on the platform in question there was a lengthy
shutdown caused by other reasons).
5. Conclusions
PROMPT can be considered as a major step forward in applying scientific
methods to maintenance management. It has its pro's and cons. Its pro's are
undoubtedly the structured approach leading to an optimisation of preventive
maintenance. Its con however, mainly consists of being a complex system, and
the long time effort required to initialise it. Future work will be directed at
reducing the initialisation effort and simplifying the system while keeping the
benefits of the structured approach.
Acknowledgement. The authors like to thank Mrss. van Oorschot, Cooper and Hart-
ley from Shell Expro Aberdeen for their cooperation on the PROMPT project. The
actual development of PROMPT was done by Ernest Montagne, loop van Aken,
Dick Turpin and the authors.
References
Barlow, R.E. , Hunter, L.C.: Optimum Preventive Maintenance Policies. Oper. Res.
8, 90-100 (1960)
Barlow, R.E. , Proschan, F.: Mathematical Theory of Reliability. New York: Wiley
1965
Backert, W., Rippin, D.W.T.: The Determination of Maintenance Strategies for
Plants Subject to Breakdown. Computers Chemical Engineering 9, 113-126
(1985)
Berg, M.B.: A Marginal Cost Analysis for Preventive Maintenance Policies. Euro-
pean Journal of Operational Research 4 , 136-142 (1980)
Cho, D.1. , ParIar, M.: A Survey of Maintenance Models for Multi-Unit Systems.
European Journal of Operational Research 51, 1-23 (1991)
Dekker, R.: Use of Expert Judgment for Maintenance Optimization. First report of
the ESRRDA Project group on expert judgment (1989)
Dekker, R.: Applications of Maintenance Optimisation Models: A Review and
Analysis. Report Econometric Institute 9228/ A, Erasmus University Rotter-
dam (1992)
548 Rommert Dekker and Cyp van Rijn
Appendix
A. Example of Advice
Summary. The delay time model is an inspection model which has been used
extensively in many case studies of the development of recommended maintenance
policies both for industrial plant and buildings. An introduction to the method-
ology is given, together with some recent modelling and inferential developments.
The focus is on the estimation of model parameters by fitting to so-called objective
data, rather than on the use of subjective data. There is a comprehensive bibliog-
raphy. Hitherto unpublished work presented here includes statistical inference for
multicomponent systems and new ways of modelling imperfect inspection.
1. Introduction
This chapter describes a preventive maintenance model that has been suc-
cessfully used in many case studies since 1982. The focus here is on recent
developments in the model, and the reader is referred to papers such as
Baker and Christer (1994) for a more general account of the method, and its
historical evolution.
In this chapter, a simple case study is presented to give the flavour of
the method, and after a fuller description of the model and of the estimation
of model parameters, some more complex case studies are discussed. Finally
some current ideas for model development are mentioned.
For those readers familiar with the delay-time model, the work presented
here for the first time is the derivation of the likelihood function of the NHPP
multi component model from the component-tracking model in Section 2.4.1,
the Empirical Bayes multi component model in Section 2.5, the remarks on
stochastic cost in Section 3 and the new imperfect-inspection model in Sec-
tion 7.2.2.
1.1 Background
1.2 Terminology
It is necessary to define some terms.
We are concerned with delay-time modelling of one or more machines or
systems liable to (costly) failure. The model has been applied to a variety of
systems, most often industrial plant, but also to building maintenance. The
machines may be very large and complex, such as power presses or production
lines, or small items with few components, such as infusion pumps and other
items of hospital equipment.
Failure is taken here to mean a breakdown or catastrophic event, after
which the system is unusable until repaired or replaced. It may also be simply
a deterioration to a state such that the repair can no longer be postponed.
This is especially true in building maintenance. Preventive maintenance is
some activity carried out at intervals, with the intention of reducing or elim-
inating the number of failures occurring, or of reducing the consequences of
failure in terms of, say, downtime or operating cost.
The concept of failure delay time or simply delay time is central to the
DTM. Failure is regarded as a two stage process. First, at some time u with
distribution function G( u) and pdf. g( u) a component of the system becomes
recognisable as defective, and the defective component subsequently fails after
some further interval h, with distribution function F(h), pdf. f(h). Preventive
552 Rose Baker
Fig. 1.1. How inspections prevent failures in the component tracking model. The
horizontal axis represents time, and the open circles represent the origination of
defects, the closed circles represent failures, and the vertical lines inspections. The
third defect has originated but has been detected and repaired at inspection and
so has not caused a failure.
(a)
(b)
Fig. 1.2. How inspections prevent failures for the NHPP model. The horizontal
axis represents time and the open circles represent the origination of defects, the
closed circles represent failures and the vertical lines inspections. With periodic
inspections, as in the lower part of the figure, the second, fourth, fifth and eighth
defects have now been detected and repaired, and so have not caused failures.
9
8.8
8.6
8.4
8.2
D
8
7.8
7.6
7.4
7.2
10 15 20 25 30 35 40 45 50
..1 ..... (hours)
Fig. 1.3. Downtime per week as a function of the interval between inspections for
a simple case study of a canning line from Christer and Waller (1984c)
Typical additional assumptions for a simple model that tracks key com-
ponents are:
Set C
1. Each component has only one failure mode.
2. f and 9 are modelled as exponential or Wei bull distributions.
3. The age of the system, as distinct from the age of the component, does
not influence the distributions G and F.
4. Repairs are statistically equivalent to replacements, so that the faulty
component is restored to an 'as-new' condition.
5. The key components of a machine are assumed independent, i.e. the
failure of one will not affect the subsequent functioning of another.
6. If more than one machine in a set is modelled, machines are assumed to
behave identically and to have uniform usage.
7. Total breakdown repair time is negligible compared to operating time.
Additional model assumptions for a simple model where individual compo-
nents are not tracked are:
1. The number of components is very large, and the probability of any given
component becoming defective in a specified period is very small, so that
defects arise in a NHPP.
2. Defects are repaired sufficiently well that the probability of any given re-
paired component again becoming defective is infinitesimally small. This
assumption is required in order not to jeopardise the NHPP of defect
arrival times. (For example, imperfect repair would cause a clustering of
defect arrival times).
In practice, the NHPP is a good approximation for any complex machine
where individual components cannot be tracked. The first set of model as-
sumptions (set A) cannot be changed without ceasing to have a recognisable
'delay-time' model. When the main function of maintenance is not the re-
placement or repair of defective parts, but is, for example, age-based replace-
ment of components, then the DTM is not applicable. However, it is possible
to include other effects of maintenance besides replacement of defective parts,
for example the 'rejuvenation' or premature ageing of machinery by beneficial
or hazardous inspection is discussed in Section 7.l.
In general, the more specific model assumptions in sets Band C can be
relaxed or varied to suit the problem at hand.
Given a model constructed according to these assumptions, the mainte-
nance activity is understood well enough to calculate optimum maintenance
policies. This will often mean simply finding the optimum frequency of main-
tenance. The development will again differ according to the criterion chosen
for optimisation, e.g. minimum cost, minimum downtime or maximum out-
put. It is possible to devise an optimum policy for component tracking models
by which maintenance occurs at irregularly-spaced epochs after renewal of a
component, and this is discussed in Christer (1991b).
558 Rose Baker
B Breakdown (failure)
N Inspection and no defect found
Y Inspection and defect found
E End of observation period.
Event N will be referred to as a negative inspection, event Y as a positive
inspection. In addition, the following event types are useful:
S Start of observation period
R Replacement (on a B or Y)
X Denotes any event
Event S is equivalent to an R event. We wish to write down the likelihood
of observing a sequence of events Xl ... Xn of types B, E, Y and N at times
tl ... tn. The key to doing this is the multiplication law of likelihood, i.e.
PNBIR(tn, t) = it tn
g(u)f(t - u) duo (2.2)
Maintenance Optimisation with the Delay Time Model 561
Since g(u) is the pdf. that a defect arises at time u, and f(h) is the pdf.
that a breakdown occurs a time h later, g(u)f(t - u) is the pdf. of a failure
at t arising from a defect at u, and the integration sums over all possible
times u. These can only occur after the last moment that there was known
to be no defect, tn, and before the breakdown time t.
It is also true that
PNBIR(tn, t) = PNIR(O, tn)PBIRN(tn, t),
where PNIR(O, tn) is the probability of a negative inspection at time tn
from renewal, and PBIRN(t n , t) is the pdf. of a breakdown, conditional on
that negative inspection.
- PNEIR(tn, t) is the probability of a (possibly null) sequence of negative
inspections of which the last is at tn, and no breakdown before observation
ceases at time t from last renewal.
The first term, 1- G(t), is the probability that no defect arises before time
t, and the second contribution to the probability of no failure is that a
defect does arise at time u > tn, but does not lead to a failure before time
t. The product g( u )(1 - F(t - u)) is the pdf. that a defect arises at u and
that there is no failure before time t, and the integration sums over all
possible times u, after the last negative inspection at tn and before time t.
As before, the probability of no event may be written as the probability
of a sequence of negative inspections, multiplied by the probability of no
event given such a sequence, i.e.
PNEIR(tn, t) = PNIR(O, tn)PEIRN(tn, t).
- PNYIR(tn, t) is the probability of a sequence of negative inspections of
which the last occurs at tn, followed by a positive inspection at time t
from last renewal.
PNYIR(tn, t) = it
to.
g( u)(1 - F(t - u)) duo (2.3)
The pdf. for a fault arising at time u is g( u), and the probability of no
breakdown before t is (1- F(t - u)). The integration sums over all possible
times of fault origin u.
562 Rose Baker
g(u) = Iho:fluf31-1e-(<>lU).81,
F(h) = 1 _ e-(<>2 h/ 2 ,
f(h) = f320:g2hf32-1e-(<>2h).82,
where 0:1,0:2 are scale parameters and f31, f32 are shape parameters.
The likelihood is calculated by accumulating the product of these three
terms. Each renewal may be followed by a sequence of negative inspections,
and this must terminate in an event of type B, E, or Y. Event E is really
'no event'. The likelihood C for a total of nB breakdowns at times ti, nE
'no failure before observation ceases' events at times tj, and ny positive
inspections at times tk, is
C= (2.4)
i=l j=l k=l
The case of a machine comprising two components is discussed first. They are
assumed to be mutually independent in that the state of either component is
assumed not to affect that of the other. There are two possible scenarios: when
component A fails, component B is either not inspected (case 1) or inspected
and replaced if visibly defective (case 2). Happily both are tractable.
In case 1, the two components are completely independent-nothing that
happens to either of them can affect the other, and the likelihood factorizes.
The log-likelihood is the sum of log-likelihoods for each component, log C =
log CA + log CB . In case 2, they are no longer independent, because a failure
of A will cause the replacement of B, if B if visibly defective, and vice versa.
Happily, the likelihood can still be written in factored form, even although the
components are not now independent. A failure of either component (A, say)
Maintenance Optimisation with the Delay Time Model 563
So far it has been assumed that inspections always find a visible defect if it is
there. In the case of imperfect inspection, there is a probability r < 1 that a
defect is found if it exists. Successive 'trials' or inspections are independent.
This is equivalent to saying that a (perfect) inspection is carried out with
probability r, and that with probability 1- r the inspection is omitted. The
inspection is regarded as omitted merely as far as our state of knowledge
of the machine is concerned: it is not omitted as regards cost, downtime,
and other such consequences. The component-tracking model was developed
for the imperfect inspection case in Baker and Wang (1993). The logic is
complicated, and is not reproduced here.
The component tracking model can not be used for complex plant where there
are very many components, and where detailed records are lacking. Follow-
ing a Pareto analysis of failure modes, any unreliable and hence frequently
replaced components could be modelled as above, and the remainder of the
defect and failure types grouped into q classes. In what follows it is assumed
that there is a 1 : 1 correspondence between defect types and failure modes;
however, it is straightforward to generalise the model to the situation where
several types of defect lead to a single failure mode, or vice versa. Each class
of defect is assumed to arise in a NHPP (nonhomogeneous Poisson process)
with intensity Ap (u) for the pth class, and to generate a failure at time t > u,
according to the distribution Fp{t - u).
The NHPP model is now derived as a limiting case of the component
tracking model, and its likelihood function found.
2.4.1 The NHPP . Assume that there are a large number M of com-
ponents, and consider the distribution function of time to a defect aris-
ing for the mth component, Gm{u) = 1 - exp{- IoU Am{x)dx}, where
Am (x) is the hazard of failure at time X. Let Am (x) --+ 0 and M --+ 00
such that the total hazard A{ x) = L:~=l Am (x) = h{ x) is finite. Then
Gm{u) --+ IoU Am{x)dx. Group components into q classes Al .. . A q , so
that the expected number of defects arising by time u in the pth class is
564 Rose Baker
I:mEA Gm(U) -> I:mEA loti Am(X) dx. Note that the LHS of this equation
is an approximation, beca"use once a component has once developed a defect,
the expected number of defects due to it so shortly afterwards at some time
u' > u is not Gm(u'). However, as Am -> 0, a vanishingly small fraction
of components will have developed defects by time u, so that the expected
number of defects by time u tends to the RHS expression, as long as the
hazard of failure still -> 0 as M -> 00 after repair or replacement. A pro-
cess whose expected number of defects by time u is a function only of u is
an NHPP, and so this is a good model of defect arrival for complex plant.
Gamma, Weibull and log logistic distributions for G all lead to the power
law process A( u) = auf3, and the exponential distribution to the special case
of an HPP, where (3 = 1. The Gompertz distribution leads to the loglinear
=
process A( u) a exp{{3u}.
2.4.2 The Likelihood Function. Suppose that 'events' (failures or detec-
tion of defects) may be observed at epochs t 1 ... tn. This means that failures
are interval censored; they occur during the interval (ti-1, ti). It is straightfor-
ward to later revert to the case where the timing of failures is known exactly.
Some of the ti will however be times when inspections are carried out.
Then
M n n
(2.5)
m=1 ;=1 i=1
where Pim is the probability of an event of appropriate type (failure or defect
found at inspection) for the mth component at the ith time, Sim is the number
of such events (either 0 or 1) and the final product runs over all components
that have not given rise to an event of either type.
The Pim are proportional to Am, so as M -> 00, Pim -> O. The final
product can now be taken over all M components rather than those not
suffering any event, and is equal to exp{ - I:~=1 I:~=1 Pim} in the limit of
M -> 00.
The likelihood is a now a product of Poisson expressions
n M
.c = IT IT Pi~mexp{-Pim}. (2.6)
;=1 m=1
Grouping components into q classes, the number of events aip in the pth
class is aip = I:mEA p Sim, and the mean number in that class is J.lip =
I:mEA P;m. Here aip follows a Poisson distribution, because it is the sum of
a number of Poisson random variates Sim, and so the full likelihood is
n q
.c II II J.lf;p exp{-J.lip}/aip!
;=1 p=1
Maintenance Optimisation with the Delay Time Model 565
n q n q
J1ip = tIT
j=1 k=j
(1- rk) it;
t;-l
Ap(x){F(ti - x) - F(ti-l - x)} dXj (2.8)
with the convention that F(t) =0 if t < 0, and when the ith time is a time
when an inspection took place,
J1ip = ri
i i-I
2: II (1 -
j=1 k=j
rk) it;_l
t
J Ap(x){1 - F(ti - x)} dx. (2.9)
c = II[{ (",n
q a')' n n
~~=1 ~p,' II(J1;p/EJ1jp)a iP }
p=1 Ilj =1 a
Jp . i=1 j=1
( ",n )a n
~j=1 J1jp 'P '"'
X
(Lj::::1 ajp)! exp{- ~J1jp}]. (2.10)
at time t after an inspection, and from equation (2.9) the expected number
of defects found at any inspection is
The expected total number of defects detected in any way will be found to
equal A..1, as it must.
The downtime D per unit time is modelled as
(2.12)
where C1 is the downtime due to an inspection, C2 is the downtime due to
a failure, and b(..1) is the fraction of defects manifesting as failures. From
equation (2.11)
The simplest form for F is the exponential distribution, F(t) = l-exp( -{t),
so that
k
f = dlogA-AM..1+ L:log{l-exp(-{tj)}
j=1
+ (d - k) 10g{1 - exp( -{..1)} - (d - k) log{ + constant. (2.15)
Maintenance Optimisation with the Delay Time Model 567
Since in general 1 - e- x < x for x > 0, the RHS is always positive. Hence the
curvature of C is always negative, and so all stationary values are maxima. It
follows that there is only one solution for i" as ifthere were more there would
of necessity be a minimum of C also. This is a practically useful result when
maximising likelihood functions numerically, as any maximum found by the
function optimizer must be the maximum.
{_8 2i/ 8e le=e} -1 estimates the variance of f, for a particular realisation
of the random process, but it is also possible to derive the expected variance
{E { _8 2i/ 8e le=e} }-1, which it may be shown applies for large sample sizes
d. To derive this from equation (2.18), the sums are replaced by probability
integrals: in general
k . :1
l/d~f(t$) -+ 10 F(t)/,1f(t)dt.
where I(z) = J;
x 2 dx/(e X - 1). The standard deviation of i" u ex d- 1 / 2. e
It is interesting to compare the variance of the ML estimator with that
e
of the intuitive estimator obtained by equating the observed and predicted
fraction of defects that manifest as failures, i.e.
1 - e-{..:1
_ = 1- kid. (2.20)
e,1
Using the usual large-sample delta notation, where
be small, differentiating equation (2.20) with respect to
= 6e e- E{e}' and will
e
{e-e..:1 - (1 - e-{..:1)/(e,1)}6e/e = -8k/d.
Squaring and taking expectations, the RHS becomes F(t)/,1(1- F(t)/,1)/d
as k obeys the binomial distribution, and substituting for F(t)/,1 from equa-
tion (2.20) gives
2 e(1- e-{..:1 )(e,1- 1 + e-e..:1)
u{= d(1-(1+e,1)e- c..:1)2 . (2.21)
Figure 2.1 shows the large-sample variances of the ML and naive estimators.
The naive estimator is less efficient than the ML estimator, but its efficiency
e
approaches 100% as -+ 0, and from equations (2.19) and (2.20) both es-
timators then have variance e/2,1. The ML estimator is intuitively better
because it uses the information about failure times in the data. This shows
the advantage of the ML approach.
Maintenance Optimisation with the Delay Time Model 569
900
800
700
600
500
(J'2
400
300
200
100
0
0 2 4 6 8 10
eLl
Fig. 2.1. Large-sample variances for method of moments and maximum likelihood
e,
estimates of where e-
1 is mean delay time.
(2.23)
II J.lip
q
( aip/ . 1) (1
a,p'
+ 111)(1 + 211) (1 + (L~-1 aip - 1)11)
"n' (2.25)
p=1 (1 + L7=1 J.liPI1)'Y+ L.....i=l a,p
It can be readily seen that as , ~ 00 the likelihood function reverts to its
original form, with all scale factors equal, as it must, because then v(x) is a
Dirac delta-function 6(x - 1).
572 Rose Baker
What has been described can be understood from the frequentist view-
point as a random-effects model, whose parameters are estimated by maxi-
mum likelihood. From a Bayesian viewpoint, v(xl,) is a prior distribution,
whose parameters have been (heretically) estimated from the data. The in-
dividual scale parameters a p must also be estimated, as they are needed for
the cost model. They are the usual means of posterior distributions, i.e.
1000 .cp(aiplx)v(xli')x dx
(2.26)
A
p
a = 1000 .cp(aiplx)v(xli') dx
This gives
"n aip + ,
"n
L...,j-1 A
ap = (2.27)
A
L.."i=1 pip + ,
A
i=1 ;=1
If there are no data available on a particular failure type, a = 1, so that this
failure mode is taken as having the mean defect arrival scale factor.
Rather than using the mean of the posterior distribution as a point es-
timate for a p , it may be preferable to calculate the expected cost per unit
time conditional on values of the a p , and then to take the expectation of cost
per unit time (or whatever quantity is to be optimised) with respect to the
(estimated) posterior distribution of the a p
Given more than one machine, they may be assumed identical, in which
case likelihood functions for each machine multiply to give the total likelihood
function, or the EB method may again be used, to regard key parameters for
each machine as drawn from a population of such parameters. This accounts
nicely for random differences in operating conditions or quality of parts, and
is discussed briefly in Baker and Wang (1993).
A general account of the EB method is given in Maritz and Lwin (1989).
2.5.1 Computational Problems. Note that without the EB modification,
the likelihood facto rises
q
.c = II .cp,
p=1
and parameters for each failure mode can be estimated in turn; the estimates
of various parameters for a given failure mode will be correlated together,
but parameter estimates will not correlate with estimates of parameters for
other failure modes. The computational burden is light with a minimum of
two parameters per failure type. It becomes greater if (say) a common shape
parameter f3 is assumed for each defect arrival rate, because all parameters
Maintenance Optimisation with the Delay Time Model 573
3. Cost Models
The modelling process of model formulation, model fitting and model re-
finement has the benefit that it forces the investigator to examine his or
her assumptions, and to clarify the meaning of the data in discussions with
management. Some model parameters, such as the probability r of detecting
a defect that is present, are of intrinsic interest, and the modelling process
could thus lead to changes in practice.
However, the main aim of modelling the failure and inspection processes
and fitting the model to data is to be able to calculate from the model long-
term cost per unit time, or downtime per unit time, and to choose decision
variables (usually the interval L1 between inspections) to minimise one of
these measures of cost.
This is an area where more modelling effort should be applied, as existing
cost models are very simple. When the defect arrival process is a NHPP,
the rate of occurrence of failures will change (often it will increase) as the
system ages. Hence the frequency of inspection must also increase. In the cost
Maintenance Optimisation with the Delay Time Model 575
models given here, the NHPP must be approximated by a stepwise HPP, and
a different optimum inspection interval found for each step.
For the q failure-mode model, the cost per unit time is
"q {c(J) E{N(J)) + c(i) E{N(i))} + I
( A) = L.Jp=l P P P P
(3.1)
c~ .1+d '
where c~J) is the average cost of a failure for the pth failure mode, E{NJJ)) the
expected number of failures over the inspection interval, c~i) is the average
cost of repairing a defect at inspection for the pth failure mode, E(NJi))
the expected number of defects found at inspection, I is the cost of the
inspection, and d is the average downtime incurred. Expected numbers of
failures and defects are calculated using formulae (2.8) and (2.9), where now
actual inspection timings are replaced by an infinite sequence of inspections
occurring regularly every .1 time units.
Often some terms in equation (3.1) are negligible, so that d may be very
small, I may be small, or conversely all the cost of an inspection may be due
to downtime, and the extra cost incurred per defect c~i) may be negligible.
In general c{.1) has a minimum value, as long as the average total cost of
repairing a defect at inspection is less than that of a failure.
For the component tracking model, there must presumably either be more
than one machine under consideration, or the machine has had all major
components replaced several times. Unless one of these cases holds, the in-
vestigator will have been unable to obtain enough data to estimate the model
parameters. In both these cases, regularly spaced inspections are a reasonable
option (otherwise, for an expensive machine that was ageing, the frequency of
inspection should change with machine age). In the hospital equipment study,
machines such as infusion pumps are serviced every 6 months regardless of
age. To obtain the optimum policy, it, is necessary to know the service lifetime
of machines, and the age distribution ot machines in service. If machines are
purchased regularly, and old machines removed from service, equation (3.1)
still applies, where now expected numbers offailures and defects, and average
costs are for a random machine from the number in service.
For the HPP defect arrival model, opportunistic inspections increase the
value of c~J). Besides repairing some part of the machine responsible for the
pth failure mode, a general inspection is carried out. If the cost of this in-
spection is proportional to the number of defects found, then c~J) will be an
increasing function of .1. The simplest way to find the value of c~J) numeri-
cally for a given .1 is to simulate the process over a long time period.
This highlights a general problem with equation (3.1), that some of the
parameters appearing in it may be functions of .1; if so, the value of .1
minimising cost will change. In the recent study Christer et al. (1995) it was
found that downtime per failure was an increasing function of .1. This was
not due to opportunistic inspections. In the model used in Christer et al.
576 Rose Baker
(1995), all failure modes were lumped together, and it was thought that the
more expensive failure modes had longer delay-times, and so were more likely
to occur during long intervals between inspections. Discriminating between
different failure modes removes this difficulty.
A likely reason why c~i) might increase with ..1 is that repair time or cost
increases with the elapsed delay-time, i.e. the time since the defect became
visible. As time passes, defects grow and require more effort to fix, e.g. a
crack might grow in size. This behaviour can be modelled by allowing cost to
be a stochastic function of elapsed delay-time, giving a distribution function
Pr( C :S c) for cost c:
Pr(C:S C)ip = Ti t IT
j=l k=j
(1- Tk) it;
tJ-l
.Ap(x){Pr(C:S clti - x)} dx. (3.2)
Here {Pr( C :S cit; - x)} is the probability that the random cost does
not exceed c, given that the defect arose at time x. This can be modelled
by any survival distribution, for example a gamma distribution, whose mean
is an increasing function of ti - x. Model parameters can be estimated by
multiplying the likelihood function by the likelihood of observing the repair
costs.
Unfortunately when inspection has been carried out perfectly regularly
every ..1 time units, and when .Ap(x) is constant, the dependence of the dis-
tribution of cost on elapsed delay time at inspection cannot be estimated from
the likelihood. Only when there is very irregular maintenance or there are
opportunistic inspections, can the likelihood function based on equation (3.2)
enable us to estimate cost parameters.
of time from last inspection. Failures occurring at many different times are
lumped into one histogram class together, and this gives a large enough num-
ber of failures per class to enable the goodness of prediction to be assessed
both visually and by a chi-squared test. There should be few failures occurring
soon after an inspection, if inspection is effective at removing defects.
The observed and predicted number of faults found at inspection can
also be plotted, possibly breaking the period of observation up into several
intervals. Observed and predicted numbers of failures can also be plotted for
intervals over the period of observation, and this shows whether the function
h for NHPP models is of the right form.
The usual chi-squared can be calculated for such tests. The number of
degrees of freedom should be the number of independent counts of events, mi-
nus the number of fitted model parameters. However, the number of degrees
of freedom for the chi-squared is only known approximately, first because
parameter estimation by maximum likelihood gives more accurate measure-
ments than obtained by minimising a X2 , so that fewer degrees of freedom
need be subtracted, and secondly because only part of the data appears in any
one chi-squared, so that fewer degrees offreedom again should be subtracted.
This difficulty is not usually a real problem in practice.
a suboptimal inspection policy will increase cost per unit time away from the
minimum by some amount called the excess cost. Clearly, for a given sample
size, the excess cost can be calculated, and hence the required sample size
needed for a given excess cost can be found.
These calculations are given in detail in Baker and Scarf (1995). The
important result following can be seen without doing any mathematics, and
is that because of the quadratic nature of a minimum, even large errors on
estimated optimum ~ will incur small excess cost. In fact, excess cost is
inversely proportional to sample size.
5. Case Studies
In a recent case study (Christer et al. 1995) the HPP model was used to
optimise maintenance practice for an extrusion press, a key item of plant
for a copper-products manufacturer in the NW of the UK. Fortunately, the
company had already tried various maintenance practices, such as failure
maintenance, and daily and weekly maintenance, so that total downtime per
hour of press operation could be found. Table 5.1 shows these results. and
Table 5.2 shows the results of fitting the HPP delay-time model.
As can be seen, five models for the delay-time distribution were fitted.
Model 1 is exponential, model 4 Weibull, while model 5 is a Weibull distri-
bution with the scale parameter ci' a random variate from a Gamma dis-
tribution. Thus model 4 generalises model 1, and model 5 generalises model
Maintenance Optimisation with the Delay Time Model 579
Table 5.1. Percentage of downtime for an extrusion press under various mainte-
nance regimes. This includes downtime due to failures and downtime due to main-
tenance.
PM policy percentage downtime per press hour
production record objective method
no PM 5.47 5.53
1 week PM cycle 4.06 4.05
1 day PM cycle 2.45 1.85
Table 5.2. Fitted values of parameters for the Extrusion Press data.
models and fitted values of parameters
I
model choice:F(h)
parameter (1) (2) (3)
l_e- ah 1- 1 1_(I_pa)e- ah
(1+~)"Y
for ti-I < t ~ ti, and 1jJ(t) -4 1jJ(teffective)' The sum is nugatory if j > i -1,
i.e. if i = 1, so that no inspection has yet occurred.
The survival function S( u) = 1 - G( u) proves unexpectedly complicated
when fJ i: O. Let So be the survival function when fJ = O. The equation
u = e - Jf"
C' ( )
vo 1/>(t)dt
o (7.1)
is the key to calculating S(u). For ti-I < t < ti, the hazard is 1jJ(t -
f;
I:~:i Min{tj -tj-I, fJ}). The integral 1jJ(teffective) dt must then be carried
out piecewise, and is
r
10 1jJ(teffective)dt = L
n+Ilti i-1
. 1jJ(t - LMin{tj -tj_bfJ})dt, (7.2)
o i=1 t._l j=1
Maintenance Optimisation with the Delay Time Model 583
where a total of n inspections have been carried out by time u from renewal,
to= O,t n +l = u.
It is now possible to write down the survival function S, using the equation
e
- f;
i;_l
t/J(t)dt
= SO(ti)jSO(ti-t},
derived from equation (7.1). Treating each term in the summation in equa-
tion (7.2) in this way, and remembering that So(O) = 1, finally
nIl+! So(t; - L~:~ Min{tj - tj-I. 6})
S(u) = i-I. , (7.3)
i=1 SO(ti-l - Li=1 Mm{ti - ti-I. 6})
where u appears on the right hand side in the guise of tn+!. Clearly, for
exponential distributions the additional terms due to 6 cancel as they must,
because when the hazard tf; is a constant, rejuvenation can have no effect
upon it.
The pdf. g(u) = -dS(u)jdu obtained by differentiating equation (7.3) is
n
g(u) = tf;(u - L Min{ti - ti-l, 6})S(u)
j=1
for u > tn. In terms solely of the original survival function So and pdf. go,
the pdf. is
n n
g(u) = go(u - L Min{tj - tj-I. 6})S(u)jSo(u - L Min{tj - tj-I. 6}),
j=1 i=1
where S is as defined in equation (7.3).
It is now possible to compute G(u) and g(u) when 6 is nonzero, if the
original distribution function G o(u) and pdf. go( u) can be computed.
Whether rejuvenation would be an improvement would depend on whether
the hazard of a defect developing was increasing or decreasing with age-
restoring the machine to an earlier and more unreliable state would not be
an advantage. The basic concept of changing the component's effective age is
still valid for such DFOM (decreasing force of mortality) distributions, but
here it is the increase in age that must be restricted. It is simplest to write
i-I
t ~ teffective =t +L Min{tj - ti-I. 6},
j=1
and to define 6 as the increase in age conferred by the inspection. However, for
DFOM distributions the rationale of this approach, the notion of restoration
to a younger and more reliable state, is lacking.
584 Rose Baker
AJ"
r.tl+" {l- F(u)} du,
and a failure intensity at time t from inspection of AF(t + TJ). As TJ -+ 00,
no defects are found at inspection, and the failure intensity approaches the
defect arrival rate A. The log-likelihood is
Ie
l dlogA - AM Ll + 2)ogF(tj + TJ)
j=1
+ (d - k) log" 1.tl+"
{l - F(x)} dx + constant. (7.4)
b(Ll,q) 1
= A"
.tl+"
F(u)du, (7.5)
This simple form makes calculation of costs per unit time easier than in the
'r-model'. For the exponential distribution, the estimating equation for TJ,
8l/8TJ = 0, reduces to
Maintenance Optimisation with the Delay Time Model 585
L l/{expe(tj +~) -
Ie
I} = d - k.
j=l
Acknowledgement. I would like to thank Professor Tony Christer, Dr. Philip Scarf,
and all my colleagues in the Maintenance Research Group for helpful and stimu-
lating discussions on this presentation of our joint work.
References
Baker, R.D. , Wang, W.: Estimating the Delay-Time Distribution of Faults in Re-
pairable Machinery from Failure Data. IMA Journal of Mathematics Applied
in Business and Industry 3, 259-281 (1991)
Baker, R.D., Wang, W.: Developing and Testing the Delay-Time model. Journal of
the OR Society 44, 361-374 (1993)
Cerone, P.: On a Simplified Delay-Time Model of Reliability of Equipment Subject
to Inspection Monitoring. J. OpJ. Res. Soc. 42, 505-511 (1991)
Chilcott, J.B., Christer, A.H.: Modelling of Condition-Based Maintenance at the
Coal Face. International Journal of Production Economics 22,1-11 (1991)
Christer, A.H.: Innovatory Decision Making. In: Bowen, K. , White, D.J.(eds.):
Proc. NATO Conference on Role and Effectiveness of Decision Theory in Prac-
tice (1976)
Christer, A.H.: Modelling Inspection Policies for Building Maintenance. J. OpJ. Res.
Soc. 33, 723-732 (1982)
Christer, A.H.: Operational Research Applied to Industrial Maintenance and Re-
placement. In: Eglese, Rand (eds.):Developments in Operational Research. Ox-
ford: Pergamon Press 1984, pp. 31-58
Christer, A.H.: Delay-Time Model of Reliability of Equipment Subject to Inspection
Monitoring. J. Opl. Res. Soc. 38, 329-334 (1987)
Christer, A.H.: Condition-Based Inspection Models of Major Civil-Engineering
Structures. J. Opl. Res. Soc. 39, 71-82 (1988)
Christer, A.H.: Modelling for Control of Maintenance for Production. In: On-
derhoud en Logistiek (Op weg naar intergrale beheersing). Eindhoven: Sam-
som/Nive 1991a
Christer, A.H.: Prototype Modelling of Irregular Condition Monitoring of Produc-
tion Plant. IMA Journal of Mathematics Applied in Business and Industry 3,
219-232 (1991b)
Christer, A.H., Redmond, D.F.: A Recent Mathematical Development in Mainte-
nance Theory. IMA Journal of Mathematics Applied in Business and Industry
2, 97-108 (1990)
Christer, A.H., Redmond, D.F.: Revising Models of Maintenance and Inspection.
International Journal of Production Economics 24, 227-234 (1992)
Christer, A.H., Waller, W.M.: Delay Time Models of Industrial Maintenance Prob-
lems. J. Opl. Res. Soc. 35, 401-406 (1984a)
Christer, A.H., Waller, W.M.: An Operational Research Approach to Planned Main-
tenance: Modelling P.M. for a Vehicle Fleet. J. Opl. Res. Soc. 35, 967-984
(1984b)
Christer, A.H., Waller, W.M.: Reducing Production Downtime Using Delay-Time
Analysis. J. OpJ. Res. Soc. 35, 499-512 (1984c)
Christer, A.H., Wang, W.: A Model of Condition Monitoring of a Production Plant.
International Journal of production Research 9, 2199-2211 (1992)
Christer, A.H., Wang, W.: A Delay-Time Based Maintenance Model of a Multi-
component System. Technical Report MCS-94-13. Mathematics Dept., Salford
University (1994)
Christer, A.H., Whitelaw, J.: An O.R. Approach to Breakdown Maintenance Prob-
lem Recognition. J. OpJ. Res. Soc. 34, 1041-1052 (1983)
Christer, A.H., Wang, W., Baker, R.D., Sharp, J.: Modelling Maintenance Practice
of Production Plants Using the Delay Time Concept. IMA journal of mathe-
matics in business and industry 6, 67-83 (1995)
Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics. 4th edition. High
Wycombe: Griffin 1979
Maritz, J.S., Lwin, T.: Empirical Bayes Methods. London: Chapman and Hall 1989
Maintenance Optimisation with the Delay Time Model 587
O'Hagan, A.: Kendall's Advanced Theory of Statistics: Bayesian Inference. Vol. 2B.
London: Edward Arnold 1994
Pellegrin, C.: A Graphical Procedure for an On-Condition Maintenance Policy:
Imperfect-Inspection Model and Interpretation. IMA Journal of Mathematics
Applied in Business and Industry 3,177-191 (1991)
Sakamoto, Y., Ishiguro, M., Kitagawa, G.: Akaike Information Criterion Statistics.
Tokyo: KTK Publishing House 1986
Shwartz, M., Plough, A.L.: Models to Aid in Cancer Screening Programs. In: Cor-
nell, R. (ed.): Statistical Methods for Cancer Studies. New York: Marcel Dekker
1984
Thomas, L.C., Gaver, D.P., Jacobs, P.A.: Inspection Models and Their Application.
IMA Journal of Mathematics Applied in Business and Industry 3, 283-303
(1991 )
Xie, M.: On the Solution of Renewal-Type Integral Equations. Commun. Statist.
B 18, 281-293 (1989)
Valdez-Flores, C., Feldman, R.M.: A Survey of Preventive Maintenance Models
for Stochastically Deteriorating Single-Unit Systems. Naval Research Logistics
Quarterly 36, 419-446 (1989)
List of Contributors
Kishor S. Trivedi
Department of Electrical Eng.
Duke University
Durham, NC 27708
USA
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 46: Recent Advances in Speech Understanding and Dialog Systems. Edited by H. Niemann, M.
Lang and G. Sagerer. X, 521 pages. 1988.
Vol. 47: Advanced Computing Concepts and Techniques in Control Engineering. Edited by M. J.
Denham and A. J. Laub. XI, 518 pages. 1988. (out of print)
Vol. 48: Mathematical Models for Decision Support. Edited by G. Mitra. IX, 762 pages. 1988.
Vol. 49: Computer Integrated Manufacturing. Edited by I. B. Turksen. VIII, 568 pages. 1988.
Vol. 50: CAD Based Programming for Sensory Robots. Edited by B. Ravani. IX, 565 pages. 1988.
(ROB)
Vol. 51: Algorithms and Model Formulations in Mathematical Programming. Edited by S. W. Wallace.
IX, 190 pages. 1989.
Vol. 52: Sensor Devices and Systems for Robotics. Edited by A. Casals. IX, 362 pages. 1989. (ROB)
Vol. 53: Advanced Information Technologies for Industrial Material Flow Systems. Edited by S. Y. Nof
and C. L. Moodie. IX, 710 pages. 1989.
Vol. 54: A Reappraisal of the Efficiency of Financial Markets. Edited by R. M. C. Guimar res, B. G.
Kingsman and S. J. Taylor. X, 804 pages. 1989.
Vol. 55: Constructive Methods in Computing Science. Edited by M. Broy. VII, 478 pages. 1989.
Vol. 56: Multiple Criteria Decision Making and Risk Analysis Using Microcomputers. Edited by
B. Karpak and S. Zionts. VII, 399 pages. 1989.
Vol. 57: Kinematics and Dynamic Issues in Sensor Based Control. Edited by G. E. Taylor. XI, 456
pages. 1990. (ROB)
Vol. 58: Highly Redundant Sensing in Robotic Systems. Edited by J. T. Tou and J. G. Balchen. X, 322
pages. 1990. (ROB)
Vol. 59: Superconducting Electronics. Edited by H. Weinstock and M. Nisenoff. X, 441 pages. 1989.
Vol. 60: 3D Imaging in Medicine. Algorithms, Systems, Applications. Edited by K. H. Hahne, H. Fuchs
and S. M. Pizer. IX, 460 pages. 1990. (out of print)
Vol. 61: Knowledge, Data and Computer-Assisted Decisions. Edited by M. Schader and W. Gaul. VIII,
421 pages. 1990.
Vol. 62: Supercomputing. Edited by J. S. Kowalik. X, 425 pages. 1990.
Vol. 63: Traditional and Non-Traditional Robotic Sensors. Edited byT. C. Henderson. VIII, 468 pages.
1990. (ROB)
Vol. 64: Sensory Robotics for the Handling of Limp Materials. Edited by P. M. Taylor. IX, 343 pages.
1990. (ROB)
Vol. 65: Mapping and Spatial Modelling for Navigation. Edited by L. F. Pau. VIII, 357 pages. 1990.
(ROB)
Vol. 66: Sensor-Based Robots: Algorithms and Architectures. Edited by C. S. G. Lee. X, 285 pages.
1991. (ROB)
Vol. 67: Designing Hypermedia for Leaming. Edited by D. H. Jonassen and H. Mandl. XXV, 457 pages.
1990. (AET)
Vol. 68: Neurocomputing. Algorithms, Architectures and Applications. Edited by F. Fogelman Soulie
and J. Herault. XI, 455 pages. 1990.
Vol. 69: Real-Time Integration Methods for Mechanical System Simulation. Edited by E. J. Haug and
R. C. Oeyo. VIII, 352 pages. 1991.
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 70: Numerical Linear Algebra, Digital Signal Processing and Parallel Algorithms. Edited by
G. H. Golub and P. Van Dooren. XIII, 729 pages. 1991.
Vol. 71: Expert Systems and Robotics. Edited by T. Jordanides and B.Torby. XII, 744 pages. 1991.
Vol. 72: High-Capacity Local and Metropolitan Area Networks. Architecture and Performance Issues.
Edited by G. Pujolle. X, 536 pages. 1991.
Vol. 73: Automation and Systems Issues in Air Traffic Control. Edited by J. A. Wise, V. D. Hopkin and
M. L. Smith. XIX, 594 pages. 1991.
Vol. 74: Picture Archiving and Communication Systems (PACS) in Medicine. Edited by H. K. Huang,
O. Ratib, A. R. Bakker and G. Witte. XI, 438 pages. 1991.
Vol. 75: Speech Recognition and Understanding. Recent Advances, Trends and Applications. Edited
by P. Laface and Renato De Mori. XI, 559 pages. 1991.
Vol. 76: Multimedia Interface Design in Education. Edited by A. D. N. Edwards and S. Holland. XIV,
216 pages. 1992. (AET)
Vol. 77: Computer Algorithms for Solving Linear Algebraic Equations. The State of the Art. Edited by
E. Spedicato. VIII, 352 pages. 1991.
Vol. 78: Integrating Advanced Technology into Technology Education. Edited by M. Hacker,
A. Gordon and M. de Vries. VIII, 185 pages. 1991. (AET)
Vol. 79: Logic, Algebra, and Computation. Edited by F. L. Bauer. VII, 485 pages. 1991.
Vol. 80: Intelligent Tutoring Systems for Foreign Language Leaming. Edited by M. L. Swartz and
M. Yazdani. IX, 347 pages. 1992. (AET)
Vol. 81: Cognitive Tools for Learning. Edited by P. A. M. Kommers, D. H. Jonassen, and J. T. Mayes.
X, 278 pages. 1992. (AET)
Vol. 82: Combinatorial Optimization. New Frontiers in Theory and Practice. Edited by M. Akgul, H. W.
Hamacher, and S. TufekQ. XI, 334 pages. 1992.
Vol. 83: Active Perception and Robot Vision. Edited by A. K. Sood and H. Wechsler. IX, 756 pages.
1992.
Vol. 84: Computer-Based Learning Environments and Problem Solving. Edited by E. De Corte, M. C.
Linn, H. Mandl, and L. Verschaffel. XVI, 488 pages. 1992. (AET)
Vol. 85: Adaptive Learning Environments. Foundations and Frontiers. Edited by M. Jones and P. H.
Winne. VIII, 408 pages. 1992. (AET)
Vol. 86: Intelligent Learning Environments and Knowledge Acquisition in Physics. Edited by
A. Tiberghien and H. Mandl. VIII, 285 pages. 1992. (AET)
Vol. 87: Cognitive Modelling and Interactive Environments. With demo diskettes (Apple and IBM
compatible). Edited by F. L. Engel, D. G. Bouwhuis, T. Basser, and G. d'Ydewalle. IX, 311 pages.
1992. (AET)
Vol. 88: Programming and Mathematical Method. Edited by M. Broy. VIII, 428 pages. 1992.
Vol. 89: Mathematical Problem Solving and New Information Technologies. Edited by J. P. Ponte,
J. F. Matos, J. M. Matos, and D. Fernandes. XV, 346 pages. 1992. (AET)
Vol. 90: Collaborative Learning Through ComputerConferencing. Edited by A. R. Kaye. X, 260 pages.
1992. (AET)
Vol. 91: New Directions for Intelligent Tutoring Systems. Edited by E. Costa. X, 296 pages. 1992.
(AET)
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 92: Hypermedia Courseware: Structures of Communication and Intelligent Help. Edited by
A. Oliveira. X, 241 pages. 1992. (AET)
Vol. 93: Interactive Multimedia Learning Environments. Human Factors and Technical Considerations
on Design Issues. Edited by M. Giardina. VIII, 254 pages. 1992. (AET)
Vol. 94: Logic and Algebra of Specification. Edited by F. L. Bauer, W. Brauer, and H. Schwichtenberg.
VII, 442 pages. 1993.
Vol. 95: Comprehensive Systems Design: A New Educational Technology. Edited by C. M. Reigeluth,
B. H. Banathy, and J. R. Olson. IX, 437 pages. 1993. (AET)
Vol. 96: New Directions in Educational Technology. Edited by E. Scanlon and T. O'Shea. VIII, 251
pages. 1992. (AET)
Vol. 97: Advanced Models of Cognition for Medical Training and Practice. Edited by D. A. Evans and
V. L. Patel. XI, 372 pages. 1992. (AET)
Vol. 98: Medical Images: Formation, Handling and Evaluation. Edited by A. E. Todd-Pokropek and
M. A. Viergever. IX, 700 pages. 1992.
Vol. 99: Multisensor Fusion for Computer Visiori. Edited by J. K. Aggarwal. XI, 456 pages. 1993. (ROB)
Vol. 100: Communication from an Artificial Intelligence Perspective. Theoretical and Applied Issues.
Edited by A. Ortony, J. Slack and O. Stock. XII, 260 pages. 1992.
Vol. 101: Recent Developments in DeCision Support Systems. Edited by C. W. Holsapple and A. B.
Whinston. XI, 618 pages. 1993.
Vol. 102: Robots and Biological Systems: Towards a New Bionics? Edited by P. Dario, G. Sandini and
P. Aebischer. XII, 786 pages. 1993.
Vol. 103: Parallel Computing on Distributed Memory Multiprocessors. Edited by F. OzgOner and
F. Er<,al. VIII, 332 pages. 1993.
Vol. 104: Instructional Models in Computer-Based Learning Environments. Edited by S. Dijkstra,
H. P. M. Krammer and J. J. G. van Merrienboer. X, 510 pages. 1993. (AET)
Vol. 105: Designing Environments for Constructive Learning. Edited by T. M. Duffy, J. Lowyck and
D. H. Jonassen. VIII, 374 pages. 1993. (AET)
Vol. 106: Software for Parallel Computation. Edited by J. S. Kowalik and L. Grandinetti. IX, 363 pages.
1993.
Vol. 107: Advanced Educational Technologies for Mathematics and Science. Edited by D. L.
Ferguson. XII, 749 pages. 1993. (AET)
Vol. 108: Concurrent Engineering: Tools and Technologies for Mechanical System Design. Edited by
E. J. Haug. XIII, 998 pages. 1993.
Vol. 109: Advanced Educational Technology in Technology Education. Edited by A. Gordon,
M. Hacker and M. de Vries. VIII, 253 pages. 1993. (AET)
Vol. 110: Verification and Validation of Complex Systems: Human Factors Issues. Edited by J. A.
Wise, V. D. Hopkin and P. Stager. XIII, 704 pages. 1993.
Vol. 111: Cognitive Models and Intelligent Environments for Learning Programming. Edited by
E. Lemut, B. du Boulay and G. Dettori. VIII, 305 pages. 1993. (AET)
Vol. 112: Item Banking: Interactive Testing and Self-Assessment. Edited by D. A. Leclercq and J. E.
Bruno. VIII, 261 pages. 1993. (AET)
Vol. 113: Interactive Leaming Technology for the Deaf. Edited by B. A. G. Elsendoorn and F. Coninx.
XIII, 285 pages. 1993. (AET)
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 114: Intelligent Systems: Safety, Reliability and Maintainability Issues. Edited by O. Kaynak,
G. Honderd and E. Grant. XI, 340 pages. 1993.
Vol. 115: Learning Electricity and Electronics with Advanced Educational Technology. Edited by
M. Caillot. VII, 329 pages. 1993. (AET)
Vol. 116: Control Technology in Elementary Education. Edited by B. Denis. IX, 311 pages. 1993. (AFT)
Vol. 117: Intelligent Learning Environments: The Case of Geometry. Edited by J.-M. Laborde. VIII, 267
pages. 1996. (AET)
Vol. 118: Program Design Calculi. Edited by M. Broy. VIII, 409 pages. 1993.
Vol. 119: Automating Instructional Design, Development, and Delivery. Edited by. R. D. Tennyson.
VIII, 266 pages. 1994. (AET)
Vol. 120: Reliability and Safety Assessment of Dynarnic Process Systems. Edited by T. Aldernir,
N. O. Siu, A. Mosleh, P. C. Cacciabue and B. G. G6ktepe. X, 242 pages. 1994.
Vol. 121: Learning from Computers: Mathematics Education and Technology. Edited by C. Keitel and
K. Ruthven. XIII, 332 pages. 1993. (AET)
Vol. 122: Simulation-Based Experiential Learning. Edited by D. M. Towne, T. de Jong and H. Spada.
XIV, 274 pages. 1993. (AET)
Vol. 123: User-Centred Requirements for Software Engineering Environments. Edited by D. J.
Gilmore, R. L. Winder and F. Detienne. VII, 377 pages. 1994.
Vol. 124: Fundamentals in Handwriting Recognition. Edited by S. Impedovo. IX, 496 pages. 1994.
Vol. 125: Student Modelling: The Key to Individualized Knowledge-Based Instruction. Edited by J. E.
Greer and G. I. McCalla. X, 383 pages. 1994. (AET)
Vol. 126: Shape in Picture. Mathematical Description of Shape in Grey-level Images. Edited by
Y.-L. 0, A. Toet, D. Foster, H. J. A. M. Heijmans and P. Meer. XI, 676 pages. 1994.
Vol. 127: Real Time Computing. Edited byW. A. Halang and A. D. Stoyenko. XXII, 762 pages. 1994.
Vol. 128: Computer Supported Collaborative Learning. Edited by C. O'Malley. X, 303 pages. 1994.
(AET)
Vol. 129: Human-Machine Communication for Educational Systems Design. Edited by M. D.
Brouwer-Janse and T. L. Harrington. X, 342 pages. 1994. (AET)
Vol. 130: Advances in Object-Oriented Database Systems. Edited by A. Dogac, M. T. Ozsu, A. Biliris
and T. Sellis. XI, 515 pages. 1994.
Vol. 131: Constraint Programming. Edited by B. Mayoh, E. Tyugu and J. Penjam. VII, 452 pages.
1994.
Vol. 132: Mathematical Modelling Courses for Engineering Education. Edited by Y. Ersoy and A. O.
Moscardini. X, 246 pages. 1994. (AET)
Vol. 133: Collaborative Dialogue Technologies in Distance Learning. Edited by M. F. Verdejo and
S. A. Cerri. XIV, 296 pages. 1994. (AET)
Vol. 134: Computer Integrated Production Systems and Organizations. The Human-Centred
Approach. Edited by F. Schmid, S. Evans, A. W. S. Ainger and R. J. Grieve. X, 347 pages. 1994.
Vol. 135: Technology Education in School and Industry. Emerging Didactics for Human Resource
Development. Edited by D. Blandow and M. J. Dyrenfurth. XI, 367 pages. 1994. (AET)
Vol. 136: From Statistics to Neural Networks. Theory and Pattern Recognition Applications. Edited
by V. Cherkassky, J. H. Friedman and H. Wechsler. XII, 394 pages. 1994.
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 137: Technology-Based Learning Environments. Psychological and Educational Foundations.
Edited by S. Vosniadou, E. De Corte and H. Mandl. X, 302 pages. 1994. (AET)
Vol. 138: Exploiting Mental Imagery with Computers in Mathematics Education. Edited by
R. Sutherland and J. Mason. VIII, 326 pages. 1995. (AET)
Vol. 139: Proof and Computation. Edited by H. Schwichtenberg. VII, 470 pages. 1995.
Vol. 140: Automating Instructional Design: Computer-Based Development and Delivery Tools. Edited
by R. D. Tennyson and A. E. Barron. IX, 618 pages. 1995. (AET)
Vol. 141: Organizational Leaming and Technological Change. Edited by C. Zucchermaglio, S.
Bagnara and S. U. Stucky. X, 368 pages. 1995. (AET)
Vol. 142: Dialogue and Instruction. Modeling Interaction in Intelligent Tutoring Systems. Edited by
R.-J. Beun, M. Baker and M. Reiner. IX, 368 pages. 1995. (AET)
Vol. 144: The Biology and Technology of Intelligent Autonomous Agents. Edited by Luc Steels. VIII,
517 pages. 1995.
Vol. 145: Advanced Educational Technology: Research Issues and Future Potential. Edited by T. T.
Liao. VIII, 219 pages. 1996. (AET)
Vol. 146: Computers and Exploratory Leaming. Edited by A. A. diSessa, C. Hoyles and R. Noss.
VIII, 482 pages. 1995. (AET)
Vol. 147: Speech Recognition and Coding. New Advances and Trends. Edited by A. J. Rubio Ayuso
and J. M. L6pez Soler. XI, 505 pages. 1995.
Vol. 148: Knowledge Acquisition, Organization, and Use in Biology. Edited by K. M. Fisher and M. R.
Kibby. X, 246 pages. 1996. (AET)
Vol. 149: Emergent Computing Methods in Engineering Design. Applications of Genetic Algorithms
and Neural Networks. Edited by D.E. Grierson and P. Hajela. VIII, 350 pages. 1996.
Vol. 152: Deductive Program Design. Edited by M. Broy. IX, 467 pages. 1996.
Vol. 153: Identification, Adaptation, Learning. Edited by S. Bittanti and G. Picci. XIV, 553 pages. 1996.
Vol. 154: Reliability and Maintenance of Complex Systems. Edited by S. Ozekici. XI, 589 pages. 1996.