Simulation-Based Optimization: Abhijit Gosavi
Simulation-Based Optimization: Abhijit Gosavi
Simulation-Based Optimization: Abhijit Gosavi
Abhijit Gosavi
Simulation-
Based
Optimization
Parametric Optimization Techniques
and Reinforcement Learning
Second Edition
Operations Research/Computer Science
Interfaces Series
Volume 55
Series Editors:
Ramesh Sharda
Oklahoma State University, Stillwater, Oklahoma, USA
Stefan Voß
University of Hamburg, Hamburg, Germany
Simulation-Based
Optimization
Parametric Optimization Techniques
and Reinforcement Learning
Second Edition
123
Abhijit Gosavi
Department of Engineering Management
and Systems Engineering
Missouri University of Science
and Technology
Rolla, MO, USA
ISSN 1387-666X
ISBN 978-1-4899-7490-7 ISBN 978-1-4899-7491-4 (eBook)
DOI 10.1007/978-1-4899-7491-4
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2014947352
This book is written for students and researchers in the field of indus-
trial engineering, computer science, operations research, management
science, electrical engineering, and applied mathematics. The aim is
to introduce the reader to a subset of topics on simulation-based op-
timization of large-scale and complex stochastic (random) systems.
Our goal is not to cover all the topics studied under this broad field.
Rather, it is to expose the reader to a selected set of topics that have
recently produced breakthroughs in this area. As such, much of the
material focusses on some of the key recent advances.
Those working on problems involving stochastic discrete-event
systems and optimization may find useful material here. Furthermore,
the book is self-contained, but only to an extent; a background in el-
ementary college calculus (basics of differential and integral calculus)
and linear algebra (matrices and vectors) is expected. Much of the
book attempts to cover the topics from an intuitive perspective that
appeals to the engineer.
In this book, we have referred to any stochastic optimization
problem related to a discrete-event system that can be solved with
computer simulation as a simulation-optimization problem. Our
focus is on those simulation-optimization techniques that do not re-
quire any a priori knowledge of the structure of the objective function
(loss function), i.e., closed form of the objective function or the un-
derlying probabilistic structure. In this sense, the techniques we cover
are model-free. The techniques we cover do, however, require all the
information typically required to construct a simulation model of the
discrete-event system.
vii
viii SIMULATION-BASED OPTIMIZATION
xi
xii SIMULATION-BASED OPTIMIZATION
Preface vii
Acknowledgments xi
List of Figures xxi
List of Tables xxvi
1. BACKGROUND 1
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Main Branches . . . . . . . . . . . . . . . . . . . . 2
1.2 Difficulties with Classical Optimization . . . . . . 3
1.3 Recent Advances in Simulation Optimization . . . 3
2 Goals and Limitations . . . . . . . . . . . . . . . . . . 5
2.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Limitations . . . . . . . . . . . . . . . . . . . . . . 6
3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Some Basic Conventions . . . . . . . . . . . . . . . 6
3.2 Vector Notation . . . . . . . . . . . . . . . . . . . 7
3.3 Notation for Matrices . . . . . . . . . . . . . . . . 8
3.4 Notation for n-tuples . . . . . . . . . . . . . . . . 8
3.5 Notation for Sets . . . . . . . . . . . . . . . . . . . 8
3.6 Notation for Sequences . . . . . . . . . . . . . . . 9
3.7 Notation for Transformations . . . . . . . . . . . . 9
3.8 Max, Min, and Arg Max . . . . . . . . . . . . . . 10
4 Organization . . . . . . . . . . . . . . . . . . . . . . . . 10
2. SIMULATION BASICS 13
1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . 13
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Simulation Modeling . . . . . . . . . . . . . . . . . . . 16
4.1 Random Number Generation . . . . . . . . . . . . 16
xv
xvi SIMULATION-BASED OPTIMIZATION
BACKGROUND
1. Motivation
This book seeks to introduce the reader to the rapidly evolving
subject called simulation-based optimization. This is not a very young
topic, because from the time computers started making an impact
on scientific research and it became possible to analyze random sys-
tems with computer programs that generated random numbers, engi-
neers have always wanted to optimize systems using simulation models.
However, it is only recently that noteworthy success in realizing this
objective has been met in practice.
Path-breaking work in computational operations research in areas
such as non-linear programming (simultaneous perturbation) and dyn-
amic programming (reinforcement learning) has now made it possible
for us to use simulation in conjunction with optimization techniques.
This has given simulation the kind of power that it did not have in the
past, when simulation optimization was usually treated as a synonym
for the relatively sluggish (although robust) response surface method.
The power of computers has increased dramatically over the years,
of course, and it continues to increase. This has helped increase the
speed of running computer programs, and has provided an incentive
to study simulation-based optimization. But the over-riding factor
in favor of simulation optimization in recent times is the remarkable
research that has taken place in various areas of computational oper-
ations research. We mean research that has either given birth to new
methods, e.g., response surfaces, were known to work with any given
distribution for the random variables. However, as mentioned above,
they required a large amount of computational time, and hence closed-
form methods were often preferred to them. Some of the recent adv-
ances in simulation-based optimization seek to change this perception
about simulation-based optimization.
2.1. Goals
The main goal of this book is to introduce the reader to a selection of
topics within simulation-based optimization of discrete-event systems,
concentrating on parametric optimization and control optimization.
We have elected to cover only a subset of topics in this vast field; some
of the key topics that we cover are: response surfaces with neural
networks, simultaneous perturbation, meta-heuristics (for simulation
optimization), reinforcement learning, and learning automata. The-
oretical presentation of some algorithms has been supplemented by
engineering case studies. The intent behind their presentation is to
demonstrate the use of simulation-based optimization on real-life prob-
lems. We also hope that from reading the chapters related to conver-
gence, the reader will gain an understanding of the theoretical methods
used to mathematically establish that a given algorithm works. Over-
whelming the reader with mathematical convergence arguments (e.g.,
theorems and proofs) was not our intention, and therefore material of
that nature is covered towards the end in separate chapters.
A central theme in this book is the development of optimization
models that can be combined easily with simulators, in order to opt-
imize complex systems for which analytical models are not easy to
construct. Consequently, the focus is on the optimization model: in
the context of parametric (static) optimization, the underlying system
is assumed to have no special structural properties, while in the context
of control (dynamic) optimization, the underlying system is assumed
to be driven by Markov chains.
6 SIMULATION-BASED OPTIMIZATION
2.2. Limitations
Firstly, we note that we have restricted our discussion to discrete-
event systems. Therefore, systems modeled by continuous-event
dynamics, e.g., Brownian motion or deterministic systems modeled
by differential equations, are outside the scope of this text. Secondly,
within control optimization, we concentrate only on systems governed
by Markov chains, in particular the Markov and semi-Markov decision
processes. Hence, systems modeled by partially observable Markov
processes and other kinds of jump processes are not studied here.
Finally, and very importantly, we have not covered any model-based
techniques, i.e., techniques that exploit the structure of the problem.
Usually, model-based techniques assume some prior knowledge of the
problem structure, which could either be the objective function’s
properties (in parametric optimization) or the transition probability
functions (in control optimization). Rather, our focus is on model-
free techniques that require no prior knowledge of the system (other
than, of course, what is needed to construct a simulation model).
It is also important to point out that the algorithms presented in
this book are restricted to systems for which (a) the distributions of
random variables are known (or can be determined with data collec-
tion) and (b) to systems that are known to reach steady state. Non-
stationary systems or unstable systems that never reach steady state,
perhaps due to the occurrence of rare events, are not considered in
this book. Finally, we note that any simulation model continues to
be accurate only if the underlying random variables continue to follow
the distributions assumed. If the distributions change, our simulation-
optimization model will break down and should not be used.
3. Notation
We now define much of the notation used in this book. The dis-
cussion is in general terms and refers to some conventions that we
have adopted. Vector notation has been avoided as much as possible;
although it is more compact and elegant in comparison to component
notation, we believe that component notation, in which all quantities
are scalar, is usually easier to understand.
where |a| denotes the absolute value of a. The max norm will be
equivalent to the sup norm or the infinity norm (for the analysis
in this book). Whenever ||.|| is used without a subscript, it will be
assumed to equal the max norm.
8 SIMULATION-BASED OPTIMIZATION
{an }∞
n=1 = {10, 20, 30, . . .},
then a3 = 30.
and
x5 , y 5 = T x4 , y 4 = T 2 x3 , y 3 .
10 SIMULATION-BASED OPTIMIZATION
x = max[a(i)].
i∈S
This means that x equals the maximum of the values that a can as-
sume. So if set S is defined as below:
S = {1, 2, 3},
and
a(1) = 1, a(2) = 10, and a(3) = −2,
then x = 10. Similarly, min will be used in the context of the minimum
value. Now read the following notation carefully.
y = arg max[a(i)].
i∈S
Here, y denotes the argument or the element index associated with the
maximum value. So, if set S and the values of a are defined as above,
then
y = 2.
so that a(y) is the maximum value for a. It is to be noted that arg min
has a similar meaning in the context of the minimum.
In Table 1.1, we present a list of acronyms and abbreviations that
we have used in this book.
4. Organization
The rest of this book is organized as follows. Chap. 2 covers ba-
sic concepts related to discrete-event simulation. Chap. 3 is meant to
present an overview of optimization with simulation. Chap. 4 deals
with the response surface methodology, which is used in conjunction
with simulation optimization. This chapter also presents the topic of
neural networks in some detail. Chap. 5 covers the main techniques
for parametric optimization with simulation. Chap. 6 discusses the
classical theory of stochastic dynamic programming. Chap. 7 focuses
on reinforcement learning. Chap. 8 covers automata theory in the con-
text of solving Markov and Semi-Markov decision problems. Chap. 9
deals with some fundamental concepts from mathematical analysis.
These concepts will be needed for understanding the subsequent chap-
ters on convergence. Convergence issues related to parametric opt-
imization methods are presented in Chap. 10, while Chap. 11 discusses
Background 11
http://web.mst.edu/~gosavia/bookcodes.html
Chapter 2
SIMULATION BASICS
1. Chapter Overview
This chapter has been written to introduce the topic of discrete-
event simulation. To comprehend the material presented in this chap-
ter, some background in the theory of probability is needed, some of
which is in the Appendix. Two of the main topics covered in this
chapter are random number generation and simulation modeling of
random systems. Readers familiar with this material can skip this
chapter without loss of continuity.
2. Introduction
A system is usually defined as a collection of entities that interact
with each other. A simple example is the queue that forms in front
of a teller in a bank: see Fig. 2.1. The entities in this system are the
people (customers), who arrive to get served, and the teller (server),
who provides service.
The behavior of a system can be described in terms of the so-called
state of the system. In our queuing system, one possible definition for
the system state is the length of the queue, i.e., the number of people
waiting in the queue. Frequently, we are interested in how a system
changes over time, i.e., how the state changes as time passes. In a real
banking queue, the queue length fluctuates with time, and thus the
behavior of the system can also be said to change with time. Systems
which change their state with time are called dynamic systems. In this
book, we will restrict our attention to discrete-event systems. In a
Server
Arrival
Departure
Customers
3. Models
To understand, to analyze, and to predict the behavior of sys-
tems (both random and deterministic), operations researchers con-
struct models. These models are typically abstract models, unlike
physical models, e.g., a miniature airplane. Abstract models take
the form of equations, functions, inequations (inequalities), and com-
puter programs etc. To understand how useful abstract models can
be, consider the simple model from Newtonian physics: v = u + gt.
This equation predicts the speed of a freely-falling body that has been
in the air for t time units after starting its descent at a speed of u.
Simulation Basics 15
4. Simulation Modeling
Determining the distributions of the governing random variables
is the first step towards modeling a stochastic system, regardless of
whether mathematical or simulation models are used. In mathematical
models, the pdfs (or cdfs) of the governing random variables are used in
the closed forms obtained. In simulation models, the pdfs (or cdfs) are
used to generate random numbers for the variables concerned. These
random numbers are then used to imitate, within a computer, the
behavior of the system. Imitating the behavior of a system essentially
means re-creating the events that occur in the real-world system that
is being imitated.
How does one determine the distribution of a random variable? For
this, usually, one has to actually collect data on the values of the
random variables from the real-life system. Then, from this data, it
is usually possible to fit a distribution to that data, which is called
distribution fitting. For a good discussion on distribution fitting, see
e.g., [188].
An important issue in stochastic analysis is related to the number
of random variables in the system. Generally, the larger the number
of governing random variables in a system the more complicated is its
analysis. This is especially true of analysis with mathematical models.
On the other hand, simulating a system with several governing ran-
dom variables has become a trivial task with modern-day simulation
packages.
Our main strategy in simulation is to re-create, within a computer
program, the events that take place in the real-life system. The re-
creation of events is based on using suitable values for the governing
random variables. For this, one needs a mechanism for generating
values of the governing random variables. We will first discuss how to
create random values for these variables and then discuss how to use
the random values to re-create the events.
or obtained from a real system. Having said that, for all practical
purposes, artificial (or pseudo) random numbers generated by com-
puters are usually sufficient in simulations. Artificial random number
generation schemes are required to satisfy some statistical tests in or-
der to be acceptable. Needless to add, the schemes that we have at our
disposal today do pass these tests. We will discuss one such scheme.
The so-called linear congruential generator of random numbers [221]
is given by the following equation:
12, 4, 8, 16
in the original set. Then, for large values of m, this set of random
numbers from 0 to 1 will approximate a set of natural random numbers
from the uniform distribution. Recall that the pdf of the random
variable has the same value at every point in the uniform distribution.
The maximum number of integers that may be generated in this
process before it starts repeating itself is m − 1. Also, if I0 , which is
known as the seed, equals 0, the sequence will only contain zeroes.
A suitable choice of a and m yields a set of m − 1 numbers such that
each integer between 0 and m occurs once at some point.
An important question is: Is it acceptable to use a sequence of ran-
dom numbers with repeating subsets, e.g., (12, 4, 8, 16, 12, 4, 8, 16, . . .)?
The answer is no because the numbers (12, 4, 8, 16) repeat and are
therefore deterministic. This is a serious problem. Now, unfortunately,
random numbers from the linear congruential generator must repeat
after a finite period. Therefore, the only way out of this is to generate
a sequence that has a sufficiently long period, such that we are finished
with using the sequence before it gets to repeats itself. Fortunately,
if m = 231 − 1, then with a suitable choice of a, it is possible to gen-
erate a sequence that has a period of m − 1. Thus, if the number of
random numbers needed is less than m − 1 = 2,147,483,646 (for most
applications, this is sufficient), we have a set of random numbers with
no repetitions.
Schemes with small periods produce erroneous results. Further-
more, repeating numbers are also not independent. In fact, repetitions
imply that the numbers stray far away from the uniform distribution
that they seek to approximate.
If the largest number in a computer’s memory is 231 − 1, then a
legal value for a that goes with m = 231 − 1 is 16,807. These values
of m and a cannot be implemented naively in the computer program.
The reason is easy to see. In the multiplication of a by Ij , where the
latter can be of the order of m, one runs into trouble as the product is
often larger than m, the largest number that the computer can store.
A clever trick from Schrage [266] helps us circumvent this difficulty.
Let [x/y] denote the integer part of the quotient obtained after dividing
x by y. Using Schrage’s approximate factorization, if
use the single-server queuing example that has been discussed above.
In the single-server queue, there are two governing variables, both of
which are random:
1. The time between successive arrivals to the system: ta
2. The time taken by the server to give service to one customer: ts .
We now generate values for the elements of the two sequences. Since
we know the distributions, it is possible to generate values for them.
E.g.,
let the first 7 values for ta be: 10.1, 2.3, 1, 0.9, 3.5, 1.2, 6.4
and those for ts be: 0.1, 3.2, 1.19, 4.9, 1.1, 1.7, 1.5.
Now {t(n)}∞ n=1 will denote the following sequence: {t(1), t(2), . . . , }.
These values, we will show below, will lead to re-enacting the real-life
queue.
If one observes the queue from real life and collects data for any of
these sequences from there, the values obtained may not necessarily
be identical to those shown above. Then, how will the above lead to a
re-enactment of the real-life system within our simulation model? The
answer is that the elements of this sequence belong to the distribution
of the random variable in the real-life system. In other words, the
above sequence could very well be a sequence from the real-life system.
We will discuss later why this is sufficient for our goals in simulation
modeling.
Now, from the two sequences, one can construct a sequence of the
events that occur. The events here are of two types:
1. A customer enters the system (arrival).
2. A customer is serviced and leaves the system (departure).
Our task of re-creating events boils down to the task of finding
the time of occurrence of each event. In the single-server queuing
system (see Fig. 2.2), when a customer arrives to find that the server
is idle, he or she directly goes to the server without waiting. An
arriving customer who finds that the server is busy providing service
to someone becomes either the first person in the queue or joins the
queue’s end. The arrivals in this case, we will assume, occur regardless
of the number of people waiting in the queue.
To analyze the behavior of our system, the first task is to determine
the clock time of each event as it occurs. This task, as stated above,
22 SIMULATION-BASED OPTIMIZATION
Arrival
13.4
0.0
Time
Axis
10.2 15.6 16.79 21.69 22.79 24.49
Departure
15.6. Hence the second customer must wait in the queue till 15.6 when
he/she joins service. Hence the third departure will occur at a clock
time of 15.6 plus the service time of the third customer, which is 1.19.
Therefore, the departure of the third customer will occur at a clock
time of 15.6 + 1.19 = 16.79. The fourth customer arrives at a clock
time of 14.3 but the third departs only at 16.79. It is clear that the
fourth customer will depart at a clock time of 16.79 + 4.9 = 21.69. In
this way, we can find that the fifth departure will occur at a clock time
of 22.79 and the sixth at a clock time of 24.49. The seventh arrival
occurs at a clock time of 25.4, which is after the sixth customer has
departed. Hence when the seventh customer enters the system, there
is nobody in the queue, and the seventh customer departs some time
after the clock time of 25.4. We will analyze the system until the time
when the clock strikes 25.4.
Now, from the sequence of events constructed, we can collect data
related to system parameters of interest. First consider server utiliza-
tion. From the observations above, it is clear that the server was idle
from a clock time of 0 until the clock struck 10.1, i.e., for a time in-
terval of 10.1. Then again it was idle from the time 10.2 (the clock
time of the first departure) until the second arrival at a clock time of
12.4, i.e., for 2.2 time units. Finally, it was idle from the time 24.49
24 SIMULATION-BASED OPTIMIZATION
(the clock time of the sixth departure) until the seventh arrival at a
clock time of 25.4, i.e., for 0.91 time units. Thus, based on our ob-
servations, we can state that the system was idle for a total time of
10.1 + 2.2 + 0.91 = 13.21 out of the total time of 25.4 time units for
which we observed the system. Then, the server utilization (fraction
of time for which the server was busy) is clearly: 1 − 13.21
25.4 = 0.4799.
If one were to create very long sequences for the inter-arrival times
and the service times, one could then obtain estimates of the utilization
of the server over a long run. Of course, this kind of a task should
be left to the computer, but the point is that computer programs are
thus able to collect estimates of parameters measured over a long run.
It should now be clear to the reader that although the sequences
generated may not be identical to sequences obtained from actual ob-
servation of the original system, what we are really interested in are the
estimates of system parameters, e.g., long-run utilization. As long as
the sequence of values for the governing random variables are generated
from the correct distributions, reliable estimates of these parameters
can be obtained. For instance, an estimate like long-run utilization
will approach a constant as the simulation horizon (25.4 time units in
our example) approaches infinity.
Many other parameters for the queuing system can also be measured
similarly. Let E[W ] denote the average customer waiting time in the
queue. Intuitively, it follows that the average waiting time can be
found by summing the waiting times of a large number (read infinity)
of customers and dividing the sum by the number of customers. Thus
if wi denotes the queue waiting time of the ith customer, the long-run
average waiting time should be
n
i=1 wi
E[W ] = lim . (2.3)
n→∞ n
Similarly, the long-run average number in the queue can be defined as:
T
0 Q(t)dt
E[Q] = lim , (2.4)
T →∞ T
where Q(t) denotes the number in the queue at time t. Now, to use
these definitions, in practice, one must treat ∞ as a large number.
For estimating the long-run average waiting time, we can use the
following formula:
n
wi
W̃ = i=1 , (2.5)
n
Simulation Basics 25
k
i=1 W̃i
E[W ] = ,
k
provided k is large enough.
Averaging over many independent estimates (obtained from many
replications), therefore, provides a good estimate for the true mean of
the parameter we are interested in. However, doing multiple replica-
tions is not the only way to generate independent samples. There are
other methods, e.g., the batch means method, which uses one long
replication. The batch means method is a very intelligent method (see
Schmeiser [264]) that divides the output data from a long replication
into a small number of large batches, after deletion of some data. The
means of these batches, it can be shown, can be treated as independent
samples. These samples are then used to estimate the mean. We now
present the mathematical result that allows us to perform statistical
computations from means.
5. Concluding Remarks
Simulation of dynamic systems using random numbers was the main
topic covered in this chapter. However, this book is not about the
methodology of simulation, and hence our discussion was not com-
prehensive. Nevertheless, it is an important topic in the context of
simulation-based optimization, and the reader is strongly urged to get
a clear understanding of it. A detailed discussion on writing simulation
programs in C can be found in [234, 188]. For an in-depth discussion
on tests for random numbers, see Knuth [177].
SIMULATION-BASED
OPTIMIZATION: AN OVERVIEW
1. Chapter Overview
The purpose of this short chapter is to discuss the role played by
computer simulation in simulation-based optimization. Simulation-
based optimization revolves around methods that require the max-
imization (or minimization) of the net rewards (or costs) obtained
from a random system. We will be concerned with two types of opti-
mization problems: (1) parametric optimization (also called static
optimization) and (2) control optimization (also called dynamic op-
timization).
2. Parametric Optimization
Parametric optimization is the problem of finding the values of deci-
sion variables (parameters) that maximize or minimize some function
of the decision variables. In general, we can express this problem as:
Maximize or Minimize f (x(1), x(2), . . . , x(N )), where the N deci-
sion variables are: x(1), x(2), . . ., and x(N ). It is also possible that
there are some constraints on the values of the decision variables.
In the above, f (.) denotes a function of the decision variables. It is
generally referred to as the objective function. It also goes by other
names, e.g., the performance metric (measure), the cost function, the
loss function, and the penalty function. Now, consider the following
example.
b. Minimize
∞
f (x) = 8[x − 5]−0.3 g(x)dx
−∞
Why avoid the closed form? The main reason is that in many
real-world stochastic problems, the objective function is too complex
Simulation Optimization: An Overview 31
X1 + X2 + · · · , Xn
E(X) .
n
The above follows from the strong law of large numbers (see Theorem
2.1), provided the samples are independent and n is sufficiently large.
This “estimate” plays the role of the objective function value.
Combining simulation with numerical parametric optimization
methods is easier said than done. There are many reasons for this.
First, the estimate of the objective function is not perfect and con-
tains “noise.” Fortunately, the effect of noise can often be minimized.
Second, parametric optimization methods that require a very large
number of function evaluations to generate a good solution may not
be of much use in practice, since even one function evaluation via
simulation usually takes a considerable amount of computer time (one
function evaluation in turn requires several samples, i.e., replications
or batches).
We would like to reiterate that the role simulation can play in para-
metric optimization is limited to estimating the function value. Sim-
ulation on its own is not an optimization technique. But, as stated
above, combining simulation with optimization is possible in many
cases, and this opens an avenue along which many real-life systems
may be optimized. In subsequent chapters, we will deal with a number
of techniques that can be combined with simulation to obtain solutions
in a reasonable amount of computer time.
3. Control Optimization
The problem of control optimization is different than the problem of
parametric optimization in many respects. Hence, considerable work
in operations research has occurred in developing specialized techniques
for control optimization.
A system is defined as a collection of entities (such as people and
machines) that interact with each other. A dynamic system is one
in which the system changes in some way from time to time. To
detect changes in the system, we describe the system using a numerical
attribute called state. Then, a change in the value of the attribute can
be interpreted as a change in the system.
A stochastic system is a dynamic system in which the state changes
randomly. For example, consider a queue that builds up in front of
a counter in a supermarket. Let the state of the system be denoted
by the number of people waiting in the queue. Then, clearly, the
queue is stochastic system because the number of people in the queue
fluctuates randomly. The randomness in the queuing system could be
Simulation Optimization: An Overview 33
where μ(i) denotes the action selected in state i and f (.) denotes the
objective function. In large problems, |S| may be of the order of
thousands or millions.
Dynamic programming is a well-known and efficient technique for
solving many control optimization problems encountered in discrete-
event systems governed by Markov chains. It requires the computation
of a so-called value function for every state.
Simulation’s role: It turns out that every element of the value func-
tion of dynamic programming can be expressed as an expectation of a
random variable. Also, fortunately, it is the case that simulation can
be used to generate samples of this random variable. Let us denote
the random variable by X and its ith sample by Xi . Then, using the
strong law of large numbers (see Theorem 2.1), the value function at
each state can be estimated by using:
X 1 + X2 + . . . + Xn
E(X) ,
n
provided n is sufficiently large and the samples are independent. As
stated above, simulation can be used in conjunction with dynamic
programming to generate a large number of independent samples of
34 SIMULATION-BASED OPTIMIZATION
4. Concluding Remarks
The goal of this chapter was to introduce the reader to two types of
problems that will be solved in this book, namely parametric optimiza-
tion and control optimization (of systems with the Markov property).
We remind the reader that parametric optimization is also popularly
known as static optimization, while control optimization is popularly
known as dynamic optimization.
The book is not written to be a comprehensive source on the topic
of “simulation-based optimization.” Rather, our treatment is aimed at
providing an introduction to this topic with a focus on some important
breakthroughs in this area. In particular, our treatment of parametric
optimization is devoted to model-free methods that do not require any
properties of the objective function’s closed form. In case of control
optimization, we only cover problems related to Markov chain governed
systems whose transition models are hard to obtain.
Chapter 4
PARAMETRIC OPTIMIZATION:
RESPONSE SURFACES AND NEURAL
NETWORKS
1. Chapter Overview
This chapter will discuss one of the oldest simulation-based methods
of parametric optimization, namely, the response surface method
(RSM). While RSM is admittedly primitive for the purposes of sim-
ulation optimization, it is still a very robust technique that is often
used when other methods fail. It hinges on a rather simple idea,
which is to obtain an approximate form of the objective function by
simulating the system at a finite number of points carefully sampled
from the function space. Traditionally, RSM has used regression
over the sampled points to find an approximate form of the objective
function.
We will also discuss a more powerful alternative to regression,
namely, neural networks in this chapter. Our analysis of neural net-
works will concentrate on exploring its roots, which lie in the principles
of steepest gradient descent (or steepest descent for short) and least
square error minimization, and on its use in simulation optimization.
We will first discuss the theory of regression-based traditional
response surfaces. Thereafter, we will present a response surface tech-
nique that uses neural networks that call neuro-response surfaces.
2. RSM: An Overview
The problem considered in this chapter is the “parametric-
optimization problem” discussed in Chap. 3. For the sake of con-
venience, we reproduce the problem statement here.
3. RSM: Details
As stated above, RSM consists of several steps. In what follows, we
will discuss each step in some detail.
3.1. Sampling
Sampling of points (data pieces) from the function space is an
issue that has been studied by statisticians for several years [213].
Proper sampling of the function space requires a good design of
40 SIMULATION-BASED OPTIMIZATION
x
x
x
x x
Y-Axis
x x
x
x
x
denotes a
sampled point
at which
simulation is
performed
0
X-Axis
n
minimize SSE ≡ p=1 (ep )
2.
∂
(SSE) = 0.
∂a
Calculating the partial derivative, the above becomes:
n
2 (yp − a − bxp )(−1) = 0, which simplifies to:
p=1
n
n
n
na + b xp = yp , noting that 1 = n. (4.3)
p=1 p=1 p=1
∂ n
(SSE) = 0 which implies that: 2 (yp − a − bxp )(−xp ) = 0,
∂b
p=1
n
n
n
which simplifies to a xp + b x2p = x p yp . (4.4)
p=1 p=1 p=1
which when solved yield a = 2.2759, and b = 0.1879. Thus the meta-
model is: y = 2.2759 + 0.1879x.
Parametric Optimization 43
φ = a + bx + cy. (4.5)
n
n
n
n
na + b xp + c yp + d zp = φp ,
p=1 p=1 p=1 p=1
n
n
n
n
n
a xp + b x2p + c x p yp + d x p zp = xp φp ,
p=1 p=1 p=1 p=1 p=1
n
n
n
n
n
a yp + b yp x p + c yp2 +d yp z p = yp φ p ,
p=1 p=1 p=1 p=1 p=1
and
n
n
n
n
n
a zp + b zp x p + c z p yp + d zp2 = zp φ p .
p=1 p=1 p=1 p=1 p=1
x
x
x
x x x
x x x
x x x
x x
x
x x x
x x
x
x x
x
x
Zone Zone x
Y-Axis
x x
B C x
x x
x
x
Zone
Zone D
A
0
X-Axis
y = a + bx + cx2 . (4.6)
y = a + bx + cz, by setting: z = x2 .
Parametric Optimization 45
With this replacement, the equations of the plane can be used for
the metamodel. See Fig. 4.3 for a non-linear form with one indepen-
dent variable and Fig. 4.4 for a non-linear form with two independent
variables. Other non-linear forms can be similarly obtained by using
the mechanism of regression explained in the case of a straight line or
plane.
x x x
x x x x
x x
x x
Y-Axis
x x
x
x x
x x
x
x
x
x x
x
x
denotes a
sampled point
x at which
simulation is
x
performed
(0,0)
X-Axis
x x x x x
x
x
x x x
Z-Axis
x x x x
x x
x x x
x x
x
xx x is
x x Ax
x Y-
x
x
x
X-Axis
n
where SST is given by: SST = (yp − ȳ)2 ,
p=1
n
and SSE is defined as: SSE = (yp − yppredicted )2 .
p=1
In the above, ȳ is the mean of the yp terms and yppredicted is the value
predicted by the model for the pth data piece. We have defined SSE
during the discussion on fitting straight lines and planes. Note that
those definitions were special cases of the definition given here.
Now, r2 , denotes the proportion of variation in the data that is
explained by the metamodel assumed in calculating SSE. Hence a
large value of r2 (i.e., a value close to 1) usually indicates that the
metamodel assumed is a good fit. In a very rough sense, the reliability
of the r2 parameter increases with the value of n. However, the r2
parameter can be misleading if there are several variables and hence
should be used cautiously. The reader is also refereed to [173] for
further reading on this topic.
y = 5x + 6 when 0 ≤ x ≤ 5,
y = 3x + 16 when 5 < x ≤ 10,
y = −4x + 86 when 10 < x ≤ 12, and
y = −3x + 74 when x > 12.
It is not hard to see that the peak is at x = 10 around which the
steepest (slope) changes its sign. Thus x = 10 is the optimal point.
The approach used above is quite crude. It can be made more
sophisticated by adding a few stages to it. The response surface
method is often used in a bunch of stages to make it more effective.
One first uses a rough metamodel (possibly piecewise linear) to get
a general idea of the region in which the optimal point(s) may lie
(as shown above via Example B). Then one zeroes in on that region
and uses a more non-linear metamodel in that region. In Example B,
the optimal point is likely to lie in the region close to x = 10. One
can now take the next step, which is to use a non-linear metamodel
around 10. This form can then be used to find a more precise location
of the optimum. It makes a lot of sense to use more replications in
the second stage than in the first. A multi-stage approach can become
quite time-consuming, but more reliable.
Remark: Regression is often referred to as a model-based method
because it assumes the knowledge of the metamodel for the objective
function.
In the next section, we will study a model-free mechanism for
function-fitting—the neural network. Neural networks are of two
types—linear and non-linear. It is the non-linear neural network that
is model-free.
∂E n
Thus =− (yp − op ) xp (i). (4.9)
∂w(i)
p=1
k
op = w(j)xp (j).
j=0
n
w(i) ← w(i) + μ (yp − op ) xp (i).
p=1
N
op = w(j)xp (j).
j=0
x(0)=1
x(1)
Output Node
x(2)
x(i) w(i)
Input
Nodes
Figure 4.5. A linear network—a neuron with three input nodes and one output
node: The approximated function is a plane with two independent variables: x(1)
and x(2). The node with input x(0) assumes the role of the constant a in regression
Output Node
x(h)
Hidden
Nodes
w(i,h)
Input
Nodes
Figure 4.6. A non-linear neural network with an input layer, one hidden layer
and one output node: The term w(i, h) denotes a weight on the link from the ith
input node to the hth hidden node. The term x(h) denotes the weight on the link
from the hth hidden node to the output node
The above function goes by the name sigmoid function. There are
other functions that can be used for thresholding. We will understand
the role played by functions such as these when we derive the backprop
algorithm.
Thus the actual value of the hth hidden node, v(h), using the sig-
moid function, is given by:
1
v(h) = .
1 + e−v∗ (h)
Let x(h) denote the weight on the link from the hth hidden node to
the output node. Then the output node’s value is given by:
H
o= x(h)v(h), (4.11)
h=1
where v(h) denotes the actual (thresholded) value of the hth hidden
node and H denotes the number of hidden nodes. Now we will demon-
strate these ideas with a simple example.
Example C: Let us consider a neural network with three input nodes,
two hidden nodes, and one output node, as shown in Fig. 4.7. Let the
input values be: u1 = 0.23, u2 = 0.43, and u3 = 0.12. Let the weights
from the input node to the hidden nodes be: w(1, 1) = 1.5, w(1, 2) =
4.7, w(2, 1) = 3.7, w(2, 2) = 8.9, w(3, 1) = 6.7 and w(3, 2) = 4.8. Let
the weights from the hidden nodes to the output node be x(1) = 4.7
and x(2) = 8.9. Then using the formulas given above:
3
v ∗ (1) = w(i, j)u(i) = (1.5)(0.23) + (3.7)(0.43) + (0.67)(0.12) = 2.74
i=1
Remark 1. From the above example, one can infer that large values
for inputs and the w(i, h) terms will produce v(h) values that are very
close to 1. This implies that for large values of inputs (and weights),
the network will lose its discriminatory power. It turns out that even
56 SIMULATION-BASED OPTIMIZATION
0.23
1.5
1 4.7
4.7 1 13.2782
0.43 3.7 8.9
0
8.9
2 2 Output Node
4.8 x(h)
0.12 6.7
Hidden
3
Nodes
u(i) w(i,h)
Input
Nodes
Figure 4.7. Deriving the value of the output for given values of inputs and weights
for values such as 50, we have 1+e1−50 ≈ 1. If all data pieces are in this
range, then all of them will produce the same output o for a given set
of weights. And this will not work. One way out of this problem is to
use the following trick. Normalize the raw inputs to values between
0 and 1. Usually the range of values that the input can take on is
known. So if the minimum possible value for the ith input is a(i) and
the maximum is b(i) then we should first normalize our data using the
following principle:
uraw (i) − a(i)
u(i) = .
b(i) − a(i)
So for example, if values are: uraw (1) = 2, a(1) = 0, and b(1) = 17,
then the value of u(1) to be fed into the neural network should be:
2−0
u(i) = = 0.117647.
17 − 0
Remark 2. An alternative way to work around this difficulty is to
modify the sigmoid function as shown below:
1
v= , where M > 1.
1 + e−v∗ /M
This produces a somewhat similar effect but then one must choose M
carefully.
Remark 3. The w(i, h) terms should also remain at small values for
retaining the discriminatory power of the neural network. As we will
see later, these terms are in the danger of becoming too large. We will
discuss this issue later again.
The subscript p will be now used as an index for the data piece.
(If the concept of a “data piece” is not clear, we suggest you review
Example A in Sect. 3.2.1.) Thus yp will denote the function value of
the pth data piece obtained from simulation. For the same reason,
vp (h) will denote the value of the hth hidden node when the pth data
piece is used as an input to the node. So also, up (i) will denote the
value of the input for the ith input node when the pth data piece is
used as input to the neural network. The notations w(i, h) and x(h),
however, will never carry this subscript because they do not change
with every data piece.
where n is the total number of data pieces available. For the backprop
algorithm, instead of minimizing SSE, we will minimize SSE/2. (If
SSE/2 is minimized, SSE will be minimized too.) In the following
section, we will discuss the derivation of the backprop algorithm.
The following star-marked section can be skipped without loss of
continuity in the first reading.
58 SIMULATION-BASED OPTIMIZATION
1 ∂
n
∂E
= (yp − op )2
∂x(h) 2 ∂x(h)
p=1
1 n
∂
= 2 (yp − op )
(yp − op )
2 ∂x(h)
p=1
n
∂op
= (yp − op ) −
∂x(h)
p=1
n
= − (yp − op ) (vp (h)).
p=1
Parametric Optimization 59
The last equation follows from the fact that op = Hi=1 x(i)vp (i),
where H is the number of hidden nodes. Thus we can conclude that:
∂E n
=− (yp − op ) vp (h). (4.18)
∂x(h)
p=1
1
n
∂E ∂
= (yp − op )2
∂w(i, h) 2 ∂w(i, h)
p=1
1 n
∂ (yp − op )
= 2 (yp − op )
2 ∂w(i, h)
p=1
n
∂op
= (yp − op ) −
∂w(i, h)
p=1
n
∂op ∂vp (h)
= (yp − op ) − .
∂vp (h) ∂w(i, h)
p=1
n H
∂vp (h)
= − (yp −op ) x(h). since op = x(i)vi (p)
∂w(i, h)
p=1 i=1
∂vp (h) ∂vp∗ (h)
n
= − (yp − op ) x(h). ∗ .
∂vp (h) ∂w(i, h)
p=1
n
= − (yp − op ) x(h)vp (h) (1 − vp (h)) up (i).
p=1
I
∂vp∗ (h)
vp∗ (h) = w(j, h)up (j) implies = up (i);
∂w(i, h)
j=1
1 ∂vp (h)
vp (h) = implies = vp (h)[1 − vp (h)].
1+e −vp∗ (h) ∂vp∗ (h)
Thus in conclusion,
∂E n
=− [yp − op ]x(h)vp (h)[1 − vp (h)]up (i). (4.19)
∂w(i, h)
p=1
60 SIMULATION-BASED OPTIMIZATION
Bias Node
Output Node
x(h)
Hidden
Nodes
w(i,h)
Input
Nodes
Figure 4.8. A neural network with a bias node: The topmost node is the bias
node. The weight on the direct link to the output node is b
H
o = (b)(1) + x(h)v(h). (4.22)
h=1
1 ∂
n
∂E
= (yp − op )2
∂b 2 ∂b
p=1
1 n
∂
= 2 (yp − op ) (yp − op )
2 ∂b
p=1
62 SIMULATION-BASED OPTIMIZATION
n
∂op
= (yp − op ) −
∂b
p=1
n
= − (yp − op ) 1.
p=1
H
The last equation follows from the fact that op = b+ i=1 x(i)vp (i).
Thus,
∂E n
=− (yp − op ) . (4.23)
∂b
p=1
input nodes. (Note that I will include the bias node.) The algorithm
will be terminated when the absolute value of the difference between
the SSE in successive iterations is less than tolerance—a pre-specified
small number, e.g., 0.001.
Step 1: Set all weights—that is, w(i, h), x(h), and b for i = 1, 2, . . . , I,
and h = 1, 2, . . . , H, to small random numbers between 0 and 0.5.
Set the value of SSEold to a large value. The available data for the
pth piece is (up , yp ) where up denotes a vector with I components.
Set m to 0.
Step 2: Compute each of the vp∗ (h) terms for h = 1, 2, . . . , H and
p = 1, 2, . . . , n using
I
vp∗ (h) = w(j, h)up (j).
j=1
Step 5:
n
Update b using: b ← b + μ p=1 (yp − op ).
Update each w(i, h) using:
n
w(i, h) ← w(i, h) + μ (yp − op ) x(h)vp (h) (1 − vp (h)) up (i).
p=1
Step
n6: Increment m by 1. Calculate SSEnew using SSEnew =
2
p=1 (yp − op ) . Update the value of μ using m as discussed
above. If |SSEnew − SSEold | < tolerance, STOP. Otherwise, set
SSEold = SSEnew , and then return to Step 2.
Step 1: Set all weights—that is, w(i, h), x(h), and b, for i=1, 2, . . . , I,
h = 1, 2, . . . , H, to small random numbers. The available data for
the pth piece is (up , yp ) where up denotes a vector with I compo-
nents. Set m to 0 and mmax to the maximum number of iterations
for which the algorithm is to be run.
Parametric Optimization 65
I
vp∗ (h) = w(j, h)up (j).
j=1
Virtual Bias
Virtual Bias Weights
Node
Bias Node
1
1
Output Node
x(h)
Hidden
Nodes
w(i,h)
Input
Nodes
4. After performing the backprop for a very long time, the weights
can become large. This can pose a problem for the net. (It loses
its discriminatory power). One way out of this is to multiply each
weight in each iteration by (1− μγ
2 ), where γ is another step size less
Parametric Optimization 67
i h w(i, h)
1 1 2.062851
1 2 2.214936
1 3 2.122674
2 1 0.496078
2 2 0.464138
2 3 0.493996
68 SIMULATION-BASED OPTIMIZATION
h x(h) vb(h)
1 2.298128 −0.961674
2 2.478416 −1.026590
3 2.383986 −0.993016
Point x1 x2 y y predicted
1 0.2 0.5 3.54 3.82
2 0.3 0.9 4.44 4.54
3 0.1 0.9 3.46 3.78
5. Concluding Remarks
The chapter was meant to serve as an introduction to the technique
of response surfaces in simulation-based optimization. A relatively
new topic of combining response surfaces with neural networks was
also introduced. The topic of neural networks will surface one more
time in this book in the context of control optimization.
Bibliographic Remarks: The technique of response surfaces is now widely used
in the industry. The method was developed around the end of the Second World
War. For a comprehensive survey of RSM, the reader is referred to [213]. For
RSM-based simulation optimization, the reader is referred to [281, 172, 19]. The
so-called kriging methodology [149], which is based on interpolation rather than
function fitting (e.g., least-squares minimization), has also been used in simula-
tion metamodeling: see [174, 9]. Use of neural networks in RSM is a relatively
recent development (see e.g., [165, 259, 220, 230, 231, 12, 87, 194, 202]). A more
recent development in the field of neural networks include radial basis functions;
see [210, 199].
The idea of neural networks, however, is not very new. Widrow and Hoff’s work
[323] on linear neural networks appeared in 1960. Research in non-linear neural
networks was triggered by the pioneering work of Werbös [313]—a Ph.D. disser-
tation in the year 1974. See also [257] in 1986, which explained the methodology
in details. Since then countless papers have been written on the methodology and
uses of neural networks. The textbooks [132, 199] also contains excellent discus-
sions. Our account in this chapter follows Law and Kelton [188] and Mitchell [205].
Neural networks remain, even today, a topic of ongoing research. We end with a
simple exercise that the reader is urged to carry out.
Exercise: Evaluate the function f (x) at the following 20 points.
ln(x3 )
f (x) = 2x2 + , where 1 ≤ x ≤ 10.
x−1
Now, using this data, train the batch version of backprop. Then, with the trained
network, predict the function at the following points.
Test the difference between the actual value and the value predicted by the network.
(Use codes from [121].)
Chapter 5
PARAMETRIC OPTIMIZATION:
STOCHASTIC GRADIENTS
AND ADAPTIVE SEARCH
1. Chapter Overview
This chapter focusses on simulation-based techniques for solving
stochastic problems of parametric optimization, also popularly called
static optimization problems. Such problems have been defined in
Chap. 3.
At the very outset, we would like to state that our discussion will
be limited to model-free techniques, i.e., techniques that do not
require structural properties of the objective function. By structural
properties, we mean the availability of the analytical closed form of
the objective function, the availability of the distribution (or density)
functions of random variables in the objective function, or the ability
to manipulate the integrals and derivatives within the analytical form
of the objective function. As stated previously, our interest in this
book lies in complex stochastic optimization problems with large-scale
solution spaces. For such problems, it is usually difficult to obtain
the kind of structural properties typically needed by model-based
techniques, such as likelihood ratios or score functions (which require
the distributions of random variables within the objective function)
and infinitesimal perturbation analysis. Model-based techniques have
been studied widely in the literature; see [91, 253] and Chap. 15 of
Spall [281] for an extensive coverage.
Model-free techniques are sometimes also called black-box tech-
niques. Essentially, most model-free techniques are numeric, i.e.,
they rely on the objective function’s value and not on its closed
form. Usually, when these techniques are used, one assumes that it
2. Continuous Optimization
The problem of continuous parametric optimization can be de-
scribed formally as:
where x(i) denotes the ith decision variable, N denotes the number
of decision variables, and f (.) denotes the objective function. (Any
maximization problem can be converted to a minimization problem by
reversing the sign of the objective function f (.), i.e., maximize f (x) ≡
minimize −f (x).)
As discussed in Chap. 3, we are interested in functions with
stochastic elements, whose analytical expressions are unknown be-
cause it is difficult to obtain them. As a result, simulation may have
to be used to find estimates of the function value. Now, we will present
an approach that uses the gradient of the function for optimization.
Step 3. If all the partial derivatives equal zero or are sufficiently close
to zero, STOP. Otherwise increment m by 1, and return to Step 2.
f (x, y) = 2x2 + 4y 2 − x − y − 4.
∂f (x, y) ∂f (x, y)
= 4x − 1 = 0; = 8y − 1 = 0,
∂x ∂y
X Z
Y
(see [8] and references therein) that Eq. (5.2) (central differences) has
statistical properties superior to those of Eq. (5.3); i.e., the error pro-
duced due to the approximation of h by a positive quantity is less
with the central differences formula. (We will prove this in Chap. 10.)
(2) The function evaluations at (x + h) and (x − h) must be performed
using common random numbers. This means that both function eval-
uations should use the same set of random numbers in the replica-
tions. For instance, if a set of random numbers is used in replication
3 of f (x + h), then the same set should be used in replication 3 of
f (x − h). Using common random numbers has been proven to be a
“good” strategy—through the viewpoint of statistics. See [8] for addi-
tional details.
A simple example is now used to illustrate the numerical computing
of a derivative. The derivative of the function
f (x) = 2x3 − 1 with respect to x is 6x2 .
Therefore the actual value of the derivative when x = 1 is 6. Now
from Eq. (5.2), using h = 0.1, the derivative is found to be:
[2(6 + 0.1)3 − 1] − [2(6 − 0.1)3 − 1]
= 6.02,
(2)(0.1)
and using h = 0.01, the derivative is found to be:
[2(6 + 0.01)3 − 1] − [2(6 − 0.01)3 − 1]
= 6.0002.
(2)(0.01)
As h becomes smaller, we approach the value of the derivative. The
above demonstrates that the value of the derivative can be approx-
imated with small values for h. When the analytic function is un-
available, as is the case with objective functions of complex stochastic
systems, one may use numerical approximations such as these for com-
puting derivatives.
The so-called “finite difference” formula for estimating the deriva-
tive is the formula in Eq. (5.2) or Eq. (5.3). In problems with many
decision variables, the finite difference method runs into trouble
since its computational burden becomes overwhelming. Here is why.
Consider the case with N decision variables:
x(1), x(2), . . . , x(N ).
In each iteration of the steepest-descent algorithm, one then has to
calculate N partial derivatives of the function. Note that the general
expression using central differences is:
Parametric Optimization 79
∂f (x) f (x(1), x(2), . . . , x(i)+h, . . . , x(N ))−f (x(1), x(2), . . . , x(i)−h, . . . , x(N ))
= .
∂x(i) 2h
(i) f (x(1) + h(1), x(2) + h(2)), and (ii) f (x(1) − h(1), x(2) − h(2)).
These two evaluations would then be used to find the two partial
derivatives. It is perhaps clear now that regardless of the number of
variables, we will only need two evaluations. The formula for this esti-
mate of the derivative is provided formally in Step 3 of the algorithm
description that follows.
Parametric Optimization 81
Notice that, in the above, one needs the values of the derivatives,
which were obtained in Step 3.
Step 5. Increment m by 1 and update μm using some step size de-
caying rule (discussed below). If μm < μmin then STOP; otherwise,
return to Step 1.
Step 1. From the set P, select the following three solutions: the solu-
tion with the maximum objective function value, to be denoted by
xmax , the solution with the second largest objective function value,
to be denoted by xsl , and the solution with the lowest objective
function value, to be denoted by xmin . Now compute the so-called
centroid as follows:
1
N +1
xc ← −xmax + x(i) .
N
i=1
(The above is a centroid of all the points except for xmax .) Then
compute the so-called reflected point as follows:
If f (xexp ) < f (xr ), set xnew ← xexp . Otherwise, set xnew ← xr . Go
to Step 6.
Step 4. We come here if the reflected point is better than xsl . Set
xnew ← xr , and go to Step 6.
Step 5. We come here if the reflected point is worse than xsl . The op-
eration performed here is called contraction.
Step 6. Remove the old xmax from the polygon, i.e., set xmax ← xnew
and return to Step 1.
3. Discrete Optimization
Discrete parametric optimization is actually harder than continu-
ous parametric optimization since the function may have gaps, and
hence derivatives may be of little use. Even when the function can
be evaluated exactly, discrete parametric optimization leads to a diffi-
cult problem, unless the problem has a special structure. Without the
closed form, structure is hard to find, and the structure is hence not
available in the model-free context.
We will make the following important assumption regarding discrete
parametric optimization problems. We will assume that the solution
space is finite (although possibly quite large). Like in the continuous
case, we assume that it is possible to estimate the function at any
given point using simulation, although the estimate may not be exact,
i.e., it may contain some noise/error.
Now, if the solution space is manageably small, say composed of 100
points, then the problem can often be solved by an exhaustive search
of the solution space. An exhaustive search should be conducted only
if it can be performed in a reasonable amount of time. Generally, in an
exhaustive search, one evaluates the function with a pre-determined
number of replications (samples) at all the points in the solution space.
What constitutes a manageably small space may depend on how com-
plex the system is. For an M/M/1 queuing simulation written in C (see
[188] for a computer program), testing the function even at 500 points
may not take too much time, since M/M/1 is a simple stochastic sys-
tem defined by just two random variables. However, if the simulation
is more complex, the time taken to evaluate the function at even one
point can be significant, and hence the size of a “manageable” space
may be smaller. With the increasing power of computers, this size is
likely to increase.
If the solution space is large, i.e., several thousand or more points,
it becomes necessary to use algorithms that can find good solutions
86 SIMULATION-BASED OPTIMIZATION
1
m
S 2 (i, l) = [X(i, j) − X(i, l) + X̄(l, m) − X̄(i, m)]2 .
m−1
j=1
Step 3. Compute:
−
h2KN S 2 (i, l)
N̆il = ,
δ2
where (a)− denotes the largest integer smaller than a. Let Ni =
maxi=l N̆il .
If m ≥ (1 + maxi Ni ), declare the solution with the maximum value
for X̄(i, m) as the best solution, and STOP.
Otherwise, set p ← m, and go to the next step.
Step 4.
Let Is = {i : i ∈ I and X̄(i, p) ≥ X̄(l, p) − Wil (p) ∀l ∈ I, l = i}, where
δ h2KN S(i, l)
Wil (p) = max 0, −p .
2p δ2
Then set: I ← Is .
Step 5. If |I| = 1, declare the solution whose index is still in I as the
best solution, and STOP. Otherwise, go to Step 6.
Step 6. Take one additional observation for each system in I and set
p ← p + 1. If p = 1 + maxi Ni , declare the solution whose index is
in I and has the maximum value for X̄(i, p) as the best solution,
and STOP. Otherwise, go to Step 4.
3.2. Meta-heuristics
When we have several hundred or several thousand solutions in
the solution space, neither ranking and selection methods nor ex-
haustive enumeration can be used directly. We may then resort to
using meta-heuristics. Since it becomes difficult to use a variable
number of replications, as needed in ranking and selection, with meta-
heuristics, one usually uses a large, but fixed, pre-determined number
of replications (samples) in evaluating the function at any point in
the solution space. As stated above, meta-heuristics do not have sat-
isfactory convergence properties, but often work well in practice on
large-scale discrete-optimization problems. In this subsection, we will
90 SIMULATION-BASED OPTIMIZATION
{1, 2, . . . , 10}.
Now consider a solution (3, 7). A neighbor of this solution is (4, 6),
which is obtained by making the following changes in the solution
(3, 7).
3 −→ 4 and 7 −→ 6.
It is not difficult to see that these changes produced a solution—
(4, 6)—that lies in the “neighborhood” of a given solution (3, 7).
Neighbors can also be produced by more complex changes.
Clearly, the effectiveness of the meta-heuristic algorithm will depend
on the effectiveness of the neighbor generation strategy. Almost all the
algorithms that we will discuss in the remainder of this chapter will
Parametric Optimization 91
where the value of c(i) is the least increment permitted for the ith
decision variable. Thus, for instance, if the ith decision variable
assumes values from an equally spaced set, {2, 4, 6, 8, . . . , 20}, then
c(i) is clearly 2. Obviously, this definition of c(.) is appropriate for de-
cision variables that have equally spaced values. For variables that do
not take values from equally spaced sets, one must select c(.) in a way
such that y(.) becomes a feasible solution for every c(.) selected. We
also note that c(.) does not have to be the least increment permitted.
For variables assuming values from equally spaced sets, c(.) can be
any integer multiple of the least increment. We now illustrate the
hit-and-run strategy with an example.
{1, 2, . . . , 10}.
y = (1 − 1, 7 + 1) = (0, 8).
The above solution is not feasible, since it does not belong to the
solution space. Hence, we perform one more attempt to generate a
neighbor. Let us assume that on this occasion, the random number
generator produces the following values: H(1) = 1 and H(2) = 1.
92 SIMULATION-BASED OPTIMIZATION
y = (1 + 1, 7 + 1) = (2, 8),
Step 2. Clearly: xbest = (1, 5) and xworst = (4, 10). Let the ran-
domly selected neighbor (xnew ) of the best solution be (2, 6).
Replacing the worst solution by the new solution, our new popula-
tion becomes:
(2, 4), (1, 5), (2, 6), and (3, 2).
values. These strings are treated as the genes of the cross-over and
mutation strategy, and then combined to produce superior progeny.
See [281] for a detailed coverage of this topic.
In the above move, for the first decision variable, the “mutation”
to
is 2 −→ 3, and for the second decision variable, the “mutation” is
to
1 −→ 2.
The tabu list is a finite-sized list of mutations that keeps chang-
ing over time. We now present step-by-step details of a tabu search
algorithm. The algorithm is presented in terms of minimization of the
objective function value.
Steps in tabu search. Let m denote the iteration number in the
algorithm. Let mmax denote the maximum number of iterations to be
performed. Like in the genetic algorithm, mmax has to be pre-specified,
and there is no rule to find an optimal value for it. Also, as stated
earlier, this number is based on the available computer time to run
the algorithm.
xbest ← xcurrent ,
xcurrent ← xnew .
If the new solution is better than the best obtained so far, replace
the best obtained so far by the current. That is if
The tabu list is thus a list of mutations that have been made re-
cently. Maintaining the list avoids the re-evaluation of solutions that
were examined recently. This is perhaps a distinguishing feature of this
algorithm. It must be added, however, that in simulation optimization
even if a solution is re-examined, it is not necessary to re-simulate the
system. All the evaluated solutions can be stored in a so-called binary
tree, which is a computer programming construct. Once a solution is
simulated, its objective function value can be fetched every time it is
needed from the binary tree, making re-simulation unnecessary.
Examples of tabu lists. Consider a problem with two decision vari-
ables: s and q, where each decision variable can assume values from:
the future, while those that produce “poor” objective function values
are punished by a reduction in their probabilities. We now present a
formal description of the underlying mechanism.
Let (x(1), x(2), . . . , x(N )) denote N decision variables (parameters),
where x(i) takes values from the finite set A(i). Thus, A(i) denotes
the finite set of values that are permitted for decision variable i. Let
pm (i, a) denote the probability of selecting the value a for the ith
decision variable in the mth iteration of the algorithm. As stated
above, the algorithm starts as a pure random search. Mathemati-
1
cally, this implies that: p1 (i, a) = |A(i)| for i = 1, 2, . . . , N, and every
a ∈ A(i). The updating schememof the algorithm that we will see below
has to ensure that: a∈A(i) p (i, a) = 1 for all (i, a)-pairs and every
m. Since the probabilities are updated using the objective function
values, the objective function value has to be normalized to a value
between 0 and 1. This is achieved via:
R − Rmax
F = , (5.6)
Rmax − Rmin
where R denotes the actual (or raw) objective function value, F
denotes the normalized objective function value, Rmax denotes the
maximum value for the actual objective function, and Rmin denotes
the minimum value for the actual objective function. Knowledge of
Rmax and Rmin is necessary for this algorithm. If these values are not
known, one must use guessed estimates.
The best normalized objective function value, obtained thus far
in the algorithm, will be denoted by B(i, a) for i = 1, 2, . . . , N and
a ∈ A(i). We will need a constant step size, to be denoted by μ, in
the updating. In general, μ ∈ (0, 1); e.g., μ = 0.1. We present the
algorithm in terms of maximizing the objective function value.
Steps in LAST.
Step 3. Evaluate the objective function value associated with x. Let
the value obtained be denoted by R. Calculate the normalized
objective function value, F , using Eq. (5.6). If F > Fbest , set
Step 4. Set i = 1.
Step 6. Set
a=|A(i)|
p m+1
(i, x(i)) ← 1 − pm+1 (i, a).
a=x(i);a=1
Let the value of x selected in the (m + 1)th iteration be (2, 1). In other
words, for the first decision variable a = 2 was selected and for the
second a = 1 was selected. Let the objective function value, F , be 0.1.
Fbest , it should be clear from the B matrix, is assumed to be 0.4.
We now show all the calculations to be performed at the end of this
iteration.
Now from Step 5, since B(1, 1) < B(1, 2), p(1, 1) will decrease and
will be updated as follows.
The probability p(1, 3) will increase, since B(1, 3) > B(1, 2), and the
updating will be as follows:
[1 − pm (1, 3)]pm (1, 2)
pm+1 (1, 3) = pm (1, 3) + μ[B(1, 3) − B(1, 2)] .
3−1
And finally from Step 6, p(1, 2) will be updated as follows:
Since both B(1, 2) and B(2, 1) are greater than F , the new response
will not change the B matrix. Thus the new B matrix will be identical
to the old. And then we conduct the (m + 2)th iteration, and the
process continues
Δ
U ≤ exp(− ), set: xcurrent ← xnew ; else keep xcurrent unchanged.
T
2. Patient: This strategy was proposed in [248, 249], where one does
not terminate a phase until a better solution is found. Thus the
108 SIMULATION-BASED OPTIMIZATION
T (P + 1) = λT (P ), (5.8)
in which 0 < λ < 1 (e.g., λ = 0.99) and T (0) = C, where C > 0 is
user-specified. An equivalent rule [333] is T (P ) = C(λ)P .
3. Rational : A rule based on a rational function is (also used in neural
networks and reinforcement learning)
C
T (P ) = , (5.9)
B+P
where C and B are user-specified positive constants, e.g., C = 1
and B = 0 [291]; C > 1 and B > C (used typically in rein-
forcement learning, see e.g., [113]). Another rule is from [193]:
T (P ) = T (0)/(1 + B · T (0) · P ), where B 1.
Remark 3. The expression exp(− Δ T ) with which U (the random
number between 0 and 1) is compared needs to be studied carefully.
Parametric Optimization 109
For small positive values of T , this expression also assumes small pos-
itive values. When the expression is small, so is the probability of
accepting a worse solution. A general rule of simulated annealing is
that T should decrease as the number of phases increases. Thus, for
instance if the rules discussed above are used for temperature decay, C
(and other relevant constants) should be chosen in a manner such that
when P is zero, exp(− Δ T ) is significantly larger than zero. Otherwise
the probability of selecting a worse solution will be very small at the
start itself, which can cause the algorithm to get trapped in the nearest
local optimum, essentially negating the idea of exploring. Note also
that if the temperature starts at a high value and is never reduced,
and in addition, the number of iterations per phase keeps increasing,
the algorithm essentially becomes a “wanderer,” which is equivalent
to a pure random search. Thus, one should start with a sufficiently
high temperature and decay the temperature.
Remark 4. When we use simulated annealing with a simulator, we
assume that the estimate produced by the simulator is “close” to the
actual function value. In reality, there is some noise/error. Fortu-
nately, as long as the noise is not too “large,” the algorithm’s behavior
is not impacted (see Chap. 10). It may be a good idea, however, to
increase the accuracy of the function estimation process, by increasing
the number of replications, as the algorithm progresses.
Remark 5. The reason for allowing the algorithm to move to worse
solutions is to provide it with the opportunity of moving away from
a local optimum and finding the global optimum. See Fig. 5.1 (see
page 76). A simulated annealing algorithm that finds X in Fig. 5.1
may still escape from it and go on to find the global optimum, Y .
Remember that if the algorithm moves out of a local optimum and that
local optimum happens to be a global optimum, the global optimum is
not lost because the best solution is always retained in the algorithm’s
memory; such algorithms are called memory-based.
Remark 6. Finally, an important question is: how many phases
(Pmax ) should be performed? The answer depends on how the tem-
perature is reduced. When the temperature approaches small values
at which no exploration occurs, the algorithm should be stopped. The
rate at which the temperature is reduced depends on how much time
is available to the user. Slower the decay the greater the chance of
exploring the entire solution space for finding the global optimum.
Example. We will demonstrate a few steps in the simulated annealing
algorithm with an example. The example will be one of minimization.
Consider a problem with two decision variables, x and y, each taking
110 SIMULATION-BASED OPTIMIZATION
values from the set: {1, 2, 3, 4, 5, 6}. We will assume that one iteration
is allowed per phase. The temperature is decayed using the following
rule: T (P ) = 100/ln(2 + P ). Let the current solution be: xcurrent =
(3, 4). The same solution is also the best solution currently; in other
words: xbest = (3, 4). Let f (xcurrent ) be 1,400.
on the values of the objective function of the current point and the
point to which a move is being considered by the algorithm. Also, the
neighbor-generating strategy will use a stochastic generator matrix
that we first discuss.
Stochastic generator matrix. Consider a simple problem where
there are three solutions, which are indexed as 1, 2, and 3. Now
consider the following matrix:
⎡ ⎤
0 0.2 0.8
⎣
G = 0.3 0 0.7 ⎦ .
0.1 0.9 0
The sum of elements in any row of this matrix sum to 1, which means
that it is a so-called stochastic matrix. Such matrices will be covered
extensively from the next chapter onwards in the control optimization
setting. Here it is sufficient for the reader to view this matrix as
an entity that can randomly generate a new solution from a current
solution. The generation mechanism works as follows: If the algorithm
is currently in a solution indexed by i, then the probability with which
it be moved to a solution indexed by j is given by G(i, j). We consider
an example next.
Assume that the algorithm is in the solution indexed by 2. Then,
the probability that it will move to the solution with index i is given
by G(2, i). Thus, it will move to the solution indexed as 1 with a
probability of G(2, 1) = 0.3 and to the solution indexed as 3 with
a probability of G(2, 3) = 0.7. In order to achieve the move, one
generates a uniformly distributed random number, U , between 0 and 1.
If U ≤ 0.3, the neighbor (new solution) is the solution indexed as 1,
while if U > 0.3, the solution indexed as 3 becomes the neighbor. Note
that we have assumed the diagonal elements in G to be 0 above, since
it will be clearly inefficient in simulation optimization to consider the
same point again as a neighbor.
1 if f (xnew ) ≤ f (xcurrent )
A(current, new) = ,
exp(− f (xnew )−fT(xcurrent ) ) otherwise
(5.10)
where T > 0 does not depend on the iteration, but may depend on the
values of the objective function. Hence here T should not necessarily
be viewed as the temperature of simulated annealing. In general T is a
function of f (xnew ) and f (xcurrent ). Other mechanisms for generating
the matrix A can also be used as long as A(current, new) equals 1
when the function finds an improved/equally good point.
Remark 2. Stopping criteria other than mmax can also be used.
Essentially, the value of mmax depends on the time available to the an-
alyst. In global optimization, unless the algorithm has the opportunity
to sample the entire solution space, the chances of finding the global
optimum are low. Hence, higher this value, the better the performance
is likely to be.
Remark 3. The algorithm follows the format of simulated an-
nealing with an important difference: The exploration/backtracking
Parametric Optimization 113
probability in BAS does not depend on the iteration number but only
on the objective function values of the current and the new solution.
In simulated annealing, the exploration probability depends, in addi-
tion to the objective function values, on the “temperature,” which in
turn depends on the number of iterations the algorithm has performed
thus far. Also, the convergence properties of the two algorithms are
markedly different, which we will discuss in Chap. 10.
Remark 4. Note that we did not define stochastic generator matrices
in simulated annealing or LAST, because it was not necessary to store
them explicitly in the computer’s memory. However, intrinsically, such
generators exist underlying all SAS techniques. In simulated anneal-
ing, we discussed the hit-and-run strategy. Parameters underlying the
hit-and-run strategy can in fact be used to compute this matrix. In
LAST, we can generate this matrix from the probabilities used in the
solution. For instance, in a three-solution problem assume that in a
given iteration, the probabilities of selecting the solutions, indexed 1,
2, and 3, are 0.2, 0.3 and 0.5 respectively. Then, the stochastic gener-
ator matrix in LAST for that iteration is:
⎡ ⎤
0.2 0.3 0.5
G = ⎣ 0.2 0.3 0.5 ⎦ ,
0.2 0.3 0.5
0 0 1/2 1/2
When W (i) = |N (i)|, we will refer to W (i) as the friendliness coef-
ficient of i, indicating that it is a measure of how many candidate
solutions (neighbors) can be generated from a solution.
The basic idea in the stochastic ruler is straightforward. Assume
that a and b are the lower and upper bounds, respectively, of the
objective function, which we wish to minimize. Further assume that
both bounds are known. Clearly then, the value of the objective func-
tion at the optimal solution is a or a value very close to a. Now if we
generate random numbers from the distribution U nif (a, b), then the
probability that a generated random number will exceed a solution’s
objective function value should equal 1 when the solution is the opti-
mal solution. However, if the solution is not optimal, this probability
should be less than 1. Also, if the solution is at its worst point (where
the objective function value is b), this probability should be 0. The
stochastic ruler essentially seeks to maximize this probability, striving
to move the solution to points where this probability is increased. In
the limit, it reaches a point where this probability is maximized, which
should clearly be the optimal solution.
Parametric Optimization 115
where any ties that occur are broken randomly. Then, Yi∗ contains
the best objective function value.
Parametric Optimization 119
then the algorithm retracts to the entire feasible region in the next
iteration, i.e., F(m + 1) = S, and essentially starts all over again.
Example 3. Assume that the promising region in the mth iteration is
a singleton set, T . Then, M = 1 and K = M + 1 = 2. Then, Y1 = T ,
and the surrounding region, i.e., S\T , equals Y2 .
Example 4. Assume that the promising region in the mth iteration
is the entire feasible region S. Let M = 2. Then, we construct a total
of K = M = 2 sub-regions, such that Y1 ∪ Y2 = S.
4. Concluding Remarks
Our discussion in this chapter was restricted by design to model-
free search techniques. Our discussion for the continuous case was
limited to finite differences, simultaneous perturbation, and the down-
hill simplex. In discrete optimization, we covered two meta-heuristics,
namely the genetic algorithm and tabu search, and five SAS tech-
niques, namely simulated annealing, BAS, LAST, the stochastic ruler,
and nested partitions. A number of other techniques that we were
unable to cover include meta-heuristics, such as scatter search [105],
ant colony optimization [80], and particle swarm optimization [163],
and SAS techniques, such as GRASP [84], MRAS [146], and COM-
PASS [142]. MRAS is a recent development that needs special mention
because MRAS generates solutions from an “intermediate probabilistic
model” [146] on the solution space, which is updated iteratively after
each function evaluation and may lead to an intelligent search like in
the case of LAST.
Parametric Optimization 121
CONTROL OPTIMIZATION
WITH STOCHASTIC DYNAMIC
PROGRAMMING
1. Chapter Overview
This chapter focuses on a problem of control optimization, in
particular the Markov decision problem (or process). Our discussions
will be at a very elementary level, and we will not attempt to prove
any theorems. The central aim of this chapter is to introduce the
reader to classical dynamic programming in the context of solving
Markov decision problems. In the next chapter, the same ideas will be
presented in the context of simulation-based dynamic programming.
The main concepts presented in this chapter are (1) Markov chains,
(2) Markov decision problems, (3) semi-Markov decision problems,
and (4) classical dynamic programming methods.
2. Stochastic Processes
We begin with a discussion on stochastic processes. A stochastic
(or random) process, roughly speaking, is an entity that has a prop-
erty which changes randomly with time. We refer to this changing
property as the state of the stochastic process. A stochastic process
is usually associated with a stochastic system. Read Chap. 2 for a
definition of a stochastic system. The concept of a stochastic process
is best understood with an example.
Consider a queue of persons that forms in a bank. Let us assume
that there is a single server (teller) serving the queue. See Fig. 6.1.
The queuing system is an example of a stochastic system. We need
to investigate further the nature of this queuing system to identify
properties, associated with the queue, that change randomly with time.
Let us denote
The number of customers in the queue at time t by X(t) and
The number of busy servers at time t by Y (t).
Then, clearly, X(t) will change its value from time to time and so
will Y (t). By its definition, Y (t) will equal 1 when the teller is busy
serving customers, and will equal 0 when it is idle.
Customer being
served
Server
Customers
in the queue
Now if the state of the system is recorded after unit time, X(t)
could take on values such as: 3, 3, 4, 5, 4, 4, 3 . . . The set {X(t)|t =
1, 2, · · · , ∞}, then, defines a stochastic process. Mathematically, the
sequence of values that X(t) assumes in this example is a stochastic
process.
Similarly, {Y (t)|t = 1, 2, · · · , ∞} denotes another stochastic process
underlying the same queuing system. For example, Y (t) could take on
values such as 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, . . .
It should be clear now that more than one stochastic process may be
associated with any given stochastic system. The stochastic processes
X and Y differ in their definition of the system state. For X, the
state is the number of customers in the queue and for Y , the state is
the number of busy servers.
An analyst selects the stochastic process that is of interest to
him/her. E.g., an analyst interested in studying the utilization of
the server (i.e., proportion of time the server is busy) will choose Y ,
while the analyst interested in studying the length of the queue will
choose X. See Fig. 6.2 for a pictorial explanation of the word “state.”
In general, choosing the appropriate definition of the state of a
system is a part of “modeling.” The state must be defined in a manner
suitable for the optimization problem under consideration. To under-
stand this better, consider the following definition of state. Let Z(t)
denote the total number of persons in the queue with black hair. Now,
Dynamic Programming 125
State = 3
Customers Server
State = 2
Customers Server
Customers Server
Figure 6.2. A queue in two different states: The “state” is defined by the number
in the queue
Figure 6.3. Schematic of a two-state Markov chain, where circles denote states
later.) Hence, after unit time, the system either switches (moves) to
a new state or else the system returns to the current state. We will
refer to this phenomenon as a state transition.
To understand this phenomenon better, consider Fig. 6.3. The figure
shows two states, which are denoted by circles, numbered 1 and 2.
The arrows show the possible ways of transiting. This system has
two states: 1 and 2. Assuming that we first observe the system when
it is in state 1, it may for instance follow the trajectory given by:
1, 1, 2, 1, 1, 1, 2, 2, 1, 2, . . .
A state transition in a Markov process is usually a probabilistic, i.e.,
random, affair. Consider the Markov process in Fig. 6.3. Let us further
assume that in its first visit to state 1, from state 1 the system jumped
to state 2. In its next visit to state 1, the system may not jump to
state 2 again; it may jump back to state 1. This should clarify that
the transitions in a Markov chain are “random” affairs.
We now need to discuss our convention regarding the time needed
for one jump (transition). In a Markov process, how much time is spent
in one transition is really irrelevant to its analysis. As such, even if the
time is not always unity, or even if it is not a constant, we assume it
to be unity for our analysis. If the time spent in the transition becomes
an integral part of how the Markov chain is analyzed, then the Markov
process is not an appropriate model. In that case, the semi-Markov
process becomes more appropriate, as we will see below.
When we study real-life systems using Markov processes, it usu-
ally becomes necessary to define a performance metric for the real-life
system. It is in this context that one has to be careful with how the
unit time convention is interpreted. A common example of a perfor-
mance metric is: average reward per unit time. In the case of a Markov
process, the phrase “per unit time” in the definition of average reward
actually means “per jump” or “per transition.” (In the so-called semi-
Markov process that we will study later, the two phrases have different
meanings.)
Another important property of the Markov process needs to be
studied here. In a Markov process, the probability that the process
jumps from a state i to a state j does not depend on the states vis-
ited by the system before coming to i. This is called the memoryless
property. This property distinguishes a Markov process from other
stochastic processes, and as such it needs to be understood clearly.
Because of the memoryless property, one can associate a probability
with a transition from a state i to a state j, that is,
i −→ j.
Dynamic Programming 127
1, 3, 2, 1, 1, 1, 2, 1, 3, 1, 1, 2, . . .
Assume that: P (3, 1) = 0.2 and P (3, 2) = 0.8. When the system visits
3 for the first time in the above, it jumps to 2. Now, the probability of
jumping to 2 is 0.8, and that of jumping to 1 is 0.2. When the system
revisits 3, the probability of jumping to 2 will remain at 0.8, and that
of jumping to 1 at 0.2. Whenever the system comes to 3, its probability
of jumping to 2 will always be 0.8 and that of jumping to 1 be 0.2. In
other words, when the system comes to a state i, the state to which
it jumps depends only on the transition probabilities: P (i, 1), P (i, 2)
and P (i, 3). These probabilities are not affected by the sequence of
states visited before coming to i. Thus, when it comes to jumping to a
new state, the process does not “remember” what states it has had to
go through in the past. The state to which it jumps depends only on
the current state (say i) and on the probabilities of jumping from that
state to other states, i.e., P (i, 1), P (i, 2) and P (i, 3). In general, when
the system is ready to leave state i, the next state j depends only on
P (i, j). Furthermore, P (i, j) is completely independent of where the
system has been before coming to i.
We now give an example of a non-Markovian process. Assume that
a process has three states, numbered 1, 2, and 3. X(t), as before,
denotes the system state at time t. Assume that the law governing
this process is given by:
where f (i, j) is the probability that the next state is j given that the
current state is i. Also f (i, j) is a constant for given values of i and j.
Carefully note the difference between Eqs. (6.2) and (6.1). Where
the process resides one step before its current state has no influence on
a Markov process. It should be obvious that in the Markov process,
the transition probability (probability of going to one state to another
in the stochastic process in one step) depends on two quantities:
the present state (i) and the next state (j). In a non-Markovian pro-
cess, such as the one defined by Eq. (6.1), the transition probability
depended on the current state (i), the next state (j), and the previous
state (l). An implication is that even if both the processes have the
same number of states, we will have to deal with additional probabil-
ities in the two-step stochastic process.
The quantity f (i, j) is an element of a two-dimensional matrix. Note
that f (i, j) is actually P (i, j), the one-step transition probability of
jumping from i to j, which we have defined earlier.
All the transition probabilities of a Markov process can be conve-
niently stored in a matrix. This matrix is called the one-step tran-
sition probability matrix or simply the transition probability ma-
trix, usually abbreviated as TPM. An example of a TPM with three
states is:
⎡ ⎤
0.7 0.2 0.1
P = ⎣0.4 0.2 0.4⎦ . (6.3)
0.6 0.1 0.3
P (i, j) here denotes the (i, j)th element of the matrix, P, i.e., the
element in the ith row and the jth column of P. In other words, P (i, j)
denotes the one-step transition probability of jumping from state i to
state j. Thus, for example, P (3, 1), which is 0.6 above, denotes the
one-step transition probability of going from state 3 to state 1.
We will also assume that a finite amount of time is taken in any
transition and that no time is actually spent in a state. This is one
convention (there are others), and we will stick to it in this book. Also,
note that by our convention, the time spent in a transition is unity (1).
In summary, a Markov process possesses three important properties:
(1) the jumpy property, (2) the memoryless property, and (3) the unit
time property (by our convention).
Dynamic Programming 129
0.6
0.7
0.3
2
1 0.4
Figure 6.4. Schematic of a two-state Markov chain, where circles denote states,
arrows depict possible transitions, and the numbers on the arrows denote the prob-
abilities of those transitions
Figures 6.5 and 6.6 show some more examples of Markov chains
with three and four states respectively. In this book, we will consider
Markov chains with a finite number of states.
Estimating the values of the elements of the TPM is often quite
difficult. This is because, in many real-life systems, the TPM is
very large, and evaluating any given element in the TPM requires
the setting up of complicated expressions, which may involve multiple
integrals. In subsequent chapters, this issue will be discussed in depth.
130 SIMULATION-BASED OPTIMIZATION
3
1
For the Markov process, the time taken in any transition is equal,
and hence the limiting probability of a state also denotes the propor-
tion of time spent in transitions to that particular state.
Dynamic Programming 133
We will now show how we can obtain the limiting probabilities from
the TPM without raising the TPM to large powers. The following
important result provides a very convenient way for obtaining the
limiting probabilities.
Theorem 6.2 Let Π(i) denote the limiting probability of state i, and
let S denote the set of states in the Markov chain. Then the limiting
probabilities for all the states in the Markov chain can be obtained
from the transition probabilities by solving the following set of linear
equations:
|S|
Π(i)P (i, j) = Π(j), for every j ∈ S (6.4)
i=1
|S|
and Π(j) = 1, (6.5)
j=1
Equations (6.4) and (6.5) are often collectively called the invariance
equation, since they help us determine the invariant (limiting) proba-
bilities. Equation (6.4) is often expressed in the matrix form as:
and 6.8. Another type of state is the absorbing state. Once the system
enters any absorbing state, it can never get out of that state and it
remains there.
An ergodic Markov chain is one in which all states are recurrent
and no absorbing states are present. Ergodic chains are also called
irreducible chains. All regular Markov chains are ergodic, but the
converse is not true. (Regular chains were defined in Sect. 3.1.1.) For
instance, a chain that is not regular may be ergodic. Consider the
Markov chain with the following TPM:
0 1
.
1 0
This chain is not regular, but ergodic. It is ergodic because both states
are visited infinitely many times in an infinite viewing.
1
2
Transient State
Recurrent State
1b
2
1a
Transient State
Recurrent State
called the embedded Markov chain. The main difference between the
semi-Markov process and the Markov process lies in the time taken in
transitions.
In general, when the distributions for the transition times are
arbitrary, the process goes by the name semi-Markov. If the time in
every transition is an exponentially distributed random variable, the
stochastic process is referred to as a continuous time Markov process.
Some authors refer to what we have called the continuous time
Markov process as the “Markov process,” and by a “Markov chain,”
they mean what we have referred to as the Markov process.
There is, however, a critical difference between the Markov chain
underlying a Markov process and that underlying a semi-Markov pro-
cess. In a semi-Markov process, the system jumps, but not necessarily
after unit time, and when it jumps, it jumps to a state that is different
than the current state. In other words, in a semi-Markov process, the
system cannot jump back to the current state. (However, in a semi-
Markov decision process, which we will discuss later, jumping back to
the current state is permitted.) In a Markov process, on the other
hand, the system can return to the current state after one jump.
If the time spent in the transitions is a deterministic quantity, the
semi-Markov process has a transition time matrix analogous to the
TPM, e.g.,
Dynamic Programming 137
− 17.2
.
1 −
For an example of the most general model in which some or all of the
transition times are random variables from any given distributions,
consider the following transition time matrix:
− unif (5, 6)
,
expo(5) −
where unif (min, max) denotes a random number from the uniform
distribution with parameters, min and max, and expo(μ) denotes the
same from the exponential distribution with parameter μ.
When we analyze a semi-Markov process, we begin by analyzing
the Markov chain embedded in it. The next step usually is to analyze
the time spent in each transition. As we will see later, the semi-
Markov process is more powerful than the Markov process in modeling
real-life systems, although very often its analysis can prove to be more
complicated.
0 1 2 3 .............
Now consider a policy μ̂ = (2, 1). The TPM associated with this
policy will contain the transition probabilities of action 2 in state 1
and the transition probabilities of action 1 in state 2. The TPM of
policy μ̂ is thus
0.1 0.9
Pμ̂ = .
0.4 0.6
P1 P2
Pu
0.1 0.9
Matrix for
0.4 0.6
policy (2,1)
Figure 6.10. Schematic showing how the TPM of policy (2, 1) is constructed from
the TPMs of action 1 and 2
Now consider a policy μ̂ = (2, 1). Like in the TPM case, the TRM
associated with this policy will contain the immediate reward of action
2 in state 1 and the immediate rewards of action 1 in state 2. Thus
the TRM of policy μ̂ can be written as
45 80
Rμ̂ = .
−14 6
The TPM and the TRM of a policy together contain all the informa-
tion one needs to evaluate the policy in an MDP. In terms of notation,
we will denote the immediate reward, earned in going from state i to
state j, under the influence of action a, by:
r(i, a, j).
144 SIMULATION-BASED OPTIMIZATION
r(i, μ(i), j)
because μ(i) is the action that will be selected in state i when policy
μ̂ is used.
Performance metric. To compare policies, one must define a perfor-
mance metric (objective function). Naturally, the performance metric
should involve reward and cost elements. To give a simple analogy, in
a linear programming problem, one judges each solution on the basis
of the value of the associated objective function. Any optimization
problem has a performance metric, which is also called the objective
function. In this book, for the most part, the MDP will be studied
with respect to two performance metrics. They are:
1. Expected reward per unit time calculated over an infinitely long
trajectory of system states: We will refer to this metric as the
average reward.
2. Expected total discounted reward calculated over an infinitely long
trajectory of system states: We will refer to this metric as the
discounted reward.
It is the case that of the two performance metrics, average reward
is easier to understand, although the average reward MDP is more
difficult to analyze for its convergence properties. Hence, we will begin
our discussion with the average reward performance criterion. Dis-
counted reward will be defined later.
We first need to define the expected immediate reward of a state
under the influence of a given action. Consider the following scenario.
An action a is selected in state i. Under the influence of this action,
the system can jump to three states: 1, 2, and 3 with probabilities of
(0.2,10)
1
(0.3,12)
i
2
(0.5,-14)
Legend: 3
(x,y)
x= transition probability
y=transition reward
earned in each visit to state i under policy μ̂ is r̄(i, μ(i)). Then the
total long-run expected reward earned in k transitions for this MDP
can be written as:
Πμ̂ (i) denotes the limiting probability of state i when the system
(and hence the underlying Markov chain) is run with the policy μ̂
Assumption 6.1 The state space S and the action space A(i) for
every i ∈ S is finite (although possibly quite large).
μ̂1 = (1, 1), μ̂2 = (1, 2), μ̂3 = (2, 1), and μ̂4 = (2, 2).
The TPMs and TRMs of these policies are constructed from the
individual TPMs and TRMs of each action. The TPMs are:
0.7 0.3 0.7 0.3
Pμ̂1 = ; Pμ̂2 = ;
0.4 0.6 0.2 0.8
0.9 0.1 0.9 0.1
Pμ̂3 = ; Pμ̂4 = .
0.4 0.6 0.2 0.8
148 SIMULATION-BASED OPTIMIZATION
(1,0.7,6) (2,0.8,13)
(1,0.3,-5)
(2,0.1,17)
1 2
(1,0.4,7)
(2,0.2,-14)
(2,0.9,10)
(1,0.6,12)
Legend:
(a,p,r): a = action
p = transition
probability
r = immediate
reward
From the TPMs, using Eqs. (6.4) and (6.5), one can find the limiting
probabilities of the states associated with each policy. They are:
r̄(1, μ1 (1)) = p(1, μ1 (1), 1)r(1, μ1 (1), 1) + p(1, μ1 (1), 2)r(1, μ1 (1), 2)
r̄(2, μ1 (2)) = p(2, μ1 (2), 1)r(2, μ1 (2), 1) + p(2, μ1 (2), 2)r(2, μ1 (2), 2)
r̄(1, μ2 (1)) = p(1, μ2 (1), 1)r(1, μ2 (1), 1) + p(1, μ2 (1), 2)r(1, μ2 (1), 2)
r̄(2, μ2 (2)) = p(2, μ2 (2), 1)r(2, μ2 (2), 1) + p(2, μ2 (2), 2)r(2, μ2 (2), 2)
r̄(1, μ3 (1)) = p(1, μ3 (1), 1)r(1, μ3 (1), 1) + p(1, μ3 (1), 2)r(1, μ3 (1), 2)
r̄(2, μ3 (2)) = p(2, μ3 (2), 1)r(2, μ3 (2), 1) + p(2, μ3 (2), 2)r(2, μ3 (2), 2)
r̄(1, μ4 (1)) = p(1, μ4 (1), 1)r(1, μ4 (1), 1) + p(1, μ4 (1), 2)r(1, μ4 (1), 2)
r̄(2, μ4 (2)) = p(2, μ4 (2), 1)r(2, μ4 (2), 1) + p(2, μ4 (2), 2)r(2, μ4 (2), 2)
Thus:
ρμ̂1 = Πμ̂1 (1)r̄(1, μ1 (1)) + Πμ̂1 (2)r̄(2, μ1 (2))
If possible, one should set μk+1 (i) = μk (i) for each i. The signifi-
cance of ∈ in the above needs to be understood clearly. There may
be more than one action that satisfies the argmax operator. Thus
there may be multiple candidates for μk+1 (i). However, the latter
is selected in a way such that μk+1 (i) = μk (i) if possible.
Step 4. If the new policy is identical to the old one, that is, if
μk+1 (i) = μk (i) for each i, then stop and set μ∗ (i) = μk (i) for
every i. Otherwise, increment k by 1, and go back to the second
step.
Table 6.1. Calculations in policy iteration for average reward MDPs on Example A
mechanism here employs the span seminorm, also called the span.
We will denote the span seminorm of a vector in this book by sp(.)
and define it as:
Step 3: If
sp(J k+1 − J k ) < ,
go to Step 4. Otherwise increase k by 1, and go back to Step 2.
Step 4: For each i ∈ S, choose
⎡ ⎤
|S|
d(i) ∈ arg max ⎣r̄(i, a) + p(i, a, j)J k (j)⎦ ,
a∈A(i) j=1
ˆ
and stop. The -optimal policy is d.
The implication of -optimality (in Step 4 above) needs to be
understood. The smaller the value of , the closer we get to the
optimal policy. Usually, for small values of , one obtains policies
very close to optimal. The span of the difference vector (J k+1 − J k )
keeps getting smaller and smaller in every iteration, and hence for a
given positive value of , the algorithm terminates in a finite number
156 SIMULATION-BASED OPTIMIZATION
After calculations in this step for all states are complete, set ρ =
J k+1 (i∗ ).
Table 6.2. Calculations in value iteration for average reward MDPs: Note that the
values get unbounded but the span of the difference vector gets smaller with every
iteration. We start with J 1 (1) = J 1 (2) = 0
Step 4: If
sp(J k+1 − J k ) < ,
go to Step 5. Otherwise increase k by 1 and go back to Step 2.
Step 5: For each i ∈ S, choose
⎡ ⎤
|S|
d(i) ∈ arg max ⎣r̄(i, a) + p(i, a, j)J k (j)⎦ ,
a∈A(i)
j=1
158 SIMULATION-BASED OPTIMIZATION
k denotes the number of transitions (or time assuming that each tran-
sition takes unit time) over which the system is observed, xs denotes
the state from where the sth jump or state transition occurs under the
policy μ̂, and E denotes the expectation operator over all trajectories
that start under the condition specified within the square brackets.
If you have trouble understanding why we use lim inf here, you may
replace it by lim at this stage.
It can be shown that for policies with regular Markov chains, the
average reward is independent of the starting state i, and hence, ρ(i)
can be replaced by ρ. Intuitively, the above expression says that the
average reward for a given policy is
the expected sum of rewards earned in a very long trajectory
.
the number of transitions in the same trajectory
In the above, we assume that the associated policy is pursued within
the trajectory. We now discuss the other important performance
metric typically studied with an MDP: the discounted reward.
Table 6.3. Calculations in Relative value iteration for average reward MDPs: =
0.001; -optimal policy found at k = 12; J 1 (1) = J 1 (2) = 0
The idea of discounting is related to the fact that the value of money
reduces with time. To give a simple example: a dollar tomorrow is
worth less than a dollar today. The discounting factor is the fraction
by which money gets devalued in unit time. So for instance, if I earn $3
today, $5 tomorrow, $6 the day after tomorrow, and if the discounting
factor is 0.9 per day, then the present worth of my earnings will be:
3 + (0.9)5 + (0.9)2 6.
The reason for raising 0.9 to the power of 2 is that tomorrow, the
present worth of day-after-tomorrow’s earning will be 0.9(6). Hence
today, the present worth of this amount will be 0.9[0.9(6)] = (0.9)2 6.
In general, if the discounting factor is λ, and if e(t) denotes the
earning in the tth period of time, then the present worth of earnings
over n periods of time can be denoted by
In Eq. (6.19),
k
s−1
E λ r(xs , μ(xs ), xs+1 )|x1 = i =
s=1
E r(i, μ(i), x2 ) + λr(x2 , μ(x2 ), x3 ) + · · · + λk−1 r(xk , μ(xk ), xk+1 ) .
This should make it obvious that the discounted reward of a policy is
measured using the format discussed in Eq. (6.18).
The above means that the optimal policy will have a value function
vector that satisfies the following property: each element of the vector
is greater than or equal to the corresponding element of the value
function vector of any other policy. This concept is best explained
with an example.
Consider a 2-state Markov chain with 4 allowable policies denoted
by μ̂1 , μ̂2 , μ̂3 , and μ̂4 . Let the value function vector be defined by
vμ̂1 (1) = 3; vμ̂2 (1) = 8; vμ̂3 (1) = −4; vμ̂4 (1) = 12;
vμ̂1 (2) = 7; vμ̂2 (2) = 15; vμ̂3 (2) = 1; vμ̂4 (2) = 42;
162 SIMULATION-BASED OPTIMIZATION
Now, from our definition of an optimal policy, policy μ̂4 should be the
optimal policy since the value function vector assumes the maximum
value for this policy for each state. Now, the following question should
rise in your mind at this stage. What if there is no policy for which
the value function is maximized for each state? For instance consider
the following scenario:
vμ̂1 (1) = 3; vμ̂2 (1) = 8; vμ̂3 (1) = −4; vμ̂4 (1) = 12;
vμ̂1 (2) = 7; vμ̂2 (2) = 15; vμ̂3 (2) = 1; vμ̂4 (2) = −5;
In the above setting, there is no one policy for which the value function
is maximized for all the states. Fortunately, it has been proved that
under the assumptions we have made above, there exists an optimal
policy; in other words, there exists a policy for which the value function
is maximized in each state. The interested reader is referred to [30,
270], among other sources, for the proof of this.
The important point that we need to address next is: how does
one find the value function of any given policy? Equation (6.19) does
not provide us with any direct mechanism for this purpose. Like in
the average reward case, we will need to turn to the Bellman policy
equation.
By solving the Bellman equation, one can obtain the value func-
tion vector associated with a given policy. Clearly, the value function
vectors associated with each policy can be evaluated by solving the
respective Bellman equations. Then, from the value function vectors
obtained, it is possible to determine the optimal policy. This method
is called the method of exhaustive enumeration.
Like in the average reward case, the method of exhaustive enu-
meration is not a very efficient method to solve the MDP, since its
computational burden is enormous. For a problem of 10 states with
two allowable actions in each, one would need to evaluate 210 policies.
The method of policy iteration is considerably more efficient.
Step 1. Set k = 1. Here k will denote the iteration number. Let the
number of states be |S|. Select any policy in an arbitrary manner.
Let us denote the policy selected in the kth iteration by μ̂k . Let μ̂∗
denote the optimal policy.
|S|
k
h (i) = r̄(i, μk (i)) + λ p(i, μk (i), j)hk (j). (6.21)
j=1
164 SIMULATION-BASED OPTIMIZATION
Step 4. If the new policy is identical to the old one, i.e., if μk+1 (i) =
μk (i) for each i, then stop and set μ∗ (i) = μk (i) for every i. Oth-
erwise, increment k by 1, and return to Step 2.
Like in the average reward case, we will next discuss the value
iteration method. The value iteration method is also called the method
of successive approximations (in the discounted reward case). This is
because the successive application of the Bellman operator in the dis-
counted case does lead one to the optimal value function. Recall that
in the average reward case, the value iteration operator may not keep
the iterates bounded. Fortunately this is not the case in the discounted
problem.
⎡ ⎤
|S|
J ∗ (i) = max ⎣r̄(i, a) + λ p(i, a, j)J ∗ (j)⎦ for each i ∈ S. (6.22)
a∈A(i)
j=1
The notation is similar to that defined for the average reward case.
Equation (6.22), i.e., the Bellman optimality equation for discounted
reward contains the max operator; hence, it cannot be solved using
linear algebra techniques, e.g., Gaussian elimination. However, the
value iteration method forms a convenient solution method. In value
iteration, one starts with some arbitrary values for the value function
vector. Then a transformation, derived from the Bellman optimality
equation, is applied on the vector successively until the vector starts
approaching a fixed value. The fixed value is also called a fixed point.
We will discuss issues such as convergence to fixed points in Chap. 11
in a more mathematically rigorous framework. However, at this stage,
it is important to get an intuitive feel for a fixed point.
If a transformation has a unique fixed point, then no matter what
vector you start with, if you keep applying the transformation repeat-
edly, you will eventually reach the fixed point. Several operations
research algorithms are based on such transformations.
We will now present step-by-step details of the value iteration algo-
rithm. In Step 3, we will need to calculate the max norm of a vector.
See Chap. 1 for a definition of max norm. We will use the notation ||.||
to denote the max norm.
Steps in value iteration for MDPs.
Step 1: Set k = 1. Select arbitrary values for the elements of a vector
of size |S|, and call the vector J 1 . Specify > 0.
Step 2: For each i ∈ S, compute:
⎡ ⎤
|S|
J k+1 (i) ← max ⎣r̄(i, a) + λ p(i, a, j)J k (j)⎦ .
a∈A(i)
j=1
Step 3: If
||(J k+1 − J k )|| < (1 − λ)/2λ,
go to Step 4. Otherwise increase k by 1 and go back to Step 2.
Step 4: For each i ∈ S choose
⎡ ⎤
|S|
d(i) ∈ arg max ⎣r̄(i, a) + λ p(i, a, j)J k (j)⎦ and stop.
a∈A(i) j=1
166 SIMULATION-BASED OPTIMIZATION
Table 6.5. Calculations in value iteration for discounted reward MDPs: The value
of is 0.001. The norm is checked with 0.5(1 − λ)/λ = 0.000125. When k = 53,
the -optimal policy is found; we start with J(1) = J(2) = 0
Table 6.6. Gauss-Siedel value iteration for discounted reward MDPs: Here =
0.001; the norm is checked with 0.5(1 − λ)/λ = 0.000125; the -optimal is found
at k = 33; we start with J 1 (1) = J 1 (2) = 0
r(i,d(i),j)
i j
v(j)
v(i)
Let us next discuss what happens when the state transitions are
probabilistic and there is a discounting factor λ. When the system
is in a state i, it may jump to any one of the states in the system.
Consider Fig. 6.14.
r(i,d(i),j)
i j
|S|
vdˆ(i) = p(i, d(i), j) r(i, d(i), j) + λvdˆ(j) ,
j=1
which turns out to be the Bellman equation for a policy (d)ˆ for state i.
We hope that this discussion has served as an intuitive basis for the
Bellman policy equation.
The Bellman optimality equation has a similar intuitive explanation:
In each transition, to obtain the optimal value function at the current
state, i, one seeks to add the maximum over the sum of immediate
reward to the next state j and the “best” (optimal) value function from
state j. Of course, like in the policy equation, we must compute an
expectation over all values of j. We now discuss semi-Markov decision
problems.
Assume that the SMDP has two states numbered 1 and 2. Also,
assume that the time spent in a transition from state 1 is uniformly
distributed with a minimum value of 1 and a maximum of 2 (Unif(1,2)),
while the same from state 2 is exponentially distributed with a mean
of 3 (EXPO(3)); these times are the same for every action. Then, for
generating the TTMs, we need to use the following values. For all
values of a,
t̄(2, a, 1) = 3; t̄(2, a, 2) = 3.
Obviously, the time could follow any distribution. If the distributions
are not available, we must have access to the expected values of the
transition time, so that we have values for each t̄(i, a, j) term in the
model. These values are needed for solving the problem via dynamic
programming.
It could also be that the time spent depends on the action. Thus
for instance, we could also represent the distributions within the TTM
matrix. We will use Ta to denote the TTM for action a. For a 2-state,
2-action problem, consider the following data:
Unif (5, 6) 12 Unif (45, 60) Unif (32, 64)
T1 = ; T2 = .
Unif (14, 16) Unif (5, 12) Unif (14, 16) Unif (12, 15)
does not return to itself after one transition. This implies that the
natural process remains in a state for a certain amount of time and
then jumps to a different state.
The decision process has a different nature. It records only those
states in which an action needs to be selected by the decision-maker.
Thus, the decision process may come back to itself after one transition.
A decision-making state is one in which the decision-maker makes a
decision. All states in a Markov chain may not be decision-making
states; there may be several states in which no decision is made. Thus
typically a subset of the states in the Markov chain tends to be the set
of decision-making states. Clearly, as the name suggests, the decision-
making process records only the decision-making states.
For example, consider a Markov chain with three states numbered
1, 2, 3, and 4. States 1 and 2 are decision-making states while 3 and
4 are not. Now consider the following trajectory:
1, 3, 4, 3, 2, 3, 2, 4, 2, 3, 4, 3, 4, 3, 4, 3, 4, 1.
In this trajectory, the NP will look identical to what we see above.
The DMP however will be:
1, 2, 2, 2, 1.
This example also explains why the NP may change several times
between one change of the DMP. It should also be clear that the DMP
and NP coincide on the decision-making states (1 and 2).
We need to calculate the value functions of only the decision-making
states. In our discussions on MDPs, when we said “state,” we meant
a decision-making state. Technically, for the MDP, the non-decision-
making states enter the analysis only when we calculate the imme-
diate rewards earned in a transition from a decision-making state to
another decision-making state. In the SMDP, calculation of the tran-
sition rewards and the transition times needs taking into account the
non-decision-making states visited. This is because in the transition
from one decision-making state to another, the system may have vis-
ited non-decision-making states multiple times, which can dictate (1)
the value of the immediate reward earned in the transition and (2) the
transition time.
In simulation-based DP (reinforcement learning), the issue of iden-
tifying non-decision-making states becomes less critical because the
simulator calculates the transition reward and transition time in tran-
sitions between decision-making states; as such we need not worry
about the existence of non-decision-making states. However, if one
wished to set up the model, i.e., the TRM and the TPM, careful
attention must be paid to this issue.
Dynamic Programming 173
where xs is the state from where the sth jump (or state transition)
occurs. The expectation is over the different trajectories that may be
followed under the conditions within the square brackets.
The notation inf denotes the infimum (and sup denotes the supre-
mum). An intuitive meaning of inf is minimum and that of sup is
maximum. Technically, the infimum (supremum) is not equivalent to
the minimum (maximum); however at this stage you can use the two
interchangeably. The use of the infimum here implies that the average
reward of a policy is the ratio of the minimum value of the total reward
divided by the total time in the trajectory. Thus, it provides us with
lowest possible value for the average reward.
It can be shown that the average reward is not affected by the state
from which the trajectory of the system starts. Therefore, one can get
rid of i in the definition of average reward. The average reward on
the other hand depends on the policy used. Solving the SMDP means
finding the policy that returns the highest average reward.
174 SIMULATION-BASED OPTIMIZATION
where t̄(i, a, j) is the expected time spent in one transition from state
i to state j under the influence of action a. Now, the average reward
of an SMDP can also be defined as:
|S|
Πμ̂ (i)r̄(i, μ(i))
ρμ̂ = i=1
|S|
(6.23)
Π
i=1 μ̂ (i) t̄(i, μ(i))
where
r̄(i, μ(i)) and t̄(i, μ(i)) denote the expected immediate reward
earned and the expected time spent, respectively, in a transition
from state i under policy μ̂ and
Πμ̂ (i) denotes the limiting probability of the underlying Markov
chain for state i when policy μ̂ is followed.
The numerator in the above denotes the expected immediate reward
in any given transition, while the denominator denotes the expected
time spent in any transition. The above formulation (see e.g., [30])
is based on the renewal reward theorem (see Johns and Miller [155]),
which essentially states that
expected reward earned in a cycle
ρ = average reward per unit time = . (6.24)
expected time spent in a cycle
μ̂1 = (1, 1), μ̂2 = (1, 2), μ̂3 = (2, 1), and μ̂4 = (2, 2).
1 5 1 5 50 75 50 75
Tμ̂1 = ; Tμ̂2 = ; Tμ̂3 = ; Tμ̂4 = .
120 60 7 2 120 60 7 2
The TPMs and TRMs were calculated in Sect. 3.3.2. The value of
each t̄(i, μ(i)) term can be calculated from the TTMs in a manner
similar to that used for calculation of r̄(i, μ(i)). The values are:
t̄(1, μ1 (1)) = p(1, μ1 (1), 1)t̄(1, μ1 (1), 1) + p(1, μ1 (1), 2)t̄(1, μ1 (1), 2)
Step 1. Set k = 1. Here k will denote the iteration number. Let the
number of states be |S|. Select any policy in an arbitrary manner.
Let us denote the policy selected by μ̂k . Let μ̂∗ denote the optimal
policy.
Step 2. (Policy Evaluation) Solve the following linear system of
equations.
|S|
h (i) = r̄(i, μk (i)) − ρ t̄(i, μk (i)) +
k k
p(i, μ(i), j)hk (j). (6.26)
j=1
⎡ ⎤
|S|
J ∗ (i) = max ⎣r̄(i, a) − ρ∗ t̄(i, a) + p(i, a, j)J ∗ (j)⎦ for each i ∈ S.
a∈A(i)
j=1
(6.27)
The following remarks will explain the notation.
The J ∗ terms are the unknowns. They are the components of the
optimal value function vector J ∗ . The number of elements in the
vector J ∗ equals the number of states in the SMDP.
The term t̄(i, a) denotes the expected time of transition from state
i when action a is selected in state i.
The term ρ∗ denotes the average reward associated with the optimal
policy.
Now in an MDP, although ρ∗ is unknown, it is acceptable to replace
ρ∗ by 0 (which is the practice in regular value iteration for MDPs),
or to replace it by the value function associated with some state of
the Markov chain (which is the practice in relative value iteration
178 SIMULATION-BASED OPTIMIZATION
for all (i, a) pairs, where J k (i) denotes the estimate of the value func-
tion element for the ith state in the kth iteration of the value iteration
algorithm.Let us define W (i, a), which deletes the ρ∗ and also the time
Table 6.7. Calculations in policy iteration for average reward SMDPs (Example B)
Now, consider an SMDP with two actions in each state, where t̄(i, 1)
=
t̄(i, 2). For this case, the above value iteration update can be writ-
ten as:
J k+1 (i) ← max{W (i, 1), W (i, 2)}. (6.29)
If regular value iteration, as defined for the MDP, is used here, one
must not only ignore the ρ∗ term but also the time term. Then, an
update based on a regular value iteration for the SMDP will be (we
will show below that the following equation is meaningless):
⎡ ⎤
|S|
J k+1 (i) ← max ⎣r̄(i, a) + p(i, a, j)J k (j)⎦ , (6.30)
a∈A(i)
j=1
where the replacements for r̄(i, a) and p(i, a, j) are denoted by r̄ϑ (i, a)
and pϑ (i, a, j), respectively, and are defined as:
|S|
hμ̂ (i) = r̄(i, μ(i)) + e−γ t̄(i,μ(i),j) p(i, μ(i), j)hμ̂ (j)
j=1
|S|
k
h (i) = r̄(i, μk (i)) + e−γ t̄(i,μk (i),j) p(i, μ(i), j)hk (j). (6.33)
j=1
Dynamic Programming 183
Step 3: If
sp(J k+1 − J k ) < ,
go to Step 4. Otherwise increase k by 1, and go back to Step 2.
Step 4: For each i ∈ S, choose
⎡ ⎤
|S|
d(i) ∈ arg max ⎣r̄(i, a) + e−γ t̄(i,a,j) p(i, a, j)J k (j)⎦ ,
a∈A(i) j
and stop.
∞
p(i, μ(i), j)rL (i, μ(i), j) + R(i, μ(i)) + e−γτ fi,μ(i),j (τ )Jμ (j)dτ ,
j∈S j∈S 0
∞
1 − e−γτ
where R(i, a) = rC (i, a, j) fi,a,j (τ )dτ
0 γ
j∈S
∞
−γτ
max p(i, a, j)rL (i, a, j) + R(i, a) + e fi,a,j (τ )J(j)dτ .
a∈A(i) 0
j∈S j∈S
Step 3b. If
q − J k )|| < (1 − λ)/2λ,
||(W
go to Step 4. Otherwise go to Step 3c.
Step 3c. If q = mk , go to Step 3e. Otherwise, for each i ∈ S, com-
pute:
⎡ ⎤
W q+1 (i) ← ⎣r̄(i, μk+1 (i)) + λ p(i, μk+1 (i), j)W q (j)⎦ .
j∈S
Step 3b. If
q − J k ) ≤ ,
sp(W
go to Step 4. Otherwise go to Step 3c.
Step 3c. If q = mk , go to Step 3e. Otherwise, for each i ∈ S, com-
pute:
⎡ ⎤
W q+1 (i) = ⎣r̄(i, μk+1 (i)) + p(i, μk+1 (i), j)W q (j)⎦ .
j∈S
Minimize ρ subject to
|S|
ρ + v(i) − p(i, μ(i), j)v(j) ≥ r̄(i, μ(i)) for i = 1, 2, . . . , |S| and all μ(i) ∈ A(i).
j=1
where x∗ (i, a) denotes the optimal value of x(i, a) obtained from solv-
ing the LP above, and d(i, a) will contain the optimal policy. Here
Dynamic Programming 189
for all policies μ̂, then v(i) is an upper bound for the optimal value
v ∗ (i). This paves the way for an LP. The formulation, using the x and
v terms as decision variables, is:
|S|
Minimize j=1 x(j)v(j) subject to
|S| |S|
j=1 x(j) = 1 and v(i) − λ j=1 p(i, μ(i), j)v(j) ≥ r̄(i, μ(i)) for i =
1, 2, . . . , |S| and all μ(i) ∈ A(i);
x(j) > 0 for j = 1, 2, . . . , |S|, and v(j) is URS for j = 1, 2, . . . , |S|.
Note that in the infinite horizon setting, the total expected reward is
usually infinite, but that is the not the case here. As such the total
expected reward is a useful metric in the finite horizon MDP.
In this setting, every time the Markov chain jumps, we will
assume that the number of stages (or time) elapsed since the start
190 SIMULATION-BASED OPTIMIZATION
.......
.......
.......
.......
1 2 T T+1
Terminal
Stage
(Non-
Decision Making Stages decision
-making)
the optimal solution. We will now discuss the main idea underlying
the backward recursion technique for solving the problem.
The backward recursion technique starts with finding the values of
the states in the T th stage (the final decision-making stage). For this,
it uses (6.37). In the latter, one needs the values of the states in
the next stage. We assume the values in the (T + 1)th stage to be
known (they will all be zero by our convention). Having determined
the values in the T th stage, we will move one stage backwards, and
then determine the values in the (T − 1)th stage.
The values in the (T − 1)th stage will be determined by using the
values in the T th stage. In this way, we will proceed backward one
stage at a time and find the values of all the stages. During the evalua-
tion of the values, the optimal actions in each of the states will also be
identified using the Bellman equation. We now present a step-by-step
description of the backward recursion algorithm in the context of dis-
counted reward. The expected total reward algorithm will use λ = 1
in the discounted reward algorithm.
A Backward Recursion. Review notation provided for Eq. (6.37).
a∈A(i,s) j=1
j=1
11. Conclusions
This chapter discussed the fundamental ideas underlying MDPs and
SMDPs. The focus was on a finite state and action space within
discrete-event systems. The important methods of value and policy
iteration (DP) were discussed, and the two forms of the Bellman equa-
tion, the optimality equation and the policy equation, were presented
for both average and discounted reward. The modified policy iteration
algorithm was also discussed. Some linear programming for solving
MDPs along with finite horizon control was covered towards the end
briefly. Our goal in this chapter was to provide some of the theory
underlying dynamic programming for solving MDPs and SMDPs that
can also be used in the simulation-based context of the next chapter.
as good as new. (5) Let i denote the number of days elapsed since
the last preventive maintenance or repair (subsequent to a failure);
then the probability of failure during the ith day can be modeled as
1 − ξψ i+2 , where ξ and ψ are scalars in the interval (0, 1), whose values
can be estimated from the data for time between successive failures of
the system.
We will use i to denote the state of the system, since this leads to a
Markov chain. In order to construct a finite Markov chain, we define
for any given positive value of ∈ , ī to be the minimum integer
value of i such that the probability of failure on the īth day is less than
or equal to (1 − ). Since we will set to some pre-fixed value, we can
drop from our notation. In theory, the line will have some probability
of not failing after any given day, making the state space infinite, but
our definition of ī permits truncation of the infinite state space to a
finite one. The resulting state space will be: S = {0, 1, 2, . . . , ī}. This
means that the probability of failure on the īth day (which is very
close to 1) will be assumed to equal 1.
Clearly, when a maintenance or repair is performed, i will be set
to 0. If a successful day of production occurs, i.e., the line does not
fail during the day, the state of the system is incremented by 1. The
action space is: {produce, maintain}. Cm and Cr denote the cost
of one maintenance and one repair respectively. Then, we have the
following transition probabilities for the system.
For action produce: For i = 0, 1, 2, . . . , ī − 1
For i = ī, p(i, produce, 0) = 1. For all other cases not specified above,
p(., produce, .) = 0. Further, for all values of i,
1. Chapter Overview
This chapter focuses on a relatively new methodology called
reinforcement learning (RL). RL will be presented here as a form
of simulation-based dynamic programming, primarily used for solving
Markov and semi-Markov decision problems. Pioneering work in the
area of RL was performed within the artificial intelligence commu-
nity, which views it as a “machine learning” method. This perhaps
explains the roots of the word “learning” in the name reinforcement
learning. We also note that within the artificial intelligence commu-
nity, “learning” is sometimes used to describe function approximation,
e.g., regression. Some kind of function approximation, as we will see
below, usually accompanies RL. The word “reinforcement” is linked
to the fact that RL algorithms can be viewed as agents that learn
through trials and errors (feedback).
But other names have also been suggested for RL. Some examples
are neuro-dynamic programming (NDP) (see Bertsekas and Tsitsik-
lis [33]) and adaptive or approximate dynamic programming (ADP)
(coined in Werbös [314]). Although we will stick to the original name,
reinforcement learning, we emphasize that our presentation here will
be through the viewpoint of dynamic programming.
For this chapter, the reader should review the material presented in
the previous chapter. In writing this chapter, we have followed an order
that differs somewhat from the one followed in the previous chapter.
Table 7.1. A comparison of RL, DP, and heuristics: Note that both DP and RL
use the MDP model
does need the distributions of the random variables that govern the
system’s behavior. The transition rewards and the transition times
are automatically calculated within a simulator. The avoidance of
transition probabilities is not a miracle, but a fact backed by simple
mathematics. We will discuss this issue in great detail.
To summarize our discussion, RL is a useful technique for large-scale
MDPs and SMDPs on which DP is infeasible and heuristics provide
poor solutions. In general, however, if one has access to the transition
probabilities, rewards, and times, DP should be used because it is
guaranteed to generate optimal solutions, and RL is not necessary
there.
Reinforcement Classical
Learning Dynamic
Programming
Inputs:
Distributions of
Governing Random
Variables
Generate the
transition
probability and
Reinforcement reward matrices
Learning
Algorithm in a
Simulator
Dynamic
Programming
Algorithm
Near-Optimal
Solution Optimal Solution
to generate the TPMs and TRMs, and the next step is to use these
matrices in a suitable algorithm to generate a solution. In RL, we
do not estimate the TPM or the TRM but instead simulate the sys-
tem using the distributions of the governing random variables. Then,
within the simulator, a suitable algorithm (clearly different than the
DP algorithm) is used to obtain a solution.
In what follows in this section, we discuss some fundamental
RL-related concepts. A great deal of RL theory is based on the
Q-factor, the Robbins-Monro algorithm [247], and step sizes, and on
how these ideas come together to help solve MDPs and SMDPs within
simulators. Our discussion in this chapter will be geared towards
helping build an intuitive understanding of the algorithms in RL.
Hence, we will derive the algorithms from their DP counterparts to
strengthen our intuition. More sophisticated arguments of convergence
and existence will be dealt with later (in Chap. 11).
For the reader’s convenience, we now define the following sets again:
1. S: the set of states in the system. These states are those in which
decisions are made. In other words, S denotes the set of decision-
making states. Unless otherwise specified, in this book, a state will
mean the same thing as a decision-making state.
2. A(i): the set of actions allowed in state i.
Both types of sets, S and A(i) (for all i ∈ S), will be assumed to
be finite in our discussion. We will also assume that the Markov chain
associated with every policy in the MDP (or the SMDP) is regular.
Please review the previous chapter for a definition of regularity.
3.1. Q-Factors
RL algorithms (for the most part) use the value function of DP. In
RL, the value function is stored in the form of the so-called Q-factors.
Recall the definition of the value function associated with the optimal
policy for discounted reward MDPs. It should also be recalled that
this value function is defined by the Bellman optimality equation,
which we restate here:
⎛ ⎞
|S|
J ∗ (i) = max ⎝ p(i, a, j) [r(i, a, j) + λJ ∗ (j)]⎠ for all i ∈ S, where
a∈A(i)
j=1
(7.1)
J ∗ (i) denotes the ith element of the value function vector associated
with the optimal policy
Reinforcement Learning 205
for all (i, a). Equation (7.4) is an extremely important equation. It can
be viewed as the Q-factor version of the Bellman optimality equation
for discounted reward MDPs. This equation leads us to a Q-factor
version of value iteration. This version can be used to determine the
optimal Q-factors for a given state and forms the Q-factor counterpart
of the value iteration algorithm of DP. We now present its step-by-step
details.
Step 1: Set k = 1, specify > 0, and select arbitrary values for the
0 , e.g., set for all i ∈ S and a ∈ A(i), Q0 (i, a) = 0.
vector Q
Step 2: For each i ∈ S and a ∈ A(i), compute:
|S|
Q k+1
(i, a) ← k
p(i, a, j) r(i, a, j) + λ max Q (j, b) .
b∈A(j)
j=1
J k+1 (i) = max Qk+1 (i, a) and J k (i) = max Qk (i, a).
a∈A(i) a∈A(i)
k+1
k+1 i=1xi
Now, X =
k+1
k
i=1 xi + xk+1
=
k+1
k
X k + xk+1
= (using Eq. (7.5))
k+1
X k k + X k − X k + xk+1
=
k+1
208 SIMULATION-BASED OPTIMIZATION
X k (k + 1) − X k + xk+1
=
k+1
X k (k + 1) Xk xk+1
= − +
k+1 k+1 k+1
X k xk+1
= Xk − +
k + 1 k+1
= 1 − αk+1 X k + αk+1 xk+1 if αk+1 = 1/(k + 1),
i.e., X k+1 = 1 − αk+1 X k + αk+1 xk+1 . (7.6)
= E[SAMPLE], (7.9)
where the quantity in the square brackets of (7.8) is the random
variable of which E[·] is the expectation operator. Thus, if samples
of the random variable can be generated within a simulator, it is pos-
sible to use the Robbins-Monro scheme for evaluating the Q-factor.
Reinforcement Learning 209
Instead of using Eq. (7.7) to estimate the Q-factors (as shown in the
Q-factor version of value iteration), we could instead use the Robbins-
Monro scheme in a simulator. Using the Robbins-Monro algorithm
(see Eq. (7.6)), Eq. (7.7) becomes:
Q (i, a) ← (1 − α )Q (i, a) + α
k+1 k+1 k k+1 k
r(i, a, j) + λ max Q (j, b)
b∈A(j)
(7.10)
for each (i, a) pair.
Perhaps, the most exciting feature of the above is that it is devoid
of the transition probabilities! I.e., we do not need to know the tran-
sition probabilities of the underlying Markov chain in order to use the
above in an algorithm. All we will need is a simulator of the system.
Thus, the mechanism shown in (7.10) enables us to avoid transition
probabilities in RL. An algorithm that does not use (or need) tran-
sition probabilities in its updating equations is called a model-free
algorithm.
What we have derived above is the main update in the Q-learning
algorithm for discounted MDPs. This was first invented by Watkins
[312]. The above discussion was provided to show that the Q-Learning
algorithm can be derived from the Bellman optimality equation for
discounted reward MDPs.
usually needs Q-factors from some other state, and the latest estimate
of these Q-factors from the other state need to be used in this scenario.
It is to be understood that in such a haphazard style of updating, at
any given time, at any given time, it is usually the case the different
Q-factors get updated with differing frequencies.
Updating in this style is called asynchronous updating. In asyn-
chronous updating, the Q-factors used within an update may not have
been updated in the past with the same frequency. Fortunately, it
can be shown that under suitable conditions on the step size and
the algorithm, even with asynchronism, the algorithm can produce
an optimal solution.
Now, thus far, we have defined the step size as
αk = 1/k
αk = A/(B + k)
with e.g., A = 5 and B = 10. With suitable values for scalars A and
B (the tuning parameters), this rule does not decay as fast as 1/k and
can potentially work well. However, for it to work well, it is necessary
to determine suitable values for A and B. Finally, a rule that does not
have any scalar parameters to be tuned and does not decay as fast as
1/k is the following log-rule:
log(k)
αk = ,
k
where log denotes the natural logarithm. Experiments with these rules
on small problems have been conducted in Gosavi [113].
It is necessary to point out that to obtain convergence to optimal
solutions, it is essential that the step sizes follow a set of conditions.
Reinforcement Learning 211
Some of the other conditions needed are more technical, and the reader
is referred to [46]. Fortunately, the A/(B + k) rule and the log rule
satisfy the two conditions specified above and those in [46]. Under
the standard mathematical analysis in the literature, convergence of
the algorithm to optimality can be ensured when all of these conditions
are satisfied by the step sizes. We will assume in this chapter that the
step sizes will be updated using one of the rules documented above.
A constant is an attractive choice for the step size since it may
ensure rapid convergence if its value is close to 1. However, a constant
k 2
step size violates the second condition above, i.e., ∞ k=1 α < ∞,
which is why in RL, constant step sizes are usually not used.
4. MDPs
In this section, we will discuss simulation-based RL algorithms for
MDPs in detail. This section forms the heart of this chapter, and
indeed of the part of the book devoted to control optimization. We
will build upon the ideas developed in the previous section. We will
first consider discounted reward and then average reward.
Steps in Q-Learning.
Step 1. Initialize the Q-factors. In other words, set for all (l, u) where
l ∈ S and u ∈ A(l): Q(l, u) = 0. Set k, the number of state
transitions, to 0. We will run the algorithm for kmax iterations,
where kmax is chosen to be a sufficiently large number. Start system
simulation at any arbitrary state.
Reinforcement Learning 213
Select an action a,
i
simulate it, and go to j
Extract the
value of r(i,a,j)
next
state
Figure 7.2. The updating of the Q-factors in a simulator: Each arrow denotes
a state transition in the simulator. After going to state j, the Q-factor for the
previous state i and the action a selected in i, that is, Q(i, a), is updated
ˆ Stop.
The policy (solution) generated by the algorithm is d.
Please make note of the following.
Also, arg maxb∈A(j) Q(j, b) denotes the action associated with the
maximum Q-factor in state j.
Simulator
(environment)
Feedback
r(i,a,j)
RL Algorithm
Action (Agent)
a
Figure 7.3. Trial and error mechanism of RL: The action selected by the RL agent
(algorithm) is fed into the simulator. The simulator simulates the action, and the
resultant feedback (immediate reward) obtained is fed back into the knowledge-
base (Q-factors) of the agent. The agent uses the RL algorithm to update its
knowledge-base, becomes smarter in the process, and then selects a better action
and select the other action with probability B k /k, where e.g., B k = 0.5
for all k. It is clear that when such an exploratory strategy is pur-
sued, the learning agent will select the non-greedy (exploratory) action
with probability B k /k. As k starts becoming large, the probability of
selecting the non-greedy (exploratory) action diminishes. At the end,
when k is very large, the action selected will clearly be a greedy ac-
tion. As such, at the end, the action selected will also be the action
prescribed by the policy learned by the algorithm. Another example
for decaying the exploration would be as follows:
pk = pk−1 A with A < 1, or via any scheme for decreasing step sizes.
GLIE strategy: We now discuss an important class of exploration
that is required for some algorithms such as SARSA and R-SMART.
In the so-called GLIE (greedy in the limit with infinite exploration)
policy [277], the exploration is reduced in a manner such that in the
limit, one obtains a greedy policy and still all the state-action pairs
are visited infinitely often. An example of such a policy is one that
would use the scheme in (7.12) with B k = A/V̄ k (i) where 0 < A < 1
and V̄ k (i) denotes the number of times state i has been visited thus
far in the simulator. That a policy of this nature satisfies the GLIE
property can be shown (see [277] for proof and other such schemes).
It is important to note, however, that this specific example of a GLIE
policy will need keeping track of an additional variable, V̄ k (i), for each
state.
218 SIMULATION-BASED OPTIMIZATION
r(i, a, j) = r(2, 1, 1) = 7;
and so on.
Table 7.2. The table shows Q-factors for Example A with a number of different
step-size rules. Here pk = 1/|A(i)|
iteration. Note that all the step-size rules produce the optimal policy:
(2, 1). The rule α = 1/k produces the optimal policy, but the values
stray considerably from the optimal values generated by the Q-factor
value iteration. This casts some doubt on the usefulness of the 1/k
rule, which has been used by many researchers (including this author
before [113] was written). Note that 1/k decays very quickly to 0,
which can perhaps lead to computer round-off errors and can cause
problems. One advantage of the 1/k rule (other than its theoretical
guarantees) is that it does not have any tuning parameters, e.g., A
and B. The log-rule (log(k)/k starting at k = 2) has an advantage in
that it does not need any tuning parameters, unlike the A/(B +k), and
yet produces values that are better than those of the 1/k rule. Finally,
in the last row of the table, we present the results of an algorithm in
which we use the rule 1/k separately for each state-action pair. In
other words, a separate step-size is used for each Q(i, a), and the step-
size is defined as 1/V (i, a), where V (i, a) is the number of times the
state-action pair (i, a) was tried. This works well, as is clear from the
table, but requires a separate V (i, a) for each state-action pair. In
other words, this rule increases the storage burden of the algorithm.
In summary, it appears that the log-rule not only works well, but also
does not need tuning of parameters.
|S|
Qμ̂ (i, a) = p(i, a, j) [r(i, a, j) + λJμ̂ (j)] , (7.13)
j=1
where Jμ̂ is the value function vector associated with the policy μ̂.
Notice the difference with the definition of the Q-factor in value iter-
ation, which is:
|S|
Q(i, a) = p(i, a, j) [r(i, a, j) + λJ ∗ (j)] , (7.14)
j=1
where J∗ denotes the value function vector associated with the optimal
policy.
Using the definition in Eq. (7.13), we can develop a version of policy
iteration in terms of Q-factors. Now, from the Bellman equation for a
given policy μ̂, which is also called the Poisson equation, we have that:
|S|
Jμ̂ (i) = p(i, μ(i), j) [r(i, μ(i), j) + λJμ̂ (j)] , ∀i. (7.15)
j=1
|S|
Qμ̂ (i, a) = p(i, a, j) [r(i, a, j) + λQμ̂ (j, μ(j))] , ∀i, a ∈ A(i).
j=1
(7.17)
It is clear that Eq. (7.17) is a Q-factor version of the equation used
in the policy evaluation phase of policy iteration and is thus useful in
devising a Q-factor version of policy iteration, which we present next.
It is a critical equation on which much of our subsequent analysis is
based. It is in fact the Q-factor version of the Bellman policy equation
for discounted reward.
Reinforcement Learning 223
Then, using the Robbins-Monro scheme (see (7.6)), Eq. (7.18) can
be written, for all state-action pairs (i, a), as:
Note also that since, Qμ̂ (j, μ(j)) ≡ Jμ̂ (j), we have that (7.22) above
and the equation underlying Q-P -Learning, i.e., Eq. (7.17), are in fact
the same. We will discuss a version of API below.
Remark 2: The policy improvement step in the policy iteration
algorithm of classical DP is given by:
⎡ ⎤
μk+1 (i) ∈ arg max ⎣ p(i, a, j) [r(i, a, j) + λJμ̂k (j)]⎦ .
a∈A(i) j∈S
Equation (7.23) follows from Eq. (7.17) as long as Q-P -Learning con-
verges to the solution of Eq. (7.17). We thus showed that using the
policy in the newly generated Q-factors is equivalent to performing the
policy improvement step in classical DP.
Remark 3: In Q-P -Learning, exploration has to be carried out at
its maximum rate, i.e., every action has to be tried with the same
probability. Without this, there is danger of not identifying an im-
proved policy at the end of an episode. Since the algorithm evaluates
a given policy in one episode, it would be incorrect to bias the explo-
ration in favor of that policy. This is because (i) the definition of the
Q-factor does not have any such bias and (ii) with such a bias, we may
never explore the actions that could potentially become optimal at a
later stage; the latter can lead to sub-optimal values of the Q-factors.
It is perhaps evident from Remark 2 that incorrect Q-factors, which
could result from inadequate exploration, will corrupt the policy im-
provement step.
Remark 4: A word about the number of iterations within an episode,
i.e., nmax , is appropriate here. This quantity obviously needs to be
large although finite. But how large should it be? Unless this num-
ber is large enough, inaccuracies in the Q-factors estimated in a given
episode can lead to a new policy that is actually worse than the cur-
rent policy. This phenomenon is part of what is called chattering (or
oscillation), which can be undesirable [33]. Setting nmax to 1 or some
other very small integer may cause severe chattering and has also been
observed in the case of an algorithm called ambitious API [33].
Remark 5: In comparison to Q-Learning, Q-P -Learning will require
additional time (especially with large values of nmax ), since in each
episode, Q-P -Learning performs a Q-Learning-like evaluation. In
other words, every episode resembles an independent application
Reinforcement Learning 227
Step 1. Initialize the Q-factors, Q(l, u), for all (l, u) pairs. Let n
denote the number of state transitions (iterations) of the algorithm.
Initialize nmax to a large number. Start a simulation. Set n = 1.
a
w
rimm i
r(i,a,j)
s j
Figure 7.4. A schematic showing the updating in SARSA: The quantity above
the arrow is the action selected and that below is the immediate reward earned
in the transition. Note that we update Q(s, w) after the transition from i to j is
complete
Step 1. Initialize kmax , nmax , and mmax to large numbers. Let the
number of algorithm iterations be denoted by k. Set k = 1. Select
any policy arbitrarily and call it μ̂k .
Step 2 (Policy Evaluation: Estimating Value Function.) Start
fresh simulation. Initialize J(l) = 0 for all l ∈ S. Let the current
system state be i. Set n, the number of iterations within the policy
evaluation episode, to 1.
Step 2a. Simulate action a = μk (i) in state i.
Step 2b. Let the next state encountered in the simulator be j. Let
r(i, a, j) be the immediate reward earned in the transition from
state i to state j. Update α. Then update J(i) using:
The reader should note that the value of the J(.) function used in
Step 3 is that obtained at the end of Step 2. Also, note that the need
for Q-factor evaluation (Step 3) arises from the fact that since the tran-
sition probabilities are unknown, it is not possible to perform policy
Reinforcement Learning 231
|S|
Q(i, a) = p(i, a, j) r(i, a, j) − ρ∗ + max Q(j, b) , (7.25)
b∈A(j)
j=1
for all (i, a) pairs. The only difficulty with this algorithm is that ρ∗ is
not known in advance! Like in DP, we will use the notion of relative
value iteration to circumvent this difficulty.
T R ← T R + r(i, a, j); k ← k + 1.
As a result, the policy learned is (2, 1). When this policy is run in a
simulator, the average reward obtained is 10.56. Note that Q(1, 1) =
10.07 10.56.
5. SMDPs
In this section, we will first discuss the discounted reward case
for solving the generalized semi-Markov decision problems (SMDPs),
using RL, and then discuss the average reward case. We remind the
reader that the SMDP is a more powerful model than the MDP, be-
cause it explicitly models the time spent in a transition. In the MDP,
the time spent is the same for every transition. The RL algorithms for
SMDPs use extensions of Q-Learning and Q-P -Learning for MDPs.
Please refer to value and policy iteration for SMDPs and also review
definitions from Chap. 6.
1 − e−γt(i,a,j)
α rL (i, a, j) + rC (i, a, j) + e−γt(i,a,j) max Q(j, b) ,
γ b∈A(j)
(7.26)
where t(i, a, j) is the possibly random amount of time it takes to tran-
sition from i to j under a and γ is the rate of return or rate of interest.
A version of this algorithm without the lump sum reward appears in
[52], while the version with the lump sum reward and a convergence
proof appears in [119].
where fi,a,j (.) is the pdf of the transition time from i to j under a’s
influence.
Reinforcement Learning 237
note that the superscript, k, used above with α and β, indicates that
the step sizes are functions of k. The superscript has been suppressed
elsewhere to increase clarity. Step-size rules such as αk = log(k)/k and
β k = A/(B + k) (with suitable values of A and B) satisfy Eq. (7.31);
other rules that satisfy the relationship can also be identified in prac-
tice. The condition in Eq. (7.31) is a defining feature of the two-time-
scale framework. Whenever two time scales are used, the two step
sizes must satisfy this condition.
tions, one can construct an imaginary SSP from any average reward
MDP easily. This SSP has some nice properties and is easy to solve.
What is interesting is that the solution to this SSP is an optimal so-
lution of the original MDP as well!
The construction of the associated SSP requires that we consider
any state in the MDP (provided that every Markov chain is regu-
lar, an assumption we make throughout) to be the absorbing state;
we call this state the distinguished state and denote it by i∗ . In the
associated SSP, if the distinguished state is the next state (j) in a tran-
sition and the Q-factor of the previous state (i) is being updated after
the transition occurs, the value of zero will replace the distinguished
state’s Q-factor. (Remember that when we update a Q-factor, we need
Q-factors from the next state as well.) However, when the turn comes
to update a Q-factor of the distinguished state, it will be updated just
like any other Q-factor. Also, very importantly, the SSP requires that
we replace the immediate reward, r(i, a, j), by r(i, a, j) − ρ∗ . Since ρ∗
is unknown at the start, we will update ρ via Eq. (7.29), under the
two-time-scale conditions (discussed previously).
Now, clearly, here, we are interested here in solving the SMDP
rather than the MDP. It was shown in [119] that similar to the MDP,
by using a distinguished (absorbing) state i∗ , one can construct an
SSP for an SMDP as well—such that the solution of the SSP is iden-
tical to that of the original SMDP. For the SMDP, solving the asso-
ciated SSP requires that we replace the immediate reward, r(i, a, j),
by r(i, a, j) − ρ∗ t̄(i, a, j), where note that we have the time element
(t(i, a, j)) in the modified immediate reward. Then, provided the value
of ρ∗ is known in advance, the following value iteration algorithm can
be used to solve the SSP and hence the SMDP:
∗ ∗
Q(i, a) ← p(i, a, j) r(i, a, j)−ρ t̄(i, a, j)+I(j
= i ) max Q(j, b) ,
b∈A(j)
j∈S
(7.33)
where I(.) is an indicator function, and it equals 1 when j
= i∗ and
equals 0 when j = i∗ . Of course, one must update ρ in a manner such
that it reaches ρ∗ in the limit. We will update ρ on a second time
scale—in a manner identical to that used in the CF-version.
Equation (7.33) thus serves as the basis for deriving a Q-Learning
algorithm (see Eq. (7.34) below). Convergence for the resulting algo-
rithm under certain conditions is shown in [119]. We now present steps
in the algorithm.
Reinforcement Learning 241
(7.34)
policy equation) keeps the iterates bounded and also converges. For
average reward MDPs, one uses relative value iteration (derived from
the Bellman policy equation). For average reward SMDPs, we cannot
use relative value iteration without discretization. But, after a suit-
able modification, one can use the Bellman equation for value iteration
for a given policy. The Bellman equation relevant to this case is the
Bellman policy (or Poisson) equation, which can be expressed in terms
of Q-factors as follows.
|S|
Q(i, a) = p(i, a, j)[r(i, a, j) − ρμ̂ t̄(i, a, j) + Q(j, μ(j))].
j=1
The above is the Bellman policy equation for the policy μ̂ and in
general does not have a unique solution, which may pose problems in
RL. We now discuss two approaches to bypass this difficulty.
Like in value-iteration-based RL, one approach is to solve the asso-
ciated SSP, and use a variant of the above equation that applies for
the SSP. Hence the first question is, what does the equation look like
for the SSP? Using the notion of a distinguished state i∗ , we have the
following Bellman policy equation:
|S|
Q(i, a) = p(i, a, j)[r(i, a, j) − ρμ̂ t̄(i, a, j) + I(j
= i∗ )Q(j, μ(j))].
j=1
(7.35)
Now, the second question is how does one obtain the value of ρμ̂ ?
This can be resolved as follows: Before updating the Q-factors using
this equation, one estimates the average reward of the policy μ̂ in a
simulator. Thus, in Step 2, we estimate the average reward of the
policy, whose Q-factors are evaluated later in Step 3 using an update
based on Eq. (7.35).
The second approach is to use the CF-version. We will discuss that
later. We first present steps in the SSP-version of Q-P -Learning for
average reward SMDPs.
where η is a scalar that satisfies 0 < η < 1. For any given value of ρ,
it can be shown that the above equation has a unique solution; this is
244 SIMULATION-BASED OPTIMIZATION
6. Model-Building Algorithms
Scattered in the literature on RL, one finds some papers that discuss
a class of algorithms called model-based algorithms. These algorithms
actually build the transition probability model in some form and at-
tempt to use the Bellman equation in its original form, i.e., with its
transition probabilities intact. At the beginning of the chapter, we
have discussed a mechanism to first build the transition probability
model within the simulator and then use DP. Model-based algorithms
are similar in spirit, but they do not wait for the model to be built.
Rather, they start updating the value function, or the Q-factors, while
the model is being simultaneously built. The question that arises is
this: What is the advantage underlying taking the additional step
of building the model when we already have the model-free Bellman
equation in terms of Q-factors (which we have used all along)? The
answer is that oftentimes model-free algorithms exhibit unstable be-
havior, and their performance also depends on the choice of the step-
size. Model-based algorithms hold the promise of being more stable
than their model-free counterparts [324].
Reinforcement Learning 245
6.1. RTDP
We first present model-building algorithms that compute the value
function rather than the Q-factors. The central idea here is to estimate
the immediate reward function, r̄(i, a), and the transition probability
function, p(i, a, j). While this estimation is being done, we use a step-
size-based asynchronous DP algorithm. The estimates of the immedi-
ate reward and transition probabilities will be denoted by r̃(i, a) and
p̃(i, a, j), respectively.
The following algorithm, called RTDP (Real Time Dynamic Pro-
gramming), is based on the work in [17]. It is designed for discounted
reward MDPs.
ˆ Stop.
The policy (solution) generated by the algorithm is d.
Step 1. Set for all (l, u), where l ∈ S and u ∈ A(l), Q(l, u) ← 0,
r̃(l, u) ← 0, and Qn (l, u) ← 0. Note that Qn (l, u) will denote the
estimate of the maximum Q-factor for the next state when action
u is chosen in state l. Set k, the number of state changes, to 0. Set
kmax , which denotes the maximum number of iterations for which
the algorithm is run, to a sufficiently large number; note that the
algorithm runs iteratively between Steps 2 and 6. Select step sizes
α and β using the two-time-scale structure discussed previously.
Start system simulation at any arbitrary state.
Step 2. Let the current state be i. Select action a with a probability
of 1/|A(i)|.
Step 3. Simulate action a. Let the next state be j. Increment k by 1.
248 SIMULATION-BASED OPTIMIZATION
Note that the update in Step 5 uses samples; but Step 4 uses es-
timates of the expected immediate reward and the expected value of
the Q-factor of the next state, making the algorithm model-based. Its
critical feature is that it avoids estimating the transition probabilities
and also computing the expectation with it. The steps for the average
reward case are same with the following changes: Any one state-action
pair, to be denoted by (i∗ , a∗ ), is selected in Step 1, and the update in
Step 4 is changed to the following:
(7.37)
Note that Q(j, T + 1, b) = 0 for all j ∈ S and b ∈ A(j, T + 1).
For total expected reward, one should set λ = 1. The algorithm can
be analyzed as a special case of the SSP (see [116]; see also [94]). We
present a step-by-step description below.
8. Function Approximation
In a look-up table, each Q-factor is stored individually in a table
in the computer’s memory. As stated in the beginning of the chapter,
for large state-action spaces which require millions of Q-factors to be
250 SIMULATION-BASED OPTIMIZATION
stored, look-up tables are ruled out. This, we remind the reader, is
called the curse of dimensionality.
Consider an MDP with a million state-action pairs. Using model-
free algorithms, one can avoid the huge transition probability matrices
associated to the MDP. However, one must still find some means of
storing the one million Q-factors. We will now discuss some strategies
that allow this.
φ(1) = 2s2 (1) + 4s(2) + 5s(3); φ(2) = 4s(1) + 2s2 (2) + 9s(3); (7.39)
where φ(l) denotes the feature in the lth dimension (or the lth feature)
and is a scalar for every l. The ideas of feature creation are useful in
all methods of function approximation, not just state aggregation. We
will explore them further below.
Features and architecture. We now seek to generalize the ideas
underlying feature creation. From here onwards, our discussion will
be in terms of Q-factors or state-action values, which are more com-
monly needed than the value function in simulation-based optimiza-
tion. Hence, the basis expression will denoted in terms of the feature
and the action: φ(l, a) for l = 1, 2, . . . , n and for every action a, where
n denotes the number of features. We will assume that the state has
d dimensions. Then, in general, the feature extraction map can be
expressed as:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
Fa (1, 1) Fa (1, 2) · · Fa (1, d) s(1) φ(1, a)
⎢ Fa (2, 1) Fa (2, 2) · · Fa (2, d) ⎥ ⎢ s(2) ⎥ ⎢ φ(2, a) ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ · · · · · ⎥·⎢ · ⎥=⎢ · ⎥,
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ · · · · · ⎦ ⎣ · ⎦ ⎣ · ⎦
Fa (n, 1) Fa (n, 2) · · Fa (n, d) s(d) φ(n, a)
where the matrix, whose elements are Fa (., .), denotes the feature ex-
traction map for action a; the feature extraction map converts the state
space to the feature space. Then the above matrix representation for
Eq. (7.38) is:
⎡ ⎤
s(1)
2 4 8 · ⎣ s(2) ⎦ = φ;
s(3)
Reinforcement Learning 253
where w(l, a) denotes the lth weight for action a. In the above, the
weights are the only unknowns (remember the basis expressions are
known) and must be estimated (and updated) via some approach (usu-
ally neurons or regression). If the architecture, such as the one above,
fits the actual Q-function well, then instead of storing all the Q-factors,
one now needs only n scalars for each action. Typically, n should be
chosen so that it is much smaller than the size of the state space.
Herein lies the power of function approximation. Function approxi-
mation enables us to replicate the behavior of look-up tables without
storing all the Q-factors. Of course, this is easier said than done,
254 SIMULATION-BASED OPTIMIZATION
State-Action
Space
Feature
Extraction
Map
Feature
Weights
Space
Function
Architecture
F(i)=w(1, 1)+w(2, 1)i+w(3, 1)i2 ; G(i)=w(1, 2)+w(2, 2)i+w(3, 2)i2 +w(4, 2)i3 ,
Reinforcement Learning 255
where ik denotes i raised to the kth power. Here, we will assume that
to obtain n features, the one-dimensional state i will be written as:
s = [i 0 0 . . . 0]T , with (n − 1) zeroes.
1. Using one non-linear neural network over the entire state space
with backpropagation as our main tool.
2. Using a pre-specified non-linear function to approximate the state
space, e.g., Example 3, using regression as the main tool.
3. Using a piecewise linear function to approximate the state space
with a neuron or linear regression within each piece. Example 4
(presented later) will illustrate this approach.
4. Using a piecewise non-linear function to approximate the state
space with a non-linear neural network within each piece. This
will amount to using several non-linear neural networks.
In general for function approximation to be successful, it must repro-
duce the behavior that RL with look-up tables would have produced
(when the state space is large, think of an imaginary look-up table).
In other words, any given Q-factor must remain at roughly the same
value that look-up tables would have produced. The solution with the
look-up tables, remember, is an optimal or a near-optimal solution.
Difficulties. We now describe the main challenges faced in function
fitting.
the weights that represent Q(i, a). E.g., consider Eq. (7.40). The
Q-factor is thus a function of the weights. When a state-action pair
(i, a) is visited, the weights get updated. Unfortunately, since every
Q-factor is a function of the weights, the update in weights causes
all Q-factors to change, not just the (i, a)-th pair that was visited.
Hence the new (updated) Q-factors may not equal the Q-factors
that would have been produced by a look-up table, which would
have changed only the (i, a)th Q-factor.
We refer to this kind of behavior, where the updates in one area of
the state space spill over into areas where they are not supposed
to, as the spill-over effect. With a look-up table, on the other
hand, only one Q-factor changes in a given iteration, and the others
remain unchanged. A function-approximation-coupled algorithm is
supposed to imitate that coupled with a look-up table and should
change only one Q-value in one iteration, leaving the others un-
changed, at least approximately.
3. Noise due to single sample: In model-free RL, we use a sin-
gle sample, instead of using the expectation, that tends to create
noise in the updating algorithm. While this poses no problems
with look-up tables, the noise can create a significant difficulty
with function fitting (see [326, 317]). Unless a model is available
apriori, this kind of noise cannot be avoided; see, however, [115]
for some model-building algorithms that can partially avoid this
noise. With the incremental least squares (regression) algorithm
that we will discuss below, Bertsekas and Tsitsiklis [33] state that
for Q-Learning, convergence properties are unknown. We also note
that in general non-linear neural networks can get trapped in local
optima and may display unstable behavior.
Clearly, the Bellman error denotes the sum of the squared differences
between the actual Q-factors obtained from a look-up table and those
given by the function approximator. Now, using the above definition
of the Q-factor in the Bellman-error expression, we will compute the
partial derivative of the Bellman error with respect to the weights.
For easing notation, we will assume that the weights, which essentially
form a matrix, can be stored in a vector, which will be denoted by w.
Thus, a Q-factor expressed in terms of this vector of weights will be
denoted by Qw (., .). Then, we will use the derivative in a steepest-
descent algorithm.
∂Qw (i, a)
= − ×
∂w(l, a)
⎡ ⎤
|S|
⎣ p(i, a, j) r(i, a, j) + λ max Qw (j, b) − Qw (i, a)⎦
b∈A(j)
i,a j=1
∂Qw (i, a)
= − ×
∂w(l, a)
⎡ ⎤
|S|
⎣ p(i, a, j) r(i, a, j) + λ max Qw (j, b) − Qw (i, a) ⎦ .
b∈A(j)
i,a j=1
We can now remove the terms p(i, a, j) that need summation over j,
thereby replacing an expectation with a single sample; this is similar
to the use of Robbins-Monro algorithm for deriving Q-Learning. The
end result is the following definition for the partial derivative:
∂BE ∂Qw (i, a)
= − r(i, a, j) + λ max Qw (j, b) − Qw (i, a) .
∂w(l, a) ∂w(l, a) i,a b∈A(j)
Now, the above can be combined with the following steepest descent
algorithm:
∂BE
w(l, a) ← w(l, a) − β for all (l, a), (7.42)
∂w(l, a)
where β is the step size. For using the above, we must determine the
expressions for ∂Q (i,a)
w
∂w(l,a) , which can be done easily from the architecture.
In general,
∂Qw (i, a)
= φ(l, a),
∂w(l, a)
Step 3b. Then use Qnew and Qold to update the weights of the neu-
ral network associated with action a via an incremental algorithm.
In the neural network, Qnew will serve as the “target” value (yp in
Chap. 4), Qold as the output (op in Chap. 4), and i as the input.
Step 4. Increment k by 1. If k < kmax , set i ← j; then go to Step 2.
Otherwise, go to Step 5.
Step 5. The policy learned is stored in the weights of the neural net-
work. To determine the action associated with a state, find the
outputs of the neural networks associated with the actions that are
allowed in that state. The action(s) with the maximum value for
the output is the action dictated by the policy learned. Stop.
We reiterate that the neural network in Step 3b is updated in an
incremental style. Sometimes in practice, just one (or two) iteration(s)
of training is sufficient in the updating process of the neural network.
In other words, when a Q-factor is to be updated, only one iteration
of updating is done within the neural network. Usually too many it-
erations in the neural network can lead to “over-fitting.” Over-fitting
implies that the Q-factor values in parts of the state space other than
those being trained are incorrectly updated (spill-over effect). It is im-
portant that training be localized and limited to the state in question.
Steps with incremental regression. Here the steps will be similar
to the steps above with the understanding that the weights of the
neural network are now replaced by those of the regression model;
Step 3a will not be needed Step 3b will be different. In Step 3b, we
will first compute the values of the Q-factors for (i, a) and those for all
actions associated to state j by plugging in the input features into the
regression model. Then a combination of Eqs. (7.41) and (7.42) will
be used to update the weights.
262 SIMULATION-BASED OPTIMIZATION
Step 3b. The current step in turn may contain a number of steps and
involves the neural network updating. Set m = 0, where m is the
number of iterations used within the neural network. Set mmax , the
maximum number of iterations for neuronal updating, to a suitable
value (we will discuss this value below).
Step 3b(i). Update the weights of the neuron associated to action a
as follows:
w(1, a) ← w(1, a) + μ(Qnew − Qold )1; w(2, a) ← w(2, a) + μ(Qnew − Qold )i.
(7.44)
Reinforcement Learning 263
simple example studied above in which the basis functions for a state
i are: φ(1, a) = 1 and φ(2, a) = i for both actions. This implies that:
∂Qw (i, a) ∂Qw (i, a)
= 1; =i (7.45)
∂w(1, a) ∂w(2, a)
The steps will be similar to those described for the neuron with the
following difference in Step 3.
Step 3. Evaluate Qold and Qnext as discussed above in Step 3 of the
neuron-based algorithm. Then, update the weights as follows:
w(1, a) ← w(1, a) + μ r(i, a, j) + λQnext − Qold 1;
w(2, a) ← w(2, a) + μ r(i, a, j) + λQnext − Qold i.
Q(i, a)=w1 (1, a)+w1 (2, a)i for i ≤ 5; Q(i, a)=w2 (1, a)+w2 (2, a)i for i > 5,
where wc (., a) denotes the weight in the cth compartment for action a.
The training for points in the zone i ≤ 5 would not spill over into
the training for points in the other zone. Clearly, one can construct
multiple compartments, and compartments do not have to be of the
same size.
Reinforcement Learning 265
Feature Space
Place neuron
in every
Extractor
Feature
compartment
State Space
Figure 7.6. A feature extraction mapping that transforms the actual 2-dimensional
state space (bottom) into a more regular feature space: Within each compartment
in the feature space, a neuron can be placed
9. Conclusions
This chapter was meant to serve as an introduction to the funda-
mental ideas related to RL. Many RL algorithms based on Q-factors
were discussed. Their DP roots were exposed and step-by-step de-
tails of some algorithms were presented. Some methods of function
approximation of the Q-function were discussed. Brief accounts of
model-building algorithms and finite horizon problems were also pre-
sented
At this point, we summarize the relationship between RL algorithms
and their DP counterparts. The two main algorithms of DP, value
and policy iteration, are based on the Bellman equation that contains
the elements of the value function as the unknowns. One can, as
discussed in this chapter, derive a Q-factor version of the Bellman
equation. Most RL algorithms, like Q-Learning and Q-P -Learning, are
based on the Q-factor version of the Bellman equation. Table 7.3 shows
the DP roots of some of the RL algorithms that we have discussed in
this chapter.
DP RL
Bellman optimality equation Q-Learning
(Q-factor version)
Bellman policy equation Q-P -Learning
(Q-factor version)
Value iteration Q-Learning
Relative value iteration Relative Q-Learning
Modified policy iteration Q-P -Learning and CAP-I
in Samuel [260] and Klopf [176]. Holland [140] is also an early work related to
temporal differences. Some other related research can be found in Holland [141]
and Booker [44].
Textbooks. Two textbooks that appeared before the one you are reading and laid
the foundation for the science of RL are [33] and [288].
Neuro-dynamic programming (NDP) [33], an outstanding book, strengthened
the connection between RL and DP. This book is strongly recommended to the
reader for foundational concepts on RL. It not only discusses a treasure of algorith-
mic concepts likely to stimulate further research in coming years, but also presents
a detailed convergence analysis of many RL algorithms. The name NDP is used
to emphasize the connection of DP-based RL with function approximation (neural
networks).
The book of Sutton and Barto [288] provides a very accessible and intuitive intro-
duction to reinforcement learning, including numerous fundamental ideas ranging
from temporal differences, through Q-Learning and SARSA, to actor-critics and
function approximation. The perspective of machine learning, rather than opera-
tions research, is used in this text, and the reader should find numerous examples
from artificial intelligence for illustration.
The reader is also referred to Chapter 6 in Vol II of Bertsekas [30], which focusses
on Approximate Dynamic Programming (ADP) and discusses a number of recent
advances in this field, particularly in the context of function approximation and
temporal differences. The acronym ADP, which is also used to mean Adaptive
Dynamic Programming, is often used to refer to simulation-based DP and RL
schemes for solving MDPs that employ regression-based function approximators,
e.g., linear least squares. It was coined in Werbös [318] and is being used widely in
the literature now.
More recently, a number of books have appeared on related topics, and we sur-
vey a subset of these. Chang et al. [62] discuss a number of recent paradigms,
including those based on stochastic policy search and MRAS for simulation-based
solutions of MDPs. Szepesvári [289] presents a very crisp and clear overview of
many RL algorithms. Numerous other books contain related material, but em-
phasize specific topics: function approximation [56], stochastic approximation [48],
sensitivity-based learning [57], post-decision-making [237], and knowledge-gradient
learning [238].
Reinforcement Learning 267
1. Chapter Overview
In this chapter, we discuss an approach for solving Markov decision
problems (MDPs) and Semi-Markov decision problems (SMDPs) using
an approach that employs the so-called action-selection probabilities
instead of the Q-factors required in reinforcement learning (RL). The
underlying philosophy of this approach can be explained as follows.
The action-selection probabilities, which are stored in some form either
directly or indirectly, are used to guide the search. As a result, we have
a stochastic search in which each action is considered to be equally
good at the start, but using feedback from the system about the eff-
ectiveness of each action, the algorithm updates the action-selection
probabilities—leading the system to the optimal policy at the end. It
should be clear to the reader that like RL, this approach also uses
feedback from the system, but unlike RL, it stores action-selection
probabilities.
The two methods in the class of “stochastic search” that we cover
are widely known as learning automata (or automata theory) and
actor critics (or adaptive critics). Automata theory for solving
MDPs/SMDPs will be referred to as Markov Chain Automata Theory
(MCAT) in this book. MCAT does not use the Bellman equation of
any kind, while actor critics use the Bellman policy equation (Poisson
equation). In Sect. 2, we discuss MCAT, and in Sect. 3, we discuss
actor critics, abbreviated as ACs. MCAT will be treated for average
reward MDPs and SMDPs, while ACs will be treated for discounted
and average reward MDPs and average reward SMDPs. It is recom-
mended that the reader become familiar with the basics of MDPs and
SMDPs from Chap. 6 before reading this chapter.
The response, as we will see later, is used to generate the feedback used
to update the action selection probabilities. The response is calculated
as the total reward earned since the last visit to i divided by the
number of state transitions since the last visit to i. Thus, for MDPs,
the response for state i is given by
R(i)
s(i) = ,
N (i)
where R(i) denotes the total reward earned since the last visit to state
i and N (i) denotes the number of state transitions that have occurred
since the last visit to i. For the SMDP, N (i) is replaced by T (i), where
T (i) denotes the total time spent in the simulator since the last visit
to i. Thus, for SMDPs, the response for state i is:
R(i)
s(i) = .
T (i)
The response is then normalized to convert it into a scalar quantity
that lies between 0 and 1. The normalized response is called feedback.
The normalization is performed via:
s(i) − smin
φ(i) = , (8.1)
smax − smin
where smin is the minimum response possible in the system and smax
is the maximum response possible in the system.
As stated above, the feedback is used to update the action-selection
probability p(i, x) where x denotes the action selected in the last visit
to i. Without normalization, we will see later, the updated action-
selection probabilities can exceed 1 or become negative.
Many schemes have been suggested in the literature to update the
action-selection probabilities. All schemes are designed to punish the
bad actions and reward the good ones. A popular scheme, known
as the Reward-Inaction scheme, will be covered in detail because it
appears to be one of the more popular ones [215].
Reward-Inaction Scheme. As discussed above, a trajectory of
states is simulated. An iteration is said to be performed when one
transitions from one state to another. Consider the instant at which
the system visits a state i. At this time, the action-selection probabil-
ities for state i are updated. To this end, the feedback φ(i) has to be
computed as shown in Eq. (8.1). Let L(i) denote the action taken in
the last visit to state i and α denote the step size or learning rate. Let
pk (i, x) denote the action-selection probability of action x in state i in
272 SIMULATION-BASED OPTIMIZATION
the kth iteration of the algorithm. For the sake of simplicity, let us
assume for the time being that no more than two actions per state are
allowed. At the very beginning, before starting the simulation, one sets
x to any of the two actions. The value of x is not changed later during
the algorithm’s progress. Then, using the Reward-Inaction scheme,
the action-selection probability of an action x is updated via the rule
given below:
pk+1 (i, x) ← pk (i, x) + αφ(i)I(L(i) = x) − αφ(i)pk (i, x), (8.2)
where I(.) in the indicator function that equals 1 if the condition inside
the brackets is satisfied and is 0 otherwise. Note that the above scheme
is used to update an action x out of the two actions numbered 1 and 2.
Clearly,
after the
above update, the other action will have to be set to
1 − pk+1 (i, x) .
The reward-inaction scheme is so named because a good action’s
effects are rewarded while those of a poor action are ignored (inac-
tion). How this is ensured can be explained as follows. Assume that x
is the action that was selected in the last visit to i. Then, the change
in the value of p(i, x) will equal φ × α[1 − p(i, x)]. Thus, if the re-
sponse is strong (feedback close to 1), the probability will be increased
significantly, while if the response is weak (feedback close to 0), the
increase will not be very significant. Likewise, if the action x was not
selected in the last visit, the change will equal −φ × αp(i, x). Hence,
if the action was not selected, but the response was strong, its prob-
ability will be reduced significantly, but if the response was weak, its
probability will not see a significant change.
Now, we present step-by-step details of an MCAT algorithm.
2. Let L(i) denote the action that was selected in the last visit to i.
Update p(i, x) using:
where I(.) is the indicator function that equals 1 when the con-
dition within brackets is satisfied and equals 0 otherwise. Then,
update the other action (action other that x) so that sum of the
probabilities of actions for state i is 1.
3. With probability p(i, a), select an action a from the set A(i).
4. Set L(i) ← a, Cr (i) ← T R, and Ct (i) ← T T. Then, simulate
action a. Let the next system state be j. Also let t(i, a, j) denote
the (random) transition time, and r(i, a, j) denote the immediate
reward earned in the transition resulting from selecting action a in
state i.
5. Set T R ← T R + r(i, a, j); T T ← T T + t(i, a, j).
6. Set i ← j and k ← k + 1. If k < M AX ST EP S, return to Step 1;
otherwise STOP.
3. Actor Critics
Actor critics (ACs), also called adaptive critics, have a rather long
history [328, 18, 316]. Like MCAT, ACs use action-selection proba-
bilities to guide the search, but like RL, they also use the Bellman
Stochastic Search for Controls 277
βk
lim = 0; note that we suppressed superscript k above for clarity’s sake.
k→∞ αk
An example of step size rules that satisfy the above condition is:
log(k) A
αk = ; βk = .
k B+k
Note that β converges to 0 faster than α, and hence the time scale
that uses β is called the slower time scale while the time scale that
uses α is called the faster time scale. Since β converges to 0 faster, it
is as if the faster time scale sees the slower time scale as moving very
slowly, i.e., as if the values on the slower time scale are fixed.
If H(i, a) > H̄, set H(i, a) ← H̄. If H(i, a) < −H̄, set H(i, a) ← −H̄.
Convergence results assure us that the value of J(i∗ ) should converge
at the end to the vicinity of ρ∗ , the optimal average reward.
If H(i, a) > H̄, set H(i, a) ← H̄. If H(i, a) < −H̄, set H(i, a) ← −H̄.
Then update ρ, T R and T T as follows:
where γ̄ is the step size on the slowest time scale. The step
k
sizes must satisfy the following two rules: limk→∞ βγ̄ k = 0 and
k
limk→∞ αβ k = 0. Selecting a suitable value for η may need experi-
mentation (see [183]).
4. Concluding Remarks
This short chapter was meant to introduce you to solving MDPs/
SMDPs via stochastic search, i.e., stochastic policies in which the
action selection probabilities (or their surrogates) are directly updated.
Our discussion was limited to some of the earliest advances, i.e., learn-
ing automata and actor critics. We discuss some of the more recent
developments in the bibliographic remarks.
CONVERGENCE: BACKGROUND
MATERIAL
1. Chapter Overview
This chapter introduces some fundamental mathematical notions
that will be useful in understanding the analysis presented in the sub-
sequent chapters. The aim of this chapter is to introduce elements of
the mathematical framework needed for analyzing the convergence of
algorithms discussed in this book. Much of the material presented in
this chapter is related to mathematical analysis, and hence a reader
with a good grasp of mathematical analysis may skip this chapter.
To follow Chap. 10, the reader should read all material up to and in-
cluding Theorem 9.2 in this chapter. All the ideas developed in this
chapter will be needed in Chap. 11.
So far in the book we have restricted ourselves to an intuitive under-
standing of why algorithms generate optimal solutions. In this chapter,
our aim is to make a transition from the nebulous world of intuition
to a more solid mathematical world. We have made every attempt
to make this transition as gentle as possible. Apart from the obvious
fact that an algorithm’s usefulness is doubted unless mathematical
arguments show it, there are at least two other reasons for studying
mathematical arguments related to an algorithm’s ability to generate
optimal solutions: (i) mathematical analysis leads to the identifica-
tion of conditions under which optimal solutions can be generated and
(ii) mathematical analysis provides insights into the working mecha-
nism of the algorithm. The reader not interested in everything in this
chapter is advised to read ahead, and then come back to this chapter
as and when the proof of a result needs to be understood. We will
begin this chapter with a discussion on vectors and vector spaces.
3. Norms
A norm, in a given vector space V , is a scalar quantity associated
with a given vector in that space. To give a rough geometric inter-
pretation, it denotes the length of a vector. There are many ways
to define a norm, and we will discuss some standard norms and their
properties.
The so-called max norm, which is also called infinity norm or
sup norm, is defined as follows.
||x||∞ = max |x(i)|.
i
In the above definition, ||x||∞ denotes the max norm of the vector
x, and x(i) denotes the ith element of the vector x. The following
example will illustrate the idea.
Example. a = (12, −13, 9) and b = (10, 12, 14) are two vectors in 3 .
Their max norms are:
||a||∞ = max{|12|, | − 13|, |9|} = 13 and
Example 2. The vector space defined by the set 69 equipped with
the max norm. The norm for this space is:
values of x come is called the domain of the function, and the set
from which the values of y come is called the range of the function.
The domain and range of a function can be denoted using the following
notation.
f : A → B,
where A is the domain and B is the range. This is read as: “a function
f from A to B.”
The example given above is the simplest form of a function. Let
us consider some more complicated functions. Consider the following
function.
y = 4 + 5x1 + 3x2 (9.1)
in which each of x1 and x2 takes on values from the set . Now, a
general notation to express this function is y = f (x1 , x2 ). This function
(given in Eq. (9.1)) clearly picks up a vector such as (1, 2) and assigns
a value of 15 to y. Thus, the function (9.1) can be represented as a
set of ordered triples (x1 , x2 , y)—examples of which are: (1, 2, 15) and
(0.1, 1.2, 8.1), and so on. This makes it possible to view this function
as an operation, whose input is a vector of the form (x1 , x2 ) and whose
output is a scalar. In other words, the domain of the function is 2
and its range is .
Functions that deal with vectors are also called mappings or maps
or transformations. It is not hard to see that we can define a func-
tion from 2 to 2 . An example of such a function is defined by the
following two equations:
Here
x = (x1 , x2 )T and x = (x1 , x2 )T
are the two-dimensional vectors in question and A and B are the
matrices. (x T denotes the transpose of the vector x.) For the example
under consideration, the matrices are:
4 5 9
A= , and B = .
5 7 7
F 2 (a) ≡ F (F (a)) .
6. Mathematical Induction
The principle of mathematical induction will be used on many
occasions in this book. As such, it is important that you understand
it clearly. Before we present it, let us define J to be the set of
288 SIMULATION-BASED OPTIMIZATION
This theorem implies that the relation holds for R(2) from the fact
that R(1) is true, and from the truth of R(2) one can show the same
for R(3). In this way, it is true for all n in J . All these are intuitive
arguments. We now present a rigorous proof.
Prove that
1 + 3 + 5 + · · · + (2p − 1) = (p)2 .
(We will use LHS to denote the left hand side of the relation and
RHS the right hand side. Notice that it is easy to use the formulation
for the arithmetic progression series to prove the above, but our intent
here is to demonstrate how induction proofs work.)
Now, when p = 1, LHS = 1 and the RHS = 12 = 1, and thus the
relation (the equation in this case) is true when p = 1.
Next let us assume that it is true when p = k, and hence
1 + 3 + 5 + · · · + (2k − 1) = k 2 .
Now when p = (k + 1) we have that:
LHS = 1 + 3 + 5 + · · · + (2k − 1) + (2(k + 1) − 1)
= k 2 + 2(k + 1) − 1
= k 2 + 2k + 2 − 1
= (k + 1)2
= RHS when p = k + 1.
290 SIMULATION-BASED OPTIMIZATION
7. Sequences
It is assumed that you are familiar with the notion of a sequence.
A familiar example of a sequence is:
a, ar, a(r)2 , a(r)3 , . . .
Convergence: Background Material 291
Here xp denotes the pth term of the sequence. For the geometric
sequence shown above:
xp = (a)p−1 .
A sequence can be viewed as a function whose domain is the set
of positive integers (1, 2, . . .) and the range is the set that includes all
possible values that the terms of the sequence can take on.
We are often interested in finding the value of the sum of the first
m terms of a sequence. Let us consider the geometric sequence given
above. The sum of the first m terms of this sequence, it can proved,
is given by:
a(1 − (r)m )
Sm = . (9.3)
(1 − r)
The sum itself forms the sequence
{S 1 , S 2 , S 3 , . . . , }
We can denote this sequence by {S m }∞
m=1 . We will prove (9.3) using
mathematical induction.
Proof Since the first term is a, S 1 should be a. Plugging in 1 for m in
the (9.3), we can show that. So clearly the relation is true for m = 1.
Now let us assume that it is true for m = k; that is,
a(1 − (r)k )
Sk = .
(1 − r)
The (k + 1)th term is (from the way the sequence is defined) ark .
Then the sum of the first (k + 1) terms is the sum of the first k terms
and the (k + 1)th term. Thus:
a(1 − (r)k )
S k+1 = + a(r)k
(1 − r)
a(1 − (r)k ) + a(r)k (1 − r)
=
1−r
a − a(r)k + a(r)k − a(r)k+1 a(1 − (r)k+1 )
= = .
1−r 1−r
292 SIMULATION-BASED OPTIMIZATION
From this stage, we will gradually introduce rigor into our discussion
on sequences. In particular, we need to define some concepts such as
convergence and Cauchy sequences. We will use the abbreviation iff to
mean “if and only if.” The implication of ‘iff’ needs to be understood.
When we say that the condition A holds iff B is true, the following is
implied: A is true if B is true and B is true if A is true.
Definition 9.1 A Convergent Sequence: A sequence {ap }∞ p=1 is
said to converge to a real number A iff for any > 0, there exists a
positive integer N such that for all p ≥ N , we have that
|ap − A| < .
Convergence: Background Material 293
ap+1 ≥ ap .
7.3. Boundedness
We next define the concepts of “bounded above” and “bounded
below,” in the context of a sequence. A sequence {ap }∞
p=1 is said to be
bounded below, if there exists a finite value L such that:
ap ≥ L
for all values of p. L is then called a lower bound for the sequence.
Similarly, a sequence is said to be bounded above, if there exists
a finite value U such that:
ap ≤ U
for all values of p. U is then called an upper bound for the sequence.
The sequence:
{1, 2, 3, . . .}
294 SIMULATION-BASED OPTIMIZATION
is bounded below, but not above. Notice that 1 is a lower bound for
this sequence and so are 0 and −1 and −1.5. But 1 is the highest lower
bound (often called the infimum).
In the sequence:
{1, 1/2, 1/3, . . .},
1 is an upper bound and so are 2 and 3 etc. Here 1 is the lowest of
the upper bounds (also called the supremum).
A sequence that is bounded both above and below is said to be a
bounded sequence.
We will next examine a useful result related to decreasing (increas-
ing) sequences that are bounded below (above). The result states
that a decreasing (increasing) sequence that is bounded below (above)
converges.
Intuitively, it should be clear that a decreasing sequence, that is, a
sequence in which each term is less than or equal to the previous term,
should converge to a finite value because the values of the terms cannot
go below a finite value M . So the terms keep decreasing and once
they reach a point below which they cannot go, they stop decreasing;
so the sequence should converge. Now, let us prove this idea using
precise mathematics. One should remember that to prove convergence,
one must show that the sequence satisfies Definition 9.1.
Theorem 9.2 A decreasing (increasing) sequence converges if it is
bounded below (above).
Proof We will work out the proof for the decreasing sequence case.
For the increasing sequence case, the proof can be worked out in a
similar fashion. Let us denote the sequence by {ap }∞
p=1 . Let L be the
highest of the lower bounds on the sequence. Then, for any p, since L
is a lower bound,
ap ≥ L. (9.5)
Choose a strictly positive value for the variable . Then > 0. Then
L + , which is greater than L, is not a lower bound. (Note that L
is the highest lower bound.) Then, it follows that there exists an N ,
such that
aN < L + . (9.6)
Then, for p ≥ N , since it is a decreasing sequence,
ap ≤ aN .
Combining the above, with Inequations (9.5) and (9.6), we have
that for p ≥ N :
L ≤ ap ≤ aN < L + .
Convergence: Background Material 295
keep increasing the value of p to get positive terms that keep getting
smaller. The terms of course approach 0, but we can never find a
finite value for p for which a(p) will equal 0. Hence 0 is not a point
in the range of this sequence. And yet 0 is an accumulation point of
the range. This is because any neighborhood of 0 contains infinitely
many points of the range of the sequence.
We are now at a point to discuss the famous Bolzano-Weierstrass
theorem. We will not present its proof, so as to not distract the reader
from our major theme which is the convergence of a Cauchy sequence.
The proof can be found in any undergraduate text on mathematical
analysis such as Gaughan [95] or Douglass [81]. This theorem will be
needed in proving that Cauchy sequences are convergent.
Theorem 9.4 (Bolzano-Weierstrass Theorem:) Every bounded
infinite set of real numbers has at least one accumulation point.
The theorem describes a very important property of bounded infi-
nite sets of real numbers. It says that a bounded infinite set of real
numbers has at least one accumulation point. We will illustrate the
statement of the theorem using an example.
Consider the interval (1, 2). It is bounded (by 1 below and 2 above)
and has infinitely many points. Hence it must have at least one
accumulation point. We have actually discussed above how any point
in this set is an accumulation point.
A finite set does not have any accumulation point because, by
definition, it has a finite number of points, and hence none of its neigh-
borhoods can contain an infinite number of points.
The next result is a key result that will be used in the convergence
analysis of reinforcement learning algorithms.
Theorem 9.5 A Cauchy sequence is convergent.
Proof Let {ap }∞ p=1 be a Cauchy sequence. The range of the sequence
can be an infinite or a finite set. Let us handle the finite case first, as
it is easier.
Let {s1 , s2 , . . . , sr } consisting of r terms denote a finite set repre-
senting the range of the sequence {ap }∞ p=1 . Now if we choose
= min{|si − sj | : i
= j; i, j = 1, 2, . . . , r},
then there is a positive integer N such that for any k, m ≥ N we have
that |ak − am | < . Now it is the case that ak = sc and am = sd for
some c and some d belonging to the set {1, 2, . . . , r}. Thus:
|sc − sd | = |ak − am | < = min{|si − sj | : i
= j; i, j = 1, 2, . . . , r}.
Convergence: Background Material 299
From the above, it is clear that the absolute value of the difference
between sc and sd is strictly less than the minimum of the absolute
value of the differences between the terms. As a result, |ak − am | = 0,
i.e., ak = am for k, m ≥ N . This implies that from some point (N )
onwards, the sequence values are constant. This means that at this
point the sequence converges to a finite quantity. Thus the convergence
of the Cauchy sequence with a finite range is established.
Next, let us assume that the set we refer to as the range of the
sequence is infinite. Let us call the range S. The range of a Cauchy
sequence is bounded by Theorem 9.3. Since S is infinite and bounded,
by the Bolzano-Weierstrass theorem (Theorem 9.4), S must have an
accumulation point; let us call it x. We will prove that the sequence
converges to x.
Choose an > 0. Since x is an accumulation point of S, by its
definition, the interval (x− 2 , x+ 2 ) is a neighborhood of x that contains
infinitely many points of S. Now, since {ap }∞ p=1 is a Cauchy sequence,
there is a positive integer N such that for k, m ≥ N , |ak − am | < /2.
Since (x − 2 , x + 2 ) contains infinitely many points of S, and hence
infinitely many terms of the sequence {ap }∞ p=1 , there exists a t such
that t ≥ N and that a is a point in the interval (x − 2 , x + 2 ). The
t
Then,
300 SIMULATION-BASED OPTIMIZATION
Then,
(i) If ap ≥ 0 for all p ∈ J , then A ≥ 0.
(ii) If ap ≤ bp for all p ∈ J , then A ≤ B.
(iii) If there exists a scalar C ∈ for which C ≤ bp for all p ∈ J ,
then C ≤ B. In a similar manner, if ap ≤ C for all p ∈ J , then
A ≤ C.
the sequences {y p }∞ p ∞
p=1 and {z }p=1 , we have that limp→∞ y ≤ A.
p
8. Sequences in n
Thus far, our discussions have been limited to scalar sequences.
A scalar sequence is one whose terms are scalar quantities. Scalar
sequences are also called sequences in . The reason for this is that
scalar quantities are members of the set .
A sequence in n is a sequence whose terms are vectors. We will
refer to this sequence as a vector sequence. For example, consider
a sequence {a p }∞
p=1 whose pth term is defined as:
1
a p = ( , p2 + 1).
p
This sequence, starting at p = 1, will take on the following values:
1 1
{(1, 2), ( , 5), ( , 10), . . .}.
2 3
The above is an example of a sequence in 2 . This concepts nicely
extends to any dimension. Each of the individual scalar sequences in
such a sequence is called a coordinate sequence. Thus, in the example
given above, { p1 }∞ ∞
p=1 and {p + 1}p=1 are the coordinate sequences
2
of {a p }∞
p=1 .
Remark. We will use the notation ap (i) to denote the pth term of the
ith coordinate sequence of the vector sequence {a p }∞
p=1 . For instance,
in the example given above,
a p = {ap (1), ap (2)},
where 1
ap (1) =
and ap (2) = p2 + 1.
p
We will next define what is meant by a Cauchy sequence in n .
9. Cauchy Sequences in n
The concept of Cauchiness also extends elegantly from sequences in
(where we have seen it) to sequences in n . The Cauchy condition
in n is defined next.
for i = 1, 2, . . . , n.
Proof We will prove the result for the max norm. The result can be
proved for any norm. Since {a p }∞ p=1 is Cauchy, we have that for a
given value of > 0, there exists an N such that for any m, k ≥ N ,
||a k − a m || < .
From the definition of the max norm, we have that for any k, m ≥ N ,
Now in this equation, the < relation holds for maxi in the left hand
side. Hence the result must be true for all i. Thus, for any k, m ≥ N ,
is true for all i. This implies that each coordinate sequence is Cauchy
in .
c = a − b.
F x = x.
Note that the definition says nothing about contractions, and in fact,
F may have a fixed point even if it is not contractive.
Let us illustrate the idea of generating a fixed point with a contrac-
tive mapping using an example. Consider the following vectors.
We will apply the transformation G in which xk (i) will denote the ith
component of the vector x k . Let us assume that G is contractive (this
can be proved) and is defined as:
xk+1 (1) = 5 + 0.2xk (1) + 0.1xk (2),
xk+1 (2) = 10 + 0.1xk (1) + 0.2xk (2).
Table 9.1 shows the results of applying transformation G repeatedly.
A careful observation of the table will reveal the following. With every
iteration, the difference vector c k becomes smaller. Note that when
k = 12, the vectors a k and b k have become one, if one ignores anything
beyond the fifth place after the decimal point. The two vectors will
become one, generally speaking, when k = ∞. In summary, one starts
with two different vectors but eventually goes to a fixed point which,
in this case, seems to be very close to (7.936507, 13.492062).
What we have demonstrated above is an important property of a
contractive mapping. One may start with any vector, but successive
applications of the mapping transforms the vector into a unique vector.
See Figs. 9.1–9.3 to get a geometric feel for how a contraction map-
ping keeps “shrinking” the vectors in 2 space, and carries any given
vector to a unique fixed point. The figures are related to the data
given in Table 9.1. The contraction mapping is very much like a dog
that carries every bone (read vector) that it gets to its own hidey hole
(read a unique fixed point), regardless of the size of the bone or where
the bone has come from.
Let us now examine a more mathematically precise definition of a
contraction mapping.
Definition 9.8 A mapping (or transformation) F is said to be a con-
traction mapping in n if there exists a λ such that 0 ≤ λ < 1 and
||Fv − F u|| ≤ λ||v − u||
for all v , u in n .
As is clear from the definition, the norm represents what was referred
to as “length” in our informal discussion on contraction mappings.
By applying F repeatedly, one obtains a sequence of vectors. Con-
sider a vector a on which the transformation F is applied repeatedly.
This will form a sequence of vectors, which we will denote by {a k }∞
k=0 .
Here a k denotes the kth term of the sequence. It must be noted that
each term is itself a vector. The relationship between the terms is
given by
v k+1 = Fv k .
Convergence: Background Material 305
Table 9.1. Table showing the change in values of the vectors a and b after repeated
applications of G
Y-Axis
k=0
X-Axis
Figure 9.1. The thin line represents vector a, the dark line represents the vector
b, and the dotted line the vector c. This is before applying G
Thus the sequence can also be denoted as: {v 0 , v 1 , v 2 , . . .}. It is clear
that v 1 = Fv 0 , v 2 = Fv 1 , and so on. Now if ||Fv 0 − F u 0 || ≤ λ||v 0 −
u 0 || for all vectors in n , then
Y-Axis
k=1
X-Axis
Figure 9.2. The thin line represents vector a, the dark line represents the vector
b, and the dotted line the vector c. This is the picture after one application of G.
Notice that the vectors have come closer
Y-Axis
k=11
X-Axis
Figure 9.3. This is the picture after 11 applications of G. By now the vectors are
almost on the top of each other, and it is difficult to distinguish between them
converges to v∗ .
In the above:
Line (9.11) follows from the definition of F (see Eq. (9.10)).
Line (9.12) follows from Inequation (9.7) by setting u 0 = v m −m ,
which can be shown to imply that:
F m u 0 = F m v 0 .
Since m > m, the above ensures that the vector sequence {v k }∞ k=0
is Cauchy. From Lemma 9.9, it follows that if the vector sequence
satisfies the Cauchy condition (see Definition 9.6), then each coordi-
nate sequence is also Cauchy. From Theorem 9.5, a Cauchy sequence
converges and thus each coordinate sequence converges to a finite num-
ber. Consequently, the vector sequence converges to a finite valued
vector. Let us denote the limit by v∗ . Hence we have that
0 ≤ ||Fv∗ − v∗ ||
= ||Fv∗ − v k + v k − v∗ ||
≤ ||Fv∗ − v k || + ||v k − v∗ ||
= ||Fv∗ − Fv k−1 || + ||v k − v∗ ||
≤ λ||v∗ − v k−1 || + ||v k − v∗ || (9.18)
Convergence: Background Material 309
From (9.17), both terms of the right hand side of (9.18) can be made
arbitrarily small by choosing a large enough k. Hence
||Fv∗ − v∗ || = 0.
This implies that Fv∗ = v∗ , and so v∗ is a fixed point of F .
What remains to be shown is that the fixed point v∗ is unique.
To prove this, let us assume that there are two vectors a and b that
satisfy x = F x. Hence
a = Fa, and b = Fb.
Then using the contraction property, one has that:
λ||b − a|| ≥ ||Fb − Fa||
= ||b − a||
which implies that:
λ||b − a|| ≥ ||b − a||.
Since a norm is always non-negative, and since λ < 1, the above must
imply that ||b − a|| = 0. As a result, a = b, and uniqueness follows.
11.2.2 A Ball
We define an open ball in n that is centered on x and has a radius
of r ∈ , where r > 0, to be the following set:
where ||.|| denotes any norm. For instance, for n = 1, the ball is simply
the open interval: (x − r, x + r). Similarly, if n = 2, then the ball can
be viewed geometrically as an open circle whose radius is r and center
is x.
dx x
=− . (9.20)
dt 1+t
By separation of variables, we have that the solution is
x0 (1 + t0 )
x= ,
1+t
where x0 and t0 are obtained from the boundary condition to elimi-
nate the constant of integration. Here, we can denote the solution in
general as:
x0 (1 + t0 )
φ(t) = .
1+t
The implication is that the equilibrium point is not only stable but
in addition, eventually the ODE’s solution will converge to the equi-
librium point. In the ODE of Eq. (9.20), where x = 0 is the unique
equilibrium point, it is easy to see that:
lim φ(t) = 0,
t→∞
noise term involved in the kth iteration of the algorithm while updat-
ing xk (l).
conditions on our algorithm. For the result that we will present below,
all of these conditions should hold.
Assumption 9.11 The function F (.) is Lipschitz continuous.
The precise implication of the following condition will become
clearer later. Essentially what the following condition ensures is that
the effect of the noise vanishes in the limit, i.e., as if it never existed!
Assumption 9.12 For l = 1, 2, . . . , N and for every k, the following
should be true about the noise terms:
E wk (l)|F k = 0;
2
E w (l) F k ≤ z1 + z2 ||X
k k ||2 ; (9.22)
where z1 and z2 are scalar constants and ||.|| could be any norm.
It is not hard to see that the first condition within the assumption
above essentially states that the conditional expectation of the noise
(the condition being that the history of the algorithm is known to us)
is 0. The second condition in (9.22) states that the second (conditional)
moment of the noise is bounded by a function of the iterate. If the
iterate is bounded, this condition holds.
Assumption 9.13 The step size αk satisfies the following conditions:
∞
αk = ∞; (9.23)
k=1
∞
2
αk < ∞. (9.24)
k=1
The ODE for this algorithm will be (compare the above to Eq. (9.21)):
dx
= 5x.
dt
Thus, all you need to identify the ODE is the algorithm’s transforma-
tion, F (.).
We are now at a position to present the important result from
[184], which forms the cornerstone of convergence theory of stochastic
approximation schemes via the ODE method.
Theorem 9.16 Consider the synchronous stochastic approximation
scheme defined in Eq. (9.21). If( Assumptions
)∞ 9.11—9.15 hold, then
with probability 1, the sequence X k converges to x∗ .
k=1
The proof of the above is rather deep, and involves some additional
results that are beyond the scope of this text. The result is very
powerful; it implies that if we can show these assumptions to hold, the
algorithm is guaranteed to converge. The implications of the result
are somewhat intuitive, and we will discuss those below. Also, we will
show in subsequent chapters that this result, or some of its variants,
can be used to show the convergence of simultaneous perturbation and
many reinforcement learning algorithms.
We note that the result above also holds in a noise-free setting, i.e.,
when the noise term w(l) = 0 for every l in every iteration of the
algorithm. As stated above, when we have a noise-free algorithm, the
condition in (9.24) in Assumption 9.13 is not needed.
The intuition underlying the above result is very appealing. It im-
plies that the effect of noise in the noisy algorithm will vanish in the
limit if Assumption 9.12 is shown to be true. In other words, it is
as if noise never existed in the algorithm and that we were using the
following update:
k )(l)
X k+1 (l) = X k (l) + αk F (X
Bibliographic Remarks. All the material in this chapter until Sect. 11 is classical,
and some of it is more than a hundred years old. Consequently, most of this material
can be found in any standard text on mathematical analysis. The results that we
presented will be needed in subsequent chapters. Gaughan [95] and Douglass [81]
cover most of the topics dealt with here until Sect. 11. The fixed point theorem can
be found in Rudin [254].
Material in Sect. 11 on ODEs can be found in [54], and Theorem 9.16 is from
Kushner and Clark [184] (see also [185, 48]). Work on ODEs and stochastic ap-
proximation has originated from the work of Ljung [192].
Chapter 10
CONVERGENCE ANALYSIS
OF PARAMETRIC OPTIMIZATION
METHODS
1. Chapter Overview
This chapter deals with some simple convergence results related to
the parametric optimization methods discussed in Chap. 5. The main
idea underlying convergence analysis of an algorithm is to identify
(mathematically) the solution to which the algorithm converges.
Hence to prove that an algorithm works, one must show that the
algorithm converges to the optimal solution. In this chapter, this is
precisely what we will attempt to do with some algorithms of Chap. 5.
The convergence of simulated annealing requires some understanding
of Markov chains and transition probabilities. To this end, it is suf-
ficient to read all sections of Chap. 6 up to and including Sect. 3.1.3.
It is also necessary to read about convergent sequences. For this
purpose, it is sufficient to read all the material in Chap. 9 up to and
including Theorem 9.2. Otherwise, all that is needed to read this
chapter is a basic familiarity with the material of Chap. 5.
Our discussion on the analysis of the steepest-descent rule begins in
Sect. 3. Before discussing the mathematical details, we review defini-
tions of some elementary ideas from calculus and a simple theorem.
2. Preliminaries
In this section, we define some basic concepts needed for understand-
ing convergence of continuous parametric optimization. The material
in this section should serve as a refresher. All of these concepts will
be required in this chapter. Readers familiar with them can skip this
section without loss of continuity.
We now illustrate this idea with the following example function from
to :
f (x) = 63x2 + 5x.
Note that the function must be continuous since for any c ∈ ,
x2 − 5x + 6 (x − 2)(x − 3) x−3 1
lim f (x)= lim = lim = lim =
= 90.
x→2 x→2 x2 − 6x + 8 x→2 (x − 2)(x − 4) x→2 x − 4 2
This implies, from the definition above, that the function is not con-
tinuous at x = 2, and hence the function is not continuous on .
Note that the definition does not hold for any , but says that there
exists an that satisfies the condition above.
This means that the global minimum is the minimum of all the local
minima.
The definitions for local and global maxima can be similarly
formulated. See Fig. 10.1 to get geometric intuition for strict local
optima and saddle points. See Fig. 10.2 for an illustration of the
difference between local and global optima.
Saddle
Point
X
Figure 10.1. Strict local optima and saddle points in function minimization
Function Minimization
X Z
Y
all orders exist and if some other conditions hold. A function for
which derivatives of all orders exist is also called a smooth function.
We present this result without proof.
Proof If |h| in the Taylor’s series is a small quantity, one can ignore
terms with h raised to 2 and higher values. Then setting x = x∗ in
the Taylor’s series, and selecting a sufficiently small value for |h|, one
has that:
df (x)
f (x∗ + h) = f (x∗ ) + h . (10.3)
dx x=x∗
We will use contradiction logic. Let us assume that x∗ is not a sta-
tionary point. Then,
df (x)
= 0, i.e.,
dx x=x∗
df (x) df (x)
either > 0 or < 0.
dx x=x∗ dx x=x∗
In either case, by selecting a suitable sign for h, one can always have
that:
df (x)
h < 0.
dx x=x∗
Using the above in Eq. (10.3) one has that
satisfies |h| < , we have that: f (x∗ ± |h|) ≥ f (x∗ ). This implies that:
f (x∗ + h) ≥ f (x∗ ), which contradicts inequality (10.4). As a result,
the local minimum must be stationary point.
3. Steepest Descent
In this book, the principle of steepest descent was discussed in the
context of neural networks and also simultaneous perturbation/finite
differences. Hence, we now present some elementary analysis of this
rule. We will prove that the steepest-descent rule converges to a
stationary point of the function it seeks to optimize under certain
conditions.
The main transformation in steepest descent is:
∂f (x)
x m+1
(i) = x (i) − μ
m
for i = 1, 2, . . . , k, (10.5)
∂x(i) x=x m
Convergence: Parametric Optimization 325
Proof In the proof, we will use the Euclidean norm . Hence, || · || will
denote || · ||2 . Consider two vectors, x and z, in k and ζ ∈ . Let
g(ζ) = f (x + ζz).
Then, from the chain rule, one has that:
dg(ζ)
= [z]T ∇f (x + ζz). (10.8)
dζ
We will use this below. We have
f (x + z) − f (x) = g(1) − g(0) follows from the definition of g(ζ)
326 SIMULATION-BASED OPTIMIZATION
1 1
dg(ζ)
= dg(ζ) = dζ
0 0 dζ
1
T
= [z] ∇f (x + ζz)dζ from (10.8)
0
1 1
≤ [z]T ∇f (x)dζ+| zT (∇f (x+ζz)−∇f (x))dζ|
0 0
1 1
≤ [z]T ∇f (x)dζ+ ||z|| · ||∇f (x+ζz)−∇f (x)||dζ
0 0
1 1
≤ [z]T ∇f (x) dζ + ||z|| Lζ||z||dζ from (10.7)
0 0
1 1
= [z]T ∇f (x) dζ + L||z||2 ζdζ
0 0
1
= [z]T ∇f (x) · 1 + L||z||2 · .
2
Note: (10.9) follows from the fact that for the Euclidean norm
for any m. In other words, the values of the objective function f (x m )
for m = 1, 2, 3, . . . form a decreasing sequence. A decreasing sequence
that is bounded below converges (see Theorem 9.2 from Chap. 9) to a
finite number. Hence:
μL 2
lim ( − μ)||∇f (x m )||2 ≤ 0.
m→∞ 2 L
In the above, all the quantities in the left hand side are ≥ 0.
Consequently,
lim ∇f (x m ) = 0.
m→∞
Proof The forward difference formula assumes that all terms of the
order of h2 and higher orders of h are negligible. If h is small, this is
a reasonable assumption, but it produces an error nevertheless. The
central difference formula on the other hand does not neglect terms of
the order of h2 . It neglects the terms of the order of h3 and higher
orders of h. From the Taylor Series, ignoring terms with h2 and higher
orders of h, we have that:
df (x)
f (x + h) = f (x) + h .
dx
This, after re-arrangement of terms, yields:
df (x) f (x + h) − f (x)
= ,
dh h
which of course is the forward difference formula given in Eq. (5.3).
328 SIMULATION-BASED OPTIMIZATION
5. Simultaneous Perturbation
Material in this section is devoted to a convergence analysis of
simultaneous perturbation. We will discuss convergence of the algo-
rithm under three progressively weaker sets of conditions. We first
need to define some notation.
2. Dm (i) will denote the true value of the partial derivative of the
function under consideration with respect to the ith variable at the
mth iteration of the algorithm; the derivative will be calculated at
x = x m . Thus mathematically:
∂f (x)
D (i) ≡
m
. (10.13)
∂x(i) x=x m
The above assumes that exact values of the function are available
in the computation above. In what follows, we will drop h from the
subscript of S m , but it will be understood that every simultaneous
perturbation estimate will depend on the vector h. Also, the vector
h is computed using Eq. (5.4).
Assumption 10.6 Let the step size satisfy the following conditions:
l
l
lim μ m
= ∞; lim (μm )2 < ∞. (10.17)
l→∞ l→∞
m=1 m=1
∂f (x) ∂f (x)
f (x(1) + h(1), x(2) + h(2)) = f (x(1), x(2)) + h(1) + h(2)
∂x(1) ∂x(2)
Convergence: Parametric Optimization 331
2 x)
1 2 ∂ f ( ∂ 2 f (x) 2 x)
2 ∂ f (
+ [h(1)] + 2h(1)h(2) + [h(2)] + ...
2! ∂x2 (1) ∂x(1)∂x(2) ∂x2 (2)
(10.20)
The proof of the above can be found in any standard calculus text.
We are now at a position to prove Lemma 10.11 from [108].
Assumption 10.10 The function f is smooth (i.e., infinitely differ-
entiable).
Lemma 10.11 If Assumption 10.10 is true of the function defined
in the update in Eq. (10.15), the noise terms in the update satisfy
Assumption 10.7.
Proof We will use the Euclidean norm below, i.e., ||.|| will mean
||.||2 . The proof’s road map is as follows: First a relationship is de-
veloped between the simultaneous perturbation estimate and the true
derivative, which helps define the noise—that will be shown to satisfy
Assumption 10.7.
We will assume for the time being that k = 2. From the Taylor series
result, i.e., Eq. (10.20), ignoring terms with h3 and higher orders of h
and suppressing the superscript m, we have that:
∂f (x) ∂f (x)
f (x(1) + h(1), x(2) + h(2)) = f (x(1), x(2)) + h(1) + h(2) +
∂x(1) ∂x(2)
1 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
[h(1)]2 2 + 2h(1)h(2) + [h(2)]2 2 (10.21)
2! ∂x (1) ∂x(1)∂x(2) ∂x (2)
From the same Taylor series, we also have that:
∂f (x) ∂f (x)
f (x(1) − h(1), x(2) − h(2)) = f (x(1), x(2)) − h(1) − h(2) +
∂x(1) ∂x(2)
2 x)
1 2 ∂ f ( ∂ 2 f (x) 2 x)
2 ∂ f (
+ [h(1)] + 2h(1)h(2) + [h(2)] .
2! ∂x2 (1) ∂x(1)∂x(2) ∂x2 (2)
(10.22)
Subtracting Eq. (10.22) from Eq. (10.21), we have
f (x(1) + h(1), x(2) + h(2)) − f (x(1) − h(1), x(2) − h(2)) =
∂f ∂f (x)
2h(1) + 2h(2) .
∂x(1) ∂x(2)
From the above, by re-arranging terms, we have:
f (x(1) + h(1), x(2) + h(2)) − f (x(1) − h(1), x(2) − h(2))
=
2h(1)
332 SIMULATION-BASED OPTIMIZATION
Let us also define the set of the history of the algorithm up to and
including the mth iteration by:
0, S
F m = {x 0 , x 1 , . . . , x m , S 1, . . . , S
m , μ0 , μ1 , . . . , μm }.
From Eq. (5.4), we know that for any (i, j) pair where i ∈ K and j ∈ K,
hm (j) H m (j)
= .
hm (i) H m (i)
Now, if the history of the algorithm is known, from the Bernoulli
distribution used in computing H (see algorithm description in Chap. 5
and Eq. (5.4)), it follows that for any given (i, j) pair, where i ∈ K and
j ∈ K,
m m
h (j) ∂f (x) m H (j) ∂f (x) m
E F = E F
hm (i) ∂x(j) H m (i) ∂x(j)
1 ∂f (x) m
= [(0.5)(−1)+(0.5)(1)]E F
H m (i) ∂x(j)
= 0.
Convergence: Parametric Optimization 333
From the above equation and from Eq. (10.26), it follows that for every
i∈K,
k
E[wm (i)|F m ] = 0 = 0,
j=i;j=1
h(j)
= 1 or − 1.
h(i)
The above follows from noting that A represents the sum of products
within the square of each wm (.); the above also employs the Euclidean
norm and exploits (10.27) and the fact that each of the partial deriva-
tives is bounded.
We note that in the proof above, the noise did not represent the noise
induced by simulation; rather it is the noise in the derivative due to
Spall’s formula. Remember that Spall’s formula does not compute
the exact derivative. We now show the convergence of simultaneous
perturbation.
is the solution indexed by i. This leads to the following model for the
transition probability of the underlying Markov chain:
P (i, j) = i
= j;
G(i, j)A(i, j) when
(10.28)
G(i, i)A(i, i) + i=j G(i, j)(1 − A(i, j)),
where A(i, i) = 1. Note that in the above, the expression for P (i, j)
with i
= j follows from elementary product rule of probabilities, while
the same for P (i, i) follows from the fact that the algorithm remains
in the same solution either if it is generated again and accepted, or if
some other solution is generated but rejected. Note that A(i, i) must
equal 1 for the probabilities to be well-defined.
In some SAS algorithms, we will have a unique Markov chain that
will define its behavior. In some algorithms, however, the Markov
chain will change with every iteration or after a few iterations. When
the Markov chain does not change with iterations, it will be called a
stationary or homogenous Markov chain, while if it changes with iter-
ations, it will be called a non-stationary or non-homogenous Markov
chain. These ideas will be clarified further in the context of every
algorithm that we will analyze.
Convergence metrics. To analyze the convergence of any SAS tech-
nique, we will be interested in answering the following three questions:
The first question is related to whether the algorithm will ever reach
the global optimum and is the most fundamental of questions we ask
of any optimization algorithm. The second question revolves around
how long the first hitting time will be. The last question is regard-
ing whether the algorithm will eventually be absorbed into the global
optimum.
For some algorithms, as we will see below, it is possible to answer all
three questions. But, for any SAS technique, we would like to have at
the very least an answer to the first question. Answers to the second
question may provide information about the rate of convergence of
the algorithm. The reader should note that an affirmative response
338 SIMULATION-BASED OPTIMIZATION
to the first question does not imply that the algorithm will take fewer
iterations than exhaustive search. All it ensures is that eventually, we
will visit the global optimum. But there is nothing specified about
how long that might take! Hence, one must also answer the second
question and hopefully establish that the rate of convergence is faster
than that of exhaustive search. Finally, an affirmative answer to the
third question assures us that in the limit the algorithm in some sense
converges to the optimal solution.
It should be clear now that if the algorithm converges to the optimal
solution in the limit (answer to the third question), then the answer to
the first question is yes. However, producing an answer to the second
question is more important from the practical standpoint since it gives
us an idea of how long an algorithm may take before identifying an
optimal solution. Oftentimes answering this question is the hardest
task.
We also note that the number of iterations to first strike/hit the
optimal is likely to be a random variable in an SAS technique. There-
fore, one is interested in the mean and also in the variance (and pos-
sibly higher order moments) of this number. Finally, we note that in
algorithms where the “best solution thus far” is maintained, the num-
ber of iterations to first strike the optimal is sufficient to measure the
rate of convergence. But in algorithms like the nested partitions and
stochastic ruler, where one does not keep this in memory, one may have
to use other ways to measure the rate of convergence. We will explore
answers to these questions for some stochastic search algorithms now.
We begin with pure random search where our discussion follows [333].
Proof Let q(j), where q(j) > 0 for every j, denote the probability
|X |
of selecting the solution indexed by j such that j=1 q(j) = 1. Then
the transitions from one solution to another in the algorithm can be
modeled by a stationary Markov chain whose transition probabilities
are defined as: P (i, j) = q(j) for every i. Clearly then, the Markov
Convergence: Parametric Optimization 339
chain is regular, and hence ergodic. Hence, every state, including the
global optimal solution, will be visited infinitely often in the limit, and
we are done.
We now present a result (see e.g., [333]) that computes the mean
and variance of the number of iterations needed for the first strike at
the global optimum for any pure random search.
Proof The result can be found in any text on Markov chains, e.g.,
[251].
Let i∗ denote the index of the global optimum from the finite set of
solutions denoted by X .
Theorem 10.25 Let M (i) denote the number of iterations needed for
the first strike at the global optimum provided the algorithm starts at
solution i. Under Assumption 10.22, we have that the mean number of
iterations for the first strike can be computed by solving the following
system of linear equations. For i = 1, 2, . . . , |X |,
E[M (i)] = 1 + P (i, j)E[M (j)].
j∈X
j=i∗
Markov chain. We will analyze the first category, where the Markov
chain is stationary, in detail and provide references for the analysis of
the second category to the interested reader.
We begin with a simple result, often called the time reversibility
result of a Markov chain, that we will need later.
which together with Eq. (10.29) implies from elementary Markov chain
theory that y is the limiting probability vector of the Markov chain.
From the above definitions and the model in (10.28), it follows that:
and then claim that πT is the limiting probability (steady-state) vector
of the Markov chain whose transition probability from state i to state
j equals PT (i, j). To prove this claim, some work is needed, which we
now present.
Thus, from the above, one can conclude that when 1 < j < i,
Case 2: i = j. This case is trivial since the left- and right-hand sides
of the above are identical.
Then since πT (i)PT (i, j) = πT (j)PT (j, i) is true for all (i, j)-pairs, from
Lemma 10.26, we conclude that the vector πT is the limiting probabil-
ity vector of the Markov chain. Now, we consider what happens as T
tends to 0.
Convergence: Parametric Optimization 347
lim πT = (s, s lim AT (1, 2), s lim AT (1, 3), . . . , s lim AT (1, |X |))
T →0 T →0 T →0 T →0
= (s, 0, 0, . . . , 0) (using Eq. (10.35))
= (1, 0, 0, . . . , 0) (since the limiting probabilities sum to 1).
From the strong law of large numbers (see Theorem 2.1), η1 and
η2 can be made arbitrarily small, i.e., for a given value of > 0, a
sufficiently large number of replications (samples) can be selected
such that with probability 1, η1 ≤ and η2 ≤ . By choosing
= −Δ2 , we have that the following will be true with probability 1:
Δ Δ
η1 ≤ − and η2 ≤ − . (10.37)
2 2
To prove that our result holds for Case 1, we need to show that the
relationship in (10.36) is satisfied. Let us first consider Scenario 1.
What we show next can be shown for each of the other scenarios
in an analogous manner.
at the global optimum, x∗ , we will have that P (x∗ , x∗ ) = 1, i.e., the
Markov chain will be an absorbing one in which the global optimum is
the absorbing state, assuming we have a unique global optimum. This
implies that eventually, the system will be absorbed into the global
optimum. The reader is referred to [6] for additional analysis and
insights.
7. Concluding Remarks
Our goal in this chapter was to present a subset of the results in
the convergence theory of model-free parametric optimization tech-
niques. Our goal in this chapter was not very ambitious in that
we restricted our attention to results that can be proved without us-
ing very complicated mathematical arguments. The initial result on
the steepest-descent rule was presented because it is repeatedly used
throughout this book in various contexts. Overall, we presented some
preliminary analysis related to the convergence of steepest descent and
some SAS techniques. The bibliographic remarks mention several ref-
erences to additional works that cover this topic in greater depth.
CONVERGENCE ANALYSIS
OF CONTROL OPTIMIZATION
METHODS
1. Chapter Overview
This chapter will discuss the proofs of optimality of a subset of
algorithms discussed in the context of control optimization. The
chapter is organized as follows. We begin in Sect. 2 with some defini-
tions and notation related to discounted and average reward Markov
decision problems (MDPs). Subsequently, we present convergence
theory related to dynamic programming (DP) for MDPs in Sects. 3
and 4. In Sect. 5, we discuss some selected topics related to semi-
MDPs (SMDPs). Thereafter, from Sect. 6, we present a selected
collection of topics related to convergence of reinforcement learning
(RL) algorithms.
For DP, we begin by establishing that the Bellman equation can
indeed be used to generate an optimal solution. Then we prove that the
classical versions of value and policy iteration can be used to generate
optimal solutions. It has already been discussed that the classical
value-function-based algorithms have Q-factor equivalents. For RL, we
first present some fundamental results from stochastic approximation.
Thereafter, we use these results to prove convergence of the algorithm.
The reader will need material from Chap. 9 and should go back to
review material from there, as and when it is required.
then be viewed as a vector that gets transformed every time the Bell-
man transformation is applied on it.
We will first define a couple of transformations related to the
discounted reward problem, and then define the corresponding trans-
formations for the average reward problem.
The transformation T , you will recognize, is the one that we use
in the value iteration algorithm for discounted reward (of course, it is
derived from the Bellman optimality equation). T is defined as:
⎡ ⎤
|S|
T J(i) = max ⎣r̄(i, a) + λ p(i, a, j)J(j)⎦ for all i ∈ S, (11.1)
a∈A(i)
j=1
Although
where T J(i) denotes the ith component of the vector T (J).
all the terms used here have been defined in previous chapters, we
repeat the definitions for the sake of your convenience.
The symbols i and j stand for the states of the Markov chain and
are members of S—the set of states. The notation |S| denotes the
number of elements in this set.
a denotes an action and A(i) denotes the set of actions allowed in
state i.
λ stands for the discounting factor.
p(i, a, j) denotes the probability of transition (of the Markov chain)
in one step from state i to state j when action a is selected in state i.
r̄(i, a) denotes the expected immediate reward earned in a one-
step transition (of the Markov chain) when action a is selected in
state i. The term r̄(i, a) is defined as shown below:
|S|
r̄(i, a) = p(i, a, j)r(i, a, j), (11.2)
j=1
|S|
Tμ̂ J(i) = r̄(i, μ(i)) + λ p(i, μ(i), j)J(j) for all i ∈ S. (11.3)
j=1
|S|
Lμ̂ J(i) = r̄(i, μ(i)) + p(i, μ(i), j)J(j). (11.5)
j=1
All the relevant terms have been defined in the context of T and Tμ̂ .
Some more notation and definitions are needed:
Next, we will show that the result holds for Tμ̂ , i.e., for policy μ̂. The
proof is similar to the one above.
⎡ ⎤
Since for all i ∈ S, Tμ̂ J(i) = ⎣r̄(i, μ(i)) + λ p(i, μ(i), j)J(j)⎦
j∈S
⎡ ⎤
≤ ⎣r̄(i, μ(i)) + λ p(i, μ(i), j)J (j)⎦
j∈S
= Tμ̂ J (i), the result must hold for k=1.
Now assuming that the result is true when k = m,
J(i) ≤ J (i) for all i ∈ S will imply that Tμ̂m (J(i)) ≤ Tμ̂m (J (i)).
Then for all i ∈ S, Tμ̂m+1 J(i) = r̄(i, μ(i)) + λ p(i, μ(i), j)Tμ̂m J(j)
j∈S
≤ r̄(i, μ(i)) + λ p(i, μ(i), j)Tμm J (j)
j∈S
xs denotes the state from which the sth jump of the Markov chain
of the policy occurs, λk equals λ raised to the kth power, and Eμ̂ ,
the expectation over the trajectory of states produced by policy μ̂, is
defined as follows:
356 SIMULATION-BASED OPTIMIZATION
k
k s−1
Eμ̂ λ h(xk+1 ) + λ r(xs , μ(xs ), xs+1 ) x1 = i
s=1
≡ p(x1 , μ(x1 ), x2 )[r(x1 , μ(x1 ), x2 )] +
x2 ∈S
p(x1 , μ(x1 ), x2 )× p(x2 , μ(x2 ), x3 )[λr(x2 , μ(x2 ), x3 )]+· · ·+
x2 ∈S x3 ∈S
p(x1 , μ(x1 ), x2 )× p(x2 , μ(x2 ), x3 )× · · · × p(xk , μ(xk ), xk+1 )×
x2 ∈S x3 ∈S xk+1 ∈S
λk−1 r(xk , μ(xk ), xk+1 ) + λk h(xk+1 ) .
(11.6)
We now present a simple proof for this, but the reader can skip it
without loss of continuity.
Proof We will use induction on k. From the definition of Tμ̂ in
Eq. (11.3), for all i ∈ S,
Tμ̂ h(i) = p(i, μ(i), j) [r(i, μ(i), j) + λh(j)]
j∈S
1
= Eμ̂ λh(x2 )+ [r(xs , μ(xs ), xs+1 )] x1 =i , when j=x2 ,
s=1
and thus the result holds for k = 1. We must now show that the result
holds for k = m + 1. Assuming the result to hold for k = m, we have:
m
For all i ∈ S, Tμ̂m h(i) m
= Eμ̂ λ h(xm+1 ) + λ s−1
r(xs , μ(xs ), xs+1 ) x1 = i
s=1
where (11.8) follows from Eq. (11.3), (11.9) follows from Eq. (11.7), and
(11.10) follows from the definition of Eμ̂ in (11.6). Then, for x1 = i,
we have that for all i ∈ S:
m+1
Tμ̂m+1 h(i) = Eμ λm+1 h(xm+2 ) + r(xs , μ(xs ), xs+1 ) x1 = i .
s=1
3. The state space and the action space in every problem (MDP or
SMDP) are finite. Further, the Markov chain of any policy in the
MDP/SMDP is regular (regularity is discussed in Chap. 6).
Definition 11.2 The optimal value function vector for the discounted
MDP is defined as:
In other words, a(i) denotes an action in the ith state that will maxi-
mize the quantity in the square brackets above. This implies that:
⎡ ⎤
T J(i) = ⎣r̄(i, a(i)) + λ p(i, a(i), j)J(j)⎦ for every i ∈ S1 .
j∈S
⎡ ⎤
Similarly, let b(i) ∈ arg max ⎣r̄(i, u) + λ p(i, u, j)J (j)⎦
u∈A(i) j∈S
Since action b(i) maximizes the quantity in the square brackets above,
⎡ ⎤
for every i ∈ S1 , T J (i) ≥ ⎣r̄(i, a(i)) + λ p(i, a(i), j)J (j)⎦ .
j∈S
Convergence: Parametric Optimization 361
Then, for every i ∈ S1 , − T J (i) ≤ − r̄(i, a(i)) + λ p(i, a(i), j)J (j) .
j∈S
Combining this with the definition of T J(i) and the fact that T J(i) ≥
T J (i) for all i ∈ S1 , we have that for every i ∈ S1 :
0 ≤ T J(i) − T J (i)
⎡ ⎤
≤ ⎣r̄(i, a(i)) + λ p(i, a(i), j)J(i)⎦ −
j∈S
⎡ ⎤
⎣r̄(i, a(i)) + λ p(i, a(i), j)J (i)⎦
j∈S
=λ p(i, a(i), j)[J(i) − J (j)]
j∈S
≤λ p(i, a(i), j) max |J(j) − J (j)|
j
j∈S
⎛ ⎞
= λ max |J(j) − J (j)| ⎝ p(i, a(i), j)⎠
j
j∈S
Case 2: Using logic similar to that used above, one can show that:
This follows from the fact that the LHS of each of the inequalities
(11.13) and (11.14) has to be positive. Hence, when the absolute value
of the LHS is selected, both (11.13) and (11.14) will imply (11.15).
Since inequality (11.15) holds for any i ∈ S, it also holds for the
value of i ∈ S that maximizes |T J(i) − T J (i)|. Therefore:
max |T J(i) − T J (i)| ≤ λ||J − J ||; i.e., ||T J − T J || ≤ λ||J − J ||.
i
362 SIMULATION-BASED OPTIMIZATION
The next result shows that the property also holds for the mapping
associated with a given policy.
Proof The proof is very similar to that above. Let state space S =
S1 ∪ S2 where S1 and S2 will be defined below.
Thus, for all i ∈ S1 : Tμ̂ J(i) − Tμ̂ J (i) ≤ λ||J − J ||.
The Case 2 can be argued as in the proof above, and the result follows
using very similar arguments.
Convergence: Parametric Optimization 363
We will now present two key results, Propositions 11.5 and 11.6, that
will help in analyzing the Bellman optimality and the Bellman policy
equation, respectively.
Proposition 11.5 (i) For any bounded function h : S → , the
following limit exists:
lim T k h(i) for all i ∈ S.
k→∞
M λP
we have from the above that, |A| ≤
1−λ
M λP M λP
which implies that: − ≤A≤ .
1−λ 1−λ
Multiplying the above inequations by −1, we have that
M λP M λP
≥ −A ≥ − .
1−λ 1−λ
Adding Jμ̂ (i) to each side, we have:
M λP M λP
Jμ̂ (i) + ≥ Jμ̂ (i) − A ≥ Jμ̂ (i) − .
1−λ 1−λ
Using (11.16), the above can be written as:
P
M λP
Jμ̂ (i) + ≥ Eμ̂ s−1
λ r(xs , μ(xs ), xs+1 ) x1 = i
1−λ
s=1
Convergence: Parametric Optimization 365
M λP
≥ Jμ̂ (i) −
. (11.17)
1−λ
Now, for any bounded function h(.), we have that maxj∈S |h(j)| ≥
h(j) for any j ∈ S. Hence, it follows that
− max |h(j)| ≤ Eμ̂ [h(xP +1 )|x1 = i] ≤ max |h(j)|
j∈S j∈S
where Eμ̂ [. |x1 = i] is used as defined in Eq. (11.6). Then, since λ > 0,
one has that:
max |h(j)|λP ≥ Eμ̂ [ λP h(xP +1 ) x1 = i] ≥ − max |h(j)|λP . (11.18)
j∈S j∈S
M λP
− max |h(j)|λP .
Jμ̂ (i) −
1−λ j∈S
Using Lemma 11.1, the above becomes:
M λP M λP
Jμ̂ (i)+ +max |h(j)|λP ≥ Tμ̂P h(i) ≥ Jμ̂ (i)− −max |h(j)|λP .
1 − λ j∈S 1 − λ j∈S
(11.19)
We take the limit with P → ∞; since limP →∞ λ = 0, we then have
P
The result implies that associated with any policy μ̂, there is a value
function vector denoted by Jμ̂ that can be obtained by applying the
transformation Tμ̂ on any given vector infinitely many times. Also
note the following:
The ith element of this vector denotes the expected total discounted
reward earned over an infinite time horizon, if one starts at state i
and follows policy μ̂.
366 SIMULATION-BASED OPTIMIZATION
In contrast, the ith element of the vector J∗ denotes the expected
total discounted reward earned over an infinite time horizon, if one
starts at state i and follows the optimal policy.
We now formally present the optimality of the Bellman equation.
Proposition 11.7 Consider the system of equations defined by:
⎡ ⎤
h(i) = max ⎣r̄(i, a) + λ p(i, a, j)h(j)⎦ for all i ∈ S, (11.21)
a∈A(i)
j∈S
The value function vector associated to policy μ̂, Jμ̂ , which is defined
in Definition 11.1, is in fact a solution of the Bellman equation given
above. Furthermore, Jμ̂ is the unique solution.
Convergence: Parametric Optimization 367
Thus, for every i, v k (i) ≤ Tμ̂k+1 v k (i). Then for every i, using the
monotonicity result from Sect. 2.2, it follows that
Since the above is also true when P → ∞, one has that for all i,
From Proposition 11.6, we know that limP →∞ Tμ̂Pk+1 v k (i) exists and
equals Jμ̂k+1 (i), where Jμ̂k+1 (.) is the value function of policy μ̂k+1 .
Hence
v k (i) ≤ Jμ̂k+1 (i) for all i. (11.25)
Now, by Proposition 11.8, Jμ̂k+1 (i) satisfies
From Eqs. (11.26) and (11.27), it is clear that both vectors v k+1 and
Jμ̂k+1 satisfy the equation
h = Tμ̂h.
v k+1 = Jμ̂k+1 .
This means that in each iteration (k) the value of the vector (v k )
either increases or does not change. This iterative process cannot go
on infinitely as the total number of policies is finite (since the number
of states and the number of actions are finite). In other words, the
process must terminate in a finite number of steps. When the policy
repeats, i.e., when μk (i) = μk+1 (i) for all i, it is the case that one has
obtained the optimal solution. Here is why:
for all i. The first equality sign follows from (11.23) and the last
equality sign follows from (11.24). Thus: v k (i) = T v k (i) for all i.
In other words, the Bellman optimality equation has been solved.
By Proposition 11.7, we have that μ̂k is the optimal policy.
Proof Jdˆ denotes the vector associated with the policy d. ˆ Note that
this vector is never really calculated in the value iteration algorithm.
But it is the reward vector associated with the solution policy returned
by the algorithm. From Proposition 11.6, we know that limP →∞ Tdˆh
converges for any h ∈ |S| and that the limit is Jdˆ. From Proposi-
tion 11.8, we know that
TdˆJdˆ = Jdˆ. (11.29)
Now from Step 4, from the way dˆ is selected, it follows that for any
vector h,
Tdˆh = T h. (11.30)
We will use this fact below. It follows from the properties of norms
(see Sect. 3 of Chap. 9) that:
λ
Rearranging terms, ||Jdˆ − J k+1 || ≤ ||J k+1 − J k ||. (11.32)
1−λ
Similarly,
||J k+1 − J∗ || ≤ ||J k+1 − T J k+1 || + ||T J k+1 − J∗ ||
using a standard norm property
= ||T J k − T J k+1 || + ||T J k+1 − J∗ ||
≤ λ||J k − J k+1 || + ||T J k+1 − J∗ ||
≤ λ||J k − J k+1 || + ||T J k+1 − T J∗ || since T J∗ = J∗
≤ λ||J k − J k+1 || + λ||J k+1 − J∗ ||.
λ
Rearranging terms, ||J k+1 − J∗ || ≤ ||J k+1 − J k ||. (11.33)
1−λ
Using (11.32) and (11.33) in (11.31), one has that ||Jdˆ − J∗ || ≤
λ
2 1−λ ||J k+1 − J k || < ; the last inequality in the above stems from
Step 3 of the algorithm. Thus: ||Jdˆ − J∗ || < .
of the policy μ̂ is regular, the average reward is the same for every
starting state, and then, we have that:
ρμ̂ ≡ ρμ̂ (i) for any i ∈ S.
It is thus important to recognize that when the Markov chain of the
policy is regular, we have a unique value for its average reward that
does not depend on the starting state. The maximum attainable value
for the average reward over all admissible policies is the optimal aver-
age reward, denoted by ρ∗ . We now turn our attention to the Bellman
equation.
The Bellman equations for average reward were used without proof
in previous chapters. Our first result below is a key result related to
the Bellman equations: both the optimality and the policy versions.
Our analysis will be under some conditions that we describe later. The
first part of the result will prove that if a solution exists to the Bellman
policy equation for average reward, then the scalar ρ in the equation
will equal the average reward of the policy in question. The second part
will show that if a solution exists to the Bellman optimality equation,
then the unknown scalar in the equation will equal the average reward
of the optimal policy and that the optimal policy will be identifiable
from the solution of the equation.
Note that we will assume that a solution exists to the Bellman
equation. That solutions exist to these equations can be proved, but it
is beyond our scope here. The interested reader is referred to [30, 242]
for proofs.
Proposition 11.11 (i) If a scalar ρ and a |S|-dimensional vector hμ̂ ,
where |S| denotes the number of elements in the set of states in the
Markov chain, S, satisfy
|S|
ρ + hμ̂ (i) = r̄(i, μ(i)) + p(i, μ(i), j)hμ̂ (j), i = 1, 2, . . . , |S|,
j=1
(11.34)
then ρ is the average reward associated to the policy μ̂ defined in
Definition 11.3.
(ii) Assume that one of the stationary, deterministic policies is optimal.
If a scalar ρ∗ and a |S|-dimensional vector J∗ satisfy
⎡ ⎤
|S|
ρ∗ +J ∗ (i) = max ⎣r̄(i, u) + p(i, u, j)J ∗ (j)⎦ , i = 1, 2, . . . , |S|,
u∈A(i)
j=1
(11.35)
Convergence: Parametric Optimization 373
for i = 1, 2, . . . , |S|.
The above can be written in the vector form as:
Lkμ̂ hμ̂ = kρe + hμ̂ . (11.38)
We will use an induction argument for the proof. From Eq. (11.36),
the above is true when k = 1. Let us assume that the above is true
when k = m. Then, we have that
Lm e + hμ̂ .
μ̂ hμ̂ = mρ
where xs is the state from where the sth jump of the Markov chain
occurs and A is a finite scalar.
Using the above and Eq. (11.37), we have that:
k
Eμ̂ A + r(xs , μ(xs ), xs+1 )|x1 = i = kρ + hμ̂ (i).
s=1
Therefore,
k
Eμ̂ [A] Eμ̂ [ s=1 r(xs , μ(xs ), xs+1 )|x1 = i] hμ̂ (i)
+ =ρ+ .
k k k
Taking limits as k → ∞, we have:
Eμ̂ [ ks=1 r(xs , μ(xs ), xs+1 )|x1 = i]
lim = ρ.
k→∞ k
(The above follows from the fact that limk→∞ X/k = 0, if X ∈ is
finite.) In words, this means from the definition of average reward that
the average reward of the policy μ̂ is indeed ρ, and the first part of the
proposition is thus established.
Now for the second part. Using the first part of the proposition,
one can show that a policy, let us call it μ̂∗ , which attains the max in
the RHS of Eq. (11.35), produces an average reward of ρ∗ .
We will now show that any stationary and deterministic policy that
deviates from μ̂∗ will produce an average reward lower than or equal
to ρ∗ . This will establish, under our assumption that one of the sta-
tionary deterministic policies has to be optimal, that the policy μ̂∗
will generate the maximum possible value for the average reward and
will therefore be an optimal policy. Thus, all we need to show is that
a policy, μ̂, which does not necessarily attain the max in Eq. (11.35),
produces an average reward less than or equal to ρ∗ .
Equation (11.35) can be written in vector form as:
ρ∗e + J∗ = L(J∗ ). (11.39)
Convergence: Parametric Optimization 375
This proves that Eq. (11.40) holds for k = 1. Assuming that it holds
when k = m, we have that:
Lm J∗ ≤ mρ∗e + J∗ .
μ̂
Using the fact that Lμ̂ is monotonic from the results presented in
Sect. 2.2, it follows that
∗ ∗ ∗
Lμ̂ Lm μ̂ J ≤ L μ̂ mρ
e + J (11.42)
= mρ∗e + Lμ̂ J∗
≤ mρ∗e + ρ∗e + J∗ (using Eq. (11.41))
= (m + 1)ρ∗e + J∗ .
This establishes Eq. (11.40). The following bears similarity to the proof
of the first part of this proposition. Using Lemma 11.2, we have for
all i:
k
Lkμ̂ J ∗ (i) = Eμ̂ A + r(xs , μ(xs ), xs+1 )|x1 = i ,
s=1
where xs is the state from which the sth jump of the Markov chain
occurs and A is a finite scalar. Using this and Eq. (11.40), we have
that:
k
Eμ̂ A + r(xs , μ(xs )), xs+1 |x1 = i ≤ kρ∗ + J ∗ (i).
s=1
376 SIMULATION-BASED OPTIMIZATION
Therefore,
k
Eμ̂ [A] Eμ̂ [ s=1 r(xs , μ(xs ), xs+1 )|x1 = i] J ∗ (i)
+ ≤ ρ∗ + .
k k k
Taking the limits with k → ∞, we have:
Eμ̂ [ ks=1 r(xs , μ(xs ), xs+1 )|x1 = i]
lim ≤ ρ∗ .
k→∞ k
(As before, the above follows from limk→∞ X/k = 0 for finite X.)
In words, this means that the average reward of the policy μ̂ is less
than or equal to ρ∗ . This implies that the policy that attains the max
in the RHS of Eq. (11.35) is indeed the optimal policy since no other
policy can beat it.
Proof From Theorem 6.2 on page 133, ΠP =Π one can write that
for all j ∈ S:
Πμ̂ (i)p(i, μ(i), j) = Πμ̂ (j),
i∈S
which can be written as: Πμ̂ (i)p(i, μ(i), j) − Πμ̂ (j) = 0.
i∈S
Convergence: Parametric Optimization 377
Hence: Πμ̂ (i)p(i, μ(i), j)h(i) − Πμ̂ (j)h(i) = 0 for all i, j ∈ S.
i∈S
Then summing the LHS of the above equation over all j, one obtains:
Πμ̂ (i)p(i, μ(i), j)h(i) − Πμ̂ (j)h(i) = 0. (11.43)
j∈S i∈S
This establishes Lemma 11.12. If the last step is not clear, see the
Remark below.
Remark. The last step in the lemma above can be verified for a
two-state Markov chain. From ΠP = Π, if P denotes the transition
probability matrix for policy μ̂, we have that:
Π(1)P (1, 1) + Π(2)P (2, 1) = Π(1);
Π(1)P (1, 2) + Π(2)P (2, 2) = Π(2).
(Note that in the above, P (i, j) ≡ p(i, μ(i), j).) Multiplying both sides
of the first equation by h(1) and those of the second by h(2) and then
adding the resulting equations one has that:
= Π(1)h(1) + Π(2)h(2).
This can be written, after rearranging the terms, as:
− Π(1)h(1) − Π(2)h(2) = 0,
which can be written as:
2
2
2
Π(i) p(i, μ(i), j)h(j) − Π(i)h(i) = 0.
i=1 j=1 i=1
This should explain the last step of the proof of Lemma 11.12.
The reader should now review the steps in policy iteration on
page 152. We now present Lemma 11.13 which will be used directly in
our main result that proves the convergence of policy iteration. It will
378 SIMULATION-BASED OPTIMIZATION
Since, Π(i) ≥ 0 for any policy and i, we can write from the above that:
⎡ ⎤
Πμ̂k+1 (i) ⎣r̄(i, μk+1 (i)) + p(i, μk+1 (i), j)hk (j) − ρk − hk (i)⎦ ≥ 0.
j∈S
Step 3: If
sp(J k+1 − J k ) < ,
go to Step 4. Otherwise increase k by 1 and go back to Step 2.
Step 4: For each i ∈ S choose
⎡ ⎤
d(i) ∈ arg max ⎣r̄(i, a) + p(i, a, j)J k (j)⎦ (11.47)
a∈A(i)
j∈S
ˆ
and stop. The -optimal policy is d.
Then for each i ∈ S, set v k+1 (i) = W k+1 (i) − W k+1 (i∗ ).
Step 3: If sp(v k+1 − v k ) < , go to Step 4; else increase k by 1 and
return to Step 2.
Step 4: For each i ∈ S choose
⎡ ⎤
d(i) ∈ arg max ⎣r̄(i, a) + p(i, a, j)v k (j)⎦ .
a∈A(i) j∈S
where R(i, j) denotes the element of the ith row and the jth column
in R. Further, we define
C
BR (i, j) = bR (i, j, l) for every (i, j) ∈ W × W.
l=1
From the above, one can summarize that for any (i, j) ∈ C × C:
But max R (i) − min R (i) ≤ R (i) − R (j), which implies that
i∈C i∈C
We will now present the second of the two critical lemmas. But before
that, we provide some necessary notation and definitions. If trans-
|S|
kz ∈ , the resulting
formation L is applied m times on vector k
This implies that for every i ∈ S and any vector x k ∈ |S| , using the
definition of L in Eq. (11.4),
Lxk (i) ≡ Ldxk xk (i). (11.54)
Lemma 11.17 Let M be any positive, finite integer. Consider any two
vectors x 1 and y 1 that have |S| components. Using Pμ to denote the
transition probability matrix associated with deterministic policy μ̂, we
define two matrices:
AMx ≡ PdxM PdxM −1 . . . Pdx1 and Ay ≡ PdyM PdyM −1 . . . Pdy1 .
M
M
A
Then, sp(LM y 1 − LM x 1 ) ≤ αA sp(y 1 − x 1 ), where A ≡ AyM .
x
Then sp(LM y 1 − LM x 1 )
= {LM y 1 (s∗ ) − LM x1 (s∗ )} − {LM y 1 (s∗ ) − LM x1 (s∗ )}
∗
y (y − x )(s ) − Ax (y − x )(s∗ ) (from (11.57) and (11.58))
≤ AM 1 1 M 1 1
(b) Let v k denote the iterate vector in the kth iteration of relative
value iteration. We will first show that:
k
v k = J k − ζ le, (11.60)
l=1
−ζ m+1
(from (11.61))
388 SIMULATION-BASED OPTIMIZATION
⎛ ⎞
m+1
= max ⎝ p(i, a, j) [r(i, a, j) + J m (j)]⎠ − ζl
j∈S
j∈S l=1
m+1
= J m+1 (i) − ζ l (from Step 2 of value iteration),
l=1
Thus, since the difference vectors have the same span in both algo-
rithms and both algorithms choose the same sequence of maximizing
actions, we have that both algorithms terminate at the same policy
for a given value of .
will terminate with -optimal solutions when using the span criterion.
But, it has been seen in practice that relative value iteration also keeps
the iterates bounded, which indicates that relative value iteration will
not only terminate with an -optimal solution but will also be numer-
ically stable.
5. DP: SMDPs
Discounted reward SMDPs were only briefly covered in the context
of dynamic programming under the general assumption. Hence, we
will not cover discounted reward SMDPs in this section; we will discuss
them in the context of RL algorithms later. In this section, we focus
on average reward SMDPs. We first define average reward of a policy
in an SMDP. Thereafter, we present the main result related to the
Bellman equations.
Definition 11.5 The average reward of a policy μ̂ in an SMDP is
defined as:
k
Eμ̂ s=1 r(x s , μ(x s ), x s+1 ) x 1 = i
ρμ̂ (i) ≡ lim inf
k
s+1 1
k→∞
Eμ̂ s=1 t(x s , μ(x s ), x ) x = i
where z1 and z2 are scalar constants and ||.|| could be any norm.
Condition 3. Standard step-size conditions of stochastic approxima-
tion: Assumption 9.13 of Chap. 9: The step size αk (l) satisfies the
following conditions for every l = 1, 2, . . . , N :
∞
∞
2
αk (l) = ∞; αk (l) < ∞.
k=1 k=1
Then, with probability 1, the following limit must exist for all (l, l )
pairs, where each of l and l assumes values from {1, 2, . . . , N }, and
any z > 0:
V K k (z,l) (l) k
m=V k (l)
α (l)
lim K k (z,l ) .
k→∞ V (l ) k
m=V k (l )
α (l )
k }∞ converges to x∗ with probability 1.
Then, the sequence {X k=1
state-action pair is visited infinitely often in the limit, the EDU con-
dition is satisfied. Usually, a judicious choice of exploration ensures
that each state-action pair is tried infinitely often. Thus, although
these conditions appear formidable to prove, unless one selects non-
standard step sizes or non-standard exploration, they are automati-
cally satisfied.
Conditions 1 and 2 are usually straightforward to show in RL. Con-
dition 1 is usually easy to show for the Q-factor version of the Bellman
equation. Condition 2 is usually satisfied for most standard RL al-
gorithms based on the notion of one-step updates. Condition 3 is
satisfied by our standard step sizes, but note that the condition is de-
fined in terms of a separate step size for each iterate (l) unlike in the
synchronous case. Now, Condition 3 can be easily met with a sep-
arate step size for each iterate, but this would significantly increase
our storage burden. Fortunately, it can be shown (see e.g., Chap. 7
in [48], p. 80) that a single step size for all iterates which is updated
after every iteration (i.e., whenever k is incremented) also satisfies this
condition, although it makes the resulting step size random. That it
becomes random poses no problems in our analysis. This leads to the
following:
Important note: When we will use Theorem 11.21 in analyzing
specific RL algorithms, we will drop l from our notation for the step
size and denote the step size by αk , since we will be assuming that a
single step size is used for all values of l.
It is necessary to point out that showing Condition 4 (asymptotic
stability of the ODE’s equilibrium) and Condition 5 (boundedness
of iterate) usually require additional work and they should never be
assumed to hold automatically. In fact, much of the analysis of an
RL algorithm’s convergence may hinge on establishing these two con-
ditions for the algorithm concerned.
The proof of the above is beyond the scope of this text, and the
interested reader is referred to [48, p. 127]. The above is a very useful
condition that will be used if F (.) is contractive. When the con-
traction is not present, showing this condition can require significant
additional mathematics.
Theorem 11.23 Assume that Conditions 1–3 hold for the stochastic
approximation scheme concerned. Consider the scaled function Fc (.)
derived from F (.) for c > 0. Now assume that the following limit exists
for the scaled function: F∞ (.). Further assume that the origin in N
is the globally asymptotically stable equilibrium for the ODE:
dx
= F∞ (x). (11.65)
dt
k remain bounded.
Then, with probability 1, the iterates X
The above result turns out to be very useful in showing the bound-
edness of iterates in many RL algorithms. Again, its proof is beyond
our scope here; the interested reader is referred to [49, Theorem 2.1]
or [48, Theorem 7; Chap. 3].
An interesting fact related to the above is that if (i) Condition 4
holds and (ii) the scaled limit F∞ (.) exists, then oftentimes in rein-
forcement learning, the ODE in Eq. (11.65) does indeed have the origin
as the globally asymptotically stable equilibrium, i.e., Condition 5 also
holds. However, all of this needs to be carefully verified separately for
every algorithm.
said to belong to the slower time scale, while the other is said to
belong to the faster time scale.
The framework described above is called the two-time-scale frame-
work. It has been popular in electrical engineering for many years.
However, it was a result from Borkar [45] that established conditions
for convergence and provided it with a solid footing. This framework
is useful for showing convergence of R-SMART. In what follows, we
will present the framework more formally.
Let X k denote the vector (class) of iterates on the faster time scale
and Y k that on the slower time scale. We will assume that we have N1
iterates on the faster time scale and N2 iterates on the slower time
scale. Further, we assume that the underlying random process in the
system generates within the simulator two trajectories,
where Θk1 denotes the iterate from the faster time scale updated in the
kth iteration while Θk2 denotes the iterate from the slower time scale
updated in the kth iteration. Thus for k = 1, 2, . . .:
where
αk (.) and β k (.) are the step sizes for the faster and slower time-scale
iterates respectively
F (., .) and G(., .) denote the transformations driving the faster and
slower time-scale updates respectively
1k and w
w 2k denote the noise vectors in the kth iteration on the
faster and slower time scales respectively
Note that F (., .) is a function of X and Y , and the same is true of
G(., .). Clearly, hence, the fates of the iterates are intertwined (or cou-
pled) because their values are inter-dependent.
398 SIMULATION-BASED OPTIMIZATION
where z1 , z2 , z3 , z1 , z2 , and z3 are scalar constants and ||.|| could
be any norm.
Condition 3. Step-size conditions of stochastic approximation: The
step sizes satisfy the usual tapering size conditions for every l1 =
1, 2, . . . , N1 and l2 = 1, 2, . . . , N2 :
∞
∞
2
αk (l1 ) = ∞; αk (l1 ) < ∞;
k=1 k=1
∞ ∞
2
β k (l2 ) = ∞; β k (l2 ) < ∞;
k=1 k=1
In addition, the step sizes must satisfy the following condition for
every (l1 , l2 ) pair:
β k (l2 )
lim sup k = 0; (11.67)
k→∞ α (l1 )
βk
lim = 0;
k→∞ αk
(ii) If the values of the Y -iterates are frozen (i.e., fixed at any vector),
the X-iterates converge (to a solution that is Lipschitz continuous
in Y );
where J∗ ∈ |S| and J∗ is the unique solution of Equation (11.21).
Then, from Eq. (11.21), we have that for any i ∈ S:
J ∗ (i) = max Q(i, a), which from Eq. (11.70) implies that
a∈A(i)
for all (i, a) pairs, Q(i, a) = p(i, a, j) r(i, a, j) + λ max Q(j, b) .
b∈A(j)
j∈S
The above is a Q-factor version of the Bellman optimality equation in
(11.21). Thus the above equation can be replaced in Proposition 11.7
to obtain the following result.
Proposition 11.25 Consider the system of equations defined as
follows. For all i ∈ S and a ∈ A(i):
Q(i, a) = p(i, a, j) r(i, a, j) + λ max Q(j, b) .
b∈A(j)
j∈S
The vector Q ∗ that solves this equation is the optimal Q-factor vector,
i.e., if for all i ∈ S, μ∗ (i) ∈ arg maxa∈A(i) Q∗ (i, a), then μ̂∗ denotes an
optimal policy.
where Jμ̂ ∈ |S| and Jμ̂ is the unique solution of the linear
Equation (11.22). From Eq. (11.22), then we have that for any i ∈ S:
Jμ̂ (i) = Q(i, μ(i)), which from Eq. (11.71) implies that
402 SIMULATION-BASED OPTIMIZATION
for all (i, a) pairs, Q(i, a) = p(i, a, j) [r(i, a, j) + λQ(j, μ(j))] .
j∈S
where J∗ and ρ∗ together solve Eq. (11.35), then, we have from the
above and Eq. (11.73) that for any i ∈ S:
J ∗ (i) = max Q(i, a), which from Eq. (11.73) implies that
A(i)
Convergence: Parametric Optimization 403
⎡ ⎤
for all (i, a) pairs, Q(i, a)=⎣r̄(i, a) + p(i, a, j) max Q(j, b) − ρ∗⎦ .
b∈A(j)
j∈S
The above is a Q-factor version of the Bellman optimality equation
in (11.35). The above equation can be replaced in the second part of
Proposition 11.11 to obtain the following result.
Proposition 11.27 Consider the following system of equations
defined as follows. For all i ∈ S and a ∈ A(i), and for any scalar
ρ ∈ :
Q(i, a) = p(i, a, j) r(i, a, j) + max Q(j, b) − ρ . (11.74)
b∈A(j)
j∈S
∗ and a scalar ρ∗ ∈ that together satisfy the
If there exist a vector Q
above equation, then ρ∗ is the average reward of the policy μ̂∗ , where
for all i ∈ S, μ∗ (i) ∈ arg maxa∈A(i) Q∗ (i, a). Further, μ̂∗ denotes an
optimal policy.
where hμ̂ and ρ are solutions of Equation (11.34), then, we have from
the above and Eq. (11.34) that for any i ∈ S:
hμ̂ (i) = Q(i, μ(i)), which from the equation above implies that
⎡ ⎤
for all (i, a) pairs, Q(i, a) = ⎣r̄(i, a) + p(i, a, j)Q(j, μ(j)) − ρ⎦ .
j∈S
The above is a Q-factor version of the Bellman policy equation in
(11.34). The above equation can be replaced in the first part of Propo-
sition 11.11 to obtain the following result.
Proposition 11.28 Consider a policy μ̂. Further, consider the
following system of equations defined as follows. For all i ∈ S and
a ∈ A(i), and for any scalar ρ ∈ :
Q(i, a) = p(i, a, j) [r(i, a, j) + Q(j, μ(j)) − ρ] .
j∈S
|S|
k k
F Q (i, a) = p(i, a, j) r(i, a, j) + λ max Q (j, b) ;
b∈A(j)
j=1
Convergence: Parametric Optimization 405
Now, using the relation defined in Eq. (11.76), we can re-write the
above:
k (i, a) + wk (i, a) ,
Qk+1 (i, a) = Qk (i, a) + αk F Q
Thus for any (i, a) : |F Q
k (i, a)−F Q
1 2
k −Q
k (i, a)|≤λ||Q
1
k ||∞ .
2
Since the above holds for all values of (i, a), it also holds for the
values that maximize the left hand side of the above. Therefore
||F Q
k1 − F Q k − Q
k2 ||∞ ≤ λ||Q 1
k ||∞ .
2
Gosavi [111]. We will later present two other proofs that exploit
Theorem 11.23. The boundedness argument is next provided as a
separate lemma.
Proof We first claim that for every state-action pair (i, a):
|Q2 (i, a)| ≤ (1 − α)|Q1 (i, a)| + α|r(i, a, j) + λ max Q1 (j, b)|
b∈A(j)
≤ (1 − α)M + αM + αλM (from (11.81) and (11.80))
≤ (1 − α)M + αM + λM (from the fact that α ≤ 1)
= M (1 + λ)
From the above, our claim in (11.79) is true for k = 1. Now assuming
that the claim is true when k = m, we have that for all (i, a) ∈
(S × A(i)).
|Qm+1 (i, a)| ≤ (1 − α)|Qm (i, a)| + α|r(i, a, j) + λ max Qm (j, b)|
j∈A(j)
≤ (1 − α)M (1 + λ + λ + · · · + λ )
2 m
Eigenvalue Proof. For this proof, we need a basic result from ODEs
(see Theorem 4.1 on page 151 of [54]), which is as follows.
|S|
k (i, a) ≡ lim Fc Q
F∞ Q k (i, a) = p(i, a, j) λ max Qk (j, b) − Qk (i, a).
c→∞ b∈A(j)
j=1
where Q∗ (i, c) denotes the limiting Q-factor for the state-action pair
(i, c). Thus, the set M1 (i) contains all the actions that maximize the
optimal Q-factor in state i. Similarly, for each i ∈ S, let
M2 (i) = arg2maxc∈A(i) Q∗ (i, c),
where the set M2 (i) contains all the actions that produce the second-
highest value for the limiting Q-factor in state i. These sets will be
non-empty (provided we have at least two actions, which we assume to
be true) and can be possibly singletons. (For the sake of simplicity, the
reader may assume these sets to be singletons and generalize later).
Now, for any state i ∈ S and for any a1 ∈ M1 (i), let Q̃k1 (i, a1 ) denote
the value of the Q-factor of state i and action a1 in the kth iteration.
Similarly, for each state i ∈ S and for a2 ∈ M2 (i), let Q̃k2 (i, a2 ) denote
the value of the Q-factor of state i and action a2 in the kth iteration.
Note that here a1 and a2 assume values from the sets defined above.
Let J be the set of positive integers and let k ∈ J .
Proposition 11.34 With probability 1, there exists a positive integer
K such that for k ≥ K, for each i ∈ S, and for every (a1 , a2 )-pair,
Q̃k1 (i, a1 ) > Q̃k2 (i, a2 ), where a1 ∈ M1 (i) and a2 ∈ M2 (i). (11.85)
The above implies that the algorithm can be terminated in a finite
number of iterations with probability 1. This is because when for every
state, the estimate of what is the highest Q-factor starts exceeding that
of what is the second-highest Q-factor, the algorithm should generate
the optimal policy.
Convergence: Parametric Optimization 413
Proof We will first assume that we are working with one specific
(a1 , a2 )-pair. This allows us to conceal a1 and a2 from the notation
thereby easing the latter. Thus, Q̃k1 (i, a1 ) will be replaced by Q̃k1 (i)
and Q̃k2 (i, a2 ) will be replaced by Q̃k2 (i). The result can be shown to
hold for every such pair.
We first define the absolute value of the difference between the lim-
iting value and the estimate in the kth iteration. To this end, for each
i ∈ S, let
ek1 (i) = |Q̃k1 (i) − Q∗1 (i)|, and ek2 (i) = |Q̃k2 (i) − Q∗2 (i)|.
From the above, it is easy to see that we can have four different cases
for the values of the estimates, depending on the value of the difference:
Case 1: Q̃k1 (i) = Q∗1 (i) − ek1 (i), and Q̃k2 = Q∗2 (i) + ek2 (i);
Case 2: Q̃k1 (i) = Q∗1 (i) − ek1 (i), and Q̃k2 = Q∗2 (i) − ek2 (i);
Case 3: Q̃k1 (i) = Q∗1 (i) + ek1 (i), and Q̃k2 = Q∗2 (i) + ek2 (i);
Case 4: Q̃k1 (i) = Q∗1 (i) + ek1 (i), and Q̃k2 = Q∗2 (i) − ek2 (i).
Let D(i) ≡ Q∗1 (i) − Q∗2 (i) for all i ∈ S.
D(i) > 0 for all i ∈ S because of the following. By its definition,
D(i) >= 0 for any i, but D(i) = 0 for the situation in which all
actions are equally good for the state in question, in which case there
is nothing to be proved.
We will now assume that there exists a value K for k such that for
each i, both ek1 (i) and ek2 (i) are less than D(i)/2 when k ≥ K. We
will prove later that this assumption holds in Q-Learning. Thus, our
immediate goal is to show inequality (11.85) for each case, under this
assumption.
We first consider Case 1.
Q̃k1 (i) − Q̃k2 (i)
= Q∗1 (i) − Q∗2 (i) − ek2 (i) − ek1 (i) from Case 1.
= D(i) − (ek1 (i) + ek2 (i)) > D(i) − D(i) = 0.
Then, Q̃k1 (i) > Q̃k2 (i) ∀i, k ≥ K, proving inequality (11.85). The
rest of the cases should be obvious from drawing a simple figure, but
we present details. Like in Case 1, for Case 4, we can show that
Q̃k1 (i) − Q̃k2 (i) = D(i) + (ek1 (i) + ek2 (i)) > 0, since ekl (i) ≥ 0 for l = 1, 2.
For Case 2, since ek2 (i) ≥ 0, we have that
D(i) D(i)
+ ek2 (i) ≥ .
2 2
414 SIMULATION-BASED OPTIMIZATION
D(i)
Also, since ek1 (i) < 2 , we have that
D(i)
− ek1 (i) > 0.
2
Combining the two inequalities above, we have that D(i) − (ek1 (i) −
ek2 (i)) > D(i)
2 . Then, we have that Q̃1 (i) − Q̃2 (i) = D(i) − (e1 (i) −
k k k
lim Q̃k1 (i) = Q∗1 (i) and lim Q̃k2 (i) = Q∗2 (i).
k→∞ k→∞
Hence, for any given > 0, there exists a value k1 ∈ J for which
|Q̃k1 (i) − Q∗1 (i)| < when k ≥ k1 . Similarly, for any given > 0, there
exists a value k2 ∈ J for which |Q̃k2 (i) − Q∗2 (i)| < when k ≥ k2 .
Selecting = D(i)/2 (that D(i) > 0 has been shown above) and
K ≡ max{k1 , k2 }, we have that for k ≥ K, with probability 1,
|Q̃k1 (i) − Q∗1 (i)| < = D(i)/2, i.e., ek1 (i) < D(i)/2.
Similarly, using the same value for , one can show that for k ≥ K,
with probability 1, ek2 (i) < D(i)/2, which proves our assumption. The
result above did not depend on any specific value of a1 or a2 , and can
be similarly shown for every (a1 , a2 )-pair.
An issue related to the above is: how should the algorithm be ter-
minated? The reader may recall that like Q-Learning, convergence
occurs in the limit for value iteration, i.e., when the number of iter-
ations tends to infinity. In value iteration, however, we can use the
norm of a difference vector to terminate the algorithm. Unfortunately
in RL, this is not true for the following reason: Only one Q-factor
gets updated in each iteration, leading to a situation where the num-
ber of times a state-action pair has been updated thus far (updating
frequency) is unlikely to be the same for all state-action pairs at any
given (algorithm) iteration. Hence, computing the norm or span of
the difference vector is essentially not useful.
In practice, for termination, we run the algorithm for as long as we
can, i.e., for a pre-specified, fixed number of iterations. It makes sense
to use an appropriate step size such that the step size remains reason-
ably large until the pre-specified number of iterations are complete.
Convergence: Parametric Optimization 415
When the step size becomes too small, e.g., 10−6 , due to computer-
roundoff errors, the Q-factors cease to change, and there is no point in
continuing further. Another way is to terminate the algorithm if the
policy has not changed in the last several iterations. But this requires
checking the policy after every iteration (or at least after a few iter-
ations), which may be computationally burdensome (and impossible
for huge state-action spaces).
Let the optimal policy be denoted by μ̂∗ . The convergence result for
algorithm is as follows.
Proposition 11.35 When the step sizes and action selection used
in the algorithm satisfy Conditions 3, 6, and 7 of Theorem 11.21,
with probability 1, the sequence of policies generated by the Relative
Q-Learning algorithm, {μ̂k }∞ ∗
k=1 , converges to μ̂ .
|S|
F k
Q (i, a) = p(i, a, j) r(i, a, j) + max Q (j, b) − Qk (i∗ , a∗ ).
k
b∈A(j)
j=1
Then, we can define F (.) via Eq. (11.76). We further define a trans-
formation f (.) as follows:
f Q (i, a) = r(i, a, ξ ) + max Q (ξ , b) − Qk (i∗ , a∗ ),
k k k k
b∈A(ξ k )
To show Condition 5, note that the above theorem also holds for
the special case when all the immediate rewards are set to 0, i.e.,
r(i, a, j) = 0 for all i ∈ S, j ∈ S, and a ∈ A(i). Now if we compute the
scaled function Fc (.), as defined in Definition 11.6, we can show (see
Eq. (11.83) and its accompanying discussion) that:
F∞ Q k (i, a) = lim Fc Q k (i, a)
c→∞
|S|
= p(i, a, j) max Q (j, b) −Qk (i∗ , a∗ )−Qk (i, a);
k
b∈A(j)
j=1
Proposition 11.37 When the step sizes and the action selection
satisfy Conditions 3, 6, and 7 of Theorem 11.21, with probability 1,
the sequence of iterates generated within Step 2 of the CAP-I algo-
rithm, {J n }∞
n=1 , converges to the unique solution of the Bellman policy
equation, i.e., to the value function vector associated to policy μ̂.
Proof The main theme underlying this proof is very similar to that
of Q-Learning. As usual, we first define some transformations: For all
i ∈ S,
|S|
n
F J (i) = p(i, μ(i), j) [r(i, μ(i), j) + λJ n (j)] ;
j=1
|S|
F J n
(i) = p(i, μ(i), j) [r(i, μ(i), j) + λJ n (j)] − J n (i);
j=1
f J n (i) = [r(i, μ(i), ξ n ) + λJ n (ξ n )] ;
dj
= F j , (11.89)
dt
where j denotes the continuous-valued variable underlying the iter-
We now need to evaluate the conditions of Theorem 11.21.
ate J.
Convergence: Parametric Optimization 419
The result implies that the limit point Q ∞ will equal E[Z(i, a)], i.e.,
for all (i, a) pairs:
Q∞ (i, a) = p(i, a, j) [r(i, a, j) + λJ(j)] ,
j∈S
Proof For this proof, we first note that since (i) the update of a given
Q-factor is unrelated to that of any other Q-factor (this is because
the updating equation does not contain any other Q-factor) and (ii)
each state-action pair is tried with the same frequency (this is because
each action is tried with the same probability in every state), we can
analyze the convergence of each Q-factor independently (separately).
For the proof, we will rely on the standard arguments used in the
literature to show convergence of a Robbins-Monro algorithm (see e.g.,
[33]). The latter can be shown to be a special case of the stochastic
gradient algorithm discussed in Chap. 10. As such, the result associ-
ated to stochastic gradients (Theorem 10.8) can be exploited for the
analysis after finding a so-called “potential function” needed in the
stochastic gradient algorithm.
We define the potential function g : N → such that for any (i, a)
pair:
g Q m (i, a) = (Qm (i, a) − E[Z(i, a)])2 /2.
Convergence: Parametric Optimization 421
d g Q m (i, a)
Then, for any (i, a)-pair, = [Qm (i, a) − E[Z(i, a)]] .
dQm (i, a)
(11.92)
We further define the noise term as follows for every (i, a)-pair:
wm (i, a) = p(i, a, j) [r(i, a, j) + λJ(j)] − [r(i, μ(i), ξ m ) + λJ(ξ m )]
j∈S
= E[Z(i, a)] − [r(i, μ(i), ξ m ) + λJ(ξ m )] (11.93)
Combining (11.92) and (11.93), we can write the update in (11.90) as:
⎡
⎤
d g Q m (i, a)
Qm+1 (i, a) ← Qm (i, a) − αm ⎣ + wm (i, a)⎦ ,
dQm (i, a)
Since J(.) is bounded and since clearly since we start with finite val-
ues for the Q-factors, M has to be finite. We will use an induction
argument. We show the case for m = 1 as follows:
|Q1 (i, a)| ≤ (1−α)|Q0 (i, a)|+α|r(i, a, j)+λJ(j)| ≤ (1−α)M +αM = M
Now assuming that the claim is true when m = P , we have that for
all (i, a): |QP (i, a)| ≤ M . Then,
|QP +1 (i, a)| ≤ (1−α)|QP (i, a)|+α|r(i, a, j)+λJ(j)| ≤ (1−α)M +αM =M.
Proposition 11.40 When step sizes and the action selection sat-
isfy Conditions 3, 6, and 7 of Theorem 11.21, with probability 1, the
sequence of iterates generated within Step 2 of the Q-P -Learning algo-
rithm for discounted reward MDPs, {Q n }∞ , converges to the unique
n=1
solution of the Q-factor version of the Bellman policy equation, i.e.,
to the value function vector associated to policy μ̂.
Proof The proof will be very similar to that of Q-Learning, since the
policy evaluation phase in Q-P -Learning can essentially be viewed as
Q-Learning performed for a fixed policy. We will use μ̂ to denote the
policy being evaluated. Hence, using the notation of the algorithm,
for all i ∈ S:
μ(i) ∈ arg max P (i, c).
c∈A(i)
|S|
n
F Q (i, a) = p(i, a, j) [r(i, a, j) + λQn (j, μ(j))] .
j=1
Then, we can define F (.) via Eq. (11.76). We further define a trans-
formation f (.) as follows:
f Q n (i, a) = r(i, a, ξ n ) + λQk (ξ n , μ(ξ n )) ,
= n
λ||Q n
1 − Q2 ||∞ · 1.
Since the above holds for all values of (i, a), it also holds for the values
that maximize the left hand side of the above. Therefore
||F Q
n1 − F Q n1 − Q
n2 ||∞ ≤ λ||Q n2 ||∞ .
1 − e−γ t̄(i,a,j)
p(i, a, j) rL (i, a, j) + rC (i, a, j) + e−γ t̄(i,a,j) max Qk (j, b) .
j∈S
γ b∈A(j)
Convergence: Parametric Optimization 425
Proof First note that since the time term is always strictly non-
negative (t̄(., ., .) > 0) and since γ > 0, there exists a scalar λ̄ in
the interval (0, 1) such that
−γ t̄(i,a,j)
max e ≤ λ̄.
i,j∈S;a∈A(j)
k and Q k
2 in . From
Consider two vectors Q N
1
the definition of F (.)
above, it follows that: F Q k (i, a) − F Q
1
k (i, a) =
2
|S|
−γ t̄(i,a,j)
e p(i, a, j) max Qk1 (j, b) − max Qk2 (j, b) .
b∈A(j) b∈A(j)
j=1
From this, we can write that for any (i, a) pair: F Q
k (i, a)−
1
F Q k (i, a)
2
|S|
≤ p(i, a, j)e−γ t̄(i,a,j) max Qk1 (j, b) − max Qk2 (j, b)
b∈A(j) b∈A(j)
j=1
|S|
−γ t̄(i,a,j)
≤ max e p(i, a, j) max Q1 (j, b) − max Q2 (j, b)
k k
i,j∈S;a∈A(j) b∈A(j) b∈A(j)
j=1
|S|
≤ λ̄ p(i, a, j) max |Qk1 (j, b) − Qk2 (j, b)|
j∈S,b∈A(j)
j=1
|S| |S|
= λ̄ k1 − Q
p(i, a, j)||Q k2 ||∞ = λ̄||Q
k1 − Q
k2 ||∞ k1 − Q
p(i, a, j) = λ̄||Q k2 ||∞ · 1.
j=1 j=1
Proposition 11.42 [30] When all admissible policies are proper and
the starting state is unique, the optimal value function vector J∗ for
the SSP satisfies the following equation: For all i ∈ S ,
⎡ ⎤
J ∗ (i) = max ⎣r̄(i, a) + p(i, a, j)J ∗ (j)⎦ , (11.95)
a∈A(i)
j∈S
Further, for any given stationary policy μ̂, there exists a unique
solution to the following equation:
⎡ ⎤
Jμ̂ (i)= ⎣r̄(i, μ(i))+ p(i, μ(i), j)Jμ̂ (j)⎦ , such that for each i ∈ S ,
j∈S
the scalar Jμ̂ (i) will equal the expected value of the total reward earned
in a trajectory starting in state i until the termination state is reached.
Note that in the above, the summation in the RHS is over S which
excludes the termination state. The termination state is a part of
the state space S. Because the termination state is not a part of the
summation, we will have that, for one or more i ∈ S,
p(i, μ(i), j) < 1 under any given policy μ̂.
j∈S
The above implies that the transition probability matrix used in the
Bellman equation is not stochastic, i.e., in at least one row, the ele-
ments do not sum to 1. This is an important property of the SSP’s
transition probabilities that the reader needs to keep in mind.
A Q-factor version of Eq. (11.95) would be as follows:
⎡ ⎤
Q∗ (i, a) = ⎣r̄(i, a) + p(i, a, j) max Q∗ (j, b)⎦ . (11.96)
b∈A(j)
j∈S
We now present a key result which shows that the average reward
SMDP can be viewed as a special case of the SSP under some con-
ditions. The motivation for this is that the SSP has some attractive
properties that can be used to solve the SMDP.
Transformation of SMDP to a fictitious κ-SSP: Consider any
recurrent state in the SMDP, and number it K. We will call this state
the distinguished state in the SMDP. Define the immediate rewards
for a transition from i to j (where i, j ∈ S) under any action a ∈ A(i)
in the new problem to be:
r(i, a, j) − κt(i, a, j),
where κ is any scalar (the value will be defined later). In this problem,
the distinguished state will serve as the termination state as well as the
Convergence: Parametric Optimization 429
starting state. This implies that once the system enters K, no further
transitions are possible in that trajectory, and hence the above problem
is an SSP. The fictitious SSP so generated will be called a κ-SSP. We
now present the associated result which establishes the equivalence.
Lemma 11.45 Let RK (μ̂) and TK (μ̂) denote the expected value of the
total reward and the expected value of the total time, respectively, in
one “cycle” from K to K when the policy pursued in the cycle is μ̂.
Further, define
RK (μ̂)
ρ̃ ≡ max . (11.99)
μ̂ TK (μ̂)
430 SIMULATION-BASED OPTIMIZATION
Note that in the RHS of the above, we have omitted h(K) since K
is the termination state for the SSP. Now, for any given policy μ̂, the
value function of the state K in the SSP can be written as:
The above follows from the definition of the value function, which says
that it equals the expected value of the sum of the reward function (in
this case r(., ., .) − ρ̃t(., ., .)) starting from K and ending at K. Now,
again by definition,
Lemma 11.46 ρ̃ = ρ∗ .
For the proof of this lemma, we need the following fundamental result
[30, 242] (we will use the notation Ak to mean matrix A raised to the
kth power):
Lemma 11.47 Let Pμ̂ denote the transition probability matrix of a
policy μ̂ with n states, where the matrix is stochastic. Then,
m k
k=1 Pμ̂
lim = P∗∗ ∗∗
μ̂ , where Pμ̂ is an n × n matrix
m→∞ m
such that Pμ̂∗∗ (i, j) denotes the steady-state probability of being in state
j provided the system started in state i and policy μ̂ is being used in
all states.
For an SMDP where all states are recurrent, each row in the matrix
P∗∗
μ̂ will be identical, and the (i, j)th term in every row of the matrix
will equal Πμ̂ (j), i.e., the steady-state probability of being in state j
under μ̂.
Proof (of Lemma 11.46) We now define J0 = h, and for some sta-
tionary policy μ̂, using rμ̂ to denote
( ) the vector whose ith element is
∞
r̄(i, μ(i)), we define the sequence Jk :
k=1
where Pμ̂ denotes the transition probability matrix in the SMDP as-
sociated to the policy μ̂. Note that this matrix is stochastic, and so
is the transition probability matrix underlying any action defined in
Eq. (11.102). Then, if τμ̂ denotes that vector whose ith element is
t̄(i, μ(i)) for all i ∈ S, we will show via induction that:
m
ρ̃ Pkμ̂τμ̂ + J0 ≥ Jm . (11.104)
k=1
Now, from Eq. (11.102), for any given stationary policy μ̂,
from which we have, using (11.103), ρ̃Pμ̂τμ̂ + J0 ≥ rμ̂ + Pμ̂ J0 = J1 ;
n
ρ̃ Pμ̂k+1τμ̂ + J1 ≥ Jn+1 . (11.106)
k=1
Adding (11.106) and (11.105), we have ρ̃ n+1 0 ≥ Jn+1 ,
kτ + J
k=1 Pμ̂μ̂
which completes the induction. Then dividing both sides of (11.104)
by m and taking the limit as m → ∞, we have from Theorem 9.7:
m kτ
k=1 Pμ̂μ̂ J0 Jm
ρ̃ lim + lim ≥ lim . (11.107)
m→∞ m m→∞ m m→∞ m
Also, note that from its definition Jm (i) denotes the total expected
reward earned starting from state i, and hence limm→∞ Jmm = R̄μ̂e,
where R̄μ̂ denotes the expected reward in one state transition under
J0
policy μ̂. Since J0 (i) is bounded for every i, limm→∞ m = 0e. Then,
we can write (11.107) as:
R̄μ̂
ρ̃T̄μ̂e ≥ R̄μ̂e, i.e., ρ̃e ≥ e, (11.108)
T̄μ̂
where T̄μ̂ equals the expected time spent in one transition under μ̂;
the renewal reward theorem (see Eq. (6.23); [155, 251]) implies that
R̄μ̂ /T̄μ̂ equals the average reward of the policy μ̂, i.e., ρ̃ ≥ ρμ̂ . The
equality in (11.108) applies only when one uses the policy μ̂∗ that uses
the max operator in (11.102); i.e., only when ρ̃ equals the average
reward of that policy in particular. Now, what (11.108) implies is that
the average reward of every policy other than μ̂∗ will be less than ρ̃.
Clearly then μ̂∗ must be optimal, i.e., ρ̃ = ρμ̂∗ = ρ∗ .
Equation (11.102) with ρ̃ = ρ∗ is the Bellman equation for the average
reward SMDP with h(K) = 0.
Note that the distinguished state, K, in the SSP is actually a reg-
ular state in the SMDP whose value function may not necessarily
Convergence: Parametric Optimization 433
(11.109)
where I(.), the indicator function, will return a 1 when the condition
inside the brackets is true and a zero otherwise. The attractive fea-
ture of the above is that the associated transformation is fortunately
contractive (with respect to the weighted max norm; which can be
proved) and behaves gracefully in numerical computations.
The following result is the counterpart of Proposition 11.44 for a
given policy.
Proof The proof is similar to that of Lemma 11.45, using the sec-
ond part of Proposition 11.42. Note that the renewal reward theorem
implies that
RK (μ̂)
ρμ̂ = .
TK (μ̂)
434 SIMULATION-BASED OPTIMIZATION
Then, as discussed above, hμ̂ (K) = RK (μ̂) − ρμ̂ TK (μ̂), which from
the definition of ρμ̂ implies that hμ̂ (K) = 0. Then, we can write
Eq. (11.110) as follows: For i = 1, 2, . . . , K:
⎡ ⎤
K
hμ̂ (i) = ⎣r̄(i, μ(i)) − ρμ̂ t̄(i, μ(i)) + p(i, μ(i), j)hμ̂ (j)⎦ .
j=1
In other words, what this result shows is that when Condition 4b
holds, we have that the slower iterate converges with probability 1 to
a globally asymptotically stable equilibrium of the ODE in (11.69). In
the proof, we will use
- k.the shorthand notation of xk → x∗ when we
∞
mean the sequence x k=1 converges to x∗ .
Proof We will drop the iterate index l2 from the notation of the step
size for the unique slower iterate. For the proof, we need the following
limits showed in [235].
436 SIMULATION-BASED OPTIMIZATION
k+1
lim (1 − β n ) = 0 for all k > K; (11.112)
k→∞
n=K
Now the derivative condition, i.e., Condition 4b (iii), implies that
there exist negative, upper and lower bounds on the derivative, i.e.,
there exist C1 , C2 ∈ where 0 < C1 ≤ C2 such that:
Since, the above is true for any finite integral value of k, we have that
for k = K and k = K + 1,
In this style, we can also show the above result when the sandwiched
term is ΔK+3 , ΔK+4 . . .. In general, then, for any M > K, we obtain:
M
M+1 M
(1 − C2 β n )ΔK + (1 − C2 β m ) β n δ n ≤ ΔM +1
n=K n=K m=n+1
M +1
M
M
≤ (1 − C1 β n )ΔK + (1 − C1 β m ) β n δ n .
n=K n=K m=n+1
We now take the limits as M → ∞ on the above. Then, via
Theorem 9.7 and using (11.112) and (11.113), 0 ≤ limm→∞ ΔM +1 ≤ 0.
Then, Theorem 9.8 implies that ΔM +1 → 0. Identical arguments can
now be repeated for the case Y2 > Y1 , i.e., for (11.118), to obtain the
same conclusion.
9.2.3 R-SMART
We now discuss the convergence properties of R-SMART under some
conditions. R-SMART is a two-time-scale algorithm, and we will use
Theorem 11.24 to establish convergence.
We will first consider the CF-version. The core of the CF-version
of R-SMART can be expressed by the following transformations. On
the faster time scale we have:
Qk+1 (i, a) = Qk (i, a)+α r(i, a, ξ k ) − ρk t(i, a, ξ k ) + η max Qk (ξ k , b) − Qk (i, a) ,
b∈A(ξk )
(11.122)
where ξ k is a random variable that depends on (i, a) and k; on the
slower time scale we have:
T Rk
ρ k+1
= ρ + β I a ∈ arg max Q (i, u)
k k k
− ρ k
;
u∈A(i) TTk
T Rk+1 = T Rk + I a ∈ arg max Qk (i, u) r(i, a, ξ k ); (11.123)
u∈A(i)
T T k+1 = T T k + I a ∈ arg max Qk (i, u) t(i, a, ξ k );
u∈A(i)
note that in the above, we use the indicator function in order to account
for the fact that T Rk , T T k and ρk are updated only when a greedy
action is chosen in the simulator. We denote the optimal policy by μ̂∗
and that generated in the kth iteration by:
μk (i) ∈ arg max Qk (i, a) for all i ∈ S. (11.124)
a∈A(i)
Convergence: Parametric Optimization 439
Then, for all (i, a) pairs, F Q k , ρk (i, a) = F Q k , ρk (i, a) −
Qk (i, a). We further define the noise term as follows: For all (i, a)
pairs,
w1k (i, a) = r(i, a, ξ k ) − ρk t(i, a, ξ k ) + η max Qk (ξ k , b) − F Q k , ρk (i, a).
b∈A(ξk )
Then, like in the case of Q-Learning, we can write the updating trans-
formation on the faster time scale in our algorithm, (11.122), as:
Qk+1 (i, a) = Qk (i, a) + αk F Q k , ρk (i, a) + w1k (i, a) ,
which is of the same form as the updating scheme for the faster time
scale defined for Theorem 11.24 (replace X k by Qk and l by (i, a)).
Then, if we fix the value of ρk to some constant, ρ̆, we can invoke the
following ODE as in Theorem 11.24:
dq
= F (q, ρ̆), (11.125)
dt
where q denotes the continuous-valued variable underlying the iter-
ate Q.
We now define the functions underlying the iterate on the slower
time scale. For all (i, a) pairs,
440 SIMULATION-BASED OPTIMIZATION
|S|
k k
G Q , ρ (i, a) = p(i, a, j) T Rk /T T k − ρk ;
j=1
G Q , ρ (i, a) = G Q , ρ (i, a) + ρk ;
k k k k
w2k (i, a) = T Rk /T T k − G Q k , ρk (i, a);
the above allows us to express the update on the slower time scale in
the algorithm, Eq. (11.123), as:
ρk+1 = ρk +β k I a ∈ arg max Qk (i, u) G Q k , ρk (i, a) + w2k (i, a) .
u∈A(i)
Proof We will first analyze the iterate on the slower time scale and
claim that:
|ρk | ≤ M for all k, (11.126)
where M , a positive finite scalar, is defined as:
,
maxi,j∈S,a∈A(i) |r(i, a, j)| 1
M = max ,ρ .
mini,j∈S,a∈A(i) t(i, a, j)
Convergence: Parametric Optimization 441
|S|
r(i, a, j) − ρk t̄(i, a, j) maxb∈A(j) cQk (j, b)
= lim p(i, a, j) +η
c→∞ c c
j=1
cQk (i, a)
− lim
c→∞ c
|S|
= η p(i, a, j) max Qk (j, b) − Qk (i, a) (since ρk is bounded);
b∈A(j)
j=1
it is not hard to see that F∞ Q k , ρk is a special case of the trans-
formation F Q k , a with the immediate rewards and times set to 0,
where a is any fixed scalar. But F Q k , a is contractive, and hence via
d
q
Theorem 11.22, the ODE dt = F∞ (q, a) has a globally asymptotically
442 SIMULATION-BASED OPTIMIZATION
stable equilibrium. But note that the origin is the only equilibrium
point for (this )ODE. Then, from Theorem 11.23, it follows that the
∞
sequence Q k must be bounded with probability 1.
k=1
which is clearly Lipschitz in Qk (., .). We have thus shown that Condi-
tion 4a holds.
To show Condition 4b, consider Condition 4b of Sect. 9.2.2 setting
y∗ = ρ∗ , where ρ∗ is the optimal average reward of the SMDP. Note
that N2 = 1 for our algorithm (part (i) of Condition 4b ). Now, under
Assumption 7.1, when the value of ρk in the faster iterate is fixed to
ρ∗ , the faster iterates will converge to Q ∗ , a solution of the Bellman
optimality equation. Since the slower iterate updates only when a
greedy policy is chosen, in the limit, the slower iterate must converge
to the average reward of the policy contained in Q ∗ , which must be
optimal. Thus, the lockstep condition (part (ii) in Condition 4b ) holds
for y∗ = ρ∗ . The derivative condition (part (iii) of Condition 4b ) is
true by assumption. Thus, Condition 4b is true for our two-time-scale
algorithm. Then, Proposition 11.49 implies that Condition 4b in The-
orem 11.24 must hold. Theorem 11.24 can now be invoked to ensure
convergence to the optimal solution of the SMDP with probability 1.
The main transformations
relatedto the faster time scale will be: For
all i ∈ S and a ∈ A(i), F Q
k , ρk (i, a)
∗
= p(i, a, j) r(i, a, j) − ρ t̄(i, a, j) + I(j
= i ) max Q (j, b)
k k
b∈A(j)
j∈S
Convergence: Parametric Optimization 443
= p(i, a, j) r(i, a, j) − ρk t̄(i, a, j) + max Qk (j, b) . (11.27)
b∈A(j)
j∈S
F Q k , ρk (i, a) = F Q k , ρk (i, a) − Qk (i, a).
∗
k k k k k k
w1 (i, a) = r(i, a, ξ ) − ρ t(i, a, ξ ) + I(j = i ) max Q (ξ , b)
b∈A(ξk )
k k
−F Q ,ρ (i, a).
Then, like in the case of Q-Learning, we can write the updating trans-
formation on the faster time scale in our algorithm as:
Qk+1 (i, a) = Qk (i, a) + αk F Q k , ρk (i, a) + w1k (i, a) ∀(i, a);
the updates on the slower time scale and the associated functions will
be identical to those for the CF-version. Also, the policy generated by
the algorithm in the kth iteration will be given as in Eq. (11.124). Our
main convergence result is as follows:
Proposition 11.52 Assume that the step sizes used in the algorithm
satisfy Conditions 3 and 6 of Theorem 11.24 and that GLIE policies
are used in the learning. Further assume that part (iii) of Condition 4b
from Sect. 9.2.2 holds. Then, with probability 1, the sequence of policies
generated by the SSP-version of R-SMART, {μ̂k }∞ ∗
k=1 , converges to μ̂ .
The result above assumes that all states are recurrent under every
policy and that one of the stationary deterministic policies is optimal.
Proof We will first show (via Lemma 11.53) that the transforma-
tion F (.) underlying the faster iterate, as defined in (11.27), is contrac-
tive with respect to a weighted max norm. The rest of the proof will be
very similar to that of Proposition 11.50. Note, however, that because
we use a distinguished state i∗ as an absorbing state in the algorithm,
we will essentially be solving an SSP here; but Proposition 11.44 will
ensure that the SSP’s solution will also solve the Bellman optimality
equation for the SMDP concerned and we will be done. We first show
the contractive property.
444 SIMULATION-BASED OPTIMIZATION
Lemma 11.53
ρ is fixed to any constant ρ̆ ∈ , the transfor-
When k
k
F k
Q1 (i, a) − F Q2 (i, a) = k k
p(i, a, j) max Q1 (j, b) − max Q2 (j, b) .
b∈A(j) b∈A(j)
j∈S
k (i, a)
Then, for any (i, a)-pair: F Q
k (i, a) − F Q
1 2
≤ p(i, a, j) max Q1 (j, b) − max Q2 (j, b)
k k
b∈A(j) b∈A(j)
j∈S
≤ p(i, a, j) max |Qk1 (j, b) − Qk2 (j, b)|
b∈A(j)
j∈S
≤ p(i, a, j) max |Qk1 (j, b) − Qk2 (j, b)|
j∈S,b∈A(j)
j∈S
= k − Q
p(i, a, j)υ(j, b)||Q k ||υ for any b ∈ A(j)
1 2
j∈S
k1 − Q
≤ ϑυ(i, a)||Q k2 ||υ
with 0 ≤ ϑ < 1, where the last but one line follows from the defini-
tion of the weighted max norm (see Appendix) and the last line from
Lemma 11.43 and the definition of ϑ in Eq. (11.97). Then, we have
|F (Q k1 )(i,a)−F (Q k2 )(i,a)| k − Q
k ||υ . Via usual arguments,
that υ(i,a) ≤ ϑ||Q 1 2
||F Q
k − F Q
1
k ||υ ≤ ϑ||Q
2 1
k − Q k ||υ .
2
Qk+1 (i, a) = r̄(i, a) − ρ̃t̄(i, a) + p(i, a, j) I(j = i∗ ) max Qk (j, b) ∀(i, a).
b∈A(i)
j∈S
9.2.4 Q-P-Learning
We now analyze the Q-P -learning algorithm for average reward
SMDPs. As usual, in the case of algorithms based on policy itera-
tion, our analysis will be restricted to the policy evaluation phase. We
begin with analyzing the CF-version. The equation that this algo-
rithm seeks to solve is the η-version of the Bellman policy equation for
a given policy μ̂, i.e., Eq. (7.36).
Proposition 11.54 Assume the step sizes and the action selection
to satisfy Conditions 3, 6, and 7 of Theorem 11.21. Further assume
that Assumption 7.2 (Chap. 7) is true for the SMDP concerned and η
is chosen such that η ∈ (η̄, 1). Then, with probability 1, the sequence
of iterates generated within Step 3 of the CF-version of Q-P -Learning
for average reward SMDPs, {Q n }∞ , converges to the unique solu-
n=1
tion of Equation (7.36), i.e., to the value function vector associated to
policy μ̂.
Proof The proof will be very similar to that of Q-P -Learning for
discounted reward MDPs. In Step 2, an estimate of the average reward
of the policy being evaluated, μ̂ (the policy is contained in the current
values of the P -factors), is generated. The transformation F (.) for
this algorithm will be as follows:
|S|
F n
Q (i, a) = p(i, a, j) [r(i, a, j) − ρμ̂ t̄(i, a, j) + ηQn (j, μ(j))] .
j=1
446 SIMULATION-BASED OPTIMIZATION
Proof The proof will be along the lines of that for Q-P -Learning for
discounted reward MDPs (see Proposition 11.40). We will first show
that the transformation F (.) is contractive. We define F (.) as follows:
F Q n = p(i, a, j) [r(i, a, j) − ρμ̂ t̄(i, a, j) + Qn (j, μ(j))] .
j∈S
≤ p(i, a, j) |Qn1 (j, μ(j)) − Qn2 (j, μ(j))|
j∈S
Convergence: Parametric Optimization 447
≤ p(i, a, j) max |Qn1 (j, b) − Qn2 (j, b)|
j∈S,b∈A(j)
j∈S
= n1 − Q
p(i, a, j)υ(j, b)||Q n2 ||υ for any b ∈ A(j)
j∈S
n1 − Q
≤ ϑυ(i, a)||Q n2 ||υ
with 0 ≤ ϑ < 1, where the last but one line follows from the defini-
tion of the weighted max norm (see Appendix) and the last line from
Lemma 11.43 and the definition of ϑ in Eq. (11.97). Then, we have
|F (Q n1 )(i,a)−F (Q n2 )(i,a)| n − Q
n ||υ . Via usual arguments,
that υ(i,a) ≤ ϑ||Q 1 2
||F Q − F Q ||υ ≤ ϑ||Q − Q ||υ .
n n n n
1 2 1 2
Let the optimal policy be denoted by μ̂∗. For the convergence result,
we define F (.) as follows: F Qk (i, s, a) =
k
p(i, s, a, j, s + 1) r(i, s, a, j, s + 1) + max Q (j, s + 1, b) .
b∈A(j,s+1)
j∈S
Proposition 11.57 When the step sizes and action selection used in
the algorithm satisfy Conditions 3, 6, and 7 of Theorem 11.21, all the
states in the system are recurrent under every policy, and there is a
unique starting state, with probability 1, the sequence of policies gen-
erated by the finite-horizon Q-Learning algorithm, {μ̂k }∞k=1 , converges
to μ̂∗ .
which is of the standard form. Then, we can invoke the following ODE
as in Condition 4 of Theorem 11.21:
dq
= F (q), (11.129)
dt
Convergence: Parametric Optimization 449
11. Conclusions
This chapter was meant to introduce the reader to some basic re-
sults in the convergence theory of DP and RL. While the material
was not meant to be comprehensive, it is hoped that the reader has
gained an appreciation for the formal ideas underlying the convergence
theory. Our goal for DP was to show that the solutions of the Bell-
man equation are useful and that the algorithms of policy and value
iteration converge. In RL, our goal was very modest—only that of pre-
senting convergence of some algorithms via key results from stochastic
approximation theory (based on ODEs and two-time- scale updating).
DP theory. Our accounts, which show that a solution of the Bellman optimality
equation for both discounted and the average reward case is indeed optimal, follow
Vol II of Bertsekas [30]. The convergence of value iteration, via the fixed point
theorem, is due to Blackwell [42]. The convergence proof for policy iteration for
discounted reward, presented in this book, follows from Vol II of Bertsekas [30]
and the references therein. The analysis of policy iteration for average reward
is from Howard [144]. The discussion on span semi-norms and the statement of
Theorem 11.15 is from Puterman [242]. Our account of convergence of value it-
eration and relative value iteration for average reward is based on the results in
Gosavi [119].
450 SIMULATION-BASED OPTIMIZATION
RL theory. For RL, the main result (Proposition 11.21) related to synchronous
conditions follows from [46, 136, 49]; see [48] for a textbook-based treatment of this
topic. Two-time-scale asynchronous convergence result is based on results from
Borkar [45, 46].
Convergence of Q-Learning has appeared in a number of papers. Some of the ear-
liest proofs can be found in [300, 151, 290]. A proof based on ODEs was developed
later in [49], which used a result from [46]. The ODE analysis requires showing
boundedness of iterates. In our account, the proof based on basic principles for
showing boundedness of Q-Learning is from [111]. See also [300] for yet another
boundedness proof for Q-Learning. The general approach to show boundedness
(that works for many RL algorithm and has been used extensively here) is based
on showing the link between a contraction and a globally asymptotically stable
equilibrium of the ODE concerned (Theorem 11.22 above is from [48]) and a link
between the asymptotically stable equilibrium and boundedness (Theorem 11.23
above is from [49]). The eigenvalue-based analysis for showing boundedness, which
exploits these results, can be found in [119]. The analysis of finite convergence of
Q-Learning is from [109]. The convergence of Relative Q-Learning can be found in
[2, 49].
The “lockstep” condition in Condition 4b of Proposition 11.49 is not our original
work; it can be found in many two time scale algorithms, and has been explicitly
used in the proofs of an SSP algorithm in [2] and R-SMART [119]. However, it
was never presented in the general format that we present here, which makes it
a candidate for application in two time scale stochastic approximation algorithms;
when the condition holds, it should further ease the analysis of a two-time-scale
algorithm. The derivative condition in Condition 4b was formally used in [119],
but is also exploited (indirectly) in [2].
SSP. The connection between the SSP and the MDP was made via a remark-
able result in Bertsekas [30, vol I]. The result connecting the SSP to the SMDP
(Proposition 11.44), which is the basis of the SSP-versions of R-SMART and Q-
P -Learning, is from Gosavi [119]. The analysis of R-SMART for the SSP-version
and the CF-version can be found in [119]. An analysis which assumed that ρ starts
in the vicinity of ρ∗ can be found in [110]. The convergence of the SSP-versions
and regular versions of Q-P -learning for average reward can be collectively found
in Gosavi [109, 118]. The contraction property of the SSP’s transformation, shown
here via Lemma 11.53 for the Bellman optimality equation and Lemma 11.56 for
the Bellman policy equation, are extensions of results for the value function from
[33] to the Q-factor and are based on [109]. Lemma 11.43 is also a Q-factor exten-
sion of a result in [33], and can be found in [109]. Lemma 11.46 for SMDPs, based
on the renewal reward theorem, is from Gosavi [119]. Our analysis of finite-horizon
Q-Learning, under the conditions of a unique starting state and proper policies, is
based on the contraction argument in Lemma 11.53. See [332] for a more recent
analysis of the SSP in which some of these conditions can be relaxed. See also [38]
for an analysis of the finite horizon algorithm under conditions weaker than those
imposed here.
Miscellaneous. The convergence of API for MDPs (discounted reward) and
Q-P -Learning for MDPs (average and discounted reward) is from [120]. See [33]
and references therein for convergence analysis with function approximation. The
convergence of SARSA has been established in [277].
Chapter 12
CASE STUDIES
1. Chapter Overview
In this chapter, we will describe some case studies related to
simulation-based optimization. We will provide a general description
of the problem and of the approach used in the solution process. For
more specific numeric details, the readers are referred to the references
provided. We present three case studies for model-free simulation
optimization related to airline revenue management, preventive main-
tenance of machines, and buffer allocation in production lines in detail.
We also present a heuristic rule in each case, which can be used for
benchmarking the simulation-optimization performance. Such heuris-
tics are typically problem-dependent and may produce high-quality
solutions. Also, without a benchmark, it is difficult to gage the
performance of the simulation-optimization algorithm on large-scale
problems where the optimal solution cannot be determined. We enu-
merate numerous other case studies, pointing the reader to appropriate
references for further reading.
and the key to making profit lies in controlling seat allocation and
overbooking properly. Our discussion here will follow [123, 127, 109].
It is known to airlines that not every customer has the same
expectation from the service provided. For instance, some customers
like direct flights, while some are willing to fly with a few stopovers
if it means a cheaper ticket. More importantly, some customers book
tickets considerably in advance of their journey, while some (usually
business related travelers) tend to book a few days before the flight’s
departure. Airline companies take advantage of these differences
by selling seats of a flight at different prices. Thus for instance, a
customer who desires fewer stopovers or arrives late in the booking
process is charged a higher fare. A customer (generally a business
traveler) who needs a ticket that is refundable, usually because of a
higher likelihood of the cancellation of his/her trip, is also charged a
higher fare.
All of the above factors lead to a situation where airlines internally
(without telling the customers) divide passengers into different fare
classes or products based on their needs and the circumstances.
Passengers within the same fare class (or product) pay the same
(or roughly the same) fare.
It makes business sense to place upper limits on the number of seats
to be sold in each fare class. This ensures that some seats are reserved
for higher fare class passengers (that provide higher revenues), who
tend to arrive late in the booking period. Also, it is usually the case
that demand for lower fare classes is higher than that for the higher
fare classes, and unless some limits are imposed, it is likely that the
plane will be primarily occupied by the lower fare-class passengers.
This is an undesirable situation for airline profits. On the other hand,
if the limits imposed are very high, it is quite possible that the plane
will not be full at takeoff. Thus finding the right values for these limits
becomes an important problem for the carrier.
The “overbooking” aspect adds to the complexity of this problem.
Some customers cancel their tickets and some fail to show up at the
flight time (no-shows). As a result, airlines tend to overbook (sell more
tickets than the number of seats available) flights, anticipating such
events. This can minimize the chances of flying planes with empty
seats. It may be noted that a seat in an airplane, like a hotel room
or vegetables in a supermarket, is a perishable item, and loses all its
value as soon as the flight takes off. However, the downside of ex-
cessive overbooking is the risk of not having sufficient capacity for
all the ticket-holders at takeoff. When this happens, i.e., when the
Case Studies 453
Boston
Chicago
Denver
Miami
Chicago: Hub
Denver: Spoke
Boston: Spoke
Miami : Spoke
Figure 12.1. A typical hub-and-spoke network in which Chicago serves as the hub
and Miami, Boston, and Denver serve as the spokes
where Yi denotes the random number of requests for class i that will
arrive during the booking horizon, Sji denotes the number of seats to
be protected from class j for a higher class i, Vi and Vj are the fares
456 SIMULATION-BASED OPTIMIZATION
where C is the capacity of the plane. The booking limit for class n (the
highest class) is C since these customers are always desirable. Can-
cellations and no-shows are accounted for by multiplying the capacity
of the aircraft with an overbooking factor. Thus, if C is the capac-
ity of the flight and Cp is the probability of cancellation, then to
account for that, the modified capacity of the aircraft is calculated
to be C/(1 − Cp ), which replaces C in the calculations above. This
is done so that at take-off the expected number of customers present
roughly equals the capacity of the plane.
EMSR-b. A variant of the above heuristic, called EMSR-b (also
credited to Belobaba; see also [27]), is also popular in the literature.
Under some conditions, it can outperform EMSR-a (see [294] for a
comprehensive treatment of this issue). In this heuristic, the attempt
is to convert the n-fare-class problem into one with only two classes
during solution for each class. Let
n
Ȳi ≡ Yj
j=i
denote the sum of the random demands of all classes above i and
including i (again note that a higher class provides higher revenue by
our convention). Then, we can define a so-called aggregate revenue for
the ith class to be the weighted mean revenue of all classes above i
and including i as follows:
n
j=i Vj E[Yj ]
V̄i = n .
j=i E[Yj ]
Case Studies 457
where Cl denotes the capacity of the plane on the lth leg, n denotes
the number of products, and L denotes the total number of legs. The
value of zj , the decision variable, could be used as the booking limit
for product j. However, this value can be significantly improved upon
by combining the results of this LP with EMSR-b, as discussed next.
The Displacement Adjusted REvenue (DARE) for the jth product
that uses leg l, i.e., DAREjl , is computed as follows. For j = 1, 2, . . . , n
and every l ∈ Dj ,
DAREjl = Vj − Bi ,
i=l;i∈Dj
458 SIMULATION-BASED OPTIMIZATION
and Bi denotes the dual (shadow) prices associated with the ith
capacity constraint (see (12.1)) in the LP. Then DAREjl can be treated
as the virtual revenue of the product j on leg l, i.e., Vjl = DAREjl .
Finally, EMSR-b is employed on each leg separately, treating the
virtual revenue as the actual revenue. On every leg, products that are
relevant may have to be re-ordered according to their DARE values;
the higher the DARE value, the higher the class (according to our con-
vention). The demand for each relevant product on every leg has to be
determined from that of individual products. Then EMSR-b is applied
on each leg in the network. If the booking limit for product j on leg l
is denoted by BLlj , a customer requesting a given product is accepted
if and only if the conditions with respect to all the booking limits
are satisfied, i.e., if at time t in the booking horizon, φj (t) denotes
the number of seats sold for product j, then product j is accepted if
φj (t) < BLlj for every leg l used by product j. Otherwise that cus-
tomer is rejected. It is thus entirely possible that a customer meets
the above condition for one leg but not for some other leg. However,
if the conditions for all the legs are not met, the customer is rejected.
3. Preventive Maintenance
Preventive maintenance has acquired a special place in modern
manufacturing management with the advent of the so-called “lean”
philosophy. According to the lean philosophy, an untimely breakdown
of a machine is viewed as source of muda—a Japanese term for waste.
Indeed, an untimely breakdown of a machine can disrupt production
schedules and reduce production rates. If a machine happens to be a
bottleneck, it is especially important that the machine be kept in a
working state almost all the time. Total Productive Maintenance, like
many other management philosophies, relies on the age-old reliability
principle, which states that if a machine is maintained in a preventive
manner the up-time (availability) of the machine is raised.
460 SIMULATION-BASED OPTIMIZATION
TC
Optimal Time
For
Maintenance
Time
Figure 12.2. The graph shows that there is an optimal time to maintain
Case Studies 461
factory has a single machine), or else the next machine could serve as
the customer. The demand, when it arrives, depletes the buffer by 1,
while the machine, when it produces 1 unit, fills the buffer by 1 unit.
There is a limit to how much the buffer can hold. When this limit is
reached, the machine goes on vacation (stops working) and remains on
vacation until the buffer level drops to a predetermined level. The time
for producing a part is a random variable; the time between failures,
the time for a repair, the time between demand arrivals, and the time
for a maintenance are also random variables. The age of the machine
can be measured by the number of units produced since last repair or
maintenance. (The age can also be measured by the time since last
repair or maintenance.)
Product
Machine
Demand
Buffer
Machine fills the buffer, while the
demand empties it
E[C]
g(T ) = ,
E[θ]
where C is the (random) cost in one (renewal) cycle and θ is the time
consumed by one (renewal) cycle. Let x denote the time for failure of
the machine. E[C] can be written as:
where f (x) denotes the pdf of the random variable X, Tm denotes the
mean time for maintenance, and Tr denotes the mean time for repair.
The age-replacement heuristic was used as the benchmarking tech-
nique in [72] and [124], because of its robust performance on complex
systems.
Computational results. We now present some computational
results obtained with R-SMART on a small-scale problem. The buffer
uses an (S, s) policy, i.e., when the inventory in the buffer reaches S,
it goes on vacation and remains on vacation until the inventory falls
to s. The production time, the time between failures, and the time for
repair all have the Erlang distribution: Erl(n, λ), whose mean is n/λ
(see Appendix). The time for maintenance has a uniform distribution:
U nif (a, b). The inter-arrival time for the demand has the exponential
distribution: Expo(μ) whose mean equals 1/μ.
We show details for one case: (S, s) = (3, 2). Expo(1/10) for time
between arrivals, Erl(8, 0.08) for time between failures, Erl(2, 0.01) for
time for repair, Erl(8, 0.8) for production time, and U nif (5, 20) for the
maintenance time. Cr = $5, Cm = $2, and profit per unit demand’s
sale is $1. CF version of R-SMART with η = 0.999 produced a near-
optimal solution of ρ = $0.033 per unit time. The policy turns out to
have a threshold nature (concept defined in parametric optimization
model) with the following thresholds: c1 = 5, c2 = 6, and c3 = 7; when
buffer is empty, the action is to produce always. Both R-SMART and
SMART have been tested successfully on this and numerous other
cases [72, 110].
In other words, products move in a line from one machine to the next.
Kanbans reduce the risk of over production, minimize inventory, and
maximize flow in the production shop.
Consider Fig. 12.4. In between the first and the second machine,
there exists a buffer (kanban) or container. When the machine preced-
ing a buffer completes its operation, the product (essentially a batch
of parts) goes into the buffer. The machine following the buffer gets it
supply from this buffer, i.e., the preceding buffer in Fig. 12.4. The first
machine gets its supply from raw material, which we assume is always
available. In a kanban-controlled system, there is a limit on the buffer
size. When the limit is reached, the previous machine is not allowed
to produce any parts until there is space in the buffer. The previous
machine is then said to be blocked. If a machine suffers from lack of
material, due to an empty buffer, it is said to be starved. A machine
is idle when it is starved or blocked. Idleness in machines may reduce
the throughput rate of the line, i.e., the number of batches produced
by the line in unit time.
= Machine
Material
Finished
Product
Raw
= Buffer
........
k
k
Maximize f (x) such that x(i) = M or x(i) ≤ M (12.2)
i=1 i=1
where L(i) is the mean lead time (production time on the machine plus
waiting time in the queue in front of the machine) of a batch (which is
made up of one or more parts) on machine i, λ is the rate of demand
for the batch from the line, and Ω ≥ 1 (e.g., Ω = 2) is a factor of
safety.
(In many texts, the phrase “number of kanbans” is used to mean
buffer size. Thus, x(i) denotes the number of kanbans for machine i.
Note that one kanban is generally associated with a batch. Thus if n(i)
denotes the number of parts associated with a kanban on machine i,
466 SIMULATION-BASED OPTIMIZATION
σa2 σs2
Ca2 = ; Cs2 = .
(1/λ)2 (1/μ)2
Using Marchal’s approximation [198], one has that the mean waiting
time in the queue in front of the machine for a batch is:
Using the above approximation, one can determine C 2 for the inter-
arrival time and the production time for every machine, allowing one
to use Marchal’s approximation to determine the lead time (L(i)) at
the ith machine for each i. Then, Eq. (12.4) yields the buffer size for
the machine concerned.
Machine α β γ
1 0.25 0.1 0.01
2 0.2 0.3 0.02
3 0.3 0.5 0.04
6. Conclusions
We conclude with some final thoughts on computational aspects of
both parametric and control optimization.
Parametric optimization. From our review of the literature, the
genetic algorithm appears to be one of the most popular techniques
for discrete problems, although there is no general agreement on which
algorithm is the best. It is the oldest algorithm in this family, and it rel-
atively easy to code, which could be one reason for its popularity. Tabu
search, also, has seen a large number of useful applications. Stochastic
adaptive search techniques [333] have considerable theoretical back-
ing. Although their convergence guarantees are asymptotic (in the
limit) and so they often generate sub-optimal solutions in finite time,
their solutions are usually of good quality. In comparison to classical
RSM, stochastic adaptive search and meta-heuristics sometimes take
less computational time. In the field of continuous optimization, simul-
taneous perturbation appears to be a remarkable development. Com-
pared to finite differences, it usually takes less computational time.
It does get trapped in local optima and may hence require multiple
starts. We must remember that parametric optimization techniques
developed here do require fine-tuning of several parameters, e.g., tem-
perature, tabu-list length, step-sizes, or some other scalars, in order to
obtain the best behavior, which can increase computational time.
Control optimization. In this area, we covered RL and stochastic
policy search techniques. The use of both methods on large-scale prob-
lems started in the late 1980s and continues to grow in popularity. Get-
ting an RL algorithm to work on a real-life case study usually requires
that the simulator be written in a language such as C or MATLAB
so that RL-related functions and function approximation routines can
be incorporated into the simulator. The reason is that in RL, un-
like parametric optimization, the function is not evaluated at fixed
Case Studies 471
where f (x) denotes the so-called probability density function or pdf of X. Clearly,
then xi
Pr(X = xi ) = f (x)dx = 0.
xi
As a result, we can obtain the pdf from the cdf by differentiation as follows:
dF (x)
f (x) = .
dx
The mean of expected or average value of the continuous random variable defined
over an interval (a, b) is defined as:
b
E[X] = xf (x)dx,
a
The positive square root of the variance is called the standard deviation of the
random variable and is often denoted by σ:
σ(X) = Var[X].
Geometric: If the random variable denotes the number of trials required until
the first success, where the prob. of success is p, then it is said to have geometric
distribution.
e−λ λx
f (x) = for x = 0, 1, 2, . . . ; E[X] = λ; Var[X] = λ.
x!
Note that the mean and variance of the Poisson distribution are equal.
Continuous Distributions: We now discuss some well-known continuous
distributions.
Uniform: X has U nif (a, b) implies that
1
f (x) = if a ≤ x ≤ b and f (x) = 0 otherwise.
b−a
1 (x−μ)2
f (x) = √ e− 2σ2 .
σ 2π
Note that μ and σ are parameters used to define the pdf . It turns out that:
E[X] = μ; Var[X] = σ 2 .
476 SIMULATION-BASED OPTIMIZATION
Erlang distribution: Erl(n, λ), where n is a positive integer, has the following
properties:
λn xn−1 e−λx n n
f (x) = for x ≥ 0; E[X] = ; Var[X] = .
(n − 1)! λ λ2
ν(A) = max(|ψi |)
i
|x(i)|
||x||υ = max ,
i υ(i)
where |a| denotes the absolute value of a ∈
and υ = (υ(1), υ(2), . . . , υ(N ))
denotes a vector of weights such that all weights are positive.
Contraction with respect to a weighted max norm: A mapping (or
transformation) F is said to be a contraction mapping in
n with respect to a
weighted max norm if there exists a λ where 0 ≤ λ < 1 and a vector
υ of n
positive components such that
u||υ ≤ λ||v −
||F v − F u||υ for all v ,
u in
n .
Bibliography
[17] A.G. Barto, S.J. Bradtke, S.P. Singh, Learning to act using
real-time dynamic programming. Artif. Intell. 72, 81–138
(1995)
[146] J. Hu, M.C. Fu, S.I. Marcus, A model reference adaptive search
method for global optimization. Oper. Res. 55, 549–568 (2007)
[147] J. Hu, M.C. Fu, S.I. Marcus, A model reference adaptive search
method for stochastic global optimization. Commun. Inf. Syst.
8, 245–276 (2008)
[148] J. Hu, M.P. Wellman, Nash Q-Learning for general-sum
stochastic games. J. Mach. Learn. Res. 4, 1039–1069 (2003)
[149] D. Huang, T.T. Allen, W.I. Notz, N. Zeng, Global optimization
of stochastic black-box systems via sequential kriging
meta-models. J. Global Optim. 34, 441–466 (2006)
[150] S. Ishii, W. Yoshida, J. Yoshimoto, Control of
exploitation-exploration meta-parameter in reinforcement
learning. Neural Netw. 15, 665–687 (2002)
[151] T. Jaakkola, M. Jordan, S. Singh, On the convergence of
stochastic iterative dynamic programming algorithms. Neural
Comput. 6(6), 1185–1201 (1994)
[152] S.H. Jacobson, L.W. Schruben, A harmonic analysis approach
to simulation sensitivity analysis. IIE Trans. 31(3), 231–243
(1999)
[153] A. Jalali, M. Ferguson, Computationally efficient adaptive
control algorithms for Markov chains, in Proceedings of the
29th IEEE Conference on Decision and Control, Honolulu,
1989, pp. 1283–1288
[154] J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro-Fuzzy and Soft
Computing (Prentice Hall, Upper Saddle River, 1997)
[155] M.V. Johns Jr., R.G. Miller Jr., Average renewal loss rates.
Ann. Math. Stat. 34(2), 396–401 (1963)
[156] S.A. Johnson, J.R. Stedinger, C.A. Shoemaker, Y. Li, J.A.
Tejada-Guibert, Numerical solution of continuous state
dynamic programs using linear and spline interpolation. Oper.
Res. 41(3), 484–500 (1993)
[157] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement
learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)
[158] P. Kanerva, Sparse Distributed Memory (MIT, Cambridge,
MA, 1988)
490 SIMULATION-BASED OPTIMIZATION
[313] P.J. Werbös, Beyond regression: new tools for prediction and
analysis of behavioral sciences, PhD thesis, Harvard University,
Cambridge, MA, May 1974
Symbols C
cdf, 473, 474 C language, 470
pdf, 474 CAP-I, 229
pmf, 473 cardinality, 8
case study, 153, 193, 268
A Cauchy sequence, 295
accumulation point, 297 central differences, 77, 327
acronyms, 11 Chapman-Kolmogorov theorem, 130
Actor Critics, 276 closed form, 30, 38
MDPs, 278 computational operations research, 1
Actor Critics: MDPs, 277 continuous time Markov process, 136
age replacement, 462 contraction mapping, 303, 304, 360
airline revenue management, 451 weighted max norm, 476
approximate policy iteration, 225, 229, control charts, 468
266 control optimization, 2, 32
ambitious, 231 convergence of sequences, 292
conservative, 230, 417 with probability 1, 309
coordinate sequence, 301
B
cost function, 29
backpropagation, 53
critical point, 310
backtracking adaptive search
cross-over, 94
convergence, 341
curse of dimensionality, 199
algorithm, 110
curse of modeling, 198
backward recursion, 192
basis functions, 253
D
Bayesian Learning, 103
DARE, 457
behavior, 13
DAVN, 457
Bellman equation, 151, 167
decision-making process, 171
average reward, 151
decreasing sequence, 293
discounted reward, 162
DeTSMDP, 170
optimality proof, 358
differential equations, 390
optimality proof for average
discounted reward, 160
reward case, 372
discrete-event systems, 14
Bellman error, 257, 258, 263
domain, 285
binary trees, 96
dynamic optimization, 2
Bolzano-Weierstrass, 298
dynamic programming, 2, 33, 150, 159
bounded sequence, 293
dynamic systems, 13
buffer optimization, 463
E K
EMSR, 200, 455 kanban, 464
equilibrium, 310 Kim-Nelson method, 88
asymptotically stable, 312
globally asymptotically stable, 312
stable, 312 L
ergodic, 135 LAST, 100
Euclidean norm, 8, 325 convergence, 340
exhaustive enumeration, 147, 174 layer
exploration, 216, 226 hidden, 53
input, 53
F output, 53
feedback, 215, 271
learning, 214
finite differences, 78, 327
learning automata
finite horizon, 189
control optimization, 269
finite horizon problems, 248
fit, 46 parametric optimization, 100
fixed point theorem, 307 learning rate, 271
forward differences, 77, 327 limiting probabilities, 132
function linear programming, 188, 189
continuous, 320 Lipschitz condition, 329
continuously differentiable, 320 local optimum, 60, 74, 321
function approximation, 249, 267 loss function, 29
difficulties, 256 LSTD, 259
neural networks, 260
architecture, 253
function-fitting, 255 M
function fitting, 47 Manhattan norm, 8
mapping, 286
G Markov chain, 129
Gauss elimination, 154 embedded, 169
Gauss-Siedel algorithm, 166 Markov decision problems, 137
genetic algorithm, 92 convergence, 351
GLIE, 217, 439 reinforcement learning, 197
global optimum, 60, 322 Markov process, 125
gradient descent, 49 mathematical programming, 2
convergence, 325 MATLAB, 470
max norm, 7, 284
H MCAT, 269
H-Learning
MDPs, 137, 269
discounted reward, 245
convergence, 351
heuristic, 199, 455
reinforcement learning, 197
hyper-plane, 43
MDPs: leaning automata, 270
memory-based, 109
I
identity matrix, 8 memoryless, 126
immediate reward, 143 meta-heuristics, 89
increasing sequence, 293 metamodel, 38
incremental, 51 model-based, 48, 71
induction, 287 model-building algorithms, 244
infinity norm, 284 model-free, 71
inverse function method, 19 modified policy iteration, 184
irreducible, 135 monotonicity, 354
multi-starts, 76
J multiple comparisons, 86
jackknife, 68 multiple starts, 76
jumps, 125 mutation, 94
Index 507
N Q
n-step transition probabilities, 130 Q-factor, 204
n-tuple, 8, 283 boundedness, 407
natural process, 171 definition, 205
neighbor, 90 Policy iteration, 222
neighborhood, 297 Value iteration, 206
Nested Partitions, 116 Q-Learning
neural networks, 48
convergence, 404
backpropagation, 53
model-building, 246, 247
in reinforcement learning, 260,
steps in algorithm, 212
262
linear, 48 worked-out example, 218
non-linear, 53 Q-P-Learning
neuro-dynamic programming, 197, 266 average reward MDPs, 234
Neuro-RSM, 47 discounted reward MDPs, 224
neuron, 48, 255 semi-Markov decision problems,
nodes, 53 241
Non-derivative Methods, 83 queue, 123
non-linear programming, 49, 72
normalization, 271
normed vector spaces, 285 R
norms, 7, 284 R-SMART, 233, 237
notation, 6 Convergence, 438
matrix, 8 radial basis function, 255
product, 7 random process, 123
sequence, 9
random system, 14
sets, 8
range, 285
sum, 7
ranking and selection, 86
vector, 7
regression, 40, 255, 263
O linear, 40, 48
objective function, 29, 144 non-linear, 44
off-line, 215 piecewise, 43
on-line, 215 regular Markov chain, 131, 344
ordinary differential equations, 390 Reinforcement Learning, 34
overbook, 452 average reward MDPs, 231
overfitting, 64, 261, 263 convergence, 400, 404
Discounted reward MDPs, 211
P finite convergence, 411
parametric optimization, 2, 29 introduction, 197
continuous, 72 MDP convergence, 404
discrete, 85 MDPs, 211
partial derivatives, 320 SMDP convergence, 424
performance metric, 144 SMDPs, 234
phase, 106 Relative Q-Learning, 232
policy iteration, 182
relative value iteration, 156
average reward MDPs, 152
renewal theory, 462
convergence proof for average
replication, 25
reward case, 379
convergence proof for discounted response, 270
case, 367 response surface method, 1, 37
discounted reward MDPs, 163 revenue management, 451
SMDPs, 176 Reward-Inaction Scheme, 271
population-based, 93 Rinott method, 87
preventive maintenance, 459 Robbins-Monro algorithm, 204, 207,
pure random search 208
convergence, 338 RSM, 37
508 SIMULATION-BASED OPTIMIZATION
S sup-norm, 284
sample path approach, 72 symmetric neighborhood, 344
SARSA, 227 system, 1, 3, 4, 32, 123
scalar, 7 definition, 13
seed, 18, 25
Semi-Markov decision problems T
average reward DP, 173 tabu search, 95
definition, 169 mutations, 95
discounted reward DP, 180 tabu list, 95
reinforcement learning, 237 Taylor series, 322
sequence, 9, 290 Taylor’s theorem, 322, 330
sets, 8 temperature, 106
sigmoid, 55 thresholding, 55
simulated annealing TPM, 128, 141
algorithm, 104 transfer line, 463
convergence, 343 transformation, 9, 286
simulation, 16 transition probability matrix, 128
noise, 347 transition reward matrix, 143
simultaneous perturbation, 4, 79, 328 transition time matrix, 170
SMDPs transpose, 8
average reward DP, 173 trial-and-error, 214
definition, 169 TRM, 143
discounted reward DP, 180 TTM, 170
Learning Automata, 272 two time scales, 238, 277, 434
reinforcement learning, 237
SMDPs: learning automata, 270
U
SSP, 233, 239, 242, 249, 267, 426
uniformization, 170, 179
state, 13, 32, 123
state aggregation, 250
static optimization, 2 V
stationary point, 321 validation, 68
steepest descent, 58, 324 value iteration, 166, 183
convergence, 325 Average reward MDPs, 154
step size, 73, 81, 101, 210, 274 convergence for average reward
stochastic adaptive search, 86, 90 case, 386
convergence, 336 convergence proof for discounted
Stochastic approximation case, 370
asynchronous convergence, 390 discounted reward MDPs, 164
synchronous convergence, 313 SMDPs, 177
stochastic gradient methods, 83, 329 vector, 7, 282, 283
stochastic optimization, 2 vector spaces, 282
stochastic process, 123
stochastic ruler W
convergence, 349 weighted max norm, 476
stochastic ruler:algorithm, 113 Weighted max norm contraction
stochastic shortest path, 233, 239, mapping, 476
242, 249, 267, 426 Widrow-Hoff algorithm, 48
stochastic system, 14
straight line, 40 Y
strong law of large numbers, 26 yield management, 451