Nothing Special   »   [go: up one dir, main page]

Conditional Expectation and Martingales

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21
At a glance
Powered by AI
The key takeaways are that martingales were invented by Joseph Doob and play an important role in probability theory and mathematical finance. Martingales are also closely related to random walks and sums of independent random variables.

Martingales are stochastic processes that satisfy the optional stopping theorem. Some of the simplest martingales occur in connection with random walks, where the partial sums of independent random variables form martingales after subtracting the expected value at each step. These martingales associated with random walks are known as the Wald identities.

The Wald identities give the optional stopping formulas for three specific martingales related to random walks - Mn, Vn, and Zn(θ). These formulas state that the expected value of the martingale at a stopping time is equal to the initial expected value.

CONDITIONAL EXPECTATION AND MARTINGALES

1. Introduction
The stochastic processes known as martingales were invented by Joseph Doob a little over
fifty years ago at the University of Illinois. Their introduction changed the face of modern
probability theory. Martingales now play a central role in probabilistic potential theory, in
stochastic calculus, and (as some of you will learn) in mathematical finance. The purpose of
these notes is to give an introduction to martingale theory, and to illustrate how martingales
arise in the study of Markov chains and random walks.

2. The Wald Identities for Random Walks


A martingale is a discrete-time random process {Xn }n≥0 with the property that for every
bounded stopping time τ , the Optional Sampling Formula
(1) EXτ = EX0
is valid. This may be taken as the definition of a martingale, but usually isn’t. The traditional
definition, together with a proof that this definition implies the Optional Sampling Formula,
will be given later.
Some of the simplest and most important martingales occur in connection with sums
of independent, identically distributed random variables (random walks). We shall begin
by investigating several of these, with the object of showing how the Optional Sampling
formula of martingale theory can be used to do concrete calculations of interesting quantities
connected with the behavior of the random walk. Later in the notes we will show that the
Optional Sampling Formula is valid for the martingales that we introduce here.

2.1. Martingales associated with random walks. Let ξ0 , ξ1 , . . . be independent, iden-


tically distributed random variables, and let Sn = ξ1 + ξ2 + · · · ξn be the nth partial sum.
Denote by µ, σ 2 , and ϕ(θ) the mean, variance, and moment generating function of ξ1 , that
is,
µ = Eξ1 ,
σ 2 = E(ξ1 − µ)2 , and
ϕ(θ) = E exp{θξ1 }.
Corresponding to each of these scalar quantities is a martingale:
(2) Mn := Sn − nµ,
(3) Vn := (Sn − nµ)2 − nσ 2 , and
(4) Zn (θ) := exp{θSn }/ϕ(θ)n .
1
Observe that there is a separate process Zn (θ) for every real value of θ such that ϕ(θ) < ∞.
It is easily checked that the expectations of these random variables are EMn = 0, EVn = 0,
and EZn (θ) = 1 for all n, and so the Optional Sampling Formula holds at least for constant
stopping times. The Optional Sampling Formulas for the martingales Mn , Vn , and Zn (θ),
with arbitrary stopping times, are known as the Wald Identities.
2.2. The Wald Identities. Recall that a stopping time (relative to the sequence Y 0 , Y1 , . . . )
is a nonnegative-integer-valued random variable τ such that, for each n = 0, 1, 2, . . . , the
event {τ = n} depends only on the values of Y0 , Y1 , . . . , Yn . If Yn is one of the three sequences
Mn , Vn , Zn (θ) defined above, this is equivalent to requiring that the event {τ = n} depend
only on the values ξ1 , ξ2 , . . . , ξn . Thus, the random variables
τ (a) = min{n : Sn ≥ a};
T (b) = min{n : (Sn − nµ)2 ≥ 1.96σ 2 n}
are stopping times. Note that they are not bounded stopping times. It is important in
dealing with unbounded stopping times to realize that, in certain circumstances, the Optional
Sampling Formula may fail to hold. It is also important to keep in mind that any unbounded
stopping time may be approximated, in a certain sense, by bounded stopping times: in
particular, for any stopping time τ (not necessarily bounded) and any fixed nonrandom
integer n ≥ 0, the truncated random variable τ ∧ n is a bounded stopping time. (Exercise:
Prove this!) Moreover, as n → ∞, the truncated times τ ∧ n converge to τ . (Exercise:
Verify this!)
First Wald Identity . Assume that the random variables ξj have finite first moment, and
let µ = Eξ1 . Then for any stopping time τ with finite expectation,
(5) ESτ = µEτ.
Second Wald Identity . Assume that the random variables ξj have finite second moment,
and let µ = Eξ1 and σ 2 = E(ξ1 − µ)2 . Then for any stoping time τ with finite expectation,
(6) E(Sτ − mτ )2 = σ 2 Eτ.
Third Wald Identity . Assume that the moment generating function ϕ(θ) = Eeθξ1 of the
random variables ξj is finite at the argument θ. Then for any bounded stopping time τ ,
 
exp{θSτ }
(7) E = 1.
ϕ(θ)τ
Note that the hypothesis on the stopping time τ is stronger in the Third Wald Identity
than in the first two. Later we will see an example where equation (7) fails even though
Eτ < ∞.
All three of these theorems will ultimately be deduced from Theorem 1 below. However,
it is not difficult to prove them directly (at least for bounded stopping times τ ), and this is
instructive. We shall limit our attention to the Third Wald Identity.
Proof of the Third Wald Identity. The key to this is that the expectation of a product is the
product of the expectations, provided that the factors in the product are independent. Fix
indices 0 ≤ k < m. The event {τ = k} depends only on the random variables ξ1 , ξ2 , . . . , ξk ,
2
and so is independent
Pn of the random variable ξm . Similarly, the product eθSk 1{τ = k} is
independent of m=k+1 ξm . Consequently, by the product rule, for any n ≥ k,
(8) E exp{θSn }1{τ = k} = E exp{θSk } exp{θ(Sn − Sk )}1{τ = k}
= E exp{θ(Sn − Sk )}E exp{θSk }1{τ = k}
= ϕ(θ)n−k EeθSk 1{τ = k}.
Here 1F denotes the indicator random variable for the event F , that is, the random variable
that takes the value 1 on F and 0 on F c .
Suppose now that τ is a bounded stopping time, that is, that there is a nonrandom integer
n < ∞ such that τ ≤ n. Then by equation (8),
  X n  
exp{θSτ } exp{θSτ }
E = E 1{τ = k}
ϕ(θ)τ k=0
ϕ(θ)τ
n  
X exp{θSk }
= E k
1{τ = k}
k=0
ϕ(θ)
n   
X exp{θSk } exp{θSn−k }
= E 1{τ = k}
ϕ(θ)k ϕ(θ)n−k
k=0
n  
X exp{θSn }
= E 1{τ = k}
ϕ(θ)n
k=0
 
exp{θSn }
=E
ϕ(θ)n
= 1.


2.3. First Application: The Gambler’s Ruin Problem. Two gamblers, Fats and
Slim, play the following game: Fats repeatedly toses a fair coin. After each toss that
comes up H, Slim pays Fats one dollar. After each toss that comes up T , Fats pays Slim
one dollar. The game continues until either one or the other gambler runs out of money. If
Fats starts with $A and Slim starts with $B,
(A) What is the probability that, when the game ends, Fats has all the cash?
(B) What is the expected duration of the game?

This may be recast as an optional stopping problem. Let X0 , X1 , . . . be the sequence of


play-by-play increments in Fats’ fortune: thus, Xi = ±1 according to whether the ith toss
is H or T . The total change in Fats’ fortune after n plays is
Xn
Sn = Xi .
i=1
The game continues until time τ , where
τ = min{n : Sn = +A or − B}.
3
Solution to Problem A. It is not difficult to see that τ is a stopping time relative to the
sequence X0 , X1 , . . . and we shall prove below by an elelemnatary argument that Eτ < ∞.
Consequently, by the First Wald Identity, for each n < ∞,
0 = AP {Sτ = A} − BP {Sτ = −B}.
Since Sτ must be either A or −B, the two probabilities in this equation must sum to 1.
Hence, we have two equations in two unknowns, which we may solve to obtain the solution
to Problem (A):
B
(9) P {Sτ = A} = and
A+B
A
P {Sτ = B} =
A+B

Solution to Problem B. To solve Problem (B), we shall appeal to the Second Wald
Identity. Since µ = 0 and σ 2 = 1, and since Eτ < ∞ (see below), the Second Wald Identity
implies that
ESτ2 = Eτ.
But Sτ takes one of only two possible values, +A and −B, and by equations (9) above, the
probability distribution is known. Thus,
A2 B B2A
ESτ2 = + = AB
A+B A+B
This proves that
(10) Eτ = AB.
To justify the calculations that led to equations (9) and (10), we must show that Eτ < ∞.
To this end, set M = A + B, and consider how the game proceeds in blocks of M consecutive
tosses. In each such block, it is possible that there will be M consecutive Heads, in which
case the game will end and Fats will take home the money. For each such block of M
consecutive tosses, this happens with probability 1/2M . Thus, the probability that the game
has not ended after the first kM tosses is (1 − 2−M )k . Thus,
 k
1
P {τ > kM } ≤ 1 − M ,
2
that is, the distribution of τ has an exponentially decaying tail. This implies that Eτ < ∞.
(Exercise: Explain why.)

2.4. Second Application: Distribution of a First-Passage Time. Once P again let Sn


be the simple nearest neighbor random walk on the integers, that is, Sn = nj=1 Xj where
X0 , X1 , . . . are independent, identically distributed random variables that take the values
±1 with probability 1/2 each. We know that the Markov chain Sn is recurrent, and so it
will eventually visit each integer. Hence, the random variable
T = min{n : Sn = 1}
4
is finite with probability one, and it is readily apparent that it is a stopping time. It is not
the case that ET < ∞, because if it were then we would have a contradiction to Wald’s
First Identity:
1 = EST = EX1 ET = 0.
Earlier in the course we obtained formulas for the distribution of T , in two different ways:
first, by using the reflection principle, and second, by analyzing the Renewal Equation for
the probability generating function. (Refer to HW set 2.) Now we will show that this
distribution can be obtained in a radically different way, using Wald’s Third Identity. Fix
θ > 0, and let z = EeθX1 = cosh θ. For each positive integer n, the random variable T ∧ n
is a bounded stopping time, and so Wald’s Third Identity implies that
 
exp{θST ∧n }
(11) E = 1.
z T ∧n
We will argue that it is permissible to take n → ∞ in this identity. Suppose for the moment
that it is; then since ST ≡ 1, the limiting form of the identity will read
(12) eθ Ez −T = 1.
Set s = 1/z. Since z = cosh θ ≥ 1, it follows (by solving a quadratic equation: Exercise!)
that √
−θ 1 − 1 − 4s2
e = .
2s
Thus, the limiting form (12) of the Wald Identity may be rewritten so as to give the following
exact formula for the probability generating function of the stopping time T :

T 1 − 1 − 4s2
(13) Es =
2s
You should note that this is exactly the same formula that we obtained by analyzing the
Renewal Equation earlier in the course.
To justify taking n → ∞ in the Wald Identity (11) above, we shall use the Dominated
Convergence Theorem. First, note that since T < ∞ (at least with probability one),
exp{θST ∧n } exp{θST }
lim T ∧n
= .
n→∞ z zT
Hence, by the DCT, it will follow that the limiting equation (12) holds provided the inte-
grands are dominated by an integrable random variable. For this, examine the numerator
and the denominator separately. Since θ > 0, the random variable eθST ∧n cannot be larger
than eθ , because on the one hand, ST = 1, and on the other, if T > n then Sn ≤ 0 and so
eST ∧n ≤ 1. The denominator is even easier: since z = cosh θ ≥ 1, it is always the case that
z T ∧n ≥ 1. Thus,
exp{θST ∧n }
≤ eθ ,
z T ∧n
and so the integrands are uniformly bounded.
5
3. Conditional Expectation
The theory of martingales rests on the notion of conditional expectation. For martingales
in discrete probability spaces, or for martingales built from independent random variables,
such as those considered in section 2, this presents no difficulties, as conditional expectation
can be defined in an elementary manner. In general, however, a more sophisticated view of
conditional expectation is needed.

3.1. Definition of Conditional Expectation. The conditional expectation of a discrete


random variable X given the value y of another discrete random variable Y may be defined
in terms of the conditional distribution of X given the event {Y = y}: in particular,
X
(14) E(X | Y = y) = xP (X = x | Y = y),
x

where the sum is over the set of all possible values x of X. Note that this expression depends
on the value y – this will be important in what follows. For discrete random variables that
take values in finite sets there are no difficulties regarding possible divergence of the sum, nor
is there any difficulty regarding the meaning of the conditional probability P (X = x | Y = y).
For continuous random variables, or, worse, random variables that are neither discrete nor
have probability densities, the definition (14) is problematic. There are two main difficulties:
(a) If X is not discrete, then the sum must be replaced by an integral of some sort; and (b) If
Y is not discrete then it is no longer clear how to define the condtional probabilities P (X =
x | Y = y). Fortunately, there is an alternative way of defining condtional expectation that
works in both the discrete and the indiscrete cases, and additionally allows for conditioning
not only on the value of a single random variable, but for conditioning simultaneously on
the values of finitely or even countably many random variables or random vectors:
Definition 1. Let X be a real-valued random variable such that either E|X| < ∞ or X ≥ 0,
and let Y1 , Y2 , . . . , Ym be random variables taking values in a set Y.1 The conditional expec-
tation E(X | Y1 , Y2 , . . . , Ym ) is the unique (measurable) real-valued function of Y1 , Y2 , . . . , Ym
such that for every bounded (measurable) function g(Y1 , Y2 , . . . , Ym ),
(15) EXg(Y1 , Y2 , . . . , Ym ) = E(E(X | Y1 , Y2 , . . . , Ym )g(Y1 , Y2 , . . . , Ym )).
It is by no means clear a priori that such a function E(X | Y1 , Y2 , . . . , Ym ) should always
exist, nor is it obvious that it should be unique. In fact, the existence of such a function
is an important theorem of measure theory, called the Radon-Nikodym theorem. You will
see its proof next year, in Prof. Wichura’s class. The uniqueness of the function is not so
difficult to prove:
Proof of Uniqueness. For convenience, denote by Y = (Y1 , Y2 , . . . , Ym ) the random vector on
which we are conditioning. Suppose that there are distinct functions h1 (y) and h2 (y) such

1More precisely, Y1 , Y2 , . . . , Ym take values in a measurable space (Y, G). If the set Y is finite, there is no
need to worry about measurability issues.
6
that, for every bounded function g(y),
EXg(Y ) = Eh1 (Y )g(Y ) and
EXg(Y ) = Eh2 (Y )g(Y ).
Then by the linearity of ordinary expectation (taking the difference of the two equations) it
must be the case that for every bounded function g(y),
0 = E(h1 (Y ) − h2 (Y ))g(Y );
in particular, this equation must hold for the function g(y) that is 1 if h1 (y) > h2 (y) and
0 otherwise. But this implies that P {h1 (Y ) > h2 (Y )} = 0. A similar argument shows that
P {h2 (Y ) = h1 (Y )} = 0. It follows that P {h1 (Y ) = h2 (Y )} = 1. 

3.2. Equivalence of the Naive and Modern Definitions. It is not difficult to show that
the naive definition (15) coincides with the modern Definition 1 when the random variable
X and the random vector Y = (Y1 , Y2 , . . . , Ym ) are discrete and assume only finitely many
possible values with positive probability. Define
X X P {X = x and Y = y}
h(y) = xP (X = x | Y = y) = x
x x
P {Y = y}
for those values of y such that P {Y = y} > 0. To show that E(X | Y ) = h(Y ), it suffices,
by Definition 1, to show that, for any bounded function g(y),
EXg(Y ) = Eh(Y )g(Y ).
But
XX
EXg(Y ) = xg(y)P {X = x and Y = y}
y x
X X
= g(y)P {Y = y} xP (X = x | Y = y)
y x
X
= g(y)P {Y = y}h(y)
y

= Eh(Y )g(Y ).

3.3. Properties of Conditional Expectation. The raw definition 1 can be rather clumsy
to work with directly. In this section we present a short list of important rules for manipu-
lating and calculating conditional expectations. The bottom line will be that, in many im-
portant respects, conditional expectations behave like ordinary expectations, with random
quantities that are functions of the conditioning random variable being treated as constants.
In stating the propperties below, it is convenient to use the abbreviation Y = (Y1 , Y2 , . . . , Yn ).

Summary: Basic Properties of Conditional Expectation.


(1) Definition: EXg(Y ) = EE(X | Y )g(Y ) for all nonnegative functions g(y).
(2) Linearity: E(aU + bV |Y ) = aE(U |Y ) + bE(V |Y ) for all scalars a, b ∈ R.
7
(3) Positivity: If X ≥ 0 then E(X|Y ) ≥ 0.
(4) Stability: If X is a function of Y , then E(XZ|Y ) = XE(Z|Y ).
(5) Independence Law: If X is independent of Y then E(X|Y ) = EX is constant.
(6) Tower Property: If Z is a function of Y then E(E(X | Y ) | Z) = E(X | Z).
(7) Expectation Law: E(E(X|Y )) = EX.
(8) Constants: For any scalar a, E(a|Y ) = a.
(9) Jensen Inequalities: If ϕ : R → R is convex and E|X| < ∞ then
E(ϕ(X)) ≥ ϕ(EX) and
E(ϕ(X)|Y ) ≥ ϕ(E(X|Y ).
With the exception of the Jensen inequality (7) , all of these properties may be proved easily,
using only the definition 1 and elementary properties of ordinary expectation. To give an
idea of how these arguments go, we shall outline the proofs of the Linearity, Positivity, and
Independence properties below. You should try to check the Stability Property yourself.
The Jensen inequality is of a somewhat different character, but it is not difficult to prove –
see below.

Note: The definition (1) requires only that the equation EXg(Y ) = EE(X | Y )g(Y ) be
valid for bounded functions g. A standard argument from measure theory, which you will
learn next year, then implies that it holds for all functions such that the product Xg(Y )
has finite first moment. Similarly, Property (4) holds provided the product has finite first
moment.
Proof of the Positivity Property. The idea is to exploit the defining property (15) of condi-
tional expectation. First, suppose that X ≥ 0. Define B to be the set of possible values of Y
where the conditional expectation E(X | Y ) < 0, so that the event {E(X | Y ) < 0} coincides
with the event {Y ∈ B}. Then by equation (15),
EX1B (Y ) = E(E(X | Y )1B (y)).
Since X ≥ 0, the left side of this equality is nonnegative; but by definition of B, the right
side is negative unless P {Y ∈ B} = 0. It follows that P {Y ∈ B} = 0, that is, E(X | Y ) ≥ 0
with probability one. 

Proof of the Linearity Property. Since each of the conditional expectations E(U | Y ) and
E(V | Y ) is a function of Y , so must be the linear combination aE(U | Y ) + bE(V | Y ).
Thus, by Definition 1, to show that this linear combination is the conditional expectation
E(aU + bV | Y ), it sufficies to show that it satisfies equation (15), that is, that for every
bounded nonnegative function g(Y ),
(16) E(aU + bV )g(Y ) = E(aE(U | Y ) + bE(V | Y ))g(Y ).
But equation (15) holds for X = U and for X = V :
EU g(Y ) = EE(U | Y )g(Y ),
EV g(Y ) = EE(V | Y )g(Y ).
8
Multiplying these equations by a and b, respectively, and then adding gives (16), because
the unconditional expectation operator is linear. 

Proof of the Independence Property. This relies on the fact that if U and V are independent,
integrable random variables whose product U V is also integrable, then E(U V ) = EU EV .
Now suppose that X is independent of Y , and let g(Y ) be any bounded (measurable) function
of Y . Then EXg(Y ) = EXEg(Y ) = E(EX)g(Y ). Since any constant, and in particular
EX, is trivially a function of Y , Definition 1 implies that EX must be the conditional
expectation EW (X | Y ). 

Proof of the Jensen Inequalities. One of the basic properties of convex functions is that every
point on the graph of a convex function ϕ has a support line: that is, for every argument
x∗ ∈ R there is a linear function yx∗ (x) = ax + b such that
ϕ(x∗ ) = yx∗ (x∗ ) and
ϕ(x) ≥ yx∗ (x) for all x ∈ R.
Let X be a random variable such that E|X| < ∞, so that the expectation EX is well-
defined and finite. Let yEX (x) = ax + b be the support line to the convex function at
the point (EX, ϕ(EX)). Then by definition of a support line, yEX (EX) = ϕ(EX); also,
yEX (X) ≤ ϕ(X), and so
EyEX (X) ≤ Eϕ(X).
But because yEX (x) = ax + b is a linear function of x,
EyEX (X) = yEX (EX) = ϕ(EX).
This proves the Jensen inequality for ordinary expectation. The proof for conditional ex-
pectation is similar. For any value of Y , let yE(X|Y ) (x) be the support line at the point
(E(X|Y ), ϕ(E(X|Y ))). Then yE(X|Y ) (E(X|Y )) = ϕ(E(X|Y )), and for every value of X,
yE(X|Y ) (X) ≤ ϕ(X). Consequently, by the linearity and positivity properties of conditional
expectation,
ϕ(E(X|Y )) = yE(X|Y ) (E(X|Y ))
= E(yE(X|Y ) (X)|Y )
≤ E(ϕ(X)|Y ).


4. Martingales
4.1. Definition of a Martingale. Let X0 , X1 , . . . and Y0 , Y1 , . . . be sequences of random
variables defined on a common probability space, with Xn being real-valued and integrable
(that is, E|Xn | < ∞), and Yn valued in an arbitrary (measurable) space. In common
applications in stochastic process theory, the random variables Yn might be the successive
states in a Markov chain, and the random variables Xn numerical functions of the states Yn .
The sequence X0 , X1 , . . . is said to be adapted to the sequence Y0 , Y1 , . . . if for each n the
9
random variable Xn is a function2 of Y1 , Y2 , . . . , Yn . The sequence X0 , X1 , . . . is said to be a
martingale relative to the sequence Y0 , Y1 , . . . if it is adapted, and if for every n,
(17) E(Xn+1 | Y0 , Y1 , Y2 , . . . , Yn ) = Xn .
Similarly, it is said to be a supermartingale (respectively, submartingale) if for every n,
(18) E(Xn+1 | Y0 , Y1 , Y2 , . . . , Yn ) ≤ (≥)Xn .
Observe that any martingale is automatically both a submartingale and a supermartingale.
4.2. Martingales and Martingale Difference Sequences. The most basic examples of
martingales are sums of independent, mean zero random variables. Let Y0 , Y1 , . . . be a
sequence of independent, identically distributed random variables such that EYn = 0. Then
the sequence of partial sums
Xn
(19) Xn = Yj
j=1

is a martingale relative to the sequence 0, Y1 , Y2 , . . . . This is easily verified, using the linearity
and stability properties and the Independence Law for conditional expectation:
E(Xn+1 | Y1 , Y2 , . . . , Yn ) = E(Xn + Yn+1 | Y1 , Y2 , . . . , Yn )
= E(Xn | Y1 , Y2 , . . . , Yn ) + E(Yn+1 | Y1 , Y2 , . . . , Yn )
= Xn + EYn+1
= Xn .
The importance of martingales in modern probability stems at least in part from the fact
that most of the essential properties of sums of independent, identically distributed random
variables are inherited (with minor modification) by martingales: As you will learn, there are
versions of the SLLN, the Central Limit Theorem, the Wald indentities, and the Chebyshev,
Markov, and Kolmogorov inequalities for martingales.3 To get some appreciation of why this
might be so, consider the decomposition of a martingale {Xn } as a partial sum process:
n
X
(20) X n = X0 + ξj where ξj = Xj − Xj−1 .
j=1

Proposition 1. The martingale difference sequence {ξn } has the following properties: (a)
the random variable ξn is a function of Y1 , Y2 , . . . , Yn ; and (b) for every n ≥ 0,
(21) E(ξn+1 | Y1 , Y2 , . . . , Yn ) = 0.
Proof. This is another three-line calculation using the properties of conditional expectation
and the definition of a martingale. 

2Here and in what follows, the term function will always mean measurable function. If you do not know
yet what this mean, don’t worry about it: all reasonable functions are measurable.
3We won’t have a chance to discuss all of these in this class, but you can rest assured that Prof. Wichura
will give a complete tour next year.
10
Corollary 1. Let {Xn } be a martingale relative to {Yn }, with martingale difference sequence
{ξn }. Then for every n ≥ 0,
(22) EXn = EX0 .
Moreover, if X0 = 0 and EXn2 < ∞ then
n
X
(23) EXn2 = Eξj2 .
j=1

Proof. The first property follows almost trivially from Proposition 1 and the Expectation
Law for conditional expectation, as these together imply that Eξn = 0 for each n. Summing
and using the linearity of ordinary expectation, one obtains (22).
The second property is only slightly more difficult: First, observe that each of the terms
ξj has finite variance, because it is the difference of two random variables with finite second
moments. (That EXj2 < ∞ follows from the hypothesis that EXn2 < ∞, together with
the Tower Property.) Consequently, all of the products ξi ξj have finite first moments, by
the Cauchy-Schwartz inequality. Next, if j ≤ k ≤ n then ξj is a function of Y1 , Y2 , . . . , Yj ,
and therefore also a function of Y1 , Y2 , . . . , Yk . Thus, by Properties (1), (4), (6), and (7) of
conditional expectation, if j ≤ k ≤ n then
Eξj ξk+1 = EE(ξj ξk+1 | Y1 , Y2 , . . . , Yk )
= Eξj Eξk+1 | Y1 , Y2 , . . . , Yk )
= E(ξj · 0) = 0.
The variance of Xn may now be calculated in exactly the same manner as for sums of
independent random variables with mean zero:
n
X
EXn2 = E( ξ j )2
j=1
Xn Xn
=E ξj ξk
j=1 k=1
n
XX n
= Eξj ξk
j=1 k=1
Xn XX
= Eξj2 +2 Eξj ξk
j=1 j<k
n
X
= Eξj2 + 0.
j=1

4.3. Some Examples of Martingales.


11
4.3.1. Paul Lévy’s Martingales. Let X be any integrable random variable. Then the se-
quence Xn defined by Xn = E(X|Y0 , Y1 , . . . , Yn ) is a martingale, by the Tower Property of
conditional expectation.
4.3.2. Random Walk Martingales. Let Y0 , Y1 , . . . be a sequence of independent,
Pidentically
distributed random variables such that EYn = 0. Then the sequence Xn = nj=1 Yj is a
martingale, as we have seen.
4.3.3. Second Moment Martingales. Once again let Y0 , Y1 , . . . be a sequence of independent,
identically distributed random variables such that EYn = 0 and EYn2 = σ 2 < ∞. Then the
sequence
n
!2
X
(24) Yj − σ 2 n
j=1

is a martingale (again relative to the sequence 0, Y1 , Y2 , . . . ). This is also easy to check.


4.3.4. Likelihood Ratio Martingales: Bernoulli Case. Let X0 , X1 , . . . be a sequence of inde-
pendent, identically distributed Bernoulli-p random variables, and let Sn = nj=1 Xj . Note
P

that Sn has the binomial-(n, p) distribution. Define


 2Sn −n
q
(25) Zn = .
p
Then Z0 , Z1 , . . . is a martingale relative to the usual sequence. Once again, this is easy to
check. The martingale {Zn }n≥0 is quite useful in certain random walk problems, as we have
already seen.
4.3.5. Likelihood Ratio Martingales in General. Let X0 , X1 , . . . be independent, identically
distributed random variables whose moment generating function ϕ(θ) = EeθXi is finite for
some value θ 6= 0. Define
n
Y eθXj eθSn
(26) Zn = Zn (θ) = = .
j=1
ϕ(θ) ϕ(θ)n
Then Zn is a martingale. (It is called a likelihood ratio martingale because the random
variable Zn is the likelihood ratio dPθ /dP0 based on the sample X1 , X2 , . . . , Xn for probability
measures Pθ and P0 in a certain exponential family.)
4.3.6. Galton-Watson Martingales. Let Z0 = 1, Z1 , Z2 , . . . be a Galton-Watson process whose
offspring distribution has mean µ > 0. Denote by ϕ(s) = EsZ1 the probability generating
function of the offspring distribution, and by ζ the smallest nonnegative root of the equation
ϕ(ζ) = ζ.
Proposition 2. Each of the following is a nonnegative martingale:
Mn := Zn /µn ; and
Zn
Wn := ζ .
Proof. Homework. 
12
4.3.7. Harmonic Functions and Markov Chains. Yes, surely enough, martingales also arise
in connection with Markov chains; in fact, Doob’s motivation in inventing them was to
connect the world of potential theory for Markov processes with the classical theory of sums
of independent random variables.4 Let Y0 , Y0 , Y1 , . . . be a Markov chain on a denumerable
state space Y with transition probability matrix P. A real-valued function h : Y → R is
called harmonic for the transition probability matrix P if
(27) Ph = h,
equivalently, if for every x ∈ Y,
X
(28) h(x) = p(x, y)h(y) = E x h(Y1 ).
y∈Y

Here E x denotes the expectation corresponding to the probability measure P x under which
P x {Y0 = x} = 1. Notice the similarity between equation (28) and the equation for the
stationary distribution – one is just the transpose of the other.
Proposition 3. If h is harmonic for the transition probability matrix P then for every
starting state x ∈ Y the sequence h(Yn ) is a martingale under the probability measure P x .
Proof. This is once again nothing more than a routine calculation. The key is the Markov
property, which allows us to rewrite any conditional expectation on Y0 , Y1 , Y2 , . . . , Yn as a
conditional expectation on Yn . Thus,
E(h(Yn+1 ) | Y0 , Y1 , Y2 , . . . , Yn ) = E(h(Yn+1 ) | Yn)
X
= h(y)p(Yn, y)
y∈Y

= h(Yn ).


4.3.8. Submartingales from Martingales. Let {Xn }n≥0 be a martingale relative to the se-
quence Y0 , Y1 , . . . . Let ϕ : R → R be a convex function such that Eϕ(Xn ) < ∞ for each
n ≥ 0. Then the sequence {Zn }n≥0 defined by
(29) Zn = ϕ(Xn )
is a submartingale. This is a consequence of the Jensen inequality and the martingale
property of {Xn }n≥0 :
E(Zn+1 |Y0 , Y1 , . . . , Yn ) = E(ϕ(Xn+1 )|Y0 , Y1 , . . . , Yn )
≥ ϕ(E(Xn+1 |Y0 , Y1 , . . . , Yn )
= ϕ(Xn ) = Zn
Useful special cases: (a) ϕ(x) = x2 , and (b) ϕ(x) = exp{θx}.
4See his 800-page book Classical Potential Theory and its Probabilistic Counterpart for more on this.
13
5. Martingale and Submartingale Transforms
According to the Merriam-Webster Collegiate Dictionary, a martingale is
any of several systems of betting in which a player increases the stake usually
by doubling each time a bet is lost.
The use of the term in the theory of probability derives from the connection with fair games
or fair bets; and the importance of the theoretical construct in the world of finance also
derives from the connection with fair bets. Seen in this light, the notion of a martingale
transform, which we are about to introduce, becomes most natural. Informally, a martingale
transform is nothing more than a system of placing bets on a fair game.
5.1. Martingale Transforms. A formal definition of a martingale transform requires two
auxiliary notions: martingale differences and predictable sequences. Let X 0 , X1 , . . . be a
martingale relative to another sequence Y0 , Y1 , . . . . For n = 1, 2, . . . , define
(30) ξn = Xn − Xn−1 ;
to be the martingale difference sequence associated with the martingale Xn .
A predictable sequence Z1 , Z2 , . . . relative to the sequence Y0 , Y1 , . . . is a sequence of ran-
dom variables such that for each n = 1, 2, . . . the random variable Zn is a function of
Y0 , Y1 , . . . , Yn−1 . In gambling (and financial) contexts, Zn might represent the size (say, in
dollars) of a bet paced on the nth play of a game, while ξn represents the (random) payoff
of the nth play per dollar bet. The requirement that the sequence Zn be predictable in such
contexts is merely an assertion that the gambler not be clairvoyant.
Definition 2. Let X0 , X1 , . . . be a martingale relative to Y0 , Y1 , . . . and let ξn = Xn − Xn−1
be the associated martingale difference sequence. Let Z0 , Z1 , . . . be a predictable sequence
relative to Y0 , Y1 , . . . . Then the martingale transform {(Z · X)n }n≥0 is defined by
n
X
(31) (Z · X)n = Z0 X0 + Zk ξk .
k=1

Example: The St. Petersburg Game. In this game, a referee tosses a fair coin repeat-
edly, with results ξ1 , ξ2 , . . . , where ξn = +1 if the nth toss is a Head and ξn = −1 if the nth
toss is a Tail. Before each toss, a gambler is allowed to place a wager of size Wn (in roubles)
on the outcome of the next toss. The size of the wager Wn may depend on the observed
tosses ξ1 , ξ2 , . . . , ξn−1 , but not on ξn (or on any of the future tosses); thus, the sequence
{Wn }n≥1 is predictable relative to {ξn }n≥1 . If the nth toss is a Head, the gambler nets +Wn ,
but if the nth toss is a Tail, the gambler loses Wn . Thus, the net winnings Sn after n tosses
is the martingale transform
Xn
Sn = (W · X)n = Wk ξ k ,
k=1
where Xn = ξ1 + ξ2 + · · · + ξn . 

The most important fact about martingale transforms is that they are martingales in their
own right, as the next proposition asserts:
14
Proposition 4. Assume that the predictable sequence {Zn }n≥0 consists of bounded random
variables. Then the martingale transform {(Z · X)n }n≥0 is itself a martingale relative to
{Yn }n≥0 .
Proof. This is a simple exercise in the use of the linearity and stability properties of condi-
tional expectation:
E((Z · X)n+1 | Y1 , Y2 , . . . , Yn ) = (Z · X)n + E(Zn+1 ξn+1 | Y1 , Y2 , . . . , Yn )
= (Z · X)n + Zn+1 E(ξn+1 | Y1 , Y2 , . . . , Yn )
= (Z · X)n ,
the last equation because {ξn }n≥1 is a martingale difference sequence relative to {Yn }n≥0 . 

5.2. Submartingale Transforms. Submartingales and supermartingales may also be trans-


formed, using equation (31), but the resulting sequences will not necessarily be sub- or
super-martingales unless the predictable sequence {Zn }n≥0 consists of nonnegative random
variables.
Definition 3. Let X0 , X1 , . . . be a sub- (respectively, super-) martingale relative to Y0 , Y1 , . . .
and let ξn = Xn − Xn−1 be the associated sub- (super-) martingale difference sequence. Let
Z0 , Z1 , . . . be a predictable sequence relative to Y0 , Y1 , . . . consisting of bounded nonnegative
random variables. Then the submartingale transform (respectively, supermartingale trans-
form) {(Z · X)n }n≥0 is defined by
Xn
(32) (Z · X)n = Z0 X0 + Zk ξk .
k=1

Proposition 5. If the terms Zn of the predictable sequence are nonnegative and bounded,
and if {Xn }n≥0 is a submartingale, then the submartingale transform (Z · X)n is also a
submartingale. Moreover, if, for each n ≥ 0,
(33) 0 ≤ Zn ≤ 1,
then
(34) E(Z · X)n ≤ EXn .
Proof. To show that (Z · X)n is a submartingale, it suffices to verify that the differences
Zk ξk constitute a submartingale difference sequence. Since Zk is a predictable sequence, the
differences Zk ξk are adapted to {Yk }k≥0 , and
E(Zk ξk | Y1 , Y2 , . . . , Yk−1 ) = Zk E(ξk | Y1 , Y2 , . . . , Yk−1 ).
Since ξk is a submartingale difference sequence, E(ξk | Y1 , Y2 , . . . , Yk−1 ≥ 0); and therefore,
since 0 ≤ Zk ≤ 1,
0 ≤ E(Zk ξk | Y1 , Y2 , . . . , Yk−1 ) ≤ E(ξk | Y1 , Y2 , . . . , Yk−1 ).
Consequently, Zk ξk is a submartingale difference sequence. Moreover, by taking expectations
in the last inequalities, we have
E(Zk ξk ) ≤ Eξk ,
15
which implies (34). 
There is a similar result for supermartingales:
Proposition 6. If {Xn }n≥0 is a supermartingale, and if the terms Zn of the predictable
sequence are nonnegative and bounded, then {(Z · X)n }n≥0 is a supermartingale; and if
inequality (33) holds for each n ≥ 0 then
(35) E(Z · X)n ≥ EXn .

6. Optional Stopping
The cornerstone of martingale theory is Doob’s Optional Sampling Theorem. This states,
roughly, that “stopping” a martingale at a random time τ does not alter the expected
“payoff”, provided the decision about when to stop is based solely on information available
up to τ . Such random times are called stopping times.5
Theorem 1. (Optional Sampling Theorem) Let {Xn }n∈Z+ be a martingale, submartingale ,
or supermartingale relative to a sequence {Yn }n≥0 , and let τ be a stopping time. Then for
any n ∈ N,
EXτ ∧n = EX0 (martingales)
EXτ ∧n ≤ EX0 (supermartingales)
EXτ ∧n ≤ EXn (submartingales)

Proof. The easiest proof is based on the fact that martingale (respectively, submartingale, su-
permartingale) transforms are martingales (respectively, submartingales, supermartingales).
The connection between transforms and the Optional Sampling Theorem is that the sequence
{Xτ ∧n }n≥0 may be represented as a transform of the sequence {Xn }n≥0 :
(36) Xτ ∧n = (Z · X)n
where
(
1 if τ ≥ n, and
(37) Zn =
0 if τ < n.
The equation (36) is easy to verify. Note that Z0 = 1, since τ is nonnegative; consequently,
X n
(Z · X)n = X0 + Zj (Xj − Xj−1 )
j=1
τ ∧n
X
= X0 + (Xj − Xj−1 )
j=1
= Xτ ∧n ,
since the last sum telescopes.
5In some of the older literature, they are called Markov times or optional times.
16
In order that the sequence {(Z · X)n }n≥0 be a martingale transform (respectively, sub- or
super- martingale transform) it must be the case that the sequence {Zn }n≥0 is predictable.
This is where the assumption that τ is a stopping time enters: Since τ is a stopping time,
for each fixed m the event that τ = m depends only on Y0 , Y1 , Y2 , . . . , Ym . Hence, the event
n−1
c
{τ ≥ n} = ∪m=0 {τ = m}
depends only on Y0 , Y1 , . . . , Yn−1 . But this event is the same as the event that Zn = 1;
this proves that Zn is a function only of Y0 , Y1 , . . . , Yn−1 , and so the sequence {Zn }n≥0 is
predictable.
The first two assertions of the Optional Sampling Theorem now follow easily from Propo-
sitions 4 and 5, in view of the “Conservation of Expectation” properties of martingales and
supermartingales. For instance, if {Xn }n≥0 ) is a martingale, then since martingale transforms
are themselves martingales, and since expectation is “preserved” for martingales,
EXτ ∧n = E(Z · X)n = E(Z · X)0 = EX0 .
A similar argument establishes the corresponding result for supermartingales. Finally, the
last assertion, regarding the case where {Xn }n≥0 is a submartingale, follows from inequality
(34), since the terms Zn of the predictable sequence are between 0 and 1. 

7. Maximal Inequalities
The Optional Sampling Theorem has immediate implications concerning the pathwise
behavior of martingales, submartingales, and supermartingales. The most elementary of
these concern the maxima of the sample paths, and so are called maximal inequalities.
Proposition 7. Let {Xn }n≥0 be a sub- or super-martingale relative to {Yn }n≥0 , and for each
n ≥ 0 define
(38) Mn = max Xm , and
0≤m≤n

(39) M∞ = sup Xm = lim Mn


0≤m<∞ n→∞

Then for any scalar α > 0 and any n ≥ 1,


(40) P {Mn ≥ α} ≤ E(Xn ∨ 0)/α if {Xn }n≥0 is a submartingale, and
(41) P {M∞ ≥ α} ≤ EX0 /α if {Xn }n≥0 is a nonnegative supermartingale.

Proof. Assume first that {Xn }n≥0 is a submartingale. Without loss of generality, we may
assume that each Xn ≥ 0, because if not we may replace the original submartingale Xn by
the larger submartingale Xn ∨ 0. Define τ to be the smallest n ≥ 0 such that Xn ≥ α, or
+∞ is there is no such n. Then for any nonrandom n ≥ 0, the truncation τ ∧ n is a stopping
time and so, by the Optional Sampling Theorem,
EXτ ∧n ≤ EXn .
17
But because the random variables Xm are nonnegative, and because Xτ ∧n ≥ α on the event
that τ ≤ n,

EXτ ∧n ≥ EXτ ∧n 1{τ ≤ n}


≥ Eα1{τ ≤ n}
= αP {τ ≤ n}.

This proves the inequality (40).


The proof of inequality (41) is similar, but needs an additional limiting argument. First,
for any finite n ≥ 0, an argument parallel to that of the preceding paragraph shows that

P {Mn ≥ α} ≤ EX0 /α.

Now the random variables Mn are nondecreasing in n, and converge up to M∞ , so for any
 > 0, the event that M∞ ≥ α is contained in the event that Mn ≥ α −  for some n. But by
the last displayed inequality and the monotone convergence theorem, the probability of this
is no larger than EX0 /(α − ). Since  > 0 may be taken arbitrarily small, inequality (41)
follows. 

Example: The St. Petersburg Game, Revisited. In Dostoevsky’s novel The Gambler,
the hero (?) is faced with the task of winning a certain amount of money at the roulette
table, starting with a fixed stake strictly less than the amount he wishes to take home from
the casino. What strategy for allocating his stake will maximize his chance of reaching
his objective? Here we will consider an analogous problem for the somewhat simpler St.
Petersburg game described earlier. Suppose that the gambler starts with 100 roubles, and
that he wishes to maximize his chance of leaving with 200 roubles. There is a very simple
strategy that gives him a .5 probability of reaching his objective: stake all 100 roubles on
the first coin toss, and quit the game after one play. Is there a strategy that will give the
gambler more than a .5 probability of reaching the objective?
The answer is NO, and we may prove this by appealing to the Maximal Inequality (41)
for supermartingales. Let {Wn }n≥0 be any predictable sequence (recall that, for a non-
clairvoyant bettor, the sequence of wagers must be predictable). Then the gambler’s fortune
after n plays equals
n
X
Fn = 100 + Wk ξ k ,
k=1

where ξn is the martingale difference sequence of ±1 valued random variables recording


whether the coin tosses are Heads or Tails. By Proposition 4, the sequence Fn is a martingale.
Since each Fn ≥ 0, the Maximal Inequality for nonnegative supermartingales applies, and
we conclude that
P {sup Fn ≥ 200} ≤ EX0 /200 = 1/2.
n≥0
18
8. Upcrossings Inequalities
The Maximal Inequalities limit the extent to which a submartingale or supermartingale
may deviate from it initial value. In particular, if Xn is a submartingale that is bounded
in L1 then the maximal inequality implies that sup Xn < ∞ with probability one. The Up-
crossings Inequalities, which we shall discuss next, limit the extent to which a submartingale
or supermartingale may fluctuate around its initial value.
Fix a sequence Xn of real random variables. For any fixed constants α < β, define the up-
crossings count Nn ((α, β]) to be the number of times that the finite sequence X0 , X1 , X2 , . . . , Xn
crosses from the interval (−∞, α] to the interval (β, ∞). Equivalently, define stopping times
(42) σ0 := min{n ≥ 0 : Xn ≤ α} τ1 := min{n ≥ σ0 : Xn > β};
σ1 := min{n ≥ τ1 : Xn ≤ α} τ2 := min{n ≥ σ1 : Xn > β};
···
σm := min{n ≥ τm : Xn ≤ α} τm+1 := min{n ≥ σm : Xn > β};
then
Nn ((α, β]) = max{m : τm ≤ n}.
Proposition 8. Let Xn be a submartingale relative to Yn . Then for any scalars α < β and
all nonnegative integers m, n,
(43) (β − α)ENn ((α, β]) ≤ E(Xn ∨ 0) + |α|.
Consequently, if sup EXn < ∞, then EN∞ ((α, β]) < ∞, and so the sequence {Xn }n≥0 makes
only finitely many upcrossings of any interval (α, β].
Proof. The trick is similar to that used in the proof of the Maximal Inequalities: define an
appropriate submartingale transform and then use Proposition 5. We begin by making two
simplifications: First, it is enough to consider the special case α = 0, because the general
case may be reduced to this by replacing the original submartingale Xn by the submartingale
Xn0 = Xn − α (Note that this changes the expectation in the inequality by at most |α|.)
Second, if α = 0, then it is enough to consider the special case where Xn is a nonnegative
submartingale, because if Xn is not nonnegative, it may replaced by Xn00 = Xn ∨ 0, as this
does not change the number of upcrossings of (0, β] or the value of E(Xn ∨ 0).
Thus, assume that α = 0 and that Xn ≥ 0. Use the stopping times σm , τm defined above
(with α = 0) to define a predictable sequence Zn as follows:
Zn = 0 if n ≤ σ0 ;
Zn = 1 if σm < n ≤ τm ;
Zn = 0 if τm < n ≤ σm .
(Exercise: Verify that this is a predictable sequence.) This sequence has alternating blocks
of 0s and 1s (not necessarily all finite). Over any complete finite block of 1s, the increments
ξk must sum to at least β, because at the beginning of a block (some time σm ) the value
of X is 0, and at the end (the next τm ), the value is back above β. Furthermore, over any
incomplete block of 1s (even one which will never terminate!), the sum of the increments ξk
19
will be ≥ 0, because at the beginning σm of the block the value Xσm = 0 and Xn never goes
below 0. Hence,
n
X
βNn (0, β] ≤ Zi ξi = (Z · X)n .
i=1
Therefore, by Proposition 5,
(β − α)ENn (α, β] ≤ E(Z · X)τ (Mn )
≤ E(Z · X)n
≤ EXn .


9. The Martingale Convergence Theorem


Martingale Convergence Theorem . Let {Xn } be an L1 −bounded submartingale relative
to a sequence {Yn }, that is, a submartingale such that supn E|Xn | < ∞. Then with probability
one the limit
(44) lim Xn := X∞
n→∞
exists, is finite, and has finite first moment.
Proof. By the Upcrossings Inequality, for any interval (α, β] with rational endpoints the
sequence {Xn }n≥0 can make only finitely many upcrossings of (α, β]. Equivalently, the
probability that {Xn } makes infinitely many upcrossings of (α, β] is zero. Since there are
only countably many intervals (α, β] with rational endpoints, and since the union of countably
many events of probability zero is an event of probability zero, it follows that with probability
one there is no rational interval (α, β] such that Xn makes infinitely many upcrossings of
(α, β].
Now if xn is a sequence of real numbers that makes only finitely many upcrossings of any
rational interval, then xn must converge to a finite or infinite limit (this is an easy exercise
in elementary real analysis). Thus, it follows that with probability one X∞ := limnxg∞ Xn
exists (but may be ±∞). But Fatou’s Lemma implies that
E|X∞ | ≤ lim inf E|Xn | < ∞,
n→∞
and so in particular the limit X∞ is finite with probability one. 

Corollary 2. Every nonnegative supermartingale converges almost surely.


Proof. If Xn is a nonnegative supermartingale, then −Xn is a nonpositive submartingale.
Moreover, because Xn ≥ 0,
0 ≤ E|Xn | = EXn ≤ EX0 ,
the latter because Xn is a supermartingale. Therefore −Xn is an L1 −bounded submartingale,
to which the Martingale Convergence Theorem applies. 
Note: The Martingale Convergence Theorem asserts, among other things, that the limit
X∞ has finite first moment. However, it is not necessarily the case that E|Xn − X∞ | → 0.
20
Consider, for example, the martingale Xn that records your fortune at time n when you play
the St. Petersburg game with the “double-or-nothing” strategy on every play. At the first
time you toss a Tail, you will lose your entire fortune and have 0 forever after. Since this is
(almost) certain to happen eventually, Xn → 0 almost surely. But EXn = 1 6= 0 for every n!
(To be continued.)

21

You might also like