Foundations of Quantitative Finance Book IV
Foundations of Quantitative Finance Book IV
Foundations of Quantitative Finance Book IV
Finance
Chapman & Hall/CRC Financial Mathematics Series
Series Editors
M.A.H. Dempster
Centre for Financial Research
Department of Pure Mathematics and Statistics
University of Cambridge, UK
Dilip B. Madan
Robert H. Smith School of Business
University of Maryland, USA
Rama Cont
Department of Mathematics
Imperial College, UK
Robert A. Jarrow
Lynch Professor of Investment Management
Johnson Graduate School of Management
Cornell University, USA
Commodities: Fundamental Theory of Futures, Forwards, and Derivatives Pricing, Second Edition
Edited by M.A.H. Dempster and Ke Tang
Sustainable Life Insurance: Managing Risk Appetite for Insurance Savings & Retirement Products
Aymeric Kalife with Saad Mouti, Ludovic Goudenege, Xiaolu Tan, and Mounir Bellmane
Robert R. Reitano
Brandeis International Business School
Waltham, MA
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003264583
Typeset in CMR10
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
to Dorothy and Domenic
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Contents
Preface xi
Author xiii
Introduction xv
3 Order Statistics 57
3.1 M -Samples and Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Distribution Functions of Order Statistics . . . . . . . . . . . . . . . . . . . 59
3.3 Density Functions of Order Statistics . . . . . . . . . . . . . . . . . . . . . 60
3.4 Joint Distribution of All Order Statistics . . . . . . . . . . . . . . . . . . . 62
3.5 Density Functions on Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6 Multivariate Order Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Joint Density of All Order Statistics . . . . . . . . . . . . . . . . . . 69
3.6.2 Marginal Densities and Distributions . . . . . . . . . . . . . . . . . . 70
3.6.3 Conditional Densities and Distributions . . . . . . . . . . . . . . . . 73
3.7 The Rényi Representation Theorem . . . . . . . . . . . . . . . . . . . . . . 75
vii
viii Contents
Bibliography 243
Index 247
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Preface
The idea for a reference book on the mathematical foundations of quantitative finance
has been with me throughout my professional and academic careers in this field, but the
commitment to finally write it didn’t materialize until completing my first “introductory”
book in 2010.
My original academic studies were in “pure” mathematics in a field of mathematical
analysis, and neither applications generally nor finance in particular were then even on
my mind. But on completion of my degree, I decided to temporarily investigate a career in
applied math, becoming an actuary, and in short order became enamored with mathematical
applications in finance.
One of my first inquiries was into better understanding yield curve risk management, ulti-
mately introducing the notion of partial durations and related immunization strategies. This
experience led me to recognize the power of greater precision in the mathematical specifica-
tion and solution of even an age-old problem. From there my commitment to mathematical
finance was complete, and my temporary investigation into this field became permanent.
In my personal studies, I found that there were a great many books in finance that
focused on markets, instruments, models and strategies, and which typically provided an
informal acknowledgement of the background mathematics. There were also many books
in mathematical finance focusing on more advanced mathematical models and methods,
and typically written at a level of mathematical sophistication requiring a reader to have
significant formal training and the time and motivation to derive omitted details.
The challenge of acquiring expertise is compounded by the fact that the field of quanti-
tative finance utilizes advanced mathematical theories and models from a number of fields.
While there are many good references on any of these topics, most are again written at
a level beyond many students, practitioners and even researchers of quantitative finance.
Such books develop materials with an eye to comprehensiveness in the given subject matter,
rather than with an eye toward efficiently curating and developing the theories needed for
applications in quantitative finance.
Thus the overriding goal I have for this collection of books is to provide a complete and
detailed development of the many foundational mathematical theories and results one finds
referenced in popular resources in finance and quantitative finance. The included topics
have been curated from a vast mathematics and finance literature for the express purpose
of supporting applications in quantitative finance.
I originally budgeted 700 pages per book, in two volumes. It soon became obvious
this was too limiting, and two volumes ultimately turned into ten. In the end, each book
was dedicated to a specific area of mathematics or probability theory, with a variety of
applications to finance that are relevant to the needs of financial mathematicians.
My target readers are students, practitioners and researchers in finance who are quantita-
tively literate, and recognize the need for the materials and formal developments presented.
My hope is that the approach taken in these books will motivate readers to navigate these
details and master these materials.
Most importantly for a reference work, all ten volumes are extensively self-referenced.
The reader can enter the collection at any point of interest, and then using the references
xi
xii Preface
cited, work backwards to prior books to fill in needed details. This approach also works for
a course on a given volume’s subject matter, with earlier books used for reference, and for
both course-based and self-study approaches to sequential studies.
The reader will find that the developments herein are presented at a much greater level
of detail than most advanced quantitative finance books. Such developments are of necessity
typically longer, more meticulously reasoned, and therefore can be more demanding on the
reader. Thus before committing to a detailed line-by-line study of a given result, it is always
more efficient to first scan the derivation once or twice to better understand the overall logic
flow.
I hope the additional details presented will support your journey to better understanding.
I am grateful for the support of my family: Lisa, Michael, David, and Jeffrey, as well as
the support of friends and colleagues at Brandeis International Business School.
Robert R. Reitano
Brandeis International Business School
Author
xiii
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Introduction
The series is logically sequential. Books I, III, and V develop foundational mathematical
results needed for the probability theory and finance applications of Books II, IV, and
VI, respectively. Then Books VII, VIII, and IX develop results in the theory of stochastic
processes. While these latter three books introduce ideas from finance as appropriate, the
final realization of the applications of these stochastic models to finance is deferred to Book
X.
This Book IV, Distribution Functions and Expectations, extends the investigations of
Book II using the formidable tools afforded by the Riemann, Lebesgue, and Riemann-
Stieltjes integration theories of Book III.
To set the stage, Chapter 1 opens with a short review of the key results on distribution
functions from Books I and II. The focus here is on the connections between distribution
functions of random variables and random vectors and distribution functions induced by
Borel measures on R and Rn . A complete functional characterization of distribution func-
tions on R is then derived, providing a natural link between general probability theory and
the discrete and continuous theories commonly encountered. This leads to an investigation
into the existence of density functions associated with various distribution functions. Here
the integration theories from Book III are recalled to frame this investigation, and the gen-
eral results to be seen in Book VI using the Book V integration theory are introduced. The
chapter ends with a catalog of many common distribution and density functions from the
discrete and continuous probability theories.
Chapter 2 investigates transformations of random variables. For example, given a random
variable X and associated distribution/density function, what is the distribution/density
function of the random variable g(X) given a Borel measurable function g(x)? More gen-
erally, what are the distribution functions and densities of sums and ratios of random vari-
ables, where now g(x) is a multivariate function? The first section addresses the distribution
function question for strictly monotonic g(x) and the density question when such g(x) is
differentiable. More general transformations are deferred to Book VI using the change of
variable results from the integration theory of Book V. A number of results are then de-
rived for the distribution and density functions of sums of independent random variables
using the integration theories of Book III. The various forms of such distribution functions
then reflect the assumptions made on the underlying distribution functions and/or density
xv
xvi Introduction
functions. Examples and exercises connect the theory with the Chapter 1 catalog of distri-
bution functions. The chapter ends with an investigation into ratios of independent random
variables, as well as an example using dependent random variables.
A special example of an order statistic was introduced in Chapter 9 of Book II on
extreme value theory, where this random variable was defined as the maximum of a collection
of independent, identically distributed random variables. Order statistics, the subject of
Chapter 3, generalize this notion, converting such a collection into ordered random variables.
Distribution and density functions of such variates are first derived, before turning to the
joint distribution function of all order statistics. This latter derivation introduces needed
combinatorial ideas as well as results on multivariate integration from Book III and a more
general result from Book V. Various density functions of order statistics are then derived,
beginning with the joint density and then proceeding to the various marginal and conditional
densities of these random variables. The final investigation is into the Rényi representation
theorem for the order statistics of an exponential distribution. While seemingly of narrow
applicability as a result of exponential variables, this theorem will be seen to be more widely
applicable.
Expectations of random variables and transformed random variables are introduced in
Chapter 4 in the general context of a Riemann-Stieltjes integral. In the special case of dis-
crete or continuous probability theory, this definition reduces to the familiar notions from
these theories using Book III results. But this definition also raises existence and consis-
tency questions. The roadmap to a final solution is outlined, foretelling needed results from
the integration theory of Book V and the final detailed resolution in Book VI. Various
moments, the moment generating function, and properties of such are then developed, as
well as examples from the distribution functions introduced earlier. Moment inequalities of
Chebyshev, Jensen, Kolmogorov, Cauchy-Schwarz, Hölder, and Lyapunov are derived, be-
fore turning to the question of uniqueness of moments and the moment generating function.
The chapter ends with an investigation of weak convergence of distributions and moment
limits, developing a number of results underlying the “method of moments.”
Given a random variable X defined on a probability space, Chapter 4 of Book II derived
the theoretical basis for, and several constructions of, a probability space on which could be
defined a countable collection of independent random variables, identically distributed with
X. Such spaces provide a rigorous framework for the laws of large numbers of that book, and
the limit theorems of this book’s Chapter 6. This framework is in the background for Chapter
5, but the focus here is on the actual generation of random sample collections using the
previous theory and the various distribution functions introduced in previous chapters. The
various sections then exemplify simulation approaches for discrete distributions, and then
continuous distributions, using the left-continuous inverse function F ∗ (y) and independent,
continuous uniform variates commonly provided by various mathematical software. For
generating normal, lognormal, and Student’s T variates, these constructions are, at best,
approximate, and the chapter derives the exact constructions underlying the Box-Muller
transform and the Bailey transform, respectively. The final section turns to the simulation
of order statistics, both directly and with the aid of the Rényi representation theorem.
Chapter 6 begins with a more formal short review of the theoretical framework of Book
II for the construction of a probability space on which a countable collection of independent,
identically distributed random variables can be defined, and thus on which limit theorems
of various types can be addressed. The first section then addresses weak convergence of
various distribution function sequences. Among those studied are the Student’s T, Poisson,
DeMoivre-Laplace, and a first version of the central limit theorem, as well as Smirnov’s result
on uniform order statistics, a general result on exponential order statistics, and finally a limit
theorem on quantiles. The next section generalizes the study of laws of large numbers of
Book II using moment defined limits, and proves a limit theorem on extreme value theory
Introduction xvii
identified in that book. The final section studies empirical distribution functions, and in
particular, derives the Glivenko-Cantelli theorem on convergence of empirical distributions
to the underlying distribution function. Kolmogorov’s theorem on the limiting distribution
of the maximum error in an empirical distribution is also discussed, as are related results.
Continuing the study initiated in Chapter 9 of Book II, Chapter 7 again has two main
themes. The first topic is large deviation theory. Following a summary of the main result
and open questions of Book II, the section introduces and exemplifies the Chernoff bound,
which requires the existence of the moment generating function. Following an analysis of
properties of this bound, and introducing tilted distributions and their relevant properties,
the section concludes with the Cramér-Chernoff theorem, which conclusively settles the
open questions of Book II. The second major section is on extreme value theory and focuses
on two matters. The first is a study of the Hill estimator for the extreme value index γ for
γ > 0, the index values most commonly encountered in finance applications. This estimator
is introduced and exemplified in the context of Pareto distributions, and the Hill result of
convergence with probability 1 derived, along with a variety of related results. For this,
earlier developments in order statistics will play a prominent role, as does a representation
theorem of Karamata. The second major investigation is into the Pickands-Balkema-de
Haan theorem, a result that identifies the limiting distribution of certain conditional tail
distributions. This final result was approximated in the Book II development, but here it can
be derived in detail with another representation theorem of Karamata. Using an example,
it is then shown that the convergence promised by this result need not be fast.
I hope this book and the other books in the collection serve you well.
Notation 0.1 (Referencing within FQF Series) To simplify the referencing of results
from other books in this series, we use the following convention.
A reference to “Proposition I.3.33” is a reference to Proposition 3.33 of Book I, while
“Chapter III.4” is a reference to Chapter 4 of Book III, and “II.(8.5)” is a reference to
formula (8.5) of Book II, and so forth.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
1
Distribution and Density Functions
Notation 1.1 (µ → λ) In Book II, probability spaces were generally denoted by (S, E, µ),
where S is the measure space, here also called a “sample” space, E is the sigma algebra of
measurable sets, here also called the collection of “events,” and µ is the probability measure
defined on all sets in E. In this book we retain most of this notational convention. However,
because µ will often be called upon in later chapters to represent the “mean” of a given
distribution as is conventional, we will represent the probability measure herein by λ or by
another Greek letter.
Definition 1.2 (Random variable) Given a probability space (S, E, λ), a random vari-
able (r.v.) is a real-valued function:
X : S −→ R,
X −1 (a, b) ∈ E.
DOI: 10.1201/9781003264583-1 1
2 Distribution and Density Functions
This result also reflects the characterizing properties of such distribution functions. A
function F : R → R can be identified as a distribution function by Proposition II.3.6:
Proposition 1.4 (Identifying a distribution function) Let F (x) be an increasing
function, which is right continuous and satisfies F (−∞) = 0 and F (∞) = 1, defined as
limits.
Then there exists a probability space (S, E, λ) and random variable X so that F (x) =
λ[X −1 (−∞, x]]. In other words, every such function is the distribution function of a random
variable.
Distribution functions are also intimately linked with Borel measures by Proposition
II.6.3. Recall that A0 denoted the semi-algebra of right semi-closed intervals (a, b], A the
associated algebra of finite disjoint unions of such sets, and B(R) the Borel sigma algebra
of Definition I.2.13. By definition, A ⊂ B(R).
The outer measure λ∗A is defined in I.(5.8) and induced by the set function λA defined on
A by (1.6) and extended to A by finite additivity. A set A is said to be λ∗A -measurable, or
0
2. MF (R) is a complete sigma algebra and thus contains every set A ⊂ R with λ∗A (A) = 0.
3. MF (R) contains the Borel sigma algebra, B(R) ⊂ MF (R).
4. If λF denotes the restriction of λ∗A to MF (R), then λF is a probability measure and
hence (R, MF (R), λF ) is a complete probability space.
5. The probability measure λF is the unique extension of λA from A to the smallest sigma
algebra generated by A, which is B(R) by Proposition I.8.1.
6. For all A ∈ B(R) :
λF (A) = µ X −1 (A) .
(1.7)
Summary of Book II Results 3
Definition 1.7 (Random vector 2) Given a probability space (S, E, λ), a random vec-
n
tor is a mapping X : S −→ Rn , so that for all A ∈ B(R ), the sigma algebra of Borel
n
measurable sets on R :
X −1 (A) ∈ E. (1.9)
The joint distribution function (d.f.), or joint cumulative distribution func-
tion (c.d.f.), associated with X, denoted by F or FX , is then defined on (x1 , x2 , ..., xn ) ∈
Rn by h Yn i
F (x1 , x2 , ..., xn ) = λ X −1 (−∞, xj ] . (1.10)
j=1
Properties of such functions, and the link to Borel measures on Rn , were summarized
in Proposition II.6.9. But first, recall the definitions of continuous from above and n-
increasing.
Each x = (x1 , ..., xn ) in the summation is one of the 2n vertices of this rectangle, so xi = ai
or xi = bi , and sgn(x) is defined as −1 if the number of ai -components of x is odd, and +1
otherwise.
These properties are common to all joint distribution functions, and are exactly the
properties needed to generate Borel measures, by Proposition II.6.9. Below we quote this
result and add properties from Propositions I.8.15 and I.8.16.
4 Distribution and Density Functions
λF (A) = λ X −1 (A) ,
(1.14)
Qn
and thus by (1.10), for A = j=1 (−∞, xj ] :
Yn
λF (−∞, xj ] = F (x).
j=1
Joint distribution functions induce both marginal distribution functions and con-
ditional distribution functions. We recall Definitions II.3.34 and II.3.39.
2. General Case: Given F (x1 , x2 , ..., xn ) and I = {i1 , ..., im } ⊂ {1, 2, ..., n}, let xJ ≡
(xj1 , xj2 , ..., xjn−m ) for jk ∈ J ≡ I. e The marginal distribution function FI (xI ) ≡
FI (xi1 , xi2 , ..., xim ) is defined on Rm by:
Finally, recall the notion of independent random variables from Definition II.3.47,
and the characterization of the joint distribution function of such variables by Proposition
II.3.53.
If Xj : S −→ Rnj are random vectors on (S, E, λ), j = 1, 2, ..., n, we say that {Xj }nj=1
are independent random vectors if {σ(Xj )}nj=1 are independent sigma algebras.
That is, given {Bj }nj=1 with Bj ∈ σ(Xj ) :
\ n Yn
λ Bj = λ (Bj ) . (1.20)
j=1 j=1
n
Equivalently, given {Aj }nj=1 with Aj ∈ B(R j ) :
\ n Yn
Xj−1 (Aj ) = λ Xj−1 (Aj ) .
λ (1.21)
j=1 j=1
A countable collection of random variables {Xj }∞ j=1 defined on (S, E, λ) are said
to be independent random variables if given any finite index subcollection, J =
(j(1), j(2), ..., j(n)), {Xj(i) }ni=1 are independent random variables. The analogous definition
of independence applies to a countable collection of random vectors.
Countably many random variables {Xj }∞ j=1 defined on (S, E, λ) are independent random
variables if and only if for every finite index subcollection, J = (j(1), j(2), ..., j(n)) :
Yn
F (xj(1) , xj(2) , ..., xj(n) ) = Fj(i) (xj(i) ). (1.23)
i=1
These results are valid for random vectors Xj : S −→ Rnj , noting that Fj (xj ) are joint
distribution functions on Rnj as given in (1.8) or (1.10).
6 Distribution and Density Functions
where:
0, x < xn ,
fn (x) = un , x = xn ,
un + vn x > xn .
In other words, X X
f (x) = un + vn .
xn ≤x xn <x
In addition to saltus functions, recall the following functions from Definitions III.3.49
and III.3.54.
for any finite collection of disjoint subintervals, {(x0i , xi )}ni=1 ⊂ [a, b], with:
Xn
|xi − x0i | < δ.
i=1
Decomposition of Distribution Functions on R 7
Remark 1.17 The relevant facts from Book III on such functions are:
• Proposition III.3.53: Singular functions exist, and this proposition illustrates this
with the Cantor function of Definition III.3.51, named for Georg Cantor (1845–
1918).
• Proposition III.3.62: Absolutely continuous functions are characterized as fol-
lows:
A function f (x) is absolutely continuous on [a, b] if and only if f (x) equals the Lebesgue
integral of its derivative on this interval:
Z x
f (x) = f (a) + (L) f 0 (y)dy.
a
0
Implicit in this result is that f (x) exists almost everywhere and is Lebesgue integrable.
Any continuously differentiable function f (x) is absolutely continuous by the mean value
theorem (Remark III.3.56), and then the above representation is also valid as a Riemann
integral.
where:
Proof. By Proposition 1.3, F (x) is increasing and thus differentiable almost everywhere by
Proposition III.3.12. In addition, F (x) is continuous from the right, has left limits, and has
at most countably many points of discontinuity which we denote {xn }∞ n=1 . At such points
define:
un = F (xn ) − F (x−
n ),
Then FSLT (x) is increasing, and by definition FSLT (−∞) = 0 and FSLT (∞) = 1 defined
as limits. To prove right continuity, let x and > 0 be given. Then for x ≤ y :
X .
FSLT (y) − FSLT (x) = un α
x<xn ≤y
8 Distribution and Density Functions
where this summation is finite or countable. In the first case, there is δ so that for y ≤
x+δP this summation is zero and this difference is then bounded by . If countable, then
∞
since n=1 un ≤ 1, the summation is convergent and can be made arbitrarily small by
eliminating finitely many terms. Reducing y to eliminate this finite set, the above difference
can again be made arbitrarily small. Thus FSLT (x) is a saltus distribution function by
Definition 1.15 and Proposition 1.4.
Next, let G(x) ≡ F (x) − αFSLT (x). If y > x then F (y − ) ≥ F (x− ) and:
X
F (xn ) − F (x−
αFSLT (y) − αFSLT (x) ≡ n)
x<xn ≤y
≤ F (y) − F (x) − F (y − ) − F (x− )
≤ F (y) − F (x).
This obtains G(y) − G(x) ≥ 0 and so G(x) is increasing. Also by consideration of the
component functions, G(−∞) = 0 and G(∞) = 1 − α defined as limits.
Further, G(x) is continuous. First, right continuity follows from right continuity of the
components. If x is a left discontinuity of G(x), so G(x) − G(x− ) > 0, then this implies:
But this is a contradiction. Either x is a continuity point of F so F (x) = F (x− ) and this
sum converges to 0, or, x = xn is a left discontinuity of F and this sum converges to
un = F (xn ) − F (x−
n ) by construction.
By Proposition III.3.12, increasing G(x) is differentiable almost everywhere, G0 (x) ≥ 0
by Corollary III.3.13, and G0 (x) is Lebesgue integrable on every interval [a, b] by Proposition
III.3.19 with: Z b
(L) G0 (y)dy ≤ G(b) − G(a). (1)
a
It follows that the value of this integral increases as a → −∞ and/or b → ∞, and that the
limit exists since bounded by:
Z ∞
0 ≤ (L) G0 (y)dy ≤ 1 − α.
−∞
R∞
If this improper integral equals 0 then define β = 0, and otherwise, let β ≡ (L) −∞
G0 (y)dy.
Define: Z x
F̃AC (x) ≡ (L) G0 (y)dy. (2)
−∞
0 0
By (2) and
R ∞ (3),0 the integrals of G (y) and F̃AC (y) agree over (−∞, x] for all x, and so also
β ≡ (L) −∞ F̃AC dy.
Now define: Z x
0
FAC (x) ≡ (L) F̃AC (y)dy β. (1.26)
−∞
Then FAC (x) = F̃AC (x)/β is absolutely continuous, and is a distribution function since
FAC (−∞) = 0 and FAC (∞) = 1.
Finally consider H(x) ≡ G(x)−βFAC (x). Note that H(−∞) = 0 and H(∞) = 1−α−β.
As a difference of continuous increasing functions, H(x) is continuous and of bounded
variation (Proposition III.3.29), and differentiable almost everywhere by Corollary III.3.32.
But H 0 (x) ≡ G0 (x) − βFAC
0
(x) = 0 a.e., and this obtains by item 3 of Proposition III.2.49
and (1) :
Z b Z b
0
βFAC (b) − βFAC (a) = β FAC (y)dy = G0 (y)dy ≤ G(b) − G(a).
a a
satisfies:
Ci (ui ) = ui .
The joint distribution functions C(u1 , u2 , ..., un ) are called copulas.
While the above proposition applies to the marginals {Fj (x)}nj=1 of F (x1 , x2 , ..., xn ),
there does not appear to be a feasible way to use this to decompose F (x1 , x2 , ..., xn ).
10 Distribution and Density Functions
defined on R with: Z ∞
(L) f (y)dy = 1.
−∞
We say that f (x) is a density function associated with F (x) in the Lebesgue sense
if for all x: Z x
F (x) = (L) f (y)dy. (1.29)
−∞
A density f (x) cannot be unique since if g(x) = f (x) a.e., then g(x) is also a density for
F (x) by item 3 of Proposition III.2.31.
This then leads to the question:
Which distribution functions, or equivalently, which random variables X on (S, E, λ),
have density functions in the Lebesgue sense?
A density function can also be defined relative to the measure µF induced by F (x) of
Proposition 1.5, and this will prove useful for generalizations in Book V. We say f (x) is a
density function associated with λF if for all x and Ax ≡ (−∞, x]:
Z Z x
λF (Ax ) = (L) f (y)dy ≡ (L) f (y)dy.
Ax −∞
By definition such a function satisfies the integrability condition above since λF (R) = 1.
By finite additivity of measures and item 7 of Proposition III.2.31, it then follows that
for any right, semi-closed interval A(a,b] = (a, b] :
Z Z b
λF ((a, b]) = (L) f (y)dy ≡ (L) f (y)dy,
A(a,b] a
What is the connection between λF (A) and the density f (x) for other Borel sets A ∈
B (R)?
A very good start to an answer would be to investigate the set function λf defined on
the Borel sigma algebra B (R) by:
Z
λf (A) = (L) f (y)dy. (1.30)
A
The set function λf is well-defined since such Lebesgue integrals are well-defined by Defi-
nition III.2.9.
By definition, the set function λf and the Borel measure λF induced by F agree on
A0 , the semi-algebra of right, semi-closed intervals. By extension they also agree on the
associated algebra A of finite disjoint unions of such sets. The uniqueness of extensions
theorem of Proposition I.6.14 then assures that for all A ∈ B (R):
λf (A) = λF (A),
since B (R) is the smallest sigma algebra that contains A by Proposition I.8.1.
Generalizing, it will be proved in Book V that given any Lebesgue integrable function
f (x), the set function λf of (1.30) is always a measure on B (R) .
Returning to the existence question on a density function associated with the distribution
function F (x), there are two ways to frame an answer.
1. By Proposition III.3.62, a Lebesgue measurable density f (x) that satisfies (1.29) exists
if and only if F (x) is an absolutely continuous function on every interval [a, b].
This follows because (1.29) and item 7 of Proposition III.2.31 obtain for all a:
Z x
F (x) = F (a) + (L) f (y)dy.
a
This proposition also states that (1.29) is satisfied with f (x) ≡ F 0 (x), and then by item
8 of Proposition III.2.31, it is also satisfied by any measurable function f (x) such that
f (x) = F 0 (x) a.e.
Thus F (x) has a density function f (x) in the Lebesgue sense if and only if F (x) is
absolutely continuous.
2. By item 3 of Proposition III.2.31, if the induced measure λF has a density function,
then λF (A) = 0 for every measurable set with m(A) = 0, where m denotes Lebesgue
measure. This follows from (1.30) and µF = µf , since by Definition III.2.9:
Z Z
f (y)dy ≡ χA (x)f (y)dy.
A
Here χA (x) = 1 on A and is 0 otherwise and thus λF (A) = 0 since χA (x)f (y) = 0 a.e.
Thus if the measure λF induced by F (x) has a density function f (x), then λF (A) = 0
for every set for which m(A) = 0.
This will follow from a deep and general result known as the Radon-Nikodým the-
orem, named for Johann Radon (1887–1956) who proved this result on Rn , and Otto
Nikodým (1887–1974) who generalized Radon’s result to all σ-finite measure spaces.
A consequence of this result, which will see generalizations in many ways, is then:
The measure λF induced by F (x) has a density function f (x) if and only if λF (A) = 0
for every set for which m(A) = 0.
Then only FAC (x) has a density function in the Lebesgue sense, with:
0
fAC (x) = FAC (x) a.e.
Proof. This result is Proposition III.3.62, which states that F (x) has a density if and only
if F (x) is absolutely continuous.
λFAC m.
This notation is read: The Borel measure λFAC is absolutely continuous with respect to
Lebesgue measure m. While this notation is suggestive, that m(A) = 0 forces λFAC (A) = 0,
it is not intended to imply any other relationship between these measures on other sets.
For FSLT , let E1 ≡ {xn }∞n=1 , which in the notation of the proof of Proposition 1.18 are
the discontinuities of F (x). Then m(E1 ) = λFSLT (E e1 ≡ R − E1 denotes the
e1 ) = 0, where E
complement of E1 . This follows because in the notation of Proposition 1.3,
and this measure is 0 on every set that is outside E1 . For FSN , let E2 be defined as the set
0
of measure 0 on which FSN (x) 6= 0. Then by definition m(E2 ) = 0, while λFSN (Ee2 ) = 0 by
Proposition I.5.30.
Thus, the sets on which m and λFSLT are “supported,” meaning on which they have
nonzero measure, are complementary. The same is true for m and λFSN . In Book V this
relationship will be denoted:
m ⊥ λFSLT , m ⊥ λFSN .
This reads, Lebesgue measure m and the Borel measure λFSLT (respectively λFSN ) are mu-
tually singular.
Density Functions on R 13
S
Now define E = E1 E2 . Then m(E) = λFSLT (E)
e = λF (E)
SN
e = 0, and thus:
m ⊥ (λFSLT + λFSN ) .
and we have derived a special case of Lebesgue’s decomposition theorem, named for
Henri Lebesgue (1875–1941).
λF = ν ac + ν s .
By the discussion above, when λF is the Borel measure induced by a distribution function
F (x), then:
ν ac = λFAC , ν s = λFSLT + λFSN ,
is such a decomposition.
on R with: Z ∞
(R) f (y)dy = 1.
−∞
We say that f (x) is a density function associated with F (x) in the Riemann sense
if for all x: Z x
F (x) = (R) f (y)dy. (1.31)
−∞
Again in this context, such a density f (x) cannot be unique, at least if bounded. By the
Lebesgue existence theorem of Proposition III.1.22, if the above f (x) is bounded, then it
must be continuous almost everywhere on every interval [a, b], and thus continuous almost
everywhere. If g(x) is bounded and continuous almost everywhere, and g(x) = f (x) a.e.,
then g(x) is also a density for F (x). This follows because by the proof of Proposition III.1.22,
the value of the Riemann integral is determined by the continuity points of the integrand.
Proposition III.1.33 then states that such a distribution function F (x) is differentiable
almost everywhere, and F 0 (x) = f (x) at each continuity point of f (x).
For bounded densities, it follows from Proposition III.2.56 that the Lebesgue integral of
f (x) over R is 1, and that F (x) is also definable as a Lebesgue integral. Thus every density
in the Riemann sense is a density in the Lebesgue sense.
Continuous probability theory further specializes the above discussion to continuous
densities f (x). Such densities are unique within the class of continuous functions, and now
Proposition III.1.33 assures that F 0 (x) = f (x) for all x. This result also obtains that the
only distribution functions that have continuous densities are the continuously differentiable
distribution functions.
14 Distribution and Density Functions
Exercise 1.24 (Continuous densities are unique) Prove that if a distribution func-
tion F (x) has two continuous density functions f (x) and fe(x) in the Riemann sense, then
f (x) = fe(x) for all x. Hint: If f (x0 ) > fe(x0 ) then by continuity this inequality applies on
(x0 − , x0 + ). Calculate F (x0 + ) − F (x0 − ).
But an evaluation with Riemann-Stieltjes sums of Definition III.4.3 obtains for all [a, b] :
Letting a → −∞, then G(−∞) = F (−∞) = 0 obtains that G(b) = F (b) for all b and thus:
Z x
F (x) ≡ dF. (1.32)
−∞
By Proposition 1.18:
and each of these distribution functions can be represented as in (1.32). Thus by item 5 of
Proposition III.4.24:
Z x Z x Z x
F (x) = α dFSLT + β dFAC + γ dFSN . (1.33)
−∞ −∞ −∞
We now can obtain density functions for two of these distribution functions by Propo-
sition III.4.28. Further, the density functions in (1.34) and (1.35) are the density functions
of discrete probability theory, and continuous probability theory, respectively.
and thus: X
FSLT (x) = fSLT (xn ) . (1.34)
xn ≤x
defined as a Riemann integral. In this case, we define the unique continuous density
function fAC of FAC by:
0
fAC (x) = FAC (x),
and thus: Z x
FAC (x) = (R) fAC (y)dy. (1.35)
−∞
Remark 1.26 (On Proposition 1.25) A few comments on the above result.
1. If FSLT (x) is defined as in (1.25) and {xn }∞ n=1 has accumulation points, it is con-
ventional to extend (1.34) to this case even though this does not follow directly from
Proposition III.4.28. Instead, we can use right continuity of distribution functions.
For any xn , it follows by definition that:
X
FSLT (xn ) = fSLT (x0n ) . (1)
x0n ≤xn
Given an accumulation point x 6= xn for any n, consider {xn > x}. If xn → x, then by
right continuity of FSLT and (1) :
X
FSLT (x) = lim FSLT (xn ) = 0
fSLT (x0n ) .
n→∞ xn ≤x
3. Finally, there is little hope of defining a density function for a singular distribution
0
function FSN (x). Since FSN (x) = 0 almost everywhere, the Lebesgue integral of this
function is zero, and (1.35) cannot be valid. Similarly, a representation as in (1.34)
0
would in general not be feasible, summing over all xα ≤ x for which FSN (xα ) 6= 0. For
the Cantor function of Definition III.3.51, for example, such points are uncountable.
16 Distribution and Density Functions
In virtually all cases in discrete probability theory, it is the probability density func-
tions that are explicitly defined in a given application. Frequently encountered examples of
discrete distribution functions are the discrete rectangular, binomial, geometric, negative
binomial, and Poisson distribution functions.
Example 1.27 1. Discrete Rectangular Distribution: The defining collection {xj }nj=1
for this distribution is finite and can otherwise be arbitrary. However, this collection is
n
conventionally taken as {j/n}j=1 , and so {xj }nj=1 ⊂ [0, 1], and the discrete rectangular
random variable is modeled by X R : S −→ [0, 1].
For given n, the probability density function of the discrete rectangular distribution,
n
also called the discrete uniform distribution, is defined on {j/n}j=1 by:
λ(Sj ) = 1/n.
By rescaling, this distribution can be translated to any interval [a, b], defining
Y R = (b − a)X R + a.
2. Binomial Distribution: For given p, 0 < p < 1, the standard binomial random
variable is defined: X1B : S −→ {0, 1}, where the associated p.d.f. is defined by:
f (1) = p, f (0) = p0 ≡ 1 − p.
A simple application for this random variable is as a model for a single coin flip. So
S = {H, T }, a probability measure is defined on S by: λ(H) = p, and λ(T ) = p0 , and
the random variable defined by X1B (H) ≡ 1 and X1B (T ) ≡ 0. This random variable is
sometimes referred to as a Bernoulli trial, and the associated distribution function as
the Bernoulli distribution after Jakob Bernoulli (1654–1705).
This standard formulation is then translated to a shifted standard binomial random
variable: Y1B = b + (a − b)X1B , which is defined:
a, Pr = p,
Y1B =
b, Pr = p0 ,
where the example of b = −a is common in discrete time asset price modeling for
example.
3. General Binomial: The binomial model can also be extended to accommodate sample
spaces of n-coin flips, producing the general binomial random variable with two pa-
rameters, p and n ∈ N. That is, S = {(F1, F2 , ..., Fn ) | Fj ∈ {H, T }}, and XnB is defined
as the “head counting” random variable:
Xn
XnB (F1 , F2 , ..., Fn ) = X1B (Fj ).
j=1
It is apparent that XnB assumes values 0, 1, 2, ..., n, and using a standard combinatorial
analysis that the associated probabilities are given for j = 0, 1, .., n by:
n
XnB = j, Pr = j pj (1 − p)n−j .
n
Recall that j denotes the binomial coefficient defined by:
n n!
= , (1.41)
j (n − j)!j!
where 0! = 1 by convention. This expression is sometimes denoted n Cj and read, “n
choose j.”
The name “binomial coefficient” follows from the expansion of a binomial a + b raised
to the power n, producing the binomial theorem:
Xn n
(a + b)n = aj bn−j . (1.42)
j=0 j
S ={H, T H, T T H, T T T H, ....},
and the random variable X G defined as the number of flips before the first H. Conse-
quently, fG (j) above is the probability in S of the sequence of j-T s and then 1-H. That
is, the probability
P∞ that the first H occurs after j-T s. Of course, fG (j) is indeed a p.d.f.
in that j=0 p(1 − p)j = 1 as follows from (1.44), letting j → ∞.
The geometric distribution is sometimes parametrized as:
and then represents the probability of the first head in a coin flip sequence appearing
on flip j. These representations are conceptually equivalent, but mathematically distinct
due to the shift in domain.
One way of generalizing the geometric distribution is to allow the probability of a head to
vary with the sequential number of the coin flip. This is the basic model in all financial
calculations relating to payments contingent on death or survival, as well as to
various other vitality-based outcomes. Specifically, if
where fGG (j) is the probability of the first head appearing on flip j. By convention,
Q0
k=1 (1 − pk ) ≡ 1 when j = 1.
If pk = p > 0 for all k, then fG (j) is a p.d.f. as noted above. With nonconstant proba-
bilities, this conclusion is also true with some restrictions to assure that the distribution
function is bounded. For example, if 0 < a ≤ pkP ≤ b < 1 for all j, then the summation
∞
is finite since fGG (j) < b(1 − a)j−1 and thus j=1 fGG (j)(j) < b/a by a geometric
series summation. Additional restrictions are required to make this sum equal to 1. De-
tails are left to the interested reader, or see Reitano (2010) pp. 314–319 for additional
discussion.
5. Negative Binomial Distribution: The name of this distribution calls out yet another
connection to the binomial distribution, and here we generalize the idea behind the geo-
metric distribution. There, fG (j) was defined as the probability of j-T s before the first
H. The negative binomial, fN B (j) introduces another parameter k, and is defined as the
probability of j-T s before the kth-H. So when k = 1, the negative binomial is the same
as the geometric.
The p.d.f. is then defined with parameters p with 0 < p < 1 and k ∈ N by:
j+k−1 k
fN B (j) = p (1 − p)j , j = 0, 1, 2, .. (1.46)
k−1
Examples of Distribution Functions on R 19
This formula can be derived analogously to that for the geometric by considering the
sample space of coin flip sequences, each terminated on the occurrence of the kth-H.
The probability of a sequence with j-T s and k-Hs is then pk (1 − p)j . Next, we must
determine the number of such sequences in the sample space. Since every such sequence
terminates with an H, there are only the first j +k−1 positions that need to be addressed.
Each such sequence is then determined by the placement of the first (k − 1)-Hs, and so
the total count of these sequences is j+k−1
k−1 . Multiplying the probability and the count,
we have (1.46).
6. Poisson Distribution: The Poisson distribution is named for Siméon-Denis Pois-
son (1781–1840) who discovered this distribution and studied its properties. This distri-
bution is characterized by a single parameter λ > 0, and is defined on the nonnegative
integers by:
λj
fP (j) = e−λ , j = 0, 1, 2, .... (1.47)
j!
P∞
That j=0 fP (j) = 1 is an application of the Taylor series expansion for eλ :
X∞
eλ = λj /j!
j=0
One important application of the Poisson distribution is provided by the Poisson Limit
theorem of Proposition II.1.11. See also Proposition 6.5. This result states that with
λ = np fixed:
λj
n j
lim p (1 − p)n−j = e−λ .
n→∞ j j!
Thus when the binomial parameter p is small, and n is large, the binomial probabilities
in (1.40) can be approximated:
(np)j
n j
p (1 − p)n−j ' e−np . (1.48)
j j!
By p small and n large is usually taken to mean n ≥ 100 and np ≤ 10.
Another important property of the Poisson distribution is that it characterizes “arrivals”
during a given period of time under reasonable and frequently encountered assumptions.
For example, the model might be one of automobile arrivals at a stop light; or telephone
calls to a switchboard; or internet searches to a server; or radio-active particles to a
Geiger counter; or insurance claims of any type (injuries, deaths, automobile accidents,
etc.) from a large group of policyholders; or defaults from a large portfolio of loans and
bonds; etc. For this result, see pp. 300–301 in Reitano (2010).
There are many distributions with continuous density functions used in finance and in
other applications, but some of the most common are the uniform, exponential, gamma
(including chi-squared), beta, normal, lognormal, and Cauchy. In addition, other examples
are found with the extreme value distributions discussed in Chapter II.9 and Chapter
7, as well as Student’s T (also called Student T) and F distributions introduced in Example
2.28.
It should be noted that it is not required that a continuous density function f (x) be
continuous on R, but only on its domain of definition. Then by Proposition III.1.33, F 0 (x) =
f (x) at every such continuity point.
For the following Chapter II.4 results, let (S, E, λ) be given and X : (S, E, λ) →
(R, B(R), m) a random variable with distribution function F (x) and left-continuous in-
verse F ∗ (y). Let (S 0 , E 0 , λ0 ) be given and XU : (S 0 , E 0 , λ0 ) → ((0, 1), B((0, 1), m) a random
variable with continuous uniform distribution function. Then:
Proposition II.4.5: If F (x) is continuous and strictly increasing, so F ∗ = F −1 by
Proposition II.3.22, then:
(a) F (X) : (S, E, λ) → ((0, 1), B((0, 1), m) defined by F (X)(s) = F (X(s)), is a ran-
dom variable on S which has a continuous uniform distribution on (0, 1), meaning
FF (X) (x) = x.
(b) F −1 (XU ) : (S 0 , E 0 , λ0 ) → (R, B(R), m) defined by F −1 (XU )(s0 ) = F −1 (XU (s0 )), is
a random variable on S 0 with distribution function F (x).
(a) F (X) : (S, E, λ) → ((0, 1), B((0, 1), m) defined as above has distribution function
FF (X) (x) ≤ x, and this function is continuous uniform with FF (X) (x) = x if and
only if F (x) is continuous;
(b) F ∗ (XU ) : (S 0 , E 0 , λ0 ) → (R, B(R), m) defined as above is a random variable on S 0
with distribution function F (x).
Proposition II.4.9:
(a) If {Uj }nj=1 are independent, continuous uniformly distributed random variables,
then {Xj }nj=1 ≡ {F ∗ (Uj )}nj=1 are independent random variables with distribution
function F (x).
(b) If F (x) is continuous and {Xj }nj=1 are independent random variables with distri-
bution function F (x), then {Uj }nj=1 ≡ {F (Xj )}nj=1 are independent, continuous
uniformly distributed random variables.
The significance of these Book II results is that one can convert a uniformly distributed
random sample {Uj }nj=1 into a sample of the {Xj }nj=1 variables by defining:
Xj = F ∗ (Uj ),
with F ∗ (Uj ) defined as in (1.52). And note that despite being defined as an infimum,
the value of F ∗ (Uj ) is truly in the domain of the random variable X because F (x) is
right continuous.
We will return to an application of these results in Chapter 5.
2. Exponential Distribution: The exponential density function is defined with pa-
rameter λ > 0 :
fE (x) = λe−λx , x ≥ 0, (1.53)
and fE (x) = 0 for x < 0. The associated distribution function is then:
and FE (x) = 0 for x < 0. When λ = 1, this is often called the standard exponential
distribution.
The exponential distribution is important in the generation of ordered random sam-
ples. See Chapter 3.
22 Distribution and Density Functions
and since Γ(1) = 1, this function generalizes the factorial function in that for any integer
n:
Γ(n) = (n − 1)! (1.58)
This identity also provides the logical foundation for defining 0! = 1, as noted in the
discussion of the general binomial distribution.
Another interesting identity for the gamma function is:
√
Γ(1/2) = π. (1.59)
4. Beta Distribution: The beta distribution contains two shape parameters, v > 0 and
w > 0, and is defined on the interval [0, 1] by the density function:
1
fβ (x) = xv−1 (1 − x)w−1 . (1.61)
B(v, w)
The beta function B(v, w) is defined by a definite integral which in general requires
numerical evaluation: Z 1
B(v, w) = y v−1 (1 − y)w−1 dy. (1.62)
0
R1
By definition, therefore, 0
fβ (x)dx = 1.
If v or w or both parameters are less than 1, the beta density and integrand of B(v, w)
are unbounded at x = 0 or x = 1 or both. However the integral converges in the limit as
an improper integral:
Z b
B(v, w) = lim y v−1 (1 − y)w−1 dy,
a→0+ a
b→1−
A final substitution x = zy in the inner integral, and two applications of (1.56) produces
the identity:
Γ(v)Γ(w)
B(v, w) = . (1.63)
Γ(v + w)
Thus for integer n and m, it follows from (1.58) that:
(n − 1)! (m − 1)!
B(n, m) = . (1.64)
(n + m − 1)!
(a) When applied to approximate the binomial distribution, this result is called the De
Moivre-Laplace theorem, named for Abraham de Moivre (1667–1754), who
demonstrated the special case of p = 1/2, and Pierre-Simon Laplace (1749–
1827), who years later generalized to all p, 0 < p < 1. See Proposition 6.11.
(b) In the most general cases, this result is known as the central limit theorem. See
Proposition 6.13 for one example, and Book VI for others.
R∞ R∞
Remark 1.31 ( −∞ fN (x)dx = 1) By change of variable, −∞ fN (x)dx = 1 if and
R∞
only if −∞ φ(x)dx = 1. While there are no elementary proofs of the latter identity, there
R∞
is clever derivation that involves embedding the integral −∞ φ(x)dx into a 2-dimensional
Riemann integral, and the ability to move back and forth between 2-dimensional and
iterated Riemann integrals. This is largely justified in Corollary III.1.77, but the earlier
result requires a generalization to improper integrals. The derivation below also requires
a change of variables to polar coordinates which allows direct calculation, a step that is
perhaps familiar to the reader but one that will not be formally justified until Book V.
In detail:
Z ∞ Z ∞ Z ∞ Z ∞
1
exp − x2 + y 2 /2 dxdy,
φ(x)dx φ(y)dy =
−∞ −∞ 2π −∞ −∞
Example 1.32 (Example II.1.7) Example II.1.7 examined default experience on a loan
portfolio, and of particular interest was the modeling of default losses. For a portfolio of n
loans, if X denotes the number of defaults, then X : S → {0, 1, 2, ..., n} is discrete, while if
Y : S → R denotes the dollars of loss, then the model for Y will typically be “mixed.”
26 Distribution and Density Functions
This follows because with fixed default probability p say, the probability of no defaults
assuming independence is given by:
This probability will typically be quite large compared to Pr{0 < Y ≤ }, which is the
probability of one or several defaults but very low loss amounts.
For example, assume that each bond’s loss given default is uniformly distributed on
[0, Lj ], where Lj denotes the loan amount on the jth bond. Then F (y) will have a discrete
part at y = 0, and then a continuous or mixed distribution for y > 0. In the former case,
which we confirm below:
where χ[0,∞) (y) = 1 for y ∈ [0, ∞) and is 0 otherwise, and FAC (y) is absolutely continuous
with FAC (0) = 0.
As F (y) is continuous from the right as a distribution, and certainly not left continuous
at y = 0, to investigate left continuity for y > 0, recall the law of total probability in
Proposition II.1.35. If Y denotes total losses and 0 < < y, then noting that Pr{y − <
Y ≤ y |0 defaults} = 0 :
Xn
Pr{y − < Y ≤ y} = Pr{y − < Y ≤ y | k def.} Pr{k def.}, (1.71)
k=1
Since:
n k
Pr{k defaults) = p (i − p)n−k ,
k
this can be stated in terms of F (y) :
Xn n k
F (y) − F (y − ) = p (i − p)n−k [Fk (y) − Fk (y − )] (1)
k=1 k
likely to default:
1 XNk h (i) (i)
i
Fk (y) − Fk (y − ) = Fk (y) − Fk (y − ) , (2)
Nk i=1
(i)
where now Fk (y) is the distribution function of the sum of k losses from the ith collection
of k bonds.
(i)
In other words, Fk (y) is the distribution function of the sum of k continuously dis-
(i)
tributed variables, here assumed to be uniformly distributed. We claim each such Fk (y) is
continuous for y > 0 and thus by (1), (2) and (1.71), so too is F (y), and thus F (y) is a
mixed distribution as claimed.
By (2.13), if X and Y are independent random variables on a probability space (S, E, λ)
with continuously differentiable distribution functions FX and FY with continuous density
Examples of Distribution Functions on R 27
0
functions FX = fX and FY0 = fY , then the density function of Z = X + Y is continuous
and given by the Riemann integrals:
Z Z
fZ (z) = fX (z − y)fY (y)dy = fY (z − x)fX (x)dx. (3)
(i) (i)
Thus by induction, each Fk (y) has a continuous density fk (y) and is continuous as
claimed.
Exercise 1.33 Given two loans with loss given default Yj uniformly distributed on
[0, Lj ] for j = 1, 2, use (3) to determine fY (y) for Y = Y1 + Y2 . Assume L1 < L2 to
avoid ambiguity.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
2
Transformed Random Variables
In this chapter, we investigate the distribution and density functions associated with various
transformations of random variables. Many of these results will be generalized in Book VI
with the aid of the general integration theory of Book V.
FY (y) = λ X −1 g −1 (−∞, y] ,
(2.1)
1. Increasing g(x) :
FY (y) = FX (g −1 (y)), (2.2)
2. Decreasing g(x) :
−
FY (y) = 1 − FX (g −1 (y) ), (2.3)
where FX (x− ) ≡ limy→x− FX (y) denotes the left limit at x.
DOI: 10.1201/9781003264583-2 29
30 Transformed Random Variables
dg −1 (y)
If g(x) is also continuously differentiable with g 0 (x) 6= 0 for all x, then dy =
0 −1 dg −1 (y) 0
1/g (g (y)) is well defined. Further, dy is continuous since g (x) 6= 0 and is con-
−1
tinuous, and g is continuous since g is continuous and strictly increasing. Thus if FX (x)
0
is continuously differentiable with density function fX (x) = FX (x), then FY (y) is differen-
tiable and has an associated continuous density function fY (y) ≡ FY0 (y):
dg −1 (y)
fY (y) = fX (g −1 (y)) . (1)
dy
FY (y) = 1 − FX (g −1 (y)).
If FX (x) and g(x) are continuously differentiable, fY (y) ≡ FY0 (y) is continuous and given
by:
dg −1 (y)
fY (y) = −fX (g −1 (y)) . (2)
dy
−1 −1
Since dg dy(y) > 0 for increasing g(x), and dg dy(y) < 0 for decreasing g(x), (2.4) follows
from (1) and (2).
For the result in (2.5), it follows by the continuity of fY (y) and Proposition III.1.32 that
for every interval [a, b]:
Z y −1
−1
dg (z)
FY (y) = FY (a) + (R) fX (g (z)) dz, y ∈ [a, b],
a dz
and this can be iterated with a → −∞ since then FY (a) → 0 by Proposition 1.3.
FY (y) = 1 − FX (g −1 (y)).
follows from this that FY (y) as defined in (2.2), or (2.3) as modified above, is absolutely
continuous on every interval [a, b] by Definition 3.54. Thus fY (y) ≡ FY0 (y) exists a.e. by
Proposition III.3.59, and of necessity is given by (2.4) a.e. Then by Proposition III.3.62,
(2.5) is satisfied as a Lebesgue integral:
Z y −1
dg (z)
FY (y) = (L) fX (g −1 (z)) dz,
−∞ dz
The above framework can sometimes be adapted by first principals in situations where
g(x) is not strictly monotonic. However, general measurable g(x) will require better tools.
See Book VI.
The associated density function can then be calculated, and using the symmetry of fX , we
derive that for y > 0 :
√ √
fY (y) = fX ( y)/ y
1
= √ y −1/2 exp (−y/2) .
2π
Comparing this to (1.55), we see that Y has a gamma density with λ = α = 1/2 if:
√
Γ(1/2) = π.
Remark 2.5 (Sum of independent squared normals is χ2n d.f. ) As noted in Remark
1.30, when a random variable Y has a gamma distribution with λ = 1/2 and α = n/2, it is
said to have a chi-squared distribution with n degrees of freedom, which is denoted
χ2n d.f. . The density function of the chi-squared distribution with n degrees of freedom is thus
given by (1.55) with the above parameters and defined on x ≥ 0 by:
1
fχ2n d.f. (x) = xn/2−1 e−x/2 . (2.6)
2n/2 Γ(n/2)
As proved above, if X is standard normal then X 2 is χ21 d.f. . We will seen in item 5 of
Pn
Section 4.4.1 that if {Xi }ni=1 are independent standard normals, then i=1 Xi2 is χ2n d.f. .
32 Transformed Random Variables
{vm }M
m=1 , each with a sum of 1. Let X, Y be independent (Definition 1.13) discrete random
variables on a probability space (S, E, λ) with respective distribution functions FX (x) and
FY (y), and associated density functions fX (x) and fY (y) :
P P
FX (x) = xn ≤x un , FY (y) = ym ≤y vm
(2.7)
fX (xn ) = un , fY (ym ) = vm .
In other words, density functions are defined to be 0 outside the points {xn }N
n=1 and
{ym }M
m=1 .
Proposition 2.6 (Z = X + Y : FZ (z) as a sum, discrete FX , FY ) Let X and Y be in-
dependent random variables on a probability space (S, E, λ) with saltus distribution functions
given in (2.7). Then the distribution function of Z = X + Y is given by:
XM XN
FZ (z) = FX (z − ym )fY (ym ) = FY (z − xn )fX (xn ). (2.8)
m=1 n=1
The next result transforms the above representation into a form that will be seen to gen-
eralize below. The reader is invited to derive the analogous result for the second expression
in (2.8) as an exercise.
and note that a > 0 by the assumptions on disjointedness of domains and absence of ac-
00
cumulation points. Let P = {wk }K k=−K 0 with wk < wk+1 for all k be a partition of mesh
size µ < a of an interval I that contains all z − xn , ym , where I = R is possible. Since
there are no accumulation points, we can assume that P does not contain any z − xn or
ym . For notational consistency, assume that {z − xn }N n=1 has been reordered to increasing
00
M 00
{z − xn }N M
n=−N 0 , and {ym }m=1 to {ym }m=−M 0 , noting that any or all of these set index
bounds may be infinite.
With z fixed as above, recall Definition III.4.6 for the lower and upper Darboux sums:
PK 00
L(FX (z − y), FY (y), P ) = k=−K 0 mk (z) [FY (wk+1 ) − FY (wk )] ,
PK 00
U (FX (z − y), FY (y), P ) = k=−K 0 Mk (z) [FY (wk+1 ) − FY (wk )] .
Here mk (z) = inf{FX (z − w)|w ∈ [wk , wk+1 ], and Mk (z) is similarly defined in
terms of the supremum. These sums are convergent since 0 ≤ mk , Mk ≤ 1 and
PK 00
k=−K 0 [FY (wk+1 ) − FY (wk )] = 1.
By construction, each interval (wk , wk+1 ) contains at most one ym . If ym ∈ (wk , wk+1 ),
then FY (wk+1 ) − FY (wk ) = vm ≡ fY (ym ). Conversely, FY (wk+1 ) − FY (wk ) = 0 if
(wk , wk+1 ) contains no ym . Thus:
PM 00
L(FX (z − y), FY (y), P ) = m=−M 0 mk(m) (z)fY (ym ) ,
PM 00 (1)
U (FX (z − y), FY (y), P ) = m=−M 0 Mk(m) (z)fY (ym ) .
Here mk(m) (z) and Mk(m) (z) are defined on (wk(m) , wk(m)+1 ), where this notation implies
that ym ∈ (wk(m) , wk(m)+1 ).
Given such an interval [wk , wk+1 ] which contains some ym , as this interval cannot con-
tain any z−xn , it follows that FX (z−w) is constant and thus mk (z) = Mk (z) = FX (z−ym ).
It now follows from (1) that for any partition P with mesh size µ < a, that:
and further, these Darboux sums agree with FZ (z) in (2.8) by Proposition III.4.12.
34 Transformed Random Variables
Remark 2.8T(On {z − xn }N M
T
n=1 {ym }m=1 = ∅,) Corollary 2.7 made the assumption that
N M
{z − xn }n=1 {ym }m=1 = ∅, and this assumption cannot be weakened. By Proposition
III.4.15, if a Riemann-Stieltjes integral over [a, b] exists, then the integrand and integrator
can have no common discontinuities. As {z − xn }N M
n=1 and {ym }m=1 are the discontinuities
of FX (z − y) and FY (y), this restriction is necessary since existence as an integral over
R implies existence over all such [a, b]. Thus the representation in (2.9) is valid for all z
outside an at most countable set, defined by {ym + xn } for all m and n.
Proof. By continuity of the distribution functions, these integrals are well defined over
every bounded interval [a, b] by Proposition III.4.17. By boundedness, these integrals have
well defined limits as a → −∞ and b → ∞ by Proposition III.4.21. It then follows from
Proposition III.4.12 that integrals can be evaluated by any sequence of Riemann-Stieltjes
summations with partition mesh size µ → 0.
To this end, define Pn ≡ {j/2n }j∈Z+ , a partition of (−∞, ∞) of mesh size µ = 2−n .
Then the first integral in (2.10) equals the limit as n → ∞ of the Riemann-Stieltjes sums:
X∞
FX (z − yj,n ) [FY ((j + 1)/2n ) − FY (j/2n )] , (1)
j=−∞
(n) (n)
\
Bj (z) = {X ≤ z − j/2n } Aj ,
S∞ (n)
and let B (n) (z) = j=−∞ Bj (z). We claim that {B (n) (z)}∞
n=1 is a nested collection,
(n+1) (n)
B (z) ⊂ B (z), and:
\∞
B (n) (z) = B(z) ≡ {X + Y ≤ z}. (2)
n=1
(n+1)
If s ∈ B2j (z) then:
Comparing to (1), this is the limit of Riemann-Stieltjes sums with tags yj,n = j/2n , and the
proof is complete for the first integral in (2.10).
The second integral follows identically with a change of notation.
and this integral is well defined as a limit of integrals over [a, b] as a → −∞ and b → ∞.
This follows from Proposition III.4.21 since FY (z − x) − FZ (zn − x) is continuous and
bounded, and FX (x) is increasing and bounded.
Thus by item 3 of Proposition III.4.24 applied to such intervals and taking a limit, and
then applying item 2 of that result:
Z ∞
|FZ (z) − FZ (zn )| ≤ |FY (z − x) − FY (zn − x)| dFX (x)
−∞
≤ sup |FY (z − x) − FY (zn − x)| ,
36 Transformed Random Variables
R
since dFX (x) = 1. To complete the proof, we claim that this supremum converges to 0 as
n → ∞.
To this end, let > 0 be given and choose m such that 1/m < /2. For 0 ≤ j ≤ m − 1
define:
(m)
Aj = FY−1 [j/m, (j + 1)/m] .
Sm−1 (m) (m)
Then j=0 Aj = R by definition, and each Aj is a nonempty interval since FY is
continuous. n o
(m) (m)
Define δ = minj Aj , where Aj denotes interval length. Note that δ > 0 since
(m)
this is a finite collection of intervals, and all Aj > 0 by continuity of FY (y). Assume
(m)
that y < y 0 are given with |y − y 0 | < δ. If y, y 0 ∈ Aj for some j, then |FY (y) − FY (y 0 )| <
(m) (m)
1/m < /2 by construction. Otherwise y ∈ Aj and y 0 ∈ Aj+1 for some j, and:
|FY (y) − FY (y 0 )| ≤ |FY (y) − FY ((j + 1)/m)| + |FY ((j + 1)/m) − FY (y 0 )| < .
Thus if |zn − z| < δ, then sup |FY (z − x) − FY (zn − x)| < and the proof of continuity
of FZ (z) is complete.
For absolute continuity, assume that FX (x) has this property and that > 0 is given.
Xn
Recalling Definition III.3.54, choose δ so |FX (xi ) − FX (x0i )| < for any collection
i=1
n
X n n
{(x0i , xi )}i=1 of disjoint intervals with |xi − x0i | < δ. Given disjoint {(zi0 , zi )}i=1 with
Xn i=1
|zi − zi0 | < δ, it then follows from properties 1 and 3 of Proposition III.4.24 that:
i=1
Xn Z ∞X
n
|FZ (zi ) − FZ (zi0 )| ≤ |FX (zi − y) − FX (zi0 − y)| dFY (y)
i=1 −∞ i=1
Z ∞
< dFY (y) = .
−∞
When one or both distribution functions in Proposition 2.9 has a continuous derivative,
as is the common assumption in continuous probability theory, we can convert one or both
of the above Riemann-Stieltjes integrals to Riemann integrals.
In Book VI, this result will be generalized using the general integration theory of Book
V. There it will be proved that if one or both of these distribution functions is absolutely
continuous, and thus only differentiable almost everywhere by Proposition III.3.59, a similar
conclusion results, but we must then interpret the integrals below in the sense of Lebesgue.
Corollary 2.11 (Z = X + Y : FZ (z) as an R integral, continuous fX , fY ) Let X and
Y be independent random variables on a probability space (S, E, λ) with continuously differ-
0
entiable distribution functions FX and FY with continuous density functions fX ≡ FX and
0
fY ≡ FY . Then the distribution function of Z = X + Y is continuously differentiable and
given as Riemann integrals by:
Z ∞ Z ∞
FZ (z) = (R) FX (z − y)fY (y)dy = (R) FY (z − x)fX (x)dx. (2.11)
−∞ −∞
If only one of FX (x) and FY (y) are continuously differentiable with the other continuous,
then this representation is valid for the associated density function, and then FZ (z) is con-
tinuous.
Proof. This representation as a Riemann integral follows directly from Proposition III.4.28.
Continuity follows from Corollary 2.10, while continuous differentiability will be addressed
in Proposition 2.17.
Sums of Independent Random Variables 37
Exercise 2.12 When both distribution functions are continuously differentiable, show that
these two integrals are equal by integration by parts of III.(1.30). Hint: Start with integrals
over [−N, N ] and consider limits. Recall Remark III.1.35.
Notation 2.13 (Convolution) In the terminology and notation of Book V, the first ex-
pression in (2.8) and (2.11) for the distribution function FZ (z) equals the convolution of
FX and fY . Analogously, the second expression equals the convolution of FY and fX .
Notationally:
FZ (z) = FX ∗fY (z) = FY ∗fX (z).
Convolutions are commutative, meaning that:
as a change of variables in the integral verifies. The verification is similar for (2.8).
Convolution of three or more functions proves to be associative, and thus can be defined
iteratively, by
f ∗ g ∗ h(z) ≡ (f ∗ g) ∗ h(z) = f ∗ (g ∗ h) (z).
Associativity will be proved with the aid of Fubini’s theorem in Book V.
Example 2.14 We apply the above formulas in (2.8) and (2.11) to previously introduced
distribution functions.
1. Sums of Binomials are Binomial: Let X and Y be independent and have binomial
distributions as in (1.40) with common parameter p, but with respective parameters n
and m. Then by (2.8):
Xz m j
FZ (z) = FX (z − j) p (1 − p)m−j
j=0 j
Xz Xz−j n
m j
= pk (1 − p)n−k p (1 − p)m−j
j=0 k=0 k j
Xz Xz−j nm
= pj+k (1 − p)n+m−(j+k) .
j=0 k=0 k j
obtains:
Xz Xi n m i
FZ (z) = p (1 − p)n+m−i .
i=0 j=0 i−j j
Now:
Xi n m m+n
=
j=0 i−j j i
38 Transformed Random Variables
fZ (z) = λ2 ze−λz , z ≥ 0.
It then follows that fZ (z) is the gamma density function in (1.55) with parameters λ
and α = 2.
3. Sums of Exponentials and the Poisson: By Exercise 2.15, the sum of k independent
exponentials with common parameter λ produces a gamma with parameters λ and α = k,
and this motivates an interesting connection between sums of such exponentials and the
Poisson distribution.
Pn
Given such independent exponentials {Xj }∞ j=1 , let Sn = j=1 Xj and define a new
random variable N by:
N ≡ max{n|Sn ≤ 1}. (1)
Then N = n if and only if Sn ≤ 1 < Sn+1 , and hence N ≥ n if and only if Sn ≤ 1. By
Exercise 2.15, the distribution function of Sn is given in (1.60), and since Pr[Sn ≤ 1] =
FSn (1) we obtain:
X∞ λj
Pr[N ≥ n] = e−λ .
j=n j!
Now:
Pr[N = n] = Pr[N ≥ n] − Pr[N ≥ n + 1],
and thus:
λn
Pr[N = n] = e−λ .
n!
In other words, N in (1) has a Poisson distribution with parameter λ.
1. Sums of Binomials are Binomial: Generalize item 1 above and prove by induction
that the sum of N such independent binomials
PN with parameters p and ni has a binomial
distribution with parameters p and n = i=1 ni .
Sums of Independent Random Variables 39
2. Sums of Exponentials are Gamma: Generalize item 2 above and prove by induction
that the sum of k independent exponentials with the same parameter λ has a gamma
distribution with parameters λ and α = k.
Note: This result will be further generalized in Section 4.4.1 to the statement that sums
of independent gamma random variables are gamma as long as theyP have a common λ
parameter. Then the resultant α parameter of the sum satisfies α = i αi , where {αi }i
are the parameters of the individual gamma variates.
Now Z(S) = {xn + ym }n,m , and so if z ∈/ Z(S) then fZ (z) = 0 and (2.12) is satisfied.
Otherwise given zn,m ≡ x0n + ym
0
, (1) and (1.3) obtain the first expression in (2.12):
Proof. We derive the first representation, with the second derived by a change of variable.
To first investigate fZ (z) as defined above, let:
Z n
(n)
fZ (z) = (R) fX (z − y)fY (y)dy. (1)
−n
(n)
Then fZ (z) is well-defined as a Riemann integral by continuity of the integrand and Propo-
sition III.1.15.
(n)
To see that fZ (z) is continuous for all n, let zm → z and assume without loss of
generality that |zm − z| < 1 for all m. Then by Proposition III.1.23:
Z n
(n) (n)
fZ (z) − fZ (zm ) ≤ |fX (z − y) − fX (zm − y)| fY (y)dy.
−n
(n)
Hence fZ (z) is continuous and Riemann integrable on [−N, N ] for all N by Proposition
III.1.15, and the improper integral is defined when it exists by:
Z ∞ Z N Z n
(n)
fZ (z)dz ≡ lim fX (z − y)fY (y)dydz.
−∞ N →∞ −N −n
By continuity of the integrand, the Riemann integral of fX (z − y)fY (y) is well defined as
an integral over the rectangle [−N, N ] × [−n, n] by Proposition III.1.63. Corollary III.1.77
then justifies reversing the iterated integrals:
Z ∞ Z nZ N
(n)
fZ (z)dz ≡ lim fX (z − y)dzfY (y)dy
−∞ N →∞ −n −N
Z n
= lim [FX (N − y) − FX (−N − y)] fY (y)dy.
N →∞ −n
Defining GN (y) ≡ [FX (N − y) − FX (−N − y)] fY (y), note that GN (y) → fY (y) point-
wise and this sequence is monotonically increasing. Thus interpreted as a Lebesgue integral
(Proposition III.2.18), Lebesgue’s monotone convergence theorem obtains:
Z n Z n
lim (L) GN (y)dy = (L) fY (y)dy.
N →∞ −n −n
Each of these integrals equals the Riemann counterparts by Proposition III.2.18 and thus:
Z ∞ Z n
(n)
(R) fZ (z)dz = fY (y)dy = FY (n) − FY (−n). (2)
−∞ −n
Sums of Independent Random Variables 41
(n)
For the integral of fZ (z), it follows by definition that fZ (z) → fZ (z) pointwise, and
again the sequence is monotonically increasing by nonnegativity of the integrand in (1).
Another application of monotone convergence and (2) obtains:
Z ∞ Z ∞
(n)
(L) fZ (z)dz = lim fZ (z)dz = 1.
−∞ n→∞ −∞
Although we will not prove this until Book V, fZ (z) is in fact continuous, and thus the
Riemann and Lebesgue integrals agree over every bounded interval [−N, N ] by Proposition
III.2.18, so taking the limit as N → ∞ :
Z ∞
(R) fZ (z)dz = 1. (3)
−∞
In summary, fZ (z) is continuous, nonnegative, and integrates to 1, and thus has all the
properties of a density function.
Now define the distribution function F̃Z (z) associated with fZ (z) :
Z z Z z Z ∞
F̃Z (z) = (R) fZ (w)dw ≡ (R) fX (w − y)fY (y)dydw. (4)
−∞ −∞ −∞
Note that F̃Z (z) is continuously differentiable by Proposition III.1.33 and F̃Z0 (z) = fZ (z). If
reversing the order of integration in (4) is justified, it will then follow that F̃Z (z) = FZ (z).
When interpreted as Lebesgue integrals using Proposition III.2.56, this justification is
found in Book V in either Fubini’s theorem or Tonelli’s theorem. This reversal of iterated
integrals can also be seen as a generalization of the Riemann result in Corollary III.1.77.
To prove it here with the same approach as above, note that by definition of the improper
Riemann integral: Z z h i
F̃Z (z) = lim HN (w) dw,
−∞ N →∞
where: Z N
HN (w) ≡ fX (w − y)fY (y)dy.
−N
Switching back and forth between Riemann and Lebesgue integrals, we have by Lebesgue’s
monotone convergence theorem that:
Z z Z N
F̃Z (z) = lim fX (w − y)fY (y)dydw. (5)
N →∞ −∞ −N
Then by (5) and Corollary 2.11, it follows that F̃Z (z) = FZ (z), and the proof is
complete.
42 Transformed Random Variables
Comparing with (1.65), Z is seen to have a normal distribution with µ = 0 and σ 2 = 1/2.
2. Average of two independent standard Cauchy is standard Cauchy: Let X
and Y be independent and have standard Cauchy distributions as in (1.69). In this example,
we derive that Z ≡ X/2 + Y /2 also has a standard Cauchy distribution.
As in item 1, by (2.13):
Z ∞
4 1
fZ (z) = 2 h i dx,
π −∞ [1 + 4x2 ] 1 + 4 (z − x)2
z
To better exploit the symmetries of this integrand, substitute x = y + 2 and evaluate
g(2z) to obtain:
4 ∞ 1 + 4z 2
Z
g(2z) = h ih i dy. (2)
π −∞ 1 + 4 (y + z)2 1 + 4 (y − z)2
To prove that g(2z) = 1 for all z, we prove that g(0) = 1 and g 0 (2z) = 0 for all z.
2 ∞
Z
1
g(0) = dx.
π −∞ [1 + x2 ]2
Ratios of Random Variables 43
and this is seen to equal 1 after two applications of integration by parts, obtaining the
antiderivative: Z
1
cos2 θdθ = (cos θ sin θ + θ) + C.
2
2
b. g 0 (2z) = 0 for all z : Rewriting (2) with A(y, z) ≡ 1 + 4 (z + y) :
4 ∞ 1 + 4z 2
Z
g(2z) = dy.
π −∞ A(y, z)A(−y, z)
This integral is absolutely convergent for all z since the integrand is O(1/y 2 ) at ±∞. In
addition, the derivative of this integrand:
1 + 4z 2
∂
∂z A(y, z)A(−y, z)
8yA(y, z)A(−y, z) − 8 1 + 4z 2 [(z + y) A(−y, z) − (z − y) A(y, z)]
= 2 , (3)
[A(y, z)A(−y, z)]
For any z, h(y, z) is an odd function of y, meaning that h(−y, z) = −h(y, z). The
integral of any odd function is 0 by substitution, and thus the proof of (1) is complete.
what is the distribution function of the random variable Z ≡ X/Y ? What is the density
function of Z? To avoid definitional problems, it will generally be assumed that Y has
range (0, ∞) and thus FY (0) = 0. This assumption can be weakened somewhat when Y is
discrete.
Now λ [Bm ] = fY (ym ) by definition. By independence and Proposition II.1.34, and since
ym > 0 for all m :
The random variable Z in this case is well-defined under a more general assumption than
FY (0) = 0 as assumed above, and it is apparent that what is necessary is that fY (0) = 0
for the associated density function. To simplify the development, we generalize the notation
0
underlying (2.7) to {ym }Mm=−M 0 ⊂ R with M, M ≤ ∞, and associated real nonnegative
sequence {ν m }Mm=−M 0 with sum of 1. Here we assume that ym < 0 for m ≤ −1, y0 = v0 = 0,
and ym > 0 for m ≥ 1.
The next result provides the density function fZ (z) in the more general setting of Propo-
sition 2.21, but reduces to that for Proposition 2.20 by changing the summation to m = 1
to M .
Proof. By definition, Z(S) = {xn /ym }n,m , and so if z ∈ / Z(S) then fZ (z) = 0 and (2.17)
is satisfied.
Otherwise, given zn,m ≡ x0n /ym
0
, it follows from (1.3) and (2.16) that:
If ym > 0 :
X X X
fX (xn ) − lim
−
fX (xn ) = lim
−
fX (xn )
z→zn,m z→zn,m
xn ≤zn,m ym xn ≤zym zym <xn ≤zn,m ym
= fX (zn,m ym ).
For ym < 0 :
X X X
fX (xn ) − lim
−
fX (xn ) = lim
−
fX (xn )
z→zn,m z→zn,m
xn ≥zn,m ym xn ≥zym zn,m ym ≤xn <zym
= fX (zn,m ym ).
Proof. If z = 0 then FZ (0) = FX (0) by definition, and this is obtained from (2.18). We
now prove this result for z > 0 and assign z < 0 as an exercise.
As FX and FY are continuous and increasing functions, this integral is well defined over
every bounded interval [0, b] by Proposition III.4.17. By boundedness of these functions,
this integral has a well defined limit as b → ∞ by Proposition III.4.21. It then follows by
Proposition III.4.12 that integrals can be evaluated by any sequence of Riemann-Stieltjes
summations with partition mesh sizes µ → 0.
To this end, define Pn ≡ {j/2n }j∈Z+ , a partition of [0, ∞) of mesh size µ = 2−n , where
+
Z includes 0. Then the integral in (2.18) equals the limit as n → ∞ of the Riemann-
Stieltjes sums: X∞
FX (zyj,n ) [FY ((j + 1)/2n ) − FY (j/2n )] , (1)
j=0
(n) (n)
\
Bj (z) = {X ≤ z (j + 1) /2n } Aj ,
S∞ (n)
and let B (n) (z) = j=0 Bj (z).
We claim that {B (n) (z)}∞n=1 is a nested collection, B
(n+1)
(z) ⊂ B (n) (z), and:
\∞
B (n) (z) = B(z) ≡ {X/Y ≤ z}. (2)
n=1
(n+1)
which in turn follows from the definitions. If s ∈ B2j (z) then:
Comparing to (1), this is the limit of Riemann-Stieltjes sums with tags yj,n = (j + 1) /2n ,
and the proof is complete for z > 0.
(n)
Exercise 2.24 (z < 0) Complete the above proof for the case z < 0. Hint: Define Bj (z) =
T (n)
{X ≤ zj/2n } Aj .
As was noted above Proposition 2.21, the assumption that FY (0) = 0 is more than
is needed. Since λ Y −1 (0) = 0 by (1.3) for continuous distribution functions, we can
and this is obtained from (2.19). The proof of this result for z > 0 and z < 0 are similar to
those above, so we omit some of the details.
By continuity and monotonicity of the distribution functions, this integral is well defined
over every bounded interval [a, b] by Proposition III.4.17. By boundedness of these functions,
this integral has a well defined limit as a → −∞ and b → ∞ by Proposition III.4.21. It then
48 Transformed Random Variables
follows by Proposition III.4.12 that integrals can be evaluated by any sequence of Riemann-
Stieltjes summations with partition mesh size µ → 0.
To this end, define Pn ≡ {j/2n }j∈Z , a partition of (−∞, ∞) of mesh size µ = 2−n . Then
the integrals in (2.19) equal the limit as n → ∞ of the Riemann-Stieltjes sums:
X∞
FX (zyj,n ) [FY ((j + 1)/2n ) − FY (j/2n )] (1)
j=0
X−∞
+ [1 − FX (zyj,n )] [FY ((j + 1)/2n ) − FY (j/2n )] ,
j=−1
Further: [
B(z) ≡ B+ (z) B− (z) = {X/Y ≤ z}. (3)
(n)
Similarly for z < 0, define Bj (z) ∈ E by:
( T (n)
(n) {X ≤ zj/2n } Aj , j ≥ 0,
Bj (z) = T (n) ,
{X ≥ z (j + 1) /2n } Aj , j < 0,
(n) (n)
and B± (z), B± (z) and B(z) as above. Again the B± (z)-collections are nested and (2)
and (3) are verified.
For either case of z < 0 or z > 0, we have by finite additivity and continuity from above:
F (z) = λ [B(z)]
= λ [B+ (z)] + λ [B− (z)]
h i h i
(n) (n)
= lim λ B+ (z) + lim λ B− (z) .
n→∞ n→∞
h i
(n)
As in Proposition 2.23 and Exercise 2.24, λ B+ (z) equals the first Riemann-Stieltjes sum
(n)
in (1) for yj,n that reflects the X-boundaries in the definitions of Bj (z). Thus this limit
agrees with the first Riemann-Stieltjes integral in (2.19).
For the second limit,
assume that z > 0. Then by countable additivity and then indepen-
dence, recalling that λ X −1 (zj/2n ) = 0 for all such points by continuity of FX :
h
(n)
i X−∞ h \ (n) i
λ B− (z) = λ {X ≥ zj/2n } Aj
j=−1
X−∞
= [1 − FX (zj/2n )] [FY ((j + 1)/2n ) − FY (j/2n )] .
j=−1
Ratios of Random Variables 49
The same derivation holds for z < 0. Thus this limit agrees with the second Riemann-
Stieltjes integral in (2.19).
R0
The equivalence of (2.19) and (2.20) follows from −∞ dFY (y) = FY (0), which in turn
follows from Riemann-Stieltjes sums.
When fZ (z) so defined is continuous, then FZ (z) is definable as a Riemann integral and
FZ0 (z) = fZ (z).
Proof. As was the case for Proposition 2.17, the subtlety here relates to the existence of
this integral, For this proof, we use Tonelli’s theorem of Book V to justify reversing iterated
integrals.
From (2.21) and a substitution into this Riemann integral, recalling Remark III.1.35:
Z ∞ Z zy Z 0 Z ∞
FZ (z) = fX (w)dwfY (y)dy + fX (w)dwfY (y)dy
0 −∞ −∞ zy
Z ∞Z z Z 0 Z −∞
= yfX (vy)fY (y)dvdy + yfX (vy)fY (y)dvdy
0 −∞ −∞ z
Z ∞Z z Z 0 Z z
= yfX (vy)fY (y)dvdy − yfX (vy)fY (y)dvdy
0 −∞ −∞ −∞
Z ∞ Z z
= |y| fX (zy)fY (y)dvdy.
−∞ −∞
50 Transformed Random Variables
As this iterated integral equals FZ (z), it is well-defined and finite as a Riemann inte-
gral, though Corollary III.1.77 does not apply due to the unbounded domain of integration.
However, by continuity of the integrand, this integral is well-defined and finite when inter-
preted as a Lebesgue integral by Proposition III.2.56. Tonelli’s theorem then obtains that
this iterated Lebesgue integral can be reversed:
Z z Z ∞
FZ (z) = (L) |y| fX (vy)fY (y)dy dv. (1)
−∞ −∞
is Lebesgue integrable. By Proposition III.3.39, (1) obtains FZ0 (z) = g(z) for almost all z.
Thus from the Lebesgue theory we obtain that:
and g(z) so defined is indeed a density function associated with FZ (z) in the Lebesgue sense:
Z z
FZ (z) = (L) g(w)dw. (3)
−∞
Finally, the integrand in (2.22) is continuous for any z by assumption. It is thus Riemann
integrable over any bounded interval [a, b] by Proposition III.1.15, and further by Proposition
III.2.18: Z b Z b
(R) yfX (zy)fY (y)dy = (L) yfX (zy)fY (y)dy.
a a
As this is true for every interval [a, b], and the integral on the right is well defined as
a → −∞ and b → ∞, it follows that fZ (z) in (2.22) satisfies fZ (z) = g(z) for all z and so
from (1) and (2):
Z z
FZ (z) = (L) fZ (w)dw, FZ0 (z) = fZ (z), a.e.
−∞
If fZ (z) is continuous, then FZ (z) is definable as a Riemann integral by the same argu-
ment as above, and FZ0 (z) = fZ (z) by Proposition III.1.33.
Example 2.28 1. Ratio of gammas and the beta distribution: Let X and Y be inde-
pendent gamma random variables with common λ parameter and respective parameters
of α1 and α2 . If Z = X/Y, we have from substituting (1.55) into (2.22) that for z > 0 :
Specifically, let Fβ (w) denote the beta distribution function with parameters v = α1 and
w = α2 . Integrating fZ (z) and substituting y = x/(1 + x) obtains:
Γ(α1 + α2 ) z xα1 −1
Z
FZ (z) = dx
Γ(α1 )Γ(α2 ) 0 (1 + x)α1 +α2
Γ(α1 + α2 ) z/(1+z) α1 −1
Z
= y (1 − y)α2 −1 dy
Γ(α1 )Γ(α2 ) 0
z
= Fβ .
1+z
2. Ratio of chi-squared and the F-distribution: In the special case of item 1 where
λ = 1/2, α1 = n/2, and α2 = m/2, these X and Y gamma variates are called chi-
squared variables with respective degrees of freedom of n and m as noted in
Remark 1.30. In this special case, the random variable:
m X/n
F ≡ Z= ,
n Y /m
Now let:
Z −µ ns2
X= √ , Y2 = .
σ/ n σ2
In a given application, Z̄ and s2 are simply numbers. But as discussed in Chapter II.4
and continued in Chapter 5, we can interpret Z̄ and s2 as random variables by interpreting
{Zi }ni=1 as independent, identically distributed random variables on some probability space.
By Section 4.2.4, the mean of Z is µ, and its variance is σ 2 /n, and thus X has mean 0 and
variance 1. Indeed X is standard normal by item 4 of Section 4.4.1.
We now motivate the fact that Y 2 is chi-squared with n − 1 degrees of freedom. To this
end:
Xn 2
ns2 = (Zi − µ) − (Z − µ)
i=1
Xn 2 2
= (Zi − µ) − n Z − µ ,
i=1
The first result will be applied in Chapter 5 on simulating samples of random variables.
X
Proposition 2.30 (Z = X+Y is beta for FX, FY gamma) Let X and Y be independent
gamma random variables defined on a probability space (S, E, λ), with parameters α1 , λ and
α2 , λ respectively. Define the random variable
X
Z= .
X +Y
Then Z is defined on the interval (0, 1), and has density function fZ (z) that is indepen-
dent of the parameter λ :
Γ(α1 + α2 ) α1 −1 α −1
fZ (z) = z (1 − z) 2 . (2.28)
Γ(α1 )Γ(α2 )
Thus we can redefine X and Y to have range (0, ∞) without changing their distributions or
the calculations below.
From (1.55),
Comparing with (1.56), this integral equals Γ(α1 + α2 ) and the derivation of (2.28) is
complete.
Ratios of Random Variables 55
X(k,M ) ≡ X(k) .
Notation 3.1 (Order of order statistics) Perhaps ironically, there is no universal no-
tational convention for the order of order statistics. In some references, order statistics
are ordered in the natural numerical order, so X(k) ≤ X(k+1) as above. However, it is not un-
common to see order statistics denoted so that X(1) is the largest, and hence X(k+1) ≤ X(k) .
In this section, we derive the distribution and density functions of kth order statistics,
including various multivariate results, and introduce the Rényi representation theorem
for exponential order statistics which will be applied in Section 7.2 on extreme value theory.
Definition 3.2 (M -Sample) Let a probability space (S, E, λ) and random variable X :
S −→ R be given. With M finite or infinite, a collection of random variables {Xj }M j=1
defined on a probability space (S 0 , E 0 , λ0 ) is said to be an M -sample of X, or a sample
of X when M is implied, if this collection is independent, and identically distributed
with X (i.i.d.-X):
1. {Xj }M m
j=1 are independent if given (i1 , ..., im ) ⊂ (1, 2, ..., M ) and {Aj }j=1 ⊂ B(R) :
h \m i Ym
λ0 Xi−1
j
(Aj ) = λ0 [Xi−1
j
(Aj )]. (3.1)
j=1 j=1
2. {Xj }M
j=1 are identically distributed with X if for all j, and all A ∈ B(R) :
DOI: 10.1201/9781003264583-3 57
58 Order Statistics
1. {Xj }M j=1 are independent if given (i1 , ..., im ) ⊂ (1, 2, ..., M ), then for all x =
(xi1 , ..., xim ) ∈ Rm :
Ym
F (xi1 , ..., xim ) = Fj (xij ), (3.3)
j=1
1. S 0 ≡ S M : {Xj }M
j=1 is defined by the projection mapping:
Xj (s) = si .
The terminology random sample from a random variable X implies the result of
some experimental or other empirical process by which numerical values {Xj (sj )}M j=1
are observed, generated, or otherwise obtained. For this collection to be called a “random
sample” requires that these variates be deemed independent, and each governed by the
distribution function underlying X. In practice, such determinations are made with a mix
of science and judgment.
Thus {Xj }M M
j=1 as an M -sample, and {Xj (s)}j=1 as a random sample, are intimated
M
related. In theory, the collection {Xj (s)}j=1 as defined above has the potential to be a
random sample for any s ∈ S 0 , but not all random samples are equally useful.
For example, if {Xj }Mj=1 represents the outcomes of M coin flips, then certainly there is
an s ∈ S 0 for which Xj (s) = H for all j. In isolation, this would seem to be an odd example
of a “random sample,” especially for M large, but it could be an example of such a random
sample from the experimental or other empirical process by which it was obtained.
In order to be deemed random samples, it must be the case that the experimental or
other empirical process employed to produce these, if repeated N times, would result in a
collection of random samples {{Xj (sk )}M N
j=1 }k=1 which at least approximately satisfied the
Distribution Functions of Order Statistics 59
Definition 3.3 (Order statistics) Let {Xj }M j=1 be independent, identically distributed
random variables on a probability space (S 0 , E 0 , λ0 ). Define the collection of kth order
statistics {X(k) }M
k=1 by:
• X(1) ≡ min{Xj }M
j=1 ;
Proposition 3.4 (Order statistics are random variables) The collection of kth order
0 0 0
statistics {X(k) }M
k=1 of Definition 3.3 are random variables on (S , E , λ ).
0 0 0
Proof. That X(1) and X(M ) are random variables on (S , E , λ ) follows from Proposition
I.3.47. For other X(k) , define Yk = minj max(Xj − X(k−1) , 0) , which a random variable
by this proposition. Then X(k) = X(k−1) + Yk is a random variable by Proposition I.3.30.
Notation 3.5 ((S 0 , E 0 , λ0 ) → (S, E, λ)) We now drop the added notation of (S 0 , E 0 , λ0 ),
which was largely used to distinguish this space from the space (S, E, λ) on which the orig-
inal X was defined. But for the rest of this chapter we will only be investigating an i.i.d
collection {Xj }M M
j=1 and its associated kth order statistics {X(k) }k=1 , and for this we resort
back to the simpler notation (S, E, λ).
Proposition 3.6 (F(k) (x)) Given independent random variables {Xj }M j=1 with common
distribution function F (x), the distribution function of the kth order statistic F(k) (x) is
defined on the same domain as is F (x) and given by:
XM M M −j
F(k) (x) = F j (x) (1 − F (x)) . (3.5)
j=k j
Proof. For a given ordering of independent variates (X1 , ..., XM ) and j subscripts specified,
the probability that exactly these j variates are less than or equal to x and the remaining
60 Order Statistics
M −j M
M − j variates greater than x is F j (x) (1 − F (x)) . There are j such specifications
possible, so by independence, the probability that exactly j variates are less than or equal to
M −j
x is M j
j F (x) (1 − F (x)) . If Aj ⊂ S denotes the event on which exactly j variates are
less than or equal to x, then {Aj }M j=1 are disjoint and union to S.
As noted above, the event X(k) ≤ x is the union of such Aj -events for j ≥ k. Addition
of probabilities is justified by the disjointness of this collection and finite additivity of λ.
In particular, the distribution functions for the smallest and largest uniform variate are
respectively given on [0, 1] by:
M
F(1) (x) = 1 − (1 − x) ; F(M ) (x) = xM .
Consequently, the first order statistic X(1) from a sample of M exponentials with param-
eter λ is exponentially distributed with parameter λM. This observation will be expanded
upon in the section below on the Rényi representation theorem on order statis-
tics.
If F (x) is continuously differentiable, then f (x) = F 0 (x) for all x by the Riemann version
of this result in Proposition III.1.32, and:
Z x
F (x) = (R) f (x)dx.
−∞
In either case, the given assumption on F (x) yields the analogous property on the distribu-
tion function of F(k) (x) in (3.5).
To simplify the next statement, we assume F (x) is continuously differentiable. The
absolutely continuous result is derived by qualifying that the density function is defined
almost everywhere, and obtains the associated distribution function as a Lebesgue integral
as above.
Proposition 3.8 (f(k) (x), continuous f (x)) If F (x) is continuously differentiable with
associated continuous density f (x), then the density function f(k) (x) of the kth order statis-
tic is continuous and given by:
M −k
f(k) (x) = c(k) F k−1 (x) (1 − F (x)) f (x). (3.6)
Proof. As F(k) (x) in (3.5) is continuously differentiable, the associated density satisfies
0
f(k) (x) = F(k) (x) by Proposition III.1.32, and is given by:
XM M h M −j M −j−1
i
0
F(k) (x) = jF j−1 (x) (1 − F (x)) − (M − j)F j (x) (1 − F (x)) f (x).
j=k j
Example 3.9 (f(k) (x) beta for f (x) uniform) Let U be continuous and uniformly dis-
tributed on [0, 1]. Then the density function of the kth order statistic of a sample of M is a
Beta density with parameters v = k and w = M − k + 1 :
Γ(M + 1) M −k
f(k) (x) = xk−1 (1 − x) . (3.8)
Γ(k)Γ(M − k + 1)
This follows from (3.6) since F (x) = x here, by (1.61) and the last expression in (3.7).
62 Order Statistics
For this investigation, we will transform this probability statement into probability state-
ments on the various Xj , and then obtain the connection between F(1,...,M ) (x1 , ..., xM ) and
the common distribution F (x). To do this, we introduce a partitioning of RM into nearly
disjoint sets with various implied orderings of components.
Recall the following notion of a permutation. This definition also applies to M = ∞,
but we will not use this generalization except in Exercise 3.11.
Exercise 3.11 (Number of permutations) Show that for M finite, there are M ! possi-
ble permutations including the identity permutation, π(xj ) = xj .
If M = ∞, then there are uncountably many. Hint for M = ∞ : Given a binary expan-
sion for b ∈ [0, 1), b = b1 b2 , ... with each bj ∈ {0, 1}, define a permutation π b on {xj }∞
j=1 so
that for each j :
Given a permutation π : (1, 2, ..., M ) → (π(1), ..., π(M )) where M < ∞, define the set
Dπ ⊂ RM by:
Dπ = {(x1 , ..., xM )|xπ(1) ≤ xπ(2) ≤ · · · ≤ xπ(M ) }.
The set Dπ is thus the collection of points in RM which are weakly ordered by π, meaning
that equality is allowed. Further: [
Dπ = RM ,
π
where this union is over all M ! permutations.
Define the set Dπo ⊂ Dπ :
The set Dπo is thus the collection of points in RM which are strongly ordered by π.
Example 3.12 (Dπ and order statistics) For points in R3 , there are 3! = 6 permuta-
tions as noted in Exercise 3.11. These can be identified in terms of the results of π on the
coordinate indexes,
π : (1, 2, 3) → (π(1), π(2), π(3)).
Joint Distribution of All Order Statistics 63
Dπ = {(x1 , x2 , x3 )|x2 ≤ x3 ≤ x1 }
To connect with order statistics, assume that {Xj }3j=1 is a collection of independent,
identically distributed random variables defined on (S, E, λ), and let X ≡ (X1 X2 , X3 ) be the
associated random vector. Given π, define Aπ by:
Thus on Aπ :
X(1) = Xπ(1) , X(2) = Xπ(2) , X(3) = Xπ(3) .
For example, if π = (2, 3, 1) then:
Aπ = {s|X2 ≤ X3 ≤ X1 },
and on Aπ :
X(1) = X2 , X(2) = X3 , X(3) = X1 .
3
S Dπ ∈ B R . As X is a random vector, this obtains Aπ ∈ E by
We will see below that
(1.9). Then since S = π Aπ , the Dπ -sets provide a way of decomposing S into measurable
sets in which the ordering of {Xj }3j=1 is known.
But this ordering is based on ≤ rather than <, and thus it is not uniquely defined. In
general, Dπ1 ∩ Dπ2 6= ∅.
T
In the next result we investigate the intersection sets Dπ Dπ0 , and will show that these
have Lebesgue measure 0 in RM . For certain distribution functions F (x), this will then imply
that X −1 (Dπ1 ∩ Dπ2 ) = Aπ1 ∩ Aπ2 has λ-measure 0 in S. This will then yield detailed
information on a decomposition of S into measurable sets with well-defined orderings of
{Xj }Mj=1 .
Proposition 3.13 (Dπ , Dπo ∈ B(RM ), {Dπo }π are disjoint, m(Dπ1 ∩ Dπ2 ) = 0) With the
notation above, Dπ is closed and Dπo is open. Thus Dπ , Dπo ∈ B(RM ) and are Lebesgue mea-
surable for all π.
If π 1 6= π 2 , then:
Dπo 1 ∩ Dπo 2 = 0, m(Dπ1 ∩ Dπ2 ) = 0.
Proof. The set Dπ is closed by Exercises III.4.46 and 3.14 as the intersection of M − 1
closed sets:
−1
{xπ(j) ≤ xπ(j+1) }M
j=1 .
Closed sets are Borel measurable by Definition I.2.13, and hence Lebesgue measurable by
Proposition I.2.38.
Each Dπo is open by the same exercises as the intersection of M − 1 open sets:
−1
{xπ(j) < xπ(j+1) }M
j=1 ,
Dπo 1 ∩ Dπo 2 = ∅,
64 Order Statistics
xπ1 (1) < · · · < xπ1 (M ) and xπ2 (1) < · · · < xπ2 (M ) .
Exercise 3.14 Prove that {xπ(j) ≤ xπ(j+1) } is closed and {xπ(j) < xπ(j+1) } is open in
RM . Hint: Recall Definition I.2.10.
Verify that any intersection set Dπ1 ∩ Dπ2 for π 1 6= π 2 has Lebesgue measure 0. Hint:
Recall Definition III.1.64 and use (1).
Before proving the general result for F(1,...,M ) (x1 , ..., xM ), we develop the idea in an
example.
As noted in the remarks before (1.30), it will be proved in Book V that for any Borel
measurable set A ∈ B(R3 ) and random vector X = (X1 , X2 , X3 ) defined on (S, E, λ):
Z
Pr [(X1 , X2 , X3 ) ∈ A] = f (x1 , x2 , x3 )dx. (3.10)
A
3
where dx denotes Lebesgue measure on R .
This identity is true for general M, and extends the result that for x = (x1 , ..., xM ) and
QM
A = i=1 (−∞, xi ], this integral obtains F (x) = λ (X1 , ..., XM )−1 A . By (1.14), with λF
is the probability measure on (RM , B(RM )) induced by the distribution function F (x), the
result in (3.10) can also be expressed:
Z
λF (A) = f (x)dx.
A
Joint Distribution of All Order Statistics 65
In the special case where f (x1 , x2 , x3 ) is continuous almost everywhere, these integrals can
also be interpreted as Riemann integrals by Propositions III.1.68 (extended with III.(1.53))
and III.2.56.
By (3.10) and (1):
Z Z Z
Pr [(X1 , X2 , X3 ) ∈ Dπ ] = f (x1 )f (x2 )f (x3 )dx1 dx2 dx3
R x ≤x x ≤x
Z Z 2 3 1 2
= F (x2 )f (x2 )f (x3 )dx2 dx3
R x2 ≤x3
Z
1
= F 2 (x3 )f (x3 )dx3
2 R
= 1/3!
This same result is produced for any permutation π, with only a change in the definition of
the iterated integrals and the order of the integrations.
For any π, since F (x) is absolutely continuous and thus continuous, the same calculation
produces:
Pr [(X1 , X2 , X3 ) ∈ Dπo ] = Pr [(X1 , X2 , X3 ) ∈ Dπ ] .
For example: Z
f (x1 )dx1 = F (x−
2 ) = F (x2 ).
x1 <x2
Recalling Exercise III.1.71, this can also be understood in terms of the boundaries of Dπ
which have measure zero by Exercise 3.14.
In M dimensions, the same calculations obtain that for any π,
and thus: X
Pr [(X1 , ..., XM ) ∈ Dπ ] = 1. (3.11)
π
Since {Dπ }π is not a disjoint collection, it is important to recognize that (3.11) is not
a consequence of finite additivity of λ. Instead, it is absolute continuity of the distribu-
tion function F (x) that produced this result because this assumption assured that for any
permutation,
Pr [(X1 , ..., XM ) ∈ Dπ ] = Pr [(X1 , ..., XM ) ∈ Dπo ] ,
or equivalently,
Pr [(X1 , ..., XM ) ∈ (Dπ − Dπo )] = 0.
That (3.11) can fail in the absence of absolute continuity is exemplified next.
66 Order Statistics
Example 3.16 (Discrete F (x)) If X : S → {0, 1} is binomial, with λ[X −1 (1)] = p with
0 < p < 1, then with M = 2 there are only 2 permutations:
• If π 1 : (1, 2) → (1, 2), then (X1 , X2 )−1 Dπ1 = {X1 ≤ X2 } and Dπ1 =
{(0, 0), (0, 1), (1, 1)}.
• If π 2 : (1, 2) → (2, 1), then (X1 , X2 )−1 Dπ2 = {X2 ≤ X1 } and Dπ2 =
{(0, 0), (1, 0), (1, 1)}.
In either case:
λ (X1 , X2 )−1 Dπ = 1 − p(1 − p),
and so: X
λ (X1 , X2 )−1 Dπ = 2 [1 − p(1 − p)] > 1.
π
This sum exceeds 1 by exactly p2 + (1 − p)2 , and this is because Dπ1 ∩ Dπ2 =
{(0, 0), (1, 1)} and:
With the above warm-up, we now turn to the main result. The reader may want to
supplement this proof with a graphical depiction of the various sets in R2 .
Notation 3.17 Note that given random variables (X1 , ..., XM ), that (Xπ(1) , ..., Xπ(M ) ) de-
notes the reordering given a permutation π, while (X(1) , ..., X(M ) ) denotes the associated
vector of order statistics.
Proposition 3.18 (F(1,...,M ) (x1 , ..., xM ), A.C. F (x)) Given independent, identically dis-
tributed random variables (X1 , ..., XM ) defined on (S, E, λ) with absolutely continuous dis-
tribution function F (x), the joint distribution function F(1,...,M ) for all order statistics
(X(1) , ..., X(M ) ) is defined on x1 ≤ x2 ≤ · · · ≤ xM by:
YM
F(1,...,M ) (x1 , ..., xM ) = M !F (x1 ) [F (xj ) − F (xj−1 )] . (3.12)
j=2
Then as sets in S:
[
(X(1) , ..., X(M ) ) ∈ Rx = (X1 , ..., XM ) ≤ (xπ(1) , ..., xπ(M ) )
[π
= {(X1 , ..., XM ) ∈ Dπ ∩ Rx } .
π
Thus:
F(1,...,M ) (x1 , ..., xM ) ≡ Pr (X(1) , ..., X(M ) ) ∈ Rx
h [ i
= λ (X1 , ..., XM )−1 [Dπ ∩ Rx ]
π
h[ i
−1
= λ (X1 , ..., XM ) (Dπ ∩ Rx ) .
π
Density Functions on Rn 67
identity.
As {Xπ−1 (j) }M M
j=1 is just a reordering of {Xj }j=1 , this obtains:
X
F(1,...,M ) (x1 , ..., xM ) = F (x1 ) [F (x2 ) − F (x1 )] ... [F (xM ) − F (xM −1 )]
π
= M !F (x1 ) [F (x2 ) − F (x1 )] ... [F (xM ) − F (xM −1 )] .
Example 3.19 (Discrete F (y)) When the distribution function is not absolutely contin-
uous, the above result is not valid. For X binomial as in Example 3.16, a calculation shows
that F(1,2) (0, 1) = 1 − p2 , since this is the probability that the smaller variate is less than or
equal to 0, and the larger is less than or equal to 1. Only (1, 1) fails this criterion. On the
other hand, 2F (0) [F (1) − F (0)] = 2p(1 − p).
The domain of integration is also denoted Rx ≡ {y ≤ x} and is shorthand for {yj ≤ xj for
all j}. This domain of integration also appears implicitly in Definition 1.6 for F (x) :
F (x1 , x2 , ..., xn ) = λ X −1 Rx .
68 Order Statistics
As in Section 1.3, the integral in (3.13) may be defined in the sense of Riemann or
Lebesgue, depending on the properties of the function f (x). In either case:
Z
f (y)dy = 1,
Rn
and such densities are not unique since in both integration theories, integrands may be
changed pointwise in various ways without changing the value of the integral. Recalling
Exercise 1.24, the primary exception to this observation is for the Riemann theory when
f (x) is continuous. Then one can say that this density function is unique among continuous
functions.
In one variable, singular and saltus distribution functions are examples that illustrate
that not all distribution functions have associated density functions in the above sense. The
same is true for joint distribution functions. For example if F (x) is the joint distribution
function of independent random variables with a common singular distribution function,
then by Proposition 1.14: Yn
F (x) = F (xi ).
j=1
If F (x) had an associated density function f (x), then it would follow by integration to obtain
marginal distributions, that such distributions had density functions, which contradicts that
f (xi ) does not exist.
In one dimension, the existence of a density function f (x) in the sense of a Lebesgue in-
tegral required absolute continuity of the distribution function F (x) by Proposition III.3.62.
Then F 0 (x) exists almost everywhere, and we can take as a density any function f (x) with
f (x) = F 0 (x) a.e. When such a distribution function is also continuously differentiable, then
a density function exists in the sense of a Riemann integral by Proposition III.1.32. Now
F 0 (x) is continuous, and we can again take f (x) = F 0 (x) as a density.
For joint distribution functions, the Riemann theory is essentially the same for a con-
tinuously differentiable joint distribution functions F (x), by which is meant that f (x) as
defined in (3.14) below is a continuous function. For densities in the Lebesgue sense, it will
be seen in Book V that a density exists only then the distribution function is absolutely con-
tinuous. Generalizing the discussion in Summary 1.20 and Remark 1.21, absolute continuity
will there be defined in terms of the associated induced probability measure λF defined on
Rn , recalling Proposition 1.9.
Turning to some details, assume that such f (x) exists in the Riemann sense. Then by
Corollary III.1.77, the integral for F (x) in (3.13) can be expressed as an iterated integral:
Z xn Z x1
F (x) = (R) ··· f (y1 , y2 , ..., yn )dy1 dy2 ...dyn ,
−∞ −∞
where by “iterated” is meant that these integrals can be evaluated one at a time, in this
or in any given Q order. While this corollary was stated in terms of integrals over bounded
n
rectangles R = j=1 (ai , xj ], since F (x) → 0 as all ai → −∞, this representation is valid
over Rx .
If f (y1 , y2 , ..., yn ) is continuous, Proposition III.1.76 states with the same generalization
that for any (x1 , ..., xn−1 ), the function:
Z xn−1 Z x1
g(x1 , ..., xn−1 , yn ) ≡ (R) ··· f (y1 , y2 , ..., yn )dy1 ...dyn−1 ,
−∞ −∞
is differentiable and:
∂G
= g(x1 , x2 , ..., xn ).
∂xn
In other words:
Z xn−1 Z x1
∂F
= (R) ··· f (y1 , y2 , ..., xn )dy1 dy2 ...dyn−1 ,
∂xn −∞ −∞
xj ≤ xj+1 by definition. In detail, assume that continuous f(1,...,M ) (x1 , ..., xM ) exists. Then
for x1 ≤ x2 ≤ · · · ≤ xM :
Z xM Z x2 Z x1
F(1,...,M ) (x1 , ..., xM ) = (R) ··· f(1,...,M ) (y1 , y2 , ..., yM )dy1 dy2 ...dyM . (3.15)
xM −1 x1 −∞
This formula significantly complicates the relationship between f(1,...,M ) (x1 , ..., xM ) and
derivatives of F(1,...,M ) (x1 , ..., xM ) compared with the result in (3.14). Fortunately for the
current application, this approach need not be followed.
An application of (3.16) will be given below for the Rényi representation theorem.
Proposition 3.20 (f(1,...,M ) (x1 , ..., xM ), continuous f (x)) Given independent, identi-
cally distributed random variables (X1 , ..., XM ) defined on (S, E, λ) with continuously dif-
ferentiable distribution function F (x) and associated continuous density function f (x),
the joint density function f(1,...,M ) of all order statistics is continuous and given for
x1 ≤ x2 ≤ · · · ≤ xM by:
Proof. It is apparent from (3.16) that f(1,...,M ) (x1 , ..., xM ) is continuous, so we need only
show that (3.15) is satisfied.
From (3.12) it must be proved that for x1 ≤ x2 ≤ · · · ≤ xM :
YM Z xM Z x2 Z x1
M !F (x1 ) [F (xj ) − F (xj−1 )] = (R) ··· M !f (y1 )f (y2 )...f (yM )dy1 dy2 ...dyM .
j=2 xM −1 x1 −∞
(1)
Now for j ≥ 2 : Z xj
f (yj )dyj = F (xj ) − F (xj−1 ),
xj−1
while for j = 1 : Z x1
f (y1 )dy1 = F (x1 ),
−∞
and the result follows.
R
Exercise 3.21 ( f(1,...,M ) (x1 , ..., xM )dx = 1) Prove that f(1,...,M ) (x1 , ..., xM ) in (3.16) is
indeed a density function and integrates to 1. Hint: Decompose the integral using the Dπ
sets and generalize Example 3.15.
Thus: Z xI
FI (xi1 , xi2 , ..., xim ) = (R) fI (yI )dyI ,
−∞
Multivariate Order Functions 71
For the current application to F(1,...,M ) (x1 , ..., xM ), we must again take more care with
this integration since the variables are ordered. For example with xJ = x1 , it would make
no sense to let x1 → ∞ in the distribution function since x1 ≤ x2 ≤ · · · ≤ xM and so
x1 → ∞ here really means x1 → x2 . Thus integrating the x1 variate of the density function
over (−∞, ∞) must be interpreted as the integral over (−∞, x2 ].
In general, for the definition of marginal distribution function we must interpret “inf ty”
as the upper boundary point of the domain of the variable, and this differs depending on
which indexes are in the xJ vector. Similar comments apply to the lower limit of integration,
that −∞ must also be interpreted in terms of the lower boundary point of the domain of
the given variable.
The resulting calculations can become tedious, remembering there are 2n −2 possibilities,
so we provide the results for marginal densities when:
1. I = (1, ..., j) with 1 ≤ j < M, deriving the marginal density functions f(1,...,j) (x1 , ..., xj );
2. I = (i, ..., j) 1 ≤ i < j ≤ M, deriving the marginal density functions f(i,...,j) (xi , ..., xj );
3. I = (i, j) with 1 ≤ i < j ≤ M, deriving the marginal density functions f(i,j) (xi , xj ).
4. I = (i) with 1 ≤ i ≤ M, deriving the marginal density functions f(i) (xi ).
M! M −j
f(1,...,j) (x1 , ..., xj ) = f (x1 )f (x2 )...f (xj ) [1 − F (xj )] . (3.18)
(M − j)!
Proof. We calculate the marginal density f(1,...,j) (x1 , ..., xi ) from f(1,...,M ) (x1 , ..., xM ), first
dividing by M !f (x1 )f (x2 )...f (xj ) and suppressing the (R) to simplify notation:
"Z "Z # #
f(1,...,j) (x1 , ..., xj )
Z ∞ ∞ Z ∞ ∞
= ··· f (yM )dxM f (yM −1 )dyM −1 · · · f (yj+1 )dyj+1
M !f (x1 )f (x2 )...f (xj ) xj yj+1 yM −2 yM −1
"Z #
Z ∞ ∞ Z ∞
= ··· [1 − F (yM −1 )] f (yM −1 )dyM −1 · · · f (yj+1 )dyj+1
xj yj+1 yM −2
..
.
1
= [1 − F (xj )]M −j ,
(M − j)!
Proposition 3.23 ( f(i,...,j) (xi , ..., xj ), 1 ≤ i < j ≤ M, continuous f (x)) Given indepen-
dent, identically distributed random variables (X1 , ..., XM ) defined on (S, E, λ) with contin-
uously differentiable distribution function F (x) and associated continuous density function
72 Order Statistics
f (x), and I = (i, ..., j) with 1 ≤ i < j ≤ M, the marginal density function f(i,...,j) (xi , ..., xj )
is continuous and given on xi ≤ ... ≤ xj by:
M! M −j i−1
f(i,...,j) (xi , ..., xj ) = f (xi )f (xi+1 )...f (xj ) [1 − F (xj )] F (xi ). (3.19)
(M − j)!(i − 1)!
Proof. The density f(i,...,j) (xi , ..., xj ) is derived from f(1,...,j) (x1 , ..., xj ) in (3.18), first di-
M −j
viding by (MM−j)!
!
f (xi )f (xi+1 )...f (xj ) [1 − F (xj )] to simplify notation:
M!
f(i,j) (xi , xj ) = f (xi )f (xj ) (3.20)
(M − j)!(j − i − 1)!(i − 1)!
M −j j−i−1
× [1 − F (xj )] [F (xj ) − F (xi )] F i−1 (xi ).
Proof. The density f(i,j) (xi , xj ) is derived from f(i,...,j) (xi , ..., xj ) in (3.19) by in-
tegrating the variates xk with i < k < j from xk−1 to xj . Dividing by
M! M −j i−1
(M −j)!(i−1)! f (xi )f (xj ) [1 − F (xj )] F (xi ) obtains:
..
.
j−i−1
[F (xj ) − F (xi )]
= ,
(j − i − 1)!
Recall Remark II.3.37, that marginal distributions have no “memory” of the variates
xJ → ∞, and thus the marginal FI (xi1 , xi2 , ..., xim ) is the joint distribution function for the
random vector (Xi1 , Xi2 , ..., Xim ). The same is true for densities of course, and the following
result is no surprise. That is, the marginal density f(i) (xi ) agrees with the density function
for the ith order statistic in (3.16).
M! M −i
f(i) (xi ) = f (xi )F i−1 (xi ) [1 − F (xi )] . (3.21)
(M − i)!(i − 1)!
Proof. From (3.20) with 1 ≤ i < M and j = i + 1, f(i,i+1) (xi , xi+1 ) is defined on xi ≤ xi+1
by:
M! M −i−1 i−1
f(i,i+1) (xi , xi+1 ) = f (xi )f (xi+1 ) [1 − F (xi+1 )] F (xi ).
(M − i − 1)!(i − 1)!
For the next result we recall a commonly encountered and related notion from elementary
probability theory. Given a bivariate distribution function F (x, y), we seek to define a
conditional distribution function where the y-conditional set B is replaced
by a single point
B ≡ y0 . In many applications of interest it will be the case that λ Y −1 (y0 ) = 0, for
example when the the marginal distribution function FY (y) is continuous, and thus the
74 Order Statistics
above definition is not applicable. However, the intuition is compelling, that given the
distribution function F (x, y) and Y = y0 , there must be a distribution function of x :
F (x|y0 ) ≡ F (x, y|y = y0 ),
such that this distribution is parametrized by y0 .
We will return to a very general model for this notion and related ideas in Book VI in the
study of conditional probability measures and conditional expectations, but for the
current application it is enough to recall Example II.3.42. There was developed an approach
to defining the conditional distribution function F (x|y0 ) from the bivariate continuous joint
distribution function F (x, y).
Dropping the subscript on y, this derivation defined F (x|y) as the limit:
F (x|y) ≡ lim F (x, y|Y ∈ [y, y + ∆y]).
∆y→0
Assuming that ∂F∂y(y) ≡ f (y) 6= 0, where F (y), f (y) are the associated marginal distribution
and density functions, the result derived there was:
∂F (x, y) ∂F (y)
F (x|y) = , f (x|y) = f (x, y) /f (y). (3.23)
∂y ∂y
The goal of this section is to apply the result from this example to derive the conditional
density function f(i+1|i) (xi+1 |xi ) from the conditional distribution function F(i+1|i) (xi+1 |xi )
of X(i+1) given X(i) . The same analysis can be applied to F(j|i) (xj |xi ) for j > i and is left
as an exercise.
Proposition 3.27 (F(i+1|i) (xi+1 |xi ), f(i+1|i) (xi+1 |xi ), 1 ≤ i < M, continuous f (x))
Given independent, identically distributed random variables (X1 , ..., XM ) defined on (S, E, λ)
with continuously differentiable distribution function F (x) and associated continuous density
function f (x), and i with 1 ≤ i < M, the conditional distribution function F(i+1|i) (xi+1 |xi )
is given on xi+1 ≥ xi by:
M −i
1 − F (xi+1 )
F(i+1|i) (xi+1 |xi ) = 1 − , (3.24)
1 − F (xi )
and the associated conditional density function f(i+1|i) (xi+1 |xi ) is:
M −i−1
(1 − F (xi+1 ))
f(i+1|i) (xi+1 |xi ) = (M − i)f (xi+1 ) M −i
. (3.25)
(1 − F (xi ))
Proof. The marginal distribution function F(i,i+1) (xi , xi+1 ) can be expressed as in (3.22)
using the marginal density function f(i,i+1) (xi , xi+1 ) given in (3.20) for xi ≤ xi+1 :
Z xi Z xi+1
F(i,i+1) (xi , xi+1 ) = f(i,i+1) (x, y)dydx.
−∞ x
Using (3.23) and then the fundamental theorem of calculus of Proposition III.1.33:
∂F(i,i+1) (xi , xi+1 ) ∂F(i,i+1) (xi , ∞)
F(i+1|i) (xi+1 |xi ) =
∂xi ∂xi
Z xi+1 Z ∞
= f(i,i+1) (xi , y)dy f(i,i+1) (xi , y)dy
xi xi
Z xi+1 Z ∞
M −i−1 M −i−1
= f (y) [1 − F (y)] dy f (y) [1 − F (y)] dy .
xi xi
The Rényi Representation Theorem 75
In the last step, the factorial constants and the common factor of f (xi )F i−1 (xi ) cancel from
numerator and denominator. These integrals can be evaluated by substitution to produce
(3.24).
Given xi ≤ xi+1 :
Z xi+1
F(i+1|i) (xi+1 |xi ) = f(i+1|i) (y|xi )dy.
xi
That the distribution function of X(i+1) depends on the value of X(i) here is logically
expected because it must be the case that X(i+1) ≥ X(i) . We will investigate this further in
the study of the Rényi representation theorem on order statistics, but here look at an
example.
Example 3.28 (F (x) exponential) If F (x) is the exponential distribution of (1.54) with
parameter λ, then (3.24) becomes:
E
F(i+1|i) (xi+1 |xi ) = 1 − e−λ(M −i)(xi+1 −xi ) .
While the distribution function of X(i+1) depends on the value of X(i) , the distribution
function of the difference, X(i+1) − X(i) does not.
This formula states that whatever is the value of X(i) , the value of X(i+1) is given by:
X(i+1) = X(i) + Yi ,
where Yi ≡ X(i+1) − X(i) is exponentially distributed with parameter λ(M − i). And this is
true for all i.
The remarkable insight in the development of the Rényi representation theorem is that
−1
{Yi }M
i=0 so defined are independent exponentials.
Recall the definition of conditional probability of Definition 1.12. When applied to the
event Pr{X ≤ x + y|X > x} for exponential X with FE (x) = 1 − e−λx :
Pr{x < X ≤ x + y}
Pr {X ≤ x + y|X > x} ≡
Pr{X > x}
FE (x + y) − FE (x)
=
1 − FE (x)
= FE (y). (1)
In other words, letting x = X(k) , this calculation states that the distribution function of
the excess variate Y ≡ X(k+1) − X(k) is independent of X(k) .
On an intuitive level it is clear that such a statement could not possibly be true for
many distribution functions. Indeed, if FU (x) = x is the uniform distribution function, then
a calculation produces:
FU (x + y) − FU (x) min (x + y, 1) − x
= ,
1 − FU (x) 1−x
f (x + y) = f (x) + f (y),
then there is a constant c so that f (x) = cx. Defining f (x) = ln [1 − F (x)] , Cauchy’s
functional equation is equivalent to:
and his conclusion is then that F (x) = 1 − ecx . This proves that only the exponential
distribution F (x) = FE (x) satisfies (1).
However, perhaps it is possible that for a given distribution function F (x) that x and y
are again independent, but with:
F (x + y) − F (x)
= F1 (y), (2)
1 − F (x)
with F1 (y) is a different distribution function. This would not contradict Cauchy’s result,
but would provide another example for which the excess variable y was independent of x.
Exercise 3.29 ((2) ⇒ F1 = F = FE ) Show that if (2) is satisfied with F (x) a differentiable
distribution function, then F (x) is the exponential distribution and thus F1 = F. Hint: By
(2), since F1 (y) is independent of x, the x-derivative of [F (x + y) − F (x)] / [1 − F (x)] is
zero, and this obtains that F 0 (x)/[1 − F (x)] is constant.
FU (x) = FV (x),
for all x. Sometimes for expediency, as in the next statement, one states simply that two
random variables are equal. But this always means “equal in distribution.”
The Rényi Representation Theorem 77
where {Yj }M
j=1 are independent exponential variates with respective parameters
M
{λ(M − j + 1)}k=1 .
Proof. This proof requires another integral of the joint density function f(1,...,M ) (x1 , ..., xM )
of (3.16). The goal is to show that the joint distribution function GY (a1 , a2 , ..., aM ) of
satisfies:
YM
GY (a1 , a2 , ..., aM ) = Gk (ak ),
k=1
where Gk (x) is the distribution function of an exponential variate with parameter λ(M −
k + 1). This then proves that {X(k) − X(k−1) }Mk=1 are independent random variables with
these distributions by Proposition 1.14.
Since xk ≥ xk−1 for all k :
GY (a1 , a2 , ..., aM )
= Pr X(M ) − X(M −1) ≤ aM , ..., X(2) − X(1) ≤ a2 , X(1) ≤ a1
Z a1 Z xM −2 +aM −1 Z xM −1 +aM !
= M! ··· f (xM )dxM f (xM −1 )dxM −1 · · · f (x1 )dx1
−∞ xM −2 xM −1
Z a1 Z xM −2 +aM −1
= M! .. [F (xM −1 + aM ) − F (xM −1 )] f (xM −1 )dxM −1 ..f (x1 )dx1 .
−∞ xM −2
1
1 − e−λkaM −k+1 fk+1 (xM −k ).
=
k+1
78 Order Statistics
Thus:
" "Z # #
M! h i Z a1 xM −2 +aM −1
−λaM
GY (a1 , a2 , ..., aM ) = 1−e ··· f2 (xM −1 )dxM −1 · · · f (x1 )dx1
2 −∞ xM −2
M! h ih i
= 1 − e−λaM 1 − e−2λaM −1 ×
3!
Z a1 Z xM −3 +aM −2
··· f3 (xM −2 )dxM −2 · · · f (x1 )dx1
−∞ −∞
..
.
M! YM h i Z a1
= 1 − e−(M −k+1)λak fM −1 (x1 )dx1
(M − 1)! k=2 −∞
YM h i
= 1 − e−(M −k+1)λak .
k=1
An important application of this representation theorem is that any of the kth order
statistics, or any sequential grouping of kth order statistics, can be generated directly as
a sum of independent exponential random variables. This is in contrast to the definitional
procedure whereby the entire collection {Xj }M j=1 would need to be generated, then reordered
to {X(j) }Mj=1 to identify each variate or grouping.
To generate all {X(j) }M M
j=1 requires the independent {Yk }k=1 defined above, with Yk
exponential with parameter λ(M − k + 1). To then generate a larger ordered collection
0
of M 0 > M variates requires {Yi0 }M 0 0
i=1 with Yi exponential with parameter λ(M − i + 1).
0
M 0 M −M
However, given {Yk }k=1 , only independent {Yi }i=1 need be so generated. For k > 0,
0 0 0
YM 0 −M +k is exponential with parameter λ(M − [M − M + k] + 1) = λ(M − k + 1), so
0
YM 0 −M +k = Yk .
The following corollary provides a simpler version of this representation, in that now all
independent exponentials are standard exponentials.
Corollary 3.31 (Rényi Representation Theorem) Let {Xk }M k=1 denote independent
random variables from an exponential distribution with parameter λ, and {X(k) }M
k=1 the
associated ordered random variables. Then in distribution:
Xk Ej
X(k) =d , (3.27)
j=1 λ(M − j + 1)
The final result reflects definitions from the next chapter on expectations of random
variables, but is included here for completeness. If unfamiliar with these notions, the reader
should read ahead and come back to this result.
Corollary 3.32 (Rényi Representation Theorem) Let {Xk }M k=1 denote independent
random variables from an exponential distribution with parameter λ, and {X(k) }M k=1 the
associated ordered random variables. Then denoting by µ(k) and σ 2(k) the mean and variance
of X(k) :
1 Xk 1 1 XM 1
µ(k) = = , (3.28)
λ j=1 M −j+1 λ j=M −k+1 j
The Rényi Representation Theorem 79
1 Xk 1 1 XM 1
σ 2(k) = 2 2 = 2 . (3.29)
λ j=1 (M − j + 1) λ j=M −k+1 j 2
Further, with M(k) (t) denoting the moment generating function of X(k) :
−1
Yk t
M(k) (t) = 1− , |t| < λ (M − k + 1) . (3.30)
j=1 λ (M − j + 1)
Proof. By the prior proposition X(k) is the sum of k independent exponentials with param-
eters λ (M − j + 1) for j = 1 to k. These results then follow from Section 4.2.4 on moments
of sums of random variables, using (4.72) and (4.73) with α = 1.
1
µ(M ) ≈ ln M.
λ
More generally, by comparing the series to the integral of 1/x:
1 XN 1
ln N + < < ln N + 1.
N j=1 j
Thus:
1 1 1
ln M + < µ(M ) < [ln M + 1] ,
λ M λ
and for k < M :
1 M M −1 1 M M −k−1
ln − < µ(k) < ln + .
λ M −k M λ M −k M −k
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
4
Expectations of Random Variables 1
In this chapter we begin the study of “expectations” of random variables and introduce some
of their important properties. These results can be largely appreciated with the current state
of our theoretical development in the special cases of continuously differentiable and discrete
distribution functions. But as will be summarized below, even this development requires a
leap of faith regarding the fundamental definitions.
This definitional ambiguity will be investigated here, and the framework for a resolution
will be outlined. But these matters can only be finally resolved in Book VI with the more
advanced integration theory of Book V.
when: Z ∞
|g(x)| dF < ∞. (4.2)
−∞
1. Existence: Since F (x) is increasing and bounded, Proposition III.4.21 assures that this
integral exists for g(x) continuous and bounded. The existence theory also applies to
|g(x)|. Of course boundedness of g(x) is a big restriction, but it will be seen that at least
in the special cases of integrators addressed in Proposition III.4.28, that this integral
also exists for certain unbounded integrands.
2. Random Vectors: The above definition is equally applicable when X : S → Rn is
a random vector defined on a probability space (S, E, λ) with joint distribution function
DOI: 10.1201/9781003264583-4 81
82 Expectations of Random Variables 1
F (x), and g : Rn → R a Borel measurable function. This then uses the Riemann-Stieltjes
theory and results from Section III.4.4.
Proof. Once we prove that |ag(x) + bh(x)| is integrable and thus E[ag(x) + bh(x)] exists,
the equality in (4.3) follows from Proposition III.4.24.
By the triangle inequality:
and hence linearity of the integral and integrability of |g(x)| and |h(x)| obtains the result.
For distribution functions of random variables, any such function can be decomposed as
in (1.24) of Proposition 1.18:
At least in the case of continuous and bounded g(x), the existence theory of Proposition
III.4.21 applies to the Riemann-Stieltjes integrals defined with respect to each of the three
component functions, and then Proposition III.4.24 assures that for such g(x) :
Z Z Z Z
g(x)dF = α g(x)dFSLT + β g(x)dFAC + γ g(x)dFSN .
0
When FAC (x) is continuously differentiable with fAC (x) ≡ FAC (x), it follows from Propo-
sition III.1.32 that defined as a Riemann integral:
Z x
FAC (x) = (R) fAC (y)dy.
−∞
Using Lebesgue integration, this last result generalizes to arbitrary absolutely continuous
0 0
functions by Proposition III.3.62, where now fAC (x) ≡ FAC (x) a.e., recalling that FAC (x)
exists a.e. by Proposition III.3.59.
For these last two conclusions, each noted proposition stated that for any a, such FAC (x)
could be expressed: Z x
FAC (x) = FAC (a) + (R/L) fAC (y)dy,
a
where fAC (x) is defined as above. Since FAC (a) → 0 as a → −∞ by Proposition 1.3, the
result follows.
For discrete FSLT (x) and continuously differentiable FAC (x) component functions in the
decomposition of F (x), the Riemann-Stieltjes integral can be recast as follows by Proposition
General Definitions 83
III.4.28. For more general absolutely continuous FAC (x), the above Lebesgue integrals do
not fit neatly within the Riemann-Stieltjes framework of this result. But we will see in Book
V that these integrals are intimately related to the Lebesgue-Stieltjes framework studied
there, and that there will be a parallel result for Lebesgue-Stieltjes integrals that looks much
like item 1 of Proposition III.4.28.
when: Z ∞
X
|g(xn )| fSLT (xn ) + (R) |g(x)| fAC (x)dx < ∞. (4.5)
n −∞
In many applications using distribution functions as defined above, only one of FSLT (x)
or FAC (x) will be present and thus E[g(X)] will be defined in terms of only one of the
components in 4.4. These are then the standard applications in the discrete and continuous
probability theories.
Then FY (y) is well defined since g −1 (−∞, y] ∈ B(R) assures that X −1 [g −1 (−∞, y]] ∈ E,
and FY (y) is an increasing and right continuous function as are all distribution functions
by Proposition 1.3.
We now have two approaches to the definition of the expectation of Y ≡ g(X) :
1. As a function of X : Z ∞
E[g(X)] = g(x)dFX .
−∞
84 Expectations of Random Variables 1
For the special cases of Definition 4.4, we can derive insights to the affirmative answer
to (4.6).
• FX (x) = FSLT
X
(x) a saltus function, and g(x) monotonic.
Assume that FX (x) is defined by {xn }∞ n=−∞ in increasing order, with probabilities
{fX (xn )}∞
n=−∞ , and define yn ≡ g(x n ).
If g(x) is increasing, then {yn }∞
n=−∞ is an increasing sequence and:
then (4.6) is satisfied with Y ≡ g(X). Hint:R From (2.2), FY (y) = FX (g −1 (y)). Investigate
the Riemann-Stieltjes summations for ydFY , approximating ∆FY with ∆FX , and noting
that if {yj } is a partition for the dFY -integral, then by continuity and monotonicity of
g −1 (y) it follows that {g −1 (yj )} is a partition for the dFX -integral.
This definition requires the development of an integration theory on the measure space
(S, E, λ) which will be addressed in Book V. But for now, we mention that such inte-
grals possess many of the familiar properties seen in the development of the Riemann
(Proposition III.1.72), Lebesgue (Proposition III.2.49), and Riemann-Stieltjes integrals
(Proposition III.4.24).
R
Summary 4.6 (On S X(s)dλ(s)) Three important and likely familiar properties of
integrals from Book III to be seen in Book V are summarized below. But before contin-
uing, the reader may want to look back to Definitions III.2.2 and III.2.4 to recall how
we initiated the development of a Lebesgue integral, and then to Remark III.2.6 which
attempted to set the stage for future generalizations, including the current one.
(b) Linearity: If X and Y are integrable, which means that (4.8) is satisfied for each,
then so too is aX + bY for all a, b ∈ R, and:
Z Z Z
[aX(s) + bY (s)] dλ(s) = a X(s)dλ(s) + b Y (s)dλ(s). (4.10)
S S S
86 Expectations of Random Variables 1
Exercise 4.7 (X ≥ 0 and E[X] = 0 imply X = 0, λ-a.e.) Prove that (4.9) and (4.11)
imply that if X ≥ 0 and E[X] = 0, then X = 0, λ-a.e., meaning outside a set of λ-measure
0. Hint: Let An = {X > 1/n}, and prove that λ(An ) = 0. Now apply continuity from above
of Proposition I.2.45 to A = {X > 0}.
Exercise 4.8 (X = 0, λ-a.e. implies E[X] = 0) Prove that (4.11) and (4.10) imply that
if X = 0, λ-a.e., then E[X] = 0. Hint: Both 0 ≤ X, λ-a.e. and 0 ≤ −X, λ-a.e. are true.
Exercise 4.9 (Triangle inequality: |E[X]| ≤ E[|X|]) Prove the triangle inequality, that:
Once E[X] is defined as in (4.7), the ambiguities of the previous section disappear.
Defining a new random variable Y ≡ g(X) with Borel measurable g(x), then:
Z
E[Y ] ≡ Y (s)dλ(s)
ZS
≡ g(X(s))dλ(s) ≡ E[g(X)].
S
Since the general definition does not even mention the distribution function of the
variable we are integrating, it matters not whether we consider Y as the random variable,
or X as the random variable which is then composed with Borel measurable g(x). As
measurable functions on S, Y (s) ≡ g(X(s)) for all s by definition.
2. Change of Variables I - Transformation from S to R
While this approach circumvents the apparent definitional problem, it raises the question
of how in any given application one actually evaluates such an integral on S. If F (x) is the
distribution function of X, and λFX the associated Borel measure on R as summarized
in Proposition 1.5, it will be proved that for any Borel measurable function g :
Z Z
g(X(s))dλ(s) = g(x)dλFX . (4.13)
S R
The integral on the right is a Lebesgue-Stieltjes integral, and was briefly intro-
duced in Section III.4.1.2. It is named for Henri Lebesgue (1875–1941) and Thomas
Stieltjes (1856–1894).
The original Stieltjes integral, which modified the Riemann approach and is now known
as the Riemann-Stieltjes integral, was the subject of most of Chapter III.4. As the
Lebesgue integral was introduced in 1904, it is apparent from the above timelines that
Stieltjes did develop the modification that is now known as the Lebesgue-Stieltjes in-
tegral. However, this integral adapts the original Lebesgue idea in much the same way
as the Stieltjes integral originally modified the Riemann idea, and thus this name was
born.
General Definitions 87
and thus as integrals on R, this implies a change of variables result that for y = g(x);
Z ∞ Z ∞
g(x)dλFX = ydλFY .
−∞ −∞
When g(x) is continuous it will turn out that Lebesgue-Stieltjes and Riemann-Stieltjes
integrals agree, and thus for example:
Z ∞ Z ∞
g(x)dλF = g(x)dF.
−∞ −∞
(a) In the special case where F (x) = FAC (x) is absolutely continuous, then with
0
fAC (x) ≡ FAC (x) defined almost everywhere and Lebesgue measurable (Propo-
sition III.3.59), and Lebesgue measurable g(x):
Z ∞ Z ∞
g(x)dλF = (L) g(x)fAC (x)dx, (4.14)
−∞ −∞
(b) In the special case where F (x) = FSLT (x) is discrete with discontinuity set
{xn } having no accumulation points, then with fSLT (xn ) ≡ F (xn ) − F (x−
n ) and
continuous g(x) :
Z ∞ Z ∞ X
g(x)dλF = g(x)dF = g(xn )fSLT (xn ), (4.16)
−∞ −∞ n
as in Proposition III.4.28.
88 Expectations of Random Variables 1
Remark 4.10 (Random vectors) The above program of study will also apply when X :
S → Rn is a random vector defined on a probability space (S, E, λ), F (x) is the associated
joint distribution function, and g : Rn → R a Borel measurable function. We will review
the multivariate version of steps 2 and 3 in Section 4.2.3 on moments of sums of random
variables.
2. Central Moments
The central moments are defined with g(x) = (x − µ)n , where µ denotes the mean of
the distribution.
3. Absolute Moments
There are both absolute moments and absolute central moments defined respec-
n n
tively in terms of g(x) = |x| and g(x) = |x − µ| . Of course, the absolute value is redun-
dant when n is an even integer. By definition, these moments exist whenever the associated
moments and central moments exist due to the constraint in (4.2).
There is no standard notation for these moments, but µ0|n| and µ|n| seem self-explanatory
and will be used in this text.
the integral in (4.26) is related to the bilateral or two-sided Laplace transform of the
Borel measure λF , and named for Pierre-Simon Laplace (1749–1827). However, it is
then conventional to use the exponential e−tx in this definition, and to define this function
on complex t = a + ib.
Thus the moment generating function is the two-sided Laplace transform of the Borel
measure λF restricted to the real numbers, and with reversed orientation.
In Book VI, we will introduce the characteristic function of a distribution, and this
will be explicitly defined in terms of the Fourier transform of this measure, again restricted
to the real line. This transform is named for Jean-Baptiste Joseph Fourier (1768–1830)
and will be studied in Book V.
Exercise 4.16 (Nonexistence of MX (t), t = 6 0) Using (4.27) and the continuous density
function for the lognormal distribution in (1.67) defined on [0, ∞), investigate the conclusion
that MLN (t) exists only for t = 0. The solution to this will be found in Example 4.32.
Based on the above splitting of the integral defining MX (t) into negative and positive
domains of integration, we have the simple result:
Proposition 4.17 (Simple existence criterion) If the first integral in (4.29) exists for
some t00 < 0, and the second exists for some t000 > 0, then MX (t) is well defined on (−t0 , t0 )
for t0 = min[−t00 , t000 ].
R0 0 0 R0
Proof. If −∞ et0 x dF (x) < ∞, then etx ≤ et0 x for x ≤ 0 obtains that −∞ etx dF (x) < ∞
for t00 < t ≤ 0 by Proposition III.4.24. As the second integral automatically exists for all
t ≤ 0 as noted above, MX (t) exists on (t00 , 0].
Similarly, if the second integral exists for some t000 > 0, then MX (t) exists on [0, t000 ), and
the result follows.
Finally, we note a simple but useful result which is an application of (4.3).
Proposition 4.18 (MaX+b (t)) Assume that MX (t) exists for t ∈ (−t0 , t0 ) with t0 > 0,
and define Y ≡ aX + b for a, b ∈ R. Then MY (t) exists for t ∈ (−t0 / |a| , t0 / |a|), and:
MY (t) = ebt MX (at). (4.30)
Proof. By definition and (4.3):
h i
MY (t) = E e(aX+b)t = ebt E eXat ,
the existence of E[gi (Xi )] for all i assures the existence of E[X] by the triangle inequality:
Xn
|X| ≤ |gi (Xi )| ,
i=1
Pn
As a special case, it follows that if E [Xi ] exists for all i, then E [ i=1 Xi ] exists with:
hXn i Xn
E Xi = E [Xi ] . (4.32)
i=1 i=1
where y = (y1 , ..., yn ) and dy denotes the Lebesgue product measure on Rn . It will then
follow that:
Z Z
g(x1 , x2 , ..., xn )dλF = (L) g(x1 , x2 , ..., xn )f (x1 , x2 , ..., xn )dx. (4.35)
Rn Rn
This important result from Book V is Fubini’s theorem, named for Guido Fubini
(1879–1943). It states that given the constraint in (4.5), this integral over Rn can be
evaluated in an iterated fashion, one variable at a time, and in any order.
When f and g are continuous, the integral in (4.36) is definable as a Riemann integral as
in item 3, and this transformation to iterated integrals was proved as Corollary III.1.77.
In the case of independent random variables, Proposition 1.14 states that the dis-
tribution function of independent {Xi }ni=1 is given by:
Yn
F (x1 , x2 , ..., xn ) = Fj (xj ).
j=1
This follows for continuous densities by Corollary III.1.77, and in the general case by
Tonelli’s theorem in Book V. See also Section 5.1.3 for more detail.
Thus (4.36) can be expressed:
Z Z ∞ Z ∞ Yn
g(x1 , x2 , ..., xn )dλF = (L) ··· g(x1 , x2 , ..., xn ) fj (xj )dx1 dx2 ...dxn .
Rn −∞ −∞ j=1
(4.37)
When g(x1 , x2 , ..., xn ) and all fj (xj ) are continuous, the integral in (4.37) can be inter-
preted as a Riemann integral, as noted in item 3.
94 Expectations of Random Variables 1
Pn Qn
Example 4.20 (X = i=1 gi (Xi ) ; X = i=1 gi (Xi ) for independent Xi ) The above
development is somewhat abstract but was needed to connect familiar mathematical ma-
nipulations to a rigorous framework.
As an application, it is worthwhile to revisit the integrals of sums from Example 4.19
in the context of (4.36), and then consider product functions of independent random
variables.
Pn
(a) X = i=1 gi (Xi ) , general {Xi }ni=1
When {Xi }ni=1 are independent, then (4.31) follows from (4.37) and linearity of
the Lebesgue integral by Proposition III.2.49. In detail:
Z Xn Z ∞ Z ∞ Yn
XdλF = (L) ··· gi (xi ) fj (xj )dx1 dx2 ...dxn
Rn i=1 −∞ −∞ j=1
Xn Z ∞
= (L) gi (xi ) fi (xi )dxi ,
i=1 −∞
since: Z ∞ Z ∞ Y
··· fj (xj )dxj = 1.
−∞ −∞ j6=i
But independence is not needed for this result. In the general case:
Z Xn Z ∞ Z ∞
XdλF = (L) ··· gi (xi ) f (x1 , x2 , ..., xn )dx1 dx2 ...dxn
Rn i=1 −∞ −∞
Xn Z ∞
= (L) gi (xi ) fi (xi )dxi ,
i=1 −∞
Qn
(b) X = i=1 gi (Xi ) , independent {Xi }ni=1
In this case:
Z Z ∞ Z ∞Y
n
XdλF = (L) ··· fi (xi )gi (xi ) dx1 dx2 ...dxn
Rn −∞ −∞ j=1
Yn Z ∞
= (L) gi (xi ) fi (xi )dxi .
i=1 −∞
1. Mean:
Pn
Let X = i=1 Xi where {Xi }ni=1 are random variables defined on (S, E, λ). Using (4.32)
obtains that Xn
E (n) [X] = E (n) [Xi ].
i=1
R
In the case of independent variates, (4.37) and R
f (xj )dxj = 1 for all j obtains:
Z ∞ Y Z ∞
E (n) [Xi ] = xi f (xi )dxi f (xj )dxj
−∞ j6=i −∞
= µi ,
2. mth Moment:
Using the multinomial theorem, which is to be proved in Exercise 4.22:
Xn m X m!
Xi = X m1 X2m2 ...Xnmn , (4.40)
i=1 m1 ,m2 ,..mn m1 !m2 !...mn ! 1
Pn this summation is over all distinct n-tuples (m1 , m2 , ..mn ) with mj ≥ 0 and
where
j=1 mj = m.
96 Expectations of Random Variables 1
Using the same approach as for the mean, it follows that for independent variates:
Z ∞ Z ∞
E (n) [X1m1 X2m2 ...Xnmn ] = ··· xm 1 m2 mn
1 x2 ...xn f (x1 , x2 , ..., xn )dx1 dx2 ...dxn
−∞ −∞
Exercise 4.22 (Multinomial theorem) Prove the formula in (4.40) using induction
on m.
where this summation is over all distinct n-tuples (m1 , m2 , ..mn ) with mj ≥ 0 and
Pn (i) (i)
j=1 mj = m. Here µmi is the mi th central moment of Xi , and so µ0 = 1 and
(i)
µ1 = 0.
4. Variance for Independent Variates:
Pn Pn
With X = i=1 Xi , it follows from (4.39) that X − E (n) [X] = i=1 (Xi − µi ) , and so:
2 Xn X
2
X − E (n) [X] =
(Xi − µi ) + 2 Xj − µj (Xi − µi ) . (1)
i=1 j<i
P
The second summation in (1) is sometimes expressed as j6=i without the coefficient 2.
With (4.37) applied to independent random variables:
h i Z ∞ Z ∞
2 2
E (n) (Xi − µi ) = ··· (Xi − µi ) f (x1 , x2 , ..., xn )dx1 dx2 ...dxn
−∞ −∞
Z ∞
2
= (Xi − µi ) f (xi )dxi = σ 2i ,
−∞
The terms of this summation again involve a marginal density function of (3.17):
Z ∞ Z ∞ Y
f (xi , xj ) = ··· f (x1 , x2 , ..., xn ) dxk ,
−∞ −∞ k6=i,j
and so:
Z ∞ Z ∞
(n)
E Xj − µj (Xi − µi ) = Xj − µj (Xi − µi ) f (xi , xj )dxi dxj
−∞ −∞
E (2)
≡ Xj − µj (Xi − µi ) .
The correlation between X and Y, denoted corr(X, Y ) and often ρ(X, Y ) or ρXY ,
is defined:
cov(X, Y )
corr(X, Y ) ≡ , (4.45)
σX σY
where σ X , σ Y are the standard deviations of these variates.
Hence, the general formula for the variance of a summation can be expressed:
hXn i Xn X
V ar Xi = σ 2i + 2 cov(Xi , Xj ), (4.46)
i=1 i=1 j<i
or equivalently:
hXn i Xn X
V ar Xi = σ 2i + 2 ρij σ i σ j . (4.47)
i=1 i=1 j<i
and so: Yn
MX (t) = MXi (t). (4.48)
i=1
(i) (i)
Thus if MXi (t) exists for t ∈ (−t0 , t0 ), then MX (t) exists on (−t0 , t0 ) ≡
Tn (i) (i)
i=1 (−t0 , t0 ), and on this interval (4.48) is satisfied.
Pn Qn
This result is a special case of (4.38) since exp ( i=1 tXi ) = i=1 exp (tXi ) .
98 Expectations of Random Variables 1
Thus for n = 2 :
σ 2 = µ02 − µ2 . (4.50)
Proof. Left as an exercise. Hint: Recall the binomial theorem in (1.42).
The next result states that in order for MX (t) to exist for some interval (−t0 , t0 ) with
t0 > 0, it is necessary that µ0n and hence µn exist for all n. In other words, the existence of
the moment generating function is at least as restrictive on a distribution function as is the
existence of all finite moments.
As will be seen in Examples 4.32 and 4.58, the existence of MX (t) is even more restrictive
than the existence of all moments. Specifically, there are infinitely many distributions for
which µ0n exists for all n, but for which MX (t) exists only for t = 0.
Proposition 4.25 (MX (t) implies µ0n , µn all n) If MX (t) exists for some interval
(−t0 , t0 ) with t0 > 0, then µ0n and thus µn exist for all n.
Proof. Choose t with 0 < t < t0 . Since et|x| ≤ etx + e−tx , the existence of MX (t) and
MX (−t) implies by items 1 and 2 of Proposition III.4.24:
Z ∞
et|x| dF ≤ MX (t) + MX (−t) < ∞. (1)
−∞
t|x| n
Given n, e ≥ |x| for |x| / ln |x| ≥ n/t. Since |x| / ln |x| is increasing and unbounded,
choose {xn }∞
n=1 with xn > 0 and |xn | / ln |xn | ≥ n/t. Then by (1):
Z ∞ Z Z
n n n
|x| dF = |x| dF + |x| dF
−∞ |x|<xn |x|≥xn
Z
n
≤ c |xn | + et|x| dF
|x|≥xn
n
≤ c |xn | + MX (t) + MX (−t),
where:
Z
c = dF
|x|<xn
F x−
= n − F (−xn ) ≤ 1.
Thus µ0n exists for every n, and the existence of µn for all n is Proposition 4.24.
Moments of Distributions 99
The name “moment generating” function for MX (t) is justified in the next two propo-
sitions. When MX (t) exists, it not only assures the existence of all moments by the prior
result, but this function can also be used to generate these moments.
To make these more general proofs rigorous requires the Lebesgue dominated con-
vergence theorem for Lebesgue-Stieltjes integrals. This result generalizes the result for the
Lebesgue integral of Proposition III.2.52, and will be proved in Book V. As noted above, it
will also be proved in Book V that Lebesgue-Stieltjes and Riemann-Stieltjes integrals agree
for continuous integrands, and thus this result obtains the Lebesgue dominated convergence
theorem for Riemann-Stieltjes integrals used below.
To make this result more accessible, we note some important special cases which require
“mostly” prior results.
1. As noted in the proof, if F (x) = 1 for x ≥ b and F (x) = 0 for x ≤ a, then Proposition
III.4.27 obtains the needed convergence result.
2. When the distribution function F (x) is discrete or continuously differentiable, the re-
spective Riemann-Stieltjes integrals reduce to summations or Riemann integrals by
Proposition III.4.28.
(a) For discrete distributions, we can appeal to standard results on absolutely conver-
gent series.
(b) In the continuously differentiable case, these Riemann integrals agree with Lebesgue
integrals over every bounded interval [a, b] by Proposition III.1.15, and such in-
tegrals equal the respective Lebesgue integrals by Proposition III.2.18. Letting
a → −∞ and b → ∞ obtains that the improper Riemann integral exists if
the Lebesgue integral exists. Thus the Lebesgue dominated convergence result of
Proposition III.2.52 suffices.
3. If F (x) is absolutely continuous and thus has Lebesgue measurable density f (x) by
Proposition III.3.62, the Riemann-Stieltjes integrals below equal Lebesgue-Stieltjes in-
tegrals (Book V). Then as in (4.14), this integral reduces to a Lebesgue integral, and
again the Book III dominated convergence theorem suffices.
(n)
Proposition 4.26 (MX (t) implies µ0n = MX (0)) Assume that MX (t) exists for some
interval (−t0 , t0 ) with t0 > 0.
Then MX (t) is infinitely differentiable on this interval, and for all n, the nth derivative
of MX (t) is given by: Z ∞
(n)
MX (t) = y n ety dF. (4.51)
−∞
Thus the moments of X are given by:
(n)
µ0n = MX (0). (4.52)
Proof. By (4.26): Z ∞
MX (t) ≡ etx dF (x).
−∞
This result would be obvious if we could assume a generalization of the Leibniz rule of
Proposition III.1.40 to differentiate under the integral sign, but a proof is required to justify
this.
(0)
Proceeding by induction, since MX (t) ≡ MX (t) by definition, we prove that for n ≥ 0 :
Z ∞ Z ∞
(n) (n+1)
If MX (t) = xn etx dF, then, MX (t) = xn+1 etx dF. (1)
−∞ −∞
100 Expectations of Random Variables 1
Let: h i h i
−
+
fm (x) = xn etx m ex/m − 1 , fm (x) = xn etx m 1 − e−x/m .
(n)
The assumed existence of MX (t) for t ∈ (−t0 , t0 ) assures that for m large:
h i R
(n) ∞ +
m MX (t + 1/m) − MX (t) = −∞ fm (x)dF,
h
(n)
i R
∞ (2)
−
m MX (t) − MX (t − 1/m) = −∞ fm (x)dF.
Both integrals on the right are well-defined and bounded if t ± 1/m ∈ (−t0 , t0 ).
Now: h i h i
m 1 − e−x/m ≤ x ≤ m ex/m − 1 ,
Since all bounding functions are integrable for t ± 1/m ∈ (−t0 , t0 ), item 2 of Proposition
III.4.24 assures that: Z ∞
n+1 tx
x e dF < ∞, (3)
−∞
−
1. If n is even, then fm +
(x) ≤ fm (x) since e−x/m + ex/m ≥ 2 by concavity of the exponen-
− (n+1)
tial
R ∞ function. Thus gm (x) = fm (x) and this obtains by (2) and (4) that MX (t) =
n+1 tx
−∞
x e dF.
− + − +
2. If n is odd, then fm (x) ≤ fm (x) for x ≥ 0 and fm (x) ≥ fm (x) for x ≤ 0, but the same
conclusion will follow if it can be proved that as m → ∞ :
Z ∞
−
+
fm (x) − fm (x) dF → 0.
0
Moments of Distributions 101
+ −
Since fm (x)−fm (x) → 0 pointwise, this result will again follow by Lebesgue’s dominated
convergence theorem if the integrand is appropriately dominated.
A Taylor series analysis shows that if t + 1/m0 ∈ (−t0 , t0 ) and m ≥ m0 , then for x ≥ 0 :
+ −
0 ≤ fm (x) − fm (x) ≤ 2xn+1 e(t+1/m)x ≤ 2xn+1 e(t+1/m0 )x ,
The next result provides a second insight to the name moment generating function
by way of a power series representation using the moments {µ0n }∞ n=1 . This proof again re-
quires the Book V generalization of the Lebesgue dominated convergence theorem, and more
specifically, the corollary to this result that generalizes Corollary III.2.53. This needed result
addresses the question of when the integral of a infinite sum of functions equals the sum of
the associated integrals.
Note that the power series representation in (4.53) provides an alternative proof of the
infinite differentiability of MX (t) and derivation of (4.52). See for example Proposition 9.111
in Reitano (2010).
Proposition 4.27 (Power series for MX (t)) If MX (t) exists for some interval (−t0 , t0 )
with t0 > 0, then on this interval:
X∞ µ0n n
MX (t) = t , (4.53)
n=0 n!
where {µ0n }∞
n=1 are the moments of X.
Proof. Because e|tx| ≤ etx +e−tx , the existence of MX (t) for |t| < t0 assures that e|tx| dF <
R
∞.
The exponential Taylor series:
n
X∞ (tx)
etx ≡ , (4.54)
n=0 n!
is absolutely convergent in x for all t. Hence the partial sums:
n
XN |tx|
≤ e|tx| ,
n=0 n!
are bounded by an integrable function, and this implies by the triangle inequality that for all
N:
N (tx)n
X
|tx|
n=0 n! ≤ e .
Hence as N → ∞: n
XN (tx)
→ etx ,
n=0 n!
and these partial sums are dominated by an integrable function. The above noted corollary
to the Lebesgue dominated convergence theorem of Book V obtains for |t| < t0 :
XN tn Z
MX (t) = lim xn dF
N →∞ n=0 n!
X∞ µ0
n n
= t .
n=0 n!
102 Expectations of Random Variables 1
For the purpose of calculating the mean, variance, and third central moment of a distri-
bution, the following corollary is often useful.
Corollary 4.28 (The cumulant generating function: ln MX (t)) If MX (t) exists for
some interval (−t0 , t0 ) with t0 > 0, then defining the cumulant generating function
gX (t) ≡ ln MX (t) :
0 00 (3)
µ = gX (0), σ 2 = gX (0), µ3 = gX (0). (4.55)
Proof. Left as an exercise.
Remark 4.29 (On cumulants) If MX (t) exists for t ∈ (−t0 , t0 ), then gX (t) ≡ ln MX (t)
exists as a Taylor series for all such t since MX (t) > 0 by definition.
The cumulants of X are then defined in terms of this Taylor series:
X∞ tn
ln MX (t) = κn ,
n=0 n!
and by (4.30):
2. Binomial Distribution:
Let fBn (j) denote the probability density function of the sum of n independent standard
binomials with 0 < p < 1 as in (1.40):
n j
fBn (j) = p (1 − p)n−j , j = 0, 1, .., n.
j
and it again follows that the resulting series is convergent for (1 − p)et < 1, and:
k
p
MN B (t) = , t < − ln(1 − p). (4.66)
1 − (1 − p)et
k(1 − p) k(1 − p)
µN B = , σ 2N B = . (4.67)
p p2
5. Poisson Distribution:
Recalling (1.47):
e−λ λj
fP (j) = , j = 0, 1, 2, ...,
j!
and the moment generating function is calculated:
µP = λ, σ 2P = λ. (4.69)
b+a (b − a)2
µU = , σ 2U = , (4.70)
2 12
and:
ebt − eat
MU (t) = , t ∈ R. (4.71)
t (b − a)
Note that MU (t) is well defined at t = 0 despite the apparent singularity, as justified
using the exponential Taylor series in (4.54).
Moments of Distributions 105
fE (x) = λe−λx , x ≥ 0.
This distribution is a special case of the more general gamma density defined with a
shape parameter α and scale parameter λ > 0 in (1.55) by:
1 α α−1 −λx
fΓ (x) = λ x e , x ≥ 0,
Γ(α)
with the gamma function Γ(α) defined by:
Z ∞
Γ(α) = xα−1 e−x dx.
0
Moments of the gamma are derived by integration by parts and (1.57), that for α > 1:
Γ(α + 1) = αΓ(α),
3. Beta Distribution
The beta density contains two shape parameters, v > 0, w > 0, and is defined on the
interval [0, 1] by the density function in (1.61):
1
fβ (x) = xv−1 (1 − x)w−1 ,
B(v, w)
where the beta function B(v, w) is defined by a definite integral in (1.62):
Z 1
B(v, w) = y v−1 (1 − y)w−1 dy.
0
The beta function B(v, w) satisfies an important identity which is useful in evaluating
these moments:
v
B(v + 1, w) = B(v, w), (4.74)
v+w
106 Expectations of Random Variables 1
4. Cauchy Distribution
The Cauchy distribution of (1.68) is of interest as an example of a distribution that
has no finite moments. This density function is defined on R as a function of a location
parameter x0 ∈ R and a scale parameter γ > 0 by:
1 1
fC (x) = ,
πγ 1 + ([x − x0 ] /γ)2
while the standard Cauchy distribution is parameterized with x0 = 0 and γ = 1 to:
1 1
fC (x) = .
π 1 + x2
1 N |x| dx 1 N 2xdx
Z Z
1
= = ln N.
π −N 1 + x2 π 0 1 + x2 π
Hence the Cauchy distribution has no mean, and thus by the introduction to Section
4.2.5, it has no finite moments.
5. Normal Distribution
The normal density is defined on (−∞, ∞), depends on a location parameter µ ∈ R
and a scale parameter σ > 0, and is defined in (1.65) by:
!
2
1 (x − µ)
fN (x) = √ exp − .
σ 2π 2σ 2
The associated unit or standard normal distribution φ(x) is defined in (1.66) with
µ = 0 and σ = 1 :
1 1
φ(x) = √ exp − x2 .
2π 2
Moments of Distributions 107
As noted earlier, there is no elementary derivation of the fact that φ(x), and
hence
fN (x), integrate to 1. However, all central moments exist because exp −x2 /2 < x−n
for all n as x → ∞. In addition, all odd central moments are 0 by symmetry, and for
even moments an integration by parts obtains:
Z ∞ Z ∞
x2m φ(x)dx = (2m − 1) x2m−2 φ(x)dx.
−∞ −∞
Hence, justifying the notational convention of parameterizing the normal with µ and
σ 2 , we have that:
σ 2m (2m)!
µ0N,1 = µ, µN,2 = σ 2 , µN,2m = , µN,2m+1 = 0. (4.77)
2m m!
The moment generating function is derived by completing the square in the exponential
function, and then a substitution, to produce:
1 2 2
MN (t) = exp µt + σ t , t ∈ R, (4.78)
2
6. Lognormal Distribution:
The lognormal distribution is defined on [0, ∞), depends on a location parameter
µ ∈ R and a shape parameter σ > 0, and has probability density function given in
(1.67): !
2
1 (ln x − µ)
fL (x) = √ exp − .
σx 2π 2σ 2
The substitution y = (ln x − µ) /σ into the integral of fL (x) produces the integral of the
unit normal φ(y), and moments of all orders exist for the lognormal and are calculated
using the same substitution:
µ0L,n = enµ MΦ (nσ).
In other words, the moments of the lognormal can be calculated from the moment
generating function of the unit normal.
Specifically, using (4.79) obtains:
2 2 2
2
µ0L,n = enµ+(nσ) /2
, µL = eµ+σ /2
, σ 2L = e2µ+σ eσ − 1 . (4.80)
moments assures the existence of MX (t) on an open interval (−t0 , t0 ). The lognormal dis-
tribution provides the classical counterexample, in that while µ0nL exists for all n by (4.80),
the series:
X∞ tn µ0 X∞ tn 2
nL
= enµ+(nσ) /2 ,
n=0 n! n=0 n!
cannot converge for any t 6= 0, and so MLN (t) cannot exist except for t = 0.
To see this, recalling Section I.3.4.2 for definitions
P∞ of limits superior and inferior, the
ratio test states that for a positive series n=0 cn :
cn+1 X∞
lim sup < 1⇒ cn < ∞,
cn n=0
cn+1 X∞
lim inf > 1⇒ cn = ∞.
cn n=0
As the log of this expression is unbounded for t = 6 0, so too is this ratio and thus the
alternating series does not converge.
In summary, despite have all finite moments, the moment generating function MLN (t)
of the lognormal distribution does not exist except for t = 0. By Example 4.58, there are
infinitely many distribution functions with this same property.
For results on series, see for example Chapter 6 in Reitano (2010).
The following result is known as Markov’s inequality, named for Andrey Markov
(1856–1922), a student of Chebyshev.
E[|X|]
Pr[|X| ≥ t] ≤ . (4.87)
t
Proof. This is a restatement of (4.83) with n = 1.
The final result is intuitively reasonable, but the proof is subtle because we will need to
revisit the transformations of Section 4.1.2.
Remark 4.37 (On σ(X)) If X is a random variable on (S, E, λ), then X −1 (B) ∈ E for
every Borel set B ∈ B(R) by Exercise II.3.3, and thus X −1 (B(R)) ⊂ E. By Exercise II.3.44,
X −1 (B(R)) is the smallest sigma algebra on S with respect to which X is measurable, and
recalling Definition II.3.43, this sigma algebra is denoted σ(X). Thus:
The same change of variables result noted in (4.13), but applied to the integral over A,
obtains:
Z Z
n n
|X(s)| dλ(s) = |x| dλFX
A X(A)
Z ∞
n
≡ |x| χX(A) (x)dλFX ,
−∞
Thus:
Z
n n
E[|X| χA ] ≥ |x| χX(A) (x)dλFX
|x|≥t
Z
n
≥ t χX(A) (x)dλFX
|x|≥t
Z ∞
= tn χ[X(A)∩{|x|≥t}] (x)dλFX
−∞
h \ i
= tn Pr {|X| ≥ t} A .
This last step is justified by the observation that by definition of the induced measure λFX
and (1.7):
Z ∞ h \ i
χ[X(A)∩{|x|≥t}] (x)dλFX = λFX X (A) {|x| ≥ t}
−∞
h \ i
= λ {|X| ≥ t} A .
• A convex function is one for which the secant line is above the graph of the function
over this interval,
• A concave function is one for which the secant line is below the graph of the function
on this interval.
More formally:
When the inequalities are strict for t ∈ (0, 1), such functions are referred to as strictly
concave and strictly convex, respectively.
An important result from calculus, which we do not prove but can be found in Section
9.6 of Reitano (2010) and elsewhere, is the following.
σ 2j
X X
k n
Pr max Yj − µj ≥ t ≤ .
1≤k≤n j=1 j=1 t2
n
While this result requires that {Xj }j=1 be independent random variables, it does not
require that they be “identically distributed.” Consequently this result applies to samples of
variates, which are independent and identically distributed (i.i.d.), and to other independent
collections of random variables.
n
Proposition 4.43 (Kolmogorov’s inequality) Let {Xi }i=1 be independent random
variables defined on a probability space (S, E, λ) with all E[Xj ] = 0 and V ar[Xj ] = σ 2j .
Then for t > 0 :
σ 2j
X
Xk n
Pr max Xj ≥ t ≤
. (4.94)
1≤k≤n j=1 j=1 t2
Pk
Proof. With Sk ≡ j=1 Xj , define Ak = {s ∈ S| |Sk | ≥ t and |Sj | < t for j < k}, letting
n
Pn
S0n≡ 0 for A1 . Then {Ak }k=1 are disjoint measurable sets, and k=1 χAk (s) ≤ 1 since s ∈
S /
k=1 Ak if |Sn (s)| < t. Recall that χAk (s) is the characteristic function of Ak and defined
as 1 for s ∈ Ak and 0 otherwise.
Thus: Xn
Sn2 ≥ χAk Sn2 ,
k=1
Now Sn2 = Sk2 + 2Sk (Sn − Sk ) + (Sn − Sk )2 , and so by the same steps:
Xn Xn Xn
E Sn2 ≥ E χAk Sk2 + 2 χAk (Sn − Sk )2
E χAk Sk (Sn − Sk ) + E
k=1 k=1 k=1
Xn
E χAk Sk2 .
≥ (1)
k=1
To justify the last step, E χAk (Sn − Sk )2 ≥ 0 by (4.11), and we claim that
Pk Pn
E χAk Sk (Sn − Sk ) = 0. For this, χAk Sk = j=1 χAk Xj and Sn − Sk = j=k+1 Xj are
independent by Exercise II.3.50 and Proposition II.3.56. Thus by (4.38) and the assumption
that all E[Xj ] = 0:
E χAk Sk (Sn − Sk ) = E χAk Sk E [(Sn − Sk )] = 0.
114 Expectations of Random Variables 1
Now χAk Sk2 ≥ t2 χA k by the
definition of Ak , and then E χAk Sk2 ≥ t2 E χAk by (4.11)
and (4.10), and so E χAk Sk2 ≥ t2 λ [Ak ] . Thus by (1) and disjointness of {Ak }nk=1 :
Xn
E Sn2 ≥ t2
λ [Ak ]
k=1
h[ n i
= t2 λ Ak
k=1
2
≡ t Pr max |Sk | ≥ t .
1≤k≤n
n Pn
From the independence of {Xi }i=1 , E Sn2 = 2
j=1 σ j by (4.43), and the proof is
complete.
σ 2j
X X
k n
max Pr Xj ≥ t ≤
.
1≤k≤n j=1 j=1 t2
To set the stage, assume that X and Y are random variables defined on a probability
space (S, E, λ). Then XY is a random variable by Proposition I.3.30, with expectation
defined by (4.7) when (4.8) is satisfied. This latter constraint is that E[|XY |] be finite,
where: Z
E[|XY |] ≡ |X(s)Y (s)| dλ(s).
S
However, the integrability of |X| and |Y | does not in general imply the integrability of
|XY | .
Example 4.45 (Integrability of XY ) With (S, E, λ) = ([0, 1], B([0, 1]), m), the random
variables X(s) = s−a and Y (s) = s−b are both integrable for 0 < a, b < 1, but XY = s−(a+b)
will not be integrable if a + b ≥ 1.
But note that if both E[X 2 ] and E[Y 2 ] exist, it then follows that 0 < a, b < 1/2. So now
a + b < 1 and XY is integrable.
Moment Inequalities 115
In addition, there is equality in (4.95) if and only if outside a set of λ-measure 0, cX +dY ≡
0 for real c, d ∈ R, not both 0.
Proof. To simplify the notation of the proof, we represent expectations in terms of the
integrals on S as in (4.7). h i
2
For real t, define the function f (t) = E (tX + Y ) :
Z
2
f (t) = [tX(s) + Y (s)] dλ(s).
S
By the quadratic formula, if a 6= 0, then the polynomial p(t) ≡ at2 + bt + c ≥ 0 for all
t if and only if the discriminant b2 − 4ac ≤ 0. This follows because if b2 − 4ac > 0, then
p(t) has two distinct real roots at which p0 (t) 6= 0, and thus by continuity, p(t) < 0 in any
interval about these roots, a contradiction.
2
If E[|X| ] 6= 0, then this discriminant inequality obtains:
2 2 2
(E[XY ]) ≤ E[|X| ]E[|Y | ].
2 2
If E[|X| ] = 0, then 2tE[XY ] + E[|Y | ] ≥ 0 for all t by (1), and this is possible if and only
if E[XY ] = 0. This proves (4.95).
If cX + dY ≡ 0 for c, d 6= 0, it follows that X = αY with α = −d/c. Then (4.95) holds
with equality by (4.10).
If (4.95) holds with equality, f (t) in (1) can be restated:
1/2 1/2 2
2 2
f (t) = t E[|X| ] + E[|Y | ] ≥ 0.
1/2 1/2
2 2 2
If E[|X| ] 6= 0, then f (c) = 0 for c = − E[|Y | ] / E[|X| ] , noting that c could be
h i
2
0. In any case, E (cX + Y ) = 0 by definition. It then follows that cX + Y = 0, λ-a.e. by
2
Exercise 4.7. If E[|X| ] = 0 then X = 0, λ-a.e., and so X + dY ≡ 0, λ-a.e. for d = 0.
116 Expectations of Random Variables 1
One may wonder from the above result whether E[|XY |] exists as required by Definition
4.1, and also satisfies (4.95). The triangle inequality in (4.12) of Exercise 4.9:
Proof. Let X̃ = |X| and Ỹ = |Y |, both random variables by Proposition I.3.47. Then
2 2
2 2
E[X̃ ] = E[|X| ] and E[Ỹ ] = E[|Y | ] exist, and so by (4.95):
1/2 1/2
2 2
E[X̃ Ỹ ] ≤ E[|X| ] E[|Y | ] .
and so by (4.96):
1/2
µ0|2k+1| ≤ µ02k µ02k+2 .
k k+1
For central moments, let X = (Z − µZ ) and Y = (Z − µZ ) , and this obtains:
1/2
µ|2k+1| ≤ µ2k µ2k+2 .
Using (4.12), these inequalities provide upper bounds for the absolute value of the odd
moments µ02k+1 and µ2k+1 , in terms of the respective even moments:
0 0 0
1/2
2k+1 ≤ µ2k µ2k+2
µ ,
(4.97)
µ2k+1 ≤ µ2k µ2k+2 1/2 .
−σ X σ Y ≤ E[(X − µX ) (Y − µY )] ≤ σ X σ Y . (4.98)
Corollary 4.50 (|corr(X, Y )| ≤ 1) Given random variables X, Y with finite second mo-
ments, the covariance cov(X, Y ) exists and satisfies:
|cov(X, Y )| ≤ σ X σ Y .
|corr(X, Y )| ≤ 1. (4.99)
Remark 4.52 (Conjugate indexes) When p, q satisfy 1 < p, q < ∞ and p1 + 1q = 1, they
are called conjugate indexes, and sometimes Hölder conjugate indexes. By defining
1
∞ = 0, the pair (1, ∞) is also called conjugate. Many results on conjugate indexes can be
extended to this limiting pair, including Hölder’s inequality below.
Example 4.53 (Integrability of XY ) With (S, E, λ) = ([0, 1], B([0, 1]), m), the random
variables X(s) = s−a and Y (s) = s−b are both integrable for 0 < a, b < 1, but XY = s−(a+b)
will not be integrable if a + b ≥ 1. However, if both E[X p ] and E[Y q ] exist with 1 < p, q < ∞
and p1 + 1q = 1, then it follows that 0 < a < 1/p and 0 < b < 1/q. In this case, a + b <
1 1
p + q = 1, and XY is integrable.
We now prove Hölder’s inequality, and note that this reduces to the Cauchy-Schwarz
inequality of (4.96) with p = q = 2.
118 Expectations of Random Variables 1
Proposition 4.54 (Hölder’s inequality) Given p, q with 1 < p, q < ∞ and 1/p + 1/q =
p q
1, assume that E[|X| ] < ∞ and E[|Y | ] < ∞ for random variables X, Y defined on a
probability space (S, E, λ).
Then E[|XY |] < ∞ and:
p 1/p q 1/q
E[|XY |] ≤ (E[|X| ]) (E[|Y | ]) . (4.101)
For p = 1, if E[|X|] < ∞ and sup[|Y |] < ∞, then E[|XY |] < ∞ and:
E[|XY |] ≤ E[|X|] sup[|Y |]. (4.102)
p q
Proof. If either or both E[|X| ] = 0 and E[|Y | ] = 0, then either or both X = 0 and Y = 0
outside a set of λ-measure 0 by Exercise 4.7. Thus XY = 0 outside such a set, E[|XY |] = 0
by Exercise 4.8, and (4.101) follows in these cases.
p q
Thus assuming that E[|X| ] 6= 0 and E[|Y | ] 6= 0, apply Young’s inequality with a =
p 1/p q 1/q
|X| / (E[|X| ]) and b = |Y | / (E[|Y | ]) :
p q
|XY | 1 |X| 1 |Y |
≤ + .
p
(E[|X| ])
1/p q
(E[|Y | ])
1/q p E[|X|p ] q E[|Y |q ]
p q
The existence of E[|X| ] and E[|Y | ] then assures the existence of E[|XY |] by (4.11). Tak-
ing expectations, the right-hand side reduces to 1 for conjugate indexes, and (4.101) again
follows.
If p = 1 and q = ∞, then from |XY | ≤ |X| sup |Y | we obtain by (4.11) and (4.10):
E[|XY |] ≤ sup |Y | E[|X|].
This obtains |E[XY ]| ≤ E[|XY |], and Hölder’s inequality can be stated:
p 1/p q 1/q
|E[XY ]| ≤ (E[|X| ]) (E[|Y | ]) . (4.103)
There is an important corollary result on moments which is easily proved using Hölder’s
inequality, called Lyapunov’s inequality. This result is named for Aleksandr Lyapunov
(1857–1918) and provides a lower bound on the growth rate of moments.
Corollary 4.56 (Lyapunov’s inequality) Given a random variable X on a probability
space (S, E, λ), then for 0 < α < β and assuming all moments exist:
1/β
α 1/α β
(E[|X| ]) ≤ E[|X| ] . (4.104)
Uniqueness of Moments 119
α
Proof. Let p = β/α, then apply Hölder’s inequality to |X| and Y ≡ 1. That E[Y ] = 1
follows from (4.7) since S is a probability space.
Example 4.57 (Moment inequalities) Lyapunov’s inequality implies that absolute mo-
ments of a random variable X must grow at least geometrically, when such moments exist.
1/m 1/m
1. µ0|m| and µ|m| increase with m.
3. For m = 1 in item 2: n
µ0|1| ≤ µ0|n| .
Example 4.58 (Heyde) The lognormal distribution in (1.67) is defined with µ = 0 and
σ=1:
1 1 2
fL (x) = √ exp − (ln x) , x ≥ 0.
x 2π 2
Now define for −1 ≤ α ≤ 1 :
The first integral is equal to 1 as this is the normal density of (1.66). The second integral
since |sin(2πy)| ≤ 1, and equals 0 by symmetry since the integrand g(y) ≡
is well defined
exp −y 2 /2 sin(2πy) is an odd function, meaning that g(−y) = −g(y).
To see that fα (x) has the same moments as fL (x) by (4.1), we show that for n = 1, 2, ... :
Z ∞ Z ∞
xn dFL (x) = xn dFα (x).
0 0
To this end, making the substitution x = exp(y +n), and noting that sin x = sin(x+2πn)
for any integer n produces:
Z ∞
2
In = exp(yn + n2 ) exp − (y + n) /2 sin(2π(y + n))dy
−∞
Z ∞
= exp(n2 /2) exp −y 2 /2 sin(2πy)dy
−∞
= 0.
This again follows because the integrand g(y) = exp −y 2 /2 sin(2πy) is absolutely integrable
by boundedness of sin(2πy), and is an odd function.
Hence for any such α, fα (x) is a density function with the same moments as the log-
normal fL (x).
With the above example in hand, we now begin an investigation toward a positive result
on uniqueness of moments. The next proposition states
P∞ that given the distribution function
F (x), if all moments exist and the power series n=0 µ0n tn /n! converges for t ∈ (−t0 , t0 )
with t0 > 0, then MF (t) exists. Thus by Proposition 4.27, MF (t) is given by this power
series on this interval.
Note that given the next result, we will not be able to conclude that F (x) is the only
distribution function with these moments. It could well be possible that there exists a second
distribution function G(x) with MF (t) = MG (t) on (−t0 , t0 ) and thus, F (x) and G(x) have
the same moments yet F (x) 6= G(x). See Proposition 4.61 and Corollary 4.62.
Uniqueness of Moments 121
Proposition 4.59 (Existence of MF (t)) Given the P∞distribution function F (x), assume
that µ0n exists for all n and that the power series n=0 nµ0 n
t /n! converges absolutely on
(−t0 , t0 ) with t0 > 0.
Then the moment generating function MF (t) of F (x) exists on (−t0 , t0 ), and is given
by this series.
Proof. For t ∈ (−t0 , t0 ), the triangle inequality of item 3 of Proposition III.4.24, and then
the triangle inequality for sums obtains:
Z
∞ Xn (tx)j Xn Z ∞ (tx) j Xn |t|j
dF (x) ≤ dF (x) ≡ µ0 , (1)
j=0 j! j=0 −∞ j! j=0 j! |j|
−∞
h i
j
where µ0|j| denotes the absolute jth moment, µ0|j| = E |X| .
We show below that the upper summation in (1) is bounded for all n for any such t.
Assuming this, and applying integral properties from Proposition III.4.24:
Z Z
∞ Xn µ0j tj ∞ Z ∞X
n (tx)
j
etx dF (x) − = etx dF (x) − dF (x)
j=0 j! j!
−∞ −∞ −∞ j=0
Z ∞ X j
∞ (tx)
≤ dF (x)
j=n+1 j!
−∞
j
X∞ |t| µ0|j|
≤ . (2)
j=n+1 j!
µ02n ≥ µ02n+2 ,
+ 2n,
2n ≡
2n + 2, µ02n < µ02n+2 .
Given t ∈ (−t0 , t0 ), let s = λt where λ < 1. We can assume that t 6= 0 since the existence
of MF (0) is always assured. Then:
2n+1 2n+
µ0|2n+1| |s| µ0 + |t|
≤ cn 2n + ,
(2n + 1)! (2n )!
where
|t| λ2n+1 /(2n + 1), 2n+ = 2n,
cn ≡ 2n+1
(2n + 2) λ / |t| , 2n+ = 2n + 2.
In either case, cn → 0 as n → ∞.
Now for each n, define:
Then since µ0|2n| = µ02n by definition, splitting the summation in (1) into even and odd
indexes obtains:
X2n |s|j µ0|j| Xn |t|2j µ02j 1 + d− +
j+1 + dj
≤ .
j=0 j! j=0 (2j)!
The following result provides an alternative test for the existence of MF (t) given
{µ0n }∞
n=1 , stated in terms of a growth bound on even moments.
Corollary 4.60 (Existence of MF (t)) Given the distribution function F (x), assume
that µ0n exists for all n and that:
1/2n
(µ02n )
lim sup = r < ∞.
2n
Then MF (t) exists for |t| < 1/r, and by Proposition 4.27 is given on this interval by the
series in (4.53).
Proof. Recall that the limit superior of a sequence {an }∞
n=1 is defined:
lim sup an ≡ inf sup am = lim sup am ,
n m≥n n→∞ m≥n
where the final equality follows since supm≥n am is decreasing with n. See Section I.3.4.2
for more on lim sup .
The assumed bound above then implies that for all but at most finitely many n :
2n
µ02n ≤ (2nr) .
By (4.97), this obtains that for all but at most finitely many n :
0 1/2
µ02n µ02n+2
µ2n+1 ≤
2n+1
≤ (2 (n + 1) r) .
Hence in all cases of even or odd m, with at most finitely many exceptions:
m
|µ0m | ≤ ((m + 1) r) ,
and so:
m
X∞ |µ0m | tm X∞(t (m + 1) r)
≤
m=0 m! m=0 m!
m
m (m + 1)
X∞
= (tr) .
m=0 m!
Using the ratio test noted in Example 4.32, it follows that this series is convergent if |t| <
1/r, and the result follows from Proposition 4.59.
Returning to the investigation of this section, we now know that the existence of all
moments, and then either a certain growth bound or the convergence of an associated
power series, assures the existence of a moment generating function. So now the question
might be reframed:
Uniqueness of Moments 123
Does the existence of a moment generating function assure the uniqueness of moments?
In other words, can two different distribution functions have the same moments and
moment generating functions? While Example 4.58 showed that different distribution func-
tions can have the same moments, it must be noted that none of those distribution functions
have a moment generating function by Example 4.32.
The following proposition settles this question and will be proved in Book VI using
properties of the characteristic function of F (x). It provides a key test for when a
moment collection uniquely defines a distribution function. Corollary 4.62 provides another
test that uses this result.
Proposition 4.61 (Uniqueness of Moments)P∞ Given the distribution function F (x), as-
sume that µ0n exists for all n and that n=0 µ0n tn /n! converges absolutely on (−t0 , t0 ) with
t0 > 0.
Then F (x) is the only distribution function with these moments.
Proof. See the section on uniqueness of moments in the Book VI chapter on the charac-
teristic function.
The following corollary provides a uniqueness result that complements the results of
Propositions 4.25 and 4.27. These stated:
1. If MX (t) exists for t ∈ (−t0 , t0 ) with t0 > 0, then so too do all moments {µ0n }∞
n=1 of X;
Corollary 4.62 R ∞(Uniqueness of MF (t)) Given the distribution function F (x), assume
that MF (t) ≡ −∞ etx dF exists on (−t0 , t0 ) with t0 > 0.
Then F (x) is the only distribution function with this moment generating function, and
the only distribution function with the associated moments..
Proof. If F (x) has moment generating function MF (t), which converges on the interval
(−t0 , t0 ) with t0 > 0, then by Propositions 4.25 and 4.27, F (x) P
has moments of all orders
∞
defined by µ0n = M (n) (0), and MF (t) is given by the power series n=0 µ0n tn /n! on (−t0 , t0 ).
Thus by Proposition 4.61, F (x) is the only distribution function with these moments.
If G(x) is another distribution function with moment generating function MF (t) con-
vergent on (−t00 , t00 ) with t00 > 0, then by the same argument G(x) has the same moments
and convergent power series as F (x). This contradicts Proposition 4.61, and hence F (x) is
the only distribution function with this moment generating function.
Thus X has the moment generating function of a normal variate with parameters µ and
σ 2 as above. By Corollary 4.62, only the normal variate has this moment generating
function, and the result is proved.
5. Sum of squared normals is chi-squared:
Recall Remark 1.30, that when a random variable Y has a gamma distribution with
λ = 1/2 and α = n/2, it is said to have a chi-squared distribution with n degrees
of freedom, which is often denoted χ2n d.f. .
Proved in Example 2.4, if X is standard normal, then X 2 is chi-squared with 1 degree of
freedom. If {Xi }ni=1 are independent standard normals, then {Xi2 }ni=1 are independent
chi-squared by Proposition II.3.56. Thus as an application of item 3,Pas a sum of inde-
n
pendent gammas with common parameters λ = 1/2 and α = 1/2, i=1 Xi2 is gamma
with λ = 1/2 and α = n/2.
Pn
That is, i=1 Xi2 is χ2n d.f. as noted in Remark 2.5.
6. Student T construction in Example 2.29:
Recall Example 2.29 on Student’s T distribution. Simplifying notation, it was concluded
that there are random variables A, B, and C with A + B = C, where A and B are inde-
pendent and both A and C are chi-squared with 1 and n degrees of freedom, respectively.
The claim there was that this assured that B is also Chi-squared, and with n − 1 degrees
of freedom.
To prove this, the moment generating functions of A and C are given in (4.73) with
λA = λC = 1/2, αA = 1/2, and αC = n/2. Assuming that B has a moment generating
function, an application of (4.48) and Corollary 4.62 then obtains that B is also Chi-
squared with n − 1 degrees of freedom.
This assumption on B is verified by noting that the moments of A and B sum to the
moments of C, and thus, the convergence of the moment series for A and C assures
the convergence of the moment series for B. Thus by Proposition 4.59, the moment
generating function for B exists.
expectations, these results will be generalized with the aid of the integration theory of Book
V.
Assume that a distribution function F (x) is uniquely determined by its moment col-
lection {µ0n }∞ ∞
n=1 , and that there exists a sequence of distribution functions {Fm }m=1 with
0 ∞ 0 0
moment collections {µm,n }n=1 where µm,n → µn for each n. Are we then able to con-
clude that Fm converges weakly to the distribution function F, denoted Fm ⇒ F ?
Recalling Definition II.8.2, this means that Fm (x) → F (x) for every continuity point of F.
As discussed in Section II.8.1, the notion Fm ⇒ F can be equivalently stated in other
ways:
1. Given random variables {Xm }∞ ∞
m=1 associated with {Fm }m=1 and X associated with
F (recall Proposition II.3.6):
Fm ⇒ F is equivalent to Xm ⇒ X, meaning that Xm converges in distribution to
the random variable X.
2. Given the Borel measures {λFm }∞ ∞
m=1 induced by {Fm }m=1 and λF induced by F (recall
Proposition 1.5):
Fm ⇒ F is equivalent to λFm ⇒ λF , meaning that λFm converges weakly to λF .
In probability theory, the method of moments is:
• The name often given to the framework within which one can assert that Fm ⇒ F by
demonstrating that that µ0m,n → µ0n for each n, or conversely.
• The name given to the process whereby one calculates the moments of a random sample
to estimate the parameters of an assumed underlying distribution function. This is ac-
complished by equating the distribution’s parametric moment formulas to the numerical
values calculated from the sample and solving.
The focus of this section is on method of moments in the first sense.
Any method of moments in this sense must require that the distribution function F be
uniquely determined by its moment collection {µ0n }∞ ∞
n=1 . For example, assume that {Fm }m=1
2
is given and µ0m,n → enµ+(nσ) /2 for each n. Even though these limits are recognizable as
the moments of the lognormal distribution, it was shown in Example 4.58 that these are
also the moments of an infinite number of other distribution functions. So certainly there
can be no useful statement concerning Fm ⇒ F.
The main result of this section is in Proposition 4.76 which states that in the case where
F is uniquely defined by its moments, if µ0m,n → µ0n for each n, then Fm ⇒ F. Proposition
4.77 provides the same conclusion based on convergence of moment generating functions.
For these results, we will first need positive results in the opposite direction. For example,
if Fm ⇒ F with moment collections {{µ0m,n }∞ ∞ 0 ∞
n=1 }m=1 and {µn }n=1 , must it be the case that
0 0
µm,n → µn for each n? Perhaps surprisingly, the answer is in the negative as illustrated in
Example 4.64 and subsequent discussion.
Recall Definition II.8.16 on the notion of tightness of a family of measures {λm }∞ m=1
or distribution functions {Fm }∞ m=1 . The equivalence of definitions in the latter case was
addressed in Remark II.8.17 and is here left as an optional exercise.
Definition 4.63 (Tight sequence: {λm }∞ ∞
m=1 or {Fm }m=1 ) A sequence of probability
∞
measures {λm }m=1 is said to be tight if for any > 0 there is a finite interval (a, b]
so that λm ((a, b]) > 1 − for all m.
A sequence of distribution functions {Fm }∞ m=1 is said to be tight if for any > 0 there is
a finite interval (a, b] so that Fm (b) − Fm (a) > 1 − for all m, or equivalently Fm (b) > 1 −
and Fm (a) < for all m.
Weak Convergence and Moment Limits 127
1. {Fm }∞
m=1 is not tight:
Define discrete density functions {fm }∞
m=1 by fm (m) = 1, fm (x) = 0 for x 6= m, and
thus Fm (x) = χ[m,∞) (x):
0, x < m,
Fm (x) =
1, m ≤ x.
Now, Fm (x) → F (x) ≡ 0 for all x, but by definition it does not follow that Fm ⇒ F
since F is not even a distribution function. That F is not a distribution function is
consistent with the result of Exercise II.8.21, since the associated distribution functions
{Fm }∞m=1 are not tight.
In this case µ0m,n 9 µ0n for any n, since µ0m,n = mn for all n and µ0n = 0.
Conjecture: It seems natural to hypothesize that if Fm ⇒ F with F a distribution
function, that we will obtain the positive result that µ0m,n → µ0n for all n. By Proposition
II.8.18, if F is a distribution function and Fm ⇒ F, then {Fm }∞ m=1 must be tight, so
we next look at such an example.
2. {Fm }∞
m=1 is tight:
Define discrete density functions {fm }∞
m=1 by fm (0) = 1 − 1/m, fm (m) = 1/m, and
fm (x) = 0 for x 6= 0, m. Then:
0, x < 0,
Fm (x) = 1 − 1/m, 0 ≤ x < m,
1, m ≤ x.
Before proceeding to a solution, we investigate the second example further, and demon-
strate why the desired conclusion was not achieved even with tightness of the sequence
{Fm }∞m=1 .
Example 4.65 (First analysis of Example 4.64: {Fm }∞ m=1 is tight) Assume that
{Fm }∞m=1 is tight sequence of distribution functions and F m ⇒ F for a distribution func-
tion F . By Skorokhod’s representation theorem of Proposition II.8.30, we can define
random variables {Xm }∞ m=1 and X on the Lebesgue measure space ((0, 1), B(0, 1), m) with
these distribution functions, and for which Xm → X for all t ∈ (0, 1). As in (4.7), but now
as Lebesgue integrals:
Z 1 Z 1
0 n 0
µm,n = Xm (t)dmL , µn = X n (t)dmL , (1)
0 0
In the second example above, an application of the Skorokhod construction yields that
Xm (t) ≡ 0 for t ∈ (0, 1 − 1/m] and Xm (t) ≡ m for t ∈ (1 − 1/m, 1), while X(t) ≡ 0. As
assured by the Skorokhod result, Xm → X for all t ∈ (0, 1). Not surprisingly, µ0m,n = mn−1
and µ0n = 0 as before, but now calculated as in (1).
The assumption that {Fm }∞ m=1 is tight implies that for any > 0 there is an interval,
say (−N , N ], such that for all m :
Fm (N ) − Fm (−N ) ≥ 1 − .
−1
By definition of the distribution function, that Fm (x) ≡ mL Xm (−∞, x] , this obtains:
−1
mL Xm (−N , N ] ≥ 1 − .
Hence: h h [ ii
−1
mL Xm (−∞, −N ] (N , ∞) ≤ . (2)
We can replace (−N , N ] by a slightly larger open interval for notational simplicity, and
then have using the same notation:
Z Z
µ0m,n = Xmn
(t)dmL + Xmn
(t)dmL . (3)
|Xm |<N |Xm |≥N
As will be seen in the proof of Proposition 4.72, the first integral will convergence as m → ∞
by the bounded convergence theorem of Proposition III.2.22. This result applies because for
any N , mL [|Xm | < N ] ≤ mL [(0, 1)] = 1, while |Xm (t)| ≤ N on this set for all m by
definition, and Xm → X for all t ∈ (0, 1). Thus the bounded convergence theorem applies
and obtains that the first integral converges:
Z Z
n
Xm (t)dmL → X n (t)dmL . (4)
|Xm |<N |X|<N
Hence given tightness, the challenge of achieving the desired result that µ0m,n → µ0n for
each n, is the convergence of the second integral in (3) with unbounded integrands. In other
words, the outstanding question is as follows.
Given that Xm (t) → X(t) for all t and mL {t| |Xm (t)| ≥ N } ≤ by (2), what assumption
will assure that: Z Z
n
Xm (t)dmL → X n (t)dmL = 0? (5)
|Xm |≥N |Xm |≥N
Certainly, tightness is inadequate for such an assurance, as the second example above
illustrates. That mL {t| |Xm (t)| ≥ N } ≤ for all m does not even assure that the integrals
n
of [Xm (t)] over these sets remain bounded. In detail, given any such N and any m > N :
Z Z 1
n
Xm (t)dmL = mn dmL = mn−1 . (6)
|Xm |≥N 1−1/m
For this example, and as will be the case in general, the failure of µ0m,n → µ0n for each n
will be linked to the second integral in (3) not converging as needed in (5). The first integral
in (3) will always be assured to converge as in (4) by the bounded convergence theorem.
Returning to Book II and keeping the above notation, there were 3 “integration to
n
the limit” results whereby pointwise convergence of Xm → X n implied convergence of the
associated Lebesgue integrals:
Though items 2 or 3 may apply in special situations, none of these results is generally
applicable to the current investigation because the criteria needed are very strong.
As it turns out, there is another general criterion, called uniform integrability, which
will provide another integration to the limit result which will serve our purpose. We in-
troduce this definition next in the context of a general probability space, even though we
currently only require this notion for Lebesgue integrals. The general integrals in this defi-
nition were introduced in Section 4.1.2 and will be developed in Book V.
Definition 4.66 (Uniform integrability) A sequence of random variables {Xm }∞ m=1 de-
fined on a probability space (S, E, λ) is said to be uniformly integrable (U.I.) if:
Z
lim sup |Xm (s)| dλ(s) = 0. (4.106)
N →∞ m |Xm |≥N
Remark 4.67 (On uniform integrability) A few comments on this definition are war-
ranted.
Thus {Xm }∞
m=1 are all integrable, and the associated integrals are uniformly
bounded.
3. Uniform integrability is assured if {XRm }∞
m=1 are dominated by integrable Y, meaning
that |Xm (s)| ≤ |Y (s)| for all m and S |Y (s)| dλ(s) < ∞. In Book V, this result will
be seen to be true in a general measure space, and not just in a probability space. But
uniform integrability is weaker than the assumption that |Xm (s)| ≤ |Y (s)| for such Y.
See Exercises 4.68 and 4.69.
130 Expectations of Random Variables 1
4. If {Xm }∞
m=1 are integrable and uniformly bounded, so |Xm (s)| ≤ c for all m, this implies
uniform integrability in a probability space. But this conclusion is not valid in a general
measure space as the example of Xm ≡ χ[m.m+1] (x) defined on (R, B(R), m) illustrates.
The following exercises are stated for the probability space ((0, 1), B(0, 1), mL ) so the
more familiar Lebesgue integration theory can be used. But note that these results generalize
to arbitrary probability spaces using the integration theory introduced in Section 4.1.2.
Exercise 4.68 (All |Xm | ≤ |Y | for integrable Y ⇒ {Xm }∞ m=1 U.I.) Prove that if
{Xm }∞ m=1 are defined on the Lebesgue probability space ((0, 1), B(0, 1), mL ) and |X m (t)| ≤
R1
|Y (t)| for all m where 0 |Y (t)| dmL (t) < ∞, then {Xm }∞ m=1 are uniformly integrable. Hint:
Item 4 of Proposition III.2.49.
Exercise 4.69 ({Xm }∞ m=1 U.I. ; All |Xm | ≤ |Y | for integrable Y ) On the Lebesgue
probability space ((0, 1), B(0, 1), mL ), develop an example of uniformly integrable random
variables {Xm }∞m=1 for which there is no integrable Y with |Xn | ≤ |Y | . Hint: {Xm }m=1
∞
must be unbounded and uniformly integrable. Identify variates that are tightly bounded by
Y (t) = 1/t, which is not integrable.
1+
dmL < ∞ ⇒ {Xm }∞ ∞
R
Exercise 4.70 (supm (0,1) |Xm | m=1 U.I.) Let {Xm (t)}m=1 be a
random variable sequence on the Lebesgue probability space ((0, 1), B(0, 1), mL ). Prove that
if for some > 0: Z
1+
sup |Xm (t)| dmL (t) = K < ∞,
m (0,1)
Determine c(N ).
Turning to the next result, it will perhaps be surprising that it is stated without a
n ∞
mention of the uniform integrability of the variates {Xm }m=1 for all n. Indeed, this result
is completely silent on these variates, and simply assumes that Fm ⇒ F, and that all
distribution functions have all moments.
n ∞
To obtain the necessary uniform integrability of the unmentioned variates {Xm }m=1 ,
it makes a more easily verifiable assumption on the boundedness of the even moments
{µ0m,2n }∞
m=1 for each n. As will be seen in the proof, this then assures the uniform integra-
bility assumption, and this is enough to obtain the desired conclusion that µ0m,n → µ0n for
all n, as noted in Example 4.71.
Proposition 4.72 (When Fm ⇒ F implies that µ0m,n → µ0n for all n) Let {Fm }∞ m=1
and F be distribution functions with respective moment collections {{µ0m,n }∞ ∞
n=1 }m=1 and
{µ0n }∞
n=1 .
Weak Convergence and Moment Limits 131
If Fm ⇒ F and {µ0m,2n }∞ 0 0
m=1 is bounded for every n, then µm,n → µn for all n.
Proof. By Skorokhod’s representation theorem of Proposition II.8.30, define random vari-
ables {Xm }∞m=1 and X on the Lebesgue measure space ((0, 1), B(0, 1), mL ) with respective
distribution functions {Fm }∞ m=1 and F, and for which Xm → X for all t ∈ (0, 1). For given
N define:
(N ) Xm , |Xm | < N, (N ) X, |X| < N,
Xm = X =
0, |Xm | ≥ N, 0, |X| ≥ N.
h in n
(N )
For each n and N, Xm → X (N ) for all t ∈ (0, 1), and so by the bounded conver-
gence theorem of Proposition III.2.22:
Z h in Z h in
(N )
Xm (t) dmL → X (N ) dmL . (1)
(0,1) (0,1)
trarily small as N → ∞. To complete the proof, we show that the assumption on boundedness
n
of even moments assures that {[Xm (t)] } are uniformly integrable for any n, and thus the
first term on the right in (3) converges to zero as N → ∞ by Definition 4.66.
To this end, the assumed boundedness of {µ0m,2n }∞ m=1 implies that
Z
2n
sup |Xm (t)| dmL = Kn < ∞.
m (0,1)
Hence:,
Z Z
n 2n
sup |Xm (t)| dmL ≤ sup |Xm (t)| dmL /N n
m |Xm (t)|≥N m |Xm (t)|≥N
n
≤ Kn /N ,
n
and {[Xm (t)] }∞m=1 are uniformly integrable.
It then follows from (3) that for all n :
Z Z
n n
lim sup [Xm (t)] dmL − [X(t)] dmL = 0,
m (0,1) (0,1)
and since these terms are nonnegative, this assures that the corresponding limit is 0. Thus
µ0m,n → µ0n for all n.
The existence of all moments for F was assumed above, but could have been part of the
conclusion.
132 Expectations of Random Variables 1
Corollary 4.73 (When Fm ⇒ F implies that µ0m,n → µ0n for all n) Let {Fm }∞ m=1 be
distribution functions with moment collections {{µ0m,n }∞ }∞
n=1 m=1 . If F m ⇒ F and
{µ0m,2n }∞
m=1 is bounded for every n, then F has moments of all orders {µ0 ∞
}
n n=1 , and
µ0m,n → µ0n for all n.
Proof. Once the existence of moments for F is demonstrated, the above proof applies.
To this end, let {Xm }∞ m=1 and X be defined as above. Then since Xm → X for all
t ∈ (0, 1), Fatou’s lemma of Proposition III.2.34 obtains:
h i h i
2n 2n
= lim inf µ0m,2n < ∞.
E |X| ≤ lim inf E |Xm |
m m
Proposition 4.74 (When Fm ⇒ F implies that Mm (t) → M (t)) Let {Fm }∞ m=1 and F
be distribution functions with with respective moment generating functions {Mm (t)}∞m=1 and
M (t), all defined on a common interval (−t0, t0 ) with t0 > 0.
If Fm ⇒ F and {Mm (t)}∞ m=1 is bounded for each t ∈ (−t0, t0 ), then Mm (t) → M (t) for
t ∈ (−t0, t0 ).
Proof. Using the notation of the proof of Proposition 4.72:
Z h i Z h i
(N )
exp tXm (s) dmL (s) → exp tX (N ) (s) dmL (s),
(0,1) (0,1)
The last integral converges to 0 as N → ∞ for all t ∈ (−t0, t0 ) by the existence of M (t).
To complete the proof we show that the first term in this upper bound will also have limit 0
∞
as N → ∞ by proving that {exp [tXm (s)]}m=1 are uniformly integrable for t ∈ (−t0, t0 ).
Exercise 4.70 proves that uniform integrability follows if for some > 0,
Z
1+
sup (exp [tXm (s)]) dmL (s) = K < ∞.
m (0,1)
By definition: Z
1+
(exp [tXm (s)]) dmL (t) = Mm (t [1 + ]),
(0,1)
and hence is bounded by assumption when t (1 + ) ∈ (−t0, t0 ). For t ∈ (−t0, t0 ), such >
0 always exists.
The proof is complete by the same final step as for Proposition 4.72.
Weak Convergence and Moment Limits 133
As was the case for Proposition 4.72 as confirmed in Corollary 4.73, the existence of
M (t) could have been part of the conclusion of the prior result.
Corollary 4.75 (When Fm ⇒ F implies that Mm (t) → M (t)) Let {Fm }∞ m=1 be distri-
bution functions with moment generating functions {Mm (t)}∞ m=1 defined on a common in-
terval (−t0, t0 ) with t0 > 0. If Fm ⇒ F and {Mm (t)}∞
m=1 is bounded for each t ∈ (−t0, t0 ),
then F has a moment generating function M (t), and Mm (t) → M (t) for t ∈ (−t0, t0 ).
Proof. As for Corollary 4.73, the existence of M (t) for each t ∈ (−t0, t0 ) follows from
Fatou’s lemma of Proposition III.2.34.
We are now ready for the main results on method of moments. These address the
reverse implications to those above. Namely, when can we infer that Fm ⇒ F from conver-
gence of moments, or convergence of moment generating functions?
As noted in the introduction, it will always be assumed that the distribution function F
is uniquely determined by its moment collection {µ0n }∞ n=1 . Proposition 4.61 and Corollary
4.62 provide criteria which assure this.
While Proposition 4.76 on convergence of moments is addressed first, it is Proposition
4.77 on convergence of the associated moment generating functions that will provide the
result in the form most commonly used in applications. Many important results will be
derived based on this proposition in Section 6.2.
Proposition 4.76 (Method of moments: When µ0m,n → µ0n for each n implies
Fm ⇒ F ) Assume that a distribution function F is uniquely determined by its moment
collection {µ0n }∞
n=1 .
If {Fm }∞ m=1 is a sequence of distribution functions with moment collections
{{µ0m,n }∞ ∞ 0 0
n=1 }m=1 , and µm,n → µn for each n, then Fm ⇒ F.
Proof. Since µ0m,2 → µ02 , the collection {µ0m,2 }∞ ∞
m=1 is bounded, say by K. Let {Xm }m=1
be random variables with distribution functions {Fm }∞ m=1 , constructed for example using
Skorokhod’s representation theorem of Proposition II.8.30. Then by Chebyshev’s inequality
in (4.83), for all m :
Pr[|Xm | ≥ t] ≤ K/t.
This implies by Definition 4.63 that the collection of distribution functions {Fm }∞ m=1 is
tight.
By Helly’s selection theorem of Proposition II.8.14, there exists a subsequence {Fmk }∞k=1 ,
and a right continuous, increasing function Fe, so that Fmk (x) → Fe(x) at all continuity
points of Fe. By Proposition II.8.20, Fe is a distribution function because {Fm }∞m=1 is tight,
and thus Fmk ⇒ Fe.
Now since µ0m,n → µ0n for each n, it follows that µ0mk ,n → µ0n for each n for this
subsequence. If it can be proved that µ0n is the nth moment of Fe, it would follow that Fe = F
by the assumption that F is uniquely determined by its moment collection. Then by Corollary
II.8.22 to Helly’s selection theorem, since every such Helly subsequence satisfies Fmk ⇒ F,
we can conclude that Fm ⇒ F.
To prove that µ0n is the nth moment of Fe, note that the convergence assumption µ0m,n →
µ0n for each n assures that {µ0mk ,2n }∞
m=1 is bounded for all n. Thus from Corollary 4.73,
Fmk ⇒ F assures that F has moments {µ̃0n }∞
e e 0
n=1 and µmk ,n → µ e0n for each n. But since
0
µ0m,n → µ0n for each n, we conclude that µ en = µ0n for all n.
Proof. The existence of M (t) on (−t0, t0 ) assures that F is uniquely determined by its
moments by Corollary 4.62, and that its moments are given by µ0n = M (n) (0) by Proposition
4.26. Let {Xm }∞ ∞
m=1 be random variables with distribution functions {Fm }m=1 .
0
By (4.85), for any t ∈ (−t0, t0 ) :
0 0
Pr[|Xm | ≥ t] ≤ e−tt [Mm (t0 ) + Mm (−t0 )] ≤ c(t0 )e−tt .
This upper bound exists since Mm (±t0 ) → M (±t0 ) for all t0 . Letting t → ∞ obtains that the
collection of distribution functions {Fm }∞m=1 is tight.
As in the above proof, there is a subsequence {Fmk }∞k=1 and a distribution function F so
e
that Fmk ⇒ Fe. The proof that M f(t) exists and Mm (t) → M
k
f(t) for t ∈ (−t0, t0 ), and thus
that M (t) = M (t), follows from Corollary 4.75 as for the proof above. Corollary 4.62 on the
f
uniqueness of the moment generating functions now assures that Fe = F and thus Fmk ⇒ F.
As this conclusion is true for any Helly subsequence, it follows as above that Fm ⇒ F.
5
Simulating Samples of RVs – Examples
We begin by recalling Proposition II.4.9. For this result, a continuous uniformly distributed
random variable U is defined on some probability space:
with m Lebesgue measure. The distribution function FU (y) = y for y ∈ (0, 1) , and this is
an example of the uniform distribution of item 1 of Example 1.28.
Proposition II.4.9: Let (S, E, λ) be given, and X : (S, E, λ) → (R, B(R), m) a random
variable with distribution function F (x) and left-continuous inverse F ∗ (y) as given in (1.52)
of Definition 1.29:
F ∗ (y) = inf{x|F (x) ≥ y}.
1. If {Uj }M
j=1 are independent, continuous uniformly distributed random variables, then:
∗
{Xj }M M
j=1 ≡ {F (Uj )}j=1 ,
In the next section we apply this result to develop random samples for a number of
the distributions introduced in Section 1.4. Following that, we investigate the generation of
random ordered samples.
We will assume throughout this chapter that one has access to computer software for
generating the collections {Uj }M
j=1 of independent, continuous uniformly distributed random
variables. Generating samples of such U -variates is now very easy since most mathematical
software has a built-in function which does exactly this. For example, in Microsoft Excel
this function is called RAND(), while in MathWorks MATLAB it is rand, and so forth.
See Chapter II.4 for the theoretical framework for random samples, or the introduction
to Chapter 6 on Limit Theorems for a discussion of the Book II results.
words, X(S) = {ai }. Since X is measurable by definition, X −1 (ai ) ∈ E for all i, and the
probability density function f (x) is defined on R by:
f (x) ≡ λ X −1 (x) .
Thus f (x) = 0 for x ∈ / {ai }, and otherwise f (ai ) is often denoted pi.
In theory, a discrete random variable can have a ∈ {ai } with λ X −1 (a) = 0, but this
is never seen in practice. One can simple define a new random variable X 0 to equal X
everywhere except on X −1 (a), and on this set define X 0 to be any other ai -value. Then all
probability statements about X and X 0 will be identical.
Thus it is always assumed that pi = f (ai ) > 0 for all i, and it is commonly said that
f (ai ) ≡ λ X −1 (ai ) is the probability that X = ai :
f (ai ) ≡ Pr[X = ai ].
The associated distribution function F (x) is defined:
F (x) ≡ λ X −1 (−∞, x] ,
It follows from (5.2) that the distribution function F (x) is defined as the bounded
step function: X
F (x) = pj , x ∈ [ai , ai+1 ). (5.3)
j≤i
Note that F (x) is right continuous, as is every distribution function by Notation II.3.5 and
Proposition I.3.60.
Remark 5.1 (On F (x) and {[ai , ai+1 )}) When {ai } has no accumulation points, it then
follows that this collection is unbounded if an only if it is infinite. Thus it will only be true
in case 3 that: [
[ai , ai+1 ) = R.
i
However, by extension F (x) is nonetheless well defined on R by (5.3) in the bounded or semi-
bounded cases. In essence, we interpret x ∈ [ai , ai+1 ) to be well-defined as long as at least
one of ai and ai+1 are in the collection {ai }. For example in case 1, [aN , aN +1 ) ≡ [aN , ∞),
while [a−1 , a0 ) ≡ (−∞, a0 ). Thus:
Random Samples 137
As this type of distribution function is rarely encountered in applications, we will not dwell
on it, but note that the approach of this section can be adapted to these cases.
Given F (x) as in (5.3) and defined by finite or infinite {ai }, with no accumulation points
in the latter case, the left-continuous
P inverse F ∗ (y) as defined in (1.52) can be expressed
as follows. With F (ai ) ≡ j≤i pj by (5.2), for y ∈ (0, 1) :
X
F ∗ (y) = ai χ(F (ai−1 ),F (ai )] (y), (5.4)
i
where χ(F (ai−1 ),F (ai )] (y) is the characteristic function of the interval (F (ai−1 ) , F (ai )],
and defined to be 1 if y ∈ (F (ai−1 ) , F (ai )], and 0 otherwise.
Remark 5.2 (On F ∗ (y) and {(F (ai−1 ) , F (ai )]}) As was noted in Remark 5.1 for (5.3),
we interpret (F (ai−1 ) , F (ai )] to be well-defined as long as one variable is in the collection
{ai }. For example in case 1,
• While F ∗ (y) is formally represented as a finite or infinite sum, there is only one nonzero
term for any y due to the disjointedness of the interval collection {(F (ai−1 ) , F (ai )]}.
• F ∗ (y) is seen to be increasing and left continuous as is true in general by Proposition
II.3.16.
• By Proposition II.4.9 noted in the introduction to this chapter, if U is a continuous uni-
formly distributed random variable on (0, 1), then X̃ ≡ F ∗ (U ) has the same distribution
function as the random variable X that we started with. This follows because X̃ = ai if
U ∈ (F (ai−1 ) , F (ai )], and this occurs with probability:
FU [F (ai )] − FU [F (ai−1 )] = pi .
B
Alternatively, since Xn,j is the sum of n independent standard binomials, it can also
be simulated by sums of variates simulated as in item 2. However, this is not a compu-
tationally efficient approach unless one is interested in both the component variates as
well as the sum.
4. Geometric Distribution, parameter p, defined in (1.43):
∞
An example of case 2 of Remark 5.1, here {ai } ≡ {k}k=0 , which defines the collection
{[ai , ai+1 )}, supplemented to include (−∞, 0).
Random Samples 139
∗
Given independent, continuous uniform {Uj }M G
j=1 , the variates Xj ≡ F (Uj ) are defined
∗
as in (5.4), but this calculation is simplified by the fact that FG (y) can be explicitly
calculated. With FG (j) defined in (1.44) by FG (j) = 1 − (1 − p)j+1 , the definition in
(1.52) produces:
∗
ln(1 − y)
FG (y) = inf j ≤j+1 .
ln(1 − p)
Using the ceiling function dye of (5.5), it follows that:
ln(1 − y)
FG∗ (y) = − 1.
ln(1 − p)
Hence given {Uj }M
j=1 :
ln(1 − Uj )
XjG= − 1.
ln(1 − p)
Since Uj has the same distribution as 1 − Uj , this can be expressed:
ln Uj
XjG = − 1.
ln(1 − p)
is a Poisson random variable with parameter λ. As derived in the next section, from
independent, continuous uniform {Uj }M
j=1 , such exponentials are generated by:
ln Uj
XjE = − .
λ
Further, these exponentials are independent by Proposition II.3.56.
Thus given independent continuous uniform {Uj }j=1 , independent Poisson random vari-
ables can be produced by:
n Yn o
X P = max n Uj ≥ e−λ .
j=1
140 Simulating Samples of RVs – Examples
1
XjE = − ln(1 − Uj ),
λ
or equivalently:
1
XjE = − ln Uj ,
λ
since Uj has the same distribution as 1 − Uj .
3. Gamma Distribution, parameters λ, α, defined in (1.55):
The gamma distribution function is not explicitly invertible for general α, but for α =
k ∈ N a positive integer, Exercise 2.15 proves that this random variable is the sum of k
independent exponential random variables with parameter λ. Hence, given independent,
continuous uniform {Uj }kM Γ M
j=1 , we can define {Xj }j=1 in terms of k-sums or k-products
of exponential variables from item 2:
Y
Γ 1 Xjk 1 jk
Xj = − ln Uj = − ln Uj .
λ k=(j−1)k+1 λ k=(j−1)k+1
−γ
[− ln y] −1
G−1
γ (y) = ,
γ
while for γ = 0 :
G−1
0 (y) = − ln[− ln y].
XjEV = G−1
γ (Uj ),
• Generate numerical estimates of Φ(x) for discrete x values, then approximate Φ−1 (U )
with interpolation.
142 Simulating Samples of RVs – Examples
• Applying the central limit theorem of Proposition 6.13 obtains that for n large, if
{Xj }nj=1 are independent and identically distributed random variables with finite mean
and variance, then distribution function of the random variable:
Pn
j=1 Xj − nE [X]
Zn ≡ p ,
nV ar [X]
Pjn
k=(j−1)n+1 Uj − n/2
ZjΦ = p .
n/12
A very powerful approach to generating exact pairs of independent normal variates was
introduced in 1958 by G. E. P. Box (1919–2013) and Mervin E. Muller (1928–2018).
This approach is often referred to as the Box–Muller transform, since it transforms
independent pairs (U1 , U2 ) of continuous, uniform variates into independent pairs (Z1 , Z2 )
of standard normal variates.
The following proof requires a couple of results. The first has been noted before, while
the second generalizes the result in (2.4) of Proposition 2.1. Both will be formally derived
in Book VI using the integration theory of Book V:
where Fi (xi ) is the distribution function of Xi . If all distribution functions are continu-
ously differentiable with associated continuous density functions, then this implies that
as Riemann integrals:
Z Yn Z xi
f (y)dy = fi (yi )dyi ,
Rx i=1 −∞
n
where y = (y1 , ..., yn ), dy
Qnreflects the Riemann integral in R of Section III.1.3, and with
x = (x1 , ..., xn ), Rx ≡ i=1 (−∞, xi ].
An application of Corollary III.1.77 for the integral on the left, justifying the lower limit
extension to −∞, obtains that for all (x1 , ..., xn ) :
Z xn Z x1 Z xn Z x1 Y
n
··· f (y1 , ..., yn )dy1 ...dyn = ··· fi (yi )dy1 ...dyn .
−∞ −∞ −∞ −∞ i=1
It will be seen in Book VI that this conclusion does not require continuity of densities,
but then requires the Lebesgue generalization of the Riemann Corollary III.1.77, to be
derived in Fubini’s or Tonelli’s theorems. Further, with these results, it is only necessary
to assume that {Xi }ni=1 have density functions, and then the density function for X is
assured to exist. This last observation is needed below.
2. Densities of transformed random vectors: Let X ≡ (X1 , ..., Xn ) be a random vec-
tor defined on a probability space (S, E, λ) with distribution function F (x1 , ..., xn ) and
density f (x1 , ..., xn ). Define the random variable Y = g(X) where g : Rn → Rn is con-
tinuously differentiable and one-to-one. What can be said about the distribution and
density functions of Y ?
Generalizing (2.4) of Proposition 2.1, it will be proved in Book VI that:
−1
∂g
fY (y) = f g −1 (y) det
, (5.7)
∂y
−1
where det ∂g∂y is the determinant of the Jacobian matrix of the transformation g −1 .
This transformation is well-defined by the assumption that g is one-to-one.
Given a transformation g −1 (y), which we denote by T : Rn → Rn for notational sim-
plicity, denote:
T (y) = (t1 (y), t2 (y), · · · , tn (y)) ,
where ti : Rn → R for all i. When this transformation has differentiable
component
∂T 0
functions, the Jacobian matrix associated with T, denoted ∂y or T (y), is defined
by: ∂t
∂t1 ∂t1
∂y1
1
∂y2 · · · ∂y
∂t2 ∂t2 · · · ∂t2n
∂T ∂y1 ∂y2 ∂yn
≡ . . . .. . (5.8)
∂y .. .. ..
.
∂tn ∂tn ∂tn
∂y1 ∂y2 · · · ∂y n
This matrix and its determinant are named for Carl Gustav Jacob Jacobi (1804–
1851), an early developer of determinants and their applications in analysis.
where F (x− ) denotes the left limit at x. Since FU (x) = x is continuous, it follows that:
Then T1 and T2 are Student T variates with ν degrees of freedom, but are not independent
random variables. r
−2/ν
Proof. Define a random variable R ≡ ν U1 − 1 . Since U : (S 0 , E 0 , λ0 ) →
((0, 1), B (0, 1) , m) from (5.1), R is well defined. A calculation obtains the distribution
FR (r) :
2 − !
ν r +ν
FR (r) = 1 − FU exp − ln ,
2 ν
where F (x− ) denotes the left limit at x. By continuity of FU (x) = x it follows that:
−ν/2
r2 + ν
FR (r) = 1 − ,
ν
and the associated density function is given by Proposition III.1.33:
2 −ν/2−1
r +ν
fR (r) = r .
ν
Define the random variable S = 2πU2 , and thus fS (s) = 1/2π on (0, 2π). Independence
of U1 and U2 assures independence of R and S by Proposition II.3.56. Thus as noted above,
the joint density fR,S (r, s) exists on (0, ∞) × (0, 2π) and is given by:
−ν/2−1
r2 + ν
1
fR,S (r, s) = r .
2π ν
Recall the transformation g : (0, ∞) × (0, 2π) → R2 in the proof of Proposition 5.4:
g(r, s) = (r cos s, r sin s),
with inverse: p
g −1 (x, y) = x2 + y 2 , arctan(y/x) .
Defined as in (5.10):
(T1 , T2 ) = g(R, S),
and the joint density function fT1 ,T2 (x, y) can be obtained from (5.7):
∂g −1
p
fT1 ,T2 (x, y) = fR,S 2 2
x + y , arctan(y/x) det
∂ (x, y)
−ν/2−1
x2 + y 2 ∂g −1
1 p 2 2
= x +y +1 det ∂ (x, y) .
2π ν
p
The Jacobian matrix has determinant 1/ x2 + y 2 as above, and thus:
−ν/2−1
x2 + y 2
1
fT1 ,T2 (x, y) = 1+ . (1)
2π ν
146 Simulating Samples of RVs – Examples
and thus by Proposition III.1.33, the associated density function is seen to be:
Z ∞
fT1 (x) = fT1 ,T2 (x, w) dw.
−∞
which is seen to follow from the derivative of f (t) = x−t at t = 0. Thus for ν large, the
Bailey transform is very similar to the Box-Muller transform.
This is consistent with the result of Proposition 6.4 that as ν → ∞, Student’s T converges
in distribution to the standard normal. In the notation of Definition II.5.19, Tν →d Z, or
Tν ⇒ Z, as ν → ∞.
Exercise 5.8 (Independent {Tj }M 2M
j=1 ) Given independent, continuous uniform {Uj }j=1 ,
assume that (5.10) is applied in pairs, so that T2k and T2k+1 are generated from U2k and
U2k+1 . Prove that {T (k)}M
k=1 are independent, Student T variates with ν degrees of freedom,
where T (k) ∈ {T2k , T2k+1 } for each k. Hint: Proposition II.3.56.
Ordered Random Samples 147
Here we assume that Xi = F ∗ (Ui ) for notational simplicity, but this is not essential.
In the case of normal and Student T variates, for example, these variates would be
generated from pairs of uniform variates and the formulas in (5.9) and (5.10).
Each M -sample {F ∗ (Ui )}jM
i=(j−1)M +1 of X variates is then reordered to determine
{X(k)j }M
k=1 :
n oM
{F ∗ (Ui )}jM
i=(j−1)M +1 → X(k)j
, j = 1, ..., N,
k=1
148 Simulating Samples of RVs – Examples
n oM N n oM N
∗
X(k)j = F U(k)j .
k=1 j=1 k=1 j=1
Thus it is not necessary to evaluate F ∗ (Ui ) for every Ui in the various M -samples unless
all order statistics arerequired.
For example, if only a sample of X(k) is required for a
fixed k, then only F ∗ U(k)j need be evaluated for such U(k)j .
When seeking ordered samples for normal, lognormal and Student T, this alternative
approach is not an option since the ordering of these variates is not predictable based
on the orderings of the pairs of continuous, uniform variates.
Summary: When Xi = F ∗ (Ui ) , this approach requires M N continuous,
uniformly
∗
distributed random variables {Ui }M N
i=1 , and KN evaluations of F U(k)j for 1 ≤ K ≤
M, to produce an N -sample of the order statistics {{X(ki )j }K N
i=1 }j=1 .
When Xi is generated from pairs of uniform variates, this approach requires M N (nor-
mal, lognormal) or 2M N (Student T) continuous, uniformly distributed random vari-
ables, and an equal number of evaluations of the formulas in (5.9) and (5.10).
2. Direct kth Order Statistics 1; when X = F ∗ (U ) is calculable
Recall Example 3.9, that the kth order statistic U(k) from an M -sample of continuous,
uniform variates has a beta distribution with v = k and w = M −k+1. This would appear
promising for the generation of N -samples {U(k)j }Nj=1 of kth order uniform variates, and
then N -samples {X(k)j }Nj=1 in the case where X i = F ∗ (Ui ) .
But based on item 4 of the prior section, to generate one beta random variable with
v = k and w = M − k + 1 requires v + w = M + 1 continuous, uniformly distributed
(M +1)N
random variables. Thus to generate {U(k)j }N
j=1 requires {Ui }i=1 , and this is more
than that needed for the complete sample method in item 1 despite seeking only a
sample of one order variate.
Summary: When Xi = F ∗ (Ui ) , this approach requires (M + 1) N continuous, uni-
(M +1)N
formly distributed random variables {Ui }i=1 , with each (M + 1)-sequence
used to
produce one beta U(k)j as above, and then N evaluations of F ∗ U(k)j to produce an
N -sample of the kth ordered statistics {X(k)j }N
j=1 .
∗
3. Direct kth Order Statistics 2; when X(k) = F(k) (U ) is calculable
Since the distribution function F(k) (x) for X(k) in (3.5) is known, if it is possible to deter-
∗
mine its left-continuous inverse F(k) (y), then by definition only one N -sample {Uj }N j=1
would be needed to obtain a sample of X(k) :
n oN n oN
∗
X(k)j = F(k) (Uj ) .
j=1 j=1
However, since even the mathematically simple uniform distribution produced F(k) (x)
with the complexity of the beta distribution, it is unlikely that many examples will
Ordered Random Samples 149
∗
be encountered for which one can calculate F(k) (y) directly. But in general, to de-
∗
termine F(k) (U ) using (3.5) requires two steps, the first of which is almost surely
numerical:
n n
{Uj }j=1 ≡ 1 − exp −XjE j=1 ,
Both the distributional results and the independence results are Proposition II.4.9 in the
introduction to this chapter, since FE is continuous.
150 Simulating Samples of RVs – Examples
Continuing the investigations of Chapters II.5 and II.8, in this chapter we study limit
theorems grouped into three categories:
This study will be supplemented in Book VI with more general results related to weak
convergence of measures and the central limit theorem.
6.1 Introduction
Throughout this chapter we will repeatedly encounter the notion of a collection of n indepen-
dent random variables, often identically distributed but sometimes not, and will typically
want to address measure-theoretic questions related to the summation of such variables,
and the limits of such sums as n → ∞. This appears intuitively plausible and yet requires
some justification to avoid building the house of emerging theories on a foundation of sand.
Among the questions that need to be addressed are:
Q1. Given a random variable X : (S, E, λ) → (R, B(R), m) with distribution function F (x),
we will want to study properties of a sample {Xj }nj=1 of X. By a sample or n-sample
is meant (Definition 3.2) that these are independent random variables defined on some
probability space, that are identically distributed as X.
The fundamental questions are:
How is this sample constructed, and on what probability space is this sample defined?
If {Xj }nj=1 is then such a sample of X defined on the space identified, then an expression
Pn
like j=1 Xj is again a random variable on this same space by Proposition I.3.30.
Thus the measure on this space provides meaning to probability statements on such
summations, and underlies the definition of the associated distribution function.
Q2. If {Xj }nj=1 is a sample of X defined on the probability space identified in Q1, what does
it mean to let n → ∞? Is this increasing collection still defined on the same space so that
probability statements regarding this sum, or properties of its distribution function, are
all defined relative to a given probability measure?
Q3. More generally, if Xj : (Sj , Ej , λj ) → (R, B(R), m) are random variables with distribu-
tion functions Fj (x), is there a common probability space on which all can be defined,
and on which these will be independent random variables? If so, then an expression like
In Chapter II.4, the construction sought in Q1 was developed using the infinite di-
mensional probability space theory of Chapter I.9. While this construction targeted the
application to Q1, the Book I theory applied equally well to the more general construction
needed for Q3. We summarize below the Book II construction in this more general context,
noting that the Book II result can be viewed as a change of notation. While the Book II
construction allowed for probability spaces for either finite of infinite samples, we summarize
the infinite dimensional models to simultaneously address Q2.
To this end, assume that we are given random variables {Xj }∞ j=1 defined on probability
spaces {(Sj , Ej , λj )}∞ ∞
j=1 with distribution functions {Fj }j=1 . In the Q1 application of Book
II, we started with an infinite collection of copies of X and (S, E, λ), and the indexing just
allowed a basis for referencing members of these collections. As summarized in Propositions
1.3 and 1.5, each Fj is increasing and right continuous, and gives rise to a Borel measure
µFj defined on B(R) that is defined on right semi-closed intervals by:
Thus µFj (R) = 1 and (R, B(R), µFj ) is a probability space for all j.
We can now apply the construction of Chapter I.9 to {(Rj , B(Rj ), µFj )}∞ j=1 , where Rj
and B(Rj ) are indexed only for notational purposes. In addition, all µFj will equal µF in the
Q1 context of a single random variable, but are also allowed to be different to accommodate
the Q3 model.
The infinite dimensional probability space (RN , σ (A+ ) , µN ), and a complete counterpart
(R , σ(RN ), µN ), can now be constructed as in Chapter I.9. For this probability space:
N
RN ≡ {(x1 , x2 , ...)|xj ∈ Rj },
and µN is uniquely defined on σ (A+ ) , the smallest sigma algebra containing the algebra
A+ of general finite dimensional measurable rectangles or general cylinder sets
in RN . Thus:
A+ ⊂ σ A+ ⊂ σ(RN ).
(6.2)
In more detail, H ∈ A+ if for some positive integer n and n-tuple of positive integers
J = (j(1), j(2), ..., j(n)) :
which is the projection mapping defined on RN to the jth coordinate, and often denoted
π j . We verify below that such Xj0 is measurable, and thus a random variable on this space.
The following result is a modest generalization of Proposition II.4.4. It uses the same
Book I theory, and answers the above questions.
For the application to Q1, all (Sj , Ej , λj ) = (S, E, λ), all Xj = X, meaning each has the
distribution of X, and thus all Fj = F. Then for the conclusion, each Xj0 has the distribution
of X. In other words, {Xj0 }∞ j=1 are independent and identically distributed in this case. This
is essentially the context of Proposition II.4.4.
Proposition 6.1 (A probability space for independent {Xj }∞ j=1 ) Let probability
spaces {(Sj , Ej , λj )}∞
j=1 and random variables Xj : Sj −→ R with distribution func-
0
tions {Fj }∞ 0 0
j=1 be given. With the notation above, let (S , E , λ ) denote (R , σ(R ), µN ) or
N N
N +
(R , σ (A ) , µN ) of Proposition I.9.20.
Then {Xj0 }∞ ∞
j=1 as defined in (6.4) is a sample of {Xj }j=1 , meaning these are independent
0
random variables defined on (S 0 , E 0 , λ ) with respective distribution functions {Fj }∞
j=1 .
Proof. First, Xj0 is measurable and thus a random variable on RN since for A ∈ B(R), (6.4)
yields:
−1
Xj0 (A) = {x ∈ RN |xj ∈ A} ∈ A+ .
−1
Thus Xj0 (A) is a general cylinder set and an element of either sigma algebra by (6.2).
−1
Further, if A ∈ B(R) and H ≡ Xj0 (A), then from (6.3) and (6.1):
h −1 i
µ0 Xj0 (A) = µFj (A) ≡ λj (Xj−1 (A)),
0
and thus {Xj(k) }nk=1 are independent random variables.
Notation 6.2 It is somewhat of a notational burden to distinguish between {Xj }∞ j=1 de-
0
fined on {(Sj , Ej , λj )}∞
j=1 , and independent {X 0 ∞
}
j j=1 defined on (S 0
, E 0
, λ ). So this formality
is often suppressed and statements are made about independent, and possibly identically dis-
tributed random variables {Xj }∞ j=1 defined on a probability space (S, E, µ).
With the construction of Proposition 6.1, the above questions can be answered:
Pn
If {Xj }nj=1 are random variables defined on the new space (S, E, µ), then j=1 Xj is
a measurable function on (S, E,Pµ) by Proposition I.3.30 and thus a random variable. Since
n
probability statements on such j=1 Xj are made in this probability space for any n, these
statements remain well defined as n → ∞.
Remark 6.3 The above framework will not be formally mentioned again in the next two
sections on weak convergence of distributions and laws of large numbers.
In the final section on convergence of empirical distributions functions, some additional
comments will be required.
156 Limit Theorems
At ±∞, this integrand has order of magnitude O(tm−(n+1) ), and thus this integral will con-
verge only when m − (n + 1) < −1, or for m ≤ n − 1. However, using the more powerful tools
of characteristic functions in Book VI, a proof of this result in the flavor of Proposition
4.77 will be possible.
More directly, for all s ∈ R :
n ln (1 + s/n) −→ s as n → ∞.
This follows from the definition of f 0 (0) for f (x) = ln(1 + sx), noting that this function is
well-defined in an open interval about x = 0 for any s. This then obtains:
n
(1 + s/n) −→ es as n → ∞. (6.5)
which is the functional part of the standard normal density function ϕ(t) in (1.66). The
coefficient derivation, that:
Γ ((n + 1) /2) 1
√ →√ ,
πnΓ (n/2) 2π
will certainly be messier and is not pursued here. But even once done it will still take an
integration to the limit result (recall Section III.2.6) to prove that fTn (t) → ϕ(t) for all t,
implies that for all x :
Z x Z x
FTn (x) = fTn (t)dt → ϕ(t)dt = Φ(x).
−∞ −∞
These details can be settled successfully, but there is a proof that avoids such technical
details. It uses a weak law of large numbers result (6.30) of Proposition 6.35, and then
Slutsky’s theorem from Book II.
Proposition 6.4 (FTn ⇒ Φ) Let FTn (x) denote the distribution function of Student’s T
with n ∈ N degrees of freedom, and Φ(x) the distribution function of the standard normal.
Then as n → ∞ :
FTn ⇒ Φ. (6.6)
Proof. Recall from item 3 of Example 2.28 that fTn (t) is the density function of the random
variable:
X
Tn ≡ p ,
Y /n
where X is standard normal, and Y is chi-squared with n degrees of freedom.
If {Xi }ni=1 are independent
Pn standard normals, then {Xi2 }ni=1 are independent by Proposi-
2
tion II.3.56, and then i=1 Xi is chi-squared with n degrees of freedom by item 5 of Section
4.4.1. Thus by (6.30):
Y 1 Xn
Xi2 →P E X12 = 1.
=
n n i=1
Convergence in probability was introduced in Definition II.5.11, and the above results means
that for every > 0 :
Y
lim Pr − 1 ≥ = 0.
n→∞ n
158 Limit Theorems
p
It then follows that Y /n →P 1 since:
(r )
Y Y 2
− 1 ≥ ⊂ − 1 ≥ 2 + .
n n
Further, X →d X p by definition, since by defining Xn0 = X for all n obtains that Xn0 →d X.
Summarizing, Y /n →P 1 and X →d X. By item 3 of Slutsky’s theorem of Proposition
II.5.29, it then follows that:
X
p →d X,
Y /n
and the proof is complete.
Thus:
µBn,λ/n = µPλ , σ 2Bn,λ/n → σ 2Pλ as n → ∞,
where µPλ and σ 2Pλ are the mean and variance of the Poisson with parameter λ by (4.69).
Remarkably, this next result proves that not only do a couple of binomial moments converge
to the Poisson moments, but the entire binomial distribution function converges to the
Poisson distribution.
We will see in the next section that the Poisson can also be the weak limit of sums of
independent but nonhomogeneous binomials, meaning with different values of p.
Proposition 6.5 (Poisson limit theorem) Let FBn,p denote the distribution function of
the general binomial with parameters n, p associated with the density function in (1.40), and
FPλ (j) the distribution function of the Poisson with parameter λ associated with the density
function in (1.47).
If λ = np is fixed, then as n → ∞ :
FBn,λ/n ⇒ FPλ . (6.7)
Proof. Let MBn,λ/n (t) and MPλ (t) denote the respective moment generating functions of
(4.63) and (4.68): n
λ t
MBn,λ/n (t) = 1 + (e − 1) ,
n
and:
MPλ (t) = exp[λ(et − 1)].
By (6.5), MBn,λ/n (t) → MPλ (t) for all t, and thus FBn,λ/n ⇒ FPλ by
Proposition 4.77.
Weak Convergence of Distributions 159
Example 6.6 (Modeling bond defaults; insurance losses) The Poisson limit theo-
rem has immediate applications in finance in any situation in which one is modeling the
binomial outcomes of a large group with each member of the group having the same or
similar probabilities of the event being observed.
For example, in a relatively large portfolio of bonds with similar credit ratings, or simi-
larly rated bank loans of various types, the event of default over a given period is fundamen-
tally binomial with p equalling the probability of default over this period. Assume we have
n = 200 loans, each with default probability p = 0.02 in one year, and define the random
variable N equal to the number of defaults. One can then model N exactly as the sum of 200
binomials, or approximately as a Poisson random variable with λ = np = 4. It is then true
that in either model, the assumption of independence from loan to loan is often reasonable,
though the assumed value of p is highly dependent on the economic cycle.
One can similarly model a variety of insurance loss events this way. For example, death,
disability, hospitalization, etc., can be modeled as sums of binomials or approximately Pois-
son random variables, as long as the group being modeled is reasonably homogeneous and
individuals have similar values for p. Similarly, various automobile and homeowner insur-
ance loss events can be modeled with binomials or Poisson random variables when the groups
modeled are reasonably homogeneous in terms of claim probabilities. Independence is again
often a reasonable assumption.
This result can even be generalized further to more general “binomials” which assume
that: B
Pr Xnm = 1 = pnm
Pr XnBm= 0 = 1 − pnm − nm ,
Pr XnBm ≥ 2 = nm ,
where maxm {nm } → 0 as n → ∞.
In order to ensure that all of the probabilities pnm become small as n → ∞, and thus
this remains a law of “small numbers,” it is necessary to also require that maxm {pnm } → 0
as n → ∞.
Below we state and prove a version of intermediate generality, but first a definition.
Definition 6.7 (Triangular array) A collection of random variables {{Xn,m }nm=1 }∞ n=1 is
called a triangular array if for each n, the random variables {Xn,m }nm=1 are independent.
The same definition applies if 1 ≤ m ≤ mn where mn → ∞ as n → ∞, though we will have
no need of the more general notion.
160 Limit Theorems
Proposition 6.8 (Weak law of small numbers) Let {{Xn,m }nm=1 }∞ n=1 be a triangular
array where for each n, {XnBm }nm=1 are independent, standard binomial variables with pa-
rameters {pnm }nm=1 , where as n → ∞:
Xn
p nm → λ > 0 and max{pnm } → 0.
m=1 m
Pn
If Sn = m=1 XnBm , then as n → ∞:
where FPλ denotes the distribution function of the Poisson with parameter λ.
Proof. By (4.48):
Yn
1 + pnm (et − 1) .
MSn (t) =
m=1
Similarly, n
λ t
MBn,λ/n (t) = 1 + (e − 1) ,
n
where MBn,λ/n (t) is the moment generating function of the sum of n independent, identically
distributed binomials with p = λ/n.
By the Poisson limit theorem, MBn,λ/n (t) → MPλ (t) for all t, and hence (6.8) will be
proven if it can be shown that MSn (t)/MBn,λ/n (t) → 1 for all t. For this we prove that
ln MSn (t)/MBn,λ/n (t) → 0 for all t. This logarithm is well-defined since M (t) > 0, and
this limit obtains the needed result because the exponential function is continuous.
To this end, fix t and for arbitrary < max[1, et − 1] define N so that for all n ≥ N
and all m ≤ n :
pnm (et − 1) < and λ (et − 1) < .
n
The first bound is possible since maxm {pnm } → 0 as n → ∞, and the second is apparent.
Recall the Taylor series for ln(1 + x), that for |x| < 1 :
X∞ xj
ln(1 + x) = (−1)j+1 . (6.9)
j=1 j
By the above bounds, both ln (1 + pnm (et − 1)) and ln 1 + nλ (et − 1) can be expanded as
Here the interchange in summations is justified by the absolute convergence of these series
for n ≥ N.
Weak Convergence of Distributions 161
Now for j ≥ 2, since pnm < / |a| and λ/n < / |a| :
j k
λ λ Xj−1 j−k−1 λ
pjnm − = pnm − pnm
n n k=0 n
X j−1
λ j−1
< pnm −
n k=0 |a|
j−1
λ
≤ j pnm −
.
n |a|
Example 6.9 (Modeling bond defaults) This theorem allows the examples in the prior
section to be more generally applied. Specifically, a portfolio of bonds with various credit
ratings can also be modeled with a Poisson random variable. Similarly for various insurance
applications. The requirement that maxm {pnm } → 0 as n → ∞ provides a constraint,
however, that the default or insurance claim probabilities should individually be small.
For example, given n = 400 bonds, 100 with p1 = 0.002, 200 with p2 = 0.005 and 100
with p3 = 0.01, one could reasonably model the number of defaults in this portfolio as a
Poisson random variable with λ = 100p1 + 200p2 + 100p3 = 2.2.
Exercise 6.10 (Pr[a ≤ XnB ≤ b] → 0) Prove that for fixed a and b that Pr[a ≤ XnB ≤ b] →
0 as n → ∞. Hint: Stirlings formula of (4.105).
162 Limit Theorems
In contrast to the Poisson limit theorem, here p is fixed and hence np → ∞. Thus both the
mean and variance of XnB grow without bound as n → ∞.
To investigate quantitatively the limiting probabilities under this distribution as n → ∞,
some form of “scaling” is necessary to stabilize this distribution. The approach used by
Abraham de Moivre (1667–1754) in the special case of p = 12 , and many years later
generalized by Pierre-Simon Laplace (1749–1827) to all p with 0 < p < 1, was to consider
what is often called the normalized random variable YnB . This variable is defined:
XnB − E[XnB ] X B − µB
YnB ≡ p = n . (6.10)
V ar[XnB ] σB
p
Since µB = np and σ B = np(1 − p) are constants for each n, the random variable YnB
has the same binomial probabilities as does XnB in the sense that:
" #
B 0 j − np
= Pr XnB = j .
Pr Yn = j ≡ p
np(1 − p)
E[YnB ] = 0, V ar[YnB ] = 1.
In other words, in terms of mean and variance, the distribution of YnB “stays put,” in
contrast the the distribution of XnB that wanders off with n.
Consequently, with mean and variance both constant and independent of n, the question
of investigating and potentially identifying the limiting distribution of YnB as n → ∞ is
better defined and its pursuit more compelling. The following proposition identifies this
limiting result.
Note that the variate YnB in (6.11) can also be expressed as a summation:
Xn
YnB ≡ B
Y1,m ,
m=1
B
where for each n and independent standard binomials {X1,m }nm=1 :
B
B
X1,m −p
Y1,m ≡p .
np(1 − p)
B n
Hence {Y√1,m }m=1 are independent, binomial variables with fixed p, and with µY1 = 0 and
σ Y1 = 1/ n.
Proposition 6.11 (De Moivre-Laplace theorem) Let FYnB (j 0 ) denote the distribution
function of the normalized general binomial YnB with parameters n, p, where 0 < p < 1. In
other words, X
FYnB (j 0 ) ≡ 0 0
fYnB (k 0 ),
k ≤j
Weak Convergence of Distributions 163
p
where fYnB (j 0 ) ≡ fBn (j) with j ≡ j 0
np(1 − p) + np for 0 ≤ j ≤ n.
Then as n → ∞:
FYnB ⇒ Φ, (6.11)
where Φ(x) denotes the distribution function of the standard normal associated with the
density function in (1.66).
Proof. Let MBn,p (t) denote the moment generating function of the general binomial, which
from (4.63) is given for all t by:
n
MBn,p (t) = 1 + p(et − 1) .
From (4.30) and (4.63), the moment generating function for YnB is defined for all t as
follows, denoting q ≡ 1 − p :
npt t
MYnB (t) = exp − √ MBn,p √
npq npq
n
npt t
= exp − √ 1 + p exp √ −1
npq npq
n
pt qt
= q exp − √ + p exp √
npq npq
h p .√ p . √ in
= q exp −t p/q n + p exp t q/p n .
The moment generating function of the standard normal MΦ (t) is defined for all t and
given in (4.79) by:
MΦ (t) = exp t2 /2 .
To prove that MYnB (t) → MΦ (t) for all t, it is sufficient to demonstrate that:
ln MYnB (t) → t2 /2,
for all t, since MYnB (t) is nonnegative and the exponential function is continuous.
To this end:
h h p .√ i h p . √ ii
ln MYnB (t) = n ln q exp −t p/q n + p exp t q/p n .
Remark 6.12 (On binomial approximations with the normal) Because Φ(x) is con-
tinuous everywhere, the De Moivre-Laplace theorem conclusion of FYnB ⇒ Φ implies that
for all y :
FYnB (y) → Φ(y).
√
With j 0 ≡ j−µ
σ B , where µB ≡ np, σ B ≡
B
npq and 0 ≤ j ≤ n, it follows that for any real
number w :
X n!
FBn,p (w) = pj q n−j
j≤w j!(n − j)!
X n!
= pj q n−j
j 0 ≤(w−µB )/σ B j!(n − j)!
w − µB
= FYnB .
σB
In applications where by definition n << ∞, the above limiting result for the binomial
provides an approximation that for n “large,” meaning that n ≥ 30 or so for most applica-
tions:
w − µB
FBn,p (w) ≈ Φ .
σB
The corresponding approximation for FBn,p (w)− FBn,p (v) when w > v :
w − µB v − µB
FBn,p (w) − FBn,p (v) ≈ Φ −Φ . (6.12)
σB σB
In applications one is typically interested in integer values of w and/or v, and this
approximation can not therefore be uniformly useful. For example, if w = j an integer and
real v < j, we would conclude that as v → j that:
while
j − µB v − µB
Φ −Φ → 0.
σB σB
While not apparent from the above approach to the proof, the problem here can be in-
vestigated with a more direct analysis of the binomial probabilities using Stirling’s formula
in (4.105), and some careful calculations as in Proposition 8.24 of Reitano (2010). This
reveals that if jn0 → y as n → ∞, where each jn0 is in the range of the random variable YnB ,
then:
√ 1 2
lim npq Pr{YnB = jn0 } → √ e−y /2 . (6.13)
n→∞ 2π
This produces the approximation:
1
fYnB (jn0 ) ≈ √ exp −(jn0 )2 /2 ,
2πnpq
which for 0 ≤ j ≤ n is equivalent to:
" 2 #
1 1 j − µB
fBn,p (j) ≈ √ exp − . (6.14)
2πnpq 2 σB
This approximation for fYnB (j) can be interpreted as a single term in the Riemann sum-
1 2 √
mation which approximates the integral of the standard normal density function e− 2 y / 2π
Weak Convergence of Distributions 165
√
using ∆y = 1/σ B where σ B = npq. This value of ∆y is seen to equal jn0 (k + 1) − jn0 (k)
0 n
where {jn (k)}k=0 denote the n + 1 values of this variate.
Hence, the discrete probability fBn,p (j) in (6.14) is approximated by the standard normal
probability of an interval of length 1/σ B which contains (j − µB ) /σ B . Alternatively, for
some λ with 0 < λ < 1, one wants to approximate fBn,p (j) by the integral of the standard
normal density over [a, b] with a = [j − µB − (1 − λ)] /σ B and b = [j − µB + λ] /σ B , an
interval of length 1/σ B .
The conventional solution is λ = 1/2, which makes a half interval adjustment, or
half integer adjustment for the above approximation. This produces:
j + 1/2 − µB j − 1/2 − µB
fBn,p (j) ≈ Φ −Φ , (6.15)
σB σB
1. The sequence of variances does not grow too fast, to preclude latter terms in the series
from increasingly dominating the summation, and,
2. A requirement that the sequence of variances does not converge to 0 so quickly that the
average variance converges to 0.
These theorems can be equivalently stated in terms of the sum of independent random
variables or their average. This is because by (4.39) and (4.3):
X
1 n 1 hXn i
E Xj = E Xj ,
n j=1 n j=1
So while the ranges of the sum and average of independent random variables are quite
different, the associated normalized random variables are identical.
Consequently, central limit theorems in general, and the De Moivre-Laplace theorem in
particular, apply to the sums of random variables if and only if they apply to the averages
of
Pnrandom variables. More generally, if a given versionPn of a central limit theorem applies to
X
j=1 j for independent {X j }, then it applies to j=1 Yj where Yj = aXj + b for arbitrary
constants a 6= 0 and b.
In this section, we provide a proof of a simplified version of the central limit theorem
in the case of independent, identically distributed random variables which have moments
of all orders and a well-defined moment generating function. Mechanically, the proof will
be quite similar to that of the De Moivre-Laplace theorem, except that we will have to
accommodate a more general form of MX (t).
In Book VI, using more powerful tools than the moment generating function, we will
present more general versions of this result, with far weaker assumptions.
Proposition 6.13 (Central limit theorem 1) Let FX denote the distribution function
of a random variable X with mean and variance denoted µ and σ 2 , and moment generating
function MX (t) which exists for t ∈ (−t0 , t0 ) with t0 > 0. Let Yn denote the normalized ran-
dom variable associated with the sum or average of n independent values of X, respectively
defined as in (6.10): Pn Pn
1
j=1 Xj − nµ n j=1 Xj − µ
Yn = √ = √ .
nσ σ/ n
Then as n → ∞,
FYn ⇒ Φ, (6.18)
where Φ(x) denotes the distribution function of the standard normal.
Proof. By (4.48) and (4.30):
n
µt t
MYn (t) = exp − √ MX √ , (1)
nσ nσ
t2
MYn (t) → exp = MΦ (t).
2
t2
ln MYn (t) → .
2
By (4.53):
j
t X∞ 1 t
MX √ = √ µ0j .
nσ j=0 j! nσ
Recalling that µ00 = 1, µ01 = µ and µ02 = σ 2 + µ2 :
2
σ + µ2 2
t µ 3
MX √ =1+ √ t+ 2
t + n− 2 E1 (n), (2)
nσ nσ 2nσ
where: j
X∞ 1 0 t 3−j
E1 (n) = µ n 2 .
j=3 j! j σ
Weak Convergence of Distributions 167
√
Since MX (t) is by assumption absolutely convergent for |t| < t0 , MX (t/ [ nσ]) and hence
√ 3
E1 (n) are absolutely convergent for |t| < nσt0 , and so for any t, E1 (n) → µ03 (t/σ) /6 as
n → ∞.
Similarly, using the Taylor series in (4.54) obtains:
j
µt X∞ 1 µt
exp − √ = −√
nσ j=0 j! nσ
µ µ2 2 3
= 1− √ t+ 2
t + n− 2 E2 (n), (3)
nσ 2nσ
where: j
X∞ 1 µt 3−j
E2 (n) = − n 2 .
j=3 j! σ
3
Again this series is absolutely convergent for all t, and E2 (n) → −µ3 (t/σ) /6 for any t as
n → ∞.
With a bit of algebra on (2) and (3):
t2
µt t 3
exp − √ MX √ =1+ + n− 2 E3 (n),
nσ nσ 2n
√
where the error term E3 (n) is is absolutely convergent for |t| < nσt0 , and E3 (n) →
6
−µ3 µ03 (t/σ) /36 as n → ∞. To obtain MYn (t) in (1), this expression can now be raised to
the nth power, a logarithm taken, and the function ln(1 + x) expanded in a Taylor series as
in (6.9).
To simplify, note that we only need to keep track of the powers of n that are needed for
the final limit, meaning only those terms that will not converge to zero as n → ∞. This
produces:
t2
− 32
ln MYn (t) = n ln 1 + + n E3 (n)
2n
" 2 #
t2 1 t2
− 32 − 32
−3
= n + n E3 (n) − + n E3 (n) + O n
2n 2 2n
h i
= t2 /2 + O n−1/2 .
To justify the second step in which the power series expansion for ln(1 + x) is invoked with
3
x = t2 / (2n) + n− 2 E3 (n), it must be verified that |x| < 1 for any t for n large enough. But
since:
t2 3
|x| ≤ + n− 2 |E3 (n)| ,
2n
where E3 (n) is continuous and bounded, the conclusion follows.
Hence for all t :
t2
ln MYn (t) → ,
2
as n → ∞. Equivalently MYn (t) → MΦ (t) for all t, so FYn ⇒ Φ by Proposition 4.77.
The assumption in this version of the central limit theorem, that MX (t) exists and is
convergent for all t ∈ (−t0 , , t0 ) with t0 > 0, is quite strong. As was proved in Proposition
168 Limit Theorems
4.25, this implies that the associated distribution function of X has finite moments of all
orders.
For independent and identically distributed random variables, the conclusion in (6.18)
will be seen to be valid under the assumption that X has only two finite moments, a mean
and variance. It is also valid more generally for sums of independent random variables which
are not identically distributed. But in neither case will the tools of this book suffice for the
demonstration.
The problem with the current approach is that the moment generating function is a
blunt instrument. If it exists on an open interval (−t0 , , t0 ) with t0 > 0, then all moments
exist by Proposition 4.25. There is no way to adapt this argument in the case of random
variables with only finitely many moments.
In Book V, following the development of a general integration theory, the Fourier
transform of a measurable function will be introduced. In the same way that the moment
generating function of a random variable is defined relative to the Laplace transform of this
measurable function as noted in Remark 4.15, in Book VI, the characteristic function
of a random variable will be defined in terms of the Fourier transform. It will there be seen
that unlike the moment generating function, which may or may not exist, characteristic
functions always exist.
Characteristic functions will also be seen to uniquely identify distribution functions, so
they are useful in proofs in the same way that moment generating functions are useful. But
more generally, the associated proofs can also be implemented for random variables with
only finitely many moments. The moments of a distribution, to the extent they exist, will
be seen to appear in the series expansion of the characteristic function in a familiar way,
reminiscent of (4.53).
Proposition 6.15 (Smirnov’s Limit Theorem) Let {U(k) }nk=1 be the order statistics
from a continuous uniform distribution on [0, 1]. Define:
0 U(k) − bn
U(k) ≡ ,
an
where: r
k−1 bn (1 − bn )
bn = , an = ,
n−1 n−1
0
and let F(k) denote the distribution function of U(k) .
Then if k → ∞ and n − k → ∞ as n → ∞ :
F(k) ⇒ Φ, (6.19)
n! n−k
g(k) (y) = y k−1 (1 − y) .
(k − 1)! (n − k)!
0
The density f(k) (y) of U(k) is thus defined on [−bn /an , (1 − bn ) /an ] by:
n! k−1 n−k
f(k) (y) = (an y + bn ) (1 − an y − bn )
(k − 1)! (n − k)!
k−1 n−k
n! k−1 n−k an an
= an bn (1 − bn ) 1+ y 1− y . (1)
(k − 1)! (n − k)! bn 1 − bn
The coefficient in (1) simplifies with Stirling’s formula of (4.105). Here we use the
notation “∼” to denote that as n → ∞, the expressions have the same limit.
n! n−k
an bk−1
n (1 − bn )
(k − 1)! (n − k)!
k−1/2 n−k+1/2 1/2
nn+1/2 e−1
k−1 n−k 1
∼ √ k−1/2 n−k+1/2
2π (k − 1) (n − k) n−1 n−1 n−1
n+1/2
1 1
= √ 1+ e−1 .
2π n−1
Taking logs, and applying the Taylor series expansion in (6.9), which is justified noting that
both abnny and 1−b
an y
n
have absolute value less than 1 :
" k−1 n−k #
an an
ln 1 + y 1− y
bn 1 − bn
" j j # " #
X∞ an y X∞ an j y j
= (k − 1) (−1)j+1 + (n − k) − .
j=1 bn j j=1 1 − bn j
yj
The coefficient cj of j for j ≥ 1 is:
j j
an an
cj ≡ (−1)j+1 (k − 1) − (n − k)
bn 1 − bn
j/2 j/2
j+1 n−k k−1
= (−1) (k − 1) − (n − k)
(n − 1) (k − 1) (n − 1) (n − k)
j/2 j/2
−1 1 −1 1
= (−1)j+1 (k − 1) + − (n − k) + .
n−1 k−1 n−1 n−k
From this expression it follows that c1 = 0, c2 = −1, and since k → ∞ and n − k → ∞, the
j
coefficient of yj converges to zero as n → ∞ for j ≥ 3.
In summary, for all y :
1 2
f(k) (y) → ϕ(y) ≡ √ e−y /2 ,
2π
as n → ∞.
The final step is to prove convergence of the associated distribution
functions, F(k) (y) →
Φ(y) for all y. To this end, let h(k) (y) ≡ max ϕ(y) − f(k) (y), 0 , and note that:
f(k) (y) − ϕ(y) = f(k) (y) − ϕ(y) + 2h(k) (y).
Definition II.2.9 and item 4 of Proposition II.2.49 obtain that for all measurable A:
Z
f(k) (x) − ϕ(x) dx → 0.
A
Taking A = (−∞, y] and using the the triangle inequality of that proposition’s item 7:
Z y Z y Z y
f (k) (x)dx − ϕ(x)dx ≤
f(k) (x) − ϕ(x) dx → 0.
−∞ −∞ −∞
Remark 6.16 (Scheffé’s Theorem) The last paragraphs of the above proof derive a spe-
cial case of Scheffé’s Theorem, named for a 1947 result of Henry Scheffé (1907–1977).
This theorem states that pointwise convergence of density functions assures pointwise con-
vergence of the associated distribution functions, and thus the weak convergence of distribu-
tions.
It is more generally true for density functions defined on arbitrary measure spaces and
will be proved in Book VI using the general version of Lebesgue’s dominated convergence
theorem of Book V. Otherwise, the general proof will be identical with that above.
In the special case where kn /n → q, the qth quantile of the uniform distribution for
0 < q < 1, Smirnov’s limit theorem can be stated in a simpler way. The proof is an
application of Proposition II.9.16.
Corollary 6.17 (Smirnov’s Limit Theorem) Let {U(k) }nk=1 be the order statistics from
a continuous, uniform distribution on [0, 1], and kn a sequence so that kn /n → q for 0 <
q < 1. Define:
00 U(k ) − q
U(k n)
≡p n ,
q(1 − q)/n
00
and let G(kn ) denote the distribution function of U(k n)
.
Then as n → ∞,
G(kn ) ⇒ Φ, (6.20)
where Φ(x) is the distribution function of the standard normal.
Proof. Since kn /n → q ∈ (0, 1) implies that both kn → ∞ and n − kn → ∞ as n → ∞, the
above proposition assures that F(k) (y) ⇒ Φ(y), where F(k) (y) is the distribution function of
the variable:
0 U(kn ) − bn
U(k n)
≡ ,
an
with an and bn defined as above in terms of kn .
By Proposition II.9.16, if cn and dn are sequences that satisfy:
cn → c, dn → d,
then:
F(k) (cn y + dn ) ⇒ Φ(cy + d),
or equivalently: h i
0
Pr U(kn ) ≤ c n y + d n → Pr[Z ≤ cy + d], (1)
Define: p
q(1 − q)/n q − bn
cn = , dn = ,
an an
and note that: h i h i
0 00
Pr U(kn)
≤ cn y + dn = Pr U(kn ) ≤ y = G(kn ) (y). (2)
Thus by (1) and (2), the proof will be complete by showing that cn → 1 and dn → 0.
Now kn /n − q → 0 by assumption, and thus (kn − 1) / (n − 1) → q and
(n − kn ) / (n − 1) → 1 − q. This obtains:
p r
q(1 − q)/n n−1 n−1 n−1
cn = = q(1 − q) → 1,
an kn − 1 n − kn n
172 Limit Theorems
and !−1/2
q − bn kn − 1 (kn − 1) (n − kn )
dn = = q− 2 → 0.
an n−1 (n − 1)
Proposition 6.18 (General quantile limits) Let {X(k) }nk=1 be the order statistics from
a distribution function F which is continuous and strictly increasing in a neighborhood of
F −1 (q) for 0 < q < 1, and differentiable at F −1 (q) with F 0 F −1 (q) 6= 0. Given a sequence
{kn } with kn /n → q as n → ∞, define:
0 X(kn ) − F −1 (q)
X(k n)
≡ hp i. ,
q(1 − q)/n F 0 (F −1 (q))
0
and let F(kn ) denote the distribution function of X(k n)
.
Then as n → ∞ :
F(kn ) ⇒ Φ, (6.21)
where Φ(x) is the distribution function of the standard normal.
Proof. Given continuous, uniform U on (0, 1), define X = F ∗ (U ) where F ∗ is the left-
continuous inverse of F, and recall Proposition II.4.9 in the Chapter 5 introduction that X
has distribution function F. If {U(k) }nk=1 are order statistics from this uniform distribution,
define X(k) = F ∗ (U(k) ), order statistics for X since F ∗ is increasing by Proposition II.3.16.
If kn is a sequence as defined above, then Corollary 6.17 applies. In the notation of weak
convergence of random variables (Definition II.5.19):
cn U(kn ) − q ⇒ Z,
p
where cn ≡ 1/ q(1 − q)/n, and Z is a standard normal random variable. Because cn → ∞,
the ∆-Method of Proposition II.8.40 obtains that if F ∗ (y) is differentiable at y = q, then:
0
cn F ∗ (U(kn ) ) − F ∗ (q) ⇒ (F ∗ ) (q)Z.
(1)
As G(x) is continuous and strictly increasing on D ≡ {x|F (a) < G(x) < F (b)} = (a, b),
Corollary II.3.23 obtains that G∗ = G−1 on (a, b). But G = F on (a, b), and thus F ∗ = F −1
on (a, b), and in particular, F ∗ (q) = F −1 (q) .
Weak Convergence of Distributions 173
0
F −1 (q) = 1/F 0 (F −1 (q)).
Thus with F ∗ (U(kn ) ) ≡ X(kn ) , (1) obtains:
cn X(kn ) − F −1 (q) ⇒ Z/F 0 (F −1 (q)),
which is (6.21).
Remark 6.19 (On F 0 (F −1 (q))) It seems natural to expect that F 0 (F −1 (q)) should be ex-
pressible in terms of the density function associated with the distribution function F (x). As
was seen in Book III, not every distribution function has an associated density function,
meaning a measurable function f (x) so that for all x :
Z x
F (x) = (L) f (y)dy,
−∞
defined as a Lebesgue integral. This is true even if F (x) is continuous as assumed above.
Existence of a density function f (x) requires that F (x) be absolutely continuous,
introduced in Definition III.3.54. Absolute continuity is a stronger condition than either
uniform continuity or of bounded variation, and is weaker than continuously differentiable.
Interestingly, F (x) in the above result need not be absolutely continuous even in the interval
(F −1 (q) − , F −1 (q) − ) of the proof, where it is strictly increasing and continuous. As an
example, let F (x) = x + FC (x) on (a, b) ⊂ (0, 1) where FC (x) is the Cantor function of Def-
inition III.3.51, which is continuous, increasing, and not absolutely continuous by Example
III.3.57. The function F (x) is strictly increasing and continuous on (a, b) by construction,
and not absolutely continuous.
When F (x) is absolutely continuous, then F 0 (x) exists almost everywhere and is Lebesgue
measurable by Proposition III.3.59. The Lebesgue integral above is then satisfied with
f (x) ≡ F 0 (x) by Proposition III.3.62, generalized as before since F (a) → 0 as a → −∞ by
Proposition 1.3. This representation of F (x) is also satisfied with any function g(x) with
g(x) = F 0 (x) a.e., meaning outside a set of Lebesgue measure 0.
Thus even in this case of absolutely continuous F (x), the derivative F 0 (x) is not uniquely
defined in terms of the density function f (x), since this density is not unique. This is also
seen in Proposition III.3.39, that if F (x) is expressible as above with measurable f (x), then
F 0 (x) exists almost everywhere and F 0 (x) = f (x) a.e.
In the very special case of continuous probability theory, where F (x) is continuously
differentiable and f (x) is continuous, we obtain an affirmative conclusion on the inquiry
of this remark. Then F 0 (x) = f (x) everywhere by Proposition III.1.33. And in this special
case, F −1 (q) = xq , the qth quantile of F, which is well defined since F is locally strictly
increasing. Thus:
F 0 (F −1 (q)) = f (xq ).
Example 6.20 (Estimating F −1 (q)) It follows from Proposition 6.18 that X(k 0
is ap-
n o n)
0
proximately standard normally distributed for n large. Hence X(kn ) ≤ 1.96 defines an
approximate 95% confidence interval for this variate, recalling that the 97.5th quantile of
the standard normal is z0.975 = 1.96.
Defining kn = bqnc , the greatest integer less than or equal to n, it follows that kn /n →
q as n → ∞. This then obtains a confidence interval for the quantile F −1 (q) of X given a
random sample ordered variate X b(bqnc) :
p
−1
1.96 q(1 − q)
X(bqnc) − F (q) ≤ √ 0 −1 .
b
nF (F (q))
174 Limit Theorems
Of course, F and F 0 are unknown generally, but assuming F (x) has a continuous density,
F 0 (F −1 (q)) ≈ f (X
b(bqnc) ) would need to be estimated.
Using the normal density φ(x) of (1.66), it is reasonable to assume that ϕ(X
b(bqnc) ) <
0 −1
f (X(bqnc) ) since φ(x) has “skinny” tails. Thus replacing F (F (q)) by ϕ(X(kn ) ) provides
b b
an upper bound for this inequality and a conservative confidence interval:
1.96pq(1 − q)
−1
X(bqnc) − F (q) ≤ √ .
b
nϕ(X
b(k ) )
n
1 Xk−1
Yk,n = X(n−j) − X(n−k) .
k j=0
In other words, Yk,n equals the average of the k “gaps” between X(n−k) and higher order
variates X(n−j) for j = 0 to k − 1.
This result states that if properly normalized, Yk,n is asymptotically normal as n →
∞ and k → ∞.
Example 6.21 Though the proof does not require that k and n increase proportionately, if
k = b(1 − q) nc for 0 < q < 1, then k → ∞ as n → ∞. Recall that bxc denotes the greatest
integer less than or equal to x, where the greatest integer function, also called the floor
function, is defined by:
bxc ≡ max{m ∈ N|m ≤ x}.
Then:
X(n−k) = X(n−b(1−q)nc) ,
and since n − b(1 − q) nc is within 1 of bqn + 1c , X(n−k) is essentially the (qn + 1) st order
statistic.
Thus Yk,n above is the average of the gaps between this (qn + 1) st order statistic and
all higher order statistics X(m) with m > bqn + 1c .
0
The following proposition identifies Yk,n as a normalized version of Yk,n . This will be
justified in the proof, where it will be seen that E [Yk,n ] = 1 and V ar [Yk,n ] = 1/k.
0 Yk,n − 1
Yk,n = √ ,
1/ k
0
and let Fk,n denote the distribution function of Yk,n .
Then as n → ∞ and k → ∞ :
Fk,n ⇒ Φ, (6.22)
Laws of Large Numbers 175
Yk,n →1 1, (6.23)
Thus:
Xk−1 Xk−1 Xk−j El+n−k
X(n−j) − X(n−k) =
j=0 j=0 l=1 k − l + 1
Xk Xk−l El+n−k
=
l=1 j=0 k − l + 1
Xk
= El+n−k . (1)
l=1
Since the collection {Ej }∞j=1 used in the Rényi representation theorem are independent
standard exponentials, we can for notational simplicity replace {El+n−k }kl=1 in this summa-
tion with {El }kl=1 . Hence by (1) and the definition of Yk,n :
1 Xk
Yk,n = Ei ,
k i=1
If {Xn } is a finite or infinite collection of random variables, the sigma algebra gen-
erated by {Xn } and denoted σ(X1 , X2 , ...) is the smallest sigma algebra with respect to
which each Xn is measurable. By Proposition II.3.45:
meaning σ(X1 , X2 , ...) is the sigma algebra generated by the collection of sigma algebras
{σ (Xn )}.
By measurability of random variables, σ(X) ⊂ E for any X and σ (X1 , X2 , ...) ⊂ E for
any {Xn } .
Definition 6.24 (Independent random variables) Random variables {Xn }∞ n=1 defined
on a probability space (S, E, λ) are said to be independent random variables if
{σ (Xn )}∞
n=1 are independent sigma algebras. That is, given any finite index subcollection
J = (j(1), j(2), ..., j(n)), and {Bj(i) }ni=1 with Bj(i) ∈ σ(Xj(i) ) :
\ n Yn
µ Bj(i) = µ Bj(i) .
i=1 i=1
Finally, we recall the notion of the tail sigma algebra T associated with an arbi-
trary collection of random variables {Xn }∞
n=1 . The intersection of sigma algebras is a sigma
algebra by Proposition I.2.8, and note that the sigma algebras in (6.25) are nested:
Definition 6.25 (Tail sigma algebra: T ≡ T ({Xn }∞ n=1 )) Given a probability space
(S, E, λ) and countable collection of random variables {Xn }∞n=1 , the tail sigma algebra
associated with {Xn }∞ n=1 and denoted:
T ≡ T ({Xn }∞
n=1 ) ,
Laws of Large Numbers 177
is defined: \∞
T = σ(Xn , Xn+1 , Xn+2 , ...), (6.25)
n=1
where σ(Xn , Xn+1 , Xn+2 , ...) is the sigma algebra generated by {Xj }∞
j=n .
Thus T ⊂ E, and a tail event is any set A ∈ T .
An example of a tail event, which is at the heart of this section, is the convergence set
A of a countable collection of random variables {Xn }∞
n=1 . Note that it is not apparent from
this definition even that A ∈ E.
Definition 6.26 (Convergence set) Given a countable collection of random variables
{Xn }∞
n=1 defined on (S, E, λ), define the convergence set A ⊂ S by:
nX∞ o
A= Xn (s) converges .
n=1
∞
P∞that A is a tail event, A ∈ T ≡ T ({Xn }n=1 ) , recall that a series of real
To prove
numbers n=1 an converges if and only if this series satisfies the Cauchy convergence
criterion, also called the Cauchy criterion, and named for Augustin-Louis Cauchy
(1789–1857).
P∞
Definition 6.27 (Cauchy convergence criterion) A series P n=1 an satisfies the
n
Cauchy criterion if given any > 0 there is an N so that j=m aj < for all n, m ≥ N.
Example 6.28 (Convergence sets are tail events) That the convergence set A ∈
T ≡ T ({Xn }∞ n=1 ) is intuitively plausible because the convergence of a series does not depend
on any finite number of terms.
More formally, we can by the Cauchy criterion define the convergence set as follows,
using rational :
\ [∞ \ nXn o
A= Xj (s) < . (6.26)
∈Q N =1 n≥m≥N j=m
nP o
n
Because j=m Xj (s) < ⊂ σ(Xm , Xm+1 , Xm+2 , ...), it follows that A ⊂ σ(Xm , Xm+1 ,
In the next two sections we continue the study of convergence of series. The weak laws
of large numbers will provide additional information on the weaker notion of convergence
in probability, while the strong laws of large numbers will provide additional information
on when such series converge with probability one, meaning:
hnX∞ oi
λ Xn (s) converges = 1.
n=1
Notation 6.33 (Pr statements) Using the probability notation Pr is often preferred in
probability theory to using the measure-theoretic notation with λ. By definition:
with {Xj }∞j=1 a sequence of independent random variables. When random variables are
assumed identically distributed, we will investigate convergence in probability to µ ≡ E[X],
though this assumption is not necessary and there are other versions of this result. This
average of the first n random variables also denoted:
1 Xn
X̄n ≡ Xj .
n j=1
We begin with a version of a weak law with the simplest proof, due largely to a strong
assumption. It states that when the sequence {Xj }∞j=1 is independent and identically dis-
tributed, with finite mean and variance, then the associated average sequence {X̄n } con-
verges in probability to the mean.
Laws of Large Numbers 179
Proposition 6.35 (WLLN 1) Let {Xj }∞ j=1 be a sequence of independent, identically dis-
tributed random variables defined on a probability space (S, E, λ) with finite mean µ and
variance σ 2 . Then:
1 Xn
Xj →P µ. (6.30)
n j=1
σ2
X
1 n 1 Xn
V ar Xj = , E Xj = µ.
n j=1 n n j=1
σ2
X
1 n
Pr Xj − µ ≥ ≤ 2 ,
n j=1 n
Example 6.36 (Bernoulli’s theorem) A special case of this version of the weak law is
Bernoulli’s theorem of Proposition II.5.3. This result stated that the average of indepen-
dent, identically distributed binomial random variables {XjB }∞
j=1 converges in probability to
the binomial probability p. The assumptions of the above proposition are satisfied for this
variate since XjB has finite mean and variance, and unsurprisingly, so too is the conclusion
of this result since p = E[XjB ].
While the existence of σ 2 provides a very simple proof thanks to Chebyshev, the above
proposition remains true with only the assumption of the existence of the first moment
µ. The proof is significantly harder due principally to the need for the general Lebesgue
dominated convergence theorem of Book V for integrals as in (4.7). This theorem
is needed for a technical result on “truncations” of random variables which may appear
elementary, but resist direct and more elementary validation.
In cases where E[X] as defined in (4.7) can be transformed to a Lebesgue integral as in
(4.14), the Lebesgue dominated convergence theorem of Proposition III.2.52 suffices.
Definition 6.37 (Truncation of X) Let X be a random variable on (S, E, λ). Given m >
0, define a random variable X (m) , called the truncation of X at m, by:
X, |X| ≤ m
X (m) =
0, |X| > m.
Proof. For item 1, recall the integral representation of E[X] in (4.7). Applying (4.12) and
(4.11) obtains that X (m) has finite mean:
h i h i
E X (m) ≤ E X (m) ≤ E [|X|] < ∞,
For item 2, define Zm ≡ X − X (m) . Then Zm → 0 pointwise on S, and |Zm | ≤
|X| with |X| integrable by existence of E[X] and (4.8). Thus by the general version of
Lebesgue’s dominated convergence theorem of Book V, E [Zm ] → 0 as n → ∞, and the result
follows.
Exercise 6.39 Provide the details of item 2 above in the case where X = f (x) is a Lebesgue
measurable function that is integrable on R (or Rn ). Given m, define the truncation of f (x)
as above, denoting this fm (x). Use Proposition III.2.52 to complete the proof that for any
α > 0 there exists M = M (α) so that for all m ≥ M :
E [|f − fm |] < α.
Now assume that f (x) = φ(x), the standard normal density function in (1.66), and
formulaically determine M (α) in terms of the normal distribution Φ(x).
Proposition 6.40 (WLLN 2) Let {Xj }∞ j=1 be a sequence of independent, identically dis-
tributed random variables defined on (S, E, λ) with finite mean µ ≡ E[X]. Then (6.30) is
satisfied:
1 Xn
Xj →P µ.
n j=1
Proof. Given > 0, we prove by truncation that for any δ > 0 there exists N so that for
n≥N : X
1 n
Pr Xj − E[X] ≥ < δ. (1)
n j=1
The reader should verify that (1) implies the convergence in probability of (6.30).
(m) Pn (m)
Given m, let Xj denote the truncation of Xj , and define X̄n ≡ n1 j=1 Xj and X̄n ≡
1
Pn (m)
n j=1 Xj . By the triangle inequality:
X̄n − E[X] ≤ X̄n − X̄n(m) + X̄n(m) − E[X (m) ] + E[X (m) ] − E[X] .
By considering the defining sets in S and using subadditivity of the measure λ it follows
that:
h i
Pr X̄n − E[X] ≥ ≤ Pr X̄n − X̄n(m) ≥ /3
h i
+ Pr X̄n(m) − E[X (m) ] ≥ /3 (2)
h i
+ Pr E[X (m) ] − E[X] ≥ /3 .
Laws of Large Numbers 181
Given α >0 to be specified below, there exists M = M (α) by Proposition 6.38 so that
E X − X (m) < α for any truncation X (m) with m ≥ M (α). We now prove that each of
the probabilities on the right in (2) can be made arbitrarily small by making α small.
h i 1 Xn h i
(m)
h i
E X̄n − X̄n(m) ≤ E Xj − Xj = E X − X (m) < α.
n j=1
2. Let g(x) ≡ χ[−m,m] (x), the characteristic function of the interval [−m, m], defined to
equal 1 on this interval and 0 elsewhere. Using Proposition II.3.56 and Borel measurable
(m) (m)
g(x), independence of {Xj } and Xj = g(Xj ) assure independence of {Xj }. Thus
by independence, (4.43), and Proposition 6.38:
V ar[X̄n(m) ] ≤ mE[|X|]/n.
(m)
As E[X (m) ] = E[X̄n ] by (4.39), Chebyshev’s inequality in (4.82) obtains:
h i 9mE[|X|]
Pr X̄n(m) − E[X (m) ] ≥ /3 ≤ . (3b)
n2
The final generalization of Proposition 6.35 again assumes finite second moments, and
applies to an independent sequence of random variables {Xj }∞ j=1 with arbitrary distribu-
tions. However, an assumption will be needed on thehPgrowth i Pof the mean and variance
rates
Pn n n
of j=1 Xj as functions of n. By independence, E j=1 Xj = j=1 E [Xj ] by (4.39) and
P P
n n
V ar j=1 Xj = j=1 V ar (Xj ) by (4.43).
Note that the requirements of the next result are automatically satisfied when {Xj }∞
j=1
are also identically distributed.
182 Limit Theorems
Proposition 6.41 (WLLN 3) Let {Xj }∞ j=1 be a sequence of independent random vari-
ables defined on (S, E, λ) with finite means {µj }∞ 2 ∞
j=1 and variances {σ j }j=1 . Denote:
Xn Xn
mn = µj , s2n = σ 2j .
j=1 j=1
1
Pn
Proof. Letting X̄n ≡ n j=1 Xj , the triangle inequality obtains:
X̄n − µ ≤ X̄n − mn + mn − µ .
n n
By consideration of the defining sets in S :
mn m
n
Pr X̄n − µ < ≥ Pr X̄n − <− − µ . (1)
n n
Applying Chebyshev’s inequality with E X̄n = mn /n and V ar X̄n = s2n /n2 :
mn m
n
s2n /n2
Pr X̄n − <− − µ ≥ 1 − 2.
n n [ − |mn /n − µ|]
Remark 6.42 (On further generalizations) In the above result, the assumption that
mn /n → µ can be eliminated to obtain the result:
1 Xn
Xj − µj →P 0.
n j=1
Similarly, we can replace the independence assumption in any version of thehweak law by ithe
Pn
assumption that correlations in (4.45) satisfy ρij ≤ 0 for all i, j. Then V ar j=1 Xj /n ≤
s2n /n2 by (4.47) and the proof goes through without change.
A positive result is possible even in the nonnegative correlation case if we assume for
example that for some 0 < r < 1, ρij ≤ r|i−j| and σ 2j ≤ B for all j.
Exercise 6.43 Let {Xj }∞ j=1 be a sequence of random variables with the same finite mean
µ, and variances σ 2j with σ 2j ≤ B for all j. Show that if ρij ≤ r|i−j| for some 0 < r < 1,
then (6.30) remains true. Hint: Repeat the above proof with the new variance estimate from
(4.47).
Laws of Large Numbers 183
This notion is also called convergence almost surely, and then denoted Yn →a.s. Y.
As was the case for weak laws, strong laws are most often stated in the context of:
1 1 Xn
Yn = Sn ≡ Xj ,
n n j=1
with {Xj }∞j=1 a sequence of independent random variables. When identically distributed,
we will investigate convergence with probability 1 to µ ≡ E[X], though this assumption is
not necessary and there are other versions of this result. The average of the first n random
variables is also denoted:
1 Xn
X̄n ≡ Xj .
n j=1
Weak laws and strong laws can be related as follows. Let {Xj }∞j=1 be a sequence of inde-
pendent, identically distributed random variables defined on a probability space (S, E, λ).
Define the set An () ⊂ S as in II.(5.3):
X
1 n
An () ≡ Xj − µ ≥ .
n j=1
It is an exercise to check that An () ∈ E and thus is measurable. In the first two versions
of the weak law in Propositions 6.35 and 6.40, it was proved that for such independent and
identically distributed random variables and any > 0 :
λ [An ()] → 0 as n → ∞.
As noted in Section II.5.1.1, the sets {An ()}∞n=1 are not necessarily nested and it is
possible to have both An+1 () − An () 6= ∅ and An () − An+1 () 6= ∅. Hence there is no
apparent event in S associated with weak laws that is definable in terms of some limit of
An () as n → ∞.
For the strong laws, the goal is to determine the measure of the strong convergence
set CS on which this averaging series converges to µ:
X X
1 n 1 n
CS ≡ Xj → µ = (Xj − µ) → 0 .
n j=1 n j=1
This definition was introduced in II.(5.8) in the context of binomial {Xj }∞ j=1 , and called
the “convergence set,” but here we add the qualifier “strong” to distinguish this set from
the general convergence set A of Definition 6.26.
We prove below in Proposition 6.46 that CS ∈ E and is thus measurable. Strong laws
identify conditions under which λ [CS ] = 1.
The strong convergence set CS is closely related to the An ()-sets, but we must first
recall Definition II.2.1 for the notion of the limit superior of a sequence of sets {An }∞
n=1 .
We include the other limit notions for completeness.
184 Limit Theorems
Definition 6.45 (lim sup An , lim inf An , lim An ) Given a measure space (S, E, λ) and a
countable collection of sets {An }∞
n=1 ⊂ E, define:
1. Limit superior: \∞ [∞
lim supn An = Ak . (6.32)
n=1 k=n
2. Limit inferior: [∞ \∞
lim inf n An = Ak . (6.33)
n=1 k=n
limn An ≡ A. (6.34)
It common notationally to omit the subscript n when clear from the context.
en (1/j) denotes the complement of An (1/j). Thus CS ∈ E since An () ∈ E for all n,
where A
.
Using De Morgan’s laws of Exercise I.2.2 and (6.32), the complement of CS is given by:
[∞
C
eS = lim sup An (1/j). (1)
j=1
h i
If λ [CS ] = 1 then λ C
eS = 0, and thus λ [lim supn An ()] = 0 for all = 1/j by (1). This
is then true for all > 0 since if > 0 :
Conversely, by the nested property in (2) and continuity from below (Proposition I.2.45)
of the measure λ : h i
λ CeS = lim λ [lim sup An (1/j)] .
j→∞
h i
If λ [lim sup An (1/j)] = 0 for all j, then λ C
eS = 0 and thus λ [CS ] = 1.
Remark 6.47 (SLLN proof strategy) To prove a strong law we must by the above
proposition show that for any > 0,
A powerful tool for proving such a statement is the first result of the Borel-Cantelli lemma
of Proposition II.2.6, due to Cantelli:
Laws of Large Numbers 185
Cantelli: Given a measure space (S, E, λ) and a countable collection of sets {An }∞
n=1 ⊂
E: X∞
λ(An ) < ∞ =⇒ λ(lim sup An ) = 0. (6.36)
n=1
Thus if we can show that for any > 0 :
X∞
λ(An ()) < ∞,
n=1
then λ(lim sup An ()) = 0 for all > 0, and the strong convergence set CS then has proba-
bility 1 by Proposition 6.46.
We now derive two strong laws. As for the weak law development, the first result will
have excess assumptions to more easily highlight the application of Borel-Cantelli. Much
like the assumption of the existence of σ 2 in the first version of the weak law, we will require
the existence of the fourth central moment µ4 to facilitate the application of Chebyshev’s
inequality.
The existence of µ4 assures the existence of all lower order moments as noted in the in-
troduction to Section 4.2.5. Consequently, we do not have to separately assume the existence
of the mean µ, and this is actually part of the conclusion.
Proposition 6.48 (SLLN 1) Let {Xj }∞ j=1 be a sequence of independent, identically dis-
tributed random variables on a probability space (S, E, λ) with fourth central moment µ4 .
Then:
1 Xn
Xj →1 µ.
n j=1
n
In other words, n1 j=1 Xj converges to the mean µ with probability 1:
P
X
1 n
Pr Xj → µ = 1. (6.37)
n j=1
Proof. Let: Xn
Sn ≡ (Xj − µ) .
j=1
This follows because all other terms have at least one factor such as E [(Xj − µ)] , which
equals 0.
By Lyapunov’s inequality in (4.104):
h i h i
2 2 4
E (Xj − µ) E (Xi − µ) ≤ E (Xj − µ) ,
and hence:
E Sn4 ≤ 3n2 − 2n µ4 .
Example 6.49 (Borel’s theorem) A special case of this version of the strong law is
Borel’s theorem presented in Proposition II.5.9. This stated that the average of indepen-
dent, identically distributed binomial random variables {XjB }∞
j=1 , converged with probability
B
1 to the binomial probability p = E[Xj ] by (4.62).
All moments are finite for the binomial by Proposition 4.25, since this variate has a
moment generating function in (4.63). Thus the above assumption that µ4 < ∞ is satisfied.
Like the weak law, it is also true that the strong law is valid for independent and
identically distributed random variables assuming only the existence of a first moment. See
for example Feller (1968) or Billingsley (1995).
We provide a version of this result that is somewhat weaker because it requires a sec-
ond moment. But this version is also somewhat more general as it is also applicable to
independent random variables which need not be identically distributed.
For the proof, we will use Kolmogorov’s inequality in (4.94), and modify the definition
of An () to a related set A0n ().
∞
Proposition 6.50 (SLLN 2) Let {Xj }j=1 be a sequence of independent random variables
∞ ∞ P∞ σ2
on a probability space (S, E, λ) with means µj j=1 , and variances σ 2j j=1 with j=1 j 2j <
∞. Then:
1 Xn
Xj − µj →1 0.
n j=1
In other words: X
1 n
Pr Xj − µj → 0 = 1. (6.38)
n j=1
It is an exercise to show that A 00n () ∈ E (Hint: Proposition I.3.47). The inclusion
Pk
A0n () ⊂ A00n () follows because if j=1 Yj /k ≥ for any k with 2n−1 < k ≤ 2n , then
P
k
max2n−1 <k≤2n j=1 Yj ≥ 2n−1 by definition.
By Kolmogorov’s inequality in (4.94), the probability of the event A00n () is bounded:
1 X2n
Pr[A00n ()] ≤ σ 2j .
2 22n−2 j=1
Laws of Large Numbers 187
X(kn ) →1 x∗ . (6.40)
Proof. For r < x∗ , define the random variable Zj = χ(r,∞) (Xj ). Each Zj is binomially
distributed with:
p ≡ Pr[Zj = 1] = 1 − F (r).
Thus by (4.62), E [Zj ] = 1 − F (r) and V ar [Zj ] = F (r)(1 − F (r)).
By Proposition 6.48 on the strong law, as n → ∞ :
1 Xn
Zj →1 1 − F (r). (1)
n j=1
Since F is continuous and F (r) → 1 as r → x∗ , it follows from (1) and the definition of Zj
that for any r < x∗ :
1 Xn
χ(r,∞) (Xj ) →1 1 − F (r) > 0. (2)
n j=1
But knn → 1 by assumption, and this implies that the right hand summation converges to 0
with probability 1. This contradicts (2), that this summation converges to 1 − F (r) > 0.
Thus for all r < x∗ :
λ lim supn {X(kn ) ≤ r} = 0,
and by complementarity:
λ lim inf n {X(kn ) > r} = 1. (3)
Convergence of Empirical Distributions 189
then hT i
∞
λ j=1 Bj = 1, (4)
where by Definition 6.45:
\∞ \∞ [∞ \∞
Bj = {X(km ) > rj }.
j=1 j=1 n=1 m=n
T∞
If s ∈ j=1 Bj , then for for every j there exists n so that X(km ) (s) > rj for all
m ≥ n, and thus limm→∞ X(km ) (s) > rj . As this is true for all rj , it follows from
limm→∞ X(km ) (s) ≤ x∗ and rj → x∗ that limm→∞ X(km ) (s) = x∗ . Hence:
\∞ n o
Bj ⊂ lim X(km ) = x∗ ,
j=1 m→∞
Corollary 6.53 (X(kn ) →1 x∗ ) Let {Xj }∞ j=1 be a sequence of independent, identically dis-
tributed random variables on a probability space (S, E, λ) with a continuous distribution
function F. For each n let {X(k) }nk=1 be the order statistics associated with {Xj }nj=1 .
If kn /n → 0, then X(kn ) → x∗ with probability 1 :
X(kn ) →1 x∗ , (6.41)
where:
x∗ ≡ sup{x|F (x) = 0}, x∗ ≡ −∞ if F (x) > 0 for all x.
Proof. Left as an exercise. Hint: Define the random variable Y = −X. Relate FX (x) and
FY (y), and show that x∗ = y ∗ . Use the above result by relating kn for X with kn0 for Y.
a random variable with a given distribution function. In particular it reflects the necessary
assumption that this data is more or less stable, or stationary, over the time period
observed. This issue is often addressed informally based on an intuitive understanding of
the selection process for the given data, rather than by a formal analysis which seeks to
prove that such a distribution function must exist.
For example, if one looked at a series of observations of the month-end value of a given
equity index over the last 25 years, it would hardly seem logical to define this as a random
variable and then attempt to identify the associated distribution function. Outside of inher-
ent volatility and market corrections, most everyone’s expectation is that the value of this
index will increase over time, due at least to inflation and productivity. So while we can
formally identify a “distribution function” for the given historical data, it is not clear that
such an effort will produce something of value, if by “of value” we mean, providing some
predictive insights for the future.
On the other hand, if we instead convert this price series into a monthly return series,
there would seem to be a better argument that the distribution of returns over time had
some stability, and hence some predictability. This argument is then challenged by the fact
that such observations can potentially provide very different insights when grouped into
various subperiods.
This forces the analyst to choose between several competing interpretations, a few being:
1. There is no single underlying distribution function for the given data because, for ex-
ample, different periods have different distributions and the timing and magnitude of
the change between distributions is unpredictable.
2. There is an underlying distribution function but no single period reveals all of its qual-
ities; thus more data over longer periods is needed to reveal the ultimate distribution.
3. There may have been an underlying distribution function, but due to a significant event,
the distribution in the future can be expected to be different.
Unfortunately, there is no universally accepted approach to resolving which of these or
other interpretations is correct in a given situation. Ultimately, such a data analysis and
the assumptions made to justify this analysis are part of the quantitative analyst’s model
building process, within which many other assumptions will also likely be made.
Are the assumptions valid? Is the model correct? What questions can the model answer?
Perhaps the best summary comment on this matter is one attributed to the statistician
George E. P. Box (1919–2013). While there are many versions of his dictum, a commonly
cited version is the quote:
“All models are wrong, but some are useful.”
In this section, we study empirical distribution functions constructed from samples
from a given fixed distribution, and investigate convergence of this empirical distribution
to the underlying theoretical distribution function as the sample size increases.
Thus nothing in this section will ease the plight of the analyst in deciding if the random
samples obtained are indeed from a given distribution. What these results confirm, is that
IF the random sample has such an underlying distribution function, then the empirical
distributions constructed will provide insights to it.
can be constructed and defined in terms of the first n-components of a point in an infi-
nite dimensional space (RN , σ(RN ), µN ), and this allows one to investigate various results as
n → ∞.
That {Xj }nj=1 is a “sample” means that this collection is i.i.d., or independent and
identically distributed.
But there are two interpretations for this collection in this section:
1. Consistent with Chapter 6, these are independent, identically distributed random vari-
ables for any n, constructed on a probability space such as (RN , σ(RN ), µN ).
2. Consistent with empirical analysis, these are a random sample of the variate X, which
can be envisioned as the numerical values {Xj (s)}nj=1 for some s ∈ RN . What makes
this numerical sample “random” is that these are deemed to have been obtained from a
process, that if repeated many times, would produce collections which are approximately
i.i.d.
While all empirical analysis is based on the random samples of item 2, one can only make
probabilistic statements on the resulting empirical distributions by interpreting such samples
within the context of item 1. Hence this section alternates between these perspectives.
Given a random sample, one approach to visualizing the underlying assumed-to-exist dis-
tribution function F (x) is the construction of a histogram, which focuses on the underlying
assumed-to-exist density function f (x). In this construction one assigns a probability of
1/n to each observed variate, and this can be formally justified by Proposition II.4.9 of the
Chapter 5 introduction, as follows.
If F (x) is a given distribution function, such a sample can be produced as in Chapter
5 by {F ∗ (Uj )}nj=1 , where {Uj }nj=1 is a numerical sample from the continuous, uniform
distribution on (0, 1), and F ∗ (y) denotes the left-continuous inverse of F (x). Each such Uj
is independent and uniformly distributed, so for any interval [c, d] ⊂ (0, 1) :
Pr{Uj ∈ [c, d]} = d − c.
By continuity of measures, this same result is true for (c, d) ⊂ (0, 1) or semi-closed intervals.
By independence:
n Yn o Yn
Pr (U1 , ..., Un ) ∈ [cj , dj ] = (dj − cj ).
j=1 j=1
Thus the value of each variate Uj is deemed equally likely, as is the value of each observed
variate Xj ≡ F ∗ (Uj ) , and one logically assigns a probability of 1/n to each value of Xj .
A histogram is then a graphical depiction of the sample {Xj }nj=1 with each observation
given probability 1/n, and this provides visual clues on the underlying density function f (x).
There are two approaches for creating a histogram.
1. If {X(j) }nj=1 are the associated order statistics of the sample, then for each j we assume
X(j) ∈ [aj , bj ) where {[aj , bj )}nj=1 are disjoint intervals with union equal to the assumed
range of X and:
Z bj
f (x)dx = 1/n.
aj
For example, one could define bj = X(j) + X(j+1) /2 = aj+1 , and extend symmetrically
to a1 and bn . Alternatively, one could define aj = X(j) = bj−1 .
In this representation, the empirical density function f (x) is assumed constant on
each interval:
1
f (x) = , x ∈ [aj , bj ]. (6.42)
n(bj − aj )
192 Limit Theorems
Thus a histogram is not a discrete probability density function, but a piecewise con-
tinuous approximation to the unknown density f (x). The goal of these constructions is to
provide some information on the “shape” of f (x). Requiring aj+1 = bj and ck+1 = dk reflects
the assumption that the range of X is connected. In the first approach, the various intervals
will have different lengths and this uniform probability assumption will get translated into
the various values for f (x) by (6.42). In the second approach the bin intervals have fixed
length, and the various values for f (x) given by (6.43) provide a frequency count for that
interval.
An alternative approach which we study in this section, reflects the empirical distri-
bution function defined as follows.
Definition 6.54 (Empirical distribution function) Given a sample {Xj }nj=1 of a ran-
dom variable X defined on a probability space (S, E, λ), the associated empirical distri-
bution function, denoted Fn (x) ≡ Fn (x| {Xj }nj=1 ), is defined:
1 Xn
Fn (x) = χ(−∞,x] (Xj ), (6.44)
n j=1
where χ(−∞,x] (y) is the characteristic function of (−∞, x], and defined to equal 1 for x ∈
(−∞, x] and 0 otherwise.
Given {Xj }nj=1 , we can look at Fn (x) from at least two perspectives based on the com-
ments above:
1. For a fixed random sample {Xj }nj=1 , Fn (x) ≡ Fn (x| {Xj }) is indeed a distribution
function by Proposition 1.4. Specifically, Fn (x) is increasing and right-continuous, and
when defined as limits, Fn (−∞) = 0 and Fn (∞) = 1.
In addition, Fn (x) is continuous as a function of x except at {Xj }nj=1 with:
1
Fn (Xj ) = Fn (Xj− ) + ,
n
where Fn (Xj− ) denotes the left limit of Fn (x) at Xj .
This interpretation is conceptually consistent with the histograms above, but Fn (x)
only equals the distribution function associated with these empirical density functions
in case 1 with aj = X(j) = bj−1 . In the other two constructions the discontinuities of
the associated distribution functions will occur at {aj }nj=1 and {ck }m k=1 rather than at
{Xj }nj=1 . In addition, the increases at the discontinuities associated with (6.43) will not
in general equal 1/n.
Convergence of Empirical Distributions 193
2. For fixed x and {Xj }nj=1 interpreted as i.i.d. random variables on the probability space
(RN , σ(R ), µN ), let {Yj }nj=1 be defined by Yj ≡ χ(−∞,x] (Xj ) :
N
N
Yj : (RN , σ(R ), µN ) → (R, B(R), m),
Then {Yj }nj=1 are standard binomial variables defined on this probability space, with:
Further, {Yj }nj=1 are independent by Proposition II.3.56, since g(y) = χ(−∞,x] (y) is
Borel measurable for any x.
Thus, for each x :
1 Xn
Fn (x) ≡ Yj , (6.45)
n j=1
is a random variable on this probability space, and in fact is a general binomial with
parameters n and p = F (x).
Hence, in this case we have by (4.62) that for each x:
Based on the second interpretation, the following proposition presents two results which
are corollaries to the earlier results of this section. For these results we fix x and consider
N
Fn (x) of (6.45) as a random variable on the probability space (S, E, λ) ≡ (RN , σ(R ), µN ).
Thus in this result, F (x) is also a constant.
Proposition 6.55 (Limit results for Fn (x), fixed x) Let {Xj }nj=1 be independent, iden-
tically distributed random variables on (S, E, λ) with distribution function F (x), and define
Fn (x) as in (6.45).
Then for each x :
Fn (x) →1 F (x).
In other words,
λ{Fn (x) → F (x)} = 1.
Fn (x) − F (x)
Zn = p ,
F (x)(1 − F (x))/n
Proof. Item 1 is an immediate consequence of the strong law of large numbers of Proposition
6.48, while item 2 is an application of the central limit theorem of Proposition 6.13 using
(6.46).
The above proposition provides the important insight that random samples are informa-
tive relative to the underlying distribution function F at each x, when these random samples
are defined as realizations of i.i.d variates. Not only does Fn (x) converge to F (x) for each
x with probability 1, but we can in theory make approximate probability statements about
F (x) based on the random sample values {Xj }nj=1 and value of Fn (x).
Example 6.56 (Confidence limits for F (x), fixed x) As is typical in estimating con-
p an unknown binomial parameter p ≡ F (x), we do not know the exact
fidence intervals about
standard deviation F (x)(1 − F (x))/n, and must approximate this with the sample value:
p
σ̂ ≡ Fn (x)(1 − Fn (x))/n.
Hence, for each x we can estimate the value of F (x) from a random sample as above.
Further, we have probability 1 convergence of Fn (x) →1 F (x) were Fn (x) is interpreted as a
random variable, and F (x) a constant. However, this theory does not immediately support
any conclusions on the extent to which Fn converges to F more generally.
Indeed, for each x the above convergence result implies existence of a measurable set
Ax ⊂ S with λ (Ax ) = 0 and Fn (x) → F (x) for s ∈ S − AxS . But as there are uncountably
many such x, there is no theory to S suggest that the union x Ax is measurable, and even
if measurable,
S is it possible that λ ( x Ax ) > 0. For example, m({x}) = 0 for all x ∈ [0, 1],
but m ( x {x}) = 1. S
Consequently, it is possible in theory that S − x Ax , the set on which Fn (x) → F (x)
for all x, is not measurable, or measurable with probability less than 1.
Remark 6.57 (On right continuity of F (x)) It is tempting to think that right continu-
ity ofSthe distribution function F (x) would simplify the above discussion. Specifically, define
A = x∈Q Ax , so A is the union of Ax for rational x. Then since Q is countable, it follows
that A ∈ S is measurable with λ (A) = 0. Further, for all s ∈ S − A, a set of probability 1,
we have that Fn (x) → F (x) for all x ∈ Q.
It then seems compelling by right continuity that for all s ∈ S − A, that Fn (x) → F (x)
for all x ∈ R − Q as well.
What is clear is that if x ∈ R − Q, and if {xm } ⊂ Q with xm > x and xm → x, then:
N
Although |Fn (x) − F (x)| is a random variable on (S, E, λ) ≡ (RN , σ(R ), µN ) for every x,
the definition of Dn (s) requires the supremum over uncountably many x. Hence Proposition
I.3.47 does not apply, and we cannot immediately assert that Dn (s) is a measurable function.
However, by right continuity it can be verified as an exercise that:
In other words, Dn (s) → 0 with probability 1, and thus Fn (x) → F (x) uniformly in x
outside a set A ∈ E with λ (A) = 0.
Proof. Define:
1 Xn
Fn (x− ) = χ(−∞,x) (Xj (s)),
n j=1
and note that Fn (x− ) is again a binomial random variable on (S, E, λ), now with probability
p− = F (x− ), the left limit of F at x. Another application of the strong law of large numbers
proves as in Proposition 6.55 that for each x, Fn (x− ) → F (x− ) outside a set Bx ∈ E of
λ-measure 0.
With F ∗ denoting the left-continuous inverse of F defined in (1.52), let xk/m = F ∗ (k/m)
for integer m, and 1 ≤ k ≤ m − 1. By Proposition II.3.19, F (F ∗ (y)− ) ≤ y ≤ F (F ∗ (y)) for
all y ∈ (0, 1). Letting y = k/m obtains for 1 ≤ k ≤ m − 1 :
F x−
k/m ≤ k/m ≤ F xk/m .
F (x−
k/m ) − F (x(k−1)/m ) ≤ 1/m, (1a)
and:
F (x−
1/m ) ≤ 1/m, (m − 1)/m ≤ F (x(m−1)/m ). (1b)
Now define:
max Fn (xk/m ) − F (xk/m ) , max Fn (x− −
Dn,m (s) = max ) − F (x k/m .
)
1≤k≤m 1≤k≤m k/m
To see this, let (k − 1)/m ≤ x < k/m where 2 ≤ k ≤ m − 1. Then by the monotonicity
of distributions and the inequality in (1a) :
Hence:
A similar analysis applies when 0 < x < 1/m and (m − 1)/m < x < 1, and is left as an
exercise. This now proves (2).
Recalling Proposition 6.55, let:
[ S
A= Axk/m Bxk/m ,
k,m
where Axk/m ⊂ S is the exceptional set of measure zero outside of which Fn (xk/m ) →
F (xk/m ), and Bxk/m ⊂ S is similarly defined relative to Fn (x− −
k/m ) → F (xk/m ). Then A is
a countable union of sets of measure zero and hence λ(A) = 0. Further, if s ∈ S−A, then
Dn,m (s) → 0 as n → ∞ for any m by definition.
Hence by (2), if s ∈ S−A then limn→∞ Dn (s) ≤ 1/m for every m, proving (6.49).
Exercise 6.61 Complete the proof of (2) of the above proof by extending (3) to 0 < x < 1/m
and (m − 1)/m < x < 1. Hint: Repeat above derivation with (1b), and recall the limits in
(1.4) and (1.5).
Kolmogorov derived this result by first proving that the limiting distribution in (6.50)
is independent of F (x) for continuous distributions, and then explicitly derived the limiting
distribution for the uniform distribution function FU (x) associated with the density function
defined in (1.51). This reduction utilizes the tools already developed so we provide this detail.
198 Limit Theorems
Proposition II.4.9 also obtains that if F is continuous, then the distribution function of
F (X) is the continuous uniform distribution:
With Uj ≡ F (Xj ), if {Xj }nj=1 are independent then {Uj }nj=1 are independent uniform vari-
ates by Proposition II.4.9. Thus by (2) :
1 Xn
Fn (F ∗ (y)) = χ(−∞,y] (Uj )
n j=1
≡ FU n (y),
Several years after Kolmogorov’s result, Nikolai Smirnov (1900–1966) in 1939 derived
a comparable limit theorem but for the measure:
and in the same year developed results for the difference between two empirical distributions.
For example, on Dn+ (s) he proved:
Convergence of Empirical Distributions 199
Proposition 6.65 (Smirnov’s theorem) Let {Xj }nj=1 be independent, identically dis-
tributed random variables on (S, E, λ) with continuous distribution function F (x). Let
FDn+ (t) denote the distribution function of Dn+ as defined as in (6.51). Then for all t > 0 :
√ 2
FDn+ (t/ n) → 1 − e−2t , (6.52)
In 1956 this earlier work was used to develop a type of large deviation estimate as
discussed in the next section, but for Dn with finite n. The first result is called the
Dvoretzky-Kiefer-Wolfowitz inequality or the Dvoretzky-Kiefer-Wolfowitz the-
orem, and named for Aryeh (Arie) Dvoretzky (1916–2008), Jack Kiefer (1924–1981)
and Jacob Wolfowitz (1910–1981). Their original result stated the inequality below in
(6.53) with an undefined coefficient C. This result was then improved in 1990 by Pascal
Massart by the derivation of a sharp estimate of C = 2. By sharp is meant that it is the
best possible bound.
Note that the following result does not require that F be continuous.
Equivalently,
2
Pr [Dn ≤ t] ≥ 1 − 2e−2nt , t > 0. (6.54)
Example 6.67 (Confidence band for F (x)) The DKMW theorem allows the estimation
of a “confidence band” about the entire empirical distribution function which will contain
the theoretical distribution function with the degree of confidence implied by (6.54).
For example, if n = 100 and a 95% confidence band for F (x) is desired, we choose t so
2
that 1 − 2e−200t = 0.95 producing t = 0.135 81. Then with Fn (x) denoting the empirical
distribution function of (6.45) based on the given sample of 100, (6.53) implies the 95%
confidence band:
max (Fn (x) − 0.135 81, 0) ≤ F (x) ≤ min (Fn (x) + 0.135 81, 1) .
Letting Fn (x, s) be defined as in (6.45) with Yj (s) = χ(−∞,x] (Xj (s)), this confidence
band implies that if A ⊂ S is defined by:
A = {max (Fn (x, s) − 0.135 81, 0) ≤ F (x) ≤ min (Fn (x, s) + 0.135 81, 1)},
then:
λ [A] ≥ 0.95.
2
For a 100α% confidence band, the solution to 1 − 2e−2nt = α is given by:
0.5
1 2
t= ln .
2n 1−α
If k is the largest integer with k/n ≤ t, then the lower bound of 0 is achieved when x ≤
X(k) , the kth order statistic, and the upper bound of 1 is achieved when x ≥ X(n−k) . Thus
the confidence interval is informative primarily for X(k) < x < X(n−k) , an interval that
decreases as α increases.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
7
Estimating Tail Events 2
In this chapter, we continue the investigations initiated in Chapter II.9, now using properties
of expectations and the moment generating function. As in Book II, we investigate properties
of the right tail of the distribution, focusing on results as n → ∞ :
• Large Deviation Theory studies tail probabilities related to the average of n inde-
pendent random variables;
• Extreme Value Theory studies the limiting distribution of the maximum of n in-
dependent random variables.
Summary 7.1 (Book II) If {Xj }nj=1 are independent and identically distributed on a
probability space (S, E, λ), define:
Xn
Sn ≡ Xj ,
j=1
In this section, we develop additional insights on the bounding function π(t). We begin
with the explicit identification of another bounding function when a moment generating
function exists, called the Chernoff bound. After developing properties of this bound, we
prove the Cramér-Chernoff theorem, which asserts that this bound is arbitrarily close
to the π(t)-bound as n → ∞.
E[|X|]
Pr[|X| ≥ t] ≤ ,
t
and it follows that for θ > 0 :
and so:
Pr {Sn ≥ nt} ≤ exp [−n [θt − ln MX (θ)]] .
As this inequality is true for all θ > 0, we have proved the following:
Xn
Proposition 7.2 (Bound for Pr {Sn ≥ nt}) If Sn = Xj with {Xj }nj=1 indepen-
j=1
dent and identically distributed on a probability space (S, E, λ), and if MX (θ) exists on
(−t0 , t0 ) with t0 > 0, then for any t > 0,
Pr {Sn ≥ nt} ≤ exp −n supθ≥0 [θt − ln MX (θ)] . (7.1)
Proof. The bound in (7.1) was derived above from Markov’s inequality, while (7.2) follows
from the definition of π(t) and Proposition II.9.2. In detail, the definition of π n obtains that
π n /n ≤ −supθ≤0 [θt−InMX (θ)]. As this is true for all n, the result follows from Proposition
II.9.2 and the definition of π(t).
Remark 7.3 (Chernoff bound) The upper bound in (7.1) is known as the Chernoff
bound, and named for Herman Chernoff.
Γ(θ) ≡ θt − ln MX (θ),
Then Γ(θ∗ (t)) is called the rate function for X, and it follows from (7.1) that the Chernoff
bound can be expressed:
Pr {Sn ≥ nt} ≤ exp [−nΓ(θ∗ (t))] . (7.3)
Remark 7.5 (Questions on the Chernoff bound) There are two immediate questions
that arise from the Chernoff bound:
1. When is Γ(θ∗ (t)) ≡ supθ≥0 (θt − ln MX (θ)) > 0, so that the bound in (7.3) provides an
estimate of the exponential decay rate for Pr {Sn ≥ nt}?
Large Deviation Theory 2 203
2. If Γ(θ∗ (t)) > 0, is this estimate of exponential decay close to the best possible using π(t),
recalling from (7.2) that this estimate is, at the moment, just an upper bound to enπ(t) ?
Example 7.6 In this example, we investigate the Chernoff bound for binomial and normal
variates.
1. Binomial Xj :
From (4.63), MB (θ) = 1 + p(eθ − 1) , and so:
ΓB (θ) = θt − ln 1 + p(eθ − 1) .
It is an exercise to check that Γ00B (θ) < 0 for all θ and hence ΓB (θ) is concave and has
a maximum when Γ0B (θ) = 0.
A calculation shows that this equation is solvable for any t with 0 < t < 1, and has a
solution at θ∗ ≡ θ∗ (t) defined by:
∗ t(1 − p)
θ = ln .
p(1 − t)
As a function of t, the rate function ΓB (θ∗ (t)) is strictly increasing for p < t < 1 since:
It follows from this and ΓB (θ∗ (p)) = 0 that ΓB (θ∗ (t)) > 0 for such t. Further, ΓB (θ∗ (t))
is a convex function of t.
Hence for t with p < t < 1 :
and Pr {Sn ≥ nt} decreases exponentially as n → ∞. This bound in (7.4) can also be
expressed using (1) :
"
t 1−t #−n
t 1−t
Pr {Sn ≥ nt} ≤ ,
p 1−p
2. Normal Xj :
From (4.78), MN (θ) = exp µθ + 12 σ 2 θ2 , and so:
1
ΓN (θ) = (t − µ) θ − σ 2 θ2 .
2
It again follows that Γ00N (θ) < 0 for all θ, and hence ΓN (θ) is concave and has a maximum
when Γ0N (θ) = 0.
A calculation obtains that this maximum occurs at θ∗ ≡ θ∗ (t) :
t−µ
θ∗ = ,
σ2
and thus θ∗ > 0 if and only if t > µ = E[X]. We then have that supθ≥0 ΓN (θ) = ΓN (θ∗ )
where:
2
(t − µ)
ΓN (θ∗ (t)) = , (7.5)
2σ 2
and the rate function for the normal is increasing and convex for t > µ.
Thus by (7.3): " #
2
n (t − µ)
Pr {Sn ≥ nt} ≤ exp − . (7.6)
2σ 2
Setting n = 1, this bound can be compared to that derived in II.(9.49) of Example II.9.48:
" #
2
1 σ (t − µ)
Pr {S1 ≥ t} ≤ √ exp − .
2π t − µ 2σ 2
We are now ready to address the first question posed in Remark 7.5. The next result
states that when MX (θ) exists and t > E[X], the bound in (7.1) or (7.3) is meaningful
in the sense that it produces exponential decay as n → ∞. We also prove that Γ(θ) is a
concave function, a property seen in the above examples.
Perhaps ironically, the critically important first result will have a short, easy proof. The
second result on the concavity of Γ(θ) will require a good deal more work. For it we will
need to derive a special case of a Book V change of variables result. See Remark 7.8.
Proposition 7.7 (supθ≥0 Γ(θ) > 0; Γ(θ) is concave) Assume that MX (θ) exists on
(−θ0 , θ0 ) with θ0 > 0.
If t > E[X], then Γ(θ) ≡ θt − ln MX (θ) > 0 for some θ > 0, and hence:
where F is the distribution function of X. Since eθx is continuous and F is bounded and
increasing, this integral exists for all y by Proposition III.4.17. It is left as an exercise to
prove that Fθ (y) is continuous, increasing, and has appropriate limits at ±∞, and thus
Fθ (y) is a distribution function of
R ∞a random variable Y by Proposition 1.4.
We now prove that E[Y n ] ≡ −∞ y n dFθ exists for all n, and that:
(n)
MX (θ)
E[Y n ] = , (1)
MX (θ)
(n)
where MX (θ) is the nth derivative of MX (θ).
Once proved, this assures concavity of Γ(θ). By direct calculation of Γ00 (θ), then (1) and
(4.50):
0
2 00
MX (θ) MX (θ)
Γ00 (θ) = −
MX (θ) MX (θ)
= −V ar[Y ]
≤ 0.
We now prove that the integral in (2) equals E[Y n ] as claimed in (1).
Rb
First, since Fθ (y) is increasing and y n is continuous, a y n dFθ exists for any bounded
interval [a, b] by Proposition III.4.17. Then by Definition III.4.3, for any > 0 there exists
a partition {[yj−1 , yj ]}m
j=1 of [a, b] so that with arbitrary tags ỹj ∈ [yj−1 , yj ] , this integral
can be approximated within by a Riemann-Stieltjes summation:
Z
b Xm
n n
y dFθ − ỹj [Fθ (yj ) − Fθ (yj−1 )] < .
a j=1
By Definition III.4.3, the result in (3) is valid for any refinement (Definition III.4.7) of this
partition. Recalling that θ ∈ (−θ0 , θ0 ) is fixed, we refine this partition as necessary so that
over each such interval [yj−1 , yj ] , if Mj and mj denote the maximum and minimum values
of y n :
|Mj − mj | ≤ e−|θ|yj MX (θ)/2j+1 . (4)
206 Estimating Tail Events 2
j n
and Definition III.4.3 can again be applied to produce a partition {[yjk −1 , yjk ]}k=1 of each
0
[yj−1 , yj ] so that for arbitrary tags yjk ∈ [yjk −1 , yjk ] :
1 Xnj 0
θyjk
[F (yjk ) − F (yjk −1 )] < 2j+1 Mjn .
[Fθ (yj ) − Fθ (yj−1 )] − e
MX (θ) k=1
Remark 7.8 (Change of variables and (6)) It will be seen in Book V that the transition
from the dFθ -integral representation for E[Y n ] to the dF -integral representation in (6) is a
special case of a general change of variables result.
From (7.8): Z y
1
Fθ (y) = eθx dF.
MX (θ) −∞
It seems compelling to wonder, as would appear “obvious” at least notationally, if:
1
dFθ = eθy dF ?
MX (θ)
Large Deviation Theory 2 207
This notational “trick” was justified in the above proof by approximating the dFθ -integral
on the left with sums, converting the ∆Fθ terms to dF -integrals which can then be approxi-
mated with ∆F terms, where the final summation also approximates the dF -integral on the
right.
In Book V, this change of variable result will be justified more generally for Lebesgue-
Stieltjes integrals, which will be seen to agree with Riemann-Stieltjes integrals when the
integrand is continuous.
Example 7.9 (Existence of MY (t)) As an application of this general result, or derived
using the approach of the above proof, if θ ∈ (−θ0 , θ0 ) then MY (t) exists for t ∈ (−t0 , t0 )
where (θ − t0 , θ + t0 ) ∈ (−θ0 , θ0 ).
Replacing y n with ety in (7.9) obtains:
Z ∞
MX (θ + t)
MY (t) ≡ ety dFθ = . (7.10)
−∞ MX (θ)
Definition 7.10 (Twisted/tilted distribution) The distribution function Fθ in (7.8)
is referred to as the twisted distribution, or the tilted distribution, associated with F.
Sometimes either of these names is preceded by “exponentially,” as in exponentially
tilted distribution.
Proposition 7.7 assures that if t > E[X], then supθ≥0 (θt − ln MX (θ)) > 0 and that
Γ(θ) ≡ θt − ln MX (θ) is a concave function. Hence since Γ(θ) is also differentiable, if we can
solve Γ0 (θ) = 0 for given t with solution θ∗ (t) > 0, then
supθ≥0 Γ(θ) = Γ(θ∗ (t)).
This was the approach taken in Example 7.6.
Solving Γ0 (θ) = 0 is equivalent to solving the following for θ∗ (t) :
0
MX (θ∗ (t))
t= . (7.11)
MX (θ∗ (t))
0 0 0
By (4.50) and (1) of the above proof, [MX (θ)/MX (θ)] = V ar[Y ] and so MX (θ)/MX (θ) is
0
an increasing function. Since MX (0)/MX (0) = µ, it follows that if the equation in (7.11) is
solvable for t > µ, then it will be solvable with θ∗ (t) > 0 .
That said, the above proposition does not assure that we can always determine the value
of θ∗ (t) which achieves this supremum using this approach because MX 0
(θ)/MX (θ) may be
0
bounded and hence (7.11) will be unsolvable for t > max [MX (θ)/MX (θ)] .
0
Example 7.11 (Bounded MX (θ)/MX (θ)) Given density function f (x) = Cx−3 e−x on
x ≥ 1, where C is chosen so that f integrates to 1, then MX (θ) exists when θ ≤ 1. Further,
0
MX (θ)/MX (θ) is increasing but bounded when θ ≤ 1 since
0 0
MX (θ) MX (1)
≤
MX (θ) MX (1)
Z ∞ Z ∞
= x−2 dx x−3 dx
1 1
= 2.
Hence, the equation in (7.11) has no solution for t > 2.
208 Estimating Tail Events 2
then the bound in (7.3) is exact in the limit as n → ∞. In other words, the Chernoff bound
is tight.
For this result, we assume that the distribution function F (x) of X is absolutely con-
tinuous, recalling the discussion in Section 1.3.1. The same proof works if F (x) is a saltus
function, replacing integrals below with summations. The key point of these assumptions is
to assure that X has a density function, so in particular it is assumed that the decomposition
of F (x) in Proposition 1.18 has no singular component.
Remark 7.12 (On the proof and θ# (t)) A new function θ# (t) is introduced in the
statement of this proposition which is defined in terms of supθ Γ(θ), while θ∗ (t) of Defi-
nition 7.4 is defined in terms of supθ≥0 Γ(θ). However, the first development in the proof
will be that θ# (t) = θ∗ (t) when t > µ.
Also, for an absolutely continuous distribution function F (x), the proof below again
requires Fubini’s theorem of Book V, which allows the evaluation of a multivariate Lebesgue
integral in terms of iterated 1-dimensional integrals. When F (x) is a saltus function with
discrete density, the Book V general machinery is not needed. Similarly, if F (x) is absolutely
continuous with continuous density f (x) ≡ F 0 (x), then the integrals below become Riemann
integrals and Fubini’s result is replaced by Corollary III.1.77.
Xn
Proposition 7.13 (Cramér-Chernoff theorem) Let Sn = Xj with {Xj }nj=1 in-
j=1
dependent and identically distributed with a distribution function F (x) containing no sin-
gular part, and where MX (θ) exists on (−θ0 , θ0 ) with θ0 > 0. Assume that for given
t > µ ≡ E[X], that θ# (t) > 0 exists with θ# (t) ∈ (0, θ0 ) and:
Proof. The upper bound in (7.12) is obtained by (7.3) if we can prove that θ# (t) = θ∗ (t)
for t > µ, meaning:
supθ Γ(θ) = supθ≥0 Γ(θ). (1)
Because Γ(0) = 0, (1) will follow by showing that Γ(θ) < 0 for θ < 0.
Since f (x) ≡ eθx is convex, MX (θ) ≥ eθµ by Jensen’s inequality in (4.93). If θ < 0 and
t > µ then exp[−θ(t − µ)] > 1, and so:
and thus by Proposition 1.5 this integral equals Pr{X ∈ (a, b]}. ItR will be seen in Book V
that for any measurable set A that Pr{X ∈ A} is given by (L) A f (x)dx, and that this
generalizes to joint density functions.
Thus for any δ > 0, and dropping the notation (L):
n Xn o
Pr {Sn ≥ nt} ≥ Pr nt ≤ Xj ≤ n (t + δ)
j=1
Z
= f˜(x)dx
A
Z Z
= ··· f (x1 )...f (xn )dx1 ...dxn , (2)
A
n Xn o
where A ≡ nt ≤ xj ≤ n (t + δ) ,
j=1
Given t > µ and θ∗ ≡ θ∗ (t) = θ# (t), define the twisted density function g(x)
associated with the density f (x) :
exp[θ∗ x]
g(x) = f (x).
MX (θ∗ )
This is indeed a density function since nonnegative, and it integrates to 1 by definition of
MX (θ∗ ).
Let Y be a random variable associated with the distribution function G(x) defined by
g(x), which is assured to exist by Proposition 1.4. Then by (2) :
nXn o
Pr Xj ≥ nt
j=1
n ∗ Z Qn ∗
j=1 exp [θ xj ]
Z
MX (θ )
≥ · · · f (x1 )...f (xn )dx1 ...dxn
exp [n (t + δ) θ∗ ] A MX n (θ ∗ )
n ∗ Z Z
MX (θ )
≡ · · · g(x1 )...g(xn )dx1 ...dxn
exp [n (t + δ) θ∗ ] A
n ∗
MX (θ ) n Xn o
= ∗ Pr nt ≤ Yj ≤ n (t + δ) . (3)
exp [n (t + δ) θ ] j=1
Qn Xn
In the first step, j=1 exp [θ∗ xj ] ≤ exp [n (t + δ) θ∗ ] since xj ≤ n (t + δ) by the
j=1
definition of A. The last step reflects the above comment relating Pr[Y ∈ A] with the integral
of the density for Y over A.
Now:
n ∗
MX (θ )
= exp [n ln MX (θ∗ ) − n (t + δ) θ∗ ]
exp [n (t + δ) θ∗ ]
= exp [−nΓ(θ∗ )] exp [−nθ∗ δ] . (4)
In addition, E[Y ] = MX 0
(θ∗ )/MX (θ∗ ) by (1) in the proof of Proposition 7.7. Since θ∗
maximizes concave and differentiable Γ(θ), it follows that Γ0 (θ∗ ) = 0. A calculation with
Γ(θ) = θt − ln MX (θ) then produces:
E[Y ] = t. (5)
210 Estimating Tail Events 2
Also, the standard deviation σ Y of Y exists by (1) of the prior proof and (4.50), while MY (s)
exists by Example 7.9.
Combining (3) − (5) :
nXn o
Pr Xj ≥ nt
j=1
≥ exp [−nΓ(θ∗ )] exp [−nθ∗ δ] × (6)
√
√ δ n
hXn i.
Pr 0 ≤ (Yj − E[Y ]) σY n ≤ .
j=1 σY
The central limit theorem of Proposition 6.13 then obtains that for any δ > 0, the probability
expression in (6) converges to 1/2 as n → ∞.
Choose 0 < δ ≤ / (2θ∗ ) . Then since exp [−nθ∗ δ] ≥ exp [−n/2] , determine N1 so that
for n ≥ N1 :
exp [−nθ∗ δ] ≥ exp [−n/2] ≥ 4 exp [−n] .
For this same δ, let N2 be defined so that for n ≥ N2 :
√
√ δ n
hXn i.
Pr 0 ≤ (Yj − E[Y ]) σY n ≤ ≥ 1/4.
j=1 σY
Then for n ≥ max[N1 , N2 ] :
nXn o
Pr Xj ≥ nt ≥ exp [−n (Γ(θ∗ ) + )] .
j=1
The next section below summarizes the Book II results on the Fisher-Tippett-Gnedenko
theorem in order to introduce the Hill estimator γ H for the extreme value index γ > 0,
and then investigate this estimator and some of its properties in the following sections. For
finance applications, γ > 0 is the range of extreme value indexes of central interest.
The final section returns to the Gnedenko-Pickands-Balkema-de Haan theorem, and
provides a proof when γ > 0.
F n (an x + bn ) ⇒ G(x),
Definition 7.16 (Extreme value index; Domain of attraction) The single family of
distributions identified and parametrized by γ is called the extreme value class of distri-
butions, and the distribution function Gγ (ax + b) is called a generalized extreme value
distribution, abbreviated GEV.
The parameter γ ∈ R is called the extreme value index.
When F n (an x + bn ) ⇒ Gγ (Ax + B) , we say that the distribution function F is in the
domain of attraction of Gγ , denoted F ∈ D(Gγ ).
212 Estimating Tail Events 2
Proposition II.9.45 derived the von Mises’ condition, named for Richard von Mises
(1883–1953), which identified how the extreme value index γ could be derived in the case
of a twice continuously differentiable distribution function F (x).
For its statement, recall x∗ as defined in (6.39):
Proposition 7.17 (von Mises’ condition) Let F (x) be a twice continuously differen-
tiable distribution function with F 0 (x) > 0 for some interval (x0 , x∗ ). If:
0
1−F
lim (x) = γ, (7.16)
x→x∗ F0
The von Mises’ condition provides one approach to determining if a given distribution
function is in the domain of attraction of Gγ for some γ.
Example 7.18 (F (x) normal, or Pareto) With Φ denoting the standard normal distri-
bution function, it was shown that Φ ∈ D(G0 ) in Example II.9.48 using the von Mises’
condition.
If F (x) is the Pareto distribution of (7.17) with x0 = 1 for simplicity, so FP (x) =
1 − x−1/γ for x ≥ 1, then F (x) satisfies the requirements for the von Mises’ result. As
(1 − F ) /F 0 = γx, it follows from (7.16) that FP ∈ D(Gγ ).
When dealing with data sets from unknown distributions in finance and other disciplines,
the question naturally arises as to how one can determine if the data is consistent with that
from a distribution F with F ∈ D(Gγ ) for some γ. As F (x) is unknown in this case, certainly
so too are its various derivatives. Thus the von Mises’ condition can only provide a workable
approach when dealing with data if an underlying distribution function is first estimated
and proves to be sufficiently differentiable.
While there are many approaches to this estimation problem, a popular and frequently
used approach is the Hill estimator, introduced in 1975 by Bruce M. Hill. We study this
estimator in the case of greatest interest in finance applications, and that is when γ > 0.
where γ > 0. This model is reflective of many data observations made in finance and
elsewhere, and is often parametrized with α = 1/γ.
Extreme Value Theory 2 213
Parametrized as in (7.17), this distribution function satisfies the requirements of the von
Mises’ result, and as noted in Example 7.18, such distributions satisfy FP ∈ D(Gγ ). Since
γ > 0 here, it follows that Pareto, power law, or Zipf-type distributions are in the domain
of attraction of the Fréchet class of distributions, also called Type II extreme value
distributions, as noted in Examples II.9.10 and II.9.13.
The Hill estimator for γ > 0 is defined as follows.
Definition 7.19 (Hill estimator for γ > 0) Let {Xi }ni=1 be a random sample of a given
variate, and {X(j) }nj=1 the associated order statistics.
(k,n)
The Hill estimator γ H ≡ γ H is based on the k + 1 largest variates, {X(n−j) }kj=0 ,
and defined as an average of k log ratios:
1 Xk−1 X(n−j)
γH ≡ ln , (7.18)
k j=0 X(n−k)
(k,n)
Example 7.20 (Pareto distribution: γ H and MLE) Given a random sample {Xj }nj=1
that is assumed to come from a Pareto distribution with unknown parameter γ, we show that
the Hill estimator provides a solution that equals what is known as the maximum likelihood
estimate of this parameter.
Let FP (x) be given as in (7.17) with x0 = 1 for simplicity, and recall that FP ∈ D(Gγ )
by Example 7.18. Given the ordered sample {X(j) }nj=1 and k, define the conditional distri-
bution function for x ≥ X(n−k) as in Definition 1.12 (see also Example II.3.40), where we
temporarily denote α = 1/γ for notational simplicity:
FP (x) − FP (X(n−k) )
FP x|x ≥ X(n−k) =
1 − FP (X(n−k) )
h i
−α −α
= X(n−k) − x−α /X(n−k) .
The conditional density function is then given as the derivative of this differentiable function
by Proposition III.1.33:
−α
fP x|x ≥ X(n−k) = αx−(α+1) /X(n−k)
.
In many applications it is reasonable to assume that the given sample {Xj }nj=1 has den-
sity function f (x; α), parametrized by and therefore “conditional” on an unknown parameter
α to be estimated. The conditional likelihood function, and sometimes the likelihood
function of the sample, is defined:
Yn
L[{Xj }nj=1 ; α] ≡ f (Xj ; α).
j=1
Given this assumption on the parametric form of the density, L[{Xj }nj=1 ; α] informally
provides the “probability of this sample” given any parameter α.
A logical objective is therefore to maximize L as a function of α, producing the con-
ditional maximum likelihood estimate for α, often called the maximum likelihood
estimate/estimator or MLE. By maximizing L, the given parameter α provides a model
which maximizes the probability of the observed sample among the family of distributions
{f (x; α)} parametrized by α.
214 Estimating Tail Events 2
Applying this approach to the sample {X(n−j) }kj=0 and conditional density fP (x; α) ≡
−α
αx−(α+1) /X(n−k) obtains the likelihood function:
Yk−1 −(α+1)
−α
L {X(n−j) }kj=0 ; α = αk
X(n−j) /X(n−k) .
j=0
∂ ln L
The maximum likelihood estimate αM LE equals the value of α that solves ∂α = 0 if
2
∂ ln L
∂α2 < 0, which is verifiable in this case. A calculation yields
αM LE = 1/γ H .
It is then checked as an exercise that if parametrized as a function of γ, one again has that
∂ ln L ∂ 2 ln L M LE
∂γ = 0 and ∂γ 2 < 0 when γ = γH .
Thus we have proved the following result:
(k,n)
Proposition 7.21 (Pareto distribution: γ H = γ M LE ) Given a sample {Xj }nj=1 that
is assumed to follow a Pareto distribution with unknown parameter γ, if γ M LE denotes the
maximum likelihood estimate for γ based on {X(n−j) }kj=0 , then:
(k,n)
γ M LE = γ H , (7.20)
(k,n)
where γ H is the Hill estimator based on this same subsample.
(k,n)
Example 7.22 (Pareto distribution: γ H as a random variable) The collection
{Xj }nj=1 above denoted a random sample from a random variable X defined some probability
space with given distribution function F (x). By random sample is meant that these variates
have the same distribution function F (x) as X, and are independent. See Example II.4.6 for
a discussion on random samples, and the meaning of random samples having such properties.
Given the constructions of Chapter II.4 and summarized in Section 6.1, one can also
construct a probability space (S, E, λ), and collection of independent, identically dis-
tributed (i.i.d.) random variables {Xj }∞ j=1 with:
• {Xj }∞
j=1 are identically distributed, meaning FXj (x) = F (x) for all j;
• {Xj }∞
j=1 are independent in the sense of Definition 1.13.
(k,n)
Within this framework, γ H is seen to be a random variable on (S, E, λ) given any n
and k, and thus we can investigate distributional properties.
One model for such {Xj }∞ j=1 and (S, E, λ) is given in Proposition II.4.13 where Xj ≡
F (Uj ), {Uj }∞
∗
j=1 are i.i.d random variables with a continuous, uniform distribution, and
F ∗ (y) is the left-continuous inverse of F (x) as given in Definition 1.29.
If F (x) = FP (x) in (7.17) with x0 = 1 for notational simplicity, then:
is a continuous, strictly increasing function. Thus given n and k, the variates in the Hill
estimator {X(n−j) }kj=0 are defined:
−γ
X(n−j) ≡ 1 − U(n−j) ,
and so by (7.18):
(k,n) 1 Xk−1 X(n−j)
γH ≡ ln
k j=0 X(n−k)
" −γ #
1 Xk−1 1 − U(n−j)
= ln −γ
k j=0 1 − U(n−k)
γ Xk−1
= − ln 1 − U(n−j) − − ln 1 − U(n−k) .
k j=0
and since this transformation is increasing, the order statistics are preserved. Thus
{E(n−j) }kj=0 = {− ln 1 − U(n−j) }kj=0 are the higher order statistics of a sample {Ej }nj=1
−γ
of standard exponential variates. Since Xj ≡ (1 − Uj ) it follows that Ej = γ ln Xj , and
n
so {Ej }j=1 are independent random variables by Proposition II.3.56.
The Hill estimator for {X(n−j) }kj=0 can be thus expressed:
(k,n) γ Xk−1
γH = E(n−j) − E(n−k) . (1)
k j=0
(k,n)
Proof. Both results are an immediate application of Proposition 6.22. Since γ H can be
(k,n)
expressed as in (1), γ H /γ equals the variate denoted Yk,n in that result. In addition, the
(k,n)
bH /γ →1 1 is equivalent to (7.22) by definition.
conclusion of that result that γ
216 Estimating Tail Events 2
For general F ∈ D(Gγ ) with γ > 0, we cannot expect that F (x) is Pareto. However,
Proposition II.9.38 obtained that if γ > 0, then for all x ≥ −1/γ :
1 − F (t + xha (t)) −1/γ
lim = (1 + γx) .
t→x∗ 1 − F (t)
Recall that ha (t) ≡ ac 1−F1 (t) , where ac (t) is the normalizing function in Corollary II.9.35.
This is defined by ac (t) ≡ cabtc with btc is the greatest integer function:
btc = max{n|n ≤ t},
where {an }∞
n=1is the sequence in the Fisher-Tippett-Gnedenko theorem, and c is the con-
stant in II.(9.32) of Proposition II.9.34. And as defined in (6.39), x∗ = inf{x|F (x) = 1},
with x∗ ≡ ∞ if F (x) < 1 for all x.
To investigate the implications of this general result, note that:
1 − F (t + xha (t)) F (t + xha (t)) − F (t)
1− =
1 − F (t) 1 − F (t)
= F (t + xha (t)|X > t),
where the conditional distribution function on the right is as defined in Definition 1.12 (see
also Example II.3.40).
Thus for F ∈ D(Gγ ) with γ > 0, Proposition II.9.38 asserts that the conditional distri-
bution function F (t + xha (t)|X > t) has limit as t → x∗ :
−1/γ
lim F (t + xha (t)|X > t) = 1 − (1 + γx) .
t→x∗
With a reparametrization of y = 1 + γx, this implies that the following conditional distri-
bution is asymptotically Pareto, meaning:
y−1
ha (t) X > t = 1 − y −1/γ .
lim F t +
t→x∗ γ
Written in terms of the Pr notation:
y−1
ha (t) X > t = 1 − y −1/γ .
lim Pr X ≤ t +
t→x∗ γ
In summary, while general F ∈ D(Gγ ) with γ > 0 is not a Pareto distribution, it is
asymptotically Pareto in the above sense. Put another way, the conditional asymptotic tail
of F ∈ D(Gγ ) is Pareto. Thus there is some hope that the properties of the Hill estimator
for the Pareto distribution will imply similar properties for such F.
In the next three sections, we develop the Hill estimator result in three steps:
1. If F ∈ D(Gγ ) with γ > 0, then F (x) is conditionally asymptotically Pareto.
We will refine the above analysis of the behavior of F (x) and show that γ > 0 if and
only if x∗ = ∞, and then in (7.32) that:
lim Pr[X ≤ tx|X > t] = 1 − x−1/γ .
t→∞
The final section will address asymptotic normality of the Hill estimator, generalizing
Proposition 7.23.
U (tx) − U (t) xγ − 1
lim = . (7.23)
t→∞ ac (t) γ
Here as above, ac (t) ≡ cabtc , where btc is the greatest integer function, {an }∞
n=1 is
the sequence in Fisher-Tippett-Gnedenko theorem, and c is the constant in II.(9.32) of
Proposition II.9.34.
The function U (t) is the left-continuous inverse of 1/(1 − F (x)) :
∗
1
U (t) ≡ (t), (7.24)
1−F
and defined on t > 1. An alternative and perhaps more intuitive representation for U (t) is
given in Exercise II.9.26:
U (t) ≡ F ∗ (1 − 1/t). (7.25)
We begin by investigating and refining the limit in (7.23). The result is simple, but the
derivation is subtle.
U (t) U (tx)
Proposition 7.24 (Asymptotics of ac (t) and U (t) ) If F ∈ D(Gγ ) with γ > 0, then
U (t) → ∞ as t → ∞ and:
U (t) 1
lim = . (7.26)
t→∞ ac (t) γ
Also, for x > 0 :
U (tx)
lim = xγ . (7.27)
t→∞ U (t)
Proof. Simplifying notation, we write ac (t) = ca(t), and thus a(t) ≡ abtc .
Defining:
U (tx) − U (t)
Vt (x) ≡ ,
ac (t)
218 Estimating Tail Events 2
ac (tx) a(tx)
it follows that since ac (t) = a(t) :
Using the inequalities in (3) applied to each term in the first summation yields:
" #
k k−1
Xn U Z − U Z
Z γ (1 − ) lim
n→∞ k=N U (Z n )
U Z n+1
≤ lim
n→∞ U (Z n )
" #
γ
Xn U Z k − U Z k−1
≤ Z (1 + ) lim .
n→∞ k=N U (Z n )
The limits of the summations in these bounds are seen to equal 1, recalling that U (t) → ∞
as t → ∞. Thus for any < 1 − Z −γ :
γ U Z n+1
Z (1 − ) ≤ lim ≤ Z γ (1 + ),
n→∞ U (Z n )
Letting t = Z n in (5) then suggests that limt→∞ U (tx) /U (t) exists for all x > 1 and equals
the limit in (7.27), but this must be formalized.
Given Z > 1 and y > 1, let n(y) be defined as the integer such that Z n(y) ≤ y < Z n(y)+1 .
Then because U is increasing, if t, x > 1 :
U (Z n(x)+2 Z n(t) )
lim ≤ (xZ 2 )γ .
t→∞ U (Z n(t) )
Thus: x γ U (tx)
≤ lim ≤ (xZ 2 )γ ,
Z2 t→∞ U (t)
f (tx)
lim = xα . (7.29)
t→∞ f (t)
Example 7.26 (Varying ay infinity: F ∈ D(Gγ ) with γ > 0) If F ∈ D(Gγ ) with γ >
0, then both:
∗
1
• U (y) ≡ 1−F (y) by (7.27), and,
• a(t) by (7.28),
Before continuing the current development we identify a corollary result to (7.27) which
was promised in Remark II.9.36, regarding the normalizing sequences an and bn in the
statement of the Fisher-Tippett-Gnedenko theorem.
Thus for these normalizing sequences, A = 1/γ and B = −1/γ in Proposition 7.15, and:
where:
1
Hn (x) = , K(x) = x1/γ .
n [1 − F (xU (n))]
Extreme Value Theory 2 221
Though not distribution functions, such weak convergence is defined in Remark II.8.5, as
pointwise convergence at continuity points of K ∗ (x).
First:
K ∗ (x) = inf{y|y 1/γ ≥ x} = xγ .
For Hn (x) :
1 U (nx)
Hn∗ (x) = inf{z|F (z) ≥ 1 − 1/nx} = .
U (n) U (n)
By Corollary II.8.28, the proof of which extends to increasing function sequences by
Corollary II.3.27, Hn∗ (x) ⇒ K ∗ (x) implies that Hn (x) ⇒ K(x). This obtains that for all
x≥0:
limn→∞ n [1 − F (U (n)x)] = x−1/γ . (1)
Thus 1 − F (U (n)x) → 0 as n → ∞, and so:
The next step in this development is to convert the limiting results for U to limiting
results for the distribution function F. The following proposition states that if F ∈ D(Gγ )
with γ > 0, then the conditional distribution function Pr[X ≤ tx|X > t] is asymptotically
Pareto as t → ∞ for all x > 0.
1 − F (tx)
lim = x−1/γ . (7.31)
t→∞ 1 − F (t)
Equivalently, F ∈ D(Gγ ) with γ > 0 if and only if x∗ = ∞, and for all x > 0, the
conditional distribution function Pr[X ≤ tx|X > t] is asymptotically Pareto as t → ∞ :
Proof. The equivalence of the limits in (7.31) and (7.32) is Definition 1.12, recalling Ex-
ample II.3.40:
1 − F (tx) F (tx) − F (t)
1− = = Pr[X ≤ tx|X > t].
1 − F (t) 1 − F (t)
Assume that F ∈ D(Gγ ) with γ > 0. By Proposition 7.24 and (7.25), U (t) = F ∗ (1 −
1/t) → ∞ as t → ∞. This implies that F (x) < 1 for all x, and so x∗ = ∞ by definition.
By Proposition II.3.19:
F ∗ (F (t)) ≤ t ≤ F ∗ F (t)+ ,
222 Estimating Tail Events 2
and since U 1
1−F (t) = F ∗ (F (t)) by (7.24), for any > 0 :
1− 1+
U ≤t≤U ,
1 − F (t) 1 − F (t)
and hence:
y y
U 1−F (t) 1
y
U 1−F (t)
≤ U ≤ .
U 1+ t 1 − F (t) U 1−
1−F (t) 1−F (t)
y
Since F ∈ D(Gγ ) with γ > 0, (7.27) applies. With t0 = 1−F
1±
(t) and x = 1± , it follows
that as t0 → ∞ :
U 1−Fy (t) U (t0 x)
= 0)
→ xγ .
1±
U 1−F (t) U (t
But t0 → ∞ if and only F (t) → 1 if and only if t → ∞, and so for all > 0 :
γ γ
y 1 y y
≤ lim U ≤ .
1+ t→∞ t 1 − F (t) 1−
Hence:
1 y
lim U = yγ . (1)
t→∞ t 1 − F (t)
Defining gn (y) = n1 U 1−Fy (n) and g(y) = y γ , then gn (y) → g(y) for all y > 0 by (1).
This implies that gn∗ (x) → g ∗ (x) for each continuity point of g ∗ by Proposition II.8.27. A
1−F (n)
calculation obtains that g ∗ (x) = x1/γ and gn∗ (x) = 1−F (nx) , and and so for x > 0 :
1 − F (n)
→ x1/γ , n → ∞. (2)
1 − F (nx)
For real t define integer n(t) so that n(t) ≤ t < n(t) + 1. Then for x > 0, by monotonicity
of F :
1 − F ([n (t) + 1] x) 1 − F (tx) 1 − F (n (t) x)
≤ ≤ .
1 − F (n (t)) 1 − F (t) 1 − F (n (t) + 1)
1−F (n(t))
Thus (7.31) follows from (2) and an exercise that x∗ = ∞ implies that 1−F (n(t)+1) → 1.
Hint: F (t) → 1.
Conversely, if x∗ = ∞ and (7.31) is satisfied, then (2) follows, as does (1) by Corollary
II.8.28, that gn∗ (x) ⇒ g ∗ (x) implies that gn (y) ⇒ g(y). This corollary applies to this sequence
of increasing functions gn (y) by the same proof, but replacing the reference to Proposition
II.3.26 with one to Corollary II.3.27.
Now let s = 1−F1 (t) . Since U (s) = F ∗ (1 − 1/s) = F ∗ (F (t)):
U (sy) t 1 y
= ∗ U . (3)
U (s) F (F (t)) t 1 − F (t)
U (sy)
lim = yγ .
s→∞ U (s)
Extreme Value Theory 2 223
Defining normalizing sequences an = γU (n) and bn = U (n) obtains that for all y > 0 :
U (ny) − bn yγ − 1
lim = . (4)
n→∞ an γ
We now recall that the first step of the proof of the Fisher-Tippett-Gnedenko theorem of
Proposition 7.15 was to derive (4), and then proceed to obtain from this that F ∈ D(Gγ ).
Example 7.29 (Varying ay infinity: F ∈ D(Gγ ) with γ > 0) Adding to Example 7.26,
the result in (7.31) states that if F ∈ D(Gγ ) with γ > 0, then:
• 1 − F is regularly varying at infinity with index −1/γ.
Hence if F ∈ D(Gγ ) with γ > 0, then 1 − F ∈ RV−1/γ .
This can also be expressed in an even more descriptive way. If F ∈ D(Gγ ) for γ > 0,
then as x → ∞ :
F (x) = 1 − L(x)x−1/γ , L ∈ RV0 , (7.33)
which is to say that L is slowly varying at infinity. This result was derived by Gnedenko,
and follows from (7.31) by considering L(tx)/L(t).
Thus, if F ∈ D(Gγ ) for γ > 0, then (7.33) states that F has a fat tail in the sense that
1 − F (x) effectively decays like a power function. This is a distributional observation often
identified in finance applications.
Since the criterion that determines if F ∈ D(Gγ ) is based on a property of the distribu-
tion function F n (x) of the maximum of a sample of n variates, it seems logical to expect
that this property is preserved in the various “tail” conditional distribution functions of
F (x). The answer is in the affirmative, as is the converse, and is proved with the criterion
of Proposition 7.28.
Corollary 7.30 (Conditional tail distributions of F ∈ D(Gγ ) with γ > 0) Given a
distribution function F ∈ D(Gγ ) with γ > 0, define for y > 0 :
• The conditional tail distribution function Fy (x), defined on x ≥ y by:
F (x) − F (y)
Fy (x) ≡ . (7.34)
1 − F (y)
• The relative conditional tail distribution function F̃y (x), defined on x ≥ 1 by:
F (xy) − F (y)
F̃y (x) ≡ . (7.35)
1 − F (y)
Then Fy , F̃y ∈ D(Gγ ).
Conversely, if either Fy ∈ D(Gγ ) or F̃y ∈ D(Gγ ) for γ > 0 and some y > 0, then
F ∈ D(Gγ ).
Proof. First note that by definition, if x∗ = ∞ for F (x), then x∗ = ∞ for Fy and F̃y .
For x > 0 and t ≥ y/x :
1 − Fy (tx) 1 − F (tx)
= . (1)
1 − Fy (t) 1 − F (t)
Similarly, for x > 0 and t ≥ 1/x :
1 − F̃y (tx) 1 − F (tyx)
= . (2)
1 − F̃y (t) 1 − F (ty)
Thus Fy and F̃y satisfy (7.31) and Fy , F̃y ∈ D(Gγ ) by Proposition 7.28.
The converse follows from the identities in (1) and (2).
224 Estimating Tail Events 2
Example 7.31 (Pareto conditional tail distributions) If F (x) is the Pareto distribu-
tion of (7.17), defined for x ≥ x0 by:
−1/γ
x
F (x) = 1 − ,
x0
then for all t :
1 − F (tx)
= x−1/γ .
1 − F (t)
Given y ≥ x0 :
−1/γ
x
Fy (x) ≡ 1 − ,
y
so Fy (x) is Pareto on x ≥ y.
Similarly:
F̃y (x) = 1 − x−1/γ ,
so F̃y (x) is Pareto on x ≥ 1.
We present one last result and corollary that improves the above proposition regarding
the asymptotic Pareto-like behavior of F ∈ D(Gγ ) with γ > 0. The proposition sharpens
the result above by providing a uniform estimate of convergence. This proposition is a
special case of Karamata’s Representation theorem, named for Jovan Karamata
(1902–1967). This result is also true with modifications for γ < 0 and γ = 0, but we do not
develop this theory. See de Hann and Ferreira (2006).
Remark 7.32 (Pareto and Karamata’s representation theorem) In the special case
of Karamata’s result where c(t) ≡ c and g(t) ≡ tγ, the result (7.37) states that for t > t0 :
−1/γ
t
1 − F (t) ≡ c .
t0
Thus F (t) is a Pareto distribution.
Hence for x ≥ 1 :
Z tx Z tx Z tx
−1 ds ds −1 ds
exp − (γ − ) ≤ exp − ≤ exp − (γ + ) .
t s t g(s) t s
Thus for t ≥ T, x ≥ 1 :
Z tx
ds
x−1/(γ−) ≤ exp − ≤ x−1/(γ+) ,
t g(s)
c(tx)
and since c(t) → 1 for all x ≥ 1, (7.37) obtains:
1 − F (tx)
x−1/(γ−) ≤ lim ≤ x−1/(γ+) .
t→∞ 1 − F (t)
1 − F (t)
r(t) = R ∞ . (1)
t
(1 − F (x)) dx
x
In Proposition 7.35 below it will be proved that the Lebesgue integral in the definition of r(t)
is finite for all t ≥ T with T to be defined, and in (7.39) we prove that:
meaning almost everywhere and defined as outside a set of Lebesgue measure 0. Thus if
t0 ≥ T, it follows from Proposition III.3.62 and (1) that:
Z t Z ∞ Z ∞
r(s) dx dx
− ds = ln (1 − F (x))
− ln (1 − F (x))
t0 s t0 x t x
1 − F (t) 1 − F (t0 )
= ln − ln .
r(t) r(t0 )
Rewriting: Z t
r(s) 1 − F (t) r(t0 )
exp − ds = ,
t0 s r(t) 1 − F (t0 )
and so: Z t
1 − F (t0 ) r(s)
1 − F (t) = r(t) exp − ds .
d(t0 ) t0 s
Defining:
1 − F (t0 ) 1 r(s)
c(t) = r(t) , = ,
r(t0 ) g(s) s
yields (7.37).
226 Estimating Tail Events 2
1 ∞
Z
dx
c ≡ lim c(t) = (1 − F (x)) ∈ (0, ∞),
t→∞ γ t0 x
and:
g(t)
lim = γ.
t→∞ t
The following corollary proves that for F ∈ D(Gγ ) with γ > 0, not only are the condi-
tional distributions of F (x) asymptotically Pareto as proved in Proposition 7.28, but these
conditional distributions are bounded by Pareto distributions with arbitrarily close tail
1
indexes of γ± if t is large enough.
Proof. Given (7.36), for any > 0 with < 1 there is a T 0 so that for t ≥ T 0 :
g(t)
γ−≤ ≤ γ + , c (1 − /3) ≤ c(t) ≤ c (1 + /3) . (1)
t
As in the above proof:
Z tx
1 − F (tx) c(tx) ds
= exp − ,
1 − F (t) c(t) t g(s)
and: Z tx
−1/(γ−) ds
x ≤ exp − ≤ x−1/(γ+) .
t g(s)
0
Thus for t ≥ T and all x ≥ 1 :
1 − /3 −1/(γ−) 1 − F (tx) 1 + /3 −1/(γ+)
x ≤ ≤ x .
1 + /3 1 − F (t) 1 − /3
1+/3 1−/3
The result now follows since 1−/3 ≤ 1 + and 1+/3 ≥ 1 − .
1. Provide an approximation to the given γ for finite t, meaning when based on sample
variates above a given order statistic or quantile;
We derive this first result in this section, and a version of the second result in the next
section.
For the approximation result, the proposition below provides another representation for
γ based on an integral of the distribution function F introduced in the proof of Proposi-
tion 7.33. This representation will then support the conclusion that the Hill estimator γ H
approximates the exact value of γ for F ∈ D(Gγ ) with γ > 0.
The formula for γ > 0 in (7.39) generalizes von Mises’ condition in the sense that it
is valid for all such F without differentiability conditions. On the other hand, von Mises’
condition is valid for all γ.
While the proposition below is stated for γ > 0, it can also be formulated with modifi-
cations in the cases of γ < 0 and γ = 0. We do not develop this theory, but reference de
Haan and Ferreira (2006).
1 − F (te)
≤ (1 + ) e−1/γ < e−1/γ .
1 − F (t)
1 − F (ten ) Yn 1 − F (tek )
= ≤ en(−1/γ) .
1 − F (t) k=1 1 − F (tek−1 )
Given x > 1, let n ≡ dln xe , the least integer greater than or equal to ln x. Then dln xe <
ln x + 1, and substituting x = eln x :
1 − F (tx)
≤ en(−1/γ) ≤ e−1/γ x−1/γ .
1 − F (t)
1 − F (tx) 1
≤ e−1/γ x−1/γ−1 ,
1 − F (t) x
228 Estimating Tail Events 2
r(t) b0 (t)
=− , a.e.
t b(t)
Now b(t) > 0 for all such t since F (x) < 1 for all x > 0 by assumption. Thus ln b(t) is well
0
0
defined, monotonic, differentiable a.e. by Proposition III.3.12, and thus [ln b(t)] = bb(t)
(t)
a.e.
Consequently, Proposition III.3.62 obtains:
Z t Z ∞ Z ∞
r(y) dy dy
dy = − ln (1 − F (y)) + ln (1 − F (y)) ,
T y t y T y
and so by definition of r(t) :
Z ∞
dy
1 − F (t) = r(t) (1 − F (y))
t y
Z ∞ Z t
dy r(y)
= r(t) (1 − F (y)) exp − dy .
T y T y
With a similar expression for 1 − F (tx), and a change of variable:
Z tx
1 − F (tx) r(tx) r(y)
= exp − dy
1 − F (t) r(t) y
Zt x
r(tx) r(ty)
= exp − dy .
r(t) 1 y
r(tx)
Letting t → ∞ obtains from (7.39) that r(t) → 1 and r(ty) → 1/γ, and so by Lebesgue’s
dominated convergence theorem:
1 − F (tx)
lim = x−1/γ .
t→∞ 1 − F (t)
With most of the hard work done and the integral formula for γ derived in (7.39), we
will now demonstrate that the formula in (7.18) for the Hill estimator γ H approximates the
value of this integral. To this end, we begin with a transformation of the formula in (7.39)
into a Riemann-Stieltjes integral of Chapter III.4.
The integral in (7.40) is well-defined over bounded intervals by Proposition III.4.17,
since F (x) is increasing and ln(x/t) is continuous, and this extends to the improper integral
as will be addressed in the proof.
Proposition 7.36 (Another formula for γ for F ∈ D(Gγ ), γ > 0) Given a distribu-
tion function F ∈ D(Gγ ) with γ > 0 :
R∞
ln xt dF (x)
t
γ = lim , (7.40)
t→∞ 1 − F (t)
where the integral is defined as a Riemann-Stieltjes integral.
Proof. Given F ∈ D(Gγ ) with γ > 0, define h(x) ≡ 1 − F (x) and k(x) ≡ ln xt for fixed
The integral on the right is defined as a Riemann integral by Proposition III.1.22 since k 0 (x)
is continuous, and h(x) is monotonic and differentiable almost everywhere by Proposition
III.3.12, and so continuous almost everywhere.
Substituting for h(x) and k 0 (x) = 1/x obtains:
Z N Z N
dx
h(x)dk = (1 − F (x)) .
t t x
Since the limit as N → ∞ of the integral on the right exists by Proposition 7.35:
Z N Z ∞
dx
lim h(x)dk = (1 − F (x)) . (1)
N →∞ t t x
Using the integration by parts formula for Riemann-Stieltjes integrals of Proposition
III.4.14:
Z N Z N
h(x)dk = h(N )k(N ) − h (t) k (t) − k(x)dh. (2)
t t
Now h (t) is finite and k (t) = 0 by definition. Also, by (7.38), if F ∈ D(Gγ ) with γ > 0,
then for any > 0 there is a C = (1 + ) t1/(γ+) so that:
1 − F (N ) ≤ C N −1/(γ+) ,
and hence as N → ∞ :
N
h(N )k(N ) = (1 − F (N )) ln → 0. (3)
t
Combining (1) − (3) obtains:
Z ∞ Z N Z ∞
dx x
(1 − F (x)) = − lim k(x)dh = − ln d(1 − F ),
t x N →∞ t t t
where the integral on the right is again defined as a Riemann-Stieltjes integral.
By the construction of a Riemann-Stieltjes integral, using −(1−F ) or F as an integrator
produces the same result. This plus (7.39) proves (7.40).
230 Estimating Tail Events 2
Remark 7.37 (γ and Conditional expectation) The formula in (7.40) can be inter-
preted in the context of expectations as defined in (4.1).
To this end, let X be a random variable defined on a probability space (S, E, λ) with
distribution function F ∈ D(Gγ ) with γ > 0, and let t ≥ T be fixed. Given F (x), this random
variable and probability space exist by Proposition 1.4. Define the conditional distribution
function Ft (x) for x ≥ t by:
F (x) − F (t)
Ft (x) ≡ .
1 − F (t)
Since F (t) is a constant, items 5 and 6 of Proposition III.4.24 obtain:
R∞ Z ∞
ln xt dF
t x
= ln dFt (x).
1 − F (t) t t
Then by (4.1): Z ∞ x
X
ln dFt (x) = E ln |X ≥ t ,
t t t
where by notational convention E [ln (X/t)] of (4.1) is changed as above to emphasize the
use of the conditional distribution function Ft (x).
Then by (7.40):
Z ∞ R∞
ln xt dF (x)
x t
γ = lim ln dFt (x) = lim , (7.41)
t→∞ t t t→∞ 1 − F (t)
and equivalently:
X
γ = lim E ln |X ≥ t .
t→∞ t
With γ defined in (7.41) as a limit as t → ∞, an approximation can be achieved for γ by
evaluating this expression for large enough t.
Given a sample of random variates {Xi }ni=1 with distribution F ∈ D(Gγ ) with γ > 0,
and associated order statistics {X(j) }nj=1 , it follows from Remark 7.37 that for n − k large:
R∞
X(n−k)
ln x/X(n−k) dF
γ≈ .
1 − F (X(n−k) )
This is a nice formula, but for estimation of γ it is not yet useful because F (x) depends on
γ and is thus unknown.
The next result states that γ H ≈ γ when F ∈ D(Gγ ) with γ > 0. The goal of the proof is
to approximate the integral in (7.41), which is based on the unknown distribution function
F, with an integral based on the empirical distribution function Fn implied by the given
sample {Xi }ni=1 . This empirical distribution function was introduced in (6.44) and assigns
a probability of n1 to each variate:
1 Xn
Fn (x) = χ(−∞,x] (Xj ),
n j=1
The informal justification for this approximation is to choose n so that supx |Fn (x) −
F (x)| < , since then supx |∆Fn − ∆F | < 2 for any term in the defining Riemann-Stieltjes
summations. However while intuitively
plausible, to be made rigorous requires another ap-
proximation because ln x/X(n−k) is unbounded.
In the next section it will be demonstrated that under more clearly defined conditions
that γ H →1 γ. That is, γ H converges with probability 1 to γ, as defined in (6.31).
Proposition 7.38 (γ H ≈ γ for F ∈ D(Gγ ), γ > 0) Given a distribution function F ∈
D(Gγ ) with γ > 0, and random sample {Xi }ni=1 of X defined on (S, E, λ) with distribu-
tion F and order statistics {X(j) }nj=1 , then for n and X(n−k) large:
(k,n) 1 Xk−1
γH ≡ ln X(n−j) − ln X(n−k) ≈ γ. (7.42)
k j=0
For given , choose N by the Glivenko-Cantelli theorem so that with probability 1, for
n ≥ N:
sup |Fn (x) − F (x)| < . (2)
x
Given {Xi }ni=1
with n ≥ N and associated {X(j) }nj=1 , assume that n and n − k are large
enough so that X(n−k) ≥ T and X(n) ≥ t0 (X(n−k) ). This is possible since X(n) → ∞ and
X(n−kn ) → ∞ with probability 1 for n → ∞ and kn /n → 0 by Proposition 6.52, since
x∗ = ∞ by Proposition 7.28. Thus by (1):
R X
(n) ln x
X(n−k) X(n−k) dF (x)
− γ < 2. (3)
1 − F (X(n−k) )
Now by Definition III.4.3 of the Riemann-Stieltjes integral, given this there exists δ so
that for any partition {xi }m
i=0 of [X(n−k) , X(n) ] of mesh size δ or smaller:
Z
X(n)
x
Xm
x
ei
ln dF (x) − ln ∆Fi < , (4)
X(n−k) X(n−k)
X(n−k) i=1
Combining (4) and (5), and reflecting the definition of Fn in the associated Riemann-
Stieltjes integral obtains:
Z
X(n)
x
1 Xk−1
X(n−j)
ln dF (x) − ln
X(n−k) n X(n−k)
X(n−k) j=0
Z
X(n)
x
k (k,n)
= ln dF (x) − γ H
X(n−k) X(n−k) n
X(n)
≤ 1 + 2m ln .
X(n−k)
Let: R X(n)
x
X(n−k)
ln X(n−k) dF (x)
I≡ .
1 − F (X(n−k) )
Comparing (3) and (6) obtains that for any > 0, large samples will obtain that I is within
(k,n)
2 of γ, and within M of a multiple of γ H , where this multiple is approximately 1 by
the Glivenko-Cantelli theorem. This supports the assertion of an approximation for large
samples, but it is not clear here that M is uniformly bounded, and thus we cannot assert
more on the quality of this approximation in the limit.
But far beyond this, Karamata’s representation theorem applies not only to distribution
functions, but to all functions f that are regularly varying at infinity with index α ∈ R.
As noted in (7.29) of Definition 7.25, this terminology means that for all x ≥ x0 > 0 :
f (tx)
lim = xα ,
t→∞ f (t)
and is denoted f ∈ RVα .
We again need this
general result only for the case α > 0, and it will be applied to
∗
1
the function U (t) ≡ 1−F (t), the left-continuous inverse of 1/(1 − F ) which is not a
distribution function. This representation theorem will then provide the needed uniform
estimate of U (tx)/U (t) for t ∈ (t0 , ∞) and all x ≥ 1, just as Proposition 7.33 supported
such a uniform estimate of (1 − F (tx)) / (1 − F (t)) in (7.38) of Corollary 7.34.
We state without proof the general result though again will only require the case α > 0.
Proposition 7.39 (Karamata’s Representation theorem) A function f is regularly
varying at infinity with index α if and only if there are positive measurable functions c and
g, so that for all t ∈ (t0 , ∞) with t0 > 0 :
Z t
h(s)
f (t) = c(t) exp ds , (7.43)
t0 s
where
lim c(t) = c ∈ (0, ∞), lim h(t) = α. (7.44)
t→∞ t→∞
Proof. See de Haan and Ferreira, Theorem B.1.6.
The proof of the needed corollary result now follows the proof of Corollary 7.34.
Corollary 7.40 (Bounding ff(tx) (t) by Pareto) Given a function f (x) that is regularly
varying at infinity with index α, then for any > 0 with < 1 there is a T so that for
t ≥ T and all x ≥ 1 :
f (tx)
(1 − ) x(α−) ≤ ≤ (1 + ) x(α+) . (7.45)
f (t)
Proof. If (7.43) is satisfied for t > t0 , then for x > 1 :
Z tx
f (tx) c(tx) h(s)
= exp ds .
f (t) c(t) t s
By the limits in (7.44), for any > 0 with < 1 there is a T so that for t ≥ T :
α − ≤ h(t) ≤ α + , c (1 − /3) ≤ c(t) ≤ c (1 + /3) .
Hence for x > 1 :
Z tx Z tx Z tx
ds h(s) ds
exp (α − ) ≤ exp ds ≤ exp (α + ) , (1)
t s t s t s
and equivalently: Z tx
(α−) h(s)
x ≤ exp ds ≤ x(α+) . (2)
t s
The bounds in (1) and (2) then obtain:
1 − /3 (α−) f (tx) 1 + /3 (α+)
x ≤ ≤ x ,
1 + /3 f (t) 1 − /3
1+/3 1−/3
and the result in (7.45) then follows from 1−/3 ≤ 1 + and 1+/3 ≥ 1 − .
234 Estimating Tail Events 2
We now turn to the result that if F ∈ D(Gγ ) with γ > 0, then recalling Definition
6.44, the Hill estimator γ H converges to γ with probability 1 as n → ∞. To achieve this
convergence result we will also require that k → ∞, and that the implied quantile qkn ≡
1−k/n of the base order statistic X(n−k) converge to 1, or equivalently k/n → 0. This assures
that n − k → ∞, and thus the base variate in the Hill calculation satisfies X(n−k) → ∞ by
Proposition 6.52.
We identify the probability space on which these variates are defined to give meaning
to the associated probability statements.
(k,n)
Proposition 7.41 (Hill Estimator: γ H →1 γ.) Let {Xi }∞ i=1 be independent and iden-
tically distributed random variables defined on a probability space (S, E, λ) with distribution
(k,n)
function F ∈ D(Gγ ) with γ > 0, and let γ H denote the Hill estimator defined in (7.18)
with given k, n.
(k,n)
Then as k, n → ∞ and k/n → 0, the estimator γ H converges to γ with probability 1 :
(k,n)
γH →1 γ. (7.46)
Proof.
∗ We first apply Corollary 7.40 to the left-continuous inverse function U (t) ≡
1
1−F (t), which is regularly varying at infinity with index γ by (7.27). This obtains by
(7.45) that for any > 0 there is a T ≡ T () so that for t ≥ T and all x ≥ 1 :
U (tx)
(1 − ) x(γ−) ≤ ≤ (1 + ) x(γ+) ,
U (t)
and so:
where:
1 Xk−1 Y(n−j)
Zk,n ≡ ln .
k j=0 Y(n−k)
Extreme Value Theory 2 235
We now show that as k, n → ∞ and k/n → 0, that Zk,n converges to 1 with probability 1 :
Zk,n →1 1. (4)
1 Xk−1
Zk,n = X(n−j) − X(n−k) .
k j=0
Then F ∈ D(Gγ ).
Proof. See de Haan and Ferreira, Theorem 3.2.4.
236 Estimating Tail Events 2
U (tx) − U (t) xγ − 1
→ ,
a(t) γ
∗
1
with U (t) ≡ 1−F (t), the left-continuous inverse of 1/ (1 − F ) . Here, the constant c in
the earlier formula is integrated into the definition of a(t), and so here a(t)) ≡ ac (t). The
right hand limit is defined to be ln x when γ = 0, which equals limγ→0 (xγ − 1) /γ.
In order to obtain the asymptotic normality result noted above, this distribution function
F must satisfy an additional assumption known as a second-order condition, which
provides information on the rate of convergence in the above limit.
and there exists T so that A(t) does not change sign for t ≥ T ;
2. There is a function H(x) so that as t → ∞ :
U (tx) − U (t) xγ − 1
− A(t) → H(x). (7.47)
a(t) γ
By (7.26), recalling as above that here a(t) ≡ ac (t), this is equivalent to:
U (tx) γ
−x A(t) → H(x).
U (t)
xγ −1
For well-definedness, it is required that H(x) is not a multiple of γ .
xρ − 1
H(x) = xγ ,
ρ
where ρ ≤ 0. When ρ = 0 :
H(x) ≡ xγ ln x,
which is equal to xγ limρ→0 (xρ − 1) /ρ.
The significance of the parameter ρ is that it is then the case that A(t) ∈ RVρ , meaning
that A(t) is regularly varying at infinity with index ρ.
We then have the following result.
Extreme Value Theory 2 237
Proposition 7.44 (Hill Estimator: Asymptotic normality) Let F ∈ D(Gγ ) with γ >
0 and assume that F satisfies a second order condition where ρ ≤ 0. Let Fγ 0H denote the
distribution function of the normalized Hill estimator:
γH − γ
γ 0H = √ .
1/ k
If:
√ n
λ ≡ lim kA < ∞,
n→∞ k
as k, n → ∞ and k/n → 0, then:
λ
F γ 0H ⇒N 2
,γ , (7.48)
1−ρ
λ
where N λ/ (1 − ρ) , γ 2 denotes the normal distribution with mean 1−ρ and variance γ 2 .
Proof. See de Haan and Ferreira, Theorem 3.2.5.
Remark 7.45 (On λ) Since n/k → ∞, it follows that A (n/k) → 0 by the definition
of second order condition. However, the assumption that k/n → 0 implies that k/n may
approach 0 at any rate as a function of n, and it is thus also the case that n/k may approach
∞ at any rate as a function of n. √
Consequently, the value of the parameter λ = limn→∞ kn A (n/kn ) can depend on the
actual sequence {kn } of parameters used, and need not be finite.
The answer provided by this theorem is Proposition 7.15, which states that if such
sequences and distribution exist, so F ∈ D(G) in the notation of domains of attraction,
then there are real constants A > 0, B, and γ so that G(x) = Gγ (Ax + B) with Gγ (x)
defined for γ 6= 0 by:
−1/γ
Gγ (x) = exp − (1 + γx) , 1 + γx ≥ 0.
Proposition 7.28 then provided a characterization of such F ∈ D(Gγ ) for γ > 0 in (7.31),
that for x > 0 :
1 − F (tx)
lim = x−1/γ .
t→∞ 1 − F (t)
238 Estimating Tail Events 2
This is restated in an even more descriptive way in (7.33), that if F ∈ D(Gγ ) for γ > 0,
then as x → ∞ :
F (x) = 1 − L(x)x−1/γ , L ∈ RV0 ,
where L ∈ RV0 means that L is slowly varying at infinity as in Definition 7.25.
The Pickands-Balkema-de Haan theorem investigates another “tail” distribution,
specifically the conditional probability distribution of exceedances, and the analysis
underlying this result is often referred to as the peaks over threshold method. This
investigation was initiated in Book II, where in Proposition II.9.38 was proved:
• For all x with x > −1/γ when γ ≥ 0, where −1/0 ≡ −∞, or,
• For 0 ≤ x < −1/γ when γ < 0 :
−1/γ
1 − F (t + xha (t)) (1 + γx) , γ=6 0,
lim∗ =
t→x 1 − F (t) exp (−x) , γ = 0.
Since:
1 − F (t + xha (t)) F (t + xha (t)) − F (t)
1− = ,
1 − F (t) 1 − F (t)
this result can also be expressed as a conditional probability statement. For example with
γ 6= 0 :
−1/γ
lim∗ Pr [X ≤ t + xha (t)|X > t] = 1 − (1 + γx) .
t→x
Thus for fixed t “large” relative to x∗ :
−1/γ
γ
Pr [X ≤ t + y|X > t] ≈ 1 − 1 + y . (7.49)
ha (t)
These limiting and approximating distributions are examples of the generalized Pareto
distribution function Hγ,0,β (x), introduced in Definition II.9.39.
Definition 7.47 (Generalized Pareto distribution) The distribution function Hγ,t,β (x)
defined for γ 6= 0 and β > 0 by:
−1/γ
γ
Hγ,t,β (x) ≡ 1 − 1 + (x − t) , (7.50)
β
is called a generalized Pareto distribution, abbreviated GPD. When γ = 0, H0,t,β (x)
is defined as the limit of Hγ,t,β (x) as γ → 0 :
1
H0,t,β (x) ≡ 1 − exp − (x − t) . (7.51)
β
The distribution Hγ,t,β (x) is defined for x ≥ t when γ ≥ 0, and for t ≤ x ≤ t − β/γ
when γ < 0.
Extreme Value Theory 2 239
Remark 7.48 For the result below, we will primarily be interested in Hγ,0,β (x) with γ > 0 :
−1/γ
γ
Hγ,0,β (x) ≡ 1 − 1 + x , (7.52)
β
but introduced the more general notation because it is commonly cited. Our interest in
Hγ,0,β (x) with γ > 0 is motivated by (7.49).
Note that for γ > 0 :
x −1/γ
Hγ,t,γt (x) ≡ 1 − , x ≥ t,
t
and: x −1/γ
Hγ,0,γt (x) ≡ 1 − 1 + , x ≥ 0,
t
representing two common parametrizations for a standard Pareto distribution.
It is common to represent the exponential index by α, and so α = 1/γ > 0.
The asymptotic result of Proposition II.9.38 for the conditional distribution function
can be improved as stated in Proposition II.9.44, but there without proof. With the aid of
Corollary 7.34 to Karamata’s representation theorem, this earlier result can be proved in
the case γ > 0 in which case x∗ = ∞ by Proposition 7.28.
F (t + y) − F (t)
Ft (y) ≡ Pr [X ≤ t + y|X > t] = .
1 − F (t)
Then the approximation in (7.49) is uniform in y in the sense that there exists a positive
function β(t), so that:
limt→∞ sup0≤y<∞ Ft (y) − Hγ,0,β(t) (y) = 0. (7.53)
Further, (7.53) is true with β(t) ≡ γt, so Hγ,0,β(t) (y) ≡ Hγ,0,γt (y), and thus Ft (y) is
asymptotically Pareto:
y −1/γ
Pr [X ≤ t + y|X > t] → 1 − 1 + as t → ∞. (7.54)
t
Further, the error in this approximation converges to 0 uniformly in y ≥ 0.
If β(t) is any other function which satisfies (7.53), then as t → ∞ :
β(t)
→ 1,
γt
−1/(γ−) 1 − F (t + y) −1/(γ+)
(1 − ) (1 + y/t) ≤ ≤ (1 + ) (1 + y/t) .
1 − F (t)
Hence bounds for the difference between Ft (y) and Hγ,0,β(t) (y) are:
−1/γ
y −1/(γ−) γ
(1 − ) 1 + − 1+ y
t β(t)
−1/γ
1 − F (t + y) γ
≤ − 1+ y (1)
1 − F (t) β(t)
−1/γ
y −1/(γ+) γ
≤ (1 + ) 1 + − 1+ y .
t β(t)
The proof of (7.53) will be completed by proving that for β(t) ≡ γt, that the supremum in y
of both these bounds converges to 0 as t → x∗ .
To this end we investigate the upper bound and leave the lower bound as an exercise.
With β(t) ≡ γt, the upper bound becomes:
y −1/(γ+) y −1/γ
M (y) ≡ (1 + ) 1 + − 1+ .
t t
Letting w = y/t, a ≡ 1/ (γ + ) and δ ≡ /γ obtains for 0 ≤ w < ∞:
For any t,
sup0≤y<∞ M (y) = sup0≤w<∞ M (wt).
Now M (0) = γδ, M (∞) = 0, M (wt) ≥ 0, and M (wt) can be differentiated in w to reveal
that M 0 (w) ≤ 0 for all w, and this obtains:
0 ≤ M (w) ≤ γδ = .
As t → ∞, the first supremum converges to 0 as proved above, while the second converges
to 0 by assumption.
Thus as t → ∞ : −1/γ
−1/γ γ
2 − 1+ t → 0,
β(t)
Remark 7.50 (Hγ,0,γt (y) vs. Hγ,0,β (y)) While the Pareto distribution Hγ,0,γt (y) is the
exact asymptotic limit for the conditional distribution function Ft (y) as t → ∞, it is common
in applications to assume the more general model of the generalized Pareto distribution,
Hγ,0,β (y). Given the chosen threshold t and data set, this approach provides two parameters
to be determined by maximum likelihood or other estimation method rather than one. The
desirability of two parameters is reinforced by the fact that the convergence to Pareto can
be very slow indeed.
ln x
F (x) ≡ 1 − , x ≥ e.
x
Recalling (1.24), F (x) is a mixed distribution function with both a continuous component
on (e, ∞) and a saltus component at x = e.
Since for all x > 0 :
1 − F (tx)
lim = x−1 ,
t→∞ 1 − F (t)
F ∈ D(G1 ) by (7.31).
For t > e :
t ln(t + y)
Ft (y) = 1 − ,
t + y ln t
while:
t
H1,0,t (y) = 1 − .
t+y
The value of
supy [H1,0,t (y) − Ft (y)] ,
is found by calculus to occur at ŷ ≡ (e − 1)t.
Thus:
1
supy [H1,0,t (y) − Ft (y)] = ,
e ln t
which converges to zero very slowly as t → ∞. For example, to halve this supremum one
must square the threshold t.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Bibliography
I have listed below a number of textbook references for the mathematics and finance pre-
sented in this series of books. All provide both theoretical and applied materials in their
respective areas that are beyond those developed here and are worth pursuing by those
interested in gaining a greater depth or breadth of knowledge. This list is by no means
complete and is intended only as a guide to further study. In addition, various published
research papers have been identified in some chapters where these results were discussed.
The reader will no doubt observe that the mathematics references are somewhat older
than the finance references and upon web searching will find that some older texts have
been updated to newer editions, sometimes with additional authors. Since I own and use
the editions below, I decided to present these editions rather than reference the newer
editions which I have not reviewed. As many of these older texts are considered “classics,”
they are also likely to be found in university and other libraries.
That said, there are undoubtedly many very good new texts by both new and established
authors with similar titles that are also worth investigating. One that I will at the risk of
immodesty recommend for more introductory materials on mathematics, probability theory
and finance is:
[1] Reitano, Robert, R. Introduction to Quantitative Finance: A Math Tool Kit. Cam-
bridge, MA: The MIT Press, 2010.
Topology, Measure, and Integration
[2] Doob, J. L. Measure Theory. New York, NY: Springer-Verlag, 1994.
[3] Dugundji, James. Topology. Boston, MA: Allyn and Bacon, 1970.
[4] Edwards, Jr., C. H. Advanced Calculus of Several Variables. New York, NY: Academic
Press, 1973.
[5] Gemignani, M. C. Elementary Topology. Reading, MA: Addison-Wesley Publishing,
1967.
[6] Halmos, Paul R. Measure Theory. New York, NY: D. Van Nostrand, 1950.
[7] Hewitt, Edwin, and Karl Stromberg. Real and Abstract Analysis. New York, NY:
Springer-Verlag, 1965.
[8] Royden, H. L. Real Analysis, 2nd Edition. New York, NY: The MacMillan Company,
1971.
[9] Rudin, Walter. Principals of Mathematical Analysis, 3rd Edition. New York, NY:
McGraw-Hill, 1976.
[10] Rudin, Walter. Real and Complex Analysis, 2nd Edition. New York, NY: McGraw-
Hill, 1974.
[11] Shilov, G. E., and B. L. Gurevich. Integral, Measure & Derivative: A Unified Approach.
New York, NY: Dover Publications, 1977.
243
244 Bibliography
[12] Strang, Gilbert. Introduction to Linear Algebra, 4th Edition. Wellesley, MA: Cam-
bridge Press, 2009.
Probability Theory & Stochastic Processes
[13] Billingsley, Patrick. Probability and Measure, 3rd Edition. New York, NY: John Wiley
& Sons, 1995.
[14] Chung, K. L., and R. J. Williams. Introduction to Stochastic Integration. Boston, MA:
Birkhäuser, 1983.
[15] Davidson, James. Stochastic Limit Theory. New York, NY: Oxford University Press,
1997.
[16] de Haan, Laurens, and Ana Ferreira. Extreme Value Theory, An Introduction. New
York, NY: Springer Science, 2006.
[17] Durrett, Richard. Probability: Theory and Examples, 2nd Edition. Belmont, CA:
Wadsworth Publishing, 1996.
[18] Durrett, Richard. Stochastic Calculus, A Practical Introduction. Boca Raton, FL:
CRC Press, 1996.
[19] Feller, William. An Introduction to Probability Theory and Its Applications, Volume
I. New York, NY: John Wiley & Sons, 1968.
[20] Feller, William. An Introduction to Probability Theory and Its Applications, Volume
II, 2nd Edition. New York, NY: John Wiley & Sons, 1971.
[21] Friedman, Avner. Stochastic Differential Equations and Applications, Volume 1 and
2. New York, NY: Academic Press, 1975.
[22] Ikeda, Nobuyuki, and Shinzo Watanabe. Stochastic Differential Equations and Diffu-
sion Processes. Tokyo: Kodansha Scientific, 1981.
[23] Karatzas, Ioannis, and Steven E. Shreve. Brownian Motion and Stochastic Calculus.
New York, NY: Springer-Verlag, 1988.
[24] Kloeden, Peter E., and Eckhard Platen. Numerical Solution of Stochastic Differential
Equations. New York, NY: Springer-Verlag, 1992.
[25] Lowther, George, Almost Sure, A Maths Blog on Stochastic Calculus,
https://almostsure. wordpress.com/stochastic-calculus/
[26] Lukacs, Eugene. Characteristic Functions. New York, NY: Hafner Publishing, 1960.
[27] Nelson, Roger B. An Introduction to Copulas, 2nd Edition. New York, NY: Springer
Science, 2006.
[28] Øksendal, Bernt. Stochastic Differential Equations, An Introduction with Applications,
5th Edition. New York, NY: Springer-Verlag, 1998.
[29] Protter, Phillip. Stochastic Integration and Differential Equations, A New Approach.
New York, NY: Springer-Verlag, 1992.
[30] Revuz, Daniel, and Marc Yor. Continuous Martingales and Brownian Motion, 3rd
Edition. New York, NY: Springer-Verlag, 1991.
Bibliography 245
[31] Rogers, L. C. G., and D. Williams. Diffusions, Markov Processes and Martingales,
Volume 1, Foundations, 2nd Edition. Cambridge, UK: Cambridge University Press,
2000.
[32] Rogers, L. C. G., and D. Williams. Diffusions, Markov Processes and Martingales,
Volume 2, Itô Calculus, 2nd Edition. Cambridge, UK: Cambridge University Press,
2000.
[33] Sato, Ken-Iti. Lévy Processes and Infinitely Divisible Distributions. Cambridge, UK:
Cambridge University Press, 1999.
[34] Schilling, René L. and Lothar Partzsch. Brownian Motion: An Introduction to Stochas-
tic Processes, 2nd Edition. Berlin/Boston: Walter de Gruyter GmbH, 2014.
[35] Schuss, Zeev, Theory and Applications of Stochastic Differential Equations. New York,
NY: John Wiley and Sons, 1980.
Finance Applications
[36] Etheridge, Alison. A Course in Financial Calculus. Cambridge, UK: Cambridge Uni-
versity Press, 2002.
[37] Embrechts, Paul, Claudia Klüppelberg, and Thomas Mikosch. Modelling Extremal
Events for Insurance and Finance. New York, NY: Springer-Verlag, 1997.
[38] Hunt, P. J., and J. E. Kennedy. Financial Derivatives in Theory and Practice, Revised
Edition. Chichester, UK: John Wiley & Sons, 2004.
[39] McLeish, Don L. Monte Carlo Simulation and Finance. New York, NY: John Wiley,
2005.
[40] McNeil, Alexander J., Rüdiger Frey, and Paul Embrechts. Quantitative Risk Manage-
ment: Concepts, Techniques, and Tools. Princeton, NJ.: Princeton University Press,
2005.
Research Papers for Book IV
[41] Bailey, R. W. “Polar generation of random variates with the t-distribution.” Mathe-
matics of Computation, 62(206), 779–781, 1994.
[42] Balkema, A., de Haan, L. “Residual life time at great age.” Annals of Probability, 2,
792–804, 1974.
[43] Box, G. E. P., Muller, Mervin E. “A Note on the Generation of Random Normal
Deviates.” Annals of Mathematical Statistics, 29(2), 610–611, 1958.
[44] Cantelli, F. P. “Sulla determinazione empirica delle leggi di probabilita.” Giornale
dell Istituto Italiano degli Attuari, 4, 221–424, 1933.
[45] Dvoretzky, A., Kiefer, J., Wolfowitz, J. “Asymptotic minimax character of the sample
distribution function and of the classical multinomial estimator.” Annals of Mathe-
matical Statistics, 27(3), 642–669, 1956.
[46] Fisher, R. A., Tippett, L. H. C. “Limiting forms of the frequency distribution of the
largest or smallest member of a sample.” Mathematical Proceedings of the Cambridge
Philosophical Society, 24, 180–190, 1928.
246 Bibliography
[47] Glivenko, V. “Sulla determinazione empirica della legge di probabilita.” Giornale dell
Istituto Italiano degli Attuari, 4, 92–99, 1933.
[48] Gnedenko, B. “Sur la distribuion limite du terme maximum d’une série aléatoire.”
Annals of Mathematics, 44, 423–453, 1943.
[49] Heyde, C. C. “On a Property of the Lognormal Distribution.” In: Maller R., Basawa I.,
Hall P., Seneta E. (eds) Selected Works of C.C. Heyde. Selected Works in Probability
and Statistics. New York, NY: Springer, 2010.
[50] Hill, B. “A simple general approach to inference about the tail of a distribution.” The
Annals of Statistics, 3(5):1163–1174, 1975.
[51] Kolmogorov, A. “Sulla determinazione empirica di una legge di distribuzione.” Gior-
nale dell Istituto Italiano degli Attuari, 4, 83–91, 1933.
[52] Makarov, Mikhail. “Applications of exact extreme value theorem.” Journal of Oper-
ational Risk, 2(1), 115–120, 2007.
[53] Massart, P. “The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality.” The
Annals of Probability, 18(3), 1269–1283, 1990.
[54] Pickands, J. “Statistical inference using extreme order statistics.” Annals of Statistics,
3, 119–131, 1975.
[55] Rényi, Alfréd. “On the theory of order statistics.” Acta Mathematica Academiae
Scientiarum Hungarica, 4, 191–231, 1953.
[58] Smirnov, N. V. “On the estimation of the discrepancy between empirical curves of
distribution for two independent samples.” Moscow University Mathematics Bulletin,
2(2), 3–11 1939.
[59] Smirnov, N. V. “Table for estimating the goodness of fit of empirical distributions.”
Annals of Mathematical Statistics, 19, 279–281, 1948.
Index
247
248 Index