Essentials of Probability Theory For Statisticians
Essentials of Probability Theory For Statisticians
Essentials of Probability Theory For Statisticians
Series Editors
Francesca Dominici, Harvard School of Public Health, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Mathematical Statistics: Basic Ideas and Selected Topics, Volume I, Second Edition
P. J. Bickel and K. A. Doksum
Introduction to Probability
J. K. Blitzstein and J. Hwang
Second Edition
R. Caulcutt
Analysis of Variance, Design, and Regression : Linear Modeling for Unbalanced Data,
Second Edition
R. Christensen
Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians
R. Christensen, W. Johnson, A. Branscum, and T.E. Hanson
Extending the Linear Model with R: Generalized Linear, Mixed Effects and
Nonparametric Regression Models, Second Edition
J.J. Faraway
Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical
and Count Data
M. Friendly and D. Meyer
Second Edition
D. Gamerman and H.F. Lopes
Richly Parameterized Linear Models: Additive, Time Series, and Spatial Models Using
Random Effects
J.S. Hodges
Principles of Uncertainty
J.B. Kadane
Mathematical Statistics
K. Knight
Elements of Simulation
B.J.T. Morgan
Michael A. Proschan
National Institute of Allergy and Infectious Diseases, NIH
Pamela A. Shaw
University of Pennsylvania
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources.
Reasonable efforts have been made to publish reliable data and information, but the
author and publisher cannot assume responsibility for the validity of all materials or the
consequences of their use. The authors and publishers have attempted to trace the
copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted,
reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other
means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, please
access www.copyright.com (http://www.copyright.com/) or contact the Copyright
Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-
8400. CCC is a not-for-profit organization that provides licenses and registration for a
variety of users. For organizations that have been granted a photocopy license by the
CCC, a separate system of payment has been arranged.
Description: Boca Raton : Taylor & Francis, 2016. |Series: Chapman & hall/CRC texts
in statistical science series | A CRC title. | Includes
The book is organized as follows. The first chapter is intended as a broad introduction
to why more rigor is needed to take that next step to graduate-level probability. Chapter
1 makes use of intriguing paradoxes to motivate the need for rigor. The chapter assumes
knowledge of basic probability and statistics. We return to some of these paradoxes
later in the book. Chapter 2, on countability, contains essential background material with
which some readers may already be familiar. Depending on the background of students,
instructors may want to begin with this chapter or have students review it on their own
and refer to it when needed. The remaining chapters are in the order that we believe is
most logical. Chapters 3 and 4 contain the backbone of probability: sigma-fields,
probability measures, and random variables and vectors. Chapter 5 introduces and
contrasts Lebesgue-Stieltjes integration with the more familiar Riemann-Stieltjes
integration, while Chapter 6 covers different modes of convergence. Chapters 7 and 8
concern laws of large numbers and central limit theorems. Chapter 9 contains additional
results on convergence in distribution, including the delta method, while Chapter 10
covers the extremely important topic of conditional probability, expectation, and
distribution. Chapter 11 contains many interesting applications from our actual
experience. Other applications are interspersed throughout the book, but those in
Chapter 11 are more detailed. Many examples in Chapter 11 rely on material covered in
Chapters 910, and would therefore be difficult to present much earlier. The book
concludes with two appendices. Appendix A is a brief review of prerequisite material,
while Appendix B contains useful probability distributions and their properties. Each
chapter contains a chapter review of key results, and exercises are intended to
constantly reinforce important concepts.
We would like to express our extreme gratitude to Robert Taylor (Clemson University),
Jie Yang (University of Illinois at Chicago), Wlodek Byrc (University of Cincinnati),
and Radu Herbei (Ohio State University) for reviewing the book. They gave us very
helpful suggestions and additional material, and caught typos and other errors. The
hardest part of the book for us was constructing exercises, and the reviewers provided
additional problems and suggestions for those as well.
Index of Statistical Applications and
Notable Examples
In this text, we illustrate the importance of probability theory to statistics. We have
included a number of illustrative examples that use key results from probability to gain
insight on the behavior of some commonly used statistical tests, as well as examples that
consider implications for design of clinical trials. Here, we highlight our more
statistical applications and a few other notable examples.
Statistical Applications
1. Clinical Trials
2. Test statistics
x. Permutation test: Example 8.10, 8.20, 10.20, 10.45, 10.53; Sections 11.6,
11.7, 11.8
xiv. Sample variance: Example 4.44, 6.53, 7.6, 8.5, 9.7; Exercise: Section 7.1, #4
iv. Regression: Example 7.7, 8.7; Section 11.8; Exercise: Section 6.2.3, #7,
Section 7.1, #9
Introduction
This book is intended to provide a rigorous treatment of probability theory at the
graduate level. The reader is assumed to have a working knowledge of probability and
statistics at the undergraduate level. Certain things were over-simplified in more
elementary courses because you were likely not ready for probability in its full
generality. But now you are like a boxer who has built up enough experience and
confidence to face the next higher level of competition. Do not be discouraged if it
seems difficult at first. It will become easier as you learn certain techniques that will be
used repeatedly. We will highlight the most important of these techniques by writing
three stars (***) next to them and including them in summaries of key results found at the
end of each chapter.
You will learn different methods of proofs that will be useful for establishing classic
probability results, as well as more generally in your graduate career and beyond. Early
chapters build a probability foundation, after which we intersperse examples aimed at
making seemingly esoteric mathematical constructs more intuitive. Necessary elements
in definitions and conditions in theorems will become clear through these examples.
Counterexamples will be used to further clarify nuances in meaning and expose common
fallacies in logic.
At this point you may be asking yourself two questions: (1) Why is what I have learned
so far not considered rigorous? (2) Why is more rigor needed? The answers will
become clearer over time, but we hope this chapter gives you some partial answers.
Because this chapter presents an introductory survey of problems that will be dealt with
in depth in later material, it is somewhat less formal than subsequent chapters.
Flip a biased coin with probability p of heads infinitely many times. Let X1, X2, X3, ...
be the outcomes, with Xi = 0 denoting tails and Xi = 1 denoting heads on the ith flip.
Now form the random number
written in base 2. That is, Y = X1 (1/2) + X2 (1/2)2 + X3 (1/2)3 +... The first digit
X1 determines whether Y is in the first half [0, 1/2) (corresponding to X1 = 0) or
second half [1/2, 1] (corresponding to X1 = 1). Whichever half Y is in, X2 determines
whether Y is in the first or second half of that half, etc. (see Figure 1.1).
Figure 1.1
Base 2 representation of a number Y [0, 1]. X1 determines which half, [0, 1/2) or
[1/2, 1], Y is in; X2 determines which half of that half Y is in, etc.
What is the probability mass function or density of the random quantity Y? If 0.x1x2 ...
is the base 2 representation of y, then P(Y = y) = P(X1 = x1)P(X2 = x2)... = 0 if p
(0,1) because each of the infinitely many terms in the product is either p or (1 p).
Because the probability of Y exactly equaling any given number y is 0, Y is not a
discrete random variable. In the special case that p = 1/2, Y is uniformly distributed
because Y is equally likely to be in the first or second half of [0, 1], then equally likely
to be in either half of that half, etc. But what distribution does Y have if p (0,1) and p
1/2? It is by no means obvious, but we will show in Example 7.3 of Chapter 7 that for
p 1/2, the distribution of Y has no density!
Another way to think of the Xi in this example is that they represent treatment
assignment (Xi = 0 means placebo, Xi = 1 means treatment) for individuals in a
randomized clinical trial. Suppose that in a trial of size n, there is a planned imbalance
in that roughly twice as many patients are assigned to treatment as to placebo. If we
imagine an infinitely large clinical trial, the imbalance is so great that Y fails to have a
density because of the preponderance of ones in its base 2 representation. We can also
generate a random variable with no density by creating too much balance. Clinical trials
often randomize using permuted blocks, whereby the number of patients assigned to
treatment and placebo is forced to be balanced after every 2 patients, for example.
Denote the assignments by X1, X2, X3, ..., again with Xi = 0 and Xi = 1 denoting
placebo or treatment, respectively, for patient i. With permuted blocks of size 2, exactly
one of X1, X2 is 1, exactly one of X3, X4 is 1, etc. In this case there is so much balance
in an infinitely large clinical trial that again the random number defined by Equation
(1.1) has no density (Example 5.33 of Section 5.6).
Figure 1.2
Conditioning on Y X = 0 when (X, Y) are iid N(0, 1).
Formulation 2 is that Nature generates an X for each patient, but by serendipity, two
people happen to have identical values of X. In other words, we observe 0 + 1X1 +
1 and 0 + 1X2 + 2, and we condition on X1 = X2 = X. This seems equivalent to
Formulation 1, but it is not. Conditioning on X1 = X2 = X actually changes the
distribution of X, but exactly how? Without loss of generality, take X1 and X2 to be
independent standard normals, and consider the conditional distribution of X2 given X1
= X2. One seemingly slick way to compute it is to formulate the event {X1 = X2} as
{X2 X1 = 0} and obtain the conditional distribution of X2 given that X2 X1 = 0.
This is easy because the joint distribution of (X2, X2 X1) is bivariate normal with
mean vector (0, 0), variances (1, 2), and correlation coefficient . Using a standard
formula for the conditional distribution of two jointly normal random variables, we find
that the distribution of X2 given that X2 X1 = 0 is normal with mean 0 and variance
1/2; its density is
Another way to think about the event {X1 = X2} is {X2/X1 = 1}. We can obtain the
joint density g(u, v) of (U, V) = (X2, X2/X1) by computing the Jacobian of the
transformation, yielding
Expression (1.3) is similar, but not identical, to Equation (1.2). The two different
conditional distributions of X2 given X1 = X2 give different answers! Of course, there
are many other ways to express the fact that X1 = X2. This example shows that, although
we can define conditional distributions given the value of a random variable, there is no
unique way to define conditional distributions given that two continuous random
variables agree. Conditioning on events of probability 0 always requires great care, and
should be avoided when possible. Formulation 1 is preferable because it sidesteps
these difficulties.
The following is an example illustrating the care needed in formulating the experiment.
Example 1.4. The two envelopes paradox: Improperly formulating the experiment
Have you seen the commercials telling you how much money people who switch to their
auto insurance company save? Each company claims that people who switch save
money, and that is correct. The impression given is that you could save a fortune by
switching from company A to company B, and then switching back to A, then back to B,
etc. That is incorrect. The error is analogous to the reasoning in the following paradox.
Consider two envelopes, one containing twice as much money as the other. You hold
one of the envelopes, and you are trying to decide whether to exchange it for the other
one. You argue that if your envelope contains x dollars, then the other envelope is
equally likely to contain either x/2 dollars or 2x dollars. Therefore, the expected amount
of money you will have if you switch is (1/2)(x/2) + (1/2)(2x) = (5/4)x > x. Therefore,
you decide you should switch envelopes. But the same argument can be used to
conclude that you should switch again!
You might wonder what is wrong with letting X be the random amount of money in your
envelope, and saying that the amount in the other envelope is
Actually, this is true. Untrue is the conclusion that E(Y) = (1/2)E(X/2) + (1/2)E(2X) =
(5/4)E(X) > E(X). This would be valid if the choice of either X/2 or 2X were
independent of the value of X. In that case we could condition on X = x and replace X
by x in Equation (1.4). The problem is that the choice of either X/2 or 2X depends on
the value x of X. Very small values of x make it less likely that your envelope contains
the doubled amount, whereas very large values of x make it more likely that your
envelope contains the doubled amount. To see why this invalidates the formula E(Y) =
(1/2)E(X/2)+(1/2)E(2X), imagine generating a standard normal deviate Z1 and setting
Note that
so you might think that conditioned on Z1 = z1, Equation (1.6) holds with Z1 replaced
by z1. In that case E(Z2|Z1 = z1) = (1/2)(z1) + (1/2)(z1) = 0 and E(Z2) = 0. But from
Equation (1.5), Z2 = |Z1| > 0 with probability 1, so E(Z2) must be strictly positive. The
error was in thinking that once we condition on Z1 = z1, Equation (1.6) holds with Z1
replaced by z1. In reality, if Z1 = z1 < 0, then the probabilities in Equation (1.6) are 1
and 0, whereas if Z1 = z1 0, then the probabilities in Equation (1.6) are 0 and 1.
A similar error in reasoning applies in the auto insurance setting. People who switch
from company A to company B do save hundreds of dollars, but that is because the
people who switch are the ones most dissatisfied with their rates. If X is your current
rate and you switch companies, it is probably because X is large. If you could save
hundreds by switching, irrespective of X, then you would benefit by switching back and
forth. The ads are truthful in the sense that people who switch do save money, but that
does not necessarily mean that you will save by switching; that depends on whether your
X is large or small.
Suppose you have a collection of infinitely many balls and a box with an unlimited
capacity. At 1 minute to midnight, you put 10 balls in the box and remove 1. At 1/2
minute to midnight, you put 10 more balls in the box and remove 1. At 1/4 minute to
midnight, you put 10 more balls in the box and remove 1, etc. Continue this process of
putting 10 in and removing 1 at 1/2n minutes to midnight for each n. How many balls are
in the box at midnight?
We must first dispel one enticing but incorrect answer. Some argue that we will never
reach midnight because each time we halve the time remaining, there will always be
half left. But this same line of reasoning can be used to argue that motion is impossible:
to travel 1 meter, we must first travel 1/2 meter, leaving 1/2 meter left, then we must
travel 1/4 meter, leaving 1/4 meter left, etc. This argument, known as Xenos paradox, is
belied by the fact that we seem to have no trouble moving! The paradox disappears
when we recognize that there is a 1 1 correspondence between distance and time; if it
takes 1 second to travel 1 meter, then it takes only half a second to travel 1/2 meter, etc.,
so the total amount of time taken is 1 + 1/2 + 1/4 + ... = 2 seconds. Assume in the puzzle
that we take, at the current time, half as long to put in and remove balls as we took at the
preceding time. Then we will indeed reach midnight.
Notice that the total number of balls put into the box is 10 + 10 + 10... = , and the total
number taken out is 1 + 1 + 1... = . Thus, the total number of balls in the box can be
thought of as . But at each time, we put in 10 times as many balls as we take out.
Therefore, it is natural to think that there will be infinitely many balls in the box at
midnight. Surprisingly, this is not necessarily the case. In fact, there is actually no one
right answer to the puzzle. To see this, imagine that the balls are all numbered 1, 2, ...,
and consider some alternative ways to conduct the experiment.
1. At 1 minute to midnight, put balls 1 10 in the box and remove ball 1. At 1/2
minute to midnight, put balls 1120 in the box and remove ball 2. At 1/4 minute to
midnight, put balls 21 30 into the box and remove ball 3, etc. So how many balls
are left at midnight? None. If there were a ball, what number would be on it? It is
not number 1 because we removed that ball at 1 minute to midnight. It is not
number 2 because we removed that ball at 1/2 minute to midnight. It cannot be ball
number n because that ball was removed at 1/2n minutes to midnight. Therefore,
there are 0 balls in the box at midnight under this formulation.
2. At 1 minute to midnight, put balls 110 in the box and remove ball 2. At 1/2 minute
to midnight, put balls 11 20 in the box and remove ball 3, etc. Now there is
exactly one ball in the box at midnight because ball number 1 is the only one that
was never removed.
3. At 1 minute to midnight, put balls 1 10 in the box and remove ball 1. At 1/2
minute to midnight, put balls 1120 in the box and remove ball 11. At 1/4 minute to
midnight, put balls 2130 in the box and remove ball 21, etc. Now there are
infinitely many balls in the box because balls 210, 1220, 2230, etc. were never
removed.
It is mind boggling that the answer to the puzzle depends on which numbered ball is
removed at each given time point. The puzzle demonstrates that there is no single way to
define .
Examples 1.11.5 may seem pedantic, but there are real-world implications of
insistence on probabilistic rigor. The following is a good illustration.
For the above reasons, clinical trialists insist on pre-specifying all analyses. For
example, it is invalid to change from a t-test to a sign test after noticing that signs of
differences between treatment and control are all positive. This temptation proved too
great for your first author early in his career (see page 773 of Stewart et al., 1991).
Neither is it valid to focus exclusively on the one subgroup that shows a treatment
benefit. Adaptive methods allow changes after seeing data, but the rules for deciding
whether and how to make such changes are pre-specified. An extremely important
question is whether it is ever possible to allow changes that were not pre-specified.
Can this be done using a permutation test in a way that maintains a rigorous
probabilistic foundation? If so, then unanticipated and untoward events need not ruin a
trial. We tackle this topic in portions of Chapter 11.
We hope that the paradoxes in this chapter sparked interest and convinced you that
failure to think carefully about probability can lead to nonsensical conclusions. This is
especially true in the precarious world of conditional probability. Our goal for this
book is to provide you with the rigorous foundation needed to avoid paradoxes and
provide valid proofs. We do so by presenting classical probability theory enriched with
illustrative examples in biostatistics. These involve such topics as outlier tests,
monitoring clinical trials, and using adaptive methods to make design changes on the
basis of accumulating data.
Chapter 2
Size Matters
2.1 Cardinality
Elementary probability theory taught us that there was a difference between discrete and
continuous random variables, but exactly what does discrete really mean? Certainly X
is discrete if it has only a finite number possible values, but it might be discrete even if
it has infinitely many possible values. For instance, a Poisson random variable takes
values 0, 1, 2, ... Still, = {0, 1, 2, ...} feels very different from, say, = [0, 1]. This
chapter will make precise how these different types of infinite sets differ. You will learn
important counting techniques and arguments for determining whether infinite sets like
{0, 1, 2, ...} and [0, 1] are equally numerous. In the process, you will see that [0, 1] is
actually more numerous.
For any set A we can talk about the number of elements in A, called the cardinality of A
and denoted card(A). If A has only finitely many elements, then cardinality is easy. But
what is card(A) if A has infinitely many elements? Answering infinity is correct, but
only half the story. The fact is that some infinite sets have more members than other
infinite sets. Consider the infinite set A consisting of the integers. We can imagine listing
them in a systematic way, specifying which element is first, second, third, etc.:
The top row shows the position on the list, with 1 meaning first, 2 meaning second, etc.,
and the bottom row shows the elements of A. Thus, the first element is 0, followed by 1,
1, etc. Such a list is a 1 1 correspondence between the natural numbers (top row)
and the elements of A (bottom row). We are essentially counting the elements of A.
Even though A has infinitely many members, each one will be counted eventually. In
fact, we can specify exactly where each integer will appear on the list: integer n is the
(2n)th item if n is positive, and the (2|n| + 1)th item if n is negative or 0. This leads us to
the following definition.
An example of a set whose elements cannot be listed systematically is the interval [0,
1]. To see that [0, 1] is uncountable, suppose that you could construct an exhaustive list
of its members:
To see that this list cannot be complete, let a1 be any digit other than the first digit of x1;
the first digit of x1 is 7 (inside box), so take a1 = 5, for example. Now let a2 be any
digit other than the second digit of x2; the second digit of x2 is 4 (inside box), so take a2
= 0, for example. Let a3 be any digit other than the third digit of x3, etc. It seems that the
number .a1a2a3 ... is in [0, 1] but is not on the list. It differs from x1 in the first digit; it
differs from x2 in the second digit; it differs from x3 in the third digit, etc. It differs from
all of the xs by at least one digit. This argument applies to any attempted list of [0, 1],
and therefore appears to prove that [0, 1] is uncountable.
There is a slight flaw in the above reasoning. Some numbers in [0, 1] have two different
decimal representations. For example, 0.1 can be written as 0.1000... or as 0.0999...
Therefore, two different representations of the same number can differ from each other
in a given digit. To circumvent this problem, modify the above argument as follows. For
any listing x1, x2, ... of the numbers in [0, 1], let a1 be any digit other than 0, 9, or the
first digit of x1; let a2 be any digit other than 0, 9, or the second digit of x2, etc. Then
.a1a2 ... is a number in [0, 1] that is not equivalent to any of the x1, x2, ... (because its
digits cannot end in .000 ... or .999 ...). This method of proof is the celebrated diagonal
technique of G. Cantor, who made significant contributions to set theory in the late
1800s. We have established the following result.
We now have a prototype for each type of infinite set, countable (the integers) and
uncountable ([0, 1]). From these we can build new examples.
Proposition 2.3. The direct product of two countable sets is countable
The direct product A B (the set of all ordered pairs (ai, bj)) of countable sets A and B
is countable.
Proof. We prove this when A = {ai} and B = {bi} are countably infinite. Form the
matrix of ordered pairs (ai, bj), as shown in Figure 2.1.
Figure 2.1
The direct product (set of ordered pairs) of two countably infinite sets.
List the elements of the matrix by traversing its diagonals, as shown in Figure 2.1 and
Table 2.1.
Table 2.1
The following result is a helpful tool in building new countable and uncountable sets
from old ones. Its proof is left as an exercise.
We will make use of the following result repeatedly throughout this book.
Proof. By Proposition 2.3, the set D = {(i, j), i = 1, 2, ... and j=1, 2, ... } is countable.
By definition, the rational numbers are of the form i/j for (i, j) in a subset C D,
namely the set of (i, j) such that i and j have no common factors. By Proposition 2.4, C
is countable.
Because there is a 1 1 correspondence between the rationals and C, the rationals are
also countable.
Proof. Assume first that each Ai is countably infinite. Let A be the matrix whose rows
are Ai. If the aij are distinct, then consists of the elements of A, which can be put in
1 1 correspondence with the direct product (i, j), i = 1, 2, ..., j = 1, 2, ... This direct
product is countable by Proposition 2.3. If the aij are not all distinct, then is a
subset of the elements of A. By Proposition 2.4, is countable.
Although we assumed that each Ai is countably infinite, we can extend the proof to the
case when one or more of the Ai are finite. Simply augment each such Ai by countably
infinitely many new members. The countable union of these extended sets is countable
by the proof in the preceding paragraph. The countable union of the original sets is a
subset of this countable set, so is countable by Proposition 2.4.
We have seen that there are at least two different cardinalities for infinite sets
countable and uncountable. Are there infinite sets that are smaller than countable? Are
some uncountable sets larger than others? To answer these questions we must first
define what is meant by two infinite sets having the same cardinality.
Two sets A and B are said to have the same cardinality if there is a 1 1
correspondence between A and B.
Remark 2.8.
Two infinite sets can have the same cardinality even if one is a subset of the other, and
even if the set difference between them has infinitely many elements. For example, the
set of all integers and the set of negative integers have the same cardinality even though
their set difference contains infinitely many elements, namely the nonnegative integers.
The following result shows that no infinite set has smaller cardinality than countable
infinity.
Proposition 2.9. Countable infinity is, in a sense, the smallest cardinality an infinite
set can have
Proof. For the first result, select a countably infinite subset from A by
first selecting from A, then selecting from , then selecting from , etc.
For the second result, we must prove that if A can be put in 1 1 correspondence with a
set C, then so can A augmented by B = {b1, b2, ...}. Let denote the 1-1
correspondence between A and C. By part 1, we can select a countably infinite subset A
of A. Separating A into A and and retaining the correspondence between A
and C yields Table 2.2.
Table 2.2
Now retain the correspondence between A and C, but modify the correspondence
between A and C to obtain Table 2.3. This defines a 1 1 correspondence between
Table 2.3
C and the augmentation of A with B, demonstrating that these two sets have the same
cardinality. Therefore, augmenting an infinite set with a countably infinite set does not
change its cardinality. We omit the straightforward extension for the case in which B is
finite.
We next address the question of whether some uncountable sets are larger than others by
studying the countability of various strings of 0s and 1s. Doing so also allows us to see
a connection between a countable set like the natural numbers and the uncountable set
[0, 1].
1. The set of strings (x1, x2, ...) of 0s and 1s such that the number of 1s is finite is
countable. Likewise, the set of strings (x1, x2, ...) of 0s and 1s such that the number
of 0s is finite is countable.
2. The set of all possible strings (x1, x2, ...) of 0s and 1s has the same cardinality as
[0, 1].
Proof. Each string (x1, x2, ...) of 0s and 1s with only finitely many 1s corresponds to the
base 2 representation 0.x1x2 ... of a unique rational number in [0, 1]. Therefore, the set
in part 1 is in 1 1 correspondence with a subset of the rationals, and is therefore
countable by Propositions 2.4 and 2.5. The second statement of part 1 follows similarly.
To prove the second part, let B be the set of all strings of 0s and 1s. It is tempting to
argue that B is in 1 1 correspondence with [0, 1] because 0.x1x2, ... xn ... is the base 2
representation of a number in [0, 1]. But some numbers in [0, 1] have two base 2
representations (e.g., 1/2 = .1000 ... or .0111 ...). If A is the set of strings of 0s and 1s
that do not end in 111..., then A is in 1 1 correspondence with [0, 1]. Moreover, B is A
augmented by the set C of strings of 0s and 1s ending in 111 ... But C is countable
because each c C corresponds to a unique finite string of 0s and 1s, ending in 0,
before the infinite string of 1s. By part 1, C is countable. Because augmenting an infinite
set A by a countable set C does not change its cardinality (Proposition 2.9), B also has
the cardinality of [0, 1].
Proposition 2.10 can also be recast as a statement about subsets of a given set, as the
following important remark shows.
If is any set, the set of subsets of can be put in 1 1 correspondence with the set of
strings of 0s and 1s, where the length of the strings is card(). To see this, let . For
each x , write a 1 below x if x A, and a 0 below x if . For instance, if is
the set of natural numbers, then A = {1, 2, 4} is represented by
Remark 2.12.
Remark 2.11 and Proposition 2.10 imply that if is any countably infinite set:
We close this chapter with an axiom that at first blush seems totally obvious. For a finite
number of nonempty sets A1, ..., An, no one would question the fact that we can select
one element, i, from set Ai. But what if the number of sets is uncountably infinite? Can
we still select one element from each set? The Axiom of Choice asserts that the answer
is yes.
Let At, where t ranges over an arbitrary index set, be a collection of nonempty sets.
From each set At, we can select one element.
Most mathematicians accept the Axiom of Choice, though a minority do not. The
following example illustrates why there might be some doubt, in the minds of some,
about whether one can always pick one member from each of uncountably many sets.
Example 2.14. How not to choose
For each t (0, 1), let At = {s (0, 1) : s t is rational}. The Axiom of Choice
asserts that we can create a set A consisting of one member of each At. Here is an
interesting fallacy in trying to construct an explicit procedure for doing this. At step 0,
make A empty. For t (0, 1), step t is to determine whether t is in any of the sets As for
s < t. If so, then we do not add t to A because we want A to include only one member of
each At. If t is not in As for any s < t, then put t in A. Exercise 10 is to show that A
constructed in this way is empty. No new members can ever be added at step t for any t
(0, 1) because potential new members have presumably already been added. This
amusing example is reminiscent of Yogi Berras saying, Nobody goes to that restaurant
anymore: its too crowded.
Exercises
1. Which has greater cardinality, the set of integers or the set of even integers? Justify
your answer.
2. Prove that the set of irrational numbers in [0, 1] is uncountable.
11. Imagine a 1 1 square containing lights at each pair (r, s), where r and s are both
rational numbers in the interval [0, 1].
2.2 Summary
1. The cardinality of a set is, roughly speaking, its number of elements.
2. Infinite sets can have different cardinalities. Can the set be put in 1-1
correspondence with the natural numbers?
(a) If yes, the set is countably infinite (the smallest infinite set). Prototypic
example: rational numbers.
(b) If no, the set is uncountably infinite. Prototypic example: [0, 1].
3. A countable union of countable sets is countable.
4. Two sets have the same cardinality if there is a 1-1 correspondence between them.
(a) The set S of strings x1, x2, ..., each xi = 0 or 1, corresponds to base 2
representations of numbers in [0, 1], so card(S) = card([0, 1]).
i. The set of all subsets of has the same cardinality as [0, 1].
ii. The set of all subsets of with a finite number of elements is countable.
Chapter 3
For hundreds of years, some people, including the great Sir Isaac Newton, have
believed that there are prophecies encoded in the first 5 books of the Old Testament (the
Torah) that can be revealed only by skipping letters in a regular pattern. For instance,
one might read every 50th letter or every 100th letter, but the specified spacing must be
maintained throughout the entire text. Most of the time this produces gibberish, but it
occasionally produces tantalizing words and phrases like Bin Laden, and twin
towers, that seem to foretell historic events. Books and DVDs about the so-called
Bible code (Drosnin, 1998; The History Channel, 2003) maintain that the probability of
the observed patterns is so tiny that chance has been ruled out as an explanation. But
how does one even define a random experiment? After all, letters are not being drawn at
random, but are chosen to form words that are in turn strung together to make sensible
sentences. Where is the randomness?
A vocal skeptic (Brendan McKay) showed that similar messages could be found by
skipping letters in any book of similar length such as Moby-Dick. In effect, he
formulated the random experiment by treating the Torah as being drawn at random from
the set of books of similar length.
Examples such as the above inspire a very careful definition of the random experiment
and probability theory presented in this chapter. We use the triple of probability
theory, called the probability space, where is the sample space of all possible
outcomes of an experiment, is the set of events (subsets of ) we are allowed to
consider, and P is a probability measure. We also consider more general measures ,
not just probability measures, in which case is called a measure space. You
probably have some familiarity with and P, but not necessarily with because you
may not have been aware of the need to restrict the set of events you could consider. We
will see that such restriction is sometimes necessary to ensure that probability satisfies
key properties. These properties and the collection of events we may consider are
intertwined, and we find ourselves in a Catch-22. We cannot understand the need for
restricting the events we can consider without understanding the properties we want
probability measures to have, and we cannot describe these properties without
specifying the events for which we want them to hold. A more complete understanding
of the elements of probability theory may come only after reading the entire chapter.
1. All randomness stems from the selection of ; once is known, so too are
the values of all random variables.
2. A single probability measure P determines the distribution of all random variables.
3. Without loss of generality, we can take the sample space to be the unit interval
[0, 1] and P to be Lebesgue measure, which in laymans terms corresponds to
drawing at random from [0, 1].
In fact, rigorous treatment of probability theory stemmed largely from one basic
question: can we define a probability measure that satisfies certain properties and
allows us to pick randomly and uniformly from [0, 1]? The surprising answer is not
without restricting the events we are allowed to consider. A great deal of thought went
into how to restrict the allowable sets.
The chapter is organized as follows. Section 3.2 studies the collection of events we
are allowed to consider. Section 3.3 shows that this collection of allowable sets
includes a very important one not usually covered in elementary probability courses,
namely that infinitely many of a collection A1, A2, ... of events occurred. Section 3.4
presents the key axioms of probability measures and the consequences of those axioms.
Section 3.5 shows that to achieve a uniform probability measure, we must restrict the
allowable sets. Section 3.6 discusses scenarios under which it is not possible to sample
uniformly, while Section 3.7 concludes with the problems of trying to perform
probability calculations after the fact.
3.2 Sigma-Fields
Here is where we begin to diverge from what you have seen in more elementary
courses. Rather than considering any event, we restrict the set of events we are allowed
to consider to a certain pre-specified collection of subsets of . We want to be
large enough to contain all of the interesting sets we might want to consider, but not so
large that it admits pathological sets that prevent probability from having certain
desirable properties (see Section 3.5). The class of allowable events must have
certain properties. For instance, if E is allowable, then EC should also be allowable
because we naturally want to be able to consider the event that E did not occur. We
would also like to be able to consider countable unions and intersections. It turns out
that the tersest conditions yielding the sets we would like to be able to consider are
given in the following definition.
Remark 3.3.
1. Any field is also closed under finite unions because we can apply the paired union
result repeatedly. It is also closed under finite intersections by an argument similar
to that given in item 2 below.
2. Any sigma-field is closed under countable intersections because
and each belongs to .
3. Because any field or sigma-field is nonempty, it contains a set E, and therefore
EC. Consequently, any field or sigma-field contains and .
Example 3.4.
For any sample space , the smallest and largest sigma-fields are the trivial sigma-field
and the total sigma-field (also called the power sigma-field) T() = {all
subsets of }, respectively.
Remember that we are allowed to determine only whether or not event E occurred for E
, not for sets outside . The trivial sigma-field is not very useful because the only
events that we are allowed to consider are the entire sample space and the empty set. In
other words, we can determine only whether something or nothing happened. With the
total sigma-field we can determine, for each subset E , whether or not event E
occurred. The trouble with T() is that it may be too large to ensure that the properties
we desire for a probability measure hold for all E (see Section 3.5). Therefore,
we want to be large, but not too large. We would certainly like it to contain each
simple event {}, .
Suppose is a countable set like {0, 1, 2, ...}. We certainly want to contain each
singleton. But any sigma-field that contains each singleton must contain all subsets of
. This follows from the fact that any subset E of is a countable union of
singletons, and sigma-fields are closed under countable unions. Therefore, when the
sample space is countable, we should always use the total sigma-field.
Suppose that is an uncountable set like R or [0, 1]. If we want to include each
singleton , then must contain each countable subset E. It must also contain the
complement of each countable set. Each set whose complement is countable is called
co-countable. Therefore, any sigma-field containing the singletons must contain all
countable and co-countable sets. It is an exercise to prove that the collection of
countable and co-countable sets is itself a sigma-field. Therefore, the smallest sigma-
field containing the singletons is the collection of countable and co-countable sets.
Unfortunately, this sigma-field is not large enough to be useful. For instance, if = [0,
1], then we are not allowed to consider events like [0, 1/2] because it is neither
countable nor co-countable.
We continue our quest for the smallest sigma-field that contains all of the useful
subsets of the uncountable set R. We have seen that the collection of countable and co-
countable sets is not a good choice because it excludes common sets like [0, 1/2]. We
certainly want to include all intervals. But remember that we do not want to be too
large, else it might contain pathological sets that cause probability to behave
strangely. Therefore, we would like to find the smallest sigma-field that contains all
of the intervals. That is, we want to contain all intervals, and if is any other sigma-
field containing the intervals, then . The smallest sigma-field containing a
collection , denoted , is called the sigma-field generated by . We want to find
(intervals). Any sigma-field containing the intervals must also contain all finite unions
of disjoint sets of the form (a, b], (, a], or (b, ). Therefore, if this latter collection is
a sigma-field, it must be (intervals). Unfortunately, it is not a sigma-field:
If = R, the collection 0 of finite unions of disjoint sets of the form (a, b], (, a], or
(b, ) is a field, but not a sigma-field. It generates (intervals).
Continuing our search for (intervals), consider the collection of all sigma-fields
containing the intervals. There is at least one, namely the total sigma-field T(R) of all
subsets of the line. Let . We claim that = (intervals). We need only show that
is a sigma-field, because any other sigma-field containing the intervals clearly
contains .
Proof. First note that is nonempty because and for each t, and therefore
and belong to . Also, if , then E belongs to each t. Thus, EC belongs to
each t, and therefore to . Similarly, if E1, E2... belong to each t, then also
belongs to each t, so E belongs to . It follows that is a sigma-field.
The smallest sigma-field containing the intervals of R, namely the intersection of all
sigma-fields containing the intervals, is called the Borel sigma-field and denoted or
1. The sets in are called Borel sets. The Borel sets of R that are subsets of [0, 1] are
denoted [0, 1].
The Borel sigma-field contains many sets other than intervals, including, for example:
1. Countable unions of intervals, and therefore all open sets by Proposition A.15.
2. All closed sets by part 1 and the fact that complements of sets in are in .
3. The rational numbers by part 2 because each rational r is a closed set (it is easy to
see that its complement is open), and the set of rational numbers is a countable
union of these closed sets.
4. The irrational numbers, being the complement of the Borel set of rational numbers.
It is actually difficult to construct a set that is not a Borel set. Nonetheless, we will see
later that an even larger sigma-field, called the Lebesgue sets, is also useful in
probability and measure theory.
It is sometimes helpful to consider the events {} or {+}. These sets are not in the
Borel sigma-field because they are not subsets of R = (, ). To consider them as
events, we must extend the real line by augmenting it with {, +}.
Proof. Exercise.
The smallest sigma-field containing the set of all possible rectangles is called the 2-
dimensional Borel sigma-field, and denoted 2. Sets in 2 are called two-dimensional
Borel sets. Figure 3.1 shows one such set.
Figure 3.1
Two-dimensional Borel sets include direct products of one-dimensional Borel sets.
Shown here is B1 B2, where B1 = {(0.2, 0.4) 0.6} and B2 = {0.2, (0.4, 0.8)}.
2 contains all sets of the form B1 B2, where B1 and B2 are one-dimensional Borel
sets.
Step 1: For a given interval I, let I be the collection of one-dimensional Borel sets B
such that . Then I contains all intervals because 2 contains products of
intervals. Also, I is a sigma-field because 2 is. For instance, I is closed under
countable unions because if , then because each .
Also, the following argument shows that I is closed under complementation. If ,
then
By assumption, , so . Also, because 2 contains all products
of intervals, and I R is a product of intervals. We have shown that both members of the
intersection in (3.1) are in 2, so is in 2 because 2 is a sigma-field. This means
that . We have shown that I is a sigma field containing all intervals, so I
contains the smallest sigma-field containing the intervals, namely 1. Thus, for
each , which completes the proof of step 1.
Step 2: Fix and define to be the collection of one-dimensional Borel sets C such
that . By step 1, contains all intervals. An argument similar to that used in step
1 shows that is a sigma-field. Because is a sigma-field containing all intervals,
contains 1. This completes the proof of step 2.
The two-dimensional Borel sets form a very rich class that extends well beyond just
direct products of one-dimensional Borel sets. For instance,
is a two-dimensional Borel set that is not a direct product. It is not a direct product
because each x (0, 1) is the first element of at least one point in C, and likewise each
y (0, 1) is the second element of at least one point in C. Therefore, if were a direct
product, it would have to contain the entire square (0, 1) (0, 1), which it does not. To
see that is still a two-dimensional Borel set, note that may be written as
, where the union is over all rationals in (0, 1) (see Figure 3.2). Because is a
countable union of rectangles, it is in 2.
Figure 3.2
The union for r = 1/4, 1/2, and 3/4. Taking the union over all rationals
fills the entire triangle .
Any open set O in R2 is in 2. To see this, note that each point p O is contained in a
rectangle , where r, R, s and S are rational (Figure 3.3). Each such rectangle
is a two-dimensional Borel set, so , being a countable union of two-
dimensional Borel sets, is also a two-dimensional Borel set. Therefore, 2 contains all
open sets. It must also contain all complements of open sets, namely all closed sets. We
have proven the following.
Figure 3.3
Each point P in an open set is encased in a rectangle (r, R) (s, S) with r, R, s, and S
rational.
We saw that in one dimension, the collection of finite disjoint unions of intervals of the
form (a, b], (, a], or (b, ) is a field generating 1 (see Example 3.8). This was very
important because a common technique is to define a probability measure on a field and
then extend it to a sigma-field. The same technique is useful in higher dimensions, so we
need to find a field generating k.
The collection of finite unions of disjoint sets of the form , where each Bi
is a one-dimensional Borel set, is a field generating k.
Proof. We established for k = 2 that products of Borel sets are in k, and the proof is
readily extended to higher k. Because each such product is in k, finite unions of
disjoint product sets are also in k. The proof that is a field is deferred to Section
4.5.2, where we prove a more general result about product spaces.
Even though k-dimensional Borel sets include most sets of interest, we will see later
that an even larger sigma-field, called k-dimensional Lebesgue sets, is useful in
probability and measure theory.
Exercises
(a) .
(b) .
(c) .
3. Let = [0, 1], A1 = [0, 1/2), and A2 = [1/4, 3/4). Enumerate the sets in (A1,
A2), the smallest sigma-field containing A1 and A2.
4. Let = {1, 2, ...} and A = {2, 4, 6, ...} Enumerate the sets in (A), the smallest
sigma-field containing A. What is (A1, A2, ...), where Ai = {2i}, i = 1, 2, ...?
5. Let B(0, r) be the two dimensional open ball centered at 0 with radius r; i.e.,
. If = B(0, 1), what is (B(0, 1/2), B(0, 3/4))?
6. Show that if = R2, the set of all open and closed sets of is not a sigma-field.
7. Give an example to show that the union of two sigma-fields need not be a sigma-
field.
8. Let be a countably infinite set like {0, 1, 2, ...}, and let be the set of finite and
co-finite subsets (A is co-finite if AC is finite). Show that is a field, but not a
sigma-field.
9. * Prove that if is uncountable, then the set of countable and co-countable sets
(recall that A is co-countable if AC is countable) is a sigma-field.
10. * Show that the one-dimensional Borel sigma-field is generated by sets of the
form (, x], x R. That is, the smallest sigma-field containing the sets (, x], x
R, is the Borel sigma-field. The same is true if (, x] is replaced by (, x).
11. * Prove that Example 3.8 is a field, but not a sigma-field.
12. Let denote the Borel sets in R, and let [0, 1] be the Borel subsets of [0, 1],
defined as . Prove that .
13. * The countable collection is said to be a partition of if the Ai are disjoint
and . Prove that the sigma-field generated by consists of all unions of
the Ai. What is its cardinality? Hint: think about Proposition 2.10.
14. Suppose that and are partitions of and {Bi} is finer than {Ai},
meaning that each Ai is a union of members of {Bi}. Then .
15. Complete the following steps to show that a sigma-field cannot be countably
infinite. Suppose that contains the nonempty sets A1, A2....
(a) Show that each set of the form , where each Bi is either Ai or , is in
.
(b) What is wrong with the following argument: the set , each Bi
is either Ai or is in 1:1 correspondence with the set of all infinitely long
strings of 0s and 1s because we can imagine Ai as a 1 and as a 0.
Therefore, by Proposition 2.10, must have the same cardinality as [0, 1].
Hint: to see that this argument is wrong, suppose that the Ai are disjoint.
(c) Show that any two non-identical sets in are disjoint. What is the
cardinality of the set of all countable unions of C sets? What can you conclude
about the cardinality of ?
In summary:
The complement of {An i.o.}, namely that An occurs only finitely often, is also in
because is closed under complementation. This event is, by DeMorgans laws
(Proposition A.4), . This expression can also be deduced
from general principles because if is in only finitely many of the An, then there must
be an N for which is not in An for any n N. That is, must be in for all n N.
This must occur for at least one N, which translates into a union. Thus again we arrive
at the expression for the event that An occurs only finitely often.
A measure is a nonnegative function with domain the allowable sets such that if E1,
E2, ... are disjoint, then . A probability measure P is a measure
such that P() = 1.
Let be a countable set and = T() be the total sigma-field of all subsets of
. For A , define (A) =the number of elements of A. Then is clearly
nonnegative and countably additive because the number of elements in the union of
disjoint sets is the sum of the numbers of elements in the individual sets. Therefore, is
a measure, called counting measure, on . Counting measure is not a probability
measure if has more than one element because () > 1.
Let and be as defined in Example 3.23. Suppose that pi > 0 and . For ,
define . This is well defined because the summand is nonnegative, so we get
the same value irrespective of the order in which we add (Proposition A.44). Then P is
nonnegative and countably additive because if Ii and I index the elements of Ei and
, respectively, where E1, E2, ... are disjoint, then . Also,
. Therefore, P is a probability measure on . This example shows that any
probability mass function-the binomial, Poisson, hypergeometric, etc.-specifies a
probability measure.
Examples 3.23 and 3.24 involved a countable sample space. The most important
measure on the uncountable sample space R is the following
There is a measure L defined on a sigma-field containing the Borel sets such that
L(C) = length(C) if C is an interval: L is called Lebesgue measure and are called
Lebesgue sets. If we restrict attention to Lebesgue sets that are subsets of [0, 1], L is a
probability measure. We can think of the experiment as picking a number at random from
the unit interval. We will see in Section 4.6 that Lebesgue measure on [0, 1] is the only
probability measure we ever need.
Courses in real analysis spend a good deal of effort proving the existence of Lebesgue
measure. We present a very brief outline of the construction, which may be found in
Royden (1968).
Students are often puzzled that each possible outcome has probability 0, yet the
probability of [0, 1] is 1. Every event is impossible (i.e., has probability 0), yet
something is sure to happen.
It turns out that the definition of probability measure leads to a myriad of consequences,
the most basic of which are the following.
If P is a probability measure:
1. .
2. If E1, , , then .
3. 0 P(E) 1 for each .
4. If E , P(EC) = 1 P(E).
5. If E1, , .
Proof. To see part 1, note that any event E may be written as the disjoint union . By
countable additivity, , from which it follows that .
The following result that any countable union can be written as a countable union of
disjoint sets is extremely useful in probability theory.
Proposition 3.27. Any countable union can be written as a countable union of disjoint
sets
Proof. It is clear that the Di defined above are disjoint. We will prove that each in
is in and vice versa. Note that each step in the following proof is valid
whether n is a positive integer or +. If , then for some finite i n. Let I be
the smallest i n for which . Then by definition, is in EI but not in any of the
previous Ei. That is, , so . Thus, if , then . To prove the
converse, note that if , then for some finite i n. But then , and hence
. We have shown that each is in and vice versa, and therefore
.
If E1, E2, ... are any sets (not necessarily disjoint), then .
Figure 3.4
Sets decreasing to .
proving that if , then . Note that each of the above steps is valid if P is
replaced by an arbitrary measure .
We can use the continuity property of probability in conjunction with the following very
useful technique to prove many interesting results. Suppose we know that a certain
property, call it p, holds for all sets in a field, 0, and we want to prove that it holds for
all sets in the sigma-field generated by 0. Define to be the collection of sets such
that property p holds. Then by assumption, and we want to prove that . The
technique is to prove that is a monotone class, defined as follows.
(Chung, 1974, page 18) Let 0 be a field, (0) be the smallest sigma-field containing
0, and be a monotone class containing 0. Then .
In summary, the technique that will be used repeatedly to show that a property, say
property p, holds for all sets in a sigma-field (0) generated by a field 0, is as
follows.
Important technique for showing a property holds for all sets in a sigma-field
1. Let be the collection of sets in (0) such that property p holds, and show that
contains 0.
2. Show that is a monotone class, namely nonempty and closed under countable
increasing or decreasing sets.
3. Use Theorem 3.32 to deduce that property p holds for each set in (0), thereby
proving the result.
Two probability measures P1 and P2 agreeing on a field 0 also agree on the sigma-
field (0) generated by 0.
Proof. Let denote the collection of events E such that P1(E) = P2(E), where P1 and P2
are the two probability measures. By assumption, . Now suppose that E1, E2, ...
and or . By the continuity property of probability (Proposition 3.30),
. This shows that , so . That is, is a
monotone class containing 0. By Theorem 3.32, contains (0), completing the
proof.
The uniqueness part follows from Proposition 3.33. The existence part is proven by
defining an outer measure P* analogous to that used to show the existence of Lebesgue
measure (see discussion following Example 3.25). That is, P*(A) is defined to be
, where the infimum is over all countable unions such that and .
Then P* restricted to the sets A such that for all sets B is the
desired extension. We omit the details of the proof. See Billingsley (2012) for details.
The completion of the k-dimensional Borel sets k is the Lebesgue sigma-field k. The
sets in k are called k-dimensional Lebesgue sets.
The k-dimensional Lebesgue sets include all subsets of Borel sets of measure 0. It is an
interesting fact that the cardinality of Lebesgue sets is strictly larger than that of Borel
sets.
Exercises
5. Flip a fair coin countably infinitely many times. The outcome is an infinite string
such as 0, 1, 1, 0, ..., where 0 denotes tails and 1 denotes heads on a given flip. Let
denote the set of all possible infinite strings. It can be shown that each has
probability (1/2)(1/2)... (1/2)... = 0 because the outcomes for different flips are
independent (the reader is assumed to have some familiarity with independence
from elementary probability) and each has probability 1/2. What is wrong with the
following argument? Because is the collection of all possible outcomes,
if A is
(a) Show that the probability of the event An that ball number 1 is not
removed at any of steps 1, 2, ..., n is 1/(n + 1).
(b) What is the probability of the event A that ball number 1 is never
removed from the box? Justify your answer.
(c) Show that with probability 1 the box is empty at midnight.
11. Suppose that n people all have distinct hats. Shuffle the hats and pass them back
in random order. What is the probability that at least one person gets his or her hat
back? Hint: apply the inclusion-exclusion formula with Ei = {person i gets his or
her own hat back}. Show that the probability in question is a Taylor series
approximation to 1 exp(1).
12. Does an arbitrary measure have the continuity property of Proposition 3.30 for
decreasing sets? Hint: consider counting measure (Example 3.23) and the sets En =
{n, n + 1, ...}.
13. * Let r1, r2, ... be an enumeration of the rational numbers (such an enumeration
exists because the rational numbers are countable). For given > 0, let Ii be an
interval of length /2i containing ri. What can you say about the Lebesgue measure
of ? What can you conclude from this about the Lebesgue measure of the
rationals?
14. * What is the Lebesgue measure of any countable set?
15. Use Proposition 3.27 to prove Proposition 3.28.
What exactly does it mean to select randomly and uniformly from the unit circle C? The
corresponding probability measure P should have the property that for any subset A of
C, P should have the same value as for any rotation of A by an angle (Figure 3.5). We
will show that there is no such probability measure if consists of all subsets of C.
Figure 3.5
Bayesian statisticians treat their uncertainty about the value of parameters by specifying
prior distributions on them. For instance, to reflect complete uncertainty about a
Bernoulli probability parameter p, a Bayesian might specify a uniform distribution on
[0, 1] (Lebesgue measure on the Borel subsets of [0, 1]). But how does one reflect
complete uncertainty about the mean of a normal distribution? There is no uniform
probability measure on the entire line; Lebesgue measure is not a probability measure
because (R) = . Use of Lebesgue measure on R conveys complete certainty that is
enormous; after all, the measure of any finite interval relative to the measure of the
entire line is 0. It is very strange to think that has essentially no chance of being in any
finite interval. Nonetheless, this so-called improper prior can be used to reproduce
classical confidence intervals and tests (see Section 2.9 of Gelman, Carlin, Stern, and
Rubin, 2004).
Another interesting scenario in which one cannot randomly and uniformly sample is
from a countably infinite set. For instance, a Bayesian specifying a uniform prior for a
Bernoulli parameter p has a problem if p is known to be rational. One cannot construct a
uniform distribution on a countably infinite set because it would lead to a contradiction
of countable additivity (see Exercise 1). Thus, there is no analog reflecting complete
uncertainty about a countably infinite set like the rationals. Interestingly, if we had
required probability measures to be only finitely additive instead of countably additive,
we could define a uniform measure on a countably infinite set (Kadane and OHagan,
1995).
Another consequence of being unable to pick randomly from a countably infinite set
concerns exchangeable random variables. The variables X1, ..., Xn are said to be
exchangeable if has the same distribution as (X1, ..., Xn) for any permutation
(1, ..., n) of (1, 2, ..., n). An infinite set of random variables X1, X2, ... is said to be
exchangeable if every finite subset is exchangeable. Exchangeability is important for the
validity of certain tests such as permutation tests.
The following argument suggests that we can always create exchangeable random
variables by randomly permuting indices. For any finite set of random variables (X1, ...,
Xn), exchangeable or not, randomly permute the observed values x1, ..., xn. The
resulting random variables are exchangeable. For instance, if the observed values of X1
and X2 are x1 = 0 and x2 = 1, then the distribution of a random permutation is
which is exchangeable. Notice that the original distribution of (X1, X2) is arbitrary and
irrelevant. Thus, the class of n-dimensional exchangeable binary random variables is
very broad.
This technique of randomly permuting indices does not work for an infinite set of
random variables. In fact, Exercise 2 asserts the impossibility of randomly permuting a
countably infinite set such that each permutation has the same probability. The inability
to apply the random permutation technique severely limits the set of possible
distributions. In fact, it can be shown (DeFinettis theoremsee Heath and Sudderth
(1976) for an elementary proof) that every infinite set of exchangeable binary random
variables has the property that each finite subset X1, ..., Xn are conditionally iid
Bernoulli (P) given some random variable P. That is, X1, ..., Xn have a mixed Bernoulli
distribution. One consequence is that a pair (X1, X2) from an infinite string of
exchangeable binary random variables cannot have the distribution (3.5).
Exercises
1. Prove the ironic fact that despite the existence of a uniform measure (Lebesgue
measure) on the uncountable set [0, 1], there is no uniform probability measure on
a countably infinite set. That is, if is countably infinite, there is no probability
measure P assigning equal probability to each .
2. Argue that there is no way to randomly permute the set of natural numbers such
that each permutation has the same probability. Hint: if we could, then what would
be the distribution of the first element?
On July 26, 1976, NASAs Viking Orbiter spacecraft took a picture of a region of Mars
that appears to show a human face (Figure 3.6). Some have concocted elaborate
explanations involving ancient aliens who are trying to tell us they exist. After all, they
argue, what is the probability of encountering an image that so closely resembles a
human face? But if the image had looked like a bicycle, they would have inquired about
the probability of so closely resembling a bicycle. Likewise, if the image had appeared
on the moon, they would have asked the probability of seeing such a pattern on the
moon. Therefore, even though each possible might have probability 0, the
collection of s that would have triggered some recognized pattern may have non-
negligible probability. The real question is what is the probability of finding,
somewhere in the accessible part of the universe, some pattern that we would
recognize? That probability may very well be high. We will examine this example more
thoroughly in the next chapter. We will see, as with the Bible code controversy, the key
is phrasing the question so that it represents a well-defined random experiment.
Figure 3.6
3.8 Summary
1. A probability space is a triple (, , P), where:
7. *** A very useful technique for proving that a property p holds for all sets in
(0), where 0 is a field, is as follows.
(a) Define C to be the class of subsets of () such that p holds, and show
that C contains 0.
(b) Prove that C is a monotone class (nonempty, closed under increasing
unions and decreasing intersections).
(c) Invoke Theorem 3.32.
Chapter 4
One thing we took for granted in a calculation such as the one above is that any X event
of interest corresponds to an allowable set, ; otherwise, we would not be allowed
to calculate the probability of E. This motivates the following definition.
Figure 4.1 illustrates the concept of an inverse image X1(B) for the random variable
on . The Borel set B is , and
is in (0, 1).
Figure 4.1
The random variable for . If B is the Borel set ,
then is in .
Let = [0, 1], and = [0, 1] be the Borel subsets of . Then the following are
random variables:
1. .
2. .
3. .
After observing the random variable X, we can determine, for each F (X), whether
or not F occurred. In Example 4.2, . This is an
indication that X2 gives more information than X1 about which was actually drawn in
the experiment: by observing X1 we can tell only whether event F occurred for and
[0, 1], whereas by observing X2, we can tell whether F occurred for , [0, 1],[0,
1/2), and [1/2, 1]. In other words, we can narrow the possible values of somewhat.
Also, , so X3 is even more informative than X2. In fact, X3 tells us exactly
which was drawn in the experiment, whereas X2 gives us only partial information
about , namely whether or .
A function is said to be
The class of Borel functions is very broad. We will soon see that this class includes all
continuous functions, but it also includes other types of functions typically encountered
in practice.
Let X() be a random variable and Y() = f(X()) for some Borel function .
Then
1. Y is a random variable.
2. .
3. If, additionally, X = g(Y) for a Borel function g, then (X) = (Y).
Proof. Exercise.
Throughout this book we will emphasize the importance of thinking about a random
variable in terms of the sigma-field it generates.
Proof. The necessity part follows from the fact that (, x] is a Borel set of R. To see
the sufficiency part, assume that for each x, and let A be the collection of
Borel sets such that . It is an exercise (Exercise 4) to show that A is a sigma-
field. Therefore, A is a sigma-field containing all intervals (, x], x R. This means
A contains {(, x], x R}, the smallest -field containing {(, x], x R}. But by
Exercise 10 of Section 3.2.3, {(, x], x R} is the Borel sigma-field . Therefore,
. That is, X is -measurable by Definition 4.1. We have shown that Definition 4.1
is equivalent to for each x R.
We are now in a position to prove that continuous functions are Borel measurable.
We next consider limits, infs, sups, liminfs and limsups of a sequence of random
variables. This can lead to infinite values, and our definition of random variables
requires them to be finite. We therefore extend the definition as follows.
If Xn() is a random variable for each n, then the following are extended random
variables.
1. inf Xn().
2. sup Xn().
3. .
4. .
5. lim Xn if it exists.
Proof.
7. Suppose that X and Y are random variables on (, , P), and suppose that F .
Prove that
Figure 4.2
To emphasize the correspondence between the original and induced probability spaces,
we sometimes write . This does not mean that for each ; this
need not hold. Rather, it means that for each (i.e., X is -measurable).
Figure 4.3
, and X() is the number of zeros of in the interval [0, 2]. The
probability measure P on (, ) induces a measure P on (R, ), namely the measure
assigning probability 1/2i to natural number 2i, i = 1, 2, ...
Exercises
That is, is the intersection of all sigma-fields containing all of the sets
. Again it is difficult to picture this sigma-field, but as we add more
random variables to a collection, the sigma-field generated by the collection enlarges.
For example, with stochastic processes, we observe data Xt over time. The information
up to and including time t is . The longer we observe the process, the more
information we get, reflected by the fact that for s t.
In one dimension we observed that (X), the smallest sigma-field containing X1(B)
for all , is just . This follows from the fact that this latter collection is
a sigma-field (see Exercise 3 of Section 4.1.1). There is a k-dimensional analog of this
result. Recall that we defined to be the smallest sigma-field for which each Xi
is measurable. We can also express (X1, ..., Xk) as follows.
Example 4.14.
Roll a die and let = T() be the total sigma-field of all subsets of = {1, 2, 3, 4, 5,
6}. Let X1() be the indicator that the outcome of the random roll was an even
number and X2() be the indicator that it was 3 or less. The sigma field generated by
X1 is , while the sigma-field generated by X2 is
. The sigma-field generated by (X1, X2) is the smallest sigma-
field containing all of the sets , {2, 4, 6}, {1, 3, 5}, {1, 2, 3}, {4, 5, 6}, {1, 2, 3, 4, 5,
6}. We can either directly find this sigma-field or use the fact that (X1, X2) = (X1,
X2)1(B), B 2. Consider the latter approach.
1. Suppose that B contains none of the four pairs (0, 0), (0, 1), (1, 0), or (1, 1). Then
.
When we include all of these sets, we get {, {2}, {5}, {1, 3}, {4, 6}, {2, 5}, {1, 2, 3},
{1, 3, 5}, {2, 4, 6}, {4, 5, 6}, {1, 3, 4, 6}, {1, 2, 3, 5}, {2, 4, 5, 6}, {1, 2, 3, 4, 6}, {1,
3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}}. Notice that this sigma-field contains each of (X1) and
(X2) because (X1, X2) is more informative about than either X1 or X2 alone.
Exercises
1. Let be the set of integers and = T() be the total sigma-field (all subsets of
). Let and Y = ||. Find (X) and (Y). Prove that (X, Y) = .
Interpret these results in terms of the amount of information about contained in X,
Y.
2. Let (, , P) be the unit interval equipped with the Borel sets and Lebesgue
measure. For , define Xt() by I(t = ). Is Xt a random variable? What is the
sigma-field generated by Xt, 0 t 1?
3. Flip two coins, so that = {(H, H), (H, T), (T, H), (T, T)}, where H and T mean
heads and tails, respectively. Take = T(), the total sigma-field. Let X1 be the
number of heads and X2 be the number of tails. Show that (X1) = (X2) = (X1,
X2) and explain in terms of the information content of the random variables.
The random vector (X1, ..., Xk) induces a probability measure P on (Rk, k) through
Example 4.16.
Example 4.17.
Exercises
Remark 4.20.
The converse of Proposition 4.19 is also true. That is, any function satisfying the three
conditions of Proposition 4.19 is the distribution function of a random variable X
defined on some probability space (, , P). We will see this in Section 4.6.
The fact that F(x) is monotone means that the left- and right-hand limits and
both exist (why?). Thus, F(x) cannot exhibit the kind of pathological behavior
that, say, sin(1/x) does as x 0 (Figure 4.4). There are only two possibilities: the left
and right limits either agree or disagree. If they agree, then F(x) is continuous at x0. If
they disagree, then F(x) has a jump discontinuity at x0. A d.f. can have infinitely many
jumps, but only countably many, as we will see in Proposition 4.21.
Figure 4.4
f(x) = sin(1/x) exhibits bizarre behavior with no limit as x 0 (top panel). A d.f. F(x)
cannot exhibit such behavior; the only type of discontinuity of F(x) is a jump (bottom
panel).
We next prove that the set of discontinuities of F is countable. Because the smallest and
largest possible values of F(x) are 0 and 1, respectively, there cannot be more than n
jumps of size 1/n or more. Thus, the set of jumps of size 1/n or more is countable. The
set of all jumps is {jumps of size 1/n or more}, the countable union of countable sets.
By Proposition 2.6, the set of discontinuities of F is countable.
Remember from Section 4.1.2 that the probability measure P on defines a probability
measure P on the Borel sets of R via P(B) = P{ : X() B}. We can calculate
probabilities using either P on or P on ; they give the same answer. For B = (, x],
P(B) = F(x). It turns out that once we know F(x), we know P(B) = P{ : X() B}
for each Borel set B, as the following result shows.
Proof. The distribution function defines P(C) for all sets C of the form (, a], a R.
For C of the form (a, b], P(C) = F(b) F(a), so P(C) is also determined by F. For C of
the form (b, ), P(C) = 1 F(b) is determined by F. By countable additivity of
probability measures, P(C) is determined by F for all sets C in the field 0 of unions of
a finite number of disjoint sets of the form (a, b], (, a], or (b, ) (see Example 3.8 of
Section 3.2). But 0 generates the Borel sigma-field, so Proposition 3.33 shows that the
probability P(B) of any Borel set B is uniquely determined from the probabilities of
sets C 0, which we have shown are determined from F. This shows that the
distribution function F(x) of X completely and uniquely determines the probability
measure P induced by X.
Let x R and y R be any real numbers such that |y x| < , and without loss of
generality, take x < y. Then |F(y) F(x)| < /2 < if x and y are both between A and A.
On the other hand, if x and y are both less than A, then
A similar argument works if A x A and y > A. We need not consider the case x <
A and y > A because we took < 2A. This completes the proof that F is uniformly
continuous.
Exercises
The multivariate distribution function (also called the joint distribution function) F(x1,
..., xk) of (X1, ..., Xk) is defined to be .
Example 4.25.
The Lebesgue measures o the three sets above are , 1/2, and ,
respectively. If x1 1, then the left side of Equation (4.6) is X2 x2, which has
Lebesgue measure if 0 x2 < . Therefore, the distribution function for (X1,
X2) is
This example illustrates that careful bookkeeping is sometimes required to calculate a
multivariate distribution function.
In one dimension, a d.f. F(x) has only countably many discontinuities, so it is natural to
explore whether this is true for multivariate distribution functions as well. The key
question is then under what circumstances is a multivariate d.f. continuous? We already
know that it must be continuous from above, so we could look for conditions under
which it is continuous from below; i.e. for every sequence such that
xin approaches xi from the left, or, equivalently, for every sequence such that . Of
course a general function f can be continuous from above and below, but not continuous,
even if k = 2. For example:
Example 4.28. Continuous from above and below, but not continuous
The key difference between a function like f of Example 4.28 and a multivariate d.f. F is
that F is increasing in each argument. This precludes counterexamples like Example
4.28.
A multivariate d.f. is continuous at (x1, ..., xk) if and only if it is continuous from below
at (x1, ..., xk).
Proof. We know from Proposition 4.27 that F is continuous from above, so it suffices to
prove that for a multivariate d.f., continuity from above and below is equivalent to
continuity. Of course continuity implies continuity from above and below, so it suffices
to prove the other direction. Suppose that F is continuous from above and below. If F
were not continuous, then there would be an > 0 and a sequence xn = (x1n, ..., xkn)
converging to x such that either for infinitely many n, or for
infinitely many n. If for infinitely many n, then for infinitely
many n, where each component of is the minimum of the corresponding components of
xn and x (see Figure 4.5). Similarly, if for infinitely many n, then
for infinitely many n, where each component of is the maximum of the corresponding
components of xn and x. Therefore, if there is any sequence violating continuity, then
there is a sequence violating either continuity from below or continuity from above.
This proves that for a multivariate d.f., continuity from above and below implies
continuity.
Figure 4.5
For any sequence (x1n, x2n) approaching (x1, x2) such that for
infinitely many n, we can find a sequence approaching (x1, x2) from the
southwest such that for infinitely many n. Take ,
which projects points from other quadrants onto the southwest quadrant.
Let F be a multivariate distribution function. For a given (x1, ..., xk), let
. Then F is continuous at (x1, ..., xk) if and
only if pi = 0 for all i = 1, ..., k.
A sufficient condition for continuity of the multivariate distribution function F at (x1, ...,
xk) is that P(Xi = xi) = 0 for i = 1, ..., k.
Proposition 4.21 implies that there are only countably many x such that P(Xi = x) > 0
because each such x is a discontinuity point of the univariate d.f. of Xi. We call the line
xi = x an axis line of discontinuity of a multivariate d.f. F if F is discontinuous at (x1, ...,
xi = x, ..., xk) for one or more values of x1, ..., xi 1, xi + 1, ..., xk. Let i denote the
axis lines of discontinuity. Then Corollary 4.31 implies the following result (see Figure
4.6).
Figure 4.6
All of the discontinuities of a bivariate distribution function must lie on countably many
horizontal/vertical lines.
Notice that Proposition 4.32 does not say that there are only countably many points of
discontinuity of F. That is not true. For instance, let X1 be Bernoulli (1/2) and
independent of X2 (again, the reader is assumed to have some familiarity with
independence from elementary probability), which is standard normal; then
, where is the standard normal distribution function.
Then F is discontinuous at the uncountably many points (0, x2) for all x2 R and (1,
x2) for all x2 R, though all points of discontinuity lie on only two lines, x1 = 0 and
x1 = 1.
From a multivariate distribution function we can obtain the marginal distribution
function for each Xi making up X. For example, if X = (X1, X2) has joint distribution
function F (x1, x2), the marginal distribution function for X1 is
. Similarly, if X = (X1, ..., Xk), the joint
distribution function for any subset of random variables can be obtained from the joint
distribution function F(x1, ..., xk) by setting the remaining x values to A and letting A
. For instance, the joint distribution function for (X1, X2) is , and the
marginal distribution for X1 is .
Exercises
3. If F(x, y) is the distribution function for (X, Y), find expressions for:
except that Equation (4.7) does not require P(A1) > 0. Therefore, we adopt (4.7) as the
preferred definition of independence of two events A1 and A2.
We would like to extend this definition to n 2 events. It may seem like the natural
extension of independence to n events A1,...,An is to require , but
this is not sufficient. For example, roll a die and let A1 be the event that the number is
odd, A2 be the event that the number is even, and A3 be the empty set . Then
, yet A1 and A2 are mutually exclusive, so knowledge of
whether A1 occurred gives us complete knowledge of whether A2 occurred. Indeed,
. Therefore, we need to impose additional conditions in the
definition of independence of multiple events.
It is an important fact that the product rule for computing the probability of independent
events holds for a countably infinite number of events as well.
We now extend the definition of independence from events to random variables X1, ...,
Xn. Intuitively, independence of random variables X1, ..., Xn should mean
independence of events associated with these random variables. That is,
for every subcollection Xi1, ..., Xik and all
one-dimensional Borel sets B1, ..., Bk. It suffices that
for all one-dimensional Borel sets because we can
always write as an intersection involving all X1, ..., Xn by
augmenting the intersection with terms for . For instance, if we are trying
to determine whether X1, ..., X5 are independent, we can write the event
It is helpful to have another way of checking for independence because Definition 4.36
requires consideration of all Borel sets. The following is a very useful result.
for all x1, x2, ..., xn, where F1, F2, ..., Fn are the d.f.s of X1, X2, ..., Xn.
Note that there is a more immediate proof if we were willing to accept the idea that, for
given marginal distributions F1, ..., Fn, there exist independent random variables X1, ...,
Xn with those marginal distributions. In that case, we would know that there exist
independent random variables with joint distribution function F1(x1)F2(x2) ...Fn(xn),
and because the joint distribution function completely determines for
any Borel sets B1, ..., Bn (Proposition 4.26), this would prove the result. However, we
do not establish the existence of independent random variables with given distribution
functions until Section 4.6.
Example 4.38. Independent base 2 digits from randomly picking a number in [0, 1]
Let be , and let X1, X2, ... be the digits in the base 2 representation of
. That is, , so X1() tells whether is in the left or right half
of [0, 1], X2() tells whether is in the left or right half of that half, etc. The event
1. If X1, ..., Xn are independent and f1, ..., fn are Borel functions, then f1(X1), ...,
fn(Xn) are independent.
2. If Xt, t T, is a countably or uncountably infinite collection of independent random
variables and ft, t T are Borel functions, then ft(Xt), t T are independent
random variables.
Proof. For item 1, let B1, ..., Bn be arbitrary Borel sets. Then
, where is a one-dimensional Borel set
because fi is a Borel function. It follows from this and independence of the Xi that
. Therefore, f1(X1), ..., f(Xn) are
independent random variables.
Item 2 follows from the fact that every finite subcollection is independent by the proof
of item 1.
We next proceed to the concept of independence of random vectors X1, ..., Xn with
possibly different dimensions k1, ..., kn. The natural extension of Definition 4.36 is the
following.
1. A finite collection of random vectors X1, ..., Xn of dimensions k1, ..., kn is defined
to be independent if for arbitrary Borel sets B1,
..., Bn of respective dimensions k1, ..., kn.
2. A countably or uncountably infinite collection of random vectors of finite
dimensions is defined to be independent if each finite subcollection of vectors
satisfies condition 1.
Note that Definition 4.41 does not imply independence within the components of Xi,
only between collections. For instance, suppose that a generic collection Xi consists of
values of several covariates measured on person i; covariates on the same person are
dependent, but the collections Xi are independent of each other because they are
measured on different people.
Some results for independence of random variables have analogs for independence of
random vectors. The following are two examples.
X1, ..., Xn of respective dimensions k1, ..., kn are independent by Definition 4.41 if and
only if
for all x1, x2, ..., xn, where F1, F2, ..., Fn are the multivariate distribution functions of
X1, X2,..., Xn.
If X1, ..., Xn are independent random vectors of respective dimensions k1, ..., kn and
are Borel functions (i.e., ki measurable), i = 1, ..., n, then are
independent random variables.
Example 4.44. Independence of sample mean and variance, and of fitted and residual
Suppose that (Y1, ..., Yn) are independent N(, 2) random variables, and let
be the residual vector. It is known that are independent (more
generally, the residual vector in regression is independent of the fitted vector). It
follows from Proposition 4.43 that f(R) is independent of for any function
that is n-measurable. Taking shows that the sample variance
of a normal sample is independent of the sample mean. Interestingly,
the normal distribution is the only one with this property.
Let X1, ..., Xn be any random variables, and let I1, ..., Ik be sets of indices, i.e., subsets
of {1, 2, ..., n}. Let Xi consist of the components whose indices correspond to Ii. For
instance, with n = 4, k = 2, I1 = {1, 2, 3} and I2 = {3, 4}, then X1 = (X1, X2, X3) and
X2 = (X3, X4). In this case X1 and X2 have overlapping components because X3 is in
both X1 and X2. If I1, ..., Ik had been disjoint, we would say that X1, ..., Xk are non-
overlapping collections of (X1, ..., Xn).
Proposition 4.47.
If 1, ..., n are independent fields, then (1), ..., (n) are independent sigma-fields.
Proof when n = 2.
Let F 1 and define to be the collection of (2) such that F and G are
independent. Then contains 2 by assumption. Now suppose that G1, G2, ... are -sets
that increase or decrease to G. Then increase or decrease to . The
continuity property of probability (Proposition 3.30) implies that
. Therefore, F and G are independent, so G
. This shows that is a monotone class. By the monotone class theorem, contains
(2). Therefore, every set in (2) is independent of every set in 1. Now let G
(2) and define to be the collection of F (1) such that F and G are
independent. By what we just proved, contains 1. The continuity property of
probability (Proposition 3.30) implies that is a monotone class. By the monotone
class theorem, contains (1). Therefore, each set in (1) is independent of each set
in (2), completing the proof when n = 2.
Remark 4.50.
Recall Example 3.6 purporting to show a human face on Mars, spurring some to
conclude that ancient astronauts must be responsible. We argued that the probability of
seeing some recognizable pattern somewhere in the universe could be quite high. The
actual process by which patterns on planets are formed is very complicated, but imagine
the over-simplified scenario of constructing a pattern completely at random using the
lights of Exercise 11 of Section 2.1. Recall that lights are placed at positions (r, s) for
all rational r and s, and each light may be turned on or off.
Proponents of the ancient astronaut theory might argue that even a very simple pattern
such as a specific line segment has probability 0 because it requires a specific infinite
set of lights to be turned on. If X1, X2, ... are the iid Bernoulli (1/2) indicators of those
specific lights being on, then P(X1 = 1, X2 = 1, ...) = (1/2)(1/2)... = 0 by Proposition
4.35. This same argument can be applied to any pattern requiring a specific infinite set
of lights to be turned on. For instance, a circle CR of radius R must have probability 0.
Even the set {CR} of circles of all rational radii is countable. Moreover, the number N
of distinct recognizable patterns (i.e., the set of objects on earth), though very large, is
finite. Even if we allow different sizes through all rational scale factors, the set of
recognizable patterns is countable. Therefore, the probability of seeing some
recognizable pattern is 0 by countable additivity. Ancient astronaut theorists might
conclude from this that the probability of seeing such a recognizable pattern is 0.
One problem with the above reasoning is that the human eye will perceive a line
segment even if not all of the lights are turned on. For instance, if enough equally spaced
lights along a line segment are turned on, the human eye will perceive the entire line
segment (see Figure 4.7). Therefore, because the probability of any finite set of lights
being turned on is nonzero, the probability of perceiving a given line segment is
nonzero. Similarly, the probability of stringing together multiple apparent line segments
is also nonzero, though it may be tiny. A sufficiently long string of apparent line
segments can mimic any pattern. For instance, Figure 4.8 shows 8, 16, and 64 line
segments strung together to give the appearance of a circle. Therefore, in any specific
square of a given size, the probability of seeing what appears to be a human face is e >
0, though it may be tiny. Now consider a countably infinite set of non-overlapping
squares. Under the random lighting scenario, the patterns observed in these different
squares are independent. The probability of seeing what appears to be a human face on
at least one of these squares is 1 - (1 - e)(1 - e)... = 1-0= 1. Non-believers in the ancient
astronaut theory can argue that not only is seeing such a pattern not unusual, it is
guaranteed!
Figure 4.7
From left to right: 11, 101, and 5001 lights turned on along the line segment y = 2x, x
[0, 1].
Figure 4.8
From left to right: 8, 16, or 64 consecutive line segments joined together to give the
appearance of a circle.
Expression (4.13) consists of the set of in at least one of Di1, Di2, and Di3 for each i.
Consider the matrix whose ith row consists of Di1, Di2, Di3, and imagine each possible
way to pick one of the three Ds from each of the k rows. Let ji be the index (1, 2, or 3)
of the D selected in row i. For instance, the boldfaced Ds below correspond to the
choice j1 = 2, j2 = 1, ..., jk = 3.
There are 3k different ways to pick j1, ..., jk. For each, form the intersection
. Expression (4.13) consists of all possible unions of these intersections:
where each ji ranges over the set {1, 2, 3}. Because the three sets Di1, Di2, Di3 are
disjoint, and are disjoint unless each . Moreover, each is clearly
a product set because for any Ci1, Ci2, i = 1, ..., k.
Therefore, EC is a union of a finite number of disjoint product sets. That is, 0 is
closed under complements.
We show next that 0 is closed under intersections of pairs, which will establish that
0 is a field because closure under complements and paired intersections also implies
closure under paired unions. Note that
is the union of a finite number of disjoint product sets. Thus, 0 is closed under
intersection of pairs, completing the proof that 0 is a field.
We can use a similar construction to create n independent random variables. In the next
section, we show how to embed a product structure on = [0, 1] with P being
Lebesgue measure. This will allow us to create infinitely many independent random
variables with given marginal distributions.
Exercises
Show that X and Y2 are independent, but X and Y are not independent. What is
wrong with the following argument: since X and Y2 are independent, X and
are independent by Proposition 4.39.
3. Prove that if A1, ..., An are independent events and each Bi is either Ai or , then
B1, ..., Bn are also independent.
4. Prove that if X1 and X2 are random variables each taking values 0 or 1, then X1
and X2 are independent if and only if . That is, two
binary random variables are independent if and only if they are uncorrelated.
5. If A1, A2, ... is a countably infinite sequence of independent events, then
.
6. Let = {1, 2, 3, 4}, where 1 = (1, 1), 2 = (1, +1), 3 = (+1, 1), 4
= (+1, +1). Let X() be the indicator that the first component of is +1, and Y()
be the indicator that the second component of is +1. find a set of probabilities
p1, p2, p3, p4 for 1, 2, 3, 4 such that X and Y are independent. Find another
set of probabilities such that X and Y are not independent.
7. Let X be a Bernoulli random variable with parameter p, 0 < p < 1. What are
necessary and sufficient conditions for a Borel function f(X) to be independent of
X?
8. Suppose that X is a random variable taking only 10 possible values, all distinct.
The sets on which X takes those values are F1, ..., F10. You must determine
whether X is independent of another random variable, Y. Does the determination
of whether they are independent depend on the set of possible values {x1, ..., x10}
of X? Explain.
9. Flip a fair coin 3 times, and let Xi be the indicator that flip i is heads, i = 1, 2, 3,
and X4 be the indicator that the number of heads is even. Prove that each pair of
random variables is independent, as is each trio, but X1, X2, X3, X4 are not
independent.
10. Prove that a random variable X is independent of itself if and only if P(X = c) = 1
for some constant c.
11. Prove that a sigma-field is independent of itself if and only if each of its sets has
probability 0 or 1.
12. Prove that if X is a random variable and f is a Borel function such that X and f(X)
are independent, then there is some constant c such that P(g(X) = c) = 1.
13. Let , and for t [0, 1], let Xt = I( = t). Are {Xt, t [0, 1]}
independent?
14. Prove that the collection A1 in step 1 of the proof of Proposition 4.37 contains the
field 0 of Example 3.8.
15. It can be shown that if (Y1, ..., Yn) have a multivariate normal distribution with
, i = 1, ..., n, then any two subcollections of the random variables are
independent if and only if each correlation of a member of the first subcollection
and a member of the second subcollection is 0. Use this fact to prove that if Y1, ...,
Yn are iid normals, then is independent of . What can you conclude
from this about the sample mean and sample variance of iid normals?
16. Show that if Yi are iid from any non-degenerate distribution F (i.e., Yi is not a
constant), then the residuals cannot be independent.
17. Prove that the random variables X1, X2, ... are independent by Definition 4.36 if
and only if the sigma fields are independent by Definition 4.48.
This argument breaks down if F is not continuous or strictly increasing. For instance, let
F(x) = 0 for x < 1/2 and 1 for x 1/2. Then F1(u) is not even defined for 0 < u < 1.
Let F(x) be any distribution function and 0 < u < 1. Then the inverse probability
transformation F1(u) is defined to be inf{x : F(x) u} (Figure 4.9).
Figure 4.9
Proposition 4.54.
.
Inequality (4.18) follows from the fact that F1(u) is a lower bound on the set
In other words, all x such that F(x) u are at least as large as F1(u), so
. This concludes the proof of inequality (4.18).
To see inequality (4.19), assume first that t is strictly larger than F1(u). Then t is not a
lower bound on Au (because F1(u) is the greatest lower bound on Au), so there exists
a point xn in Au with xn < t. Therefore, F(xn) F(t). Because xn Au, F(xn) u. We
have shown that
proving that F(t) u for t strictly larger than F1(u). To see that the inequality also
holds if t = F1(u), note the following.
1. By what we have just proven, F(t + 1/n) u because t + 1/n is strictly larger than F
1(u). Take the limit of both sides of this inequality to see that
2. by right-continuity of F. Thus, u F(t).
This completes the proof of inequality (4.19), and therefore, of Proposition 4.54.
Let = (0, 1), = B(0, 1), and P be Lebesgue measure L. Then F1() is a random
variable with distribution function F(x).
Proof. That F1() is a random variable follows from the monotonicity of F1, the fact
that monotone functions are Borel measurable, and Proposition 4.5. By Proposition
4.54, the event F1() t is equivalent to F(t) , so P{: F1() t} = P{ (0,
F(t)]} = F(t) because the Lebesgue measure of an interval is its length.
Here is a visual way to understand the fact that Y = F1() ~ F(y). Suppose that F is a
distribution function with density function f with respect to Lebesgue measure. Imagine a
device called a quincunx with balls rolling down a board, hitting nails, and bouncing
either to the left or right with equal probability. We use quincunxes in Chapter 8 to help
understand the central limit theorem. Suppose that the first row has a single nail at
horizontal position corresponding to the median of F, namely F1(1/2). This is shown in
Figure 4.10, where F is the lognormal distribution function. The ball bounces to either
the first quartile (F1(1/4)) or third quartile (F1(3/4)) in the second row, with equal
probability. If it hits the nail at F1(1/4) in the second row, it is equally likely to bounce
to either F1(1/8) or F1(3/8) in the third row. Likewise, if it hits the nail at F1(3/4)
in the second row, it is equally likely to bounce to either F1(5/8) or F1(7/8) in the
third row. The dotted lines in Figure 4.10 show the possible paths of a ball rolling
down a quincunx with 4 rows. Approximately 4 out of every 8 balls fall into the
leftmost of the equal-width bins at the bottom because 4 of the 8 equally likely paths
lead to that bin. The shape of the empirical histogram formed by the balls mimics the
shape of the density function f. Indeed, as we increase the number of rows and equal-
width bins at the bottom, the shape of the empirical density function becomes
indistinguishable from f.
Figure 4.10
Quincunx depiction of Y = F1 (U), where U = 0.12 ... and the i are iid Bernoulli
(1/2). The deflection in row i is left or right when i is 0 or 1, respectively.
The above quincunx is simply a visual corroboration of the fact that F1() has
distribution function F. After all, if has base 2 representation 0.12 ... = 1/2 + 2/22
+ ..., then F1() < 1/2 corresponds to 1 = 0 and F1() > 1/2 corresponds to 1 = 1.
Therefore, left or right deflections in the first row correspond to 1 = 0 or 1,
respectively. Likewise, left or right deflections in the second row correspond to 2 = 0
or 1, respectively, etc. The randomly selected dictates, through its base 2
representation, the full set of left or right deflections in the rows of the quincunx. Note
that we ignored the possibility that the base 2 representation of terminates because
this event has probability 0.
Can we extend this argument to obtain a countably infinite collection (X1, X2, ...) of
independent random variables defined on with respective distribution
functions F1, F2, ...? To create two independent random variables, we used two disjoint
subsets of indices, namely the even and odd numbers. To generate an infinite number of
independent random variables, we must define infinitely many disjoint subcollections
I1, I2, ..., each with a countably infinite number of indices. One way to do this is to let
I1 = {21, 22, 23, ...}, I2 = {31, 32, 33, ...}, I3 = {51, 52, 53, ...}, etc., so that
, where pn is the nth prime. We claim that I1, I2,I3, ... are disjoint
collections. If not, then there would be an integer k such that , where i j. But
this would violate the fact that every positive integer can be written uniquely as a
product of primes. Define Then U1, U2, ... are independent by
Proposition 4.47, and each is uniformly distributed, so are independent
with respective distributions F1, F2, ...
Remark 4.57.
Note that to create the independent random variables in the development above, we
needed to ensure that the indices corresponding to the base 2 representations of the
different variables were disjoint. We could still create random variables X1 and X2
with distributions F1 and F2 if the indices were overlapping, but X1 and X2 would not
necessarily be independent. For instance, and 1 are both uniformly distributed, so
and have distributions F1 and F2, but are not independent.
One application of F1(U) ~ F is the normal scores rank test comparing the location
parameters of two groups. Before describing this particular test, we note that rank tests
in general can be an attractive alternative to parametric tests when there may be outliers
or the distributions have heavy tails. Unlike the original data, ranks cannot be too
extreme, so a rank test can confer a substantial power advantage over the t-test if the
true distribution has heavy tails. The first step of a rank test combines data from the
treatment and control groups and ranks them from 1 (smallest) to n (largest). A
commonly used statistic, the Wilcoxon rank sum statistic (Hollander and Wolfe, 1973),
sums the ranks of treatment observations and compares it to its null distribution. If the
distribution of the data is skewed or has fat tails, the Wilcoxon test can have
substantially higher power than the t-test. On the other hand, it can be shown that if the
data really are normally distributed, the power of the Wilcoxon rank sum test is
approximately 5% lower than that of the t-test, asymptotically.
An ingenious alternative to the Wilcoxon test is the following procedure. Assume the
data come from a continuous distribution function. Generate standard normal
observations randomly, and replace the ith rank statistic of the original data with the ith
order statistic from the standard normal data. The replacement data are from a normal
distribution, so a t-test is automatically valid. This almost magical method can be
viewed as a two-step procedure. The first step is conceptual; if we knew the
distribution function F for the Xi, we could imagine replacing the original data X1, ...,
Xn by U1 = F(X1), ..., Un = F(Xn). It can be shown (Exercise 7) that each Ui has a
uniform distribution on (0, 1). The second step replaces the Ui by the standard normal
deviates Zi = 1(Ui). The downside of the method is that inference depends on the
normal data randomly generated. Two different people applying the same test will get
different answers. Even the same researcher repeating the test will get a different p-
value. An alternative is to repeat many times the procedure of randomly generating
standard normal order statistics and use the sample mean of each order statistic. That
would reduce variability and make the procedure more repeatable. As the number of
repetitions tends to infinity, the sample mean of the ith order statistic tends to E{Z(i)},
the expected value of the ith order statistic. Therefore, we can dispense with generating
random deviates and just replace the ith order statistic from the original data with the
expected value of the ith order statistic from a standard normal distribution. This
adaptation of the procedure is known as the normal scores test. Unlike the Wilcoxon
test, the normal scores test loses no power compared to a t-test if the data are normally
distributed and the sample size tends to .
Let X ~ F and Y ~ G. We would like to generate a pair (X, Y) with arbitrary marginal
distributions F and G, such that X and Y are correlated. One way begins by generating a
pair (V1, V2) from a bivariate normal distribution with standard normal marginals and
correlation . Expression (8.35) of Section 8.6 shows how to transform iid standard
normals to achieve bivariate normals with the desired correlation matrix. Next, compute
U1 = (V1), U2 = (V2). The marginal distribution of each Ui is uniform [0, 1],
although the Us are correlated because (V1, V2) are correlated. Now take X = F1(U1)
and Y = G1(U2). Then X has marginal distribution F and Y has marginal distribution
G. Also, X and Y are correlated because (V1, V2) are correlated and , F, and G are
monotone functions. Larger values of generate more correlated values of (X, Y), as
seen in Figure 4.11. Pairs (U1, U2) and (X, Y) are shown on the left and right,
respectively, for = 0.9 (top), 0.0 (middle) and 0.9 (bottom) and X and Y marginally
exponential with parameter 1. The middle panel corresponds to independent data, while
the top and bottom show highly negatively and positively correlated pairs, respectively.
Figure 4.11
Copula model. The left side shows correlated uniforms (U1, U2) generated as Ui =
(Vi), i = 1, 2, where (V1, V2) are bivariate normal with correlation = 0.9 (top),
= 0 (middle), and = 0.9 (bottom). The right side shows correlated exponentials
generated by X = F1 (U1) and Y = F1 (U2), where F(t) = 1 exp(t).
In this example, the function F1(U1), G1(U2)) converting correlated uniforms (U1,
U2) to correlated observations with marginal distribution functions F and G is called a
copula. Sklars theorem (Sklar, 1959) asserts that it is always possible to generate an
arbitrary (X1, ..., Xn) from correlated uniforms using a copula. This is easy to prove
when the marginal distribution functions F1, ..., Fk are continuous: (U1, ..., Uk) =
(F1(X1), ..., Fk(Xk)) are correlated uniforms and (F1(U1), ..., F1(Uk)) = (X1, ..., Xk)
with probability 1 (exercise). It follows that (F1(U1), ..., F1 (Uk)) has the same joint
distribution as (X1, ..., Xk).
Exercises
1. Use the inverse probability transformation to construct a random variable that has a
uniform distribution on [0, a]: F(x) = x/a for 0 x a.
2. Use the inverse probability transformation to construct a random variable that has
an exponential distribution with parameter : F(x) = 1 exp(x).
3. Use the inverse probability transformation to construct a random variable that has a
Weibull distribution: , x 0.
5. Give an explicit formula for using the copula method to construct two negatively
correlated exponential random variables with parameter 1.
6. Give an explicit formula for using the copula method to construct two strongly
positively correlated random variables with the following marginal probability
mass function:
7. Suppose that the distribution function F(x) for a random variable is continuous.
Prove that F(X) is a random variable and has a uniform distribution. That is,
P(F(X) u) = u for each u (0, 1). Show this first when F is strictly increasing,
and then extend the proof to arbitrary continuous F.
8. We showed how to find countably infinitely many disjoint sets of indices ,
, where the ps are primes, and this allowed us to generate countably
infinitely many independent random variables from a single random draw of
from (0, 1). Can we find uncountably many disjoint sets of indices, and thereby
generate uncountably many independent random variables from a single random
draw of from (0, 1)? Explain.
4.7 Summary
1.
(a) A random variable is a function for each .
(b) A random vector is a function for each .
2. A random vector induces a probability measure P on
for .
3. The distribution function F(x1, ..., xk) = P(X1 x1, ..., Xk xk) completely
determines P(B) for every B k.
(a) A univariate distribution function is right-continuous; it is continuous at x
if and only if it is left-continuous at x, and there are only countably many
points of discontinuity.
(b) A multivariate distribution function is continuous from above; it is
continuous at x if and only if it is continuous from below at x, and there are
only countably many axis lines of discontinuity.
(a) Without loss of generality, we can assume that the experiment consists of
drawing a single number at random from the unit interval.
(b) Drawing = 0.X1X2 ... randomly is equivalent to flipping countably
infinitely many coins: X1 = 1 (heads on coin 1) means is in the right half of
(0, 1), X2 = 1 (heads on coin 2) means is in the right half of that half, etc.
(c) We can use the inverse probability transformation to define independent
random variables X1, X2, ... on with arbitrary distributions F1,
F2, ...
Chapter 5
In calculus you learned that the Riemann integral of a function f(x) over an interval [a,
b] is defined as a limit of partial sums. We partition [a, b] into a = x0 < x1 <...< xn = b
and form the sum
where i is a point in the ith interval, [xi1, xi) and xi = xi xi1. Note that i and
xi depend on n as well, but we have suppressed the notation for simplicity. Then f is
said to be Riemann integrable if we get the same limit as n and
regardless of the intermediate point i selected (Figure 5.1).
Figure 5.1
Partial sums of the form 5.1, where the intervals are the same size and the intermediate
point i is the leftmost (upper panel) or rightmost (bottom panel) point in each interval.
is not Riemann integrable on [0, 1] because the value of the limit 5.1 for
depends on which intermediate point is selected; if i is rational for each i, then the
limit is 0, whereas if i is irrational for each i, the limit is 1. If the i alternate between
rational and irrational, the limit does not exist.
An alternative way to define an integral divides the y-axis, rather than the x-axis, into
equal-sized intervals. Form the partial sum
where yi is a point in the ith y-interval and is Lebesgue measure. That is, for each y-
interval we select an intermediate value yi and multiply by the Lebesgue measure of the
set of xs that get mapped into the y-interval (see Figure 5.2). This assumes that f is a
(Lebesgue) measurable function, so that the Lebesgue measure of f1(Ai) is defined. If
we get the same limiting value I as the common width of the intervals forming the y-
partition tends to 0, irrespective of the intermediate value yi selected, then I is an
alternative way to define the integral of f(x). This alternative method is called the
Lebesgue integral.
Figure 5.2
The horizontal lines show one interval Ai of a partition of the y-axis. The set f1(Ai) of
xs that get mapped into Ai is . Each term of Expression (5.3) is the product of an
intermediate y value in Ai and .
The advantage of partitioning the y-axis instead of the x-axis is that the y values within a
given interval are automatically close to each other. There is no longer any need for f(x)
to be a well-behaved function of x. We pay a small price in that a simple width, xi, in
the Riemann sum (5.1) is replaced by a more intimidating expression, {f1(Ai)}, in
the Lebesgue sum (5.3). Nonetheless, we are able to compute the Lebesgue measure of
even bizarre sets f1(Ai).
Now return to the function 5.2 and consider its Lebesgue integral . The function f
takes only two possible values, 0 or 1. If the common width of the y-intervals is small
enough, then 0 and 1 will be in separate intervals (Figure 5.3). Therefore, if Ai and Aj
are the intervals containing y = 0 and y = 1, respectively, then Expression (5.3) is
, where yi and yj are intermediate values in Ai and Aj. But f1(Ai)
and f1(Aj) consist of the rationals and irrationals, respectively, in [0,1], whose
respective Lebesgue measures are 0 and 1. Therefore, Expression (5.3) is yj. As the
common width of the intervals tends to 0, yj = yj,n tends to 1. Therefore, the Lebesgue
integral exists and equals 1.
Figure 5.3
Partitioning the y-axis to form a Lebesgue partial sum for the function (5.2). When the
common interval width is small enough, the intervals containing 0 and 1 are distinct.
We can define the integral of a function f with respect to an arbitrary measure (not just
Lebesgue measure) on an arbitrary space (not just the line). If is a measure on a
probability space , we can define the integral
by partitioning the y-axis into intervals of equal width, forming the sum of Expression
(5.3), and then taking the limit as the common width tends to 0. There are two potential
problems: 1) the limit may not exist and 2) the limit may depend on the intermediate
point yi selected. We avoid both of these problems if f is nonnegative and we choose the
intervals in a clever way, as we see in the next section.
Continue dividing intervals in half and forming partial sums of the form of Expression
(5.3), where yi is the smallest value in the interval. At step m, we get Ami = [(i
1)/2m, i/2m) and
We have assumed that f() < because if f() = , then f() is not contained in any of
the intervals Ami. We now extend the definition of integration when f() is a
nonnegative function such that f1(B) for each B in the extended Borel sigma-field
(see Remark 3.13). Suppose the measure of the set of such that f() = is 0. We
adopt the convention that , so the value of the integral does not change if f is
infinite only on a set of measure 0. On the other hand, if the nonnegative function f()
satisfies f() = on A, where (A) > 0, then the Lebesgue-Stieltjes integral
is defined to be .
The reader may wonder why yi of Expression (5.3) is chosen to be the smallest number
in the given y-interval instead of requiring the sum to converge to the same limit
irrespective of the intermediate value yi selected in the given interval. It turns out that
this happens automatically if is a finite measure such as a probability measure. To see
this, suppose that we use instead the largest value, yi = i/2m, in each interval and take
the limit of
If Expression (5.7) tends to , then so does Expression (5.8). On the other hand,
suppose that Expression (5.7) tends to a finite limit L. Then
The last step follows from the fact that { : f() < } is no greater than () < .
Because the limit of Expression (5.3) for an arbitrary choice of intermediate value yi is
bounded below by the limit when the smallest yi is chosen and bounded above by the
limit when the largest yi is chosen, this shows that for finite measures, we get the same
limit for Expression (5.3) regardless of which intermediate value yi is selected.
For non-finite measures such as Lebesgue measure on the line, we do not necessarily get
the same answer when we use the largest value in each interval instead of the smallest.
For instance, consider the function f() 0 and Lebesgue measure . By any reasonable
definition, should be 0, yet that is not what we get if we use the largest value in
each interval. Expression (5.8) is for each m because the measure of the first interval,
m1, is (, ) = . That is why this formulation of the Lebesgue-Stieltjes integral
approximates f from below by using the smallest number in each interval. It should be
noted, however, that there are several equivalent alternative definitions of the Lebesgue
integral.
Proof. Exercise.
Exercises
5. Prove that if f and g are nonnegative measurable functions with f () g() for
all , then .
6. Prove Proposition 5.3.
One common expectation encountered in statistics is of the squared deviation from the
mean, E{X E(X)}2, called the variance of X. The standard deviation of X is the
square root of the variance. Statisticians often use the following notation:
Finally, if X and Y are random variables with means X, Y and finite variances and
, the covariance XY and correlation XY between X and Y are
We have defined integration over the entire sample space , though we have suppressed
the notation to simply . We define integration over any set A as
follows.
If A , is defined to be .
This definition ensures that we can always consider the region of integration to be the
entire space .
The way we have defined the expectation of a random variable X involves the measure
P on the original probability space (, , P), but more elementary courses define
expectation without ever specifying (, , P). If we look more carefully, we see that we
do not really need to know the original space. Recall that the random variable X
induces a probability measure P on the line through P(B) = P{: X() B} for each
one-dimensional Borel set B. Therefore, the approximating sum
of Expression (5.7) for a nonnegative random variable
X() can be written in terms of the induced measure P as .
Similarly, depends only on P. Consequently:
3. Modulus inequality*
(a) .
(b) .
4. Preservation of ordering*
3. The result is trivially true if the right side is , so it suffices to consider f such
that , and therefore both and . In that case,
Note that the modulus inequality is analogous to the triangle inequality; if we replace the
integral in the modulus inequality with a finite sum, we get the triangle inequality.
A common technique is to use the above properties in conjunction with indicator
functions to prove certain results, as in the following example.
We would like to use the above properties to prove that if E(|X|p) < for some real
number p 1, then E(|X|) < . It is tempting to argue that |X| |X|p, so E(|X|) E(|X|p) <
by the preservation of ordering property, but this is not quite right because if X() <
1, then |X| > |X|p. We can modify the argument by using indicator functions.
If we can show that E{|X|I(|X| 1)} < and E{|X|I(|X| > 1)} < , then the result will
follow from property 1. To show that E{|X|I(|X| 1)} < , note that {|X|I(|X| 1)} 1,
so by the preservation of ordering property, E{|X|I(|X| 1)} < E(1) = 1 < . To show
that E{|X|I(|X| > 1)} < , note that |X|I(|X| > 1) |X|pI(|X| > 1) |X|p. Therefore,
E{|X|I(|X| > 1)} E(|X|p) < . Property 1 now shows that
completing the proof. Thus, for example, if the second moment E(X2) is finite, then
E(|X|) is finite, so E(X) exists and is finite by property 2. Hence, for random variables
with a finite second moment, the mean exists and is finite.
A very important result whose proof also uses indicator functions is the following.
If X is a random variable and k is a positive integer, then E(|X|k) < if and only if
.
A more mundane way to view the trick of rearranging into rows and columns
using Expression (5.13) is as follows.
One question that comes up repeatedly in probability theory and real analysis is under
what conditions the pointwise convergence of fn() to f() implies that
That is, when can we interchange the limit and the integral? To see that some conditions
are needed, let be Lebesgue measure on = (0,1) and consider the measurable
functions
The importance of Fatous lemma is that when we ponder whether Equation (5.15)
holds, we no longer have to wonder whether the integral on the right is finite; if the limit
on the left is finite, then the integral on the right is finite.
1. Suppose that, for outside a null set N (i.e., (N) = 0), fn() 0 for all n and
fn() f(). Then .
2. In random variable terminology, if P{ : Xn() 0 for all n and Xn() X()} =
1, then E(Xn) E(X).
1. Suppose that for all outside a null set, fn() f() and |fn() | g(), where
. Then .
2. In random variable terminology, if and for all n} = 1,
where E(Y) < , then E(Xn) E(X).
Let Xn be a random variable such that P{Xn() X()} = 1 and P{|Xn| c} = 1 for
each n, where c is a constant. Then E(Xn) E(X).
It is important to realize that the MCT and DCT apply to all integrals, and because sums
are just integrals with respect to counting measure, they apply to sums as well.
Example 5.15.
Suppose we want to evaluate . Write the sum in more suggestive
integral notation as follows. Let be counting measure (see Example 3.23) on = {1,
2,...} and set for = 1,2,... Then as n for each .
Furthermore, for each n, where . By the DCT,
.
Example 5.16.
Here is an example of the use of the dominated convergence theorem to show that if
E(|X|) < and an , then . By definition, means
. Then converges to 0 for each for
which X() < . Moreover, |Yn| |X|, and E(|X|) < by assumption. By the DCT,
E(YS) 0.
Exercises
13. Suppose that X is a random variable with density function f(x), and consider
. Assume further that f is symmetric about 0 (i.e., f(x) = f(x) for all
x R). Is the following argument correct?
because g(x) = xf(x) satisfies g(x) = g(x). Hint: consider f(x) = {(1 + x2)}1;
are E(X) and E(X+) finite?
14. Show that if fn are nonnegative measurable functions such that , then it is
not necessarily the case that for an arbitrary
measure . Hint: let be counting measure and fn() be the indicator that n.
15. Use the preservation of ordering property to give another proof of the fact that if f
is a nonnegative, measurable function, then if and only if (A) = 0,
where A = { : f() > 0}. Hint: if (A) > 0, then there must be a positive integer n
such that { : f() > 1/n} > 0.
16. Use elementary integration properties and Fatous lemma to prove the monotone
convergence theorem.
Proof.
Applying Markovs inequality to the random variable (X )2 yields Chebychevs
inequality:
If X is a random variable with mean and variance 2, and c > 0, then P(|X | c)
2/c2.
The result follows immediately from Markovs inequality and the fact that |X | c is
equivalent to (X )2 c2.
Figure 5.4
Top: convex function. The secant line (solid line) joining any two points on the curve
lies on or above the curve. Equivalently, for each x there is a line through (x, f(x)) that
lies entirely on or below the curve (dotted line). Bottom: concave function. The secant
line (solid line) joining any two points on the curve lies on or below the curve.
Equivalently, for each x there is a line through (x, f(x)) that lies entirely on or above the
curve (dotted line).
Suppose that f(x) is a convex function and that X is a random variable with finite mean
. If E(|f(X)|) < , then E{f(X)} f{E(X)}.
Proof. Because f(x) is convex, there exists a line y = f() + b(x ) passing through (,
f()) that lies entirely below or on the curve. That is, f(x) f() + b(x ). Now
replace x by the random variable X and take the expected value of both sides to get, by
the preservation of ordering property, E{f(X)} f() + b{E(X) } = f().
Corollary 5.20.
Proof. Exercise.
Note that Corollary 5.20 is another way to see that if E(|X|P) is finite for some p 1,
then E(|X|) < (see also Example 5.8).
In some cases we want to bound the expectation of a product of two random variables X
and Y. One basic inequality comes from the fact that (x y)2 = x2 + y2 2xy, from
which 2xy = x2 + y2 (x y)2 x2 + y2. That is,
If we replace x and y with random variables |X| and |Y|, we deduce that
Thus, if X and Y have finite second moments, then E(XY) is finite as well.
for positive x and y and 1/p + 1/q = 1. Inequality (5.18) is the special case when p = q =
2, although (5.18) holds for positive or negative x, y, whereas (5.20) requires positive x
and y.
Proof. Replace x and y in inequality (5.20) with |X| and |Y|, random variables with
E(|X|p) = E(|Y|q) = 1. Then . This proves that
(5.21) holds when E(|X|p) = E(|Y|q) = 1. More generally, if and
, let and . Then the pth moment of |X| and qth moment of |Y|
are both 1 and
which proves the result whenever E(|X|p) > 0 and E(|Y|q) > 0. The result is immediate if
either E(|X|p) = 0 (which implies that X = 0 with probability 1) or E(|Y|q) = 0 (which
implies that Y = 0 with probability 1).
Another important inequality allows us to conclude that whenever |X|p and |Y|p are
integrable, then so is (X + Y)p or (X Y)p.
Proposition 5.23.
and .
Proof.
If p 1, then
whenever 1/p + 1/q = 1. But 1/p + 1/q = 1 means that (p 1)q = p. Therefore,
Similarly,
Dividing both sides of Equation (5.28) by and noting that 1 1/q = 1/p
yields the stated result.
Exercises
1. State and prove a result analogous to Jensens inequality, but for concave functions.
2. Prove that if X has mean 0, variance 2, and finite fourth moment 4 = E(X4), then
.
3. Prove the Schwarz inequality.
4. Prove Corollary 5.20.
5. Prove that Markovs inequality is strict if P(|X| > c) > 0 or E{|X|I(|X| < c} > 0.
Does this imply that Markovs inequality is strict unless |X| = c with probability 1?
(Hint: consider X taking values c and 0 with probabilities p and 1 p).
6. Prove that the inequality in Jensens inequality is strict unless with
probability 1 for some constant b.
7. Prove that if 0 < X < and 0 < Y < , the correlation coefficient = E(XY)/
XY between X and Y is between 1 and +1.
8. Suppose that xi are positive numbers. The sample geometric mean is defined by
. Note that . Using this representation, prove that the
arithmetic mean is always at least as large as the geometric mean.
9. The sample harmonic mean of numbers x1,..., xn is defined by .
Show that the following ordering holds for positive numbers: harmonic mean
geometric mean arithmetic mean. Does this inequality hold without the restriction
that xi > 0, i = 1,..., n?
10. Let f0(x) and f1(x) be density functions with . Then
.
11. Suppose that X and Y are independent nonnegative, nonconstant random variables
with mean 1 and both U = X/Y and V = Y/X have finite mean. Prove that U and V
cannot both have mean 1.
5.5 Iterated Integrals and More on Independence
We pointed out the connection between integration over the original sample space
with its probability measure P and integration over R using the induced probability
measure defined by its distribution function F(x). The same connection exists for
integrals of a function of more than one variable. For example, suppose we wish to
compute the expectation of the random variable g(X,Y), where X and Y are arbitrary
random variables on (, , P) and is a Borel function of (x, y). We get the same
answer whether we integrate over the original sample space using P or over the
product space R R using the induced measure :
The integral on the right is defined the same way as was the integral on the original
probability space, with replaced by R R and P replaced by .
Remark 5.25.
For precise statement of the theorems in their general form, consider the following
measure spaces.
One important scenario under which we can interchange the order of integration is when
g is nonnegative.
If each i is sigma finite and g(a1, a2) is nonnegative and measurable with respect to
, then is measurable with respect to , and
These powerful results have implications for independence. In particular, they can be
used to show that independent random variables are uncorrelated.
If X and Y are independent random variables such that E(|X|) < and E(|Y|) < , then
E(XY) = E(X)E(Y); i.e., X and Y are uncorrelated.
Proof. Let X and Y be the marginal distribution functions for X and Y, respectively,
and X Y be product probability measure on R R. By Tonellis theorem,
Therefore, XY is integrable over the product measure. By Fubinis theorem, the above
steps can be repeated without the absolute values, yielding E(XY) = E(X)E(Y).
Because cov(X, Y) = E(XY) E(X)E(Y), E(XY) = E(X)E(Y) is equivalent to cov(X,
Y) = 0.
The reverse direction does not hold. That is, X and Y can be uncorrelated without being
independent. For instance, if Z ~ N(0,1), then Z and Z2 are uncorrelated because E(ZZ2)
= E(Z3) = 0 = E(Z)E(Z2). On the other hand, Z and Z2 are clearly not independent. If
they were, then with p = P(1 Z 1),
Suppose that not only are X and Y uncorrelated, but every Borel function of X is
uncorrelated with every Borel function of Y (whenever the correlation exists). Then X
and Y are independent.
X and Y are independent if and only if f(X) and g(Y) are uncorrelated (i.e.,
for all Borel functions f and g such that exists.
Proof. If X and Y are independent, then so are f(X) and g(Y) by Proposition 4.39. By
Proposition 5.29, cov{f(X), g(Y)} = 0, proving the direction. Now suppose that f(X)
and g(Y) are uncorrelated for all Borel functions f and g such that the correlation exists.
Take f(X) = I(X x) and g(Y) = I(Y y). Then .
This holds for arbitrary x and y, so the joint distribution function F(x, y) for X and Y
factors into FX(x)FY(y). By Proposition 4.37, X and Y are independent. This completes
the proof of the direction.
Reversing the order of integration can be helpful in other settings as well. For instance,
Proposition 5.9 states that if X is a random variable and k is a positive integer, then
E(|X|k) < if and only if . One way to prove this is as follows.
where denotes the greatest integer less than or equal to |X|. The reversal of sum and
integral follows from Tonellis theorem (the sum is an integral with respect to counting
measure) Also, we see from Figure 5.5 that
Figure 5.5
The sum is the area of the bars embedded in the graph of f(t) = tk, , ;
s is at least as large as the area under the curve between 0 and j (top panel), and no
greater than the area under the curve between 1 and j + 1 (bottom panel).
Exercises
1. Suppose that cov{Y, f(X)} = 0 for every Borel function f such that cov{Y, f(X)}
exists. Show that this does not necessarily imply that X and Y are independent.
Hint: let Z be N(0, 1), and set Y = Z and X = Z2.
2. Let X be a nonnegative random variable with distribution function F(x). Prove that
.
3. Prove that , for any distribution function F and constant a 0.
5.6 Densities
Elementary probability courses usually assume that a continuous random variable X has
a probability density function f(x), defined as follows.
If there is a nonnegative function f(x) such that the distribution function F(x) of the
random variable X satisfies for all x R, then f is said to be a probability
density function (density for short).
Proof. Let be the collection of Borel sets A such that . Then contains
all sets of the form (, x] by Definition 5.31. It is easy to show that must contain all
intervals, and that it contains the field in Example 3.8. Moreover, is a monotone class
because if An and An A then .
Therefore, . The same argument works if An A. Therefore, is a monotone class
containing the field in Example 3.8. The result now follows from the monotone class
theorem (Theorem 3.32).
Not all continuous random variables have a density, as the following example shows.
Put a blue and red ball in an urn, and then draw them without replacement. If Xi is the
indicator that the ith draw is blue, i = 1, 2, then each of the outcomes (X1 = 1, X2 = 0)
and (X1 = 0, X2 = 1) has probability 1/2. Now repeat the experiment and let (X3, X4)
be the result of the next two draws, and continue in this way ad infinitum. Now form the
random number Y = 0.X1X2X3 ... = X1/2 + X2/22 + X3/23 + ... That is, the Xs are the
base 2 representation of the random number Y in the unit interval. Notice that Y is not a
discrete random variable because the probability of each outcome 0.x1x2x3 ... is 0.
Therefore, you might think that Y must have a density f(y). In that case,
for every Borel set A. Let A be the set of numbers in [0, 1] whose base 2 representation
has precisely one 1 and one 0 in the first two digits, the next two digits, the next two
digits, etc. Because of the way Y was constructed, P(Y A) = 1. But under Lebesgue
measure, the probability of exactly one 1 and one 0 in any given pair is 1/2, so the
Lebesgue measure of A is (1/2)(1/2)(1/2) ... = 0 by Proposition 4.35. Moreover, if the
Lebesgue measure of A is 0, then the integral in Equation (5.34) is 0 by Proposition 5.3.
Therefore,
a contradiction. Therefore, Equation (5.34) cannot hold for every Borel set A. In other
words, Y is not discrete, yet has no density. Note that Y is equivalent to the random
number formed using permuted block randomization, with blocks of size 2, discussed at
the end of Example 1.1.
In Example 5.33, the key to showing that there was no density function was to construct
a Borel set A of Lebesgue measure 0, yet P(Y A) > 0. If every Borel set A with
Lebesgue measure 0 had P(Y A) = 0, we would not have been able to construct a
counterexample of the type in Example 5.33. In fact, we will soon see that we cannot
construct a counterexample of any kind if this condition holds. This motivates the
following definition.
The following is a famous theorem of probability and measure theory stated for the
special case of Lebesgue measure. Chapter 10 presents the more general Radon-
Nikodym theorem for arbitrary measures.
There is another way to view a density function. The fundamental theorem of calculus
for Riemann integrals asserts that if f is any function that is continuous at b, and F is a
function such that , then F is dierentiable at b and . This holds
for Lebesgue integrals as well. In fact, a stronger result holds:
Theorem 5.36.
Proof. We prove only the fundamental theorem of calculus. Suppose that f is continuous
at b, and let > 0.
We can view Theorem 5.36 as saying that a density function f(x) is the derivative
(called the Radon-Nikodym derivative) of the distribution function F(x) with respect
Lebesgue measure. We can generalize this result in different ways. For now we consider
two-dimensional density functions.
The pair (X, Y) is said to have density function f(x, y) with respect to two-dimensional
Lebesgue measure if f(x, y) is nonnegative and for all
x R, y R, where s t denotes two-dimensional Lebesgue measure.
Figure 5.6
Level curve f(x,y) = 0.05 of the bivariate normal density function f(x,y) for = 0
(circle), = 0.9 (upward slanting ellipse) and = 0.9 (downward slanting ellipse).
Because a probability density function is nonnegative, Tonellis theorem implies that (1)
we can write as an iterated integral, and (2) it makes no difference
whether we integrate first over s or t.
The proof of the following result is similar to that of Proposition 5.32, and is omitted.
Proposition 5.38. Bivariate density defines probability of any 2-dimensional Borel set
If f(x,y) is a density function for the random vector (X, Y) with respect to two-
dimensional Lebesgue measure x y, then for every two-
dimensional Borel set B.
Proposition 5.39.
Let f(x, y) and F(x, y) be the density and distribution functions for the pair (X,Y). Then
except on a set of (x,y) values of two-dimensional
Lebesgue measure 0.
that approach f from below. Definition 5.1 says that is , the limit
of integrals of the elementary functions (5.36) approximating f.
Definition 5.1 implies that the integral of any nonnegative simple function taking values
ai on set Ai, i = 1,..., k is (Exercise 3 of Section 5.2.1). By the MCT, the
integral of any nonnegative measurable function f is the limit of integrals of the fm of
(5.37). In fact, the MCT ensures that we would get the same answer using any
nonnegative simple functions, not just the specific functions in Expression (5.37). This
leads to the following equivalent definition of integration.
1. If f() is a nonnegative simple function taking values a1, ..., an, define to
be .
2. If f () is any nonnegative measurable function, define by
where the fm are any simple functions such that fm f.
3. If f is any measurable function, define by , provided
that this expression is not of the form + . Otherwise, is undefined.
Assume first that X and Y are independent, nonnegative simple random variables,
and , all ai 0, bj 0. Then each (Ai, Bj) pair of events are
independent. Also, XY is a nonnegative simple random variable taking values aibj with
probability , i = 1,..., m, j = 1,..., n. Therefore, by definition of expectation for
nonnegative simple functions,
This proves the result for independent, nonnegative simple random variables X and Y.
Now suppose that X and Y are arbitrary nonnegative, independent random variables.
There exist sequences Xn and Yn of nonnegative simple random variables increasing to
X and Y respectively such that Xn and Yn are independent. This follows from the fact
that Xn can be constructed solely from X and Yn solely from Y, and X and Y are
independent. Note that Zn = XnYn are nonnegative simple functions increasing to Z =
XY. If follows that
(5.39)
This proves the result for arbitrary nonnegative, independent random variables X and Y
with finite means.
The last step is to prove that the result holds for arbitrary independent random variables
X and Y with finite mean. Write X as X+ X and Y as Y+ Y. The pair (X, X+)
are Borel functions of X and (Y, Y+) are Borel functions of Y. It follows that (X, X+)
are independent of (Y, Y+), so
5.8 Summary
1. Comparison between Lebesgue and Riemann integration:
(a) Lebesgue partitions the y-axis, Riemann the x-axis.
(b) Poorly behaved functions like f() = I( is rational) are Lebesgue-
integrable but not Riemann-integrable.
5. Important inequalities:
(e) Schwarz:
6. A density function f(x) for a probability measure X with distribution function F(x)
satisfies for all Borel sets B.
(a) Radon-Nikodym theorem A density function exists if and only if X is
absolutely continuous with respect to Lebesgue measure (i.e., X(A) = 0 for
all Borel sets A with Lebesgue measure 0).
(b) F(x) exists and equals f(x) except on a set of Lebesgue measure 0.
(c) Fundamental theorem of calculus F(x) = f(x) if f is continuous at x.
Modes of Convergence
An important part of the theory of statistical inference is understanding what happens to
an estimator like the mean or median as the sample size n tends to . Several different
questions might come to mind. First, does get close to the true parameter value as n
? There are several alternative ways to define get close to, namely:
These are called almost sure convergence, convergence in probability, and convergence
in Lp, respectively.
Now suppose that Zn is the usual t-statistic defined above and is the same thing
except that the usual variance estimate s2 is replaced by the maximum likelihood
estimator (MLE) , so that is just (n 1)/n times s2. To show that we
reach essentially the same reject/do not reject conclusion whether we use Zn or , we
must show more than just that the distributions of Zn and are close. For example, if Xi
are iid N(0,1) under the null hypothesis, X1 and have identical null distributions,
namely standard normal, but the reject/do not reject decisions based on these two test
statistics can be completely different in any given sample. To conclude that Zn and
reach essentially the same conclusion in any given sample, we need to show that
is close to 0 in the sense of 1,2, or 3 above.
Then Xn() still converges to 0 as n for < 1, but has no limit when = 1
because the sequence Xn(1) alternates between 1 and +1.
Even though the behavior of Xn() for = 1 was different in Equations (6.1) and (6.2),
this is irrelevant in the sense that { = 1} has probability 0 of occurring anyway. It
makes sense to ignore sets of probability 0, and this motivates the following definition.
1. The sequence of random variables X1(), X2(), ... is said to converge almost
surely (or with probability 1) to X() if, for each fixed outside a set of
probability 0, the sequence of numbers Xn() converges to X() as n . We
write Xn X a.s. or .
2. The sequence of functions f1(), f2(), ... on an arbitrary measure space is
said to converge almost everywhere (a.e.) to f() if, for each outside a set of -
measure 0, the sequence of numbers fn() converges to f().
The only difference between convergence almost everywhere and convergence almost
surely is whether the measure is a probability measure. We focus almost exclusively
on probability measures, though some results that extend to general measures are very
helpful.
Example 6.2.
Example 6.2 shows how helpful it is when assessing almost sure convergence to fix
and think about the sequence of numbers X1() = x1,..., Xn() = xn,... Nonetheless, we
need not necessarily know the underlying probability space to know that Xn converges
almost surely, as the following example illustrates.
Example 6.3.
Suppose that Y() is a finite random variable on any probability space, and let Xn() =
Y()/n. For each , Y() is just a finite number, so Y() /n 0 as n . This holds
for every , so Xn 0 almost surely.
Now let us relax the assumption that Y is finite for every . Assume instead that Y() is
finite with probability 1. Because Xn() = Y()/n 0 for every such that Y() is
finite, the exceptional set on which Xn does not converge to 0 has probability 0. In other
words, it is still true that .
Example 6.4.
In the preceding examples, the limiting random variable X() was a constant (0) with
probability 1, but there are many examples in which the limiting random variable is not
constant. For example, let X be any random variable, and let . Then
trivially. Similarly, if Y is any random variable that is finite with probability 1,
then Xn = {1 + Y()/n}n X() = exp{Y()} for each for which Y() is finite
because (1 + a/n)n exp(a) for each finite constant a. Therefore, .
Proof. These all follow almost immediately from the corresponding properties of
convergence of sequences of numbers. For instance, for part 3, Xn() X() except
on a set N1 of probability 0, and Yn() Y() except on a set N2 of probability 0.
Therefore, outside the set N = N1 N2 of probability P(N1 N2) P(N1) + P(N2) =
0, Xn() + Yn() X() + Y(). The proofs of the other items are left as an exercise.
One implication of Proposition 6.5 concerns one- and two-sample estimators. For
instance, in a one-sample setting, we may be interested in estimating a (finite) mean
using the sample mean of iid random variables. We will show in Chapter 7 that
converges almost surely to . In a two-sample setting, we estimate the difference in
means 1 2 by the difference in sample means. By part 3 of Proposition 6.5,
this estimator converges almost surely to 1 2. Similarly, in a one-sample binary
outcome setting, we are interested in estimating the probability p of an event such as
death by 28 days. Chapter 7 shows that the sample proportion of people dying by 28
days converges almost surely to p. In a two-sample setting, we may be interested in
estimating the relative risk , the ratio of event probabilities in the two groups. By
part 5 of Proposition 6.5, the sample relative risk converges almost surely to p1/p2
if p2 > 0.
Suppose that X1, X2, ... are iid Bernoulli random variables with parameter 0 < p < 1. Is
there a random variable X such that Xn X almost surely? Again fix and ask
yourself under what conditions will the sequence of numbers X1() = x1, X2() = x2,...
converge? Because each Xn() is 0 or 1, exists if and only if Xn() consists
of all zeroes or all ones from some point on. To see this, suppose that, on the contrary,
no matter how far we go in the sequence, xN,xN+1,... has both zeroes and ones. Then xn
cannot converge to 0 or 1 or anything else for that matter because the sequence
oscillates between 0 and 1 indefinitely. Thus, if we can argue that the set of such that
XN(), XN + 1(),... are all zeroes or all ones for some N has probability zero, then
we have shown that Xn diverges almost surely.
For any given N, let AN be the event that , and BN be the event
that XN = 1, XN + 1 = 1, XN + 2 = 1,... Because the Xn are independent,
, and by Proposition 4.35. Therefore,
. The probability that there is some N for which the
sequence terminates with all zeroes or all ones is
. We have shown that the set of such that the
sequence of numbers X1() = x1, X2() = x2, ... terminates with all zeroes or all ones
has probability 0. Therefore, the probability that Xn() converges is 0; Xn cannot
converge almost surely to any random variable.
It makes sense that the Xn cannot converge almost surely. If Xn converged almost surely,
then knowledge of the value of Xn should give us a lot of information about the value of
Xn + 1. But if the Xn are iid, then Xn gives us no information about Xn + 1. These two
views are consistent only if the Xns are constant.
Example 6.7 can be generalized to any iid random variables Xn. Almost sure
convergence of Xn precludes them from being iid unless they are constants (exercise).
Exercises
1. Let Xn() 1 and Yn() = I( > 1/n), where I denotes an indicator function. Does
Xn/Yn converge for every [0,1]? Does it converge almost surely to a random
variable? If so, what random variable?
2. Let
3. In the preceding problem, reverse the words rational and irrational. Does Yn
converge almost surely to a random variable? If so, specify a random variable on
that Yn converges almost surely to.
4. For each n, divide [0,1] into [0,1/n), [1/n, 2/n),..., [(n 1)/n, 1], and let Xn be the
left endpoint of the interval containing . Prove that Xn converges almost surely to
a random variable, and identify the random variable.
5. Prove that if , then . Does ln(Yn) converge almost surely to a
finite random variable?
6. Suppose that Xn converges almost surely to X. Suppose further that the distribution
and density functions of X are F(x) and f(x), respectively. Let
In a study of patients with very advanced cancer, the time Xi from study entry to death is
observed for all n patients. Assume that Xi are iid with distribution function F(x). We
are trying to estimate F(x) for different x. For a fixed x, the number of patients dying by
time x is binomial with n trials and success probability F(x). If denotes the
proportion of patients dying by time x, then
Let Xi be as in Example 6.9, but now assume that F(x) has a unique median . That is,
there is a unique number such that and . Let be the median
from a sample of size n. That is, is the middle observation if n is odd, and the average
of the middle two observations if n is even. The event implies that , where
is the proportion of Xis that are less than or equal to . Therefore, if p = F( ),
then p < 1/2 and
Step 0: must reside somewhere in [0,1]. If X0() is the indicator that , then
X0() 1 (Row 0 of Figure 6.1).
Step 1: Divide [0,1] in half, creating two bins [0,1/2) and [1/2,1] of size 1/21 =
1/2. must be in exactly one of the two bins. If X1() and X2() are the indicator
functions that is in the first and second bins, respectively, then exactly one of
X1() and X2() is 1 (Row 1 of Figure 6.1).
Step 2: Divide each of the bins in Step 1 in half, creating 4 mutually exclusive
bins: [0,1/4), [1/4,1/2), [1/2, 3/4), and [3/4,1], each of length 1/22 = 1/4. Let
X3(), ..., X6() be the indicator functions that lies in the first, second, third,
and fourth bins, respectively. For each , exactly one of X3(), X4(), X5(), and
X6() is 1 because is in exactly one of the mutually exclusive and exhaustive
bins (Row 2 of Figure 6.1).
Step k: Divide each bin in Step k-1 in half, creating 2k bins of length 1/2k and
define 2k random variables Xi() accordingly. More formally, n = 20 + 21 + ... +
2k1 + j for some j = 0,..., 2k 1 corresponds to a specific bin at step k, and
Xn() is the indicator that is in that bin. Exactly one of the 2k random variables
Xi created at step k is 1.
Figure 6.1
The darkened parts of [0, 1] show where the Bernoulli random variables Xi() in
Example 6.11 take the value 1.
As n , so does k, and the bin length tends to 0. Because is in only one of the 2k
bins, |Xn() 0| = 0 except when is in that bin of length 1/2k. Therefore,
, so . But for each , exactly one of the 2k random variables
Xi() in Row k of Figure 6.1 is 1. Therefore, for each , Xn() = 0 for infinitely many
n and Xn() = 1 for infinitely many n. It follows that for no can the sequence of
numbers Xn() converge. This shows that , but with probability 1, Xn() fails to
converge.
We have just seen that we can have convergence in probability but not almost surely, but
what about the other way around? It is by no means obvious, but almost sure
convergence implies convergence in probability. We prove this in Section 6.2.3.
If and , then:
1. If , then P(X = X) = 1.
2. for any continuous function f. In fact, f : R R need only be a Borel
function whose set D of discontinuities is such that and P(X D) =
0.
3. .
4. .
5. , provided that P(Y = 0) = 0.
Exercises
(a) .
(b) For each > 0, P(|Xn X| > ) 0 (that is, the symbol in Definition 6.8
can be replaced by >).
(c) For each > 0, P(Xn X > ) 0 and P(Xn X < ) 0 as n .
6. Prove that if , then there exists an N such that for n > N.
7. Prove parts 1 and 2 of Proposition 6.12.
8. Prove parts 4 and 5 of Proposition 6.12.
9. Prove that in Example 6.10.
10. If X1,X2,... are iid random variables with a non-degenerate distribution function,
can Xn converge in probability to a constant c? If so, give an example. If not, prove
that it cannot happen.
11. If X1,..., Xn are identically distributed (not necessarily independent) with E(|Xi|) <
, then . Hint: , where .
6.1.3 Convergence in Lp
Another common measure of the closeness of an estimator to a parameter is the
mean squared error (MSE), . This is analogous to using as a measure
of how close vectors x and y are. But just as we take the square root of to get
the Euclidean distance between two vectors x and y, we take the square root of the MSE
to get a distance between and . This so-called L2 distance is a special case the Lp
distance defined as follows.
If X and Y are two random variables with finite pth moment, p > 0, the Lp distance
between X and Y is defined to be .
It can be seen that for p 1, the Lp distance satisfies the criteria of a distance measure
d, namely that 1) d(X, Y) 0, with d(X, Y) = 0 if and only if X=Y,2) d(X,Y)=d(Y,X)
and 3) . The first two properties are obvious, and the third follows
easily from Minkowskis inequality (Proposition 5.24).
Closely related to the L2 distance between two random variables is the mean squared
error (MSE), as noted above. Let X be an estimator of a parameter , and Y be the
constant . Assuming X has mean and finite variance, the MSE is:
where bias is E(X ). If X is unbiased (i.e., the bias is 0), then its variance, the square
of the L2 distance between X and , indicates how close X tends to be to . The L1
distance E(|X Y|) is also commonly used in statistics.
Proposition 6.16.
Corollary 6.18. Sample mean of iid observations with finite variance is consistent
If X1, X2, ... are iid with mean and finite variance 2, then the sample mean
converges to in L2.
Proof. This follows from Proposition 6.17 and the fact that has bias 0 and variance
.
Proposition 6.19.
Exercises.
1. Let
Show that Xn 0 in L1, but not in L2. Are there any examples in which Xn 0
in L2, but not in L1?
2. What does Proposition 6.19 imply about the MSE of two-sample estimators such
as the difference in means or proportions when their one-sample MSEs converge
to 0?
3. Prove part 1 of Proposition 6.19.
4. Show by a counterexample that Xn converging to X in Lp and Yn converging to Y
in Lp does not necessarily imply that XnYn converges to XY in Lp.
5. Show by counterexample that Xn converging to X in Lp does not necessarily imply
that f(Xn) converges to f(X) in Lp for a continuous function f.
6. Prove that if , then .
which converges to
Not only does this not coincide with the distribution function in Equation (6.6), but it is
not even a distribution function because it is not right continuous at x = 1. Therefore,
requiring that Fn(x) converge to F(x) for every x seems too strong. The problem is at the
lone point x = 1 where F(x) is discontinuous. At all continuity points x of F(x), Fn(x)
F(x). This leads us to the following definition.
Let Xn and X have distribution functions Fn(x) and F(x), respectively. Then Xn is said
to converge in distribution to X if Fn(x) F(x) for every continuity point x of F(x). We
write or .
Example 6.21.
Many biological and other applications involve testing whether data are uniformly
distributed over time or space. For example, we might be interested in whether heart
attacks are equally likely to occur any time of day. However, we usually do not know
the exact times of patients heart attacks. Therefore, we may have to settle for identifying
the hour during which it occurred. The question is, if we can pin down the time to a
sufficiently small interval, can we ignore the fact that the data are actually from a
discrete, rather than continuous, uniform distribution? To be more concrete, consider a
setting in which the original interval of time or space has been scaled to have length 1.
Suppose we first identify which half of the interval contains the observation. Having
identified the correct half, we then identify which half of that half contains the
observation, etc. After n steps, the observation should have a discrete uniform
distribution on the dyadic rationals of order n. We will show that this discrete uniform
converges in distribution to a continuous uniform.
Let Xn have a discrete uniform distribution on the dyadic rationals of order n. That is,
, i=1, . We show that , where X is uniform (0,1). Let xd be a
dyadic rational of some order. That is, xd = i/2m for some m and some i = 1,...,2m. Then
for nm, . Therefore, Fn(xd) xd for every dyadic rational
number. Now suppose that x is an arbitrary number in the interval (0,1). Let x1 and x2
be arbitrary dyadic rationals with x1 < x < x2. Then . It follows that
. Similarly, . Because x1 < x and x2 > x are
arbitrary dyadic rationals and we can find dyadic rationals arbitrarily close to x (either
less than or greater than), and ; i.e., . Therefore,
. This being true for every , , where X is uniform (0,1).
The same technique used in Example 6.22 can be used to prove the following result.
Proof. Exercise.
Proof.
For each positive x, , where is the greatest integer less than or equal
to x. Each term of the finite sum converges to the corresponding Poisson probability, so
, where X is Poisson with parameter , completing the
proof.
Researchers conducting multiple statistical tests in the same study recognize that the
probability of at least one errorcalled the familywise error rate (FWE)is inflated if
each test uses level . The Bonferroni correction using /n for each of the n tests ensures
that the FWE is no greater than . The Bonferroni correction
tends to become quite conservative if the test statistics are highly correlated, especially
if the number of comparisons is large. An interesting question is: what happens to the
FWE for a large number of independent comparisons?
The probability of exactly k errors assuming the null hypothesis for each test is
computed as follows. For each test, whether an error is made is Bernoulli with
probability pn = /n. Therefore, the total number of errors Xn is binomial with
parameters n and pn. By the law of small numbers (Proposition 6.24), Xn converges in
distribution to an exponential with parameter The probability of no
errors tends to . Typically, is .05, in which case the probability
exp(.05) of no errors is approximately .951. Accordingly, the FWE is approximately 1
.951 = .049. Because the FWE is very close to the intended value of .05, the
Bonferroni method is only slightly conservative for a large number of independent
comparisons.
Example 6.26. Relative error in estimating tail probabilities may not tend to 0
Convergence in distribution is also called weak convergence, and for good reason. The
random variables Xn need not be close to X in any conventional sense. In fact, because
convergence in distribution depends only on the distributions of the random variables,
there is no requirement for X1,..., Xn,... to be defined on the same probability space. In
contrast, convergence almost surely, in probability, and in Lp do require the random
variables to be defined on the same probability space. Fortunately, the following very
useful result shows that if Xn converges in distribution to X, we can define random
variables on a common probability space that have the same marginal distributions and
converge almost surely.
The Skorokhod representation theorem greatly facilitates proofs, as we see in the next
result.
Proof of . Notice that whether E{f(Xn)} E{f(X)} depends only on the marginal
distributions of Xn and X, not on joint distributions. Therefore, it suffices to prove that
, where and X are as stated in Theorem 6.28. Because almost
surely and f is continuous, f(Xn) converges almost surely to f(X) by part 2 of
Proposition 6.5. Because f is bounded, the bounded convergence theorem (Theorem
5.13) implies that , completing the proof.
Proposition 6.30.
Proof. Exercise.
Exercises
Figure 6.2
Connections between modes of convergence.
If Xn X in Lp, then .
The statement that the sample mean converges in probability to is known as a law of
large numbers, or in the common vernacular, a law of averages. In the next chapter, we
will strengthen this result.
Note that convergence in probability does not imply convergence in Lp, as the following
example shows.
Let
If , then .
Proof. The easiest way to prove this uses Proposition 6.29 (exercise). However, the
following proof using only the definition of convergence in distribution is instructive.
Let Fn and F be the distributions of Xn and X, and let x be a continuity point of F. Let
be such that x and x + are also continuity points of F. Notice that
Similarly,
We have shown that and . But was arbitrary subject only
to x and x + being continuity points of F. We can find a sequence such that
and are continuity points of F (exercise). Therefore, ,
proving that Fn(x) F(x) for each continuity point x of F.
The simplest example showing that the reverse direction does not hold is when the Xn
have the same distribution, but are on different probability spaces (see Remark 6.27).
Then Xn converges in distribution to X1, but Xn cannot converge in probability if they
are on different probability spaces. Another example is the following.
Let X1, X2,... be iid Bernoulli (1/2) random variables. Then Xn converges in
distribution to X1, but Xn does not converge in probability to X1 because
.
If , then .
Examples 6.7 and 6.11 and Proposition 6.37 illustrate that the concept of events
occurring infinitely often in n is very useful in convergence settings. The following is an
indispensable tool for determining the probability of events occurring infinitely often.
1. If An, n = 1,2, ... is an infinite sequence of events and , then P(An i.o.) =
0. This remains true if P is replaced by an arbitrary measure .
2. If the An are independent and , then P(An i.o.) = 1.
Example 6.39.
Suppose that the Bernoulli parameter pn had been 1/n instead of 1/n2. Because
and the events {Xn = 1} are independent, part 2 of the Borel-Cantelli lemma implies
that P(Xn = 1 i.o.) = 1. Of course P(Xn = 0 i.o.) = 1 as well, so .
But each for which Xn() = 0 i.o. and Xn() = 1 i.o. is an for which Xn() does
not converge. Therefore, if pn had been 1/n, Xn would not have converged almost surely
to any random variable, although Xn would still have converged to 0 in probability.
Even though Xn does not converge almost surely to 0 if pn = 1/n, the subsequence
does converge almost surely to 0 by part 1 of the Borel-Cantelli lemma:
because . We will soon see that this example is typical in
that it is always possible to find a subsequence converging almost surely to X whenever
.
Exercises
6. Using the same technique as in the proof of the first part of the Borel-Cantelli
lemma (namely, using a sum of indicator random variables) compute the following
expected numbers:
(a) One method of testing whether basketball players have hot and cold
streaks is as follows. Let Xi be 1 if shot i is made and 1 if it is missed. The
number of sign changes in consecutive shots measures how streaky the
player is: a very small number of sign changes means the player had long
streaks of made or missed shots. For example, 1, 1, 1, +1, +1, +1, +1, +1,
+1, +1 contains only one sign change and has a streak of 3 missed shots
followed by 7 made shots. Under the null hypothesis that the Xi are iid
Bernoulli p (i.e., streaks occur randomly), what is the expected number of
sign changes in n shots? Hint: let Yi = I(Xi Xi 1).
(b) What is the expected number of different values appearing in a bootstrap
sample (a sample drawn with replacement) of size n from {x1,..., xn}? Show
that the expected proportion of values not appearing in the bootstrap sample is
approximately exp(1) if n is large.
7. Regression analysis assumes that errors from different observations are
independent. One way to test this assumption is to count the numbers n+ and n of
positive and negative residuals, and the number nR of runs of the same sign (see
pages 198-200 of Chatterjee and Hadi, 2006). For example, if the sequence of
signs of the residuals is + + + , then n+ = 3, n = 7, and nR = 4.
Assume that the residuals are exchangeable (each permutation has the same joint
distribution). Using indicator functions, prove that the expected number of runs,
given n+ and n, is 2n+n/(n+ + n) + 1.
8. Suppose that Y1, Y2,... are independent. Show that if and only if
for some constant B.
9. Let A1, A2,... be a countable sequence of independent events with P(Ai) < 1 for
each i. Then if and only if P(Ai i.o.) = 1.
10. Prove that if A1,..., An are events such that , then .
11. Give an example to show that independence is required in the second part of the
Borel-Cantelli lemma.
12. * Show that almost sure convergence and convergence in Lp do not imply each
other without further conditions. Specifically, use the Borel-Cantelli lemma to
construct an example in which Xn takes only two possible values, one of which is
0, and:
Suppose (n1, n2,...) is a sequence and (i1,i2, ...) is a subsequence. That is, ,
for some k1 < k2 < ... Letting (N) denote (n1,n2 ...) and (I) denote (i1, i2, ...), we write
. We use the term subsequence to refer to either (I) or xi for i . (I).
Proof. It is clear that if xn x, then every subsequence xm, m (M) also converges to
x. To prove the other direction, we use the contrapositive. If xn does not converge to x,
then we can find an > 0 and a subsequence (K) such that for all k (K). But
then clearly no subsequence can converge to x. Therefore, if xn does not
converge to x, then it cannot be true that every subsequence contains a further
subsequence converging to x. By the contrapositive, this means that if every
subsequence contains a further subsequence converging to x, then xn x.
Proof. Suppose first that , and let be a subsequence. For a fixed number
k, = 1/k is a positive number and as m along (M). This means we
can find mk (M) such that
The proof of the reverse direction for part 1 is very similar to the proof of Proposition
6.41, and is left as an exercise.
Subsequence arguments can be used to prove many useful results. For instance, consider
part 3 of Proposition 6.12. Let (M) be an arbitrary subsequence of (N) = (1, 2,...). We
will prove that there is a further subsequence such that along (I).
Because along (M), there exists a subsequence such that along (I) by
Proposition 6.42. Because along (J), Proposition 6.42 implies that there exists a
further subsequence such that along (I). Along the subsequence I, and
. By Proposition 6.5, along (I). We have shown that an arbitrary
subsequence (M) has a further subsequence such that along (I). By
Proposition 6.42, , completing the proof. The same type of subsequence
argument can be used for the other parts of Proposition 6.12 as well. We leave this as an
exercise.
Subsequence arguments can also be used for convergence in distribution. The proof of
the following result is very similar to the proof of Proposition 6.41 and is left as an
exercise.
Notice that is not sufficient for every subsequence to contain a further subsequence that
converges in distribution; it must be the same distribution, as the following example
shows.
Example 6.45.
Let Xn be N(0,1) if n is odd and N(0, 2) if n is even. Then every subsequence contains a
further subsequence converging to a distribution function, but Xn does not converge in
distribution because but .
Sketch of Proof. Let be an enumeration of the rational numbers. The ith row of Table
6.1 shows the set , a bounded infinite set of numbers. Consider the first row.
By the Bolzano-Weierstrass theorem (Theorem A.17), there is a limit point, so there is a
subsequence (M1) such that as m along (M1). For instance, that
subsequence might be the bolded elements 1,3,4,6,7,... shown in row 1. In the second
row, consider the bounded infinite set . Again by the Bolzano-Weierstrass
theorem, there must be a further subsequence such that as m
along (M2). For instance, that subsequence of 1,3,4, 6,7,... might be 1,4,6,7 ..., the
bolded sequence in row 2. Continuing in this fashion, we can find in row i + 1 a
subsequence (Mi + 1) of (Mi) such that as m along (mi+1). Now
define a subsequence (M) by taking the first bolded element of row 1, the second bolded
element of row 2, the third bolded element of row 3, etc. In Table 6.1, this corresponds
to the subsequence (M) = (1,4,7,...). Along this subsequence, Fm(ri) yi = F(ri) as m
along (M) for each rational ri.
Table 6.1
Notice that the function F(x) in Helleys selection theorem satisfies two of the three
conditions of a distribution function. Conspicuously absent is the condition that
. The next example shows that this condition is not necessarily
satisfied.
Let
Then for any subsequence (M), the distribution function Fm for Xm satisfies Fm(x)
1/2 as m . along (M). Therefore, there is no subsequence (M) such that Fm(x)
converges to distribution function F(x) as m along (M).
To ensure that the limiting F(x) is a distribution function, we need another condition
called tightness.
The sequence of distribution functions F1(x), F2(x), ... is said to be tight if for each >
0 there is a number M such that Fn(M) < and for all n.
It is clear that the sequence of distribution functions in Example 6.47 is not tight because
for any M, 1 Fn(M) = 1/2 for all n > M. Thus, tightness precludes examples such as
Example 6.47.
Corollary 6.50.
If , then Fn is tight.
We close this section by collecting part 2 of Proposition 6.5, part 2 of Proposition 6.12,
and Proposition 6.30 into a single continuous mapping theorem, also called the Mann-
Wald theorem. The idea is that if Xn converges to X in some manner, then a continuous
function g(Xn) ought to converge to g(X) in the same manner. We proved these results
earlier, but we are including them here because they can also be proven using
subsequence arguments (see Exercise 4, for example).
1. almost surely,
2. in probability, or
3. in distribution,
Exercises
3. Determine and demonstrate whether the distribution functions associated with the
following sequences of random variables are tight.
(a)
Let y, y , and y + be continuity points of the distribution function F(y) for Y. This is
always possible because there are only countably many discontinuity points of F.
Similarly,
Many estimators are asymptotically normally distributed with mean and some
variance vn, meaning that , where Z is N(0,1). If vn is an estimator of
the variance such that , then
Exercises
6. What is the problem with the following attempted proof of Slutskys theorem? As
noted in the first paragraph of the proof, it suffices to prove that if and ,
then .
Definition 6.54.
Proof. We prove part 3 when f is continuous on Rk. Let Yn = f(Xn), and Y = f(X). If
is a bounded continuous function, then h(Xn) defined as is a bounded
continuous function of Xn. Because by Definition 6.58. That is,
. Because this is true for any bounded continuous function , .
Exercises
1. Show that Proposition 6.55 is not necessarily true for convergence in distribution.
For example, let (Xn,Yn) be bivariate normal with means (0,0), variances (1,1)
and correlation
6.4 Summary
1. Xn can converge to X:
, , .
2. *** To compute the expected number of events A1, A2,... that occur, form
and take the expected value: .
3. *** Borel-Cantelli:
We defer proofs to Section 7.2. For now we consider carefully the meaning of the
WLLN and SLLN. Note first that the SLLN implies the WLLN because almost sure
convergence implies convergence in probability (Proposition 6.37). The sample mean
is an estimator of = E(X), so the WLLN says that there is high probability that this
estimator will be within of when the sample size is large enough. To understand the
SLLN, consider a specific , which amounts to considering a specific infinite sequence
x1,x2, ..., where xi = Xi(). For some the sequence x1,x2, ... results in sample means
, ,... that converge to . For others it might converge to a different number or to ,
or it might fail to converge. For example, suppose that X1, X2, ... are iid standard
normals. One might generate the infinite sample 1,1,..., 1..., in which case the
sequence of observed sample means is also 1,1,..., which converges to .
Another might generate the sequence 1, 1, 3, 3, 5, 5,... In that case the observed
sample mean is 0 if n is even and 1 if n is odd, so fails to converge for that .
These s each have probability 0, as does every other single in this experiment.
However, the SLLN says that the set of , and therefore the set of infinite samples, for
which the observed sequence converges to has probability 1. Therefore, the
probability that our infinite sample is one for which either fails to converge or
converges to something other than is 0.
If X1, X2, ... are iid Bernoulli (p) random variables, is the proportion of X1, X2,...,
Xn that are 1. The WLLN says that converges in probability to E(X1) = p. In other
words, the probability that differs from p by more than tends to 0 as n , no
matter how small is. The SLLN says that the set of such that converges to p has
probability 1. Again it is helpful to think about different infinite samples of outcomes xi
= Xi(), i = 1,2,... One might generate 0,0,..., 0,..., in which case . Another
might generate 1,1,..., 1,..., in which case , etc. The SLLN says that the set of
infinitely long strings of 0s and 1s such that converges to p has probability 1.
Example 7.2.
We can recast the WLLN and SLLN for Bernoulli (1/2) random variables in terms of
Lebesgue measure. Recall that when , we can write in base 2 as
0.X1, X2,..., where the Xi are iid Bernoulli (1/2) random variables. Therefore, the
WLLN states that with high probability, the proportion of ones in the first n digits of the
base 2 representation of a number drawn randomly from [0,1] will be close to 1/2 if n is
large enough. The SLLN states that the proportion of ones in the base 2 representation of
tends to 1/2 for all in a set of Lebesgue measure 1.
Example 1.1 involved flipping an unfair coin with probability infinitely many
times and letting Xi be the indicator that the ith flip is heads, i = 1,2,... We stated
without proof that the random variable has no density
function. We are now in a position to prove this. Let Y be the probability measure for
the random variable Y, and let A be the set of numbers [0,1] whose base 2
representations have the property that the proportion of ones in the first n digits of
converges to p as n . Then Y(A) is the probability that (X1 + ... + Xn)/n p as n
, which is 1 by the SLLN. If Y had a density f(y), then . But the
Lebesgue measure of A is 0 because, under Lebesgue measure, the digits of the base 2
representation of a number y in [0,1] are iid Bernoulli (1/2); with probability 1 the
proportion of ones must converge to 1/2 p as n by the SLLN. Therefore,
except on a set of Lebesgue measure 0. It follows from Proposition 5.3 that
. Therefore, if Y had a density f(y), then
a contradiction. It follows that Y cannot have a density. See also Example 5.33, which
shows that permuted block randomization leads to a base 2 representation that also has
no density. It is interesting that too much forced imbalance from using an unfair coin or
too much forced balance from using permuted block randomization both lead to random
numbers with no density. Simple randomization is just right in that it leads to a
random number that has a densitythe uniform densityon [0,1].
Another way to look at this example is to imagine how the density f(y) of Y would have
to appear, if it existed. Suppose p = 2/3. The density would have to be twice as high on
the second half as on the first half of [0,1]. The densitys relative appearance on [0,1/2]
must be a microcosm of its appearance on [0,1]; it is twice as high on the second half of
[0,1/2] as it is on the first half, etc. The same is true on the two halves of [1/2,1], and
likewise on intervals of length 1/4,1/8, etc. This self-similarity property causes no
problem when p = 1/2 because then the density f(y) is constant. When p 1/2, it is
impossible to draw a density that has this self-similarity property. The reader can gain
an appreciation of the problem by performing the following simulation. For n = 10,
generate thousands of replications of , where Xi are iid Bernoulli (p). Make a
histogram. Then repeat with n = 20 and n = 50.
Return to Example 6.9, involving estimation of the distribution function F(x). We used
the fact that is binomial (n,p = F(x)) to prove that converges in
probability to F(x). We can strengthen this result because , being the sample mean of
n iid Bernoullis, converges almost surely to by the SLLN. Therefore, at
each point x, the empirical distribution function converges almost surely to the true
distribution function. Likewise, we can strengthen Example 6.10 to prove that when the
distribution function has a unique median , the sample median converges not just in
probability, but almost surely, to (exercise).
We can say even more than what is stated in Example 7.4. Recall that in Example 6.9,
we showed that for all x. Also,
. Therefore, for all x. This tells us
that the convergence in probability of to F(x) is uniform in .
The Glivenko-Cantelli theorem stated below strengthens this to uniform almost sure
convergence of to F(x).
The sample distribution function for iid random variables X1,..., Xn satisfies
.
We do not prove the Glivenko-Cantelli theorem, but note that Example 7.4 shows that
for each fixed x, . It follows that for all rational r) = 1. Because
the set of rational numbers is a dense set, for outside a null set. This
shows a small part of what is needed, but does not show that the convergence is for all
x, not just continuity points of F, and does not show that the convergence is uniform in x.
The proof is simplified considerably under the additional assumption that F is
continuous. See Polyas theorem (Theorem 9.2) for details.
It follows that
By the SLLN applied to the random variables , .
Furthermore, , so Expression (7.2) converges almost surely to (1)(2
02) = 2. This shows that the sample variance converges almost surely to the population
variance 2. Likewise, the pooled variance estimate in a t-test converges almost surely
to the population variance, as does the pooled variance estimate in analysis of variance
with k groups (exercise).
By arguments similar to those of Example 7.6, we can show that Rn converges almost
surely to as n (exercise). Notice that no assumption of bivariate normality is
required: the joint distribution of (X, Y) is arbitrary. Because almost sure convergence
implies convergence in probability, there is high probability that Rn will be close to if
the sample size n is large.
Another way to study the association between X and Y is through the regression
equation
where the errors i are assumed iid from a distribution with mean 0. Notice that here we
are treating the Xi as fixed constants xi. That is, we assume that the relationship (7.3)
holds for fixed xi. We study conditioning in much greater detail in a subsequent chapter,
but the reader is assumed to have some familiarity with this concept. The least squares
estimate of 1 is
To demonstrate that the convergence is uniform in x [0,1], let > 0 be given. We must
show that there exists an N such that for all n N and x [0,1].
Because f is uniformly continuous, there exists a > 0 such that whenever
|x y| < . Let be the event that |Sn/n x| < , and let . Then
The Bn are known as Bernstein polynomials. We have used results from probability
theory to prove that the Bn uniformly approximate f.
In data analysis, we often fit simple curves such as lines or quadratics to data. But
sometimes these simple curves do not fit too well. We can use higher degree terms such
as cubics and higher. One problem is that with higher order terms, the y values for some
xs strongly influence the curve fitting in other x locations. For example, Figure 7.1
shows the best fitting fifth degree polynomial applied to data whose actual model is Y =
cos(x) + over the interval [0, 6]. The fit is quite poor. It can be improved by using a
tenth order polynomial, but it is undesirable to have to continue fitting higher order
polynomials until we find one that looks acceptable. An alternative that avoids this
problem is to use spline fitting. This involves estimating y(x) locally by averaging the y
values of nearby x values and forming different polynomials in different regions of the
curve. The resulting fit is quite good (Figure 7.2), and the process obviates the need to
check the plot to make sure the curve fits the data.
Figure 7.1
The best fitting fifth degree polynomial fit to data generated from the model y = cos(x) +
.
Figure 7.2
Spline curve fit to the data of Figure 7.1.
One cubic spline method uses Bezier curves. Let (x0, y0), ..., (x3, y3) be 4 points whose
x values are equally spaced. We can always reparameterize to make the xs equally
spaced on [0,1]. For instance, if are not on [0,1], take , where xk = k/3,
k = 0,1,2,3, is between 0 and 1. Therefore, assume without loss of generality that xk =
k/3, k = 0,1,2,3. Define yk = f(xk) and, for 0 x 1, define
Figure 7.3
Bezier polynomial fit to four points. The curve goes through the first and fourth points,
the derivative at the first point equals the slope of the secant line joining the first two
points, and the derivative at the last point equals the slope of the secant line joining the
third and fourth points. These properties facilitate the joining of contiguous Bezier
curves.
The connection between Bezier spline fitting and Bernstein polynomials is that the
Bezier curves that are joined together are Bernstein polynomials of degree 3.
Bernsteins approximation theorem (Theorem 7.8) involves convergence of Bn(x) for n
large; the Bezier curves in cubic spline fitting use the small value n = 3. Nonetheless,
the method of local approximation is the same as that of Bernstein polynomials.
Exercises
1. Prove that if X1,..., Xn are iid Bernoulli (1/2) random variables, then
(a) if r > 1.
(b) With probability 1, infinitely many Xi must be 0.
(c) P(A) = 0, where A = {fewer than 40 percent of X1,..., Xn are 1 for
infinitely many n}.
2. The following statements about iid random variables X1,X2 ... with mean are all
consequences of the WLLN or SLLN. For each, say whether the result follows
from the WLLN, or it requires the SLLN.
5. Pick countably infinitely many letters randomly and independently from the 26-
letter alphabet. Let X1 be the indicator that letters 1-10 spell ridiculous, X2 be
the indicator that letters 2-11 spell ridiculous, X3 be the indicator that letters 3-
12 spell ridiculous, etc.
6. Let (X1, Y1), (X2, Y2),... be independent, though (Xi, Yi) may be correlated.
Assume that the Xi are identically distributed with finite mean E(X1) = X and the
Yi are identically distributed with finite mean E(Y) = Y. Which of the following
are necessarily true?
(a) .
(b) You flip a single coin. If it is heads, set , n = 1, 2,... If it is
tails, set , n = 1, 2, .... Then .
(c) You flip a coin for each i. Set Ui = Xi if the ith flip is heads, and Ui = Yi
if the ith flip is tails. Then .
(d) You generate an infinite sequence U1,U2, ... as described in part c, and let
if this limit exists and is finite, and Z1() = 0 if the
limit either does not exist or is infinite. Repeat the entire experiment
countably infinitely many times, and let Z1, Z2, ... be the results. Then
.
(e) Form all possible combinations obtained by selecting either Xi or Yi, i =
1,2,... Let s index the different combinations; for the sth combination, let Zs
denote the corresponding limit in part d. Then .
7. A gambler begins with an amount A0 of money. Each time he bets, he bets half of
his money, and each time he has a 60% chance of winning.
(a) Show that the amount of money An that the gambler has after n bets may be
written as for iid random variables Xi. Show that E(An) as n
.
(b) Show that An 0 a.s.
8. Prove that if (X1, Y1),..., (Xn,Yn) are iid pairs with 0 < var(X) < and 0 <
var(Y) < , then the sample correlation coefficient Rn converges almost surely to
. Prove that Rn converges to in L1 as well.
where
10. Use the weak law of large numbers to prove that if Yn is Poisson with parameter n,
n = 1,2,..., then P(|Yn/n 1| > ) 0 as n for every > 0 (hint: what is the
distribution of the sum of independent Poisson(1)s?). Can the strong law of large
numbers be used to conclude that ?
11. Prove that if f is a bounded continuous function on x 0, then
(see the proof of Theorem 7.8).
Proof. Without loss of generality, we can take Xn to have mean 0 because we can
consider . We first argue that as n . This follows from
Chebychevs inequality coupled with the Borel-Cantelli lemma, as follows.
Because , the Borel-Cantelli lemma implies that P(Mn > i.o.) = 0. By the
same argument as in the preceding paragraph, this implies that . Putting these
results together, we get
Notice that Theorem 7.10 requires a finite variance, but does not require the Xi to be
identically distributed or even independent. It is clear that without some condition akin
to independence, we could easily concoct counterexamples to laws of large numbers.
For example, let X1 be any non-degenerate random variable with finite mean , and
define X2 = X1, X3 = X1,... Then (X1 + X2 + ... + Xn)/n = X1 does not converge to
almost surely or in probability. Theorem 7.10 eliminates such a counterexample by
requiring the Xi to be uncorrelated.
Another useful technique in the proofs of laws of large numbers is truncation. Truncated
random variables behave very similarly to their untruncated counterparts, but have the
advantage of possessing finite moments of all orders. Let Xi be identically distributed
(not necessarily independent) with finite mean . Let
The truncated random variables Yi have finite variance regardless of whether the Xi do.
Also, implies that , so P(Yi Xi) P(|Xi| i). Therefore, if Xi are iid with
finite mean,
by Proposition 5.9 because E(|X1|) < . By part 1 of the Borel-Cantelli lemma, P(Yi
Xi i.o.) = 0. In other words, with probability 1, only finitely many Yi differ from the
corresponding Xi. This implies that .
Let Xi be identically distributed (not necessarily independent) with finite mean , and
Yi be the truncated random variables defined by Equation (7.10). Then
1. if and only if .
2. if and only if .
Proof. These results follow from the fact that . For example, suppose .
Write as and note that because . Therefore, .
We are now in a position to prove a version of the WLLN requiring only finite mean and
pairwise independence.
If X1,..., Xn,... are identically distributed random variables with mean , and all pairs
(Xi, Xj), i j, are independent, then .
Therefore,
This shows that is bounded, but does not show that it converges to 0. We
can improve this technique as follows. For any sequence an < n,
If we choose and use inequality (7.12), then the first term on the right side of
Equation (7.14) is no greater than
It is interesting to ask what happens to the sample mean of independent and identically
distributed random variables X1, X2, ... such that E(X) does not exist or is not finite.
Does still converge to a constant with probability 1? For simplicity, assume that the
Xn are nonnegative iid random variables with mean . If we let Yn(A) = XnI(Xn A),
then the Yn(A) are iid with finite mean, so as n by the SLLN. It
follows that
Now lift the restriction that the Xn be nonnegative. If Xn has mean , then exactly one
of the nonnegative random variables and has infinite mean. The above argument
can be used to show that converges a.s. to . For instance, if and
, then . We have shown that cannot
converge almost surely to a finite constant if E(Xi) = . It can also be shown that if
, does not converge almost surely to a constant. The following theorem
summarizes these conclusions.
If X1,X2, ... are iid, then converges almost surely to some finite constant c if and
only if E(X1) = is finite, in which case c = .
The mean of a Cauchy random variable does not exist, although the median is . We will
see in the next chapter that has the same Cauchy distribution F, so that
. This means that the sample mean is no better than a single
observation at estimating . In particular, does not converge to almost surely as the
sample size tends to . Of course, does not converge almost surely to any other
constant either, which is consistent with Theorem 7.13.
Exercises
The sequence , where Xi are iid with distribution (7.17), is called a random
walk. When p = 1/2, it is called a symmetric random walk.
We represent a random walk with a graph of the points (n, Sn) in the plane, with linear
interpolation between points, as shown in Figure 7.4. The behavior of random walks
has been studied extensively (Feller, 1968), first in connection with gambling, but with
many other applications as well. We give only a few examples and results for symmetric
random walks that can be proven using laws of large numbers.
Figure 7.4
A random walk.
Example 7.16.
With probability 1, the symmetric random walk eventually reaches every level 0, 1,
+1, 2, +2,... In fact, with probability 1 it reaches each level infinitely often, including
infinitely many returns to the origin.
Sketch of proof. First we prove that with probability 1 the random walk eventually
reaches level +1. After one step, Sn reaches either 1 or +1 with probability 1/2 each.
Let A1 be the event S1 = 1. Now imagine a random walk beginning at level 1 instead
of 0. Proposition 7.17 ensures that it will reach either 3 or +1 eventually, and by
symmetry, it is equally likely to reach either of these first (see Figure 7.5). Let A2 be the
event that it reaches 3 before reaching +1. Now imagine a random walk beginning at
level 3. By Proposition 7.17, the random walk eventually reaches either 7 or +1, and
by symmetry, it is equally likely to reach either first. Let A3 denote the event that it
reaches 7 first, etc. The probability that Sn never reaches level 1 is .
A similar proof can be used for other k, or one can proceed by induction; the new
random walk starting at level 1 must eventually end up one level higher, etc. Also, the
same proof works for k by symmetry.
Figure 7.5
For a random walk never to reach level +1, it must reach level 1 before level +1 (left
panel). It begins anew at (1, 1), and then it must reach level 3 before reaching level
+1 (right panel), etc.
To see that Sn must reach each level infinitely often, note that if Sn() reached level k
only finitely many times, then there would be an N = N() such that either SN, SN+1, ...
are all less than k or all larger than k. Either way, infinitely many levels (either all
levels > k or all levels < k) would never be reached at or after observation N. Only
finitely many of those levels could have been reached before observation N, so for that
, some levels would never be reached. Such have probability 0 by what we proved
in the preceding paragraph. Therefore, each level is reached infinitely often. This also
implies that Sn returns to the origin infinitely often.
Consider the implications of Proposition 7.18 on Example 7.16. Suppose that the
treatments are equally effective and we monitor the study after every new patient is
evaluated. The WLLN ensures high probability that the proportion of patients for whom
treatment A does better than treatment B will be close to 1/2. Nonetheless, the
difference between the number of patients doing better on treatment A and its expected
number, n/2, can be very substantial.
The number of returns to the origin of a random walk has many important applications.
The example offered below relates to the potential for selection bias in certain types of
clinical trials.
Well-designed clinical trials keep the patient and investigators blinded to the treatment
given. In some cases, maintaining the blind is virtually impossible. For example,
suppose that one treatment is surgery and the other is a pill. Once the patient receives
treatment, he or she and the investigator know the treatment received. Therefore, an
investigator could keep track of all previous treatment assignments. Let Xi be 1 or +1
if patient i is assigned pill or surgery, respectively. An investigator keeping track of all
previous assignments knows Sn, the difference between the numbers of patients
assigned to surgery versus pill. If simple randomization is used, keeping track of past
assignments will not help the investigator predict the next assignment. But most methods
of randomization try to force more balance in the numbers of patients assigned to the
two treatments than simple randomization would. For example, permuted block
randomization with blocks of size 2k is equivalent to placing k Cs (denoting control)
and k Ts (denoting treatment) in a box and drawing without replacement for the next 2k
patients. Balance is forced after every 2k patients. Forced balance means that the
investigator usually has a better than 50 percent chance of correctly guessing the next
assignment. With blocks of size 2, the investigator is sure to correctly guess the second
patient in every block.
Having a better than 50 percent chance of correctly guessing the next assignment could
lead to selection bias, a phenomenon whereby patients assigned to treatment are
systematically different from those assigned to control. This could happen if the
investigator vetoes potential new patients until the right one comes along. For
instance, an investigator may believe that treatment helps. If the investigators knows that
the next patient will receive the new treatment, he or she may unwittingly veto healthy
looking patients until a particularly sick one who really needs the treatment arrives.
The end result is that patients receiving treatment are sicker than control patients.
Likewise, an investigator who is confident that the treatment works may introduce bias
such that those receiving treatment are healthier than the controls to ensure that the trial
produces the right answer that treatment works. In either case the investigators bias
could influence the results of the trial. Such bias is usually subtle, but to understand its
potential influence, we consider an extreme scenario whereby an investigator
intentionally tries to influence the results of a trial with permuted block randomization
and a continuous outcome Y. The truth is that the treatment has no effect on Y. An
investigator who believes that the next patient will be assigned control (respectively,
treatment) selects one with mean (respectively, + ). The probability of a false
positive result depends on the number of correct guesses of the investigator (Proschan,
1994).
The guessing strategy used by the investigator is as follows. The first guess is
determined by the flip of a fair coin. After the first patient is assigned, the investigator
continues guessing the opposite treatment until the numbers assigned to the two
treatments balance again. Let 1 be the index of the observation when things balance
again. That is, 1 is the first index n 1 such that Sn = 0, i.e., the first return to the
origin. The investigator will be correct exactly one time more than he or she is incorrect
on patients 2, 3,..., 1 (see Figure 7.6). Moreover, he or she has probability 1/2 of being
correct on patient 1. The same thing happens over the next interval 1 + 1,..., 1 + 2,
where 1 + 2 is the index of the second time of balancei.e., the index of the second
return of the random walk to the origin. The same thing happens between all successive
returns to the origin. If the randomization forces balance at the end of the trial, then the
number of correct treatment assignment guesses depends crucially on the total number of
returns to the origin (forced or unforced).
Figure 7.6
The first step is +1, so we predict 1 for each of steps 2-6, at which time the random
walk returns to the origin. Steps 2-6 contain one more correct than incorrect predictions,
while step 1 has probability 1/2 of being correctly predicted. Thus, over steps 1-6, the
number correct minus number incorrect is 1 + Y1, where Y1 = 1 or 1 with probability
1/2. The same thing happens between any two consecutive returns to the origin.
Note that simple randomization avoids this problem because we can correctly predict
only half the assignments, on average. Keeping track of previous treatment assignments
offers no advantage because the conditional probability that the next assignment is to
treatment A is 1/2 regardless of previous assignments.
It turns out that thinking in terms of the investigator in Example 7.19 leads to the
following useful result for symmetric random walks.
Theorem 7.20.
The expected number of returns to the origin of a symmetric random walk Sk before
time 2n (not including any return at time 2n) is E(|S2n|) 1.
Proof. Consider the investigator trying to predict treatment assignments in Example 7.19
with 2n patients when simple randomization is used. The difference between the number
of correct and incorrect predictions over the interval between returns i and i + 1 to the
origin is 1 + Yi, where Yi = 1 with probability 1/2 (see Figure 7.6). Let L be the time
of the last return to the origin before 2n, with L = 0 if there is no return before time 2n.
The difference between the numbers of correct and incorrect predictions from step L + 1
to step 2n is either |S2n| or |S2n| + 2, depending on whether the prediction on step L +
1 is incorrect or correct, respectively. Thus, the number correct minus number incorrect
from step L + 1 to step 2n is |S2n| + U, where U is 0 or 2 with probability 1/2. Over all
2n steps, the number correct minus number incorrect is , where N is
the number of returns to the origin before time 2n. It is not difficult to show that
(see Exercise 1), so
But for any guessing strategy, the number correct minus number incorrect must have
mean 0. This is because the steps of the random walk are iid; the conditional probability
of being correct on step k, given the results on steps 1,..., k 1, is still 1/2. Equating
Expression (7.18) to 0 gives the stated result.
What is the expected amount of time until a random walk returns to the origin? If this
expectation were finite, then the number of returns before time 2n would be of order 2n.
To see this, let 1 be the time to the first return to the origin (note: 1 could be > 2n), 2
be the time between the first and second return, etc. Then 1, 2, ... are iid random
variables. the mean time between returns were finite, then would
converge almost surely to as k by the SLLN. But each for which is an
for which . Therefore, if v < , then would converge
almost surely to . Moreover, is bounded by 1, so the BCT would imply that
also converges to In other words, the expected number of returns by time 2n would
be a constant times 2n, i.e., it would be of order 2n. However, this is all predicated on
the mean time between returns being finite. If it is not finite, then the mean number of
returns need not be of order 2n. One way to show that the mean time between returns is
not finite is to show that the mean number of returns is not of order 2n. To that end, use
Theorem 7.20 and note that
The inequality in the first line is from Corollary 5.20 with p = 2. We have shown that the
expected number of returns before 2n is of order at most . Because the expected
number of returns would have been of order 2n if the expected return time were finite,
this proves that the expected return time must be infinite. We have proven the following
theorem.
Theorem 7.21. Infinite expected time to return to origin for symmetric random walks
The expected time for a symmetric random walk to return to the origin is .
What we have shown is that although a symmetric random walk is guaranteed to return
to the origin infinitely often, the time it takes to return can be extremely long, such that
its expectation is infinite. Very long times to return mean that the random walk could
spend a high proportion of time above the x-axis or a high proportion of time below the
x-axis. See Feller (1968) for a formalization of this fact known as an arcsin law.
Exercises
1. Let X1, X2,... be iid random variables independent of the positive integer-valued
random variable N with . If , then .
2. Let Sn be a non-symmetric random walk (i.e., p 1/2) and k be any integer. Prove
that with probability 1, Sn reaches level k only finitely many times. Prove also that
P(Sn reaches level k infinitely many times for some integer k) is 0.
7.4 Summary
1. The sample mean of iid random variables with mean converges to in two
ways:
(a) SLLM: .
(b) WLLN: .
2. (Kolmogorov) If X1,X2,... are iid, then converges almost surely to some finite
constant c if and only if E(X1) = is finite, in which case c = .
(a) The Yi are bounded, so they have finite moments of all orders.
(b) if and only if and if and only if .
4. If Xi are iid Bernoulli (p), is called a random walk. If p = 1/2, Sn
reaches every level infinitely often, but the expected time to return to the origin is
infinite.
Chapter 8
Many test statistics and estimators involve sample means of iid random variables X1,...,
Xn. If we make assumptions about the Xi, we can get the exact distribution of the sample
mean . For example, if the Xi are normally distributed, so is . But in reality, we do
not know the distribution of most data. It would be nice if the distribution of ,
properly standardized, did not depend on the distribution of the Xi. In this chapter we
show that if n is large and var , the distribution of a properly standardized
version is approximately the samestandard normalregardless of the
distribution of X1,..., Xn. Later in the chapter we show that, under certain conditions, a
similar result holds if the Xi are independent but not identically distributed.
Imagine that we know nothing about the central limit theorem, but we know only that
some standardized version converges in distribution to some non-constant
random variable Z. How can we deduce the proper standardization of and the
distribution of Z? It is natural to subtract the mean of to give the standardized version
mean 0. Therefore, it makes sense to take . The question then becomes: what
sequence bn makes converge in distribution to a non-constant random variable
Z? The variance of
The next question is: what is the distribution of the limiting random variable Z? If the Xi
are normally distributed, then has an exact standard normal distribution,
so the limiting random variable Z must be standard normal in that case. But we are
assuming that Zn converges in distribution to the same Z regardless of the distribution of
the Xi. Therefore, that Z must be standard normal. We have surmised, but not proven, the
most famous theorem in probability, stated in Theorem 8.1.
In some settings we have continuous measurements of the same people at two time
points or under two different conditions (e.g., treatment and placebo). We are interested
in the paired differences D1,..., Dn for the n people in the study. Assume that D1,..., Dn
are iid from a distribution with mean and finite variance 2. The one-sample t-
statistic for testing the null hypothesis that = 0 is
Suppose that our assumption that the data are normal is wrong. Suppose that the correct
distribution is F(d), which has mean 0 and variance 2. What happens to the type I error
rate of the t-test for large sample sizes? Notice that where
Let X1,..., Xm and Y1,..., Yn be values of a continuous variable in two different groups.
Assume that the Xi are iid with mean X, the Yi are iid with mean Y and independent
of the Xi, and var(Xi) = var(Yi) = 2. We wish to test the null hypothesis that X = Y.
The two-sample t-statistic is
Suppose that Xi and Yi have arbitrary distributions F and G with the same mean and
same variance 2. Suppose that m = mn is such that mn/(mn + n) as n . The
CLT implies that
Then
In some cases we are interested in tests or confidence intervals on the variance 2 of iid
random variables X1,..., Xn. For example, we may have such a huge amount of data
from one population that we are confident that its variance is 1; we have a smaller
number of observations from another population, and we want to test whether its
variance is also 1. We estimate 2 using the sample variance .
Without loss of generality, assume that E(Xi) = 0 because subtracting E(X) from every
observation does not change the sample or population variance. The usual test and
confidence interval on 2 assume the Xi are normally distributed, in which case
has a chi-squared distribution with n 1 degrees of freedom. For example, to
test whether 2 = 1 at level , we reject the null hypothesis if (n 1)s2/1 exceeds the (1
)th quantile of a chi-squared distribution with n 1 degrees of freedom.
Now suppose we have a large sample size and do not want to assume that the data are
normal. How can we determine the asymptotic distribution of the sample variance?
Assume that . Apply the CLT to the iid random variables :
Also,
where
and note that the last term tends to 0. By Slutskys theorem, we can ignore it. We have
shown that is asymptotically normal with mean n1 and variance n . On
the other hand, if we had assumed normality, then as noted previously, (n 1)s2/2
follows a chi-squared distribution with n 1 degrees of freedom. We can write a chi-
squared random variable with n 1 degrees of freedom as , where the Ui are iid
chi-squared random variables with 1 degree of freedom. By the CLT, this sum is
asymptotically normal with mean (n 1) and variance 2(n 1).
The numerators are identical, and the ratio of the denominators tends to 1 if and only if
. This is a peculiar condition holding for the normal distribution, but not
necessarily for other distributions. Therefore, using the test based on normality gives the
wrong answer, even asymptotically, if that normality assumption is wrong. This is in
sharp contrast to the t-test of means. As we saw in Example 8.3, the t-test is
asymptotically correct regardless of the underlying distribution of the data.
where Xi and Yi are the baseline and end of study values of the continuous outcome, zi
is the treatment indicator (1 for treatment, 0 for control), and the i are iid errors from a
distribution with mean 0 and variance . Example 7.7 treated the xi as fixed constants,
whereas Exercise 9 of Section 7.3 treated them as random variables. Here we regard
them as random variables. The treatment effect estimator using this ANCOVA model is
where bars denote sample means, T and C denote treatment and control groups, and is
the slope estimator. We saw in Exercise 9 of Section 7.3 that . Replace with 1.
Then (8.4) is , where . In the treatment arm, the Ui are iid with mean 0
+ 2 and variance . By the CLT, is asymptotically normal with mean 0 + 2 and
variance . Similarly, in the control arm, the Ui are iid with mean 0 and variance ;
is asymptotically normal with mean 0 and variance . Also, because the
treatment and control observations are independent, and are independent.
Therefore,
It is an exercise to show that the same thing holds if we replace 1 on the left side of
(8.5) by its estimator . Therefore, if the i are iid mean 0 random variables from any
distribution with finite variance, the ANCOVA test statistically has asymptotically the
correct coverage probability.
We next apply the CLT to iid Bernoulli (p) random variables X1,..., Xn. The sum
has a binomial distribution with parameters n and p. The CLT says that
converges in distribution to a N(0,1).
Example 8.8.
The actual analysis was flawed on many levels: (1) different incorrect answers on
multiple choice exams are almost never equally likely to be selected, (2) whether two
students match on one answer may not be independent of whether they match on another
question, and (3) the professor erroneously used a match probability of 1/16 instead of
1/4.
Our focus is on how well the CLT approximates the binomial probability of 13 or more
correct answers out of 16 when p = 1/4. Under the professors assumptions, the number
of matches S16 is the sum of 16 independent Bernoulli trials with p = 1/4. Therefore,
E(S16) = 16(1/4) = 4 and var(Sn) = 16(1/4)(3/4) = 3. The CLT approximation is
Is this close to the binomial probability of 4 106? That depends on whether we are
talking about close in an absolute or relative sense. The difference between 4 106
and 109 is tiny, yet 109 is 4,000 times smaller than 4 106. When one is attempting
to demonstrate a very remote probability, 4 106 and 109 are very different, though
either would be considered strong evidence against S.
In general, the CLT approximation to the binomial works better when p is closer to 1/2.
For example, suppose p = 1/2. Even if n is as small as 10, the probability of 7 or more
successes is 0.17 under the binomial and 0.16 under the CLT approximation.
Exercises
Example 8.9.
Let X1, X2,... be iid with mean and finite variance 2, and let Y have an arbitrary
distribution with mean Y and variance . Replace X1 with Y; that is, define Y1 = Y,
Yi = Xi for i 2. The following argument shows that is asymptotically normal
with mean Y + (n 1) and variance .
By the CLT,
Let
By Slutskys theorem, .
The conclusion from this example is that the CLT still holds if we replace one of n iid
random variables with an arbitrary random variable with finite variance . The same is
true if we replace any finite and fixed set of the Xi by other random variables Yi.
A very useful device for developing intuition about central limit theorems is the
quincunx (Figure 8.1). The standard quincunx simulates the sum of iid random variables
Xi taking values 1 or +1 with probability 1/2 each. Balls roll down a board toward a
triangular array of nails. When a ball hits the nail in the first row, it is equally likely to
bounce left (X1 = 1) or right (X1 = +1). Whichever way it bounces, it then strikes a
nail in the second row and bounces left (X2 = 1) or right (X2 = +1) with equal
probability, etc. Each left or right deflection in a given row represents one binary 1
random variable. Bins at the end of the board collect the balls after they pass through the
n rows of nails. If the numbers of left and right deflections are equal, the ball will
collect in the middle bin , whereas if all deflections are to the right, the ball
will collect in the rightmost bin , etc. The balls location at the end is the sum
of n iid 1 deflections, and therefore has the distribution of 2Y n, where Y is binomial
with parameters n and 1/2. The numbers of balls in the different bins is a frequency
histogram estimating the distribution of . With a large number of balls, this
empirical distribution is a good approximation to the asymptotic distribution of Sn,
which is N(0,n) by the CLT.
Figure 8.1
Quincunx.
We can modify the quincunx to allow different sized deflections in different rows. This
helps us envision scenarios when a central limit theorem might hold even if the random
variables are independent but not identically distributed. As long as the deflection sizes
in the different rows are not wildly different, the distribution of the balls at the bottom is
approximately normal.
Suppose data consist of iid paired differences Di from a distribution symmetric about
and with finite variance, and we wish to test the null hypothesis that = 0. A
permutation test in this setting corresponds to treating the data di as fixed numbers, and
regarding di and +di as equally likely. For instance, if d1 = 8, we treat 8 or +8 as
equally likely. The different observations are still independent, binary observations, but
are not identically distributed because the di have different magnitudes. A quincunx with
deflection size in row i represents the permutation distribution. Intuitively, because
the di arose as iid observations from some distribution with finite variance, they will
not differ so radically that they cause the distribution of balls at the bottom to be
bimodal or have some other non-normal shape. Later we prove that a central limit
theorem holds in this setting.
If the size of the deflection in one row dominates the sizes in other rows, the distribution
of the sum may not be asymptotically normal. We illustrate how to use a quincunx to
construct an example for which the sum is not asymptotically normal.
Suppose that U1, U2, ... are iid Bernoulli (1/2). To create a quincunx whose first row
deflection has size 2, let X1 be 4U1 2. We want row i 2 to have deflection size ,
so let for i 2. Then the sum of magnitudes of all deflections from row 2
onward is (Figure 8.2). Because this is only half as large as the
deflection in the first row, balls can never end up between 1 and 1. That is,
for every n. In this example, E(Sn) = 0 and
as n . If the CLT held, then would converge
in distribution to N(0,1). Slutskys theorem would then imply that
. Clearly this cannot be the case when is exactly 0 for
every n. Therefore, the CLT cannot hold; Sn is not asymptotically normal with mean 0
and variance var(Sn).
Figure 8.2
A quincunx whose first deflection has magnitude 2, and the sum of magnitudes of all
subsequent deflections is 1. Then P(1 < Sn < 1) = 0 for all n.
In Example 8.11, we used intuition from a quincunx to concoct an example, but then we
needed to show rigorously that was not asymptotically N(0,1). Quincunxes
can provide the key idea, but rigor is required to prove normality or lack thereof.
Allowing distributions to change with n makes it even easier to construct examples such
that Sn is not asymptotically normal with mean 0 and variance var(Sn). Figure 8.3
shows a quincunx whose first deflection is much larger than the subsequent 4
deflections. The resulting distribution is bimodal. If we simultaneously add more rows
and increase the size of the first deflection, the distribution of the sum will be a mixture
of normals. If the deflection size in the first row is so large that there is complete
separation between the two humps, the probability of being in a small interval about 0 is
0. The limiting random variable will also have probability 0 of being in a sufficiently
small interval about 0, as in Example 8.11 and the following example.
Figure 8.3
A quincunx whose first deflection is much larger than the subsequent 4, creating a
bimodal distribution.
Example 8.12. Another example of CLT not holding if one variable dominates
Let U1, U2,... be iid Bernoulli random variables with parameter 1/2. We create a
quincunx whose first row has a deflection size of n by defining Xn1 = (2n)U1 n. Make
subsequent deflections have size 1 by defining Xnj = 2Uj 1, j = 2,..., n. The variance
of the sum is . It follows that
By making the deflection size in the first row of order n1/2 instead of n, we can create a
sum whose asymptotic distribution is a mixture of normals, as we see in the following
example.
Again let U1, U2,... be iid Bernoulli random variables with parameter 1/2, but now
define , , j = 2,...,n. This corresponds to a quincunx whose first
deflection is only of size n1/2 instead of n. Unlike in Example 8.12, the two humps of
the distribution of Sn are no longer completely separated. Therefore, the asymptotic
distribution of no longer puts probability 0 on an interval about 0. Instead, it
is a mixture of normals. More specifically, , where
(exercise). Therefore, the CLT does not hold for Xnj.
One condition preventing scenarios like Examples 8.12 and 8.13 is the Lindeberg
condition:
for each > 0. The term is essentially a tail variance for the random
variable Xni. Therefore, the Lindeberg condition says that the sum of tail variances must
be a negligible fraction of the total variance.
We now show that the Lindeberg condition is not satisfied in Example 8.12, where
. Take = 1/2. The first term of the sum of Expression (8.10) is
. But , which exceeds (1/2)(n2 + n 1) for all n = 1, 2,...
Therefore, the first term of the sum of Expression (8.10) is n2. Each subsequent term
is 0 for n 2 because for n 2 and i = 2,
3,..., n. Thus, the left side of Expression (8.10) is . Therefore, the
Lindeberg condition is not satisfied. In fact, if the Lindeberg condition were satisfied,
then a CLT would hold:
The proof may be found in Billingsley (2012). We illustrate through several examples
how to apply the theorem.
For each n, let Un1,..., Unn be iid Bernoulli random variables with parameter pn such
that . Let , so that the Xni have mean 0. We will show that the
Lindeberg condition is satisfied. Note that var(Sn) = n var(Xn1) = n var(Un1) = npn(1
pn) . Consequently, for each > 0. Whenever n is large enough that
, is 0 because for all n and i. Therefore, there exists an N
such that each term of the sum in Expression (8.10) is 0 for n N. This implies that the
sum is 0 for n N, so the limit of Expression (8.10) is 0. That is, the Lindeberg
condition is satisfied. By Theorem 8.14, .
For each n, let Un1,..., Unn be iid Bernoulli random variables with parameter pn such
that . By the law of small numbers (Proposition 6.24), the binomial (n,pn)
distribution of is asymptotically Poisson (). If we center the Bernoullis by
subtracting pn from each, then Slutskys theorem implies that
, where X is Poisson (). Given that this is not a normal
distribution, the Lindeberg condition cannot be satisfied. It is an exercise to verify
directly that the Lindeberg condition is not satisfied.
Example 8.16 is noteworthy because no single random variable dominates the others the
way the first random variable did in Example 8.12. One condition ensuring that no
variable dominates is the uniform asymptotic negligibility condition
We have seen that the Lindeberg condition is sufficient for Sn to be asymptotically N(0,
var(Sn)). When (8.11) holds, it is necessary as well:
Suppose that Xni are independent with mean 0 and satisfy the uniform asymptotic
negigibility condition (8.11). If , then the Lindeberg condition (8.10) is
satisfied.
Remark 8.18. Lindeberg-Feller
Combining the Lindeberg and Feller theorems, we see that under the uniform asymptotic
negligibility condition (8.11), if and only if the Lindeberg condition
(8.10) is satisfied. This is sometimes called the Lindeberg-Feller theorem.
In Example 8.15, it was easy to verify the Lindeberg condition because the sum in
Expression (8.10) became 0 for n sufficiently large. Another setting in which the
Lindeberg condition is easy to verify is when higher moments are available and
Lyapounovs condition is satisfied:
Theorem 8.19.
Proof (exercise).
Our final example requires slightly more work to verify the Lindeberg condition.
Consider the permutation test setting of Example 8.10. The permutation distribution of
is obtained by fixing di and defining , with probability 1/2 each; the
distribution of is the permutation distribution of . We will show that, for
almost all , the permutation distribution is asymptotically normal by verifying that the
Lindeberg condition holds with probability 1.
Let Ln denote the expression on the right side of Equation (8.13). We must prove that Ln
0 as n .
Before we fixed D1 = d1, D2 = d2,..., they were iid random variables. By the SLLN,
for almost all as n , so . Thus, for the fixed
numbers d1, d2,..., as n . This means that for any A we can determine an
integer NA such that
It follows that
Apply the SLLN separately to the numerator and denominator: for in a set of
probability 1, the observed sequence dj = Dj() is such that Expression (8.15)
converges to as n . We conclude that .
This holds for arbitrarily large A, and this expression tends to 0 as A . Because
E(D2) < , and the Lindeberg condition is satisfied.
We conclude that, for almost all (equivalently, for almost all sequences d1, d2,...), the
permutation distribution of is asymptotically normal with mean 0 and variance
. In other words, the one-sided permutation test is asymptotically equivalent to
rejecting the null hypothesis if Zn > z, where
and z is the (1 )th quantile of the standard normal distribution. Note that Zn is very
closely related to the usual t-statistic, except that the variance estimate is
instead of . It is an exercise to show that these two variance estimates
are asymptotically equivalent under the null hypothesis.
We end this section with an important theorem on the rate of convergence of the
normalized statistic to the standard normal distribution.
Let Xi be iid with mean and variance 2, and set . Let Fn(z) be the
distribution function for , and (z) be the standard normal distribution
function. There is a universal constant C such that
for all n = 1,2,.... Here, universal means that the same C can be used regardless of the
distribution of the Xi (subject only to having mean , variance 2, and third absolute
moment .
Exercises
1. Imagine infinitely many quincunxes, one with a single row, another with two rows,
another with three rows, etc. Roll one ball on each quincunx. What is the
probability that the ball is in the rightmost bin of infinitely many of the quincunxes?
2. Let X1 have distribution F with mean 0 and variance 1, and X2,X3,... be iid with
point mass distributions at 0. That is, Xi 0 with probability 1 for i 2. What is
the asymptotic distribution of ? Does the CLT hold?
3. In Example 8.16, prove directly that the Lindeberg condition does not hold.
4. Let ni be independent Bernoullis with probability pn, 1 i n, and let .
Prove that if < pn < 1 for all n, where > 0, then Xni satisfies Lyapounovs
condition with r = 3.
5. Let Xni, 1 i n, be independent and uniformly distributed on (an, an), an > 0.
Prove that Lyapounovs condition holds for r = 3.
6. Let Yi be iid random variables taking values 1 with probability 1/2 each. Prove
that the random variables Xni = (i/n)Yi satisfy Lyapounovs condition with r = 3.
7. Prove that the Lindeberg CLT (Theorem 8.14) implies the standard CLT (Theorem
8.1).
8. What is the asymptotic distribution of Sn in Example 8.11? Hint: recall that
is the base 2 representation of a number picked randomly from [0,1].
9. Let Dn be iid from a distribution with mean 0 and finite variance 2, and let Tn be
the usual one-sample t-statistic
With Zn defined by Equation (8.16), prove that, under the null hypothesis
that E(Di) = 0. What does this say about how the t-test and permutation test
compare under the null hypothesis if n is large?
10. Consider the context of the preceding problem. To simulate what might happen to
the t-test if there is an outlier, replace the nth observation by n1/2.
Figure 8.4
for constants , and . If 0, then corresponds to the time when blood pressure is
highest. Equation (8.17) is linear in and , but not . We can re-parameterize Equation
(8.17) to make it linear in all coefficients using the formula for the cosine of a
difference. This leads to
where 1 = cos() and 1 = sin (). Thus, the first harmonic includes both the cos(x)
and sin(x) term. Equation (8.18) is linear in the coefficients , 1 and 1. This simple
model accommodates a single peak and single trough in the interval [0, 2]. Figure 8.5
shows that the first harmonic model applied to the data in Figure 8.4 does not seem to
fit. For example, it does not seem to represent what is happening between hours 0 and 6.
Figure 8.5
First harmonic does not fit the blood pressure data well.
To better fit the data, we can add a second harmonic consisting of cos(2x) and sin(2x):
In matrix notation,
This model allows up to two peaks and two troughs, and fits the data quite well. In fact,
the curve in Figure 8.4 uses a two harmonic model.
Table 8.1 summarizes the results of our harmonic regression. Notice that the coefficients
for a given term do not depend on whether other terms are included in the model. For
example, the constant term remains 105.0521 whether 0, 1, or 2 harmonics are included
in the model; the coefficient for cos(x) is 6.6431 whether 1 or 2 harmonics are
included in the model. This is because the predictor vectorsthe columns of the design
matrix in the middle of Equation (8.20)are all orthogonal.
Table 8.1
cos(2x) 0.8448
sin(2x) 3.1463
If the two harmonic model does not fit, we can add more harmonics. In fact, we can
overfit the data using a saturated model with as many parameters as there are
observations:
The regression coefficient for each term of the saturated model is the same as it would
be in univariate regression. For 0 < k < n/2,
When we write the regression model in matrix notation as in Equation (8.20), we see
that the n vectors of the design matrix form an orthogonal basis for Rn, so Equation
(8.21) reproduces the Ys without error. But the Ys are completely arbitrary; they could
represent values of an arbitrary function f(x) with domain the n equally spaced values xi
= 2i/n. In other words, for any such function f(x), we have
Now suppose that f(x) is the probability mass function of a random variable X taking on
n possible, equally-spaced values , i = 1,...,n. Then , and
. Once we know E{cos(kX)} and E{sin(kX)} for all k = 1,..., n/2, we can
reproduce the probability mass function f, and therefore the distribution function of X.
The same argument shows that for any discrete random variable taking on n equally-
spaced values (not necessarily restricted to the interval [2, 2]), knowledge of
E{cos(tX)} and E{sin(tX)} for sufficiently many values oft reproduces the distribution
function F(x) of X. As n increases, the number of t for which we need to know
E{cos(tX)} and E{sin(tX)} to reproduce F also increases.
8.4 Characteristic Functions
The previous section motivates the following definition of a characteristic function.
The key result that we have motivated through harmonic regression is that the
characteristic function, as its name implies, characterizes a distribution function.
The characteristic function (t) uniquely determines the distribution function F(x) of the
random variable X. That is, if F1 and F2 are distribution functions with the same
characteristic function (t), then F1 F2.
1. If , then .
2. Triangle inequality for complex numbers: If zj, j = 1,...,n are complex numbers,
then .
Figure 8.6
Part 1 of Proposition 8.24 says that the length of the hypotenuse in a right triangle is
between the length of the longer leg and the sum of lengths of the two legs.
If , then exists, is finite, and has the same value for any rearrangement
of terms.
We are now in a position to extend power series to complex arguments. Any real power
series converging absolutely for has a complex analog converging
for . This follows from part 1 of Proposition 8.24 because, with Re denoting the
real part of a complex number, by assumption if
, and similarly for the imaginary part.
The series in Definition 8.26 converges for all complex numbers z by Proposition 8.25
because is the (convergent) Taylor series for the function of the real-
valued argument z.
We can now express the characteristic function in terms of the exponential function:
Proposition 8.27.
Proposition 8.28. Product rule extends to exponential functions with complex arguments
Proof.
The reversal of order of summation in the second line is justified either using Tonellis
theorem or as follows. The double sum may be viewed as a single sum of countably
many terms. The norm of the summand is , a nonnegative
sequence of real numbers whose sum is invariant to order of summation. Going
backwards from line 2 to line 1 but replacing the summand by its norm, we find that
. The reversal of order of summation in line 2
is now justified by Proposition 8.25.
1. .
2. If a and b are constants, the ch.f of aX + b is .
3. If for some constant c, then .
4. X(t) is the complex conjugate ofX(t).
5. X is symmetric about 0 (i.e., X and X have the same distribution functions) if and
only if X(t) is real.
6. (t) is a uniformly continuous function of t.
Going from step 2 to step 3 follows from the fact that because
(alternatively, we could apply Jensens inequality).
Item 5 is proven as follows. If X and X have the same distribution functions, then
, which implies that E{sin(tX)} must be 0. Thus, if X and
X have the same distribution functions, then the imaginary part of X(t) is 0 (i.e., X(t)
is real). To go the opposite direction, suppose that X(t) is real. Note that
because cos() = cos() for all . Therefore, the real parts of the
ch.f.s of X and X coincide. The fact that X(t) is real means that E{sin(tX)} = 0, so
. Therefore, the imaginary parts of the ch.f.s of X and X
coincide as well. We have shown that X and X have the same ch.f.s. By Theorem 8.23,
X and X have the same distribution functions.
Table 8.2 shows the characteristic function for some common distributions.
Table 8.2
Binomial
Bernoulli
Poisson
Geometric
Uniform
Normal
Cauchy
Gamma
Chi-squared
Exponential
Proposition 8.30. Product rule for the characteristic function of the sum of independent
random variables
This is the ch.f. of a random variable. The result now follows from
Theorem 8.23.
Let Xi be iid with characteristic function X(t) with no real roots, and let Fj(x) be the
distribution of , j = 1,... Suppose that Y1 and Y2 are independent with Y1 ~ Fk
and for some n > k. Then .
Proof. Let X(t) be the ch.f. of X. By Proposition 8.30, the ch.f.s of Y1 and are
and . Also, the ch.f. of Y1 + Y2 is by Proposition 8.30. Thus,
. Because X(t) has no real roots, we can divide both sides of this
equation by , yielding . By Theorem 8.23, Y2 has distribution function
.
Proof. It is helpful to consider the real and imaginary parts R(t) and I(t) of (t)
separately. By the mean value theorem from calculus,
where is between tx and tx + x. Now take the limit as 0. The magnitude of the
integrand is , which is integrable by assumption. Moreover, the integrand
tends to as because . The DCT implies that the integral tends to
as 0. This proves that R(t) is differentiable and
. A similar argument shows that the derivative of is . It
follows that the derivative of (t) is . This
proves that the result is true for k = 1. The rest of the proof using induction is similar
and is left as an exercise.
The moment generating function was so-named because it also recovers moments, as
seen by the following result.
Exercises
1. Use ch.f.s to prove that if X1 and X2 are independent Poissons with respective
parameters 1 and 2, then X1 + X2 is Poisson with parameter 1 + 2.
2. Use ch.f.s to prove that if X1 and X2 are independent exponentials with parameter
, then X1 + X2 is gamma with parameters 2 and .
3. Use ch.f.s to prove that if X1 and X2 are iid random variables, then the distribution
function for X1 X2 is symmetric about 0.
4. Use the representation to read out the probability mass
function that corresponds to the characteristic function cos(t). Describe the
distribution corresponding to the ch.f. {cos(t)}n.
5. Use the CLT in conjunction with the preceding problem to deduce that
converges to as . Then verify this fact directly. Hint: write the log of
as (this is not problematic because is nonnegative
for n sufficiently large) and use LHospitals rule as many times as needed.
6. Let Y be a mixture of two normal random variables: Y = X1 or Y = X2 with
probability and 1 , respectively, where . Show that Y has ch.f.
.
7. Use ch.f.s to prove that the distribution of the sample mean of n iid observations
from the Cauchy distribution with parameters and is Cauchy with parameters
and .
8. Let Y1, Y2 be iid Cauchy with parameters and , and let (0,1). Use ch.f.s to
deduce the distribution of Y1 + (1 )Y2.
9. The geometric distribution is the distribution of the number of failures before the
first success in iid Bernoulli trials with success probability p. Given its ch.f. in
Table 8.2, determine the ch.f. of the number of failures before the sth success.
10. Suppose that Y1 and Y2 are independent, Y1 has a chi-squared distribution with k
degrees of freedom, and Y1 + Y2 has a chi-squared distribution with n degrees of
freedom. Prove that Y2 has a chi-squared distribution with n k degrees of
freedom.
11. Suppose that Z1 and Z2 are independent, , and . Prove that
.
12. Show that the following are NOT ch.f.s.
Let Xn be a sequence of random variables with ch.f. n(t), and let X be a random
variable with ch.f. (t). Then if and only if as n for every t.
Now suppose that for all t. It can be shown that the sequence of distribution
functions Fn(x) for Xn is tight. We will prove that every subsequence contains a further
subsequence converging to the distribution function F for X. Proposition 6.44 will then
imply that .
If X is a random variable with mean 0 and variance 1, then its characteristic function
(t) satisfies
Proof. Taylors theorem for a function of a real variable (Theorem A.49) and
Proposition 8.33 imply that the real and imaginary parts of satisfy
It is helpful to start with symmetric random variables because the characteristic function
is real (Proposition 8.29, part 5). The ch.f. of is also real and is given by
where the remainder term rn is o{t2/(2n)}. Now take the natural logarithm of both sides
to get
Therefore, the ch.f. of Yn tends to exp(t2/2), the ch.f. of a N(0,1) random variable. By
Theorem 8.36, Yn converges in distribution to N(0,1). This completes the proof when
the Xi are symmetric about 0.
The proof in the preceding subsection was seamless because the ch.f.s for Xi and Yn
are real when the Xi are symmetric about 0. When the Xi are arbitrary iid random
variables with mean 0 and variance 1, the ch.f. of Yn is complex-valued. Taking its
logarithm is problematic because the logarithm of a complex variable is not unique. To
avoid this problem, we must extend a result from calculus to complex numbers.
Lemma 8.38. The exponential function as a limit of (1 + zn/n)n
Proof.
Now take the limit as m of Equation (8.34). The limit of the left side is
, while the limit of the right side is 0. We conclude that , so
.
Even without assuming that xn has a finite limit, the same argument shows that any
convergent subsequence xnk must converge to exp(z). Furthermore, no subsequence xnk
can converge to an infinite limit because implies that satisfies
To use this lemma to prove the CLT, note that Expression (8.30) remains valid even if
the distribution of Xi is not symmetric about 0. The only difference is that the remainder
term rn is complex instead of real. Therefore, the ch.f. of is of the form 1 +
zn/n, where as n because . By Lemma 8.38, the ch.f.
of Yn converges to exp(t2/2), the ch.f. of a standard normal deviate. By Theorem 8.36,
. This completes the proof of the standard CLT.
Exercises
1. Use ch.f.s to give another proof of the fact that if , , and Xn and Yn are
independent, then , where X and Y are independent (see also Problem
2 of Section 6.3).
2. Use ch.f.s to give another proof of the fact that, in Example 8.13, the asymptotic
distribution of is a mixture of two normals (see Problem 6 of
Section 8.4), where .
3. Modify Example 8.13 so that the first random variable is n1/2, 0, or + n1/2 with
probability 1/3. Show that the asymptotic distribution of is a mixture of
three normals.
4. Use characteristic functions to prove the law of small numbers (Proposition 6.24).
The key results for ch.f.s of random variables carry over to ch.f.s of random vectors as
well. For example, the following is a generalization of Proposition 8.23.
Proposition 8.40. Extension of Proposition 8.23: the multivariate ch.f. characterizes the
distribution of X
A ch.f. (t) uniquely determines the distribution function F(x) of the random vector X.
That is, two random vectors with the same ch.f.s have the same distribution functions as
well.
Proposition 8.41. Extension of Proposition 8.30: product rule for the ch.f. of a sum of
independent random vectors
Proof. For t Rk, let Y1 = tX1 and Y2 = tX2. Then Y1 and Y2 are independent 1
dimensional random variables. By Proposition 8.30, the ch.f. of
Y1 + Y2 is . Take s = 1 to deduce the result for two random vectors. The proof
for n random vectors can be proven by induction.
Let Xn be a sequence of k-dimensional random vectors with ch.f. n(t), and let X be a
k-dimensional random vector with ch.f. (t). Then if and only if n(t) (t) as n
for all t Rk.
Notice that we can obtain the ch.f. of a random vector from knowledge of the
distribution of tX for each t Rk. Moreover, the ch.f. completely determines the
distribution function of X. Thus, we can deduce the multivariate distribution function of
X from the distribution functions of tX for all t Rk. This suggests the possibility of
deducing the limiting distribution function of a sequence Xn from the limiting
distribution of tXn, t Rk. This important reduction technique, known as the Cramer-
Wold device, can be proven using Proposition 8.42.
We begin by defining the multivariate normal distribution function, starting with the
bivariate normal. Its density was given in Section 5.6, but here we provide an
alternative definition in terms of linear combinations of independent standard normals.
Let Z1,Z2 be iid standard normals, and consider the joint distribution of two linear
combinations, Y1 = a11Z1 + a12Z2, Y2 = a21Z1 + a22Z2. We can write Y in matrix
notation as Y = AZ, where Y = (Y1, Y2), Z = (Z1, Z2), and
It is easy to see that Y = AZ has variances and and correlation . We can add the
constant vector , to Y to make the mean of Y equal . A random vector Y = AZ + ,
where A is a 2 2 matrix and Z are two iid standard normals, is said to have a
bivariate normal distribution.
More generally, let Z = (Z1,..., Zk) be iid standard normals, and let Y = AZ, where A is
n k k matrix. Each Yi is normal with mean 0. The covariance matrix of Y, defined by
E(YY), is
Any covariance matrix is positive definite, meaning that for any k-dimensional vector a,
(we say strictly positive definite if this quantity is always strictly positive) Again
the key question is whether, given an arbitrary covariance matrix , we can find a
matrix A such that cov(AZ) = . The following result from linear algebra provides an
affirmative answer.
The characteristic function of the normal distribution with mean vector 0 and covariance
matrix is .
Proof. The ch.f. for each Z is exp(t2/2). By result 8.30, the ch.f. for Z is
. The ch.f. for AZ is
.
Let X1, X2,... be iid k-dimensional random vectors with mean vector and finite
covariance matrix , and let . Then n1/2(Sn -n) converges in distribution to
a multivariate normal with mean vector 0 and covariance matrix .
Let Yi = t(Xi). Then the Yi are iid random variables with mean 0 and variance tt.
By the CLT for random variables, converges in distribution to N(0,tt). By
the Cramer-Wold device (Theorem 8.43), converges in distribution to a
multivariate normal with mean vector 0 and covariance matrix .
Consider a medical study with two outcomes, say 30-day mortality and 30-day
cardiovascular mortality. Let Xi and Yi be the indicators of 30-day death and
cardiovascular death for patient i. Then (Xi, Yi), i = 1,...,n are independent pairs,
though of course Xi and Yi are dependent. The covariance matrix for (Xi,Yi) is given
by , , and , where is the correlation
between Xi and Yi. With n patients, the sample proportions of patients with the
respective events are sample means . The multivariate CLT implies that
is asymptotically normal with mean (pX,pY) and covariance matrix (1/n). That
is, converges in distribution to a bivariate normal random vector with
mean vector (0,0) and covariance matrix .
One application of the multivariate CLT involves chi-squared statistics and goodness of
fit tests. For example, we may want to know whether the flu is equally likely to occur in
the 4 different seasons, spring, summer, fall, and winter. We have data from a large
number, n, of patients with flu. Each patients data Y is either (1,0,0,0), (0,1,0,0),
(0,0,1,0) or (0,0,0,1) depending on whether flu occurred in spring, summer, fall, or
winter, respectively. The total numbers of patients with flu in the different seasons is the
sum, of n independent vectors. Under the hypothesis that flu is equally likely
to occur in any season, the expected number of observations in each season is n/4. The
chi-squared statistic is of the form .
More generally, with k categories, the chi-squared statistic for testing uniformity is
.
We prove that under the null hypothesis, the above goodness of fit statistic converges in
distribution to a chi-squared random variable with k 1 degrees of freedom as n .
Note that Sn is the sum of n iid vectors; a generic vector Y has mean E(Y) = (1/k,...,
1/k). Each component Yi of Y is Bernoulli 1/k, so its variance is (1/k)(1 1/k). Also,
if i j, then . But YiYj = 0 because only one of Y1,..., Yk is
nonzero. Thus, cov(Yi,Yj) = (1/k)2. Therefore, the covariance matrix = cov(Y) has
diagonal element (1/k)(1 1/k) and off diagonal elements (1/k)2. By the multivariate
CLT, n1/2(Sn n(1/k,..., 1/k)) converges in distribution to a multivariate normal
vector U with mean vector 0 and covariance matrix . By the Mann-Wald mapping
theorem (Theorem 6.59), the goodness of fit statistic converges in distribution to
.
Proof. Exercise.
Example 8.51.
Clinical trials are monitored several times to ensure patient safety and determine
whether efficacy of treatment has been established. Many test statistics involve sums of
iid random variables with mean 0 under the null hypothesis. Therefore, if we monitor
for efficacy k times, we are examining overlapping sums , where ,
E(Xj) = 0, var(Xj) = 2. Let . To protect against falsely declaring a treatment
benefit, we must determine the joint distribution of and construct
boundaries b1,..., bk such that .
Note that
where and .
Let N , and assume that tNi ti, i = 1,...,k. Then . By the CLT
for iid random variables (Theorem 8.1), as N . By Slutskys theorem
(Theorem 6.52),
Also, the YNi are independent because they involve non-overlapping sums. Therefore,
the Yi are independent (see Exercise 2 of Section 6.3).
Let g : Rk Rk be the continuous function g(y) = (y1, y1 + y2, ..., y1 + ... + yk). By the
Mann-Wald mapping theorem (Theorem 6.59),
Although we assumed known variance, the same result holds if we have an estimator
of available at information fraction tNi. In that case
Exercises
3. Let Y have a trivariate normal distribution with zero means, unit variances, and
pairwise correlations 12, 13, and 23. Show that Y has the same distribution as
AZ, where Z are iid standard normals and
6. Let X have a standard normal distribution. Flip a fair coin and define Y by:
Show that X and Y each have a standard normal distribution and cov(X, Y) = 0 but
X and Y are not independent. Why does this not contradict the preceding problem?
7. Suppose that X1 and X2 are iid N(,2) random variables. Use bivariate ch.f.s to
prove that X1 X2 and X1 + X2 are independent.
8. Let (X, Y, Z) be independent with respective (finite) means X, Y, Z and
respective (finite) variances . Let (Xi, Yi, Zi), i = 1,..., n, be independent
replications of (X, Y, Z), Show that the asymptotic distribution of
as n is bivariate normal, and find its asymptotic mean and covariance vector.
9. Let Y be multivariate normal with mean vector 0 and strictly positive definite
covariance matrix . Let 1/2 be a symmetric square root of ; i.e., (1/2) = 1/2
and 1/21/2 = . Define Z = (1/2)1 Y. What is the distribution of Z?
(a) Prove that y = y for all k-dimensional vectors y. That is, orthogonal
transformations preserve length.
(b) Prove that if Y1,..., Yk are iid normals, then the components of Z = Y are
also independent. That is, orthogonal transformations of iid normal random
variables preserve independence.
11. Helmert transformation The Helmert transformation for iid N(, 2) random
variables Y1,..., Yn is Z = HY, where
8.7 Summary
1. The ch.f. of a random variable X is (t) = x(t) = E{exp(itX)}.
4. Lindeberg CLT: Let Xni, i = 1,..., n be independent with mean i and variance
, and let . If the Lindeberg condition
Example 9.1. Test statistic converges in distribution, but p-value does not converge to
that of the asymptotic distribution
Suppose that X1,..., Xn are uniformly distributed on the interval [0,], and we test the
null hypothesis that = 1 versus the alternative hypothesis that > 1. It can be shown
that the most powerful test rejects the null hypothesis for large values of Yn =
max(X1,..., Xn). Suppose that the null hypothesis is true, so Xi ~ uniform [0,1]. The
distribution function Fn(y) for Yn is . The critical value for a test
at level rejects the null hypothesis if Yn yn,, where yn, = (1 )1/n. Notice that
for each fixed y, Fn(y) F(y), where
We can recast this example in terms of p-values. The (random) p-value is . For
arbitrary p, the probability that this p-value is p or less is .
This is not surprising; because the null distribution of Yn is continuous, the p-value has
a uniform distribution on [0,1]. Thus, there is a 50 percent chance that the p-value will
be 1/2 or less. On the other hand, if we approximate the p-value by 1 F(Yn), then with
probability 1 this approximate p-value will be 1.
Proof. Let > 0. We must show that we can find an N such that |Fn(x) F(x)| < for all
x and n N.
Because Fn converges in distribution, Fn is a tight sequence of distribution functions by
Corollary 6.50. Therefore, we can find a value t1 such that P(|Xn| t1) < /2 for all n.
Also, there is a t2 such that P(|X| t2) < /2 (why?). Let T = max(t1,t2). Then for x <
T,
For x > T,
Now consider the closed interval [T, T]. Because the limiting distribution function F is
continuous on the compact set [T, T], F is uniformly continuous on [T, T] by
Proposition A.62. It follows that we can find a such that
1. Divide the interval [T, T] into M equal intervals of length 2T/M, where M is
chosen large enough that 2T/M < : E0 = T, E1 = T + 2T/M,..., EM = T (Figure
9.1).
2. Because the left and right endpoints Li = Ei and Ri = Ei + 1 of each interval satisfy
for each i = 0,..., M.
3. Choose N large enough that . This is possible because
Fn(Ei) F(Ei) as n for the finite set i = 0,..., M.
Figure 9.1
Now consider an arbitrary x in the interval [T, T]. Then x must lie within one of the M
intervals, say [Li, Ri). For n N,
completing the proof.
The null distribution functions of many standardized test statistics are continuous:
standard normal when the CLT applies, chi-squared for Wald tests, etc. If Fn and F
denote the actual and asymptotic distribution functions of a standardized test statistic Zn,
the absolute value of the difference between the actual and approximate one-sided p-
values is . By Polyas theorem, this difference tends to 0.
Therefore, using the asymptotic distribution to approximate a p-value is valid if the
asymptotic distribution is continuous.
Polyas theorem implies uniform convergence of not just P(Xn x) to P(X x), but of
P(Xn I) to P(X I) for all intervals:
Proof. Uniformity of convergence for intervals of the form (a, b], a and b finite, follows
from
For sets of the form [a, b], a and b finite, use the fact that P(Xn [a,b]) = P(Xn = a) +
P(Xn (a,b]) in conjunction with the results we have just proven. The proof for other
types of intervals is similar and left as an exercise.
Theorem 9.5. Polyas theorem in Rk: uniform convergence over product sets of intervals
The proof for k = 2 is left as an exercise. Ranga Rao (1962) proves a more general
result implying Theorem 9.5.
Let Xn and X be random variables with density functions fn(x) and f(x) with respect to a
measure . If fn(x) f(x) except on a set of -measure 0, then and
.
Proof.
Exercises
1. Let Xn have density function fn(x) = 1 + cos(2x)/n for X [0,1]. Prove that P(Xn
B) converges to the Lebesgue measure of B for every Borel set B.
3. Prove part of Theorem 9.5 in R2, namely that if (Xn,Yn) ~ Fn(x,y) converges in
distribution to (X, Y) ~ F(x,y) and F is continuous, then Fn converges uniformly to
F. Do this in 3 steps: (1) for given > 0, prove there is a bound B such that
and for all n; (2) use the fact that F is
continuous on the compact set C = [B, B] [B, B] to divide C into squares such
that |Fn(x,y) F(x,y)| is arbitrarily small for (x,y) corresponding to the corners of
the squares; (3) Use the fact that, within each square, |Fn(x2, y2) Fn(x1, y1)| and
are maximized when (x1, y1) and (x2,y2) are at the southwest and
northeast corners.
4. Recall that in Exercise 11 of Section 3.4, there are n people, each with a different
hat. The hats are shuffled and passed back in random order. Let Yn be the number
of people who get their own hat back. You used the inclusion-exclusion formula to
see that . Extend this result by proving that as n
, where Y has a Poisson distribution with parameter 1. Conclude that
, where the supremum is over all subsets of = {0, 1, 2,...}.
Hint: (the first k people get their own hat back and none of the
remaining n k people get their own hat back).
Example 8.5 showed that the distribution of the variance s2 of a sample of n iid
observations is asymptotically normal, but for smaller sample sizes, s2 has a right-
skewed distribution even if the observations are normally distributed. For example,
suppose that the Xi are iid N(0,1). The exact density for Y = (n 1)s2/2 = (n
1)s2 is chi-squared with n 1 degrees of freedom:
This density and its normal approximation with the same mean and variance are
displayed for n = 10 as solid and dotted lines, respectively, in the top panel of Figure
9.2. The normal approximation does not fit well because is right-skewed.
Figure 9.2
Top: The exact density of (n 1)s2 (solid line) and its asymptotic normal approximation
(dashed line). Bottom: The exact density of ln{(n 1)s2} (solid line) and its asymptotic
normal approximation (dashed line).
To make the above argument for asymptotic normality rigorous, we must do better than
saying that ln(x) is approximately ln(x0) + (1/x0)(x x0). The following result fills in
the details.
Proof. Note first that . This follows from the fact that, for an 0,
because and . The caveat that an 0 is no restriction because an
. Furthermore,
where
Although Proposition 9.8 does not require f() 0, the distribution of f()X is
degenerate if f() = 0. The statement that Yn is asymptotically normal or
asymptotically chi-squared, etc., means that there is a sequence of numbers an and bn
such that , where Y is a non-degenerate normal random variable or non-
degenerate chi-squared random variable, etc. Thus, in the following corollary, we
impose the constraint that f().
Transformations are also used to stabilize the variance. This happens when the
variance of an estimator depends on the parameter it estimates, as in the following
example.
For instance, suppose that 100 people have each amassed 2 years of follow-up, and
there have been 34 events. Then t = 100 2 = 200 years, events per year,
and the 95% confidence interval for is
Therefore, we can be 95% confident that the true disease rate is between 0.12 per
person year and 0.23 per person year.
Consider the comparison of two groups with respect to a binary outcome like
progression of disease. Let and denote the sample proportions with disease
progression in the two groups, and p1 > 0 and p2 > 0 denote their expectations. The
relative risk estimate is . To determine the asymptotic distribution of the relative
risk estimator, we first take logs: . Apply the delta method to . By
the CLT, is AN(p,p(1 p)/n). Then satisfies . By Corollary 9.10,
is . Also, and are independent.
Therefore, the asymptotic distribution of is normal with mean ln(p1)
ln(p2) and variance (1 p1)/(n1p1) + (1 p2)/(n2p2), where n1 and n2 are the sample
sizes in the two groups. By Slutskys theorem, we can replace p1 and p2 with and .
Thus,
which can be used to construct the following asymptotically valid 100(1 ) percent
confidence intervals for the logarithm of the relative risk:s
Proof. If a function g(x, y) is differentiable at (x0, y0), then the partial derivatives
and exist and comprise the derivative of g(x, y) at
(x0,y0) (see Section A.6.3). By definition of derivative,
as .
As with Proposition 9.8, Proposition 9.13 does not require any additional conditions,
but the limiting random variable is degenerate if both fx(,) and fy(,) are 0. For this
reason, the next corollary assumes that at least one of these partial derivatives is
nonzero.
Using Slutskys theorem in conjunction with the fact that n2/n1 , we conclude that
where Z ~ N(0,1).
One arrives at the same asymptotic distribution for using the asymptotic distribution
of derived in Example 9.12 and applying the delta method to
(exercise).
Example 9.17. Estimator has infinite mean, but limiting random variable has finite mean
If is the sample proportion of ones among iid Bernoulli random variables X1,..., Xn,
then by the CLT, is asymptotically normal with mean p = E(X1) and variance p(1
p)/n. It follows by the delta method that is asymptotically normal with mean ln(p)
and variance (1 p)/(np) (see Example 9.12). That is,
. However, for each n because has
positive probability of being 0. Therefore, E(Zn) = , whereas E(Z) = 0.
Even if Xn has finite mean, it is not necessarily the case that if , then E(Xn)
E(X). In Example 6.34, Xn = exp(n) with probability 1/n and 0 with probability 1 1/n.
Xn converges in probability (and therefore in distribution) to 0, but E(Xn) converges to
. The limiting random variable is degenerate, but we can easily modify it to make the
limiting variable non-degenerate. For example, consider Un = Xn + Y, where Y is any
mean 0 random variable independent of Xn. Then Un converges in distribution to the
non-degenerate random variable U = Y. Furthermore, E(Un) = exp(n)/n , whereas
E(U) = 0.
We now give a heuristic motivation of an additional condition required to ensure that the
convergence of Xn to X in distribution and the finiteness of E(Xn) imply that E(Xn)
converges to E(X). Following this informal presentation, we make the arguments
rigorous.
Note first that whether E(Xn) converges to E(X) depends only on the distribution
functions of Xn and X. By the Skorokhod representation theorem, we can assume that the
Xn and X are on the same probability space and . For now we assume also that
E(|X|) < , though we later show that this is not necessary. Then
where the parts under the absolute value sign are numbered from left to right. We can
show that Part 3 can be made small for all n by choosing A large enough. Once we
choose a large value of A, we can make Part 1 small for sufficiently large n by the DCT
because (|Xn| |X|)I(|Xn| A) converges almost surely to 0 and is dominated by the
integrable random variable A + |X|. Therefore, whether we can make |E(|Xn|) E(X)|
small depends entirely on whether we can choose A large enough to make Part 2 of
Expression (9.9) small for all n. This leads us to the following definition.
The sequence Xn is said to be uniformly integrable (UI) if for each > 0 there is an A
such that for all n.
We are now in a position to make rigorous the arguments leading to Definition 9.18.
Proof. Because uniform integrability and convergence of means depend only on the
distribution functions of Xn and X, we may assume, by the Skorokhod representation
theorem, that the Xn and X are defined on the same probability space and that .
To prove item 1, assume that Xn is UI. We will show first that E(|X|) < . We can see
that E(|Xn|) is bounded by taking = 1 in the definition of uniform integrability. That is,
there is an A such that for all n, so
We will prove next that E(|Xn|) E(|X|) 0. Let > 0. We must find an N such that
|E(|Xn|) E(|X|)| < for n N. Apply the triangle inequality to Expression (9.9) and
note that Part 2 and Part 3 are nonnegative to deduce that:
Write Part 3 as
Items 1 and 2 show that Part 3 is less than /3, while item 3 shows that Part 2 is less
than /3. Thus, we have demonstrated that
We can choose N such that |Part 1| < /3 for n N because |E{(|Xn| |X|)I(|Xn| A)}|
0 by the DCT (see explanation in paragraph following Equation (9.9)). Thus, |E(|Xn|)
E(|X|)| < /3 + 2/3 = . This completes the proof that E(|Xn|) E(|X|).
To prove item 2 of Theorem 9.19, suppose that Xn is not UI. We will show that E(|Xn|)
cannot converge to E(|X|). Because Xn is not UI, there exists an * > 0 such that for any
A > 0, E{|Xn|I(|Xn| > A)} * for infinitely many n. Choose A large enough that Part 3
of Expression (9.9) is less than */3 for all n. Then choose N large enough that the
absolute value of Part 1 of Expression (9.9) is less than */3 for n N. Because there
are infinitely many n N such that the middle term of Expression (9.9) exceeds *, there
are infinitely many n N such that |E(|Xn|) E(|X|)| * (2/3)* = */3. It follows that
E(|Xn|) cannot converge to E(|X|).
Proof. Exercise.
Corollary 9.21.
Let Xi be iid random variables with mean and variance 2, and let .
Then .
Example 9.22.
Theorem 7.20 shows that the expected number of returns of a symmetric random walk
before time 2n is E(|S2n|) 1. From this and a crude bound, we argued that the expected
number of returns is of order instead of order 2n, and used this to conclude that the
expected return time must be infinite. We are now able to derive a much better estimate
of the expected number of returns before time 2n. By Corollary 9.21,
, so the expected number of returns to the origin of a symmetric random walk is
asymptotic to .
Proof. Exercise.
We can use uniform integrability to strengthen the DCT (Proposition 6.43) by weakening
its hypotheses from convergence in probability to convergence in distribution.
Proof. By Proposition 9.23, Xn is UI. The result then follows from Theorem 9.19.
Proof. Assume that , and let and be independent with the same
distribution as Xn. Then and are independent with the same
distribution as Yn. It follows that
We claim this implies that bn/n converges to a positive number (remember that bn > 0
and n > 0). Suppose not. Then either bn/n 0 or bn/n does not converge. If bn/n
0, then the right side of Equation (9.11) would converge to 0, so its limiting
distribution would be degenerate at 0. Hence, bn/n cannot converge to 0. Also,
suppose that bn/n did not converge. Then there would be subsequences and
such that bj/j converges to r along {j}, bk/k converges to s along {k}, and r <
s (s could be +). If s is finite, Slutskys theorem implies that the right side of Equation
(9.11) converges in distribution to r(Y Y) along {j} and to s(Y Y) along {k}.
Unless Y Y is degenerate at 0 (which happens only if Y and Y and Y are
degenerate), this contradicts the fact that the left side of (9.11) converges in distribution.
We conclude that bn/n converges. We also get a contradiction if s = +.
Also,
By an argument similar to that used above, (an n)/n must converge. If (an n)/n
c and bn/n d, then the argument preceding Theorem 9.25 shows that G(z) = F{(z
c)/d}.
Exercises
1. Suppose that we reject a null hypothesis for small values of a test statistic Yn that
is uniformly distributed on An Bn under the null hypothesis, where An = [1/n,
1/n] and Bn = [1 1/n, 1 + 1/n]. Show that the approximate p-value using the
asymptotic null distribution of Yn is not necessarily close to the exact p-value.
(a) (,x).
(b) (x, ).
(c) [x, ).
(d) [a, b).
3. Let Xn and X have probability mass functions fn and f on the integers k = 0, 1, +1,
2,+2,..., and suppose that fn(k) f(k) for each k. Without using Scheffes
theorem, prove that . Then prove a stronger result using Scheffs theorem.
4. Find a sequence of density functions converging to a non-density function.
5. If X1,..., Xn are iid with mean 0 and variance 2 < , prove that converges
in distribution to a central chi-squared random variable with 1 degree of freedom,
which is not normal. Why does this not contradict the delta method?
6. Use the asymptotic distribution of derived in Example 9.12 in
conjunction with the one-dimensional delta method to prove that the asymptotic
distribution of the relative risk is given by Equation (9.8).
7. Give an example to show that Proposition 9.25 is false if we remove the condition
that F and G be non-degenerate.
8. Let Sn be the sum of iid Cauchy random variables with parameters and (see
Table 8.2 and Exercise 7 of Section 8.4). Do there exist normalizing constants an
and bn such that ? If so, find them. If not, explain why not.
9. Suppose you have two trick coins having probabilities 0.20 and 0.80 of heads.
Randomly choose a coin, and then flip it ad infinitum. Let Xi be the indicator of
heads for flip i, and . Does converge to a constant (either almost
surely or in probability)? If so, what is the constant? Does converge in
distribution? If so, to what distribution? Is asymptotically normal for
some an and bn?
10. Prove Proposition 9.20.
11. Let Xi be iid with E(|X1|) < . Prove that Sn/n is UI.
12. If and , r > 0, then is UI.
(a) Prove that if pn = p for all n, then 1/Xn is asymptotically normal, and
determine its asymptotic mean and variance (i.e., the mean and variance of the
asymptotic distribution of 1/Xn). How do these compare with the exact mean
and variance of 1/Xn? Note that 1/Xn is infinite if Xn = 0.
(b) If pn = /n for some constant , prove that 1/Xn does not converge in
distribution to a finite-valued random variable.
14. Let n be the binomial probability measure with parameters n and pn, where npn
. If is the Poisson probability measure with parameter , prove the following
improvement of the law of small numbers (Proposition 6.24): .
15. Consider a permutation test in a paired data setting, as in Examples 8.10 and 8.20.
Let pn = pn(Zn) be the exact, one-tailed permutation p-value corresponding to
, and let be the approximate p-value 1 (Zn). Using what was
shown in Example 8.20, prove that .
16. Prove Proposition 9.23
9.6 Summary
1. Polyas theorem (convergence in distribution to a continuous d.f. implies uniform
convergence) If , where F is continuous, then .
2. Scheffs theorem (convergence of densities implies uniform convergence of
probability measures over all Borel sets) If Xn and X have densities fn and f with
respect to a measure , and fn(x) f(x) except on a set of -measure 0, then
and .
3. Delta method
Conditioning also allows different entities to interpret data from their own perspectives.
For instance, ideally, a medical diagnostic test should declare the disease present if the
patient truly has it, and absent if the patient does not have it. The probabilities of these
conclusions are known as sensitivity and specificity, respectively. These are conditional
probabilities of correct diagnoses given that the patient does or does not truly have the
disease. The doctor wants to ensure a small proportion of incorrect diagnoses, hence
high sensitivity and specificity. The patient is concerned only about the accuracy of his
or her own diagnosis. Given that the test was positive, what is the probability that I
really have the disease, or Given that the test was negative, what is the probability
that I am really disease free? These are different conditional probabilities, known as
positive predictive value and negative predictive value, respectively.
The importance of conditioning may be matched only by the care required to avoid
mistakes while carrying it out. Many paradoxes, including the two envelope paradox of
Example 1.4, involve errors in conditional probability or expectation. These generally
involve conditioning on sets of probability 0; conditioning on sets of positive
probability does not cause problems. If E(|Y|) < and B is any Borel set with P(X
B) > 0, the expected value of Y given that X B is unambiguously defined by
Let (X, Y) have joint density function or probability mass function f(x, y). The
conditional density (or mass function) of Y given X = x is defined as h(y | x) =
f(x,y)/g(x) if g(x) 0. It does not matter how we define h(y | x) when g(x) = 0. For each
x such that g(x) > 0, h(y | x) is a density function (or mass function) in y, to which there
corresponds a conditional distribution function or ; H(y |
x) has all of the properties of an ordinary distribution function in y.
Note that the definition of E(Y | X = x) when g(x) = 0 is arbitrary. We could define it to
be any fixed number.
In a vaccine clinical trial, let NP and NV be the numbers of disease events in the
placebo and vaccine arms, respectively. A common assumption is that NP and NV are
independent Poissons with parameters and , where P and V
denote the set of indices for the placebo and vaccine arms, and are the total
amounts of follow-up time in the two arms, and P and V are the placebo and vaccine
rates per unit time. Then N = NP + NV is Poisson with parameter P + V.The joint
probability mass function of (N, NP) is
More generally, if we substitute the random variable X() for the value x in Equation
(10.2), we get the random variable
We have taken the first big step toward a more rigorous definition of conditional
expectation when there is a probability density or mass function, namely conditioning on
a random variable rather than on a value of the random variable. The key property of Z
defined by (10.6) is that it has the same conditional expectation given X B (defined
by equation (10.1)) as Y does for all Borel sets B such that P(X B)> 0. For instance,
when (X, Y) are continuous with joint density function f(x, y),
and multiply both sides by P(X B) > 0 to deduce the equivalent condition,
In fact, Equation (10.8) holds even if P(X B) = 0, because in that case both sides are
0.
We have shown that if (X, Y) has joint density function f(x,y), then Equation (10.8)
holds. We can also start with Equation (10.8) as the definition of conditional
expectation and reproduce Equation (10.2) and therefore (10.6). Equation (10.8)
motivates the more rigorous definition of E(Y | X) given in the next section.
Exercises
1. Let X and Y be independent Bernoulli (p) random variables, and let S = X+Y.
What is the conditional probability mass function of Y given S = s for each of s =
0,1, 2? What is the conditional expected value of Y given the random variable S?
2. Verify directly that in the previous problem, Z = E(Y | S) satisfies Equation (10.8)
with X in this expression replaced by S.
3. If X and Y are independent with respective densities f(x) and g(y) and E(|Y|) < ,
what is E(Y | X = x)? What about Z = E(Y | X)? Verify directly that Z satisfies
Equation (10.8).
4. Let U1 and U2 be independent observations from a uniform distribution on [0,1],
and let X = min(U1, U2) and Y = max(U1, U2). What is the joint density function
for (X,Y)? Using this density, find Z = E(Y | X). Verify directly that Z satisfies
Equation (10.8).
5. Let Y have a discrete uniform distribution on {1, 1, 2, 2,..., n, n}. I.e., P(Y = y)
= 1/(2n) for y = i, i = 1,..., n. Define X = |Y|. What is E(Y | X = x)? What about Z
= E(Y | X)? Verify directly that Z satisfies Equation (10.8).
6. Notice that Expression (10.2) assumes that (X, Y) has a density with respect to
two-dimensional Lebesgue measure or counting measure. Generalize Expression
(10.2) to allow (X, Y) to have a density with respect to an arbitrary product
measure X x Y.
7. A mixed Bernoulli distribution results from first observing the value p from a
random variable P with density f(p), and then observing a random variable Y from
a Bernoulli (p) distribution.
(a) Determine the density function g(p, y) of the pair (P, Y) with respect to the
product measure L C, where L and C are Lebesgue measure on [0,1]
and counting measure on {0,1}, respectively.
(b) Use your result from the preceding problem to prove that E(Y | P) = P a.s.
Notice that Definition 10.2 says A conditional expected value and not The
conditional expected value. The definition allows more than one conditional
expectation. We can change the value of E(Y | X) on a set N (X) with P(N) = 0, and
it will still satisfy the definition of conditional expectation.
We have proven the following result when there is a probability density function. The
proof for probability mass functions is similar and left as an exercise.
If (X, Y) has joint density or mass function f (x, y) and X has marginal density or mass
function g(x), then one version of E(Y | X) is given by Equation (10.6).
Proof. Assume first that Y is a nonnegative simple random variable on (X). Then
, where each Fi (X). Each Fi is of the form X1(Bi) for some Borel set
Bi B. Then , where is clearly a Borel function.
Now suppose that Y is any nonnegative random variable. Then , where each
Yn is a simple random variable on (X). By what we just proved, Yn = n(X) for Borel
functions n. This means that n{X()} Y() for each , so n(x) must converge to
some function (x) for each x , the range of X(). The problem is that need not be
an extended Borel set, which means that the function
need not be an extended Borel function. For example, (x) = limn n(x) could be 1
for all x , in which case Expression (10.9) is not an extended Borel function
because 1(1) is not a Borel set. We can avoid this quandary by defining (x) by
for all x R. Proposition 4.9 implies that is an extended Borel function, and Y
= (X).
Now suppose that Y is any random variable on (X). Then Y = Y+ Y, where Y+ and
Y are nonnegative (X)-measurable random variables and Y+() Y() is not of the
form . By what we have just proven, Y+ = 1(X) and Y = 2(X) for some
extended Borel functions 1 and 2 such that 1(x) 2(x) is not of the form .
Then Y = 1(X) 2(X), and 1 2 is an extended Borel function.
Notation 10.5.
Let (X, Y) have probability mass function f(x,y), and let g(x) be the probability mass
function of X, where g(0) = 0. Define U = X3. The joint probability mass function of (U,
Y) is P(U = u,Y = y) = P({X = u1/3}{Y = y}) = f(u1/3, y), and the marginal
probability mass function of U is P(U = u) = P(X = u1/3) = g(u1/3). It follows that
is a version of E(Y | U). But of course U1/3 = X, so is a
version of E(Y |U). In other words, E(Y | X) is a version of E(Y | U). We are not saying
that E(Y |U = u) = E(Y | X = u); that is not true. However, as random variables, E(Y |
X3()) and E(Y | X()) are the same function of . We get the same information
whether we condition on X or on X3 because X and X3 generate the same sigma-field.
The observation that E(Y | X) depends only on the sigma-field generated by X leads us
to the following generalization of Definition 10.2.
We often surmise the conditional expectation Z = from general principles, and then
prove that Z is using Definition 10.7. That is, we establish that Z is -measurable
and satisfies E{Z I (A)} = E{YI(A)} for all A . We illustrate this technique with
the following example that is the flip side of Example 10.6. It shows that two random
variables that are almost the same can generate very different sigma-fields, and
therefore very different conditional expectations.
We have demonstrated that E(Y | X) = 1/2 a.s., yet E{Y | X + (1/1000)Y} = E{Y | (X,
Y)} = Y a.s. In other words, even though X and X + (1/1000)Y are very close to each
other, conditioning on X yields a dramatically different answer than conditioning on X +
(1/1000)Y. The same result obtains if we replace 1/1000 by 1/1010 or 1/10100, etc.
The following result shows that two versions of can differ only on a set of
probability 0.
If E(|Y|) < , there is always at least one version of . Two versions, Z1 and Z2, of
are equal with probability 1.
Proof. The existence part follows from a deep result in analysis called the Radon-
Nikodym theorem. See Section 10.9 for details. To prove uniqueness, let Z1 and Z2 be
two versions of . Because Z1 and Z2 are both -measurable, .
Also, Z1 and Z2 are integrable. Therefore,
Certain results for conditional expectation follow almost immediately from the
definition. For instance, if we take A = , then . We have
proven the following.
Example 10.11.
Example 10.12.
In imaging studies of the lungs, we may express the burden of a disease by the total
volume of diseased lesions. Let N be the number of lesions for a given patient, and Yi
be the volume of lesion i. Assuming that the number of lesions is small, it may be
reasonable to assume that N is bounded and independent of the Ys (this probably would
not be reasonable if N is large, in which case the larger the number of lesions, the
smaller their volumes must be). Let Y and N be the (finite) means of Y and N,
respectively. The disease burden is . Also, because N is bounded by some
integer B, . Therefore, . To find E(SN), first condition on N = n.
Then SN = Sn, the sum of n independent observations, each with mean Y. The
conditional mean is nY, so .
Let Y and Yn, n = 1, 2,... be integrable random variables on (, , P), and let be a
sigma-field. Then
1. a.s.
2. If P(Y1 Y2) = 1, then a.s.
3. If a.s., then a.s.
4. If a.s., then a.s.
5. DCT for conditional expectation If a.s. and a.s., where E(U) < , then
a.s.
Proof. The general method of proof for conditional expectation of Y given is to show
that the candidate random variable Z is -measurable and has the same expectation as Z
over sets .
For parts 3 and 4, we prove the results first for nonnegative random variables. For
example, for part 3, are -measurable random variables and are increasing by
part 2. Therefore, the limit exists (Proposition A.33) and is -measurable
(Proposition 4.9). We will demonstrate that for each A ,
which will show that Z satisfies the definition of . Note that a.s. and
is nonnegative, so the MCT implies that . The MCT also implies that
. Therefore, for each A ,
A similar argument shows that part 4 holds for nonnegative random variables. To prove
parts 3 and 4 for arbitrary Yn, write Yn as and use the fact that and .
When we condition on information that fixes the value of a random variable, we can
essentially treat that random variable as a constant. For instance, when we condition on
Y1, then Y1Y2 behaves as if Y1 were a constant: when the
expectations exist. Specifically:
If , Ai , then , and .
Moreover, each is integrable by definition of . It follows that for each
A ,
The first line is finite if and only if the last line is finite, and the last line is finite
because E(|Y1Y2|) is assumed finite. Thus, the result holds when Y1 is any nonnegative
-measurable random variable.
Exercises
1. Let Y have a discrete uniform distribution on {1, 2,..., n}, and let X = Y2. Find
E(Y | X = x). Does E(Y | X = x) match what you got for E(Y | X = x) for X = |Y| in
Problem 5 in the preceding section? Now compute Z = E(Y | X) and compare it
with your answer for E(Y | X) in Problem 5 in the preceding section.
2. Let Y be as defined in the preceding problem, but let X = Y3 instead of Y2. Find
E(Y | X = x) and Z = E(Y | X). Does Z match your answer in the preceding
problem?
3. Tell whether the following is true or false. If it is true, prove it. If it is false, give a
counterexample. If E(Y | X1) = E(Y | X2), then X1 = X2 almost surely.
Proof. Except on a null set N1, the following conditions hold for rational r.
To see the monotonicity property, note that for each pair r1 < r2 of rationals, the set
has probability 0 by part 2 of Proposition 10.13. The set of pairs
(r1, r2) of rational numbers such that F(r1, ) > F(r2, ) is the countable union,
, of sets of probability 0, so . This shows that
except on a null set, F(r, ) is monotone increasing in r. The limits as follow from
properties 3 and 4 of Proposition 10.13 because I(Y r) converges almost surely to 0 or
1 as or , respectively. Thus, except on a null set N1, conditions (10.16) are
satisfied.
for all rational numbers r. To see this, note that for a particular r, part 4 of Proposition
10.13 implies that condition (10.17) holds except on a null set N2(r). The set of for
which condition (10.17) fails to hold for at least one rational r is the countable union
of null sets, so . Therefore, outside the null set N2, condition
(10.17) holds.
We have defined the candidate distribution function F(r, ) on the rational numbers in a
way that, except for in the null set , conditions (10.16) and (10.17) hold. We
now define F(y,) for NC and y irrational: . Then F(y,), being a
liminf of -measurable random variables, is also -measurable. For each y, whether
rational or irrational, there is a sequence of rational numbers rn decreasing to y such
that F(rn,) F(r,) as n . This fact can be used to show that F(y,) is right-
continuous on NC for each y. To see this, let yn y. We can find a rational number rn
yn such that
Therefore, F(rn,) 1/n < F(yn,) F(rn,) for all n. The limits of the left and right
sides as n are both F(y,), so F(yn,) F(y,) as n . We have established
that F(y, ) is right continuous. It is also easy to show that .
Notation 10.16.
It seems like we have attained all we need. After all, the distribution function for a
random variable Y determines the probability that Y B for every Borel set B (see
Proposition 4.22). There is only one glitch: we have shown only that the conditional
distribution function F(y, ) is -measurable. How can we be sure that is -
measurable for an arbitrary Borel set B? Fortunately, it is.
Proof. We use the notation F (y) for F(y, ) to emphasize that we are fixing and
regarding F as a function of y. Define (B, ) by . It is an exercise to show that
the set of Borel sets B such that (B, ) is -measurable is a monotone class
containing the field in Proposition 3.8. By the monotone class theorem (Theorem 3.32),
B contains all Borel sets.
The reason for defining a conditional probability measure for Y given is to facilitate
the calculation of conditional expected values of functions of Y given .
The proof follows the familiar pattern of beginning with nonnegative simple g, then
extending to all nonnegative Borel functions, then to all Borel functions (exercise).
Example 10.19. Simon and Simon, 2011: rerandomization tests protect type I error rate
There are many different ways to randomize patients to treatment (T) or control (C) in a
clinical trial, including: (1) simple randomization, akin to flipping a fair coin for each
new patient, (2) permuted block randomization, whereby k patients in each block of size
2k are assigned to T, the other k to C, (3) Efrons biased coin design, whereby a fair
coin is used whenever the numbers of Ts and Cs are balanced, and an unfair coin with
probability, say 2/3, favoring the under-represented treatment when the numbers of Ts
and Cs are unbalanced, and (4) various covariate-adaptive schemes making it more
likely that the treatment assigned to the next patient balances the covariate distributions
across the arms.
A very general principle is to analyze as you randomize. To do this, treat all data in
the clinical trial other than the treatment labels as fixed constants and re-generate the
randomization sequence using whatever method was used to generate the original
labels. For this rerandomized dataset, compute the value of the test statistic T. Repeat
this process of rerandomizing the patients and computing the test statistic until all
possible rerandomizations have been included. This generates the rerandomization
distribution F(t) of T. For a one-tailed test rejecting the null hypothesis for large values
of T, determine . By the right-continuity of distribution functions,
Reject the null hypothesis if Torig corresponding to the original randomization
exceeds c*. This test is called a rerandomization test. This is a generalization of a
permutation test that accommodates any randomization scheme.
Let be the sigma-field generated by all data other than the treatment labels. The
rerandomization distribution is the conditional distribution function of T given . By
construction, 1 F(c*) = P(T > c* | ) . In other words, the conditional type I error
rate given the data is no greater than . By Proposition 10.10, the unconditional type I
error rate, P(T > c*), is simply E[E{I (T > c*) | )] . Therefore, any rerandomization
test controls the type I error rate both conditional on the observed data and
unconditionally.
Consider a clinical trial using random permuted blocks of size 4 to assign patients to
treatment (T) or control (C). Let Xi1, Xi2, Xi3, and Xi4 be the observations on a
continuous outcome for patients in block i. The treatment less control difference in
block i is , where Zij is +1 if patient j of block i is assigned to treatment,
and 1 if assigned to control. Let = (Xij, i = 1, 2,..., j = 1,..., 4) be the infinite set of
data from which the first 4n patients constitute the clinical trial. Let
be the test statistic, and suppose that we reject the null hypothesis for large values of Tn.
The second line follows from the DCT for conditional expectation because is
dominated by 1. We have thus shown that the permutation test is asymptotically
equivalent to rejecting the null hypothesis when , where z is the (1
)th quantile of the standard normal distribution. Under the null hypothesis,
. It follows that the rerandomization test is asymptotically
equivalent to rejecting the null hypothesis if . Also, the difference
between the usual t-statistic and converges almost surely to 0, so the
rerandomization test is asymptotically equivalent to the t-test under random permuted
block randomization.
Proof. We prove only the first result. Proofs of the others are similar and left as
exercises. Let F(y) be a regular conditional distribution function of Y given .
Because F(y) is a distribution function in y for fixed , the usual Jensens inequality
implies that for each . But and are versions of
E{(Y) | } and , respectively, from which the result follows.
Notice that with the Markov and Chebychev inequality for conditional expectation, C is
allowed to be any -measurable random variable, whereas with the usual Markov and
Chebychev inequalities, C is a constant. Remember that once we condition on , any -
measurable random variable becomes constant.
Let be a sigma-field.
Suppose that Y, Y1, and Y2 are random variables with finite second moments, and
is a sigma-field.
1. a.s.
2. a.s.
3. If C is an -measurable random variable with E(C2) < , then
a.s.
4. If C1and C2 are -measurable random variables with and , then
a.s.
We close this section with extensions of results for conditional distribution functions of
random variables to conditional distribution functions of random vectors.
Exercises
1. Let (X, Y) take the values (0,0), (0,1), (1,0), and (1,1) with probabilities p00, p01,
p10, and p11, respectively, where p00 + p01 + p10 + p11 = 1 and p00 +p01 > 0,
p10 + p11 > 0.
2. Roll a die and let X denote the number of dots showing. Then independently
generate Y ~ U(0,1), and set Z = X + Y.
and is the sample mean in arm i. Under the null hypothesis, i = 0, i = 1,..., k,
and without loss of generality, assume that i = 0, i = 0,1,..., k. Therefore, assume
that .
5. Fishers least significant difference (LSD) procedure for testing whether means
1,..., k are equal declares 1 < 2 if both the t-statistic comparing 1 and 2
and F-statistic comparing all means are both significant at level . When the
common variance 2 is known, this is equivalent to rejecting the null hypothesis if
and , where
6. Let Y be a random variable with finite mean, and suppose that E{exp(Y)} < .
What is the probability that ?
7. Prove Markovs inequality for conditional expectation (part 2 of Proposition
10.21).
8. Prove Chebychevs inequality for conditional expectation (part 3 of Proposition
10.21).
9. Prove Holders inequality for conditional expectation (part 4 of Proposition
10.21).
10. Prove Schwarzs inequality for conditional expectation (part 5 of Proposition
10.21).
11. Prove Minkowskis inequality for conditional expectation (part 6 of Proposition
10.21).
12. Prove parts 1 and 2 of Proposition 10.23.
13. Prove parts 3 and 4 of Proposition 10.23.
14. Complete the proof of Proposition 10.17 by showing that the set of Borel sets B
such that is -measurable is a monotone class containing the field in
Proposition 3.8.
15. Consider the Z-statistic comparing two means with known finite variance 2,
Suppose that nX remains fixed, and let F be the distribution function for . Assume
that Y1, Y2,... are iid with mean Y. Show that the asymptotic (as nY and nX
remains fixed) conditional distribution function for Z given is normal and
determine its asymptotic mean and variance. What is the asymptotic unconditional
distribution of Z as nY and nX remains fixed?
16. Prove Proposition 10.18, first when g is simple, then when g is nonnegative, then
when g is an arbitrary Borel function.
Example 10.27.
Now suppose that we are attempting to predict the value of a random variable Y that
might be difficult to measure. For instance, Y might require invasive medical imaging.
But suppose that with less invasive imaging techniques, we have a set of variables
X1,..., Xk with which we might be able to predict Y accurately. Let be (X1, ..., Xk),
the sigma-field generated by X1,..., Xk. We will estimate Y using some function of
X1,..., Xk, hence an -measurable random variable U. We want to minimize the mean-
squared error E{(Y U)2} when using U to estimate Y.
In Figure 10.1, the plane represents the vector space of -measurable random variables
in L2, while the vector above the plane is the random variable Y. It is clear from the
figure that among all vectors U in the plane, the one minimizing the squared residual (Y
U)2 is the projection of Y onto the plane. That is, Z = is the estimator of Y that
minimizes the mean-squared error E{(Y U)2}.
Figure 10.1
Conditional expectation as a projection. The conditional expected value Z = is
such that Y Z is orthogonal to every random variable in , the set of -measurable
random variables in L2.
In Example 10.27, we gave a geometric argument for the fact that mimimizes E{(Y
U)2} among -measurable functions U L2. We now give a more formal proof of
this fact.
Proof.
This shows that minimizes E(Y U)2 among all -measurable functions U
L2.
Therefore, .
and
Helpful mnemonic devices for (10.23) and (10.24) are V = VE+EV and C = CE+EC.
Other results are also apparent from viewing conditional expectation as a projection.
For instance, if Y is already -measurable, then the projection of Y onto is Y
itself. That is almost surely if Y is -measurable. Also, suppose that .
Projecting Y first onto the larger sigma-field , and then onto the sub-sigma-field , is
equivalent to projecting Y directly onto . That is almost surely.
Proposition 10.30. Projecting first onto a larger space and then onto a subspace is
equivalent to projecting directly onto the subspace
Let Y be a random variable with E(|Y|) < . If and are sigma-fields with , then
a.s.
Exercises
Proposition 10.31. X and Y are independent if and only if conditional distribution given
X = x des not depend on x
Proof. Suppose that X and Y are independent. We claim that the distribution function
G(y) of Y is a conditional distribution function of Y given X = x that does not depend on
x. It is clearly a distribution function in y and does not depend on x. We need only show
that G(y) satisfies the definition of a conditional expected value of {I(Y y) | X}. Let B
be any k-dimensional Borel set, where k is the dimension of X. We must show that
The left side of Equation (10.25) is clearly P(X B)G(y) because G(y) is a constant.
By Proposition 5.30, the right side of Equation (10.25) is E{I(Y y)}E{I(X B)} =
P(X B)G(y). Thus, Equation (10.25) holds. This completes the proof that if (X, Y)
are independent, there exists a conditional distribution of Y given X = x that does not
depend on x. The proof of the reverse direction is left as an exercise.
Another important concept is that of conditional independence given another random
vector.
Random vectors X and Y are said to be conditionally independent given Z if there are
conditional distribution functions H(x, y | z), F(x | z), and G(y | z) of (X, Y | Z = z), (X | Z
= z), and (Y | Z = z) such that H(x, y | z) = F(x | z)G(y | z).
Proposition 10.33.
Many statistical applications involve random variables Y1,..., Yn that are iid
conditional on other information. Examples include the following.
Whenever Y1, Y2,..., Yn are iid non-degenerate random variables conditional on other
information, they are unconditionally positively associated in a certain sense. To see
this, let the sigma-field represent the other information, and suppose that Y1,..., Yn
are conditionally iid and non-degenerate given . Then
Example 10.34 shows that random variables can be conditionally independent, but very
highly dependent unconditionally. The opposite is also true: two random variables X
and Y can be unconditionally independent but conditionally highly correlated. For
example, let Z1 and Z2 be independent normal random variables with variance 2, and
let X = Z1 + Z2 and Y = Z1 Z2. Then (X,Y) is bivariate normal with correlation 0, so
X and Y are independent. On the other hand, conditional on Z2, X is Z1 plus the
constant Z2, and Y is Z1 minus the constant Z2. Therefore, cor(X, Y|Z2) = 1. That is,
although X and Y are unconditionally independent, they are perfectly positively
correlated given Z2. Similarly, X and Y are conditionally perfectly negatively
correlated given Z1.
unless Z1 and Z2 are perfectly negatively correlated. Thus, it will look like the accuracy
of the new machine depends on the patients blood pressure. We should instead plot Y =
Z1 Z2 against X/2, the average of Z1 and Z2 (Bland and Altman, 1986).
If var(Z1) = var(Z2), X and Y are uncorrelated. The plot of Y against X should show no
discernible pattern. If the plot shows a linear relationship with positive slope, that is an
indication that var(Z1) > var(Z2), whereas a negative slope indicates that var(Z1) <
var(Z2). This observation provides the basis for Pitmans test of equality of variances
for paired observations: use the sample correlation coefficient between X = Z1 + Z2
and Y = Z1 Z2 to test whether the population correlation coefficient is 0 (Pitman,
1939). This is equivalent to testing whether the slope of the regression of Y on X is 0.
We have seen that random variables can be conditionally, but not unconditionally,
independent and vice versa. Whether we think conditionally or unconditionally depends
on what we want to make inferences about. For instance, consider the simple mixed
model
where Yij is the jth measurement on person i, is the mean over all people, and bi is a
random effect for person i. Usually, we want to make inferences about the mean over
people. For instance, a pharmaceutical company or regulatory agency is interested in the
mean effect of a drug over people. In this case, multiple measurements on each person
are highly correlated. A one-patient study would never suffice to prove that the drug
was effective in the general population. An individual patient has a different focus:
Does the drug help me. In this case it makes sense to condition on bi. Under the
simple mixed model (10.29), the multiple observations on person i are conditionally
independent given bi. Of course (10.29) is just a model, and may or may not be
accurate. This example shows that one might view multiple measurements on the same
person as dependent or (conditionally) independent, depending on whether the focus is
on a population average or a single person.
This sounds like it means M and Ymis are conditionally independent, given Yobs, but it
does not because condition (10.30) might hold only for the value of yobs actually
observed. Rubin offers an example involving hospital survey data that includes blood
pressure. Blood pressure is missing if and only if that blood pressure is lower than the
mean blood pressure in the population, . Suppose all participants in the survey have
blood pressure exceeding . Then yobs will be the vector of all blood pressures (none
are missing), and for that value of yobs, condition (10.30) is satisfied vacuously.
However, if any of the participants in the survey had blood pressure readings below ,
then there would have been missing data and condition (10.30) would not have held.
Therefore, the data are missing at random if and only if none are missing.
Regression to the mean can occur with discrete random variables as well, and can lead
the uninitiated observer to misinterpret changes from baseline to the end of a clinical
trial whose entry criteria require patients to have the disease at baseline. For example,
let Y0 and Y1 be the indicator that a patient is diseased at baseline and the end of
follow-up, respectively. Imagine that there is a frailty parameter P indicating the
probability that the patient is diseased at a given time. Assume that Y0 and Y1 are
conditionally independent given P, with E(Yi |P) = P, i = 0,1. If the trial recruits only
patients with disease at baseline, then we must condition on Y0 = 1. Even if a patient
receives no treatment, Y1 will be smaller than Y0, on average.
To determine how much smaller Y1 tends to be than Y0, calculate E(Y1 | Y0) using
Assume that the frailty parameters vary from patient to patient according to a uniform
distribution on [0,1]. That is, the density for p is (p) = 1 for p [0,1]. Before
conditioning on Y0 = 1, the joint density of (Y0, P) with respect to the product measure
c L of counting measure and Lebesgue measure, evaluated at (1,p), is f(1,p) = P(Y0
= 1|p)(p) = p. The marginal density of Y0 with respect to counting measure (i.e., its
marginal probability mass function is obtained by integrating the joint density over
. Therefore, the conditional expectation of P given Y0 = 1 is
Therefore, on average, E(Y1 Y0 | Y0 = 1) = 2/3 1 = 1/3. In other words, patients
tend to have less disease at follow-up even in the absence of treatment. McMahon et al.
(1994) showed a similar result applying a Poisson-gamma model to data from the
Asymptomatic Cardiac Ischemia Pilot (ACIP) study.
We hope the examples in this section give the readers an appreciation for the wealth of
applications of conditional independence. Section 11.9 covers a useful graphical tool
for understanding conditional independence relationships between variables.
Exercises
10.6 Sufficiency
10.6.1 Sufficient and Ancillary Statistics
Suppose we want to estimate the probability p that a patient in the treatment arm of a
clinical trial has an adverse event (AE). We have a random sample of 50 patients, and
let Yi be the indicator that patient i has the adverse event; Yi are iid Bernoulli (p). If
someone supplements our data with the weights of 100 randomly selected rocks, should
we use these weights to help us estimate p? Of course not. The rock weights are
ancillary, meaning that their distribution does not depend on p.
Now imagine that the AE data are obtained by first generating the sum from its
distribution, namely binomial (50, p), and then generating the individual Yis from their
conditional distribution given S. If S = s, each outcome y1,..., y50 with sum s has the
same conditional probability, namely . Notice that this conditional probability does
not depend on p. Once we observe S = s, no additional information can be gleaned from
observations generated from a conditional distribution that does not depend on p. The
situation is completely analogous to the above example of augmenting adverse event
data with rock weights. This motivates the following definition.
If S is complete and f (S) and g(S) are unbiased, Borel functions of S, then f (S) = g(S)
with probability 1 for each .
Proof. If not, then there is an unbiased estimator with smaller variance than f(S). But
then is an unbiased, Borel function of S with smaller variance than f(S). But this is
a contradiction because, by Proposition 10.41, with probability 1.
The importance of complete, sufficient statistics has led to a thorough investigation of
settings admitting such statistics. Complete, sufficient statistics have been established
for the very large class of exponential families.
Theorem 10.43. Basus theorem: Ancillary and complete sufficient statistics are
independent
Proof. Let F(a1,..., ak) = P(A1 a1,..., Ak ak). Because A is ancillary, F(a1,..., ak)
does not depend on . Let G(a1,..., ak | S) be a regular conditional distribution function
of A given S. Then F(a1,..., ak) = E{G(a1,..., ak| S)}. Therefore,
There are many applications of Basus theorem. For example, if Y1,..., Yn are iid N(,
2), with 2 known, the sample mean is sufficient and complete for . On the other
hand, the sample variance is ancillary for . This follows from the fact that adding the
same constant c to each observation does not change , so does not change the
distribution of . Basus theorem implies the well-known result that and are
independent for iid normal observations with finite variance. In fact, we can say more.
The set of consecutive sample variances with sample sizes 2, 3,..., n is also
ancillary for the same reason. Therefore, is independent of . This has
ramifications for adaptive clinical trials in which design changes might be based on
interim variances. Conditioned on those variances, the distribution of is the same as
its unconditional distribution. The next example expands on the use of Basus theorem in
adaptive clinical trials.
Consider a clinical trial with paired data and a continuous outcome Y such as change in
cholesterol from baseline to 1 year. Let Di be the difference between treatment and
control measurements on pair i, and assume that Di ~ N(,2). We are interested in
testing the null hypothesis H0: = 0 versus the alternative hypothesis H1: > 0. Before
the trial begins, we determine the approximate number of pairs required for a one-tailed
t-test at = 0.025 and 90% power using the formula
The two quantities we need for this calculation are the size of the treatment effect, ,
and the variance 2. We can usually determine the treatment effect more easily than the
variance because we argue that if the effect is not at least a certain magnitude, the
treatment is not worth developing. The variance, on the other hand, must be estimated.
Sometimes there is good previous data on which to base 2, other times not. It would be
appealing if we could begin with a pre-trial estimate of the variance, but then modify
that estimate and the sample size after seeing data from the trial itself.
Consider the following two-stage procedure. The first stage consists of half (n1 = n0/2)
of the originally planned number of observations. From this first-stage data, we use
to estimate the variance 2. This is slightly different from the usual sample
variance, which subtracts from each observation and uses n 1 instead of n in the
denominator. Nonetheless, is actually very accurate for typical clinical trials in which
the treatment effect is not very large. We then substitute for 2 in (10.32) and
compute the sample size . If n n0/2, we collect no additional observations.
Otherwise, the second stage consists of the number of additional observations required,
. It is tempting to pretend that the sample size had been fixed in advance,
compute the usual t-statistic, and refer it to a t-distribution with n 1 degrees of
freedom. This is actually a very good approximation, but the resulting test statistic is not
exactly tn1.
We can construct an exact test as follows. Let T1 be the t-statistic computed on the first-
stage data. Under the null hypothesis that = 0, the first-stage data are iid N(0, 2), and
is a complete, sufficient statistic for 2. On the other hand, T1 is ancillary for 2.
This follows from the fact that dividing each observation by does not change T1. By
Basus theorem, T1 and are independent. This implies that, conditional on , T1 has a
t-distribution with n1 1 = n0/2 1 degrees of freedom. If there is a second stage, then
conditional on , the t-statistic T2 using only data from stage 2 has a t-distribution with
n2 1 = n n0/2 1 degrees of freedom and is independent of T1. Let P1 and P2 denote
the p-values corresponding to T1 and T2. Conditional on , P1 and P2 are independent
uniforms, so the inverse probability transformations Z1 = 1(1 P1) and Z2 = 1(1
P2) are independent standard normals. If there is no second stage, we can generate a
superfluous standard normal deviate Z2. Then, conditional on ,
has a standard normal distribution (when there is no second stage, n2 = 0, and the
superfluous random variable Z2 receives zero weight). We reject the null hypothesis
when Z > z, the (1 )th quantile of a standard normal distribution. The conditional
type I error rate given is exactly . Therefore, the unconditional type I error rate is
. That is, this two-stage procedure provides an exact level test.
This unconditional approach gives a misleading estimate of the variance for the sample
size actually used, 50. We would instead use . That is, we would condition
on N = 50 because N is an ancillary statistic.
The above example may seem artificial because we usually do not flip a coin to decide
whether to double the sample size. But the same issue arises in a more realistic setting.
In a clinical trial with n patients randomly assigned to treatment or control, analyses
condition on the numbers actually assigned to treatment and control. For instance,
suppose that the sample sizes in the treatment and control arms are 22 and 18,
respectively. When we use a permutation test, we consider all different ways to
assign 22 of the 40 patients to treatment; we do not consider all 240 possibilities that
would result if we did not fix the sample sizes. It makes sense to condition on the
sample sizes actually used because they are ancillary. They give us no information about
the treatment effect. However, the next example is a setting in which the sample sizes
give a great deal of information about the treatment effect.
Example 10.45. Sample size is not always ancillary: ECMO
The Extracorporeal Membrane Oxygenation (ECMO) trial (Bartlett et al., 1985) was in
infants with primary pulmonary hypertension, a disease so serious that the mortality rate
using the standard treatment, placing the baby on a ventilator, was expected to be 80%.
The new treatment was extracorporeal membrane oxygenation (ECMO), an outside the
body heart and lung machine used to allow the babys lungs to rest and heal. Because of
the very high mortality expected on the standard treatment, the trial used a nonstandard
urn randomization technique that can be envisioned as follows. Place one standard (S)
therapy ball and one ECMO (E) ball in an urn. For the first babys assignment, randomly
draw one of the two balls. If the ball is ECMO and the baby survives, or standard
therapy and the baby dies, then stack the deck in favor of ECMO by replacing the ball
and then adding another ECMO ball. On the other hand, if the first baby is assigned to
ECMO and dies, or to the standard treatment and survives, replace the ball and add a
standard therapy ball. That way, the second baby has probability 2/3 of being assigned
to the therapy doing better so far. Likewise, after each new assignment, replace that ball
and add a ball of the same or opposite treatment depending on whether that baby
survives or dies. This is called a response-adaptive randomization scheme.
The actual data in order are shown in Table 10.1, where 0 and 1 denote alive and dead,
and E and S denote ECMO and standard therapy. The first baby was assigned to ECMO
and survived. The next baby was assigned to the standard therapy and died. Then the
next 10 babies were all assigned to ECMO and survived. At that point, randomization
was discontinued. Table 10.2 summarizes the outcome data. If we use Fishers exact
test, which is equivalent to a permutation test on binary data, the one-tailed p-value is
Table 10.1
Data from the ECMO trial; 0 and 1 denote survival and death, and E and S denote
ECMO and standard treatment.
Outcome 010000000 0 0 0
Assignment E S E E E E E E E E E E
Probability
Table 10.2
Dead Alive
ECMO 0 11 11
Standard 1 0 1
1 11
But the above calculation assumes that the 12 randomizations leading to 11 patients
assigned to ECMO and 1 to standard therapy are equally likely. This is not true if we
condition on the data and order of entry of patients shown in Table 10.1. To see this,
consider the probability of the actual treatment assignments shown in Table 10.1. The
probability that the first baby is assigned to E is 1/2. Because the first baby survived on
E, there are 2 Es and 1 S in the urn when the second baby is randomized. Therefore, the
conditional probability that the second baby is assigned to S is 1/3. Because that baby
died on S, there are 3 Es and 1 S when the third baby is assigned. The conditional
probability that the third baby is assigned to E is 3/4, etc. The probability of the
observed assignment, given the outcome vector, is (1/2)(2/3)(3/4) ... (12/13) = 1/26. On
the other hand, the randomization assignment (S,E,E,E,E,E,E,E,E,E,E,E) has probability
1/1716 given the observed outcome vector. The probability of each of the other 10
randomization sequences leading to the marginal totals of Table 10.2 is 1/429
(exercise). Therefore, the sum of probabilities of treatment sequences leading to these
marginals is 1/26+1/1716+10(1/429) = 107/1716. To obtain the p-value conditional on
marginal totals of Table 10.2, we must sum the conditional probabilities of assignment
vectors leading to tables at least as extreme as the observed one, and divide by
107/1716. But the actual assignment vector produces the most extreme table consistent
with the given marginals, so the p-value conditional on marginal totals is p =
(1/26)/(107/1716) = 66/107 = 0.62. This hardly seems to reflect the level of evidence
in favor of ECMO treatment!
The problem with conditioning on the sample sizes is that they are not ancillary. They
are quite informative about the treatment effect. Eleven of 12 babies were assigned to
ECMO precisely because ECMO was working. Therefore, it makes no sense to
condition on information that is informative about the treatment effect. It would be like
arguing that a z-score of 3.1 is not at all unusual, conditioned on the fect that the z-score
exceeded 3; conditioning on Z > 3 makes no sense because we are conditioning away
the evidence of a treatment effect. That is why Wei (1988) did not condition on the
marginals. He summed the probabilities of the actual assignments and the probability of
(E,E,E,E,E,E,E,E,E,E,E,E), which yielded a p-value of 0.051. In the end, the trial
generated substantial controversy (see Begg, 1990 and commentary, or Section 2 of
Proschan and Nason, 2009) and did not convince the medical community. A subsequent
larger trial showed that ECMO was superior to the standard treatment. There are many
valuable lessons from the original ECMO trial, but the one we stress here is that the
sample sizes in clinical trials are not always ancillary. When sample sizes are
informative about the treatment effect, the analysis should not condition on them.
Exercises
1. A common test statistic for the presence of an outlier among iid data from N(,2)
is the maximum normed residual
Using the fact that the sample mean and variance is a complete, sufficient
statistic, prove that U is independent of s2.
2. Let Y ~ N(, 1), and suppose that A is a set such that P(Y A) is the same for all
. Use Basus theorem to prove that P(Y A) = 0 for all or P(Y A) = 1 for
all .
3. In the ECMO example, consider the set of possible treatment assignment vectors
that are consistent with the marginals of Table 10.2. Show that the probability of
each of the 10 assignment vectors other than (E,S,E,E,E,E,E,E,E,E,E,E) and
(S,E,E,E,E,E,E,E,E, E,E,E) is 1/429.
4. Let X be exponential (). It is known that X is a complete and sufficient statistic
for . Use this fact to deduce the following result on the uniqueness of the Laplace
transform of a function f(x) with domain (0, ). If f(x) and g(x)
are two functions whose Laplace transforms agree for all t < 0, then f(x) = g(x)
except on a set of Lebesgue measure 0.
10.7 Expect the Unexpected from Conditional Expectation
This section covers some common pitfalls and errors in reasoning with conditional
expectation. In more elementary courses that assume there is an underlying density
function, such reasoning works and is often encouraged. But our new, more general
definition of conditional expectation applies whether or not there is a density function.
With this added generality comes the opportunity for errors, as we shall see. In other
cases, errors result from a simple failure to compute the correct conditional distribution,
as in the two envelope paradox of Example 1.4.
Recall that in the two envelope paradox of Example 1.4, one envelope has twice the
amount of money as the other. The amounts in your and my envelopes are X and Y,
respectively. Simulate this experiment as follows. Generate a random variable T1 from
a continuous distribution on (0, ). For simplicity, let T1 be exponential with parameter
1. Generate a Bernoulli (1/2) random variable Z1 independent of T1; if Z1 = 0, set T2 =
(1/2)T1, while if Z1 = 1, set T2 = 2T1. Now generate another independent Bernoulli
Z2; if Z2 = 0, set (X, Y) = (T1,T2), while if Z2 = 1, set (X, Y) = (T2,T1). Repeat this
experiment a thousand times, recording the (X, Y) pairs for each. See which of the
following steps is the first not to hold.
Figure 10.2 shows (X,Y) pairs from a thousand simulations. The points all lie on one of
two lines, Y = (1/2)X or Y = 2X. Thus, statement 1 is true. In this simulation, 491 of the
1000 pairs have X < Y, so statement 2 is true. However, among the 27 pairs with X > 5,
none produced X < Y. It is clear that the conditional probability that Y = x/2 given that
X = x is not 1/2 for large values of x. The problem in the two envelopes paradox is that
we incorrectly conditioned on X = x. The statement that Y is equally likely to be X/2 or
2X is true unconditionally, but not conditional on X.
Figure 10.2
Plot of 1000 simulations of (X, Y) from the two envelopes paradox.
The problem with Example 10.47 is that conditional expectation conditions on a random
variable or a sigma-field, not on a single set of probability 0. At this point the reader
may be wondering (1) whether this kind of problem could arise in practice and (2)
whether one could solve the problem by formulating it as a conditional expectation
given a random variable. The next example answers both of these questions.
Example 10.48. Conditioning on the equality of two continuous random variables Borel
paradox
The gene status random variable is fairly simple because it takes only two values, but
we might try to extend the same reasoning to a setting with continuous random variables.
This is exactly the setting of Example 1.3. In that example, we postulate a linear model
Y = 0 + 1X + relating the continuous outcome Y to the continuous covariate X. For
simplicity, we assume that X ~ N(0,1). We then attempt to obtain the relationship
between Y values of two people with the same X value. In formulation 2 of the
problem, we imagine continually observing pairs (X, Y) until we find two pairs with
the same X value. The first step toward determining the relationship between the two Y
values is determining the distribution of the common X, i.e., the distribution of X1 given
that X1 = X1. Formulating the event X1 = X2 in different ways, X2 X1 = 0 or X2/X1
= 1, gives different answers.
The problem is that in the plane, sets of the form x2/x1 a look quite different from sets
of the form x2 x1 b, and this leads to different conditional distributions given X2/X1
versus given X2 X1. The conditional distribution of X1 given X1 = X2 is not well-
defined. Again we cannot think in terms of conditioning on sets of probability 0, but
only on random variables or sigma-fields. Because we can envision sets of probability
0 in more than one way as realizations from random variables, in general, there is not a
unique way to define conditioning on an arbitrary set of probability 0.
The last step follows from the independence of X and Y. The second step replaces the
random variable X with its value, x. Such arguments abound in statistics, and they can
help us deduce results. Nonetheless care is needed in carrying out substitution in
conditional expectations. This was not the case in elementary statistics courses because
the definition of conditional expectation was much more specificExpression (10.2).
But the more general Definition 10.2 admits different versions of conditional
expectation, and this can lead to confusion.
When we take the expected value of this expression over the distribution of X, we reach
the correct conclusion that P(X x0) = E{I(X x0)}.
There is potential for confusion whenever there is a random variable X and its value x
in the same expression, as in E{f(x, Y)| X = x}. The following example makes this more
clear.
Let f(x, y) = x + y, and let X and Y be iid standard normals. Consider the calculation of
E{f(X, Y)| X = 0} using the rule E{f(0,Y)|X = 0} = E(0 + Y | X = 0) = 0 + E(Y | X =
0). One version of E(Y | X) is E(Y) = 0 because X and Y are independent. Using this
version, we get E{f(x, Y) | X = 0} = 0 + 0 = 0. But we can change the definition of E(Y |
X) at the single value X = 0, and it will remain a version of E(Y | X). For instance,
Using this version, we get E{f(0, Y)|X = 0} = 0+1 = 1. Of course, we could replace the
value 1 in Equation (10.35) with any other value, so E{f(0, Y)| X = 0) could be any
value whatsoever.
The same problem holds if we replace the conditioning value 0 by any value x. In this
example: E{f(x, Y)| X = x} = E(x +Y|X = x)=x+ E(Y | X = x). Let g(x) be an arbitrary
Borel function. One version of E(Y | X) is
With this version, E{f(x,Y) | X = x} = x + g(x) x = g(x). But g(x) was arbitrary, so
E{f(x, Y) | X = x} could literally be any Borel function of x.
In light of the above examples, which of the following two statements is incorrect?
>
The statements seem equivalent, but they are not. The first involves a specific version of
E{f(x,Y) | X = x}, whereas E{f(x, Y)| X = x} in the second statement is any version of
E{f(x,Y) | X = x}. Thus, the second statement asserts that E{f(X, Y)| X = x} = E{f(x,Y) |
X = x} for any version of E{f(x, Y)| X = x}. Example 10.50 is a counterexample.
Let X and Y be random vectors and (X, Y ) be a function such that | (X, Y) | is
integrable. If G(y | x) is a regular conditional distribution function of Y given X = x, then
one version of is .
An immediate consequence of Proposition 10.51 is the following.
Suppose that X and Y are independent with Y ~ G(y). If | (X, Y) | is integrable, then
one version of is .
In a clinical trial comparing means in the treatment and control arms, the treatment effect
estimate is
where Zi is 1 if patient i is assigned to treatment and 0 if control, and nT and nC are the
numbers of patients assigned to treatment and control, respectively. That is, .
Under the null hypothesis, treatment has no effect on outcome, and the treatment vector Z
is assumed independent of the outcome vector Y. A regular conditional distribution
function F(z | Y = y) of Z given Y = y is its unconditional distribution with mass function
for each string z of zeroes and ones with exactly nT ones. Therefore, a regular
conditional distribution function of given Y = y is
In a one-sample t-test setting with iid normal observations, the standard 95% confidence
interval for the mean is , where n is the sample size, sn is the sample
standard deviation, and tn1,/2 is the upper /2 point of a t-distribution with n 1
degrees of freedom. Notice that the width of the confidence interval is random because
it depends on sn. Even though the width tends to 0 almost surely as n , even for
large sample size n, there is some small probability that sn is large enough to make the
interval wide. Stein (1945) showed how to construct a confidence interval with a fixed
and arbitrarily small width.
The first step of Steins method is to take a subsample of size m, say m = 30. Let sm be
the sample standard deviation for this subsample. Once we observe the subsample, sm
is a fixed number. We choose the final sample size N = N(sm) as a certain Borel
function of sm to be revealed shortly. We then observe N m additional observations.
Consider the distribution of
That is,
is the ratio of a standard normal and the square root of an independent chi-squared (m
1) random variable divided by its number of degrees of freedom. By definition, T has a
t-distribution with m 1 degrees of freedom.
The above development is very helpful to deduce the distribution of T, but is somewhat
awkward as a proof. Once we know the right answer, we can circumvent conditioning
and provide a more appealing proof:
The third line follows from the independence of and sm and the fact that N = N(sm)
is a function of sm. Equation (10.40) shows that the joint distribution function of
is that of two independent random variables, the first of which is standard normal. Also,
we know that is chi-squared with m 1 degrees of freedom. It follows that
(10.39) is the ratio of a standard normal deviate to the square root of a chi-squared (m
1) divided by its degrees of freedom. Therefore, T has a t-distribution with m 1
degrees of freedom.
Exercises
1. In Example 10.47, let Y() = and consider two different sigma-fields. The first,
, is the sigma-field generated by I( is rational). The second, , is the sigma-
field generated by Y. What are and ? Give regular conditional distribution
functions for Y given and Y given .
2. Show that the (X, Y) pairs in Example 10.48 exhibit quirky behavior even when X
is a binary gene status random variable. More specifically, consider the following
two ways of simulating pairs (X, Y). Method 1: generate a gene status (present or
absent) for person 1, then assign that same value to person 2. Then generate the two
Ys using the regression equation, with having a N(0, 2) distribution. Method 2:
continue generating (X, Y) pairs until you find two pairs with the same X value.
Show that the distribution of the common value of X is different using Method 1
versus Method 2.
Let (Xn, Yn) be a sequence of random variables with joint distribution function Hn(x, y)
and marginal distribution functions Fn(x) and Gn(y). Write Hn(x,y) as
The next question is whether weak convergence of the joint distribution function is
equivalent to weak convergence of the marginal distribution function plus weak
convergence of the conditional distribution function Kn(y | Xn = x). More specifically,
our question is:
This differs from the development of the preceding paragraph because we are now
considering the conditional distribution of Yn given Xn = x rather than the conditional
distribution of Yn | Xn x. Before proceeding, we note that we have already seen one
setting in which this holds, namely when (Xn,Yn) are independent.
Let Hn(x, y) and Kn(y | x) denote the joint and conditional distribution of (Xn, Yn) and
Yn given Xn = x respectively. Suppose that Fn converges weakly to a distribution
function F and Kn(y | x) converges weakly to a conditional distribution function K(y | x).
Does it follow that the joint distribution function Hn(x,y) converges weakly to F(x)K(y |
x)? Likewise, suppose that Hn(x, y) converges weakly to the joint distribution function
H(x, y). Does it follow that the conditional distribution function Kn(y | x) converges
weakly to the corresponding conditional distribution function ?
Unfortunately, the answer to both of these questions is no. The following example shows
that even in a relative simple setting in which Xn and Yn take on only two possible
values and the limiting joint and conditional distributions are point masses, those point
masses may disagree.
Let
and let Yn = I(Xn > 0). The conditional distribution of Yn given Xn = 0 is a point mass
at 0. Therefore, the conditional distribution of Yn given Xn = 0 converges weakly to a
point mass at 0. On the other hand, (Xn,Yn) converges in probability to (1,1). Therefore,
the joint distribution of (Xn,Yn) converges weakly to a point mass at (1,1).
If Fn converges weakly to F and Kn(y | x) converges strongly to K(y | x), then Hn(x,y)
converges weakly to H(x,y).
The first question is whether this limit always exists and is finite. Notice that both the
numerator and denominator are increasing functions of , so each decreases to some
limit as 0. If P(X = x) = p > 0, then by the continuity property of probability, the
denominator tends to p, while the numerator tends to P(X = x,Y y). Thus, the limit in
(10.42) is P(X = x,Y y)/p. On the other hand, if P(X = x) = 0, then the numerator and
denominator of Expression (10.42) both tend to 0 as 0. Nonetheless, the ratio
cannot blow up because the numerator can never exceed the denominator. That is, the
ratio in Expression (10.42) cannot exceed 1, so cannot have an infinite limit. Still, there
could be two or more distinct limit points as 0, in which case the limit would not
exist.
Example 10.58.
Let rn = 1/2n, n = 1,2,..., and let sn = (rn + rn+1)/2, so that rn+1 < sn < rn, n = 1, 2, ...
(see Figure 10.3). Let X take value rn with probability 1/2n+1 and sn with probability
1/2n+1, n = 1,2,... Define Y to be 1 if X = rn for some n, and 1 if X = sn for some n.
Consider Expression (10.42) for x = 0, y = 0. Suppose that n = rn. Conditioned on 0
n < X < 0 + n, X can take any of the values rn+1,rn+2,... or sn, sn+1, ... The
conditional probability that X is one of rn+1, rn+2, ... is
Figure 10.3
Points rn (circles) and sn (squares) in Example 10.58. Points strictly to the left of
vertical lines represent X < n when n is rn (solid line) or sn (dashed line).
Thus, P(Y 0 | n < X < n) = 1/3 for n = rn. If we now let n , P(Y 0 | n
< X < n) 1/3.
On the other hand, if n = sn, then X can take any of the values rn+1,rn+2, ... or sn+1,
sn+2, ... Moreover, X is equally likely to be one of the ri or one of the si. Therefore, if
n = sn, then P(Y 0 | n < X < n) = 1/2. We have shown that P(Y 0 | < X <
0) tends to 1/3 if 0 along the path rn, and tends to 1/2 if 0 along the path sn.
Thus, the limit in Expression (10.42) does not exist at x = 0, y = 0. The fact that the limit
may not exist highlights one problem with using Expression (10.42) as the definition of
the conditional distribution of Y given X = x.
Even if the limit in Expression (10.42) exists for a given x, it need not be a distribution
function in y. The following example illustrates this.
Example 10.59.
Because the limit in Expression (10.42) may either not exist or not be a distribution
function in y for some x, it is problematic to define the conditional distribution function
by Expression (10.42). Nonetheless, in Example 10.58, the set of x points such that the
limit in Expression (10.42) does not exist has probability 0. In Example 10.59, the set of
x points at which Expression (10.42) fails to converge to a distribution function has
probability 0. These are not accidents.
For any random variables (X,Y), the set N of x points such that (10.42) either fails to
exist or fails to be a distribution function in y has P(X N) = 0.
Thus, one actually could take Expression (10.42) as a definition of the conditional
distribution function, and define the conditional distribution function to be some
arbitrary F if the limit either does not exist or is not a distribution function in y for a
given x. Although intuitive, this definition is avoided because it makes proofs more
difficult.
Exercises
1. Suppose that (X, Y) has joint distribution function F(x,y), and X has marginal
distribution function G(x). Suppose that, for a given (x,y), F/x exists, and G(x)
exists and is nonzero. What is Expression (10.42)?
To apply this theorem in the setting of conditional expectation, note that if E(|Y|) < ,
(A) = E{|Y|I(A)} for A defines a finite measure on . Therefore, there exists a
nonnegative, measurable function Z() such that . That is, E{|Y|I(A)} =
E{ZI(A)}. The same argument can be used for E{Y+I(A)} and E{YI(A)}.
10.10 Summary
Let Y be a random variable with E(|Y|) < , F be a sigma-field, and Z = .
(a) Z is -measurable.
(b) E{Z I (A)} = E{YI(A)} for all A .
2. Function of X If A = (X), the sigma-field generated by the random vector X, then
there is an extended Borel function f such that Z = f(X) a.s.
3. Geometry and prediction Suppose that Y L2, and let L2(A) be the space of -
measurable random variables with finite second moment.
If E(|Y|) < ,
If ,
6. Paradoxes
Applications
This chapter takes an in-depth look at the use of probability in practical applications we
have encountered over the course of our career. They range from questions about the
validity of permutation tests in different settings to conditional independence and path
diagrams to asymptotic arguments.
One application of F(X) ~ uniform (0,1) is in flow cytometry. Researchers can use a
technique called intracellular cytokine staining using a flow cytometry machine to find
out whether individual cells respond to pieces of a virus or other antigens by producing
regulatory proteins called cytokines. Researchers add a fluorescent dye that attaches to
the cytokine of interest, so by shining a fluorescent light and recording the light intensity,
they can measure the level of the cytokine. Each cell is classified as responding or not
on the basis of its brightness when the light is shone through it. The problem is that there
is no concensus on how bright the cell must be to be declared responsive. Researchers
point out that there can be day-to-day shifts in the fluoresence of cells. Also, cells can
sometimes respond under background conditions without being exposed to the antigen. It
is therefore essential to study a control group of cells that are not exposed to the antigen.
The goal is to compare the immune response to the antigen, as measured by the number
of responding cells, to the immune response under background conditions. See Nason
(2006) for more details and statistical issues.
One method for determining the positivity threshold for cell response is to eyeball the
responses of the control cells and pick a value that seems to separate two different
distributions of light intensities. This cutpoint determined from the control sample is
then applied to the cells exposed to the antigen, and a comparison is made to determine
whether the probability of response is different between the exposed and control cells.
How do we test whether the probability of response to the antigen is greater than the
probability of response to background conditions? Fishers exact test does not really
apply because Fishers exact test conditions on the total number of responding cells in
both samples. That would be appropriate if the threshold had been determined using all
cells instead of just the exposed cells. Immunologists would be extremely reluctant to
set a threshold based on the combined control and stimulated cells; this would not make
sense to them from a scientific standpoint. Given that the threshold was set using the
control sample, is there a valid way to test whether the probability of response is
greater among stimulated cells?
While there is no method that is guaranteed to be valid when the threshold is determined
in a subjective way, some methods are more valid than others. For instance, we can
imagine that the threshold corresponds to selecting a certain order statistic from the
control sample. We can then determine whether the number of exposed cells exceeding
that order statistic is very large. We next determine the distribution of the number of
exposed cells exceeding a given order statistic of the control sample.
Let X1,..., Xm and Y1,..., Yn be the fluorescence levels of control and exposed cells,
respectively. Under the null hypothesis, these are iid from some continuous distribution.
Imagine that the threshold in the control sample corresponds to r cells being at least as
large as it, which corresponds to selecting the m r + 1st order statistic X(m-r+1).
What is the distribution of the number of Ys that exceed X(m-r+1)? Without loss of
generality, we can assume that the underlying distribution of the data is uniform [0,1]. To
see this, let F be the common distribution function of the Xs and Ys. Then F(Xi), i =
1,..., m and F(Y1),..., F(Yn) are iid uniform [0,1], and the number of Yi exceeding X(m-
r+1) is the same as the number of F(Yi) exceeding F(X(m-r+1)). Therefore, we can
assume, without loss of generality, that X1,..., Xm and Y1,..., Yn are iid uniform [0,1].
Conditioned on , are iid Bernoulli random variables with
probability p = (1 x), Therefore, . Now
integrate over the density of X(mr + 1) to deduce that
We can use this distribution to compute a p-value. In flow cytometry applications, m and
n are usually very large, whereas a relatively small number of points exceed the
threshold. To approximate the above distribution, assume that as n , m = mn
such that mn/(mn + n) p. Use the fact that, as , , and
. We conclude that Expression (11.1)
tends to
as n . This is the negative binomial probability mass function (Johnson, Kotz, and
Kemp, 1992). We can view Expression (11.2) as the probability of exactly s failures by
the time of the rth success in a sequence of iid Bernoulli trials with success probability
p. Equivalently, it is the probability that the total number of trials required to achieve r
successes is r + s. In summary, Expression (11.1) is a density with respect to counting
measure on the nonnegative integers, and converges to Expression (11.2), another
density with respect to counting measure on the nonnegative integers. By Scheffs
theorem (Theorem 9.6), the probability of any set A of nonnegative integers converges
uniformly to . This can be used to very closely approximate the p-
value.
Exercises
1. Suppose there are n red and n blue balls in a container, and we draw k balls
without replacement. Let Yn be the number of balls in the sample that are red. For
each step of the following argument, tell whether it is true or false. If it is true,
prove it.
11.2.1 T-Test
Consider a two-sample t-test of the null hypothesis that T C = 0 versus the
alternative that T C > 0. Assume first that data in each treatment arm are iid normal
with common, known variance 2. The z-statistic,
If n and everything else remains fixed, power tends to 1. For power to tend to
something other than 1 as n , the size of the treatment effect must tend to 0 as n
. If = n is such that , then Expression (11.3) tends to
. For example, if we want 90% power, set = 0.1 and z = 1.28;
power tends to (1.28) = 0.90. The local alternative in this two-sample t-test setting is
.
Readers may feel that local alternatives are very strange. After all, clinical trials are
often powered for the smallest effect thought to be clinically relevant, or an effect
similar to what was observed in another study. The idea that the alternative should
become closer to the null value as the sample size gets larger is perplexing. Actually,
local alternatives are realistic; we would not power a study at a value extremely close
to 1. Rather, we might fix power at, say, 90%, and determine the per-arm sample size n.
If n is fairly large, then the approximation obtained by assuming that /(22/n)1/2 is
close to a constant is likely to be quite accurate. This is reminiscent of approximating
binomial probabilities with n large and pn small by Poisson probabilities with
parameter = npn (see Proposition 6.24).
We originally assumed that data were iid normal with common known variance, but we
now show that the power formula (11.3) holds asymptotically regardless of the
underlying (continuous) distribution of data, provided that the variance is finite. We also
relax the assumption that 2 is known. Because the local alternative changes with n, it
may appear that we must invoke the Lindeberg Feller CLT rather than the ordinary CLT.
Nonetheless, we will show that only the ordinary CLT is required.
Let Y1,..., Y2n denote the observations, with the first n being control data. Assume that
and in the control and treatment arms, respectively, where F is a
continuous distribution function with finite variance. This is a so-called shift
alternative. Consider first the null hypothesis that T = C. Generate 2n iid
observations from , and let Zn be the t-statistic
where and are the treatment and control means and s2 is the pooled variance. By
the CLT coupled with Slutskys theorem, Zn tends to N(0,1).
Incidentally, the reason for specifying that the distribution underlying the Yi be
continuous is that we added to Yn+1,..., Y2n. If the Yi take only values 0 or
1, this would not make sense.
Exercises
We now generate data under a local alternative. We cannot use exactly the same
technique as for the continuous outcome case because addition of to each
observation leads to values outside the support of the binary variables Yi. Instead we
adopt the following approach. Examine the observations Yn+1,..., Y2n and leave any Yi
= 0 alone. If Yi = 1, then switch it to 0 with probability n to be specified shortly. This
will create new iid observations in the treatment group with
. Now choose n to make this probability . That is,
. Now compute the statistic , namely Un with Yi replaced by for i =
n+1,..., 2n. Then
We have shown rigorously that an asymptotically valid sample size formula to achieve
power 1 - in a one-tailed test of proportions at level can be obtained by solving
(11.7) for n. This yields
per arm.
11.2.3 Summary
We have shown rigorously that in both the continuous and binary settings, asymptotically
valid sample size formulas can be obtained by equating the expected z-score to z + z
(Equations (11.5) and (11.7)) and solving for n. Conversely, for a given per-arm sample
size n, approximate power can be obtained by solving Equations (11.5) and (11.7) for 1
- by first subtracting z from both sides and then applying the normal distribution
function to both sides.
An alternative self-controlled design considers people who received the vaccine and
developed the disease within a certain period of time, say 84 days. The idea is to see
whether the disease occurred close to the time of vaccination. If the vaccine does not
increase the risk of disease, then on any given day, there is a tiny and equal probability
of disease onset. Therefore, we are interested in testing the null hypothesis that time of
disease onset of participants in the self-controlled design is uniformly distributed on [0,
84]. One alternative hypothesis is that time from vaccination to disease onset in the
general population follows a Weibull distribution function F(t) = 1 exp(t) with
< 1. Under this distribution, there is higher risk of disease soon after vaccination. Under
the null hypothesis, = 1 and is tiny because the condition is very rare; is also tiny
under realistic alternative values of close to, but smaller than, 1. Because we are
studying only people who developed the disease within 84 days of vaccination, we must
condition on T 84. The conditional distribution function for T given that T 84 is
We can approximate the distribution for tiny by taking the limit of (11.8) as 0. It
is an exercise to prove that this limit is (t/84). With the transformed variable U = T/84,
we would like to construct a powerful test of the null hypothesis that U is uniform on
[0,1] versus the alternative that the distribution function for U is G(u) = u, < 1.
A very powerful way to test a null hypothesis against an alternative hypothesis is based
on the likelihood ratio. The likelihood is the probability mass or density function of the
data. We take the ratio of the likelihood assuming the alternative hypothesis to the
likelihood under the null hypothesis. A large likelihood ratio means that the data are
more consisten with having arisen from the alternative, rather than the null, hypothesis.
It is usually more convenient to compute the logarithm of the likelihood ratio, rejecting
the null hypothesis for large values. The log likelihood for a sample of n observations
from G(u) = u is . The likelihood ratio test of = 1 versus < 1
rejects the null hypothesis for large values of
The null distribution of Expression (11.9) is gamma with parameters (n, 1). Thus, we
reject the null hypothesis if Expression (11.9) exceeds the upper point of a gamma (n,
1) distribution.
A practical concern is that we do not really have the precise time of onset of disease.
We have only the day of onset. Thus, we actually observe a discrete version of U. It
seems that there should not be a problem because we have seen that a discrete uniform
converges to a continuous uniform as the number of equally-spaced support points tends
to (see Example 6.22 for the setting of dyadic rationals). Using the continuous uniform
approximation works well in some circumstances. For example, try simulating 10
observations from a discrete uniform distribution on {1,2,..., 1000}. Do this thousands
of times, computing the test statistic (11.9) each time. You will find that the proportion
of times the test statistic exceeds the 0.05 point of a gamma (10,1) distribution is close
to 0.05 (exercise). But now modify the simulation by sampling 500 observations on {1,
2,..., 100}. You will find that the proportion of simulated experiments such that the test
statistic (11.9) exceeds the 0.05 point of a gamma (500,1) distribution is much less than
0.05 (exercise).
What went wrong? The test statistic (11.9) blows up at u = 0. That is, the test statistic
when U has a continuous uniform distribution can be very large at times. On the other
hand, the discrete uniform never produces an arbitrarily large value. Approximating the
discrete uniform with a continuous uniform works well if the number of participants is
much smaller than the number of days, but not the other way around. The reader is
invited to confirm this using simulation.
Exercises
Suppose you begin a clinical trial comparing a control to 5 new regimens with respect
to a continuous outcome. For simplicity, assume that the common variance 2 in
different arms is known. Let be the z-score comparing arm i to the
control, and assume that a large z-score indicates that arm i is effective compared to the
control. It is common practice to use two-tailed tests in clinical trials, so the ith arm is
declared significantly different from the control if , where c is a critical value. If
we use the Bonferroni method to adjust for multiple comparisons, we would divide 0.05
by 5 and use a two-tailed test at level 0.01. The required critical value is c = 2.576, the
upper 0.005 point of a standard normal distribution.
See if you agree with the following improvement over the Bonferroni method. We
first determine how many of the 5 z-scores exceed 0. After all, we would not be
interested in treatments that were not even better than control. Suppose that 2 of the 5 z-
scores are positive. Now divide 0.05 by 2 instead of 5, and use a two-tailed test at
level 0.025. That is, we reject the null hypothesis if |Zi| > 2.241. The argument is as
follows. The distribution of each positive z-score is that of |Zi| given that Zi > 0. Under
the null hypothesis, before conditioning on Zi > 0, Zi is standard normal and |Zi| has the
distribution of the absolute value of a standard normal deviate. Conditioning on Zi > 0
does not change the distribution of |Zi| because the distribution of Zi is symmetric about
0. Using critical value 2.241 for the positive z-scores means that the two-tailed type I
error rate for each comparison is 0.025. By the Bonferroni inequality, the probability of
falsely rejecting at least one of the two null hypotheses is at most 2(0.25/2) = 0.05. It
does not matter whether the z-statistics are independent or dependent because the
Bonferroni inequality does not require independence.
Interestingly, simulation results seem to confirm that the type 1 error rate is nearly
controlled at level 0.05. If the total number of arms exceeds 3, the type I error rate is
controlled at level 0.05 or less. With three arms, the type I error rate is slightly elevated
(see Proschan and Dodd, 2014). However, if the above argument were really valid, then
the conditional type 1 error rate given the number of positive z-scores should also be
controlled. But the conditional type 1 error rate can be greatly inflated. Therefore, the
argument given above must be incorrect.
Did you detect the flaw in the above reasoning? We conditioned on Zi > 0 one at a time,
as if the only relevant information about |Zi| is that Zi > 0. But other Zjs also are
informative about |Zi|. For example, if several z-scores are positive and the global null
hypothesis is true, the control sample mean was probably unusually small. Therefore,
information from other Zjs is informative about the control sample mean, which is
informative about all of the |Zj|s. To control the type 1 error rate, we must account for
the screening out of Zi < 0, which is informative about all comparisons.
One simple way to test whether there is large between-person variability in response to
reducing salt is as follows. Under the hypothesis that the between-person blood
pressure variability in response to sodium reduction is large, someone with a big
response on the first occasion would be expected to have a big response on the second
because that person is likely to be a hyper-responder. Under the null hypothesis,
someones response on the second occasion would be independent of his or her
response on the first occasion. One could define a threshold such as 5 mmHg or 10
mmHg; people whose blood pressures decrease by more than that amount would be
classified as hyper-responders. We could then see if the number of people declared
hyper-responders is greater than what would be expected if responses on the two
occasions are independent. But the threshold for being a hyper-responder is arbitrary. It
seems preferable to use the median responses as cutpoints. Thus, we compute the
median response M1 over all participants on occasion 1, and the median response M2
on occasion 2. The probability of exceeding Mi on occasion i is 1/2. Under the null
hypothesis, the indicators Ii of exceeding Mi on occasion i, i = 1, 2, are independent.
The probability that I1 = 1 and I2 = 1 is (1/2)2 = 1/4. It seems that one could test the
independence assumption by referring the number of people with I1 = 1, I2 = 1 to a
binomial distribution with parameters n and 1/4, where n is the number of participants.
Table 11.1
Numbers of people with blood pressure response categorized by exceeding the median
or not exceeding the median on two separate occasions in a hypothetical group of 1000
participants.
Median1 Y 500
500 500
We now make the above argument rigorous. The first step is to show that properly
standardized versions of Xn Yn and Xn + Yn converge in distribution to independent
standard normals. Let p (0,1) be the common Bernoulli parameter under the null
hypothesis. Let and . Then ,
where (U1, U2) are iid standard normals. Let and . By
the Mann-Wald theorem, , where and . But (Z1,
Z2) are iid standard normals because they are bivariate normal with correlation 0. Note
also that Zn1 is the standardized difference, , between Xn and Yn, and
similarly for Zn2. Therefore, .
We would like to conclude that because (Zn1, Zn2) converges to (Z1, Z2) and Z1 and Z2
are independent, the conditional distribution of Zn1 given Zn2 converges to the
conditional distribution of Z1 given Z2, which is standard normal. But recall from
Section 10.7.3 that weak convergence of conditional distributions does not follow
automatically from weak convergence of joint distributions, so we need to prove that
. We first present a slightly flawed argument that the reader should
attempt to debunk. We will then repair the flaw.
Now take the liminf of both sides as n , and note that by the independence of Z1 and
Z2, the liminf of the right side is . Therefore,
. A similar argument shows that . It follows that .
Did you find the flaw? We cannot fix z2 and condition on because the support of
the distribution of Zn2 changes with n.
A similar argument shows that (exercise). This mends the hole in the
above argument and completes the proof that whenever zn2 is in the
support of the distribution of Zn2 and zn2 z2.
11.5.6 Conclusion: Asymptotics of the Hypergeometric Distribution
We can recast what we have just proven as follows. Note that
, where Sn = Xn +Yn. The conditional
distribution of Xn given that Zn2 = zn2 is the conditional distribution of Zn1 given that
Xn + Yn = sn, where (sn 2np)/{2np(1 p)}1/2 = zn2. The latter conditional
distribution is hypergeometric with parameters (n,n, sn). Therefore, we have proven that
if Xn is hypergeometric (n,n, sn), where (sn 2np)/(2np)1/2 z2, then
. It turns out that this latter condition can be relaxed to
as n . This shows that for the data of Table 11.1, the null distribution of the number
of hyper-responders is approximately normal with mean s1000/2 = 250 and variance
500(1/2)(1/2)/2 = 62.5, as mentioned earlier.
Proposition 11.1.
Suppose that, as N , mN, nN, and sN tend to in such a way that and
mN/(mN + nN) . Then
The result we deduced corresponds to mN = nN, and our proof is under the additional
condition that (sn 2np)/(2np)1/2 z2. Nonetheless, the essential argument of our
proof, that Xn Yn is asymptotically independent of Xn + Yn, is quite simple. It is an
exercise to modify our reasoning above to deduce Proposition 11.1 under the additional
condition that .
Exercises
Because permutation tests condition on the outcome data, they can be very attractive in
clinical trials with unplanned changes. An important principle in clinical trials is that
the primary outcome variable and analyses methods should be pre-specified in the
protocol (Friedman, Furberg, and DeMets, 2010). Changes are frowned upon because
of concern that they might be driven by trends seen in the data, in which case the type 1
error rate could be inflated. But things sometimes go wrong. For instance, investigators
in a lung trial discovered that their original primary outcome, defined in terms of X-ray
findings, could not be measured. At the time they realized that they had to change the
outcome, they had not yet broken the treatment blind. That is, they did not know the
treatment assignments of patients. Would it be permissible under these circumstances to
change the primary outcome and perform an ordinary permutation test as if that outcome
had been pre-specified as primary?
Here is an argument justifying a permutation test in the above setting. Suppose we look
at data on k potential outcomes, and let Yi be the data for the n patients on outcome i, i =
1, ..., k. After looking at the data, we allow a change of the primary outcome. A
permutation test is still a valid test of the strong null hypothesis that Z is independent of
Y1, ..., Yk. This hypothesis says that treatment has no effect on any of the potential
outcomes, either alone or in concert. Under this strong null hypothesis, the conditional
distribution of Z given Yi = yi, i = 1, ..., k, is its unconditional distribution dictated by
the randomization method. Therefore, even though we changed the primary outcome
after looking at data, the type 1 error rate cannot be inflated if we use a permutation test.
If the above argument is correct, then what goes wrong in the following apparent
counterexample, given in Posch and Proschan (2012), to the proposition that the type I
error rate cannot be inflated? Suppose that three potential outcomes are being
considered: (1) coronary heart disease, (2) cardiovascular disease, and (3) level of
study drug in the blood. Never mind that no one would use the third outcome; it still
illustrates an important point. Even though we look at (Y1,Y2,Y3) and not Z, the third
outcome variable clearly gives us information about Z. Only treated patients will have
nonzero levels of the study drug. Once we observe Y3, we know the full set of treatment
assignments. Once we know the treatment assignments, we can compute the values of the
standardized test statistics (z-scores) for each of the candidate outcome variables and
pick the one with the largest z-score. This will clearly inflate the type 1 error rate if we
use an ordinary permutation test without accounting for the fact that we are selecting the
most statistically significant of the outcomes. The problem is compounded with a
greater number of potential outcomes. Westfall and Young (1993) describe a valid way
to account for picking the smallest p-value, but our question concerns the validity of an
ordinary permutation test treating the selected outcome as if it had been pre-specified.
What went wrong with the argument that the type 1 error rate could not be inflated with
a permutation test on the modified primary outcome? Surprisingly, nothing went wrong!
Remember that the permutation test is testing the strong null hypothesis that Z is
independent of (Y1,Y2,Y3), i.e., that treatment has no effect on any of the outcomes.
When we reject this null hypothesis, we are not making a type 1 error because the strong
null hypothesis is false: treatment does have an effect on outcome 3, level of study drug
in the blood. Therefore, the type I error rate is not inflated.
Do not dismiss the above example because of its extremeness. It illustrates the potential
danger when changing the primary outcome after examining data but before breaking the
treatment blind. We may unwittingly glean information about the treatment labels from
the data examined. For instance, it is tempting, when choosing among several potential
outcome variables, to select the one with less missing data. But how can we be sure that
the amount of missing data does not give us information about the treatment
assignments? It could be that patients in the treatment group are more likely than patients
in the control group to be missing data on a given outcome. If so, we could become at
least partially unblinded. Just as in the above example, this unblinding could inflate the
type I error rate for the selected outcome. It still provides a valid test of the strong null
hypothesis that treatment has no effect on any of the information examined, including the
amount of missing data. But rejecting that null hypothesis is not meaningful if the only
difference between the treatment arms is the amount of missing data. Therefore, a
permutation test may not test the scientifically relevant question.
Exercises
1. Many experiments involving non-human primates are very small because the
animals are quite expensive. Consider an experiment with 3 animals per arm, and
suppose that one of two binary outcomes is being considered. You look at outcome
data blinded to treatment assignment, and you find that 5 out of 6 animals
experienced outcome 1, whereas 3 animals experienced outcome 2. You argue as
follows. With outcome 1, it is impossible to obtain a statistically significant result
at 1-tailed = 0.05 using Fishers exact test. With outcome 2, if all 3 events are in
the control arm, the 1-tailed p-value using Fishers exact test will be 0.05.
Therefore, you select outcome 2 and use Fishers exact test (which is a permutation
test on binary data).
Another way to express this condition is that the conditional distribution of Y given A
and Z depends only on A. Thus, in a network of people who have sex with each other or
who have a common partner, the HIV indicators may be positively correlated, but their
joint distribution is not further affected by knowledge of Z.
This relationship is depicted in the top of the path diagram, Figure 11.1. There is an
arrow leading from A to Y, indicating that sexual sharing could affect the distribution of
the number of people infected. In other words, the conditional distribution F(y | A) of Y
given A could depend on A. Also, there is an arrow leading from Z to A, meaning that
treatment might affect the amount of sexual sharing. For instance, participants assigned
to vaccine might be more likely to have sex than participants assigned to placebo. Later,
we assume this does not happen. Notice that the only path connecting Z to Y goes
through A, meaning that the only effect of Z on Y is through the effect of Z on A. Once
we condition on A, Z and Y are independent. This is reflected by hypothesis (11.12).
Pictorially, we can envision conditioning on A as blocking the path from Z to Y; there is
no longer a way to get from Z to Y or vice versa. We will expand on path diagrams and
their role in deducing conditional independence relationships in Section 11.9.
Figure 11.1
Path diagrams illustrating two different sets of assumptions about Z (treatment
indicators), A (sexual sharing indicators) and Y (outcome). The top diagram is under
the sole assumption that Z and Y are conditionally independent given A. The bottom
diagram is under the additional assumption that Z is independent of A.
The conditional independence of Z and Y given A means that we can construct a valid
stratified permutation test. For example, suppose participants 1,5,14,21, and 100 had
sex with each other or with a common partner, whereas the other participants did not
have sex with other trial participants or with a common partner. If 3 of participants
(1,5,14,21,100) received the vaccine and simple randomization was used, we would
treat each of the sets of 3 participants as equally likely to have received vaccine.
For the complementary set of n 5 participants who did not have sex with each other or
with a common partner, we would likewise condition on the number m receiving
vaccine, and treat each set of as equally likely. A major problem with this plan is
that there is only very limited information on sexual sharing among trial participants.
Only when participants contracted HIV with genetically similar viruses was sexual
sharing discovered. Without more information, a stratified permutation test is not
possible.
Given that a stratified permutation test is not feasible, we seek additional conditions
under which an ordinary (unstratified) permutation test is valid. Consider the following
assumption.
This also seems reasonable unless the vaccine somehow makes participants more
sexually attractive or empowered. This is an unlikely scenario in such a blinded trial. If
the trial were not blinded, people in the vaccine arm might feel like they are protected,
which could cause them to engage in riskier behavior. This is called a dis-inhibition
effect. But in a blinded trial, participants do not know whether they are receiving the
vaccine or placebo. This should equalize the dis-inhibition in the two arms and render
the assumption of independence of Z and A reasonable. Under assumption (11.13), there
is no arrow connecting Z to either A or Y (bottom of Figure 11.1). Because of this, Z
and (A,Y) are independent. This can be proven as follows:
We have shown that the conditional distribution of Z given (A, Y) is the unconditional
distribution of Z. That is, Z and (A, Y) are independent. This, of course, implies that Z
and Y are independent. It follows that a permutation test is valid under assumptions
(11.12) and (11.13).
The following argument tells us when conclusion (11.15) holds. Imagine that the
probability space is , where is the space from which we draw the infinite
string y1, y2, ... and is the space from which we draw the treatment indicators. For
each n, we draw Z1,2n, ..., Z2n,2n with . By construction, (Z1,2n, ..., Z2n,2n),
n = 1, 2, ... is independent of Y1, Y2,.... For a given set y1,..., y2n, the same
hypergeometric distribution obtains for whether the Yi were originally
independent or dependent. Also, that hypergeometric distribution depends on y1,..., y2n
only through the sample proportion of ones, . We saw in Section 11.5 that if
this quantity converges to a number, then the hypergeometric random variable
is asymptotically normal with parameters proscribed by Proposition 11.1 and
conclusion (11.15) holds.
Proposition 11.2.
Let Y1, Y2,... be (possibly correlated) Bernoulli random variables such that
for some random variable T (), 0 < T () < 1. Then (11.15) holds
and the permutation test applied to the usual z-test of proportions is asymptotically
equivalent to referring the z-score to a standard normal distribution.
Exercises
Our assumption under the null hypothesis is that the conditional distribution of Y given
X1 = x1,..., Xk = xk, Z = z depends on x1,..., xk, but not additionally on the treatment
indicator vector z. That is, Y and Z are conditionally independent given X1,...,Xk.
Moreover, X1,..., Xk are measured at baseline, so treatment can have no effect on them.
Therefore, Z is independent of (X1,..., Xk). The situation is probabilistically identical
to the example in Section 11.7, where A represented information on sexual sharing and
Yi was the indicator that patient i had HIV. In the current example, take A to be the
matrix whose columns are X1,..., Xk. In both examples, Y and Z are conditionally
independent given A, and Z and A are independent. We showed in Section 11.6 that
these two facts imply that Z is independent of (A, Y). Moreover, is a Borel
function of (A, Y). Therefore, Z is independent of . It follows that a permutation
test is valid for this adaptive regression. The key was that the covariate selection was
blinded to treatment assignment. Had we instead broken the blind and picked the
covariate that minimizes the p-value for treatment, an ordinary permutation test that did
not account for picking the smallest p-value would not have controlled the type 1 error
rate.
where is the sample median of Y1,..., Yn. It is an exercise to prove that converges
almost surely to 2E{YI(Y > 0)} = (2/)1/2. Keep in mind that the covariates were
generated independently of Y, so the true i for the univariate regression of Y on Xi is 0
for i = 1,..., k. However, we just showed that when we select the covariate with largest
sample correlation with Y, the estimated for the regression of Y on that covariate will
be close to (2/)1/2 instead of 0. The resulting residual variance can be shown to be
approximately .
We now recap. Had we used a simple t-test on Y, the residual variance would have
been close to 1 because the Yi are iid standard normals. Through generating sham
covariates and adjusting for the one most correlated with Y, we reduced the residual
variance to 1 2/. This suggests we are getting some benefit from generating sham
covariates! Fortunately, this impression of benefit is an illusion; it can be shown that
under the alternative hypothesis, the adjusted treatment effect estimate is attenuated
when we select the most correlated of artificially generated covariates. This attenuation
creates bias that more than offsets any gain in precision.
The preceding paragraph shows that an ordinary t-test does not work. Given the close
connection between permutation and t-tests, this suggests, but does not prove, that a
permutation test is invalid when we pick the balanced binary covariate most correlated
with Y. Let us return to the argument in Section 11.8.1 justifying the validity of a
permutation test when we did not restrict ourselves to balanced covariates. We argued
Y is conditionally independent of Z given A (recall that A is the matrix whose columns
comprise the set of all covariates under consideration). This is still correct under the
null hypothesis. What is not correct is the assumption that Z and A are independent. To
see this, let Y = (Y1, Y2, Y3, Y4) Suppose we observe the following two balanced
covariates.
It is an exercise to show that if these two columns represent balanced covariates, then
the only possible randomizations are (C,T,T,C) or (T,C,C,T). Therefore, Z and A cannot
be independent. Thus, one of the key assumptions that we used to validate a permutation
test in Sections 11.8 and 11.8.1, namely Assumption 11.13, is not satisfied.
Exercises
Medical advances often begin by noticing the relationship between a risk factor and a
clinical outcome. For example, we notice that people with high blood pressure have a
higher risk of stroke. We then develop and test a medicine to see if it reduces blood
pressure. After showing that it does, we conduct a much larger randomized trial to see if
the stroke probability is lower in the treatment group than in the placebo group. When
we see that it is, we theorize that the treatment is reducing strokes through its effect on
blood pressure. When the effect of treatment on the clinical outcome of interest is
through its effect on an intermediate outcome like blood pressure, we call the
intermediate outcome a surrogate outcome (Prentice, 1989).
We can depict this relationship graphically through the path diagram of Figure 11.2. The
nodes of the diagram are the dots representing the variables T (treatment indicator), X
(blood pressure), and Y (stroke indicator). The graph is also called a directed acyclic
graph (DAG). It is directed because the arrows lead from one node to another. Acyclic
means that there is no directed path from one node back to itself. There is an arrow
leading from T to X, indicating that the treatment has an effect on blood pressure. The
arrow leading from X to Y means that blood pressure affects stroke risk. Therefore,
treatment has an effect on stroke through its effect on blood pressure. There is no arrow
leading from T directly to Y.
Figure 11.2
Path diagram indicating that the full effect of treatment assignment T on outcome Y is
through its effect on the surrogate X.
This is because we believe that the entire effect of the treatment on reducing stroke risk
is through its effect on blood pressure. If this is correct, then two things should happen.
First, there should be a relationship between T and Y. That is, the probability of stroke
should be lower given that T = 1 than given that T = 0. Second, the probability of stroke
given T and X should depend only on X. That is, T and Y should be conditionally
independent given X (Prentice, 1989). In other words, X is a surrogate outcome for Y.
For the remainder of this section, assume that the random variables have a density with
respect to Lebesgue measure or counting measure. We continue to think informally,
deferring a rigorous presentation until later. We move away from causal inference and
think of the path diagram as simply depicting a way to factor the joint density function
f(t, x, y) of random variables (T, X, Y). We can always factor a density function f(t, x, y)
into
But we get a bit of simplification in the above example involving stroke and blood
pressure because the conditional density function f3(y | t, x) does not depend on t.
Therefore, the joint density factors as follows in this example.
We can traverse the arrows from T to X and from X to Y. The fact that there is a path
connecting T and Y means that the conditional distribution of Y given T may depend on
T. That is, T and Y may be correlated. If they are correlated, then of course the
conditional distribution of T given Y also depends on Y. This suggests that we can
travel against the arrows as well; starting at Y, we may go against the arrow to X, and
then against the arrow to T. Even though T and Y are connected, they become
independent once we condition on X. Think of conditioning on X as applying a
roadblock at the node at X. This roadblock prevents us from traveling from T to Y or
vice versa. The nodes T and Y are now disconnected, meaning that they are
conditionally independent given X. This is what we noticed earlier in the stroke and
blood pressure example, that f3(y | t, x) = f3(y | x) for any t, x, y.
Although we pointed out that we can travel either in the direction of the arrow or against
the arrow, there is an exception. Suppose that X and Y both affect an outcome Z. For
instance, suppose that Z = X + Y. Suppose further that X and Y are independent. We
depict this relationship with arrows from X to Z and from Y to Z (Figure 11.3). Vertex Z
is called a collider or inverted fork (had the direction of arrows been reversed, there
would have been a fork: Z could lead to X or to Y). If we allowed travel from X to Z
and then to Y, it would imply that X and Y might be dependent. But we stated that X and
Y are independent. Thus, we disallow travel through the inverted fork unless we
condition on Z. Here, Z = X + Y, so conditioning on Z = z means X + Y = z.
Equivalently, Y = z X. This induces a correlation between X and Y. That is, even
though X and Y are independent, they are conditionally dependent given Z. Therefore,
the rule for inverted forks is reversed: the path is blocked (indicating independence)
unless we condition on the node at the inverted fork, in which case the path becomes
open (indicating possible dependence).
Figure 11.3
Vertex Z is an inverted fork, also known as a collider.
Each DAG for a set of random variables corresponds to a factoring of the joint density
function of the variables. For each node (a random variable or random vector), we
incorporate a term for the conditional density function of that variable given the
variables pointing to it. For Figure 11.4, this corresponds to
Figure 11.4
There are no arrows pointing to U and V, so they are called root nodes. We can factor
the density function of the subset (U,V) as the product f1(u)f2(v) of marginal densities.
Therefore, U and V are independent. Any two root nodes in a DAG are independent.
The only arrows pointing to X are from U and W. This means that the conditional
distribution of X given U, V and W depends only on U and W. That is, V and X are
conditionally independent given (U,W). Because there is still another path from U to X
that does not go through W, we cannot conclude that X is independent of U, given W.
The only arrow pointing to Y is from W. If we block that path by conditioning on W,
then nodes (U, V) are separated from Y. Therefore, the set (U, V) are conditionally
independent of Y, given W.
In the figure, nodes W and X are called descendents of node U because there are
directed paths from U to W and X. Likewise, we call U an ancestor of W and X. When
we condition on a descendent, that gives information about its ancestor. Therefore,
suppose we condition on X in the figure. That gives information about W, so we must
treat W as also given. But W is an inverted fork, so that opens up the path from U to W
to V. Therefore, we can no longer be assured that U and V are independent once we
condition on X. They are unconditionally independent because they are root nodes, but
they are not necessarily conditionally independent given X.
We are now in a position to formalize the rules and key result about DAGs.
Notation 11.6.
Suppose you are trying to decide whether the following proposition is true:
Figure 11.5 shows a path diagram illustrating a situation in which both paths from X to
Y are blocked by U and V, so . However, it is not true that because
conditioning on U blocks only one path from X to Y; the path X to V to Y remains open.
The same reasoning shows that it is not necessarily true that . The fact that we
are able to find a graphical counterexample shows, in conjunction with Theorem 11.5,
that there exist random variables (U, V, X, Y) such that (11.19) is false. In fact, the
following is a counterexample. Let U and V be iid standard normals, and let X = U + V,
Y = U V. Conditioned on (U, V), X and Y are constants, hence independent. But
conditioned on U alone, X and Y have correlation 1, and conditioned on V alone, X
and Y have correlation +1.
Figure 11.5
Path diagram.
Figure 11.6 shows a path diagram in which X and Y are conditionally independent
given (U,V) and conditionally independent given U because conditioning on U blocks
both paths from X to Y. However, X and Y are not conditionally independent given V
because the path X to U to Y remains open. A concrete example takes X = U and Y = U
+ V. Given (U, V), X and Y are constants, and hence conditionally independent. Also,
given U, X is constant and therefore conditionally independent of Y. On the other hand,
given V, X and Y have correlation 1.
Figure 11.6
Path diagram.
Exercises
2. Decide which of the following are true and prove the true ones.
(a) .
(b) .
(c) and .
(d) and .
3. In the path diagram of Figure 11.8, determine which of the following are true and
factor the joint density function for (X, U, V, W, Y, Z).
(a) U and V are independent.
(b) X and Y are independent.
(c) X and Y are conditionally independent given Z.
(d) X and Y are conditionally independent given (U, V).
Figure 11.7
Path diagram.
Figure 11.8
Path diagram.
Figure 11.9
The Bland-Altman plot of errors against means in the top panel shows a funnel
indicating larger variability of errors for larger values. In such cases, the relative error
may be constant (bottom panel).
To see if (11.21) has 95% coverage, simulate 500 observations from a normal
distribution with mean = 1 and = 1, and calculate the confidence interval (11.21).
Repeat this process 100,000 times to see that approximately 89% of the intervals
contain the true effect size, E = 1/1 = 1. Thus, the interval does not have the correct
coverage probability. Increasing the mean makes the coverage even worse; with = 3
and = 1, only about 60% of the intervals cover the true effect size 3. On the other
hand, the coverage probability is 95% if = 0, and close to 95% if is very close to 0.
where Tn1 is the usual one-sample t-statistic for testing whether the mean is .
Asymptotically, this t-statistic is N(0,1). Also, (n 1)ss /2 has a chi-squared
distribution with n 1 degrees of freedom, so it may be represented as , where
the Zi are iid N(0,1). The variance of a random variable is 2(n 1). Therefore, by
the CLT,
Also, . Therefore, by Slutskys theorem, the term to the right
of Tn1 in (11.22) converges in distribution to a normal with mean 0 and variance 2 /
(22) = E2/2. But Tn1 is independent of the term to its right in Expression (11.22)
because s2 and are independent. Therefore, converges in distribution to a
N(0,1) plus an independent N(0,1 + E2/2); the sum is asymptotically N(0,1 + E2/2).
That is,
Using approximation (11.23), we can verify the simulation result that the coverage
probability of (11.21) is only 89% and 60% when = 1 and = 3, respectively. Note
that we could have reached the correct asymptotic distribution (11.23) using the delta
method (exercise).
Denoting the left and right sides by L and U and solving for E, we find that
where
Exercises
2. In the effect size setting, the variance of the asymptotic distribution depended on
the parameter we were trying to estimate, so we made a transformation. Find the
appropriate transformations in the following settings to eliminate the dependence
of the asymptotic distribution on the parameter we are trying to estimate.
(a) Estimation of the probability of an adverse event. We count only the first
adverse event, so the observations are iid Bernoulli random variables with
parameter p, and we use the sample proportion with events, , where X
is binomial (n,p).
(b) Estimation of the mean number of adverse events per person. We count
multiple events per person. Assume that the total number of events across all
people follows a Poisson distribution with parameter , where is very
large.
3. Suppose that the asymptotic distribution of an estimator is N(, f()/n) for some
function f. How can we transform the estimator such that the variance of the
asymptotic distribution does not depend on the parameter we are trying to
estimate?
We could also have used the asymptotic distribution of Ymax derived in Example 6.25
concerning the Bonferroni approximation because . The
indicators I(Yi Yn) are iid Bernoulli with probability pn = P(Y1 yn). If we choose
yn such that n{1 (yn)} , then by the law of small numbers (Proposition 6.24), the
number of Yi at least as large as yn converges in distribution to a Poisson with
parameter . Therefore,
But now suppose we do not know or 2. Substituting the sample analogs, and s, for
and leads to
To use U, we must determine its null distribution. Assume that the sample size is large.
The distribution of U does not depend on or because because U has the same value
if we replace Yi by (Yi )/. Therefore, without loss of generality, we may assume
that Yi are iid N(0,1).
We already showed that exponential (1). We must now show that this
continues to hold if Ymax is replaced by . The first step toward that end is to
prove a very useful result in its own right.
11.11.2 Inequality and Asymptotics for
Theorem 11.7.
The standard normal distribution function (x) and density (x) satisfy:
from which the left side of the inequality in Equation (11.26) follows.
because .
By Slutskys theorem,
We have established that when the sample size n is large, an approximately valid test
declares the largest observation an outlier if exceeds .
Equivalently, if the largest sample standardized residual exceeds , we
declare the observation producing that residual an outlier.
Exercises
1. Let x1,..., xn be a sample of data. Fix n and x2,..., xn, and send x1 to . Show that
the one-sample t-statistic converges to 1 as x1 . What does this tell you
about the performance of a one-sample t-test in the presence of an outlier?
2. Let Y1,..., Yn be iid N(, 2), and Ymax, and s2 be their sample maximum,
mean and variance, respectively. Suppose that an is a sequence of numbers
converging to such that for some non-degenerate random variable
U. Prove that as well.
3. Show that if Y1,..., Yn are iid standard normals and Ymin is the smallest order
statistic, then n(Ymin) converges in distribution to an exponential with parameter
1.
4. Prove that if Y1,..., Yn are iid standard normals and Ymin and Ymax are the
smallest and largest order statistics,
for (1 + 2)/n < 1. What does this tell you about the asymptotic joint distribution
of [n(Ymin), n{1 (Ymax)}]?
5. Let Y1,..., Yn be iid N(, 2), with and 2 known. Declare the smallest order
statistic to be an outlier if n{(Y(1) )/} a, and the largest order statistic to
be an outlier if n[1 {(Y(n) )/}] a. Determine a such that the probability
of erroneously declaring an outlier is approximately 0.05 when n is large.
The logrank statistic is commonly used to compare two survival distributions (namely
one minus the distribution functions) when the treatment-to-control hazard ratio,
, is assumed constant. Here, f(t) and F(t) are the density and
distribution functions of time to death, and T and C denote treatment and control. The
hazard function f(t)/{1 F(t)} is the instantaneous mortality rate at time t, conditional on
surviving to time t. Consider a clinical trial in which all patients start at the same time
and no one is lost to follow-up. In actuality, patients arrive in staggered fashion, and
some of them drop out. Our purpose is to show that even in the admittedly over-
simplified case we consider, use of elementary arguments to determine the approximate
distribution of the logrank statistic is problematic.
Table 11.2 shows that just prior to the ith death time, there are nTi and nCi patients alive
in the treatment and control arms. We say that these patients are at risk of dying. The
random variable Xi is the indicator that the ith death came from the treatment arm.
Under the null hypothesis, the expected value of Xi, given nTi and nCi, is the proportion
of the nTi + nCi at-risk patients who are in the treatment arm, namely, Ei = nTi/(nTi +
nCi). Likewise, the null conditional variance of the Bernoulli random variable Xi, given
nTi and nCi, is Vi = Ei(1 Ei). The logrank z-statistic and a closely related estimator
are:
Table 11.2
Dead
Treatment Xi nTi
Control nCi
1 ni
where D is the total number of deaths. Readers familiar with survival methods may
recognize that is an estimator of the logarithm of the hazard ratio.
The logrank z-statistic and its associated estimator are analogous to the one-sample t-
statistic and its associated estimator. In the one-sample t-statistic setting with iid
N(,2) observations Yi, we are interested in testing whether = 0. The t-statistic is
, where and s are the sample mean and standard deviation, respectively. It is
easy to show using a similar approach to that in Section 11.2.1 that under a local
alternative n = a/n1/2, the one-sample t-statistic converges in distribution to
N(a,2). This implies that the estimator has the following property. When n = a/np
with 0 < p < 1/2, the relative error converges in probability to 0 as n .
Equivalently, .
Our goal is to show a similar result about the estimator in the logrank setting.
Specifically, we will show that under the local alternative , where ,
We will accomplish this in several steps, the first of which is to derive the distributions
of and .
where is the treatment to control hazard ratio, which is the same for all t. Let 0 to
conclude that . Under the null hypothesis,
Some authors say that the Xi may be treated as if they were independent Bernoulli
random variables with probability parameters (e.g., Schoenfeld, 1980). It is true that,
conditioned on D = d, the probability of any given string x1,..., xd of zeros and ones is
, which matches that of independent Bernoullis. However, for
different strings x1,..., xd, the Bernoulli parameters differ. Thus, the Xi are not
independent. Neither does it make sense to say that they are conditionally independent
given the set . Once we condition on the full set of , there is no more
randomness! For instance, if (nT1,nC1) = (100,96), then the first death came from the
control arm if (nT2, nC2) = (100, 95), and from the treatment arm if (nT2, nC2) = (99,
96).
Even though the Xi are not independent, the following argument shows that ,
are uncorrelated. First note that the Yi have conditional mean 0 given , and
therefore unconditional mean 0. Thus,
We will show that the first term of Expression (11.34) tends to 0 in probability, whereas
the second term tends to 1 in probability.
To show that the first term of Expression (11.34) is small, we show first that its
denominator is large. Consider
Note that, for 0 p 1, the parabola f(p) = p(1 p) is maximized at p = 1/2 and
decreases as |p 1/2| increases from 0 to 1/2. Therefore, Ei(1 Ei) decreases as |Ei
1/2| increases. Suppose that the proportion of people who die tends to a limit that is less
than 1/2. Then Ei(1 Ei) is at least as large as {(n/2 d + 1)/n}[1 {(n/2 d +
1)/ng}] = n{1/2 (d 1)/n}{1/2 + (d 1)/n}, so
It follows that
Conditioned on D = d, the expression within the absolute value sign on the right of
Expression (11.36) has mean 0 and variance
In the last step we substituted the local alternative value (11.35) for n.
That is, conditioned on the number of deaths, the first term of Expression (11.34) tends
to 0 in probability. The next step is to uncondition on the number of deaths. By
Proposition 10.10 and the bounded convergence theorem,
The left and right sides both tend to 1 as n . Therefore, the second term of
Expression (11.34) tends to 1 in probability.
The rightmost inequality is the most commonly used of the two, and is referred to as the
triangle inequality. It gets its name from the fact that the vectors x and y, together with
their resultant, x + y, form the three sides of a triangle. Geometrically, x + y x + y
says that the shortest distance between two points is a line (see Figure A.1). In R1, the
triangle inequality says that |x + y| |x| + |y| for all x,y R.
Figure A.1
The triangle inequality illustrated in two dimensions. The length of the resultant vector x
+ y cannot be greater than the sum of the lengths of x and y.
The leftmost inequality is called the reverse triangle inequality. It also has a geometric
interpretation in terms of a triangle: | x y | x y says that the length of any
triangle side is at least as great as the absolute value of the difference in lengths of the
other two sides. In R1, the reverse triangle inequality says that | |x| |y| | |x y| for all
x, y R.
Thus, if A1 is the set of rational numbers and A2 = [0,1], then A1 A2 is the set of
rational numbers in the interval [0, 1], whereas A1 A2 is the set of numbers that are
either rational or between 0 and 1. That is, A1 A2 consists of all real numbers in
[0,1], plus the rational numbers outside [0,1].
If An = (0, 1 + 1/n) for n = 1, 2, 3,..., then consists of all numbers x that are in (0, 1
+ 1/n) for every n = 1,2,... That is, x must be in (0, 2) and (0, 3/2) and (0, 4/3), etc. The
only numbers in all of these intervals are in (0, 1]. Therefore, . On the other
hand, .
Remark A.3.
Throughout this book we will consider subsets of a set called the sample space.
Proposition A.4.
If At for t I and B :
1. DeMorgans law .
2. DeMorgans law .
3. .
4. .
The first two items are known as DeMorgans laws. To prove item 1, we will show that
every element of is an element of , and vice versa. Let . Then x is not in
all of the At. In other words, there is at least one t such that x is outside of At, and hence
for at least one t. By definition, . Therefore, .
Now suppose that . Then x is in for at least one t. That is, x lies outside At for
at least one t, so x cannot belong to At. In other words, . Therefore,
.
1. b is said to be the greatest lower bound (glb) of A, also called infimum and
denoted inf(A), if b is a lower bound and no number larger than b is a lower bound
of A.
2. B is said to be the least upper bound (lup) of A, also called supremum and denoted
sup(A), if B is an upper bound and no number smaller than B is an upper bound of
A.
It is easy to see from the definition that there can be no more than one infimum or
supremum of a set. See Figure A.2 for an illustration of lower and upper bounds and the
infimum and supremum.
Figure A.2
A bounded set A has infinitely many lower and upper bounds (hash marks), but only one
inf and one sup.
The infimum of the empty set is + because every number is a lower bound of the empty
set. That is, if x is any number, it is vacuously true that y x for all . Similarly, the
supremum of the empty set is because every number is an upper bound of the empty
set.
Axiom A.8.
Every set with a lower bound has a greatest lower bound, and every set with an upper
bound has a least upper bound.
Notice that infs or sups may or may not be in the set. In the above example with A = [0,
1), , whereas . It is easy to see that the infimum and supremum of any
finite set are the minimum and maximum, respectively. If a set A has no lower bound,
then its infimum is , and if a set has no upper bound, its supremum is +. Thus, if A
is the set of integers, then inf(A) = and sup(A) = +.
Figure A.3
The two-dimensional open ball B(c, r), whose circular boundary is not included.
The open ball B(c, r) centered at c Rk with radius r is the set of points x Rk such
that x c < r. The closed ball replaces < by in this definition.
A set O in Rk is said to be open if for each point x O, there is a sufficiently small >
0 such that B(x, ) O (see Figure A.4).
Figure A.4
An open ball B(c, r) is an open set. To see this, let . We will show that there is a
sufficiently small number such that . Because x c < r, x c = r d for
some d > 0. Set = d. We will prove that B(x, d) B(c, r) by proving that if z B(x,
d), then z c < r (Figure A.5).
Figure A.5
Finding an such that . If x c = r , set = .
There are many more open sets than just open balls. For example, the union of open
balls, even infinitely many, is also an open set. Also, the empty set is open because it
satisfies the definition vacuously.
An example of a set in R1 that is not open is the half-open interval [a, b). It is not open
because there is no open ball B(a, ), no matter how small is, that is entirely contained
in [a, b); there will always be points in B(a, ) that are smaller than a. Likewise, (a, b]
and [a, b] are not open. Another non-open set is a finite set of points such as A = {1, 2,
3}. Any open ball centered at one of these points will necessarily contain points not in
A, so A is not open.
Is the intersection of an infinite number of open sets necessarily open? The intersection
of the open sets An = (1/n, 1 + 1/n) is the closed interval [0, 1], which is not
open. Thus, the intersection of infinitely many open sets need not be open, although the
intersection of a finite number of open sets is open; see Proposition A.14.
Let B be a subset of Rk. Any point in Rk is either in the interior, exterior, or boundary of
B (Figure A.6), where these terms are defined as follows.
Figure A.6
For example, if B = (a, b), then each point in B is in the interior of B, a and b are
boundary points of B, and each point in is in the exterior of B. In R2, let B
be the open ball B(c, r). Then each (x, y) B is an interior point of B; each (x, y) with
(x c1)2 + (y c2)2 = r2 is a boundary point of B; each (x, y) with (x c1)2 + (y
c2)2 > r2 is in the exterior of B.
Just as we generalized the idea of an open interval to an open set, we can generalize the
idea of a closed interval [a, b]. Similarly to what we did above, we may regard this
interval as the set of points x such that |x c| r, where c = (a + b)/2 and r = (b a)/2.
The analogy in Rk is the closed ball . Notice that the complement of
this closed ball is the set of x such that x c > r, which is an open set. This tells us
how to generalize the notion of closed intervals and closed balls.
Any closed ball is closed. Moreover, any set A of Rk that contains all of its boundary
points is closed. This is because each point in AC is in the exterior of A, and therefore
can be encased in an open ball in AC.
A set can be neither open nor closed. For example, we have seen that the half open
interval [a, b) is not open, but neither is it closed because the complement of [a, b) is
(, a) [b, ), which is not open. Note that [a, b) contains only one of its two
boundary points. More generally, a set that contains some, but not all, of its boundary
points is neither open nor closed. Another example in R is the set Q of rational numbers.
All real numbers are boundary points of Q, so Q contains some, but not all, of its
boundary points. It is neither open nor closed. An example in R2 is A = {(x, y) : x2 + y2
1, x [0,1]} {(x, y) : x2 + y2 < 1, x [1, 0)}. Note that A is the two-
dimensional region bounded by the circle x2 + y2 = 1 with center (0, 0) and radius 1,
and containing its right-side, but not its left-side, boundary. The set A is neither open
nor closed.
There are precisely two sets in Rk that are both open and closed, namely the empty set
and Rk itself.
Notice that the union of an infinite collection of closed sets need not be closed. For
instance, [1/n, 1] = (0, 1].
A set A R1 is open if and only if it is the union of a countable (see Section 2.1 for
definition of countable) collection of disjoint open intervals.
The basic idea behind the proof of this result is illustrated in Figure A.7 for R2. We
encase A in a square and pick a point x1 A. Now divide the square into four equal
cells. At least one of the cells must contain infinitely many points of A. From a cell with
infinitely many points, pick another point x2 A. Now divide that cell into four equal
cells. One of these must contain infinitely many points of A, so pick a point x3 A, etc.
Continuing this process indefinitely The intersection of all of the cells is nonempty and
contains a cluster point x.
Figure A.7
To obtain a better generalization of a closed interval, note first that every subset of Rk
has an open covering, meaning that it is contained within a union of open sets. For
instance, if A is a subset of Rk, covers A; i.e., . But for every such
covering, can we find a finite subcovering? That is, can A be covered by a union of a
finite number of the Bs? To see that the answer is no for some A, note that A = (a, b) is
covered by the sets Bn = (a + 1/n, b 1/n) because . On the other hand, no
finite subcollection of the Bn covers (a, b). But it can be shown that the closed interval
[a, b] does have the property that every open covering has a finite subcovering. This
provides us with the generalization we need of a closed interval.
Compact sets are important because some key results about functions on closed
intervals also hold more generally for functions on compact sets.
A.4 Sequences in Rk
Limits are crucial to the study of probability theory, so it is essential to have a firm
grasp of this topic. Loosely speaking, a sequence (xn) of k-dimensional vectors has limit
x Rk if the xn are sufficiently close to x for all sufficiently large n. The following
definition makes this more precise.
For instance, suppose we want to prove that using the definition of a limit.
Given > 0, we must specify an N such that |(1/n) sin(n/2) 0| < whenever n > N.
But because |sin(x)| 1 for any x. For the
given , choose N large enough that 1/N < . Then whenever n N,
. By Definition A.20, .
Proof. Note that for any > 0, there exist N1 and N2 such that xn x < /2 for n N1
and xn x < /2 for n N2. Thus, if N = max(N1, N2)
Because is arbitrary, x = x.
if and only if for each number > 0, there are only finitely many n such that xn x .
To see that this is an equivalent definition, suppose first that x x by Definition A.20.
Then for given > 0, there is an N such that xn x < for n N. That means that the
only possible values n such that xn x are n = 1,..., N 1. That is, there are only
finitely many n such that xn x . To prove the reverse implication, suppose for each
there are only finitely many n such that xn x . Then M = max{n : xn x } is
finite. Set N = M + 1. Then for the given and , . Therefore xn x by
Definition A.20.
Proposition A.23.
In either Definition A.20 or Proposition A.22, we can replace e by 1/k and require the
stated condition to hold for each natural number k.
Proposition A.23 holds because for fixed > 0, there is a natural number k such that 1/k
< , and for fixed k, there is an with < 1/k.
It is also helpful to think about the negation of xn x, which by Proposition A.22 is the
negation of the statement that for each number , there are only finitely many n such that
xn x . The negation of for each > 0, condition C holds is that there exists an
> 0 such that C does not hold. Therefore, xn does not converge to x if and only if there
exists an > 0 such that xn x for infinitely many n. Let n1 be the smallest n such
that xn x , n2 be the second smallest such n, etc. Then for the subsequence
for all i = 1, 2,...
The negation of xn x is that there exists an > 0 and a subsequence such that
for all i = 1, 2,...
If , then xn is bounded; i.e., there exists a number B such that xn B for all n.
1. .
2. , where denotes dot product.
3. If k = 1, if y 0.
We illustrate the use of the triangle inequality by proving the second item. By item 1 and
Proposition A.25, it suffices to prove the result for k = 1. Start with |xnyn xy| = |xnyn
xyn + xyn xy|. Application of the triangle inequality yields
By Proposition A.26, there is a bound B such that for all n. For
> 0, we must find an N such that for n N. Suppose first that x 0.
Because xn x and yn y, there exists an N1 such that for n N1, and an
N2 such that for n N2. Take N = max(N1, N2). Then for n N,
. If x = 0, we can take N = N1. This shows that
|xnyn xy| < for n N.
Readers unfamiliar with proving things like Proposition A.27 are encouraged to prove
the remaining parts.
In the examples above, although the sequences oscillate and do not converge, we can
find subsequences that converge. For instance, along the subsequence n1 = 1, n2 = 5, n3
= 9,..., xn = sin(n/2) is (1, 1,..., 1,...) Therefore, . Similarly, along the
subsequence n1 = 3, n2 = 7, n3 = 11,..., as i , while for n1 = 2, n2 = 4, n3 =
6,..., as i .
is a limit point of (xn) if there exists a subsequence n1, n2,... such that .
The smallest and largest limit points of a sequence are defined as follows.
Figure A.8
The sequence depicted has 3 limit points. The smallest and largest of these are the
liminf and limsup, respectively.
Some properties of liminfs and limsups that readers unfamiliar with this material should
verify are as follows.
Working with liminfs and limsups means that we no longer must qualify statements like
let x = lim(xn) by if the limit exits. Instead, we can work directly with liminfs and
limsups, which always exist, and then use the following result.
Proposition A.32. Limit exists if and only if the liminf and limsup agree
Monotone sequences (xn) (i.e., xn xn+1 for all n or xn xn+1 for all n) cannot exhibit
the oscillating behavior we observed with (1)n or sin(n/2).
Proposition A.33. Monotone sequences have a (finite or infinite) limit
This follows from the Bolzano-Weierstrass theorem because if the sequence is bounded
and contains infinitely many values, there must be a cluster point, which is the limit
(there cannot be more than one cluster point for a monotone sequence). On the other
hand, if the sequence is unbounded, then there is an infinite limit.
We conclude with a useful necessary and sufficient condition for a sequence to converge
to a finite limit.
A sequence xn is said to be a Cauchy sequence if for each > 0 there exists a natural
number N such that xn xm < whenever m, n N.
In other words, the terms of the sequence are all arbitrarily close to (within of) each
other from some point onward.
A.5 Series
An important part of probability theory involves infinite sums of random variables or
probabilities. Therefore, we need to have a working knowledge of tools that will help
us determine whether these infinite sums, called infinite series, exist and are finite.
Two classes of infinite series are particularly useful, the geometric series because we
can compute the sum explicitly, and the Riemann-Zeta series because it includes, as a
special case, one of the best known divergent series, the harmonic series.
The case r = 1 is known as the harmonic series. To see that it is divergent, write the sum
as 1 + (1/2) + (1/3 + 1/4) + (1/5 + 1/6 + 1/7 + 1/8) +... and note that all of the terms in
parentheses are at least 1/2 because 1/3 + 1/4 1/4 + 1/4 = 1/2, (1/5 + 1/6 + 1/7 + 1/8)
1/8 + 1/8 + 1/8 + 1/8 = 1/2, etc. Thus, .
Even though the harmonic series diverges, the alternating series converges.
This follows from Proposition A.40:
Figure A.9
The integral is trapped between (area of bars in top graph) and
(area of bars in bottom graph). It follows that if and only if .
This provides another way to see that the harmonic series diverges because
.
Absolute convergence is important for several reasons, one of which is the following.
Another important consequence of absolute convergence is that the sum remains the
same even if we rearrange the terms. Let 1, 2,... be a rearrangement of (1, 2, 3,...),
meaning that each natural number 1, 2,... appears exactly once among 1, 2,... If a
series converges but not absolutely, then for any number c, we can rearrange the
terms to make . A heuristic argument for this is the following. If the series
converges, but not absolutely, then the sum of the positive terms must be and the sum
of the negative terms must be (if only one of these sums were infinite, then the series
would not be convergent, whereas if neither were infinite, then the series would be
absolutely convergent). Take positive terms until the sum exceeds c, then add negative
terms until the sum is less than c, then positive terms until the sum exceeds c, etc.
Because xi 0, this process makes . Thus, we can get literally any sum by
rearranging terms.
If is absolutely convergent, then is convergent and has the same value for
any rearrangement 1,..., n,... of the terms.
Next we present useful criteria for determining whether a series converges absolutely.
Take the absolute value of the terms and then apply the following result.
3. Ratio test
Next we discuss some fundamental results about power series, which arise in a number
of ways in probability and statistics. For example, we often use a first order Taylor
approximation to obtain the asymptotic distribution of an estimator. This technique is
called the delta method. You may have also encountered power series in elementary
probability through generating functions.
P(x) is said to be a power series if for constants c and ai, i = 0,1, ...
There is a finite or infinite value r, called the radius of convergence of P(x), such that
P(x) converges absolutely if |x c| < r and diverges if |x c| > r. In fact, .
A power series :
If f(x) is a function whose nth derivative at x = c (denoted f(n)(c)) exists, the nth order
Taylor polynomial for f expanded about c is defined to be ,
where f(0)(c) is defined as f(c).
Suppose that f(n)(c) exists. Then f(x) = T(x, n, c) + Rn, where the remainder term Rn
satisfies as x c. If additionally, f(n+1)(x) exists for x I = [a,b] and c
[a,b], then Rn = f(n+1)()(x c)(n+1)/(n + 1)! for some lying on the line segment
joining x and c.
A.6 Functions
A.6.1 Mappings
If f(x) is a function, we use the notation f : X Y to denote that f maps the set X into Y,
meaning that f(x) Y for each x X. This does not necessarily mean that each point
of Y is the image of some point in X. For instance, f(x) = sin(x) maps R into R, even
though sin(x) is always between 1 and 1. Therefore, we could also write .
Restricting Y to the range of the functionthe set of image points {f(x) : x X}
means that every point y Y is the image of some point in X.
1. If every point y Y is the image of at least one point x X, we say that f maps X
onto Y.
2. If no y Y is the image of more than one x X, f is said to be 1 1.
Let f(x) be a function from a set X into another set Y. If A X, then f(A) = {f(x) : x
A}, the set of images of points in A. If B Y, then , the set of points
that get mapped into B.
1. .
2. .
3. .
Remark A.56.
Proposition A.57. Limit exists if and only if left- and right-handed limit exists and are
equal
if and only if for each > 0, there exists a such that f(x) L <
whenever 0 < x x0 < .
That is, f(x) can be made arbitrarily close to (within of) L if x is sufficiently close to
(within of), but not equal to, x0. Note that the required may depend on x0 and .
Proposition A.60. Continuous at a point if and only if left and right continuous
Figure A.10 illustrates the difference between pointwise and uniform continuity for f(x)
= 1/x. Continuity at x means that for any given > 0, we can find a horizontal interval Ih
of length 2 centered at x such that whenever x Ih, f(x) Iv, a vertical interval of
length 2 centered at f(x). For f(x) = 1/x, the width of Ih ensuring f(x) Iv is smaller at
x than at x > x. Over a restricted domain A x 1, A > 0, we can find a single that
works for all x because f(x) does not become arbitrarily steep as x A. But over the
domain 0 < x 1, we can find no single that works for all x because f(x) is arbitrarily
steep as x 0. Thus, f is uniformly continuous on [A, 1], but not on (0,1].
Figure A.10
Continuity of f(x) = 1/x.
Figure A.10 shows that a function f can be continuous on an open interval, yet still be
arbitrarily steep as we approach an endpoint of the interval. This makes it impossible to
find a single that works for all x. This kind of anomalous behavior cannot happen on a
closed interval. More generally, it cannot happen on a compact set, as we see in
Proposition A.62 below.
Proposition A.62. Continuity on a compact set implies uniform continuity on that set
An example of a function that converges pointwise, but not uniformly, on the interval
[0,1] is fn(x) = xn, which converges to f(x) =0 if 0 x < 1 and 1 if x = 1. If the
convergence were uniform, then for any given > 0, would have to
be smaller than for all n N and all x [0, 1]. But for any given n N, |fn(x) f(x)| =
xn approaches 1 as x 1. Therefore, it cannot be less than e for all x [0, 1]. Thus, fn
converges pointwise, but not uniformly, to f(x) on [0, 1].
Note that in this example, even though fn(x) is continuous for each n and fn(x) f(x),
the limit function f(x) is discontinuous. This could not have happened if the convergence
of fn to f had been uniform, as the following result shows.
Proposition A.65. Continuous functions converging uniformly implies the limit function
is continuous
Proof. Let x0 C and > 0 be given. We must determine a such that |f(x) f(x0)| <
for all x C such that |x x0| < . For any n,
by the triangle inequality. By uniform convergence of fn(x) to f(x), there exists an N such
that for all x C (including, of course, x0). Using this fact and
substituting N for n in (A.4), we see that . By the continuity of
fN, there exists a N such that . Therefore, for = N,
for all , proving that f is continuous at x0.
Another reason uniform convergence is so important can be seen from the following
example. Suppose that fn(x) is a real valued function converging pointwise to f(x), and
and exist; can we conclude that ? To see that the
answer is no, let fn(x) = n if 0 < x 1/n and 0 if x > 1/n. Then fn(x) 0 for x > 0, yet
for all n. Thus, for all x > 0, but does not converge to
. On the other hand, if fn(x) converges uniformly to f, then there is an N such
that for all n N and all x, Therefore,
We close this section with some results about how continuous functions map sets into
other sets.
The linear function L, represented by its k coefficients, is called the derivative or the
total differential.
Implied by this definition is that the limit in (A.6) must be 0 irrespective of the direction
of approach of x to y. If all but the first component of x equal the corresponding
components of y, then
where a1 is the first coefficient of the derivative. But this implies that
exists. That is, the partial derivative exists and equals the first coefficient a1 of the
derivative of f(x) at the point y. A similar argument shows that if f(x) is differentiable at
x = y, then all partial derivatives of f exist at y, and the coefficients of L are .
That is, the linear approximation to f(x) near x = y is
Appendix B
(c) If Y is the number of failures before the sth success, then Y has the
negative binomial probability mass function
(a) If Y1 and Y2 are iid standard normals, Y1/Y2 has a Cauchy distribution
with = 0 and = 1.
(b) If Y1,..., Yn are iid Cauchy (, ), the sample mean is also Cauchy (,
).
(a) If X1,..., Xn are iid exponential (), then min(X1,..., Xn) is exponential
(n). This is a special case of item 8b with b = 1.
(b) If Y1,..., Yk are independent exponentials with parameter , then is
gamma with parameters and r = k.
(c) Lack of memory property of exponential If Y is exponential (), then the
conditional distribution of Y t, given that Y t, is exponential () for each t
0.
10. Lack of memory property of geometric If Y is geometric (p), then the conditional
distribution of Y k, given that Y k, is geometric (p) for each k = 0,1,...
11. Conjugate priors The following relationships are useful in Bayesian inference,
whereby a prior density () for a parameter is specified and then updated to a
posterior density f( | D) after observing data D. The prior density is called a
conjugate prior if the posterior density is in the same family.
(c) If the conditional distribution of Y given is N(, 2), and the (prior)
density () for is , then the conditional density of given Y = y (the
posterior density) is
Appendix C
References
1. Appel, L.J., Moore, T.J., Obarzanek, E. et al. (1997). A clinical trial of the effects of
dietary patterns on blood pressure. The New England Journal of Medicine 336, 1117-
1124.
2. Bartlett, R.H., Roloff, D.W., Cornell, R.G. et al. (1985). Extracorporeal circulation in
neonatal respiratory failure: a prospective randomized study. Pediatrics 76, 479-487.
3. Begg, C.B. (1990). On inferences from Weis biased coin design for clinical trials,
Biometrika 77, 467-484.
4. Billingsley, P. (2012). Probability and Measure Anniversary Ed. John Wiley & Sons,
New York.
5. Bland, J.M. and Altman, D.G. (1986). Statistical methods for assessing agreement
between two methods of clinical measurement. The Lancet 327, 307-310.
7. Chatterjee, S. and Hadi, A.S. (2006). Regression Analysis by Example, 4th ed. John
Wiley & Sons, New York.
8. Chung, K.L. (1974). A Course in Probability Theory, 2nd ed. Academic Press,
Stanford.
9. Cohen, J. (2003). AIDS vaccine trial causes disappointment and confusion. Science
299, 1290-1291.
10. Drosnin, M. (1998). The Bible Code. Touchstone (Simon & Schuster, Inc.), New
York.
11. Feller, W. (1968). An Introduction to Probability Theory and Its Applications, vol I,
3rd ed. John Wiley & Sons, New York.
12. Follmann, D.A. (1997). Adaptively changing subgroup proportions in clinical trials.
Statistica Sinica 7, 1085-1102.
13. Friedman, L.M., Furberg, C.D., and DeMets, D.L. (2010). Fundamentals of Clinical
Trials, 4th ed. Springer, New York.
14. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2004). Bayesian Data
Analysis. Chapman & Hall/CRC, Boca Raton, FL.
16. Hollander, M. and Wolfe, D.A. (1973). Nonparametric Statistical Methods. John
Wiley & Sons, New York.
17. Johnson, N.L., Kotz, S., and Kemp, A.W. (1992). Univariate Discrete Distributions,
2nd ed. John Wiley and Sons, New York.
18. Kadane, J.B. and OHagan, A. (1995). Using finitely additive probability: Uniform
distributions on the natural numbers. Journal of the American Statistical Association 90,
626-631.
19. Mann, H.B. and Wald, A. (1943). On stochastic limit and order relationships.
Annals of Mathematical Statistics 14, 217-226..
20. McMahon, R.P., Proschan, M., Geller, N.L., Stone, P.H., and Sopko, G. (1994).
Sample size calculation for clinical trials in which entry criteria and outcomes are
counts of events. Statistics in Medicine 13, 859-870.
21. Muller, J.E., Stone, P.H., Turi, Z.G., et al (1985). Circadian variation in the
frequency of onset of acute myocardial infarction. The New England Journal of
Medicine 313, 1315-1322.
24. Obarzanek, E., Proschan, M., Vollmer, W., Moore, T., et al. (2003). Individual blood
pressure responses to changes in salt intake: Results from the DASH-Sodium trial.
Hypertension 42, 459-467.
27. Pitman, E.J. (1939). A note on normal correlation. Biometrika 31, 9-12.
28. Posch, M. and Proschan, M.A. (2012). Unplanned adaptations before breaking the
blind. Statistics in Medicine 31, 4146-4153.
29. Prentice, R.L. (1989). Surrogate endpoints in clinical trials: Definition and
operational criteria. Statisics in Medicine 8, 431-440.
30. Proschan, M.A. (1994). Influence of selection bias on type I error rate under random
permuted block designs. Statistica Sinica 4, 219-231.
31. Proschan, M.A. and Dodd, L.E. (2014). A modest proposal for dropping poor arms
in clinical trials. Statistics in Medicine 33, 3241-3252.
32. Proschan, M. and Follmann, D. (2008). Cluster without fluster: The effect of
correlated outcomes on inference in randomized clinical trials. Statistics in Medicine
27, 795-809.
33. Proschan, M.A. and Nason, M. (2009). Conditioning in 2 2 tables. Biometrics 65,
316-322.
34. Proschan, M.A. and Rosenthal, J.S. (2010). Beyond the quintessential quincunx. The
American Statistician 64, 78-82.
35. Proschan, M.A. and Shaw, P.A. (2011). Asymptotics of Bonferroni for dependent
normal test statistics. Statistics and Probability Letters 81, 739-748.
36. RGP120 HIV Vaccine Study Group (2005). Placebo-controlled phase III trial of a
recombinant glycoprotein 120 vaccine to prevent HIV infection. Journal of Infectious
Diseases 191, 654-665.
37. Royden, H.L. (1968). Real Analysis, 2nd ed. Macmillan, New York.
38. Rubin, D.B. (1976). Inference and missing data. Biometrika 63, 581-592.
39. Salmon, D.A., Proschan, M.A., Forshee, R. et al. (2013). Association between
Guillain Barre syndrome and influenza A (H1N1) 2009 monovalent inactivated
vaccines in the USA: a meta-analysis. Lancet 381, 1461-1468.
40. Schoenfeld, D. (1980). The asymptotic properties of nonparametric tests for
comparing survival distributions. Biometrika 68, 316-319.
43. Sethuraman, J. (1961). Some limit theorems for joint distributions. Sankya A 23,
379-386.
44. Simon, R. and Simon, N.A. (2011). Using randomization tests to preserve type 1
error with response adaptive and covariate adaptive randomization. Statistics and
Probability Letters 81, 767-772.
46. Stein, C. (1945). A two-sample test for a linear hypothesis whose power is
independent of the variance. Annals of Mathematical Statistics 16, 243-258.
47. Stewart, S.F., Nast, E.P., Arabia, F.A., Talbot, T.L., et al. (1991). Errors in pressure
gradient measurement by continuous wave Doppler ultrasound: Type, size and age
effects in bioprosthetic aortic valves. Journal of the American College of Cardiology
18, 769-779.
48. The History Channel (2003). The Bible Code: Predicting Armageddon. A&E
Television Networks.
49. van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge University Press,
Cambridge, England.
50. Verma, T. and Pearl, J. (1988). Causal networks: Semantics and expressiveness,
Proceedings of the 4th Workshop on Uncertainty in Artificial Intelligence (Mountain
View, CA), 352-359.
51. Wei, L.J. (1988). Exact two-sample permutation tests based on the randomized play-
the-winner rule. Biometrika 75, 603-606.
52. Westfall, P.H. and Young, S.S. (1993). Resampling-Based Multiple Testing:
Examples and Methods for p-value Adjustment. John Wiley & Sons, New York.
53. Witztum, D., Rips, E., and Rosenberg, Y. (1994). Equidistant letter sequences in the
book of Genesis. Statistical Science 9, 429-438.
Appendix D
Sets
= intersection
= union
A B = A is a subset of B
= an element of
= not an element of
: = such that
AC =
A\B =
A B = A direct product with B
means = A and for all n
An A means = A and for all n
card() = cardinality of
IA(x) = 1 if x A and IA(x) = 0 otherwise.
I() = 1 if true, 0 if false
= sigma-field generated by
= empty set
= Borel sets
= k-dimensional Borel sets
L = Lebesgue sets
Lp =
R = set of real numbers
Rk = k-dimensional space of real numbers
Limits
Probability
Elementary
AN = asymptotically normal
ANCOVA = analysis of covariance
a.s. = almost surely
BCT = bounded convergence theorem
CLT = central limit theorem
ch.f. = characteristic function
cov = covariance
DCT = dominated convergence theorem
d.f. = distribution function
iid = independent and identically distributed
inf = infimum = greatest lower bound
i.o. = infinitely often
FWE = familywise error rate
max = maximum
MCT = monotone convergence theorem
m.g.f = moment generating function
min =minimum
MLE = maximum likelihood estimate
MSE = mean squared error
SLLN = strong law of large numbers
sup = supremum = least upper bound
UI = uniformly integrable
var = variance
WLLN = weak law of large numbers