Statistical Inference For Everyone
Statistical Inference For Everyone
Statistical Inference For Everyone
S TAT I S T I C A L I N F E R E N C E
F O R E V E R YO N E
L I F E ’ S M O S T I M P O R TA N T Q U E S T I O N S A R E , F O R T H E M O S T PA R T, N O T H -
PIERRE-SIMON LAPLACE
S TAT I S T I C A L T H I N K I N G W I L L O N E D AY B E A S N E C E S S A R Y F O R E F F I C I E N T
H.G.WELLS
S TAT I S T I C S A R E T H E H E A R T O F D E M O C R A C Y.
SIMEON STRUNSKY
BRIAN BLAIS
S TAT I S T I C A L I N F E R E N C E
F O R E V E R YO N E
S AV E T H E B R O C C O L I P U B L I S H I N G
Copyright © 2017 Brian Blais
This book is licensed under the Creative Commons Attribution-ShareAlike license, version 4.0, http://
creativecommons.org/licenses/by-sa/4.0/, except for those photographs and drawings of which I am
not the author, as listed in the photo credits. If you agree to the license, it grants you certain privileges
that you would not otherwise have, such as the right to copy the book, or download the digital version
free of charge from http://web.bryant.edu/~bblais. At your option, you may also copy this book
under the GNU Free Documentation License version 1.2, http://www.gnu.org/licenses/fdl.txt, with no
invariant sections, no front-cover texts, and no back-cover texts.
Proposal 29
1 Introduction to Probability 33
1.1 Models and Data 34
1.2 What is Probability? 35
Card Game 36
Other Observations 38
1.3 Conditional Probability 39
Probability Notation 39
1.4 Rules of Probability 40
Negation Rule 41
Product Rule 42
Independence 43
Conjunction 43
Sum Rule 45
Marginalization 46
Bayes’ Rule 47
1.5 Venn Mnemonic for the Rules of Probability 49
1.6 Lessons from Bayes’ Rule - A First Look 50
2 Applications of Probability 53
2.1 Cancer and Probability 53
2.2 Weather 55
10
Bibliography 221
1.1 Standard 52-card deck. 13 cards of each suit, labeled Spades, Clubs,
Diamonds, Hearts. 36
1.2 Venn diagram of a statement, A, in a Universe of all possible state-
ments. It is customary to think of the area of the Universe to be equal
to 1 so that we can treat the actual areas as fractional areas represent-
ing the probability of statements like P( A). In this image, A takes
up 1/4 of the Universe, so that P( A) = 1/4. Also shown is the nega-
tion rule. P( A) + P(not A) = 1 or “inside” of A + “outside” of A
adds up to everything. 49
1.3 Venn diagram of the sum and product. The rectangle B takes up 1/8
of the Universe, and the rectangle A takes up 1/4 of the Universe. Their
overlap here is 1/16 of the Universe, and represents P( A and B). Their
total area of 5/16 of the Universe represents P( A or B). 49
1.4 Venn diagram of conditional probabilities, P( A| B) and P( B| A). (Right)
P( A| B) is represented by the fraction of the darker area (which was
originally part of A) compared not to the Universe but to the area of
B, and thus represents P( A| B) = 1/2. In a way, it is as if the con-
ditional symbol, “|,” defines the Universe with which to make the com-
parisons. (Left) Likewise, the same darker area that was originally
part of B represents P( B| A) which makes up 1/4 of the area of A.
Thus P( B| A) = 1/4. 50
1.5 Venn diagram of mutually exclusive statements. One can see that P( A and B) =
0 (the overlap is zero) and P( A or B) = P( A) + P( B) (the total area
is just the sum of the two areas) 50
2.1 Probability for rolling various sums of two dice. Shown are the re-
sults for two 6-sided dice (left) and two 20-sided dice (right). The dashed
line is for clarity, but represents the fact that you can’t roll a fractional
sum, such as 2.5. 57
2.2 Probability of having at least two people in a group with the same
birthday depending on the number of people in the group. The 50%
mark is exceeded once the group size exceeds 23 people. 63
18
3.1 Probability of getting h heads in 30 flips. Clearly the most likely value
is 15, but all of the numbers from 12 up to 18 have significant prob-
ability. 81
3.2 Probability of getting h heads in 30 flips given a possible unfair coin.
One coin has p = 0.1, where the maximum is for 3 heads (or 1/10
of the 30 flips), but 2 heads is nearly as likely. Another has p = 0.5,
and is the fair coin considered earlier with a maximum at 15 heads
(or 1/2 of the 30 flips). Finally, another coin shown as p = 0.8 where
24 heads (or 8/10 of the 30 flips) is maximum. 83
4.1 High Deck - 55 Cards with ten 10’s, nine 9’s, etc... down to one Ace.
Aces are equivalent to the value 1. 96
4.2 Low Deck - 55 Cards with ten Aces, two 2’s, etc... up to one 10. Aces
are equivalent to the value 1. 96
4.3 Drawing a number of 9’s in a row, possibly from a High, Low, and
Nines deck. 105
6.5 Probability for different bent-coin models, given no data (left), the
first half of the data set (middle), and the entire data set of 9 tails and
3 heads (right). 126
6.6 Posterior probability distribution for the θ values of the bent coin -
the probability that the coin will land heads. The distribution is shown
for data 3 heads and 9 tails, with a maximum at θ = 0.25. 128
6.7 Posterior probability distribution for the θ values of the bent coin -
the probability that the coin will land heads. The distribution is shown
for data 3 heads and 9 tails. The area under the curve from θ = 0
(the “all heads” coin) to θ = 0.5 (the “fair” coin) is 0.954. 129
6.8 Posterior probability distribution for the θ values of the bent coin -
the probability that the coin will land heads. The distribution is shown
for data 3 heads and 9 tails. The area under the curve from θ = 0
(the “all heads” coin) to θ = 0.28 is 0.5 - half the area. This repre-
sents the median of the distribution. 130
6.9 Posterior probability distribution for the θ values of the bent coin -
the probability that the coin will land heads. The distribution is shown
for data 3 heads and 9 tails. The various quartiles are shown in the
plot, and summarized in the accompanying table. 131
6.10 Posterior probability distribution for the θ values of the bent coin -
the probability that the coin will land heads. The distribution is shown
for data 10 heads and 20 tails. The various quartiles are shown in the
plot, and summarized in the accompanying table. 135
9.1 Probability distributions for the subset of iris petal lengths. Each dis-
tribution follows a Student-t form. 168
9.2 Probability distributions for the difference between iris petal lengths
for the closest two iris types, Virginica and Versicolor. The distribu-
tion follows a Student-t form, and clearly shows significant proba-
bility (greater than 99%) for being greater than zero. 169
9.3 Mass of Pennies from 1960 to 1974. 174
9.4 Mass of Pennies from 1960 to 1974, with best estimates and 99% CI
(i.e. 3σ) uncertainty. 177
20
9.5 Mass of Pennies from 1960 to 2003, with best estimates and 99% CI
(i.e. 3σ) uncertainty. 179
9.6 Mass of Pennies from 1960 to 2003, with best estimates for the two
true values and their 99% CI (i.e. 3σ) uncertainty plotted. There is
clearly no overlap in their credible intervals, thus there is a statisti-
cally significant difference between them. 180
9.7 Difference in the estimated values of the pre- and post 1975 pennies,
µ1 − µ2 . The value zero is clearly outside of the 99% interval of the
difference, thus there is a statistically significant difference between
the two values µ1 and µ2 . 181
10.1 Heights (in inches) and shoe sizes from a subset of McLaren (2012)
data. 194
10.2 Posterior distribution for the slope for the linear model on the shoe
size data subset. 195
10.3 Posterior distribution for the intercept for the linear model on the
shoe size data subset. 196
10.4 Best linear fit for the shoe size data subset. 196
10.5 Minimizing the Mean Squared Error (MSE) results in the best lin-
ear fit for the shoe size data subset. 197
10.6 Total SAT score vs expenditure (top) and the distributions for the slope
(bottom left) and intercept (bottom right). 198
10.7 Percent of students taking the SAT vs per pupil expenditure (top)
and the distributions for the slope (bottom left) and intercept (bot-
tom right). 200
10.8 The posterior distributions for coefficients on the expenditure term,
the percent taking term, and the intercept. 201
11.1 So-called MCMC “chains” for parameter θ versus time. Observe that
the values of θ start spread evenly from 0 to 1 at the beginning and
then thin down to a range of about 0.5-0.8 with the middle around
0.7 (17/25 = 0.68). 212
11.2 Distribution of θ, and the 95% credible interval. 213
11.3 Chains for parameters a, b, and the noise σ. 213
11.4 Data (blue) and predictions (green) for the model - the width of the
predictions demonstrates the uncertainty. 214
11.5 Distributions for parameters a and b (slope and intercept). 215
11.6 Chains for parameter mu1 , the mean of the drug group. 217
11.7 Distribution for parameter mu1 , the mean of the drug group. 217
11.8 png 218
11.9 Distribution for parameter δ, the mean of the difference between the
drug group and the placebo group. 218
C.1 Discrete uniform distribution for values 1 to 6. The value for each
is p( xi ) = 1/6. 230
21
1.1 What is the fraction of the first card as a jack given that we know
that the first card is a face card? . . . . . . . . . . . . . . . . . 41
1.2 What is the fraction of cards that are Jacks and a heart? . . . 42
1.3 What is the probability of drawing two Kings in a row? . . 42
1.4 What is the probability of flipping two heads in a row? . . . 43
1.5 Marginalization and Card Suit . . . . . . . . . . . . . . . . . 46
1.6 What is the probability of drawing a jack, knowing that you’ve
drawn a face card? . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1 What is the probability of both having cancer and getting a
positive test for it? . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2 What is the probability of both not having cancer and getting
a positive test for it? . . . . . . . . . . . . . . . . . . . . . . . 54
2.3 What is the probability of having cancer given a positive test
for it? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 If the probability that it will rain next Saturday is 0.25 and the
probability that it will rain next Sunday is 0.25, what is the prob-
ability that it will rain during the weekend? . . . . . . . . . 55
2.5 What is the probability of the sum of two dice getting a par-
ticular value, say, 7? . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 What is the probability of rolling a sum more than 7 with two
dice? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.7 What is the probability of rolling various sums with two dice
each with 20 sides? . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.8 Let’s imagine we have the case where two people meet on the
street. What is the probability that they both have April 3 as
their birthday? . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.9 Two people meet on the street, and we ask what is the prob-
ability that they both have the same birthday? . . . . . . . . 58
2.10 What is the probability that three random people have the same
birthday? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.11 What is the probability that at least two have the same birth-
day? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.12 What is the probability that at least two have the same birth-
day? A clever shortcut. . . . . . . . . . . . . . . . . . . . . . . 61
24
4.1 Drawing m 9’s in a row, from either a High Deck or Low Deck. 104
6.1 Probabilities for flipping heads given a collection of bent coins 121
6.2 Probability for different bent-coin models, given the data=9 tails, 3
heads. The middle column is the non-normalized value from Bayes’
Rule, needing to be divided by K (the sum of the middle column)
to get the final column which is the actual probability. 123
8.1 Rough guide for the conversion of deviations away from zero and
the qualitative labels for probability values for being a significant de-
viation. 161
9.1 Iris petal lengths, in centimeters, for Iris type Setosa. 165
9.2 Subset of iris petal lengths, in centimeters, for iris types Virginica, Se-
tosa, and Versicolor. 166
9.3 Production lines are produce a ball bearing with a diameter of ap-
proximately 1 micron. Ten ball bearings were randomly picked from
the production line (i.e. the First line) at one time, and then again for
a different production line (i.e. the Second line). Romano, A. (1977)
Applied Statistics for Science and Industry. 169
9.4 Mass of Pennies from 1960 to 1974. 174
9.5 Mass of Pennies from 1989 to 2003. 176
28
10.1 Heights (in inches) and shoe sizes from a subset of McLaren (2012)
data. 193
Proposal
Initial Motivation
II Probability
• Confidence intervals
30
• Sampling distributions
• Computations involving the normal distribution, t-distribution,
and binomial distribution (for proportions)
• Hypothesis testing
IV Two-sample Statistics
like a “cookbook”: just find the right recipe for the right problem.
The fundamental understanding of statistical inference is under-
mined by this approach.
A New Approach
What I Am Proposing
This text can help solve the challenges described above, and more.
By focusing on models and data, as opposed to populations and
samples, this text can more cohesively bridge the topics described in
Parts I, II, and III above. Probability will be introduced as a natural
part of solving problems, as opposed to its standalone treatment
traditionally done in today’s texts.
In this text, I will use the Probability Theory as Logic approach
applied to the same problems that are traditionally covered. This
viewpoint can greatly enhance our understanding of statistics and
can handle topics such as confidence intervals and hypothesis testing
in a very intuitive manner. Statistical inference covered in this way
also addresses real-life questions that are not addressed by traditional
statistical methods.3 3
One of the reasons why this ap-
Finally, this will be a problem oriented textbook. It is imperative proach is usually covered only in
more advanced courses is the diffi-
that the problems are cohesive with the pedagogy. I will also plan to culty of the mathematics generally
use technology, where appropriate, to further student learning and associated with it. Orthodox statis-
tics makes heavy use of sampling,
make the textbook more interactive. which is deemed more intuitive
At the level targeted for this book, there is only one textbook that than probability distributions. It
I know of that covers inference from the perspective proposed here, is my intention to start with low-
dimensional cases, building to
and that is Donald Berry’s book Statistics: A Bayesian Perspective, distributions, and to augment all
1996. It is my intention to modernize the approach, and include some concepts with numerical exercises.
topics that are not covered, specifically from the physical sciences
and business.
1 Introduction to Probability
Life’s most important questions are, for the most part, nothing but probability
problems. - Laplace
When you think about probability, the first things that might come
to mind are coin flips (“there’s a 50-50 chance of landing heads”),
weather reports (“there’s a 20% chance of rain today”), and politi-
cal polls (“the incumbent candidate is leading the challenger 53% to
47%”). When we speak about probability, we speak about a percent-
age chance (0%-100%) for something to happen, although we often
write the percentage as a decimal number, between 0 and 1. If the
probability of an event is 0 then it is the same as saying that you are
certain that the event will never happen. If the probability is 1 then you
are certain that it will happen. Life is full of uncertainty, so we assign a
number somewhere between 0 and 1 to describe our state of knowl-
edge of the certainty of an event. The probability that you will get
36 statistical inference for everyone
Card Game
A simple game can be used to explore all of the facets of probability.
We use a standard set of cards (Figure 1.1) as the starting point, and
use this system to set up the intuition, as well as the mathematical
notation and structure for approaching probability problems.
We start with what I simply call the simple card game5 , which goes 5
In this description of the game,
we do not reshuffle after each draw.
The differences between this non-
reshuffled version and the one with
reshuffling will be explored later,
but will only change some small
details in the outcomes.
introduction to probability 37
like:
From a standard initially shuffled
deck, we draw one card, note what
card it is and set it aside. We then
simple card game ≡ draw another card, note what card (1.1)
it is and set it aside. Continue until
there are no more cards, noting each
one along the way.
P( R1 ) = P( B1 )
P( R1 ) = 1 − P( B1 )
P( R1 ) = P( B1 ) = 0.5
Mutually Exclusive If I have a list of mutually exclusive events, then Mutually Exclusive If I have a
that means that only one of them could possibly be true. Example list of mutually exclusive events,
then that means that only one
events include flipping heads or tails with a coins, rolling a 1, 2, of them could possibly be true.
3, 4, 5 or 6 on dice, or drawing a red or black card from a deck of Examples includes the heads
and tails outcomes of coins, or
the values of standard 6-sided
dice. In terms of probability, this
means that, for events A and B,
P( A and B) = 0.
38 statistical inference for everyone
Non Mutually Exclusive If I have a list of events that are not mutu- Non Mutually Exclusive If I have
ally exclusive, then it is possible for two or more to be true. Examples a list of events that are not mutually
exclusive, then it is possible for
include weather with rain and clouds or holding the high and the two or more to be true. Examples
low card in a poker game. include weather with rain and
clouds or holding the high and the
Now, this was a long-winded way to get to the answer we knew low card in a poker game.
from the start, but that is how it must begin. We start working things
out where our common sense is strong, so that we know we are
proceeding correctly. We can then, confidently, apply the tools in
places where our common sense is not strong.
In summary, with no more information than that there are two
mutually exclusive possibilities, we assign equal probability to both.
If there are only two colors of cards in equal amounts, red and black,
then the probability of drawing a red is P( R1 ) = 0.5 and the probabil-
ity for a black is the same, P( B1 ) = 0.5.
Other Observations
If instead of just the color, we were interested in the suit (hearts,
diamonds, spades, and clubs), then there would be four equal and
mutually exclusive possibilities. We have a certain number of possi-
bilities, and our state of knowledge is exactly the same if we simply
swap around the labels on the cards. If we’re interested in the specific
card, not just the suit, the logic is the same. Thus, we have
1
P(♠) = P(♣) = P(♦) = P(♥) =
4
introduction to probability 39
1
P( A♠) = P(2♠) = P(3♠) = · · · = P(K♥) =
52
Probabilities for Mutually Exclusive Events In general, for mutu- Probabilities for Mutually Exclu-
ally exclusive events, we have sive Events
(number of cases favorable to A)
P( A) =
(total number of equally possible cases)
(number of cases favorable to A)
P( A) = (1.2)
(total number of equally possible cases)
Probability Notation
In math, we choose to abbreviate long sentences in English, in order
to use the economy of symbols. In this book we choose a middle-
ground between mathematical succinctness and the ease of under-
standing English. We start with the simple card game (Equation 1.1)
We then define a new symbol, |, which should be read as “given.”
When there is information given we call this probability conditional
on that information. When we write the following:
or
“The probability of drawing a red on the first draw, given that we have a
standard initially shuffled deck and we follow the procedure where we draw
one card, note what color it is and set it aside and continue drawing, noting,
and setting aside until there are no more cards.”
One can easily see that the mathematical notation is far more
efficient. It is important to be able to read the notation, because it
describes what we know and what we want to know.
Conditional Probability When information is given, and ex- Conditional Probability When
pressed on the right-hand side of the | sign, we say that the proba- information is given, and expressed
on the right-hand side of the |
bility is conditional. P (I’m going to get wet today|raining outside) is sign, we say that the probability
an assessment of how likely it is that I will get wet given, or condi- is conditional. P(I’m going to get
wet today|raining outside) is an
tional on, the fact that it is raining outside. Clearly this number will assessment of how likely it is that
be different if it was conditional on the fact that it is sunny outside - I will get wet given, or conditional
different states of knowledge yield different probability assignments. on, the fact that it is raining outside.
Clearly this number will be differ-
When we put a comma (“,”) on the right side then we read this as ent if it was conditional on the fact
“and we know that.” For example, when we write the following: that it is sunny outside.
Causation. Imagine we have a 2-card
game: a small deck with one red
card and one black card, and I draw
P(red on second draw|simple card game,red on first draw) (1.5) a red card first. Clearly this makes
the probability of drawing red as
or the second card equal to zero - it
can’t happen. We’re tempted to
interpret
P( R2 |simple card game, R1 ) (1.6)
P( R2 | R1 , 2-card game) = 0
P( R1 | R2 , 2-card game) = 0
which is, if we knew that the second
1.4 Rules of Probability card we drew was red, then it
makes it impossible to have drawn
From the rule for mutually exclusive events (Equation 1.2), we assign a red card as the first card. This is
just as true as the previous case,
the following probabilities for the first draw from this deck6 :
however, you can’t interpret this as
4 causation - the second draw didn’t
• P(10) = 52 cause the first draw.
Instead, probability statements are
13 1
• P (♥) = 52 = 4
statements of logic, not causation. One
can use probabilities to describe
1 causation (i.e. P (rain|clouds)), but
• P(10♥) = 52 the statements of probability have
no time component - later draws
12
• P(face card) = 52 from the deck of cards act exactly
the same as earlier ones.
40
• P(number card) = 52 6
A face card is defined to be a Jack,
Queen, or King. A number card is
defined to be Ace (i.e. 1) through
10.
introduction to probability 41
Negation Rule
In this section I’ll use the letter F for fraction, and we can determine
the values simply by counting. The fraction of cards which are hearts
(♥) is Either-or fallacy. The negation
rule, should not be taken to imply
that everything is “black and
13 1 white,” or “there are only two
F (♥) = = sides to every story.” It really is
52 4
just a statement of logic, should
The fraction of cards which are not hearts (i.e. the 3 other suits) is: be carefully considered and has
some limitations. For example, the
13 × 3 3 following is true,
F (not ♥) = =
52 4
P (object is black) + P (object is not black) = 1
These numbers add up to one: F (♥) + F (not ♥) = 1. We can do this
However, this does not mean the
with more complex statements. same thing as
or
P( A| B) + P(not A| B) = 1 (1.7)
42 statistical inference for everyone
Product Rule
The product rule comes from looking at the combination of events:
event A and event B. As before, we’ll work on the numbers from the
fractions of the card game.
Example 1.2 What is the fraction of cards that are Jacks and a heart?
1 4 1
F (jack and ♥) = F (♥|jack) × F (jack) = × =
4 52 52
One can equivalently reason from the suit first: the hearts constitute
13/52 of the cards, and that of those 13, the Jacks constitute 1/13
of the cards. So, we can arrive at the fraction of J♥ by taking one
thirteenth of the fraction of ♥. Again, we have
1 13 1
F (jack and ♥) = F (jack|♥) × F (♥) = × =
13 52 52
In general we have
Product Rule Product Rule
P( A and B) = P( A| B) P( B)
P( A and B) = P( A| B) P( B) = P( B| A) P( A) (1.8) = P( B| A) P( A)
P(K2 and K1 )
P(K2 and K1 ) = P ( K2 | K1 ) P ( K1 )
The second part is straight forward: P(K1 ) = 4/52. The first part is
asking the probability of drawing a second king, knowing that we
have drawn a king on the first draw. Now, there are only 51 cards
remaining when we do the second draw, and only 3 kings. Thus, we
have P(K2 |K1 ) = 3/51 and finally
P(K2 and K1 ) = P ( K2 | K1 ) P ( K1 )
3 4 1
= × =
51 52 221
introduction to probability 43
Independence
As a specific case of the product rule, we can change the rule of the
card games such that we reshuffle the deck after each draw. In this
way, the result of one draw gives you no information about other
draws. In this case, the events are considered independent.
Independent Events Two events, A and B, are said to be inde- Independent Events Two events,
pendent if knowledge of one gives you no information on the other. A and B, are said to be indepen-
dent if knowledge of one gives
Mathematically, this means you no information on the other.
Mathematically, this means
P( A| B) = P( A) P( A| B) = P( A)
and
and
P( B| A) = P( B)
P( B| A) = P( B)
In this case, the product rule reduces to the simplified rule for
independent events: the product of the individual event probabilities.
P (♥|jack) = P (♥)
Conjunction
One of the consequences of combinations of events is that the prob-
ability of two events happening, A and B, has to be less than (or
possibly equal to) the probability of just one of them, say A, happen-
ing. The mathematical fact is seen by looking at the magnitude of the
44 statistical inference for everyone
P ( A and B) = P( B| A) × P( A) ≤ P( A)
| {z }
less
than or
equal to
1
In other words, coincidences are less likely than either event hap-
pening individually. We intuitively know this, when we make com-
ments like “Wow! What are the chances of that?” referring to, say,
someone winning the lottery and then getting struck by a car the
next day. Sometimes, however, it seems as if one’s intuition does not
match the conclusions of the rules of probability. One such case is
called the conjunction fallacy. Combinations of Events and the
In an interesting experiment, Tversky and Kahneman[Tversky and English language I believe that the
issue of the conjunction fallacy is
Kahneman, 1983] gave the following survey: more subtle than this. In English,
if I were to say “Do you want steak
Linda is 31 years old, single, outspoken, and very bright. She majored for dinner, or steak and potatoes?”
in philosophy. As a student, she was deeply concerned with issues of one would immediately parse this
discrimination and social justice, and also participated in anti-nuclear as choice between
demonstrations. 1 steak with no potatoes
2 Linda is a bank teller and is active in the feminist movement. 1 steak, possibly with potatoes
and possibly without potatoes
85% chose option 2.[Tversky and Kahneman, 1974] This, they at- 2 steak, definitely with potatoes,
tributed, to the conjunction fallacy - mistaking the conjunction of two it is common in English to have
the implied negative (i.e. steak
events as more probable than a single event. They went further and with no potatoes) when given
did a survey of medical internists with the following a choice where the alternative
is a conjunction (i.e. steak with
Which is more likely: the victim of an embolism (clot in the lung) will potatoes).
experience partial paralysis or that the victim will experience both
partial paralysis and shortness of breath?
Combinations of Events and the
and again, 91 percent of the doctors chose that the clot was less English language If we interpret the
doctor’s choice with this implied
likely to cause the rare paralysis rather than to cause the combination negative, we have:
of the rare paralysis and the common shortness of breath. 1 clot with paralysis and no
Even when correct, the consequence for conjunctions can be mis- shortness of breath
used, or at least misidentified. Returning to our example of someone 2 clot with paralysis and shortness
of breath
winning the lottery and then getting struck by a car the next day, rare
and the first one is much less likely,
events occur frequently, as long as you have enough events. There are because it would be odd to have a
millions of people each day playing the lottery, and millions getting clot and not have a very common
struck by cars each day. We will explore this problem later in Sec- symptom associated with it. The
doctor’s probability assessment is
tion 2.5, but one immediate consequence is that winning the lottery absolutely correct: both symptoms
and getting struck by a car the next day probably happens somewhere together are more likely than just
one. The “fallacy” arises because
fairly regularly. the English language is sloppier
than mathematical language.
introduction to probability 45
Sum Rule
Now we consider the statements of the form A or B. For example,
in the card game, what is the fraction of cards that are jacks or are
hearts. By counting we get the 13 hearts and 3 more jacks that are not
contained in the 13 hearts, or F (jack or ♥) = 1352+3 = 16/52. Now, if
we tried to separate the terms, and do:
4 13 17
F (jack) + F (♥) = + =
52 52 52
then we get a number that is too big! It is too big because we’ve
double-counted the jack of hearts. Adjusting for this, by subtracting
one copy of this fraction, we get
4 13 1 16
F (jack) + F (♥) − F (jack and ♥) = + − = = F (jack or ♥)
52 52 52 52
In general
Sum Rule Sum Rule
P( A or B) = P( A) + P( B) − P( A and B)
P( A or B) = P( A) + P( B) − P( A and B) (1.10)
Sum Rule for Exclusive Events If two events are mutually exclusive Sum Rule for Exclusive Events If
the sum rule reduces to two events are mutually exclusive the
sum rule reduces to
P( A or B) = P( A) + P( B) (1.11) P( A or B) = P( A) + P( B)
P( A or B or C ) = P( A or [ B or C ])
= P( A) + P( B or C ) − P( A and [ B or C ])
= P( A) + P( B) + P(C ) − P( B and C ) −
P( A and B or A and C )
= P( A) + P( B) + P(C ) − P( B and C ) −
[ P( A and B) + P( A and C )−
P( A and B and A and C )]
Marginalization
Another consequence of the sum rule and the product rule is a pro-
cess called marginalization.
Example 1.5 Marginalization and Card Suit
P (jack)
all possibilities
z }| {
P (jack) = P (jack|♥) × P (♥) +
P (jack|♦) × P (♦) +
P (jack|♠) × P (♠) +
P (jack|♣) × P (♣)
1 1 1 1 1 1 1 1
= × + × + × + ×
13 4 13 4 13 4 13 4
4
=
52
introduction to probability 47
P( A| B1 ) P( A| B1 ), P( A| B2 ), P( A| B3 ), P( A| B4 ), · · ·
all possible Bs
z }| {
P( A) = P( A| B1 ) P( B1 ) + P( A| B2 ) P( B2 ) + P( A| B3 ) P( B3 ) + · · · (1.13)
Bayes’ Rule
In the 1700’s Reverend Bayes
One of the most consequential rules of probability is what is known proved a special case of this rule,
as Bayes’ Rule, sometimes called Bayes’ Theorem. We will use this and rediscovered in the general
form by Pierre-Simon Laplace.
rule throughout this book, and see its many applications. It comes as Laplace then applied the rule in
a direct result of the product rule (Equation 1.8) a large range of problems from
geology, astronomy, medicine, and
jurisprudence.
P( A and B) = P( A| B) P( B) = P( B| A) P( A)
Rearranging, we get
Bayes’ Rule Bayes’ Rule
P( B| A) P( A)
P( B| A) P( A) P( A| B) =
P( A| B) = (1.14) P( B)
P( B)
We can verify this again with the intuitions we have in the simple
card game.
F (face|jack) × F (jack)
F (jack|face) =
F (face)
4 4
4 × 52 4 1
= 12
= =
52
12 3
1.5 Venn Mnemonic for the Rules of Probability UNIVERSE UNIVERSE A and B
not A
It is often useful to have a picture to represent the mathematics, so
16
that it is easier to remember the equations and to understand their
1/
meaning. It is common to use what is called a Venn Diagram to rep- A B
resent probabilities in an intuitive, graphical way. The idea is that
A
1/8
probabilities are represented as the fractional area of simple geomet- 1/4
1/4
{
ric shapes. We can then find a picture representation of each of the
rules of probability. We start by looking at a sample Venn Diagram,
in Figure 1.2.
A or B
Figure 1.2: Venn diagram of a
The fractional area of the rectangle A represents the probability statement, A, in a Universe of all
P( A), and can be thought of as a probability of one of the statements possible statements. It is customary
to think of the area of the Universe
we’ve explored, such as P(♥). This diagram is strictly a mnemonic,
to be equal to 1 so that we can treat
because the individual points on the diagram are not properly de- the actual areas as fractional areas
fined. The diagram in Figure 1.2 also represents the Negation Rule representing the probability of
statements like P( A). In this image,
(Equation 1.7), A takes up 1/4 of the Universe, so
that P( A) = 1/4. Also shown is the
P( A) + P(not A) = 1 negation rule. P( A) + P(not A) = 1
Wednesday, May 28, 14
or “inside” of A + “outside” of A
In the diagram it is easy to see that the sum of the areas inside of adds up to everything.
A (i.e. 1/4) and outside of A (i.e. 3/4) cover the entire area of the
UNIVERSE UNIVERSE A and B UNIVERSE
Universe of statements, and thus add up to 1. not A
Figure 1.3 shows the diagram which can help us remember the
sum and product rules. The Sum Rule (Equation 1.10)
16
1/
P( A or B) = P( A) + P( B) − P( A and B) A A B A B
1/8 1/8
is represented in the total area occupied by the rectangles A and B, 1/4 1/4
1/4
and makes up all of A (i.e. 1/4) and the half of B sticking out (i.e.
1/8-1/16=1/16) yielding P( A or B) = 5/16. This is also the area
of each added up (1/4+1/8), but subtracting the intersection (1/16)
because otherwise it is counted twice. Adding the areas this way
directly parallels the Sum Rule.
Conditional probabilities, like those that come into the Product
Rule (Equation 1.8) and Bayes Rule (Equation 1.14) are a little more
{ A or B
Figure 1.3: Venn diagram of the
sum and product. The rectangle B
takes up 1/8 of the Universe, and
the rectangle A takes up 1/4 of
the Universe. Their overlap here is
1/16 of the Universe, and represents
challenging to visualize. In Figure 1.4, P( A| B) is represented by the P( A and B). Their total area of
5/16 of the Universe represents
fraction of the darker area (which was originally part of A) com- P( A or B).
pared not to the Universe but to the area of B, and thus represents
P( A| B) = 1/2. In a way, it is as if the conditional symbol, “|,” defines
Wednesday, May 28, 14
the Universe with which to make the comparisons. On the left of Fig-
ure 1.4, the same darker area that was originally part of B represents
P( B| A) making up 1/4 of the area of A. Thus P( B| A) = 1/4. The
Product Rule (Equation 1.8) then follows,
1
P( A and B) = P( A| B) P( B) = P( B| A) P( A) =
| {z } | {z } | {z } | {z } 16
1/2 1/8 1/4 1/4
50 statistical inference for everyone
We can further see the special case of mutually exclusive state- B|A A|B
ments shown in Figure 1.5. The Sum Rule for Exclusive Events
(Equation 1.11) is simply the sum of the two areas because there is
A B
no overlap
1/8
1/4
Figure 1.4: Venn diagram of con-
P( A or B) = P( A) + P( B) ditional probabilities, P( A| B) and
P( B| A). (Right) P( A| B) is repre-
sented by the fraction of the darker
Further, it is straightforward to see from this diagram the following area (which was originally part of
properties for mutually exclusive events A) compared not to the Universe but
to the area of B, and thus represents
P( A and B) = 0 P( A| B) = 1/2. In a way, it is as if
the conditional symbol, “|,” defines
P( A| B) = 0 the Universe with which to make
the comparisons. (Left) Likewise,
P( B| A) = 0 the same darker area that was orig-
inally part of B represents P( B| A)
which makes up 1/4 of the area of
1.6 Lessons from Bayes’ Rule - A First Look Wednesday, May 28, 14
A. Thus P( B| A) = 1/4.
UNIVERSE
Bayes’ Rule is the gold standard UNIVERSE
for all statistical inference. It isAa and B UNIVERSE
not A
mathematical theorem, proven from fundamental principles. It struc-
tures all inference in a systematic fashion. However, it can be used
16
without doing any calculations, as a guide to qualitative inference. 1/
Some of the lessons which are consequences
A A listed
of Bayes’ Rule are B A B
here, and will be noted throughout this text in various examples. 1/8 1/8
1/4 1/4
1/4
{
• Confidence in a claim should scale with the evidence for that claim
A orthe-
• Ockham’s razor, which is the philosophical idea that simpler B
ories are preferred, is a consequence of Bayes’ Rule when compar- Figure 1.5: Venn diagram of mu-
tually exclusive statements. One
ing models of differing complexity. can see that P( A and B) = 0
(the overlap is zero) and
• Simpler means fewer adjustable parameters P( A or B) = P( A) + P( B) (the
total area is just the sum of the two
• Simpler also means that the predictions are both specific and not areas)
overly plastic. For example, a hypothesis which is consistent with
the observed data, and also be consistent if the data were the op-
posite would be overly plastic.
Wednesday, May 28, 14
Example 2.1 What is the probability of both having cancer and getting a
positive test for it?
Example 2.2 What is the probability of both not having cancer and getting
a positive test for it?
Example 2.3 What is the probability of having cancer given a positive test
for it?
Although those with cancer nearly always test positive, out of the
pool of all people who test positive - including a large number of
false-positives - those actually having cancer are a small minority.
It is because there are many more people without cancer, so even
if a small fraction of those mistakenly test positive it will outweigh
the small fraction of those people with the disease. This is why
we insist on second opinions and why the rarity of a disease often
matters even more than the accuracy of the test.
P (positive test) = P (no cancer and positive test) + P (cancer and positive test)
= = 0.07 + 0.008 = 0.078
applications of probability 55
2.2 Weather
Example 2.4 If the probability that it will rain next Saturday is 0.25 and
the probability that it will rain next Sunday is 0.25, what is the probability
that it will rain during the weekend?
Notice, however, that we don’t have a direct expression for P(rain Sunday)
anymore. We only have the conditional or dependent forms, like P (rain Sunday|rain Saturday).
We can use the marginalization procedure (Equation 1.13 on page 47),
and sum over all of the conditional expressions
which makes it less likely to rain on the weekend if the Sunday rain
is correlated with the Saturday rain (Equation 2.2) than if they are
independent (Equation 2.1). Why is that?
One way to think of it is that, although the probability of rain on
Sunday is increased due to rain on Saturday, it is more likely that Sat-
urday is not rainy. In those cases, which are more frequent, Sunday
is less likely to be rainy as well. When the two days are indepen-
dent, Sunday’s rain is the same probability regardless of Saturday’s
weather. When they are dependent, then the more often clear Satur-
day weather makes it a little less likely for the Sunday rain, and thus
lowers the chance of weekend rain by a little bit.
Example 2.5 What is the probability of the sum of two dice getting a
particular value, say, 7?
0.18
1 5 0.16
0.14
P(8 or 9 or 10 or 11 or 12)
which are all exclusive events, so we use the Sum Rule for exclusive
events (Equation 1.11) and obtain
Example 2.7 What is the probability of rolling various sums with two dice
each with 20 sides?
0.18
P(Sum of Two 20-Sided-Dice)
0.16 0.05
0.14
P(Sum of Two Dice)
0.04
0.12
0.10 0.03
0.08
0.06 0.02
0.04 0.01
0.02
0.00 2 4 6 8 10 12 0.00 5 10 15 20 25 30 35 40
Sum of Two Dice Sum of Two 20-Sided-Dice
Figure 2.1: Probability for rolling
various sums of two dice. Shown
are the results for two 6-sided dice
(left) and two 20-sided dice (right).
The dashed line is for clarity, but
represents the fact that you can’t
roll a fractional sum, such as 2.5.
58 statistical inference for everyone
Two People
Example 2.9 Two people meet on the street, and we ask what is the proba-
bility that they both have the same birthday?
P(C1 or C2 or · · · or C365 )
P(C1 or C2 or · · · or C365 ) =
1 1 1 1 1 1
× + × +···+ ×
365 365 365 365 365 365
| {z }
365 terms, one for each day
1
= = 0.0027
365
Another way to think of this is to imagine that person 1 randomly
“chooses” their birthday, D1 , and person 2 randomly “chooses” their
birthday, D2 , and then they compare to see if the days are the same,
or D1 = D2 . In general, we can think of the problem broken up in
this way: Here we find another example
of the general requirement that
equivalent states of knowledge
give rise to equivalent probability
P( D1 = D2 ) = assignments. In this case it means
D1 is a specific day and
number of possible that if there is more than one way
P × to arrive at a conclusion, they each
D2 is the same day specific days
must give the same answer. We
In this way, we get can then choose the way that is
easiest to calculate, simply out of
1 1 convenience.
P( D1 = D2 ) = × × (365)
365 365
1
= = 0.0027
365
which is extremely unlikely (see Table 1.1 on page 51), but not nearly
as unlikely as them both having the same April 3 birthday.
Three People
Example 2.10 What is the probability that three random people have the
same birthday?
60 statistical inference for everyone
which is even more extremely unlikely (see Table 1.1 on page 51)
than the previous two-person example. It is interesting to note that
this is the same answer we received when we asked for the probabil-
ity of two people with a specific birthday. One can think of the three
people having the same, unspecified, birthday in the following way if
it helps. The first person’s birthday specifies the necessary birthday
for the other two, so it is the same as the case where we specify a
single birthday for two people.
Example 2.11 What is the probability that at least two have the same
birthday?
Writing the possibilities out like
Writing this out we get (somewhat messily) this is quite tedious, and can
lead to errors. Directly after this
calculation we find an equivalent,
P(at least two out of three have the same birthday) = and much easier, way of writing
= P(exactly 2 the same or exactly 3 the same) the same calculation. However, it is
important to note that all ways of
= P(exactly 2 the same) + writing the same information must
lead to the same answer.
P(exactly 3 the same) − P(exactly 2 and exactly 3 the same)
| {z } | {z }
1 3 ×365 0
( 365 )
Applying the product rule we get I’m sure you’re wishing for the
easier way about now...it’s coming
in Example 2.12.
applications of probability 61
Noting that there are 3 ways of getting a specific 2 the same, we These 3 ways are “person 1 and 2
obtain for this single term match”, “person 1 and 3 match”,
“person 2 and 3 match.”
1 364
P(exactly 2 the same) = × ×3
365 365
Putting it all together we have
Example 2.12 What is the probability that at least two have the same
birthday? A clever shortcut.
not “none
none the same
P + P the same in 3 = 1
in 3 people
people”
62 statistical inference for everyone
at least 2 the
none the same
P + P same in 3 = 1
in 3 people
people
which leads to
at least 2 the
none the same
P same in 3 = 1−P
in 3 people
people
364 363
= 1− ×
365 365
= 0.0082
P same in 30 = 1 − 0.29
people
= 0.71
which is 71%! Compare this likely outcome to the extremely rare out-
come of having two random people having matched birthdays, from
page 58. See Figure 2.2 to see a plot of this unintuitive observation.
23 people
0.6
0.4
0.2
0.00 10 20 30 40 50 60 70 80 90
Number of People
1
P (winning two tickets) = ∼ 5 · 10−14 (2.3)
2 · 1013
which truly is quite improbable as a single event, but is it truly an
improbable event to happen somewhere? The assumption stated in
the quote is that only two tickets were purchased. We all know that
many lottery tickets are purchased daily, which should increase the
chance that somewhere this will occur. Even this winning couple pur-
chased tickets every day for 20 years before winning this.
64 statistical inference for everyone
lows4 4
Example 2.14 Suppose you’re on a game show, and you’re given the choice
of three doors: behind one door is a car; behind the others, goats. You pick a
door, say No. 1 (but the door is not opened), and the host, who knows what’s
behind the doors, opens another door, say No. 3, which has a goat. He then
says to you, "Do you want to change your choice to door No. 2?" Is it to
your advantage or disadvantage to switch your choice, or does it matter
whether you switch your choice or not?
Example 2.18 Suppose you’re on a game show, and you’re given the choice
of three doors: behind one door is a car; behind the others, goats. You pick a
door, say No. 1 (but the door is not opened), and the host, who knows what’s
behind the doors, opens another door, say No. 3, which has a goat. He then
says to you, "Do you want to change your choice to door No. 2?" Is it to
your advantage or disadvantage to switch your choice, or does it matter
whether you switch your choice or not?
2.7 Exercises
Exercise 2.1 What is the probability that at least 3 people have the same
birthday in a group of 50?
Exercise 2.2 Examine the case of Monte Hall with 4 doors, the host open-
ing one door with a goat, and leaving you with a choice of 3. Should you
switch? Does it matter which of the other two you choose?
Exercise 2.3 What is the probability of rolling various sums from two
9-sided dice?
Exercise 2.4 What is the probability of rolling an odd sum with two dice?
Exercise 2.5 What is the probability of rolling more than 7 from two 20-
sided dice?
applications of probability 67
Exercise 2.6 Given the table above, determine the following quantities, and
describe what they mean:
2 P (cancer|negative test)
3 P (not cancer)
Which is more likely: the victim of an embolism (clot in the lung) will
experience partial paralysis or that the victim will experience both
partial paralysis and shortness of breath?
and 91 percent of the doctors chose that the clot was less likely to
cause the rare paralysis rather than to cause the combination of the
rare paralysis and the common shortness of breath.
This may not be a failure of reasoning, but a (correct!) failure of
the doctors to translate the English language literally into logical
language. It is likely that when doctors are asked: “Which is more
likely: that the victim of an embolism will experience partial paral-
ysis or that the victim will experience both partial paralysis and
shortness of breath?” they interpret it as:
The doctors are separating the analysis of the claim of the clot,
which is given information, from the other claims. Another way of
looking at it is to include the knowledge of the method of reporting.
Someone who is reporting information about an ailment will tend to
report all of the information accessible to them. By reporting only the
paralysis, there are two possibilities concerning the person measuring
the symptoms of the patient:
68 statistical inference for everyone
1 they had the means to measure shortness breath in the patient, but
there was none
Diverging Opinions
Is it possible to have people informed by the same information, and
reasoning properly, to have diverging opinions? It might seem in-
tuitive that people given the same information, reasoning properly,
would tend to come to agreement, however this is not always the
case. What is interesting is that it turns on the prior probabilities for
claims. This example comes from Jaynes, 20035 . We have the follow- 5
E. T. Jaynes. Probability Theory:
ing piece of information: The Logic of Science. Cambridge
University Press, Cambridge, 2003.
Edited by G. Larry Bretthorst
“Mr N. has gone on TV with a sensational claim
D :=
that a commonly used drug is unsafe”
and we have observers A, B, and C with different prior assignments
to the reliability of Mr N and of the safety of the drug. These prior
assignments may have been the result of previous inference by these
observers, in a different context, or possibly due to expert knowledge.
Observers A and C believe, before the announcement, that the drug
is reasonably safe. Observer B does not. We have the probability
assignments then:
PA (Safe) = 0.9
PB (Safe) = 0.1
PC (Safe) = 0.9
applications of probability 69
They all agree that if the drug is not safe, then Mr N would an-
nounce it, so we have
PA ( D |not Safe) = 1
PB ( D |not Safe) = 1
PC ( D |not Safe) = 1
Finally, we have the perceptions from the observers about the reli-
ability of Mr N if the drug is actually safe. In this case, observer A is
trusting of Mr N, observer C is strongly distrustful, and observer B is
mildly distrustful. By “distrustful” we are referring to the probabili-
ties that Mr N would make the announcement that the drug is unsafe
even if the drug were actually safe. So we have
PA ( D |Safe) = 0.01
PB ( D |Safe) = 0.3
PC ( D |Safe) = 0.99
PA ( D |Safe) PA (Safe)
PA (Safe| D ) =
PA ( D |Safe) PA (Safe) + PA ( D |not Safe) PA (not Safe)
0.01 · 0.9
= = 0.083
0.01 · 0.9 + 1 · 0.1
Following the same calculation for the others, we get the observers
updating their probability assignments after the announcement, D, as
A problem of independence
As said in the beginning of Chapter 1 (Introduction to Probability),
in 1968 a jury found defendant Malcolm Ricardo Collins and his wife
70 statistical inference for everyone
He then followed with the calculation applying the product rule for
independent events (Section 1.4 on page 43), to find the probability
that all these things could have been observed:
1 1 1 1 1 1 1
× × × × × =
10 4 10 3 10 1000 12, 000, 000
The initial conviction was overturned for two primary reasons,
one legal and one mathematical. The legal argument was that the
prosecution had not established that these initial probabilities were
supported by the evidence. However, the really devastating part of
the argument was mathematical. As you may recall, the product rule
used in this way assumes the independence of the terms (Section 1.4 on
pageSection 43).
For an example, the proper product rule for two of the terms
above would look like:
What the prosecutor was assuming is that these two items were
independent, from which it would follow that
What he was doing was equating the following in the product rule
(Section 1.4 on page 1.4):
P (second child dying of SIDS|first child dying of SIDS) = P (second child dying of SIDS)
which is equivalent to saying
Knowing that the child dies of a [not well understood] disease tells us
nothing about the probability of the second child dying of the same [not
well understood] disease.
Prosecutor’s Fallacy
Both of the cases above are examples of what is called the prosecu-
tor’s fallacy. It occurs when someone assumes that the prior prob-
ability of an event is equal to the probability that the defendant is
innocent. A simple example is that “if a perpetrator is known to have
the same blood type as a defendant and 10% of the population share
that blood type; then to argue on that basis alone that the probability
of the defendant being guilty is 90% makes the prosecutors’s fallacy,
in a very simple form.”8 8
P (innocence|evidence)
P (evidence)
Coin Flips
from s i e import *
[1 0 0 1 0 0 0 1 0 0]
Generate a slightly larger list of data...
data= r a n d i n t ( 2 , s i z e =30)
p r i n t data
[1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 0 1 0 0 1 1 0 0]
data= r a n d i n t ( 2 , s i z e = ( 2 0 0 0 , 1 0 ) )
data
9988
array([1011, 1010, 1001, 1051, 1001, 1008, 962, 990, 976, 978])
<matplotlib.text.Text at 0x10856e990>
h= a r r a y ( [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 1 0 ] )
# or . . .
h=arange ( 0 , 1 1 )
(recall that ** is exponentiation in Python, because the caret (ˆ) was already used for a computer-
sciency role.) The spaces in the equation below are not needed, but highlight the three parts of the
binomial distribution.
p=nchoosek ( 1 0 , h ) * 0 . 5 * * h * 0 . 5 * * ( 1 0 − h )
h i s t (N, c o u n t b i n s ( 1 0 ) , normed=True )
p l o t ( h , p , ’−−o ’ )
x l a b e l ( ’Number o f Heads , $h$ ’ )
y l a b e l ( ’ $p ( h|N=10) $ ’ )
<matplotlib.text.Text at 0x108560290>
2 Show in a simulation that this matches these probabilities you just found.
3 Random Sequences and Visualization
Now that we understand the rules of probability, and how they are
applied in a number of practical examples, we explore the use of
these rules to sequences of random events. This will produce several
interesting and unintuitive observations, failures of inference, and
the proper ways to handle them. Finally, we examine how visual-
ize both data in general and what we can communicate with such
visualization.
We’ll start with some simple examples of coin flipping, asking some
simple questions, and move to more complex observations and unin-
tuitive conclusions.
Example 3.1 What is the probability of flipping three heads in a row, with
a fair coin?
We can approach this problem in two different ways. The first way,
is a brute-force counting method with the definition of probability for
exclusive events (using Equation 1.2) and the second way makes use
of the other rules of probability. In the first way, we simply outline All possible results from three coin
every possible combination of three flips, see how many are “three flips:
1 TTT
heads in a row”, as we show in the margin. 2 TTH
Because there is only one case of “H H H” in all eight, the proba- 3 THT
4 THH
bility of three heads in a row is 5 HTT
6 HTH
P (three heads in a row) = 1/8 7 HHT
8 HHH
which is an unlikely outcome, but not extremely so (see Table 1.1 on
page 51).
In terms of the rule of probability, we have
(Equation 1.9)
Example 3.3 What is the probability of flipping two heads in three flips,
with a fair coin?
from which we can apply the sum rule for exclusive events (Equa-
tion 1.11) and, like before, the product rule for independent events
(Equation 1.9),
random sequences and visualization 77
2 How many ways can this type of sequence appear in the process
described in the question?
Point 1 is asking, what is the probability of this particular se-
quence:
HHHHHHHHHHTTTTTTTTTTTTTTTTTTTT
or this sequence:
TTHTTTTHHHHTTTTTHTTHTTTTTTHHTH
Although it is unintuitive, mathematically both of these specific se-
quences have exactly the same probability: each head or tail has equal
probability, is not related to the others, and there are the same num-
ber of them. So we have
P (HHHHHHHHHHTTTTTTTTTTTTTTTTTTTT) =
P (TTHTTTTHHHHTTTTTHTTHTTTTTTHHTH)
30
1
=
2
= 0.000000001 (one in a billion!)
Every single specific length-thirty sequence of heads and tails has the
same probability, one in a billion.
Point 2 is asking, how many sequences are there of thirty heads
and tails where ten of them are heads? Another way of phrasing it is,
given a sequence like:
HHHHHHHHHHTTTTTTTTTTTTTTTTTTTT
how many different ways can I rearrange this sequence and get a
unique sequence?
78 statistical inference for everyone
Example 3.5 How many ways can we rearrange the unique symbols A, B, Choices Remaining Symbols
A BCD
C, and D? B ACD
C ABD
To make this intuitive, we set up four empty boxes and we imagine D ABC
placing our symbols in the boxes, one at a time. How many choices
do we have? For the first box, we have four choices. For each of these
choices, we’ve removed one of the symbols, and one of the boxes.
Choices Remaining Symbols
Thus, we are left with three remaining symbols for each choice, and
AB CD
three remaining boxes. For each of the original four choices, we now AC BD
have three choices for the second box. This immediately leads to AD BC
4 × 3 = 12 possibilities by the time we’ve filled two boxes. For each of BA CD
these twelve possibilities, there are two symbols remaining and two BC AD
boxes. Continuing this logic, we have two choices for the third box, BD AC
and then only one choice for the final box. In summary, for each of CA BD
the four choices for the first box we have three choices for the second, CB AD
CD AB
two choices for the third, and one for the final box. Thus we have
DA BC
number of rearrange-
DB AC
ments of four different = 4 × 3 × 2 × 1 = 24 DC AB
symbols
In general we have
Number of Rearrangements of N Unique Symbols Number of Rearrangements of N
Unique Symbols
C( N ) = N × ( N − 1) × · · · × 2 × 1 C( N ) = N × ( N − 1) × · · · × 2 × 1
= N! (3.1) = N!
Example 3.6 How many ways can we rearrange the symbols A, A, A, and
D?
Symbols: A A A D
By eye we can see that there are only four rearrangements of these Rearrangements
DAAA
symbols. How is this different from the previous question with four
ADAA
symbols? We can imagine going from the first question, with four AADA
unique symbols “A B C D,” and replace both “B” and “C” with “A” AAAD
to get it. “BC” and “CB” are different sequences of unique symbols.
However, if we replace “B” with an “A” and “C” with an “A”, both
sequences become the same sequence, namely “AA”. If we try to
blindly apply Equation 3.1, the one for the number of rearrangements
of unique symbols, to the case where there are duplicates, we will
overestimate the number of rearrangements - we are over counting
duplicate subsequences. Further, we can be specific about how much
random sequences and visualization 79
we are over counting and thus find a new equation which includes
the possibility of duplicates.
For example, if we have three duplicates in a sequence, the num-
ber of over countings will be the number of possible rearrangements
of three unique symbols, because all of these rearrangements result in
the same sequence of duplicate symbols. Thus, our procedure should
be,
number of rearrange-
Example 3.7 How many ways are there of rearranging the symbols “A A
A D D”?
5! ways of
rearranging
5 unique
symbols
z }| {
A A A
| {z } D D
|{z}
3! ways of 2! ways of
rearranging rearranging
3 duplicates 2 duplicates
All possible results of rearranging
the symbols “A A A D D”:
number of rear- 1 AADDA
5!
rangements of = 2 DAADA
3!2! 3 ADADA
“A A A D D” 4 DAAAD
5×4×3×2×1 5 DADAA
=
(3 × 2 × 1) × (2 × 1) 6 AADAD
7 DDAAA
120
= 8 ADDAA
6×2 9 AAADD
= 10 10 A D A A D
Example 3.8 What is the probability of flipping ten heads in thirty flips,
with a fair coin?
number of re-
arrangements
one sequence of
of a length-30
P(h = 10, N = 30) = P 10 heads and 20 ×
sequence with
tails
10 “H” and 20
“T”
1 What is the probability of one particular sequence being considered?
one sequence of 10 20
1 1
P 10 heads and 20 = ×
2 2
tails
30
1
=
2
= 0.00000000093 (one in a billion!)
2 How many ways can this type of sequence appear in the process
described in the question?
Because we have a length-thirty sequence of “H” and “T” with
10 duplicate “H” symbols and 20 duplicate “T,” we have the fol-
lowing number of ways that this could occur (i.e. the number of
rearrangements of these sequences):
number of re-
arrangements
of a length-30
30!
=
10!20!
sequence with
10 “H” and 20
“T”
= 30045015
Probability of flipping h heads and t tails Given the probability Probability of flipping h heads
of flipping a single heads as 1/2, and the total number of flips is and t tails Given the probability
of flipping a single heads as 1/2,
N = h + t, we have the following equivalent forms: and the total number of flips is
h t N = h + t, we have the following
(h + t)! 1 1 probability for h heads and t tails:
P(h, t) = × × (3.2)
h!t! 2 2 (h + t)!
h t
1 1
h N −h P(h, t) = × ×
N! 1 1 h!t! 2 2
P(h, N ) = × ×
h!( N − h)! 2 2
! N −h
N 1 h 1
P(h, N ) = × ×
h 2 2
0.12
0.10
P(h,N =30)
0.08
0.06
0.04
0.02
0.000 5 10 15 20 25 30
Number of heads
N!
P(h| N, p) = × p h × (1 − p ) N − h (3.3)
h!( N − h)!
Probability of flipping h heads and t tails with an unfair coin Probability of flipping h heads and
Given the probability of flipping a single heads is, say, p and the total t tails with an unfair coin Given
the probability of flipping a single
number of flips is N = h + t, we have the following equivalent forms: heads as p, and the total number
of flips is N = h + t, we have the
(h + t)! following probability for h heads
P(h, t) = × p h × (1 − p ) t (3.4) and t tails:
h!t!
N! (h + t)!
P(h, N ) = × p h × (1 − p ) N − h P(h, t) =
h!t!
× p h × (1 − p ) t
h!( N − h)!
!
N
P(h, N ) = × p h × (1 − p ) N − h
h
Streaks
In the previous section we looked at the probability of getting a cer-
tain number of heads in a number of flips. Look at the following two
sequences:
1 HTTHTHHTTHTHTTHHHTHHTTHHTHHTTHTHHTHHTTHTTHHHTHTHTT
2 HHTHHHTTTTTTTHTHTTHTTTHTHTHHTHTTHTTTHHTTTHHHHTHHHH
0.10
0.05
0.000 5 10 15 20 25 30
Number of heads
writing down a sequence that they thought would look like a random
flipping of a coin. Which one is which? While many people think
that sequence 1 looks more “random” (i.e. it seems to flip around a
lot), sequence 2 is actually the random sequence.
One of the truly unintuitive things about real random sequences,
as opposed to designed sequences, is that there are long runs or
streaks. Why is this? The general solution is beyond this book but
we can think about it this way. Although a sequence of, say, 5 heads
in a row is very unlikely (P (5 heads in a row) = (1/2)5 = 0.03),
there are many opportunities for such a sequence somewhere within a
sequence of 50. Because of these many opportunities, this raises the
probability from 3% (the probability of 5 heads in a row in 5 flips), to
over 55%, the probability of finding 5 heads in a row somewhere in 50
flips. Streaks of 6 heads in a row occur nearly one third of the time in
50 flips, or over half the time if you consider a run to be either heads
or tails. Even streaks of 9 heads or tails in a row, in 50 flips, are not
extremely unlikely!
Gambler’s Fallacy
When we look at a sequence of real coin flips, like:
• HHTHHHTTTTTTT
and we ask about the probability of flipping heads in the next flip, it
is common to (mistakenly!) reason that, because we’ve seen 7 tails in
84 statistical inference for everyone
a row, then the next flip is more likely to be heads. However, this is
not the case for two reasons:
comparison, but roughly, one would have to look at all pairs of events
to see if one pair (say heads-tails) occurs more frequently (even if
only by a little) than another pair (say heads-heads).
In a total fit of irony, casino slot machines do not produce indepen-
dent winnings - they are programmed so that if you’ve lost many
times, then that machine is a little less likely to lose the next time. In
effect, at gambling houses they train the gamblers in the Gambler’s
Fallacy!
people, namely that they tend to ’detect’ patterns even where none
exist.”
What we have here, again, is the general perception that long se-
quences are somehow not “random,” when in fact the opposite is the
case. People have a natural tendency to see patterns in random data,
to infer order where there is none, and to ascribe importance to the
appearance of pattern. It is the role of statistical inference in gen-
eral to provide the tools to properly handle the distinction between
random effects and patterns, and to retune our intuitions.
1 Those that did the best the first time did worse the second (on
average)
2 Those that did the worst the first time did better the second (on
average)
One might be tempted (had you not known that this is artificial data,
and completely random) to interpret this as a causal pattern, e.g.
“the students that did better the first time, grew over-confident the
second time,” “the students that did worse the first time, worked
harder to improve the second time,” etc... This interpretation of the
results by students has been observed in the classroom.3 However, it 3
runs into serious trouble when the data is something like the heights
of children compared to their parents - the tallest parents tend to
have children shorter than they are, the shortest parents tend to have
children taller than they are, a pattern first quantified by Galton in
18694 . He noted that clearly the children are not trying to be tall, so 4
86 statistical inference for everyone
1 even when the process is entirely random, long streaks occur - and
are often misinterpreted as an increase in the probability of the
event.
3 when one has a particularly bad winter, it may be more likely that
the next winter won’t be quite do bad - due entirely to regression
to the mean. It may, however, be part of a larger pattern (e.g. a
large-scale climate oscillation, such as El Niño) and the probability
of another bad winter might be higher. In order to tell the differ-
ence, we need to construct reasonable models of the phenomena,
test those models with predictions, and apply those models into
the future. At each step, we need to be careful not to jump to the
conclusion of the existence of a pattern too quickly.
There are two main methods of visualizing data, and several others
that are related to these methods. In this section we introduce just
two, histograms and scatter plots, and we will use these throughout
the text.
Histograms
Histograms are a way of summarizing data, when presenting the
entire data set is impractical, or where some understanding of the
88 statistical inference for everyone
30
25
20
Number of People
15
10
0
150 160 170 180 190 200 210
Height [cm]
1 The average value (around the middle) should be around 175 cm.
The actual value can be calculated from the data, as
2 The range of the data is around 155 cm up to about 205 cm. Again
we can be more precise, and find the minimum of the data (154.94
cm) and the maximum (200 cm) but the histogram picture yields
an approximate value instantly.
3 The values are roughly symmetric about the mean (i.e. average)
value. This can give us a clue concerning how to model the data.
Too Few Bins Plotting the same histogram with too few bins might
look like:
90 statistical inference for everyone
90
80
70
60
Number of People
50
40
30
20
10
0
140 150 160 170 180 190 200 210
Height [cm]
Too Many Bins Plotting the same histogram with too many bins
might look like:
16
14
12
Number of People
10
8
6
4
2
0
150 160 170 180 190 200 210
Height [cm]
Scatter Plots
14
12
10
Number of People
8
6
4
2
016 17 18 19 20 21 22 23 24
Writing Hand Span [cm]
24
23
22
Writing Hand Span [cm]
21
20
19
18
17
16
15
150 160 170 180 190 200 210
Height [cm]
Histograms
from s i e import *
Load a sample data set, and select only the Male data...
select only the height data, and drop the missing data (na)...
h i s t ( male_height , b i n s =20)
x l a b e l ( ’ Height [cm] ’ )
y l a b e l ( ’Number o f People ’ )
<matplotlib.text.Text at 0x1085728d0>
Scatter Plot
from s i e import *
Load a sample data set, and select only the Male data...
random sequences and visualization 93
select only the height and the width of writing hand data, and drop the missing data (na)...
p l o t ( height , wr_hand , ’ o ’ )
y l a b e l ( ’ Writing Hand Span [cm] ’ )
x l a b e l ( ’ Height [cm] ’ )
<matplotlib.text.Text at 0x1085774d0>
4 Introduction to Model Comparison
P(model|data) (4.1)
reasonable before we use the same math in areas where our intuition
is not as strong. Imagine we draw only one card, and it is a 9. Intu-
ition suggests that this constitutes reasonably strong evidence toward
the belief that we’re holding the High Deck. If we then (as the pro-
cedure states) place the 9 back in the deck, reshuffle and then draw
a 7 we can be more strongly convinced that we are holding the High
Deck. Repeating the reshuffle, and then drawing a 3 would make
us a little less confident in this conclusion, but still quite certain. In
this way we can sense how drawing different cards pushes our belief
around, depending on how often that card comes up in the different
decks.
P(data = 9| H )
P( H |data = 9)
P( L|data = 9)
which are related to the prior and the likelihood via Bayes’ Rule (Equa-
tion 1.14):
P(data = 9| H ) P( H )
P( H |data = 9) =
P(data = 9)
P(data = 9| L) P( L)
P( L|data = 9) =
P(data = 9)
To calculate actual numbers, we apply the Bayes’ Recipe to this
problem,
P( H ) = 0.5
P( L) = 0.5
2 Write the top of Bayes’ Rule for all models being considered
P( H |data = 9) ∼ P(data = 9| H ) P( H )
P( L|data = 9) ∼ P(data = 9| L) P( L)
introduction to model comparison 99
5 Divide each of the values by this sum, K, to get the final probabili-
ties
and data is
“We’ve drawn one card, and it is a 9, replaced
data ≡
and reshuffled, and then drawn a 7”
P( H |data = 9 then a 7)
P( L|data = 9 then a 7)
100 statistical inference for everyone
which are related to the prior and the likelihood via Bayes’ Rule (Equa-
tion 1.14):
P(data = 9 then a 7| H ) P( H )
P( H |data = 9 then a 7) =
P(data = 9 then a 7)
P(data = 9 then a 7| L) P( L)
P( L|data = 9 then a 7) =
P(data = 9 then a 7)
P( H ) = 0.5
P( L) = 0.5
2 Write the top of Bayes’ Rule for all models being considered
5 Divide each of the values by this sum, K, to get the final probabili-
ties
holding the High Deck. One of the basic tenets of probability the-
ory is that if there is more than one way to arrive at an answer, one
should arrive at the same answer.5 In the above, we calculated the 5
E. T. Jaynes uses the principle that
probability of holding the High Deck given the observed data “if there is more than one way to
arrive at an answer, one should
arrive at the same answer” to help
“We’ve drawn one card, and it is a 9, replaced
data ≡ derive the rules of probability
and reshuffled, and then drawn a 7” from first principles. Failures of
this principle result in paradoxes.
This principle is also applied
and prior information in Section 2.4 for the birthday
problem.
prior ≡ “We know there are only two decks.”
P( H, 9) = 0.82
P( L, 9) = 0.18
2 Write the top of Bayes’ Rule for all models being considered
5 Divide each of the values by this sum, K, to get the final probabili-
ties
P( H |data = 9 then a 7) = 0.104/0.117 = 0.889
P( L|data = 9 then a 7) = 0.013/0.117 = 0.111
2 Write the top of Bayes’ Rule for all models being considered
P( H |data = 5 9’s in a row) ∼ P(data = 5 9’s in a row| H ) P( H )
P( L|data = 5 9’s in a row) ∼ P(data = 5 9’s in a row| L) P( L)
5 Divide each of the values by this sum, K, to get the final probabili-
ties
0.0000587
P( H |data = 5 9’s in a row) = = 0.99946
0.0000587318
0.0000000318
P( L|data = 5 9’s in a row) = = 0.00054
0.0000587318
Example 4.3 What is the probability that you are holding one of either the
High or the Low Deck having drawn m 9’s in a row from that deck, where m
stands for a number (m = 1, 2, 3, · · ·)?
P( H ) = 0.5
P( L) = 0.5
2 Write the top of Bayes’ Rule for all models being considered
5 Divide each of the values by this sum, K, to get the final proba-
bilities This step is easiest done in a table (Table 4.1), because the
resulting expression is pretty messy.
104 statistical inference for everyone
It is clear from Table 4.1 that after drawing five 9’s using our pro-
cedure, it should be extraordinarily likely that we are holding the
High Deck. However, after a certain number of 9’s observed, some-
thing starts to bother us. Perhaps not after five 9’s, but what if the
procedure were repeated and we drew ten 9’s in a row? Or perhaps
twenty 9’s. At some point, we’d refuse to believe this is the High
Deck because, although it was true that there are more 9’s in the
High Deck, there are many more other cards in the High Deck that we
should see. What do we do in this case?
Example 4.4 What is the probability that you are holding one of either the
High, Low, or Nines Deck having drawn m 9’s in a row from that deck?
2 Write the top of Bayes’ Rule for all models being considered
5 Divide each of the values by this sum, K, to get the final proba-
bilities Again, this step is easiest done in a table or, even better, a
picture (Figure 4.3).
0.8
0.2
0.00 2 4 6 8 10 12 14
Number of 9's in Drawn in a Row
the Nines deck becomes more likely. Eventually, this new model is
the one in which we are the most confident.
Imagine further that if, after drawing ten 9’s in a row we draw a 1.
What do we do then? The likelihood for the Nines deck goes to zero
instantly - the probability of drawing a 1 from a Nines deck is zero,
P(1| N ) = 0. Are we left again with the original two models, High
and Low Deck? No! We would then introduce other models, perhaps
something like a Mostly Nines Deck, or perhaps a High Deck with a
weird shuffling procedure, or perhaps others. No matter how many The creative part of science is not
models one has, the recipe is still the same. It is important to realize in the calculations performed, but
in the generation of new and useful
that in any model comparison case, there are always other models models. Until we come up with a
that could be brought to bear on the problem, perhaps with low prior better model for our data we make
do with the ones that we have, all
probability. Simply showing that a model is consistent with a set of the while being aware that a better
data does not insure against the possibility that another model could model may come into play later.
be better, if we could only think of it. Newton’s Theory of Gravity was
used for over 200 years, even when
there was known data that made
Exercise 4.1 Complete the example demonstrating the updated probabil- it less likely, until it was replaced
ities for the High and Low Deck, having drawn a 9, 7, and a 3. Compare by Einstein’s Theory of Gravity.
with the case of drawing just the 9 and the 7, and discuss how it matches Newton’s Laws, however, are still
used in nearly all gravitational
your intuition. calculations because it is “good
enough” and is a lot easier to work
Exercise 4.2 Repeat the analysis of the sequence of 9’s drawn in a row with practically.
with an added hypothesis of a deck with one hundred 9’s and one 8. Discuss
the results. Demonstrate what happens to the probabilities for all of the
hypotheses after drawing one 8, after ten 9’s in a row. Discuss.
Exercise 4.3 I tell you that I have a coin that could have both sides heads,
both sides tails, or a normal single-heads single-tails coin.
1 Before seeing the data, what would be a reasonable prior probability for
the three hypotheses H0 (no-heads), H1 (one head), and H2 (two heads)?
2 Would this have been different if you had simply been given a coin by a
friend to flip to see who has to do the dishes? Why or why not?
3 Now I flip the coin once, and get a heads. Write down the likelihood of
this data given each of the models. In other words, what are the values of:
• P (data=1 heads| H0 )
• P (data=1 heads| H1 )
• P (data=1 heads| H2 )
4 Apply Bayes’ Recipe, and determine the probability of each of these three
models given this data. In other words, what are the values of:
• P ( H0 |data=1 heads)
• P ( H1 |data=1 heads)
introduction to model comparison 107
• P ( H2 |data=1 heads)
1
P (disease) =
1, 000, 000
999, 999
P (no disease) =
1, 000, 000
2 Write the top of Bayes’ Rule for all models being considered
110 statistical inference for everyone
The top of Bayes’ Rule comes down to, given the truth of the
model (i.e. either with or without the disease), what is the proba-
bility of getting the data (i.e. the positive or negative test result).
This is measured by how good the test is. In many medical applica-
tions, the false positive rate
(P (positive test|no disease)) is not
always equal to the false negative
P (positive test|disease) = 0.999 rate (P (negative test|disease)), so to
say that a test is 99.9% accurate is
and actually incomplete - one needs to
specify both rates of effectiveness.
In this case, we are assuming that
P (positive test|no disease) = 0.001 they are the same.
So the top of Bayes’ Rule looks for both models looks like:
4 Divide each of the values by this sum, K, to get the final probabili-
ties
9.99 · 10−7
P (disease|positive test) = = 0.1%
0.000999999
P (no disease|positive test) = 99.9%
less likely) and the false positive rate (the number of healthy people
who test positive anyway). This will vary depending on the disease
and the test, but can lead to this unintuitive result, and thus can lead
one to make poor medical decisions.
The question is, which source can we trust the most? Here we follow
Bayes’ recipe,
P( A) = P( B) = P(C ) = 1/3
• Write the top of Bayes’ Rule (i.e. likelihood × prior) for all models
being considered
!
17 1
P( A| R = 3, N = 17) ∼ 0.283 (1 − 0.28)17−3 ×
3 3
!
17 1
P( B| R = 3, N = 17) ∼ 0.203 (1 − 0.20)17−3 ×
3 3
!
17 1
P(C | R = 3, N = 17) ∼ 0.133 (1 − 0.13)17−3 ×
3 3
P( A| R = 3, N = 17) ∼ 0.05006
+
P( B| R = 3, N = 17) ∼ 0.07975
+
P(C | R = 3, N = 17) ∼ 0.07087
K = 0.20068
applications of model comparison 113
• Divide each of the values by this sum, K, to get the final probabili-
ties
Again, we follow the same recipe, starting with out posterior prob-
abilities from above as our starting prior probabilities - they are prior
to the new data.
• Write the top of Bayes’ Rule (i.e. likelihood × prior) for all models
being considered
!
16
P( A| G = 5, N = 16 and old data) ∼ 0.205 (1 − 0.20)16−5 × 0.250
5
!
16
P( B| G = 5, N = 16 and old data) ∼ 0.105 (1 − 0.10)16−5 × 0.397
5
!
16
P(C | G = 5, N = 16 and old data) ∼ 0.215 (1 − 0.21)16−5 × 0.353
5
P( A|data) ∼ 0.0300
+
P( B|data) ∼ 0.00544
+
P(C |data) ∼ 0.0471
K = 0.08254
114 statistical inference for everyone
• Divide each of the values by this sum, K, to get the final probabili-
ties
Given this new data, we update our state of knowledge, and we’re
much more confident that Source C is the best one. It is clear that
Source B is unlikely, with a probability of only about 6.5%. We could
extend this example with more data, and more models if we’d like.
P(psychic|data)
H := {Paul is psychic}
R := {Paul is completely random, like a coin flip}
5 Divide each of the values by this sum, K, to get the final probabili-
ties
0.00257
P( H |data) = = 0.32
0.00806
0.00549
P( R|data) = = 0.68
0.00806
applications of model comparison 117
and the psychic loses! We continue this problem discussing the po-
tential anti-psychic bias in the presentation of the problem.
2 Write the top of Bayes’ Rule for all models being considered
5 Divide each of the values by this sum, K, to get the final probabili-
ties
0.5 · 0.333
P (car 1|you 1, host 2) = = 0.333
0.5
0 · 0.333
P (car 2|you 1, host 2) = =0
0.5
1 · 0.333
P (car 3|you 1, host 2) = = 0.666
0.5
applications of model comparison 119
Thus, in the case, given that you choose door 1 and the host
chooses 2, the probability that the car is behind door 1 (your door)
is 0.333 and the other door (door 3) is 0.666. Following the same steps
through the other cases, we get in summary
Probability of...
Your Choice Host Choice Car Behind 1 Car Behind 2 Car Behind 3
1 1 (host can’t open your door)
1 2 0.333 0 0.666
1 3 0.333 0.666 0
2 1 0 0.333 0.666
2 2 (host can’t open your door)
2 3 0.666 0.333 0
3 1 0 0.666 0.333
3 2 0.666 0 0.333
3 3 (host can’t open your door)
Coin Number Probability for Flipping Heads (P (heads)) Table 6.1: Probabilities for flipping
heads given a collection of bent
0 0.0 coins
1 0.1
2 0.2
3 0.3
4 0.4
5 0.5
6 0.6
7 0.7
8 0.8
9 0.9
10 1.0
The way we’ve set up this problem is exactly like the model com-
parison example with the High and Low Deck (Section 4.1), except in
this case we have 11 models (one for each coin). Applying the Bayes’
Recipe we have
P( M0 ) = 1/11
P( M1 ) = 1/11
..
.
P( M10 ) = 1/11 .
2 Write the top of Bayes’ Rule for all models being considered:
3 Put in the likelihood and prior values. Here we are drawing from
a binomial distribution for the likelihood:
!
12
P( M0 |data = 9T, 3H ) ∼ 0.03 × (1 − 0.0)9 × 1/11
3
!
12
P( M1 |data = 9T, 3H ) ∼ 0.13 × (1 − 0.1)9 × 1/11
3
..
. !
12
P( M10 |data = 9T, 3H ) ∼ 1.03 × (1 − 1.0)9 × 1/11 .
3
5 Divide each of the values by this sum, K, to get the final probabili-
ties: see Table 6.2.
When we are dealing with this many models, it is easier to plot the
results, shown in Figure 6.2. We are now in a position to address the
questions posed at the beginning of the section.
introduction to parameter estimation 123
Model ∼ P( Mi |data = 9T, 3H ) ∼ P( Mi |data = 9T, 3H )/K Table 6.2: Probability for different
bent-coin models, given the data=9
M0 0.000 0.000 tails, 3 heads. The middle column
M1 0.00774 0.110 is the non-normalized value from
Bayes’ Rule, needing to be divided
M2 0.0214 0.306 by K (the sum of the middle col-
M3 0.0217 0.310 umn) to get the final column which
M4 0.0128 0.184 is the actual probability.
M5 0.00488 0.0696
M6 0.00113 0.0161
M7 0.000135 0.00192
M8 0.00000524 0.0000748
M9 0.0000000145 0.000000208
M10 0.000 0.000
K=0.0700
0.30
P(model|data={9T,3H})
0.25
0.20
0.15
0.10
0.05
0.00 0 1 2 3 4 5 6 7 8 9 10
Model Number
124 statistical inference for everyone
which says that this coin is “likely” to “very likely” (Table 1.1 on
page 51) to have a probability of yielding heads less than a fair
coin, and thus yield more tails in the future.
2 Because, with distributions, areas under the curve (and not the
values of the distribution itself) are the probabilities, we can
only speak about ranges of values. For example, we can speak
meaningfully about the probability of θ between 0.3 and 0.4 (i.e.
P(0.3 < θ < 0.4)). When we write down something like P(θ ) = 1
we’re not talking about a probability of a single label but rather
the magnitude of the distribution at that label, θ.
introduction to parameter estimation 127
1 Specify the prior probabilities for the models being considered: 1.0
0.8
1
P(θ ) = 1 .
=
0.6
rve
P( θ )
Cu
0.4
er
nd
aU
2 Write the top of Bayes’ Rule for all models being considered: 0.2
Are
We can write one equation for all of the models labeled by θ at 0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ
once as
9T,3H )
ª
3
T1
0.15
©
NO
∼ P(θ|data =
rve
0.10
Cu
er
4 Find the area under this curve, and call it K.
nd
0.05
aU
Are
0.00
5 Divide each of the values of the curve by this are, K, to get the 0.0 0.2 0.4 0.6 0.8 1.0
θ
final probabilities where the area under the curve is 1.
3.5
3.0
Usually these steps are done for you, for a specific data set, and 2.5
P(θ|data = 9T,3H )
ª
you are given the final posterior distribution to use in answering 2.0
©
1.5
=1
1.0
Cu
er
know what assumptions have been made in the choice of models and
nd
0.5
aU
Are
Now we revisit the questions posed in Section 6.1 on page 121 about
the bent coin, this time using the distribution found above, repro-
duced here in Figure 6.6.
1 From this data, which “coin” do I most likely have? (or in this
interpretation, what is my best estimate for the probability of this
coin flipping heads, denoted by θ)
2
P(θ)
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
2 area=0.954
P(θ)
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
6.5 Quartiles
2
P(θ)
1
0.28
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
it a percentile.
Percentiles The term percentile refers to the value of the parameter Percentiles The term percentile
which result in a particulare area under the curve. refers to the value of the parameter
which result in a particulare area
For example, we can say from Figure 6.9 that the 99% percentile is under the curve.
0.59. Thus, it is extremely unlikely to have the coin skewed towards
heads more than θ = 0.59 given the observation that we flipped 3
heads and 9 tails with this coin.
introduction to parameter estimation 131
2 5% 0.20 0.25
P(θ)
0.28 0.50
1 1% 0.44 0.90
99% 0.49 0.95
0.59 0.99
The Mode Also known as the maximum a-posteriori probability The Mode Also known as the max-
(MAP) estimate, the mode is the maximum of the posterior probabil- imum a-posteriori probability (MAP)
estimate, the mode is the maximum
ity. In the case of a Beta distribution with h successes in N trials, we of the posterior probability.
have
h
θ̂mode =
N
The Mean Also known as the expected value or average value, the The Mean Also known as the
mean of a distribution of a parameter θ is defined to be the sum of all expected value, the mean of a distri-
bution of a parameter θ is defined
of the possible values of θ times the posterior probability of θ, to be the sum of all of the possi-
ble values of θ times the posterior
θ̂mean = ∑ θ × P(θ |data) probability of θ, as in
∑ θ × P(θ |data)
θ
θ̂mean =
It is one measure of the middle of the distribution. In the special case θ
of a Beta distribution with h successes in N trials, we have It is one measure of the middle of
the distribution.
h+1
θ̂mean =
N+2
132 statistical inference for everyone
Intuitively this is the same as the MAP of the Beta distribution, with
one more success and one more failure than actually observed. Fur-
ther, for the Beta distribution, the mean value θ̂mean represents the
predictive probability of a successful event on the next observation.
The Median Also known as the 50%-percentile, the median rep- The Median Also known as the
resents the middle of the distribution such that the probability of the 50%-percentile, the median repre-
sents the middle of the distribution
parameter below the median equal to the probability of the parame- such that the probability of the
ter above the median. parameter below the median equal
to the probability of the parameter
above the median.
P(θ ≤ θ̂median |data) = P(θ ≥ θ̂median |data) = 0.5
P(θ ≤ θ̂median |data) = 0.5
P(θ ≥ θ̂median |data) = 0.5
“Assume 2 successes and 2 failures” median approximation For “Assume 2 successes and 2 fail-
the Beta distribution there is no simple form for the median, but a ures” median approximation For
the Beta distribution there is no
decent approximation which we will use is given by5
simple form for the median, but
a decent approximation which we
h+2 will use is given by
θ̂median ≈
N+4
h+2
θ̂median ≈
Intuitively this is the same as the MAP of the Beta distribution, with N+4
two more successes and two more failures than actually observed, Intuitively this is the same as the
MAP of the Beta distribution, with
and is thus referred to as the “Assume 2 successes and 2 failures” two more successes and two more
median approximation. failures than actually observed, and
is thus referred to as the “Assume
Although each of these has their advantages, most notably ease of 2 successes and 2 failures” median
computation (especially for the mode and the mean), we will typi- approximation.
cally use the median of the distribution as the best estimate for the 5
Alan Agresti and Brian Caffo.
Simple and effective confidence
following reasons: intervals for proportions and
differences of proportions result
1 the median is intuitive as literally the middle of the distribution from adding two successes and two
failures. The American Statistician, 54
2 the median is not as sensitive to distributions that are highly (4):280–288, 2000
asymmetric
Example 6.1 What is the best estimate of the probability of a bent coin
flipping heads, given the observation of 9 tails and 3 heads?
h+2
θ̂median ≈
N+4
5
= = 0.313
16
introduction to parameter estimation 133
Inter-Quantile Range The Inter-Quantile Range (ICR) is the range Inter-Quantile Range The Inter-
between the 25% and 75% quartiles, and represents 50% of the proba- Quantile Range (ICR) is the range
between the 25% and 75% quar-
bility. tiles, and represents 50% of the
In Figure 6.10, the Inter-Quantile Range range is [0.29,0.40]. probability.
95% Credible Interval (CI) The 95% Credible Interval (CI) is the 95% Credible Interval (CI) The
range between the 2.5% and 97.5% quantiles, and thus represents 95% Credible Interval (CI) is the
range between the 2.5% and 97.5%
95% of the probability. According to Table 1.1 on page 51, it is “very quantiles, and thus represents 95%
likely” that our best estimate lies in this range. of the probability. According to
Table 1.1 on page 51, it is “very
In Figure 6.10, the 95% Credible Interval is nearly [0.2,0.5]. likely” that our best estimate lies in
this range.
Standard Deviation The standard deviation is a measure of the Standard Deviation The standard
half-width of a distribution, most commonly used specifically with deviation is a measure of the
half-width of a distribution, most
reference to the particular Normal distribution. This will be defined commonly used specifically with
more precisely in Section 7.2 on page 140), and will thus not be de- reference to the particular Normal
distribution. This will be defined
fined in general here. more precisely in Section 7.2 on
An approximate value for the standard deviation for the Beta page 140), and will thus not be
distribution is defined in general here.
q
σ ≈ θ̂ (1 − θ̂ )/N
From Figure 6.10, and using the median as the best estimate, θ̂, we
get
q
σ ≈ 0.34(1 − 0.34)/30 = 0.09
134 statistical inference for everyone
6.8 Marginalization
6.9 Exercises
Exercise 6.1 Given the posterior shown in Figure 6.10 for 10 heads and 20
tails, answer the following:
introduction to parameter estimation 135
1 The most likely estimate for the parameter θ. What does this mean?
5% 0.34 0.50
2 95% 0.40 0.75
0.45 0.90
1% 99%
1
0.49 0.95
0.55 0.99
3 heads and 9 tails Plot a beta distribution with 3 heads and 9 tails...
d i s t = b e t a ( h=1 ,N=3)
d i s t p l o t ( d i s t , xlim = [ 0 , 1 ] , s h o w _ q u a r t i l e s = F a l s e )
136 statistical inference for everyone
d i s t . median ( )
0.27527583248615201
the 95
credible_interval ( dist )
1 heads and 3 tails This should be about the same fraction as the previous example, but broader
d i s t = b e t a ( h=1 ,N=4)
d i s t p l o t ( d i s t , xlim = [ 0 , 1 ] )
<matplotlib.figure.Figure at 0x108768cd0>
introduction to parameter estimation 137
credible_interval ( dist )
2 Write the top of Bayes’ Rule for all models being considered
We construct a model for how different possible values of θ influ-
ence the outcome - a model we call the likelihood. In the case of the
bent coin, the likelihood model is a binomial model, and describes
the probability of flipping heads or tails given how bent the coin is
(i.e. given θ).
5 Divide each of the values by this sum, K, to get the final probabili-
ties
Once we observe data, we can combine the prior and the model or
likelihood using the Bayes’ recipe, and obtain the posterior distribu-
tion for the unknown value, θ, giving us the probability for each
value, now updated with our new observations.
The last couple of steps of the recipe, for simple cases, is done by
the mathematicians so we don’t have to manually add and divide as
we did in the previous chapters. In the case of the coin flips we get:
140 statistical inference for everyone
likelihood
z }| {
Beta(θ |data) ∼ Binomial(data|θ ) × Uniform(θ )
| {z } | {z }
posterior probability prior probability
From this Beta distribution, we can get the most likely values (i.e.
maximum probability value) for the unknown quantity of interest,
θ, our uncertainty in this quantity (i.e. the width of the Beta distribu-
tion) consistent with the known data. In other words, the posterior
probability summarizes all of our knowledge about the parameter of
interest given the data.
0.4
0.3
p(x) =Normal(0,1)
0.2
0.1
0.0
4 3 2 1 0 1 2 3 4
x
priors, likelihoods, and posteriors 141
0.4 µ = −2 µ =0 µ =3
0.3
p(x) =Normal(µ,1)
0.2
0.1
0.0
6 4 2 0 2 4 6
x
2 the total probability between these two points is 65%. This is typi-
cally written, µ ± σ.
range of the estimated value is between 3 and 7, and 95% certain that
the range is between 1 and 9 (i.e. mean minus two deviations and
mean plus two deviations).
0.4 σ =1
0.3
p(x) =Normal(0,σ)
0.2 σ =2
0.1 σ =4
0.0
8 6 4 2 0 2 4 6 8
x
We can specify the Normal distribution with just the two parameters,
µ and σ - the location and deviation parameters, respectively. How-
ever, due to its symmetry, we can summarize this distribution for
all cases by looking a a single special case called the standard Normal
distribution.
The Standard Normal Distribution is the Normal distribution in The Standard Normal Distribution
the special case where µ = 0 (the distribution is centered at x = 0) The Normal distribution in the
special case where µ = 0 (the
and σ = 1 (the distribution has a spread of 1). distribution is centered at x = 0)
For any Normal distribution, the area within 1-σ is 0.68, within 2-σ and σ = 1 (the distribution has a
spread of 1).
is 0.95, and 3-σ is 0.99. These locations are the most prevalently used
in any kind of statistical testing, and thus we will see them many
times.
In order to use the table of percentiles for the standard Normal dis-
tribution, we need to be able to translate from the Normal to the
standard Normal and back again. Luckily, it is a simple process, and
is one of the main reasons for using the Normal distribution - other
distributions are not so easily manipulated.
priors, likelihoods, and posteriors 143
2.3% 97.7%
0.1 0.1% 99.9%
To use the tables in Section D.3 on page 238, we first need to trans-
late everything to the standard Normal values.
x − 150
x = 170 ⇒ z= = 0.67
30
From the table in Section D.3 on page 238, the area to the left of
z = 0.67 is 0.7486. Because we are asked the probability greater than
x = 170 we need to have the area to the right of the curve, or
1 P( x < 12)
1 Make a qualitative plot of the distribution to help you with the other parts
of the question
1 Make a qualitative plot of the distribution to help you with the other parts
of the question
Sum of two Normally distributed variables If we have two vari- Sum of two Normally distributed
ables, x and y, which have Normal distributions variables If we have two Normally
distributed variables, x and y, we
have
P( x ) = Normal(µ x , σx )
P( x ) = Normal(µ x , σx )
P(y) = Normal(µy , σy )
P(y) = Normal(µy , σy ) P( x + y) = Normal(µ x + µy ,
q
σx2 + σy2 )
then their sum, x + y, has a mean the sum of the two, µ x + µy and a
q
deviation σx2 + σy2 .
One way to remember this is that the new squared deviation pa-
rameter is the sum of the two old ones,
Differences between two Normally distributed variables For Differences between two Normally
differences, x − y, we have a new mean of µ x − µy and deviation distributed variables
q
P( x − y) = Normal(µ x − µy ,
parameter again σx2 + σy2 . Note the “+” sign in the new σ, which q
keeps the new σ positive which is must be by definition. σx2 + σy2 )
If we are asked for the distribution of a quantity with an added (Note the “+” sign in the new σ.)
constant, like
z = x + constant
p(x) =Normal(8,2)
0.20
0.15
0.10
p(y) =Normal(20,7)
0.05
0.00
p(z) = p(y−x) =Normal(12.0,7.3)
10 0 10 20 30 40
Since in this case we are given σ, we wish then to estimate the pa-
rameter µ. The result will be a probability distribution over µ, with a
best (i.e. most probable) value and an uncertainty in that value. The
result is that the distribution of µ is also a Normal distribution, In scientific applications, this
√ notation is often
√ shortened to
P(µ|data, σ ) = Normal( x̄, σ/ N ) µ = x̄ ± σ/ N, so it is clear what
is the best estimate of µ (i.e. x̄)
and what is the √uncertainty in that
where the center value (and thus the most probable value of µ) is estimate (i.e. σ/ N).
given by the sample mean of the data.
Sample Mean The sample mean of a set of N samples, x1 , x2 , · · · , x N Sample Mean The sample mean of
a set of N samples, x1 , x2 , · · · , x N is
given by
x1 + x2 + x3 + · · · + x N
x̄ ≡
N
priors, likelihoods, and posteriors 147
is given by
x1 + x2 + x3 + · · · + x N
x̄ ≡
N
√
The uncertainty in µ is given by σ/ N. As a consequence, larger
N (i.e. more data points), makes us more confident in the particular
estimate for µ.
Say that we further know that the uncertainty (given this ruler) of
one measurement has σ = 0.5[cm]. What is the best estimate of the In real measurements, there is
length? The best estimate should be given by the sample mean of always the problem of bias or
systematic uncertainties, where
these 5 samples, the uncertainty does not follow a
Normal distribution. We will not
x1 + x2 + · · · + x N consider this issue here.
µ̂ =
N
5.1[cm] + 4.9[cm] + 4.7[cm] + 4.9[cm] + 5.0[cm]
= = 4.92[cm]
5
with uncertainty related to the known uncertainty of a single mea-
surement,
σ
σ̂ = √
N
0.5[cm]
= √ = 0.223[cm]
5
µ̂ = 4.92[cm] ± 0.223[cm]
Sample Deviation The sample deviation of a set of N samples, Sample Deviation The sample
x1 , x2 , · · · , x N is given by deviation of a set of N samples,
x1 , x2 , · · · , x N is given by
r
1
r
1
S≡ (( x1 − x̄ )2 + ( x2 − x̄ )2 + · · · + ( x N − x̄ )2 ) S≡
N−1
(( x1 − x̄ )2 + · · · + ( x N − x̄ )2 )
N−1
mean (µ), deviation (σ) and the degrees of freedom (dof). The degrees This distribution requires three
numbers to specify, referred to as
of freedom is defined in this case to be the number of data points less the mean (µ), deviation (σ) and the
one, N − 1. degrees of freedom (dof). The degrees
of freedom is defined in this case to
Example 7.5 Estimating the True Length of an Object...Again be the number of data points less
one, N − 1.
Say we have an object, and 5 measurements of its length from the
same ruler but from different people,
Unlike earlier, let’s say that we don’t know the uncertainty (given this
ruler) of one measurement What is the best estimate of the length?
Again, the best estimate should be given by the sample mean of these
5 samples,
x1 + x2 + · · · + x N
µ̂ =
N
5.1[cm] + 4.9[cm] + 4.7[cm] + 4.9[cm] + 5.0[cm]
= = 4.92[cm]
5
with uncertainty related to the sample deviation
1
S2 = ( x1 − x̄ )2 + · · · + ( x N − x̄ )2
N−1
1
= (5.1[cm] − 4.92[cm])2 + (4.9[cm] − 4.92[cm])2 + (4.7[cm] − 4.92[cm])2 +
5−1
(4.9[cm] − 4.92[cm])2 + (5.0[cm] − 4.92[cm])2
= 0.024[cm]2
q
S = 0.024[cm]2 = 0.155[cm]
S 0.155[cm]
√ = √ = 0.069[cm]
N 5
Looking at Table D.2on page 236 with “Degrees of Freedom” equal
to 4, we find that the 95% credible interval for µ (between areas 0.025
√
and 0.975) falls ±2.776 · S/ N, thus we have
The Normal distribution is useful for many reasons: its simple shape,
the fact that there are only two parameters which describe it, and the
ease with which one can compare the general Normal distribution to
the single standard Normal. Further, it can be used as an approxima-
tion for several other distributions, under certain limits.
to how likely the coin flips heads. For notation, we will write the
frequency of heads as
h
f ≡
N
Normal Approximation to the Beta Distribution The Normal Normal Approximation to the Beta
Approximation to the Beta Distribution , for large number of flips (N) Distribution The Normal Approxi-
mation to the Beta Distribution , for
of which a fraction f ≡ h/N are successful is given by large number of flips (N) of which a
fraction f ≡ h/N are successful is
given by
Beta(h, N ) ∼ Normal (µ = f ,
q
q σ = f (1 − f )/N
Beta(h, N ) ∼ Normal µ = f , σ = f (1 − f )/N
To see how close this approximation can be, observe the following
two cases:
1.5
1.0
0.5
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
4
2
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
and the curves are so close as to be nearly identical! There still is This is an approximation, and as
a (small) probability for getting a negative θ, which is problematic such will certainly give seriously
incorrect answers under certain
in theory but not typically in practice. To use the properties of the circumstances. For example, in this
Normal distribution here to quantify our uncertainty about the bent case, the Normal approximation
predicts that there is around a 1.8%
coin. Given 30 heads and 90 tails, the best estimate for θ (i.e. the chance that the bent coin might
top of the curve) is 0.25. Our uncertainty is quantified by the width have a negative θ, or probability of
of the distribution, given by σ. Thus, we can be confident to a 95% flipping heads (look a the Normal
curve to the left of θ = 0)! The
degree for θ within 2σ, or between 0.17 and 0.33 (0.25 − 2 · 0.04 and beta distribution is zero for any
0.25 + 2 · 0.04, respectively). value below zero or over one, and
thus will never lead to such absurd
answers.
with examples
152 statistical inference for everyone
0.30
Binomial
Normal
0.25
Binomial(N =10,p =0.25)
0.20 µ =2.50
σ =1.37
0.15
0.10
0.05
0.00 0 2 4 6 8 10 12
k
and
0.10
Binomial
Normal
0.08
Binomial(N =100,p =0.25)
µ =25.00
0.06 σ =4.33
0.04
0.02
0.00 0 20 40 60 80 100
k
For smallish data sets, 5 < N < 30, we can replace the estimate of the
mean from the Student’s t distribution to a Normal distribution with
an increased estimate for the deviation. It then becomes practical to
use the more convenient z-score to estimate credible intervals rather
priors, likelihoods, and posteriors 153
than the full t tables. The approximation in this domain looks like2 2
D. Berry. Statistics: A Bayesian
Perspective. Duxbury, 1996
7.5 Summary
It is useful to see all of these results stemming from the same Bayes’
Recipe, applied to different models of the data and (possibly) differ-
ent prior probabilities. As we have stated, many of the simple cases
have been worked out by the mathematicians, so we don’t need to
do the work of deriving them. It will be our task to understand their
properties, to be able to apply them to real problems, and to under-
stand their consequences. One of the immediate observations that
we make is the prevalence of the Normal distribution, justifying our
detailed exploration of it in this chapter.
1 Proportions
Posterior Probability:
likelihood
z }| {
Beta(θ |data) ∼ Binomial(data|θ ) × Uniform(θ )
| {z } | {z }
posterior probability prior probability
Posterior Probability:
likelihood
z }| {
Normal(µ2 |data, σ ) ∼ Normal(data|µ, σ ) × Uniform(µ)
| {z } | {z }
posterior probability prior probability
priors, likelihoods, and posteriors 155
Posterior Probability:
likelihood
z }| {
P(µ, σ|data) ∼ Normal(data|µ, σ) × Uniform(µ) · Uniform(log σ)
| {z } | {z }
posterior probability prior probability
from s i e import *
Estimating Lengths
Known deviation, σ
x=[5.1 , 4.9 , 4.7 , 4.9 , 5.0]
sigma = 0 . 5
mu=sample_mean ( x )
N= l e n ( x )
<matplotlib.figure.Figure at 0x10713c710>
156 statistical inference for everyone
credible_interval ( dist )
Unknown σ
mu=sample_mean ( x )
s=s a m p l e _ d e v i a t i o n ( x )
p r i n t mu, s
4.92 0.148323969742
d i s t p l o t ( d i s t , xlim = [ 4 . 6 , 5 . 4 ] )
<matplotlib.figure.Figure at 0x1085b5c50>
priors, likelihoods, and posteriors 157
credible_interval ( dist )
8.1 z-test
The z-test is the simplest test to use, and is perhaps the most com-
mon. It is used when we have the following assumptions:
4 If so, then the test passes, and we can be reasonably confident that
the parameter is non-zero - that the effect is real.
5 If the test fails, i.e. the credible range does not include zero, then
under the model the possibility of a zero-effect cannot be reason-
ably excluded.
There are several scenarios where we use the z-test, each with
the same procedure, differing only in the method of estimating the
“true” value µ.
µ̂ ≈ f
√ q
σ/ N ≈ f (1 − f )/N
3 For smallish data sets, 5 < N < 30, where the uncertainty is not
known,
x1 + x2 + · · · + x N
µ̂ ≈
√ √ N
σ/ N ≈ kS/ N
common statistical significance tests 161
√
where we replace the known σ/ N from the previous case with
an estimate using the sample standard deviation and an adjust-
ment for small data set parameter k,
1
S2 = ( x1 − x̄ )2 + · · · + ( x N − x̄ )2
N−1
20
k ≡ 1+ 2
N
Significance
There is a term used in the literature called statistical significance.1 1
Although the word “significant”
Roughly it means a value that is very unlikely to be zero (see Table 1.1 occurs in the term “statistically
significant,” it does not imply that
on page 51), or in other words, the value of zero is not within the the result itself is important - it may
95% percentile. This is within 2 standard deviations of the value, so be a small, uninteresting effect, but
credibly non-zero. Perhaps a term
the following estimated values are not statistically significant: like “statistically detectable” would
be better, but we are unfortunately
• 5 ± 3 - the two-deviation range is [-1,11] contains the value 0 bound to the historical use of the
term.
• 7±4
• −3 ± 2
• 7±3
• −3 ± 1
8.3 Student-t-test
σ̂ = S
5 If so, then the test passes, and we can be reasonably confident that
the parameter is non-zero - that the effect is real.
6 If the test fails, i.e. the credible range does not include zero, then
under the model the possibility of a zero-effect cannot be reason-
ably excluded.
from s i e import *
x= x _ s e r t o s a
mu=sample_mean ( x )
N= l e n ( x )
sigma=s a m p l e _ d e v i a t i o n ( x )/ s q r t (N)
t _ s e r t o s a = t d i s t (N, mu, sigma )
new_length = 1 . 7
d i s t p l o t ( t _ s e r t o s a , l a b e l = ’ p e t a l l e n g t h ’ , xlim = [ 1 . 3 7 , 1 . 8 ] ,
quartiles =[.01 ,0.05 ,.5 ,.95 ,.99] ,
)
ax=gca ( )
ax . a x v l i n e ( 1 . 7 , c o l o r = ’ r ’ )
s a v e f i g ( ’ . . / . . / f i g s / z _ t e s t _ i r i s . pdf ’ )
<matplotlib.figure.Figure at 0x10f9d2710>
164 statistical inference for everyone
9 Applications of Parameter Estimation and Inference
Table 9.1 shows data for the lengths (in centimeters) of the petals
of one species of Iris flower1 . If we want to estimate the “true” length 1
K. Bache and M. Lichman. UCI
of the the petal for this species, given all of these examples, we would machine learning repository, 2013.
URL http://archive.ics.uci.edu/
apply the following model of the data: ml
2.0
data = true value + Normal(mean=0,known σ)
1.5
σ
or equivalently (known deviation)
1.0
data = Normal(mean=true value,known σ)
0.5
The resulting distribution for the “true value”, µ, is also a Normal
0.0 0
distribution (Section 7.3), µ
(unknown true value)
√
P(µ|data, σ ) = Normal( x̄, σ/ N )
where the best estimate of the true value, µ is the sample mean, x̄,
and the uncertainty is related to the sample deviation (which we’re
166 statistical inference for everyone
0.174
µ̂ = 1.464[cm] ± √ [cm]
50
= 1.464[cm] ± 0.025[cm]
with uncertainty the same as the uncertainty of the Setosa type, so the
final estimate with uncertainty is:
1.036[cm] ± 0.025[cm]
which is
1.036[cm]
= 41 deviations away from zero!
0.025[cm]
types Virginica and Versicolor longer than the type Setosa? Is the Vir-
ginica longer than Versicolor? For each of these, we need to specify the
model, determine the best estimate for the parameters of the model,
and then compare the distributions.
The model we will use is the simple Normal model,
which is the same as the previous example, except that the deviation,
σ, is unknown. In addition to being unknown, there are so few data
points that the deviation can’t be well approximated with the sample
deviation.
The resulting distribution for the “true value”, µ, is a Student-t
distribution (Section 7.3),
√
P(µ|data) = Studentdof= N −1 ( x̄, S/ N )
The best estimates for the true length-values of each type is given by
their sample means,
1.4 + 1.4 + 1.3 + 1.5 + 1.4
µ̂setosa = = 1.40
5
6.0 + 5.1 + 5.9 + 5.6 + 5.8
µ̂virginica = = 5.68
5
4.7 + 4.5 + 4.9 + 4.0 + 4.6
µ̂versicolor = = 4.54
5
and the sample deviations for each is given by
r
1
Ssetosa = · ((1.4 − 1.40)2 + (1.4 − 1.40)2 + (1.3 − 1.40)2 + (1.5 − 1.40)2 + (1.4 − 1.40)2 )
5−1
= =
r 0.07
1
Svirginica = · ((6.0 − 5.68)2 + (5.1 − 5.68)2 + (5.9 − 5.68)2 + (5.6 − 5.68)2 + (5.8 − 5.68)2 )
5−1
= =
r 0.36
1
Sversicolor = · ((4.7 − 4.54)2 + (4.5 − 4.54)2 + (4.9 − 4.54)2 + (4.0 − 4.54)2 + (4.6 − 4.54)2 )
5−1
= 0.34
It is clear from the picture that they are very well separated, but we
can quantify this by looking at the probability that the difference
between their means is greater than zero.
168 statistical inference for everyone
00 1 2 3 4 5 6 7 8
Petal Length [cm]
The probability of their difference approximately takes the form of This approximation is called
a Student’s t distribution, with the same center and deviation shown Welch’s method. The exact anal-
ysis is beyond this book, but
for the Normal in Section 7.2. Here we do the calculation between the numerically one can calculate
closest two iris types, Virginica and Versicolor: it and it doesn’t differ from
this approximate analysis in
any significant way. Essentially
µdiff = 5.68 − 4.64 = 1.04 you calculate P(µversicolor >
µvirginica |data) by adding up the
r
0.362 0.342
σdiff = + = 0.22 P(µversicolor |data) × P(µvirginica |data)
5 5 for all possible lengths where
versicolor is longer than virginica.
The degrees of freedom used for this Student’s t distribution is ap-
proximately the smallest one from the two samples, or in this case
(since both samples have the same number of data points), dof=4.
The resulting posterior probability distribution for the difference of
means is shown in Figure 9.2.
We observe that the difference of the means is over 4 times the
deviation away from zero, so even with 4 degrees of freedom, this is
significant at the 99% level. We can be highly certain that these two
species have different petal lengths, and that the difference observed
is not just a product of the random sample.
Here’s a data data set, measuring the size of ball bearings2 from 2
David J Hand, Fergus Daly, K Mc-
two different production lines. Conway, D Lunn, and E Ostrowski.
A handbook of small data sets, vol-
We can ask questions such as: ume 1. CRC Press, 2011
• What is our best estimate of the size of a ball bearing, given one of
the production lines?
applications of parameter estimation and inference 169
1.0
10% 90%
0.5 5% 95%
1% 99%
Example 9.5 What is the best estimate (and uncertainty) for each of the
two production lines of ball bearings?
σ1 = S1 /sqrt10 = 0.092
σ2 = S2 /sqrt10 = 0.135
yielding the best estimates and uncertainties for the two production
lines
or looking at the 95% CI for each line This is just the ±2 · σ range
Roughly, given that these intervals overlap, there is not strong evi-
dence that there is a difference between the two lines.
δ12 = µ2 − µ1 = 0.212
The data is that 7 of the daughters had cancer and 3 did not. Is there
strong evidence of a connection?
The proper way, assuming total initial ignorance, is to use the Beta
distribution:
P (θcancer |data) = Beta(h = 7, N = 10)
which has a median of θ̂cancer = 0.68, but a 95% credible interval of
θ̂cancer = 0.39 up to θ̂cancer = 0.89. This means there is not strong
evidence of an effect.
Example 9.9 Cancer Rates - Normal Approximation
We can estimate the the Beta distribution median and credible
intervals with a Normal distribution, by using the “assuming 2 suc-
cesses and 2 failures” method.
h+2
θ̂cancer =
N+4
7+2
= = 0.643
10 + 4
and
q
σ = θ̂cancer (1 − θ̂cancer )/( N + 4)
q
= 0.643(1 − 0.643)/(10 + 4)
= 0.128
172 statistical inference for everyone
θ̂cancer ± 2σ
which is between 0.387 and 0.899, again with the same conclusion of
no strong evidence of an effect.
θ̂rain ± 2σ
which is between 0.268 and 0.540. This is not strong evidence against
a purely fair and random “coin flip” for rain on the 4th of July.
48 + 2
θafter a miss = = 0.877
53 + 4
251 + 2
θafter a success = = 0.875
285 + 4
and the uncertainty,
q
σafter a miss = 0.877(1 − 0.877)/(53 + 4) = 0.044
q
σafter a success = 0.875(1 − 0.875)/(285 + 4) = 0.019
3.15
Mass per Penny [g]
3.10
3.05
3.00
1960 1962 1964 1966 1968 1970 1972 1974
year
applications of parameter estimation and inference 175
2 the best estimate for the true value, µ̂, is given by the sample mean,
x̄:
x + x2 + · · · + x N
x̄ = = 1
N
3.133g + 3.083g + · · · + 3.093g
=
15
= 3.100g
4 The scale factor, k, adjusts for the small number of data points -
there is more uncertainty in our estimate when there are fewer
data points:
20
k = 1+
N2
20
= 1+ 2
15
= 1.0889
Finally, we have the best estimate and uncertainty for the pennies
in this dataset:
√
µ̂ = x̄ ± k · S/ N
√
= 3.100g ± 1.0889 · 0.0278g/ 15
= 3.100g ± 0.0078g
176 statistical inference for everyone
Example 9.13 Mass of the Penny, Model 1 - One True Value with More
Data
√
µ̂ = x̄ ± k · S/ N
2 the best estimate for the true value, µ̂, is given by the sample mean,
x̄:
3.133g + 3.083g + · · · + 2.520g
x̄ =
30
= 2.804g
applications of parameter estimation and inference 177
3.16
Best estimate of "true" value: µ̂ =3.100 ± 0.0078
3.14 99% CI: [3.077,3.124]
Mass per Penny [g]
3.12
3.10
3.08
3.06
1960 1962 1964 1966 1968 1970 1972 1974
year
60
Best Estimate for µ =3.100 ± 0.0078 grams
99% CI for µ :[3.077,3.124]
50
40
p(µ)
30
20
10
0
3.00 3.05 3.10 3.15 3.20
µ [g]
178 statistical inference for everyone
4 The scale factor, k, adjusting for the small number of data points:
20
k = 1+
302
= 1.0222
Finally, we have the best estimate and uncertainty for the pennies
in this full dataset:
√
µ̂ = x̄ ± k · S/ N
√
= 2.804g ± 1.0222 · 0.3024g/ 30
= 2.804g ± 0.0564g
1 The scale factor, k, is less for 30 data points than it is for 15 data
points. This is because the adjustment for small number of data
points gets less relevant as we obtain more data. This is what we
expect.
1 Always look at your data graphically. What you might miss look-
ing at a table of numbers, you’ll catch with a picture.
2.9
2.8
2.7
2.6
2.5
2.4
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
year
• µ1 - before 1975
• µ2 - after 1988
where the 99% credible intervals (CI) clearly do not overlap, thus
there is a statistically significant difference between them.
2.9
2.8
2.7
Best estimate µ̂2 =2.507 ± 0.0029
2.6 99% CI: [2.498,2.516]
2.5
2.4
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
year
where the sample standard deviations, S1 and S2 , and the scale fac-
tors, k1 and k2 were calculated earlier. This leads to, for this data set,
with the 99% credible interval [0.568g, 0.618g], the distribution shown
in Figure 9.7. Again, the estimated quantities are clearly different sta-
tistically: the value of zero is well outside of the 99% credible interval
for δ12 .
50
Best Estimate for µ1 −µ2 =0.593 ± 0.008 terval of the difference, thus there is
a statistically significant difference
99% CI for µ1 −µ2 :[0.568,0.618] between the two values µ1 and µ2 .
40
30
p(µ1 −µ2 )
20
10
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
µ1 −µ2
from s i e import *
Iris Example
print x_sertosa [ : 1 0 ] # p r i n t t h e f i r s t 10
0 1.4
1 1.4
2 1.3
3 1.5
4 1.4
5 1.7
6 1.4
7 1.5
8 1.4
9 1.5
Name: petal length [cm], dtype: float64
182 statistical inference for everyone
x= x _ s e r t o s a
mu=sample_mean ( x )
N= l e n ( x )
sigma=s a m p l e _ d e v i a t i o n ( x )/ s q r t (N)
t _ s e r t o s a = t d i s t (N, mu, sigma )
x= x _ v e r s i c o l o r
mu=sample_mean ( x )
N= l e n ( x )
sigma=s a m p l e _ d e v i a t i o n ( x )/ s q r t (N)
t _ v e r s i c o l o r = t d i s t (N, mu, sigma )
x= x _ v i r g i n i c a
mu=sample_mean ( x )
N= l e n ( x )
sigma=s a m p l e _ d e v i a t i o n ( x )/ s q r t (N)
t _ v i r g i n i c a = t d i s t (N, mu, sigma )
d i s t p l o t 2 ( [ t_sertosa , t _ v e r s i c o l o r , t _ v i r g i n i c a ] , show_quartiles=False )
<matplotlib.figure.Figure at 0x1058d9690>
applications of parameter estimation and inference 183
distplot ( t_virginica )
credible_interval ( t_versicolor )
credible_interval ( t_virginica )
Sunrise
184 statistical inference for everyone
d i s t = b e t a ( h =365 ,N=365)
distplot ( dist )
credible_interval ( dist )
Cancer Example
d i s t = b e t a ( h=7 ,N=10)
credible_interval ( dist )
Pennies
p l o t ( year , mass , ’ o ’ )
x l a b e l ( ’ year ’ )
y l a b e l ( ’ Mass per Penny [ g ] ’ )
<matplotlib.text.Text at 0x1087c2d90>
186 statistical inference for everyone
x=mass
mu=sample_mean ( x )
N= l e n ( x )
sigma=s a m p l e _ d e v i a t i o n ( x )/ s q r t (N)
t_penny1= t d i s t (N, mu, sigma )
d i s t p l o t ( t_penny1 , l a b e l = ’ mass [ g ] ’ )
p l o t ( year , mass , ’ o ’ )
c r e d i b l e _ i n t e r v a l _ p l o t ( t_penny1 , p e r c e n t a g e =99)
x l a b e l ( ’ year ’ )
y l a b e l ( ’ Mass per Penny [ g ] ’ )
<matplotlib.text.Text at 0x1087fcf10>
Do the 2 datasets
data2=l o a d _ d a t a ( ’ data/pennies2 . csv ’ )
p r i n t data2
year1 , mass1=year , mass
year2 , mass2=data2 [ ’ Year ’ ] , data2 [ ’ Mass [ g ] ’ ]
14 2003 2.520
x=mass1
mu=sample_mean ( x )
N= l e n ( x )
sigma=s a m p l e _ d e v i a t i o n ( x )/ s q r t (N)
t_penny1= t d i s t (N, mu, sigma )
x=mass2
mu=sample_mean ( x )
N= l e n ( x )
sigma=s a m p l e _ d e v i a t i o n ( x )/ s q r t (N)
t_penny2= t d i s t (N, mu, sigma )
<matplotlib.figure.Figure at 0x1087d3f10>
<matplotlib.legend.Legend at 0x1088198d0>
p l o t ( year1 , mass1 , ’ o ’ )
c r e d i b l e _ i n t e r v a l _ p l o t ( t_penny1 , p e r c e n t a g e =99)
p l o t ( year2 , mass2 , ’ ro ’ )
c r e d i b l e _ i n t e r v a l _ p l o t ( t_penny2 , p e r c e n t a g e =99 , xlim = [ 1 9 8 9 , 2 0 0 5 ] )
x l a b e l ( ’ year ’ )
y l a b e l ( ’ Mass per Penny [ g ] ’ )
<matplotlib.text.Text at 0x10907e310>
applications of parameter estimation and inference 189
mu1=sample_mean ( mass1 )
mu2=sample_mean ( mass2 )
d e l t a _ 1 2 =mu1−mu2
s i g m a _ d e l t a 1 2= s q r t ( sigma1 * * 2 + sigma2 * * 2 )
d i s t _ d e l t a =normal ( d e l t a _ 1 2 , s i g m a _ d e l t a 1 2 )
distplot ( dist_delta )
190 statistical inference for everyone
data1 = [ 1 . 1 8 , 1 . 4 2 , 0 . 6 9 , 0 . 8 8 , 1 . 6 2 , 1 . 0 9 , 1 . 5 3 , 1 . 0 2 , 1 . 1 9 , 1 . 3 2 ]
data2 = [ 1 . 7 2 , 1 . 6 2 , 1 . 6 9 , 0 . 7 9 , 1 . 7 9 , 0 . 7 7 , 1 . 4 4 , 1 . 2 9 , 1 . 9 6 , 0 . 9 9 ]
N1= l e n ( data1 )
N2= l e n ( data2 )
mu1=sample_mean ( data1 )
mu2=sample_mean ( data2 )
p r i n t mu1 , mu2
1.194 1.406
S1=s a m p l e _ d e v i a t i o n ( data1 )
S2=s a m p l e _ d e v i a t i o n ( data2 )
p r i n t S1 , S2
0.289681817786 0.428309337849
sigma1=S1/ s q r t (N1)
sigma2=S2/ s q r t (N2)
p r i n t sigma1 , sigma2
0.091605434094 0.135443305072
<matplotlib.figure.Figure at 0x105d61390>
<matplotlib.legend.Legend at 0x108ca1ad0>
10 Multi-parameter Models
Table 10.1 and Figure 10.1. By eye we can see a direct correlation - the
taller the person the larger shoe size.
Height [inches] Shoe Size Table 10.1: Heights (in inches) and
shoe sizes from a subset of McLaren
64.0 7 (2012) data.
70.0 9
64.0 8
71.0 11
69.0 12
68.0 9
69.0 10
61.0 6
68.0 10
70.0 9
y = mx + b
194 statistical inference for everyone
12
10
Shoe Size
4
60 62 64 66 68 70 72
Height [inches]
where m is the slope and b is the intercept. Clearly this data doesn’t
form a perfect line, so there is some uncertainty in the slope, inter-
cept, and predicted y values. We assume a Normal distribution for
the uncertainties in the data, so the statistical model looks like, for
each data point,
yi = mxi + b + Normal(0, σ)
P(m, b|data)
3 Add up the values, and divide by this sum to get the final pos-
terior probabilities. This is done by the mathematicians, and we
simply summarize the results here.
we obtain the posterior distributions for the parameters m and b.
The calculations get too detailed to do by hand, but are very easy
with the computer. For the shoe size data in Table 10.1 we get the
distributions shown in Figures 10.2 and 10.3 for the slope and inter-
cept, respectively. The most probable values then lead to the best fit,
shown in Figure 10.4.
The Student-t test clearly shows that the slope is non-zero (well
over 95% of the distribution lies to the right of zero), denoting a sta-
tistically significant effect on shoe size from height. The magnitude of
the slope, slope = 0.42, can be interpreted that every inch of height
leads to a 0.42 increase in shoe size on average.
2 10% 90%
5% 95%
1
1% 99%
0.04
P(Intercept)
0.02 5% 95%
0.01 1% 99%
12 y =0.422x−19.256
10
Shoe Size
4
60 62 64 66 68 70 72
Height [inches]
multi-parameter models 197
One can intuitively think of getting the best fit as adjusting the slopes
and intercepts, calculating the MSE for each, and stopping when you
reach a minimum value. An example of this is shown in Figure 10.5.
10
4
60 62 64 66 68 70 72
Height [inches]
An Educational Example
The following example is from a data set on school expenditures and
SAT scores.2 We plot the total SAT scores as a function of expendi- 2
tures, perform a linear model fit, and present the best values and
their uncertainties in Figure 10.6. The model is
1 The larger the expenditure per pupil the lower the SAT scores.
2 For each thousand dollars spent per pupil, the total SAT score goes
down 20 points.
1150
1000
SAT Total
950
900
850
800
3 4 5 6 7 8 9 10
Expenditure [per pupil, thousands]
50% 50%
0.06 0.010
0.05 0.008
0.04
0.006
P(Intercept)
P(Slope)
0.03
5% 95% 0.004 5% 95%
0.02
1% 99% 0.002 1% 99%
0.01
more students - both bad and good - take the SAT. Thus, even if ex-
penditure helps students, the fact that the percentage of students
taking the exam increases creates the illusion of the opposite. The
next section states how you can overcome this problem.
y = β 0 + β 1 x1 + β 2 x2 · · ·
• For each $1000 more spent per pupil the total SAT score increases
on average by 12.29.
• For each percent increase in students taking the SAT, the total SAT
score decreases on average by 2.29.
90
80 y =11.638x−33.485
Percentage of Students Taking the SAT
70
60
50
40
30
20
10
0
3 4 5 6 7 8 9 10
Expenditure [per pupil, thousands]
0.15 0.025
0.020
P(Intercept)
P(Slope)
0.10
0.015
5% 95% 5% 95%
0.010
0.05
1% 99% 1% 99%
0.005
0.00 7.81 15.47 0.000 -56.68 -10.29
6.14 11.64 17.13 -66.77 -33.48 -0.20
6 8 10 12 14 16 18 60 40 20 0
Slope Intercept
0.06
0.04 5% 95%
0.02 1% 99%
0.00 5.20 19.37
2.11 12.29 22.46
0 5 10 15 20 25
βexpenditure
50%
2.0
1.5
P(βpercenttaking)
1.0
5% 95%
0.5 1% 99%
0.0 -3.21 -2.49
-3.37 -2.85 -2.33
3.4 3.2 3.0 2.8 2.6 2.4 2.2
βpercenttaking
50%
0.020
0.015
P(Intercept)
0.010
5% 95%
0.005 1% 99%
0.000
941.25 957.20 993.83 1030.47 1046.41
940 960 980 1000 1020 1040 1060
Intercept
202 statistical inference for everyone
from s i e import *
data=l o a d _ d a t a ( ’ data/ s h o e s i z e . x l s ’ )
data . head ( )
import random
random . seed ( 1 0 2 )
rows = random . sample ( data . index , 1 0 )
newdata=data . i x [ rows ]
data=newdata
data
<matplotlib.text.Text at 0x10adae0d0>
multi-parameter models 203
r e s u l t = r e g r e s s i o n ( ’ S i z e ~ Height ’ , data )
<matplotlib.figure.Figure at 0x10d200710>
<matplotlib.figure.Figure at 0x10d702610>
204 statistical inference for everyone
h= l i n s p a c e ( 6 0 , 7 2 , 1 0 )
p l o t ( h , r e s u l t [ ’ _ P r e d i c t ’ ] ( Height=h ) , ’− ’ )
gca ( ) . s e t _ x l i m ( [ 6 0 , 7 2 ] )
gca ( ) . s e t _ y l i m ( [ 4 , 1 4 ] )
x l a b e l ( ’ Height [ i n c h e s ] ’ )
y l a b e l ( ’ Shoe S i z e ’ )
b= r e s u l t . I n t e r c e p t . mean ( )
m= r e s u l t . Height . mean ( )
i f b>0:
t e x t ( 6 2 , 1 2 , ’ $y=%.3 f x + %.3 f $ ’ % (m, b ) , f o n t s i z e =30)
else :
t e x t ( 6 2 , 1 2 , ’ $y=%.3 f x %.3 f $ ’ % (m, b ) , f o n t s i z e =30)
multi-parameter models 205
r e s u l t = r e g r e s s i o n ( ’ t o t a l ~ e x p e n d i t u r e ’ , data )
<matplotlib.figure.Figure at 0x1109156d0>
<matplotlib.figure.Figure at 0x110953110>
206 statistical inference for everyone
p l o t ( data [ ’ e x p e n d i t u r e ’ ] , data [ ’ t o t a l ’ ] , ’ o ’ )
x l a b e l ( ’ Expenditure [ per pupil , thousands ] ’ )
y l a b e l ( ’SAT T o t a l ’ )
h= l i n s p a c e ( 3 , 1 0 , 1 0 )
p l o t ( h , r e s u l t [ ’ _ P r e d i c t ’ ] ( e x p e n d i t u r e =h ) , ’− ’ )
b= r e s u l t . I n t e r c e p t . mean ( )
m= r e s u l t . e x p e n d i t u r e . mean ( )
i f b>0:
t e x t ( 4 . 5 , 1 1 2 5 , ’ $y=%.3 f x + %.3 f $ ’ % (m, b ) , f o n t s i z e =30)
else :
t e x t ( 4 . 5 , 1 1 2 5 , ’ $y=%.3 f x %.3 f $ ’ % (m, b ) , f o n t s i z e =30)
multi-parameter models 207
r e s u l t = r e g r e s s i o n ( ’ p e r c e n t _ t a k i n g ~ e x p e n d i t u r e ’ , data )
<matplotlib.figure.Figure at 0x111cfff10>
<matplotlib.figure.Figure at 0x111867750>
208 statistical inference for everyone
p l o t ( data [ ’ e x p e n d i t u r e ’ ] , data [ ’ p e r c e n t _ t a k i n g ’ ] , ’ o ’ )
x l a b e l ( ’ Expenditure [ per pupil , thousands ] ’ )
y l a b e l ( ’SAT T o t a l ’ )
h= l i n s p a c e ( 3 , 1 0 , 1 0 )
p l o t ( h , r e s u l t [ ’ _ P r e d i c t ’ ] ( e x p e n d i t u r e =h ) , ’− ’ )
b= r e s u l t . I n t e r c e p t . mean ( )
m= r e s u l t . e x p e n d i t u r e . mean ( )
i f b>0:
t e x t ( 4 . 5 , 8 5 , ’ $y=%.3 f x + %.3 f $ ’ % (m, b ) , f o n t s i z e =30)
else :
t e x t ( 4 . 5 , 8 5 , ’ $y=%.3 f x %.3 f $ ’ % (m, b ) , f o n t s i z e =30)
multi-parameter models 209
r e s u l t = r e g r e s s i o n ( ’ t o t a l ~ e x p e n d i t u r e + p e r c e n t _ t a k i n g ’ , data )
<matplotlib.figure.Figure at 0x1107f5fd0>
<matplotlib.figure.Figure at 0x10d70a690>
210 statistical inference for everyone
<matplotlib.figure.Figure at 0x110fbcf90>
11 Introduction to MCMC
2 The random values are chosen from the prior probability of the
parameters. (e.g. uniform in this case, P(θ ) = 1)
The model we first look at is the coin flip model: given 17 heads
in 25 flips, what is the probability distribution of the the measure
212 statistical inference for everyone
h,N=data=17,25
def P_data(data,theta):
h,N=data
distribution=Bernoulli(h,N)
return distribution(theta)
model=MCMCModel(data,P_data,
theta=Uniform(0,1))
model.plot_distributions()
model.P(’theta>0.5’)
0.96173333333333333
introduction to mcmc 213
0.038266666666666664
def linear(x,a,b):
return a*x+b
model=MCMCModel_Regression(x,y,linear,
a=Uniform(-10,10),
b=Uniform(0,100),
)
model.run_mcmc(500)
model.plot_chains()
plot(x,y,’o’)
model.plot_predictions(xfit,color=’g’)
model.plot_distributions()
model.percentiles([5,50,95])
model.P(’a>0’)
0.98936000000000002
drug = (101,100,102,104,102,97,105,105,98,
101,100,123,105,103,100,95,102,106,
109,102,82,102,100,102,102,101,102,
102,103,103,97,97,103,101,97,104,
96,103,124,101,101,100,101,101,104,
100,101)
placebo = (99,101,100,101,102,100,97,101,
104,101,102,102,100,105,88,101,100,
104,100,100,100,101,102,103,97,101,
101,100,101,99,101,100,100,
101,100,99,101,100,102,99,100,99)
model=mcmc.BESTModel(drug,placebo)
model.run_mcmc()
Running MCMC...
Done.
5.80 s
model.names
model.plot_chains(’mu1’)
model.plot_distribution(’mu1’)
introduction to mcmc 217
model.plot_distribution(’mu2’)
model.plot_distribution(r’$\delta$=mu1-mu2’)
3 Specify how likely your data would be if your model were true,
which is the likelihood part of Bayes’ rule
Topics I’d love to add, and will when I have the chance, include (in
no particular order),
• Measurement in Science
• Two-sample inferences
• Classification
• Experimental Design
Sharon McGrayne. The Theory That Would Not Die: How Bayes’
Rule Cracked the Enigma Code, Hunted Down Russian Submarines,
and Emerged Triumphant from Two Centuries of Controversy. Yale Uni-
versity Press, 2011. ISBN 0300169698.
A. Tversky and T. Gilovich. The cold facts about the" hot hand" in
basketball. Anthology of statistics in sports, 16:169, 2005.
https://store.continuum.io/cshop/anaconda/
It is
• Free
• Easy to Use
• Easy to Extend
• Very Powerful
The accompanying software for the book can be obtained from the
book website, http://web.bryant.edu/∼bblais/statistical-inference-
for-everyone-sie.html
Appendix B
Notation and Standards
Variables
A set of values, labeled with subscripts...
x1 = 1
x2 = 5
226 statistical inference for everyone
x3 = −3
x4 = 2
x5 = 8
referred collectively as xi .
Sums
x1 + x2 + x3 + x4 + x5 = 1 + 5 + (−3) + 2 + 8 = 13
is equivalent to
5
∑ xi = 1 + 5 + (−3) + 2 + 8 = 13
i =1
Products
x1 · x2 · x3 · x4 · x5 = 1 · 5 · (−3) · 2 · 8 = −240
is equivalent to
5
∏ xi = 1 · 5 · (−3) · 2 · 8 = −240
i =1
Sample Mean
The sample mean of a set of numbers is defined as...
x1 + x2 + · · · x N
x̄ ≡
N
In the example above
x1 + x2 + x3 + x4 + x5 3
x̄ ≡ =2
5 5
It can also be written
∑iN=1 xi
x̄ ≡
N
or
∑i xi
x̄ ≡
N
notation and standards 227
N
1
s2 ≡ ∑
N − 1 i =1
( x − x̄ )2
Although the justification for the
v N − 1 part is beyond this book,
N
u
1 one easy way to remember it is
∑
u
s≡t ( x − x̄ )2 that the sample distribution of a
N − 1 i =1 set of numbers is an estimate for
the σ parameter of the normal
distribution, representing the spread
Estimates of the data. You can think of the
N − 1 part as a check to keep you
Any specific estimate of a parameter, such as θ, is denoted with a hat, from doing the crazy thing of
such as θ̂. estimating a spread with only 1
data point!
Factorials
Factorials are defined as
N! = 1 · 2 · 3 · · · ( N − 1) · N
for example
5! = 1 · 2 · 3 · 4 · 5 = 120
The N-choose-k notation is a shorthand for the factorials that arise
in binomial and Beta distributions.
!
N N!
≡
k k!( N − k)!
term probability
virtually impossible 1/1,000,000
extremely unlikely 0.01 (i.e. 1/100)
very unlikely 0.05 (i.e. 1/20)
unlikely 0.2 (i.e. 1/5)
slightly unlikely 0.4 (i.e. 2/5)
even odds 0.5 (i.e. 50-50)
slightly likely 0.6 (i.e. 3/5)
likely 0.8 (i.e. 4/5)
very likely 0.95 (i.e. 19/20)
extremely likely 0.99 (i.e. 99/100)
virtually certain 999,999/1,000,000
Appendix C
Common Distributions and Their Properties
C.2 Uniform
Discrete
Discrete uniform distribution The discrete uniform distribution is Discrete uniform distribution The
defined to be a constant value for all possibilities. Mathematically this discrete uniform distribution is
defined to be a constant value for all
is written possibilities. Mathematically this is
1 written
p ( xi ) = 1
N p ( xi ) =
N
where N is the total number of possibilities, labeled x1 to x N . The where N is the total number of
picture of the distribution is shown in Figure C.1 possibilities, labeled x1 to x N .
Continuous
Continuous uniform distribution The continuous uniform distri- Continuous uniform distribution
bution is defined to be a constant between a minimum and maximum The continuous uniform distri-
bution is defined to be a constant
between a minimum and maximum
value, and zero everywhere else.
Mathematically this is written
1
p( x ) = for min < x < max
max − min
.
230 statistical inference for everyone
0.8
0.6
P(x)
0.4
0.2
The height of the rectangle is given by the constant value of the uni-
form distribution, or
1 1 1
height = = = 0.25
max − min 4hr − 0hr hr
common distributions and their properties 231
0.8
0.6
P(x)
0.4
0.2
0.20
P(time)
0.15
0.10 20 minutes
0.05
C.3 Binomial
Binomial distribution The discrete binomial distribution is de- Binomial distribution The discrete
fined to be the probability of achieving h successes in a given N binomial distribution is defined to
be the probability of achieving h
events where each event has a given θ probability of success. successes in a given N events where
each event has a given θ probability
of success.
!
h
P(h| N, θ ) = θ h (1 − θ ) N − h
h
N P(h| N, θ ) = θ h (1 − θ ) N − h
N
0.10
0.05
0.000 5 10 15 20 25 30
Number of heads
C.4 Beta
Beta distribution The continuous Beta distribution is the posterior Beta distribution The continuous
probability distribution for the parameter θ, where one has observed Beta distribution is the posterior
probability distribution for the pa-
rameter θ, where one has observed
h successes in a given N events, and
each event is assumed to have a θ
probability of success.
N
P(θ |h, N ) = ( N + 1) · θ h (1 − θ ) N − h
h
common distributions and their properties 233
!
N
P(θ |h, N ) = ( N + 1) · θ h (1 − θ ) N − h
h
4 50%
the bent coin - the probability
that the coin will land heads. The
25% distribution is shown for data 3
heads and 9 tails. The various
quartiles are shown in the plot.
3 75%
2 5%
P(θ)
95%
1 1%
99%
Normal distribution The Normal distribution is the most com- Normal distribution The Normal
mon distribution found in all of statistical inference. It is the best distribution is the most common
distribution found in all of statis-
prior distribution to use, when all you know is that your data has tical inference. It is the best prior
a constant true value and some constant variation around that true distribution to use, when all you
know is that your data has a con-
value. It is the posterior probability distribution for the unknown stant true value and some constant
true value given N samples and the known deviation, σ. It is also the variation around that true value. It
approximate form for nearly every distribution when you have many is the posterior probability distri-
bution for the unknown true value
samples. The mathematical form for the normal, or Gaussian, is given N samples and the known de-
viation, σ. It is also the approximate
form for nearly every distribution
1 2 /2σ2 when you have many samples. The
Normal(µ, σ ) = √ e−( x−µ) mathematical form for the normal,
2πσ2
or Gaussian, is
1 2 /2σ2
Normal(µ, σ) = √ e−( x−µ)
2πσ2
234 statistical inference for everyone
0.4
0.3
p(x) =Normal(0,1)
0.2
0.1
0.0
4 3 2 1 0 1 2 3 4
x
0.4 68% CI at µ ± 1σ
0.3
P(z)
0.2
0.1
0.0
4 3 2 1 0 1 2 3 4
z
Credible Interval ±z Approximately
50.0% 0.6745σ
68.0% 0.9945σ 1σ
90.0% 1.6449σ
95.0% 1.9600σ 2σ
99.0% 2.5758σ
99.8% 3.0902σ 3σ
99.995% 4.0556σ 4σ
Example D.1 Usage of the Credible Interval Table for the Normal Distri-
bution
√ √
uncertainty σ/ N or 0.3/ 10 = 0.095. Some of the credible intervals
for this estimate then are the following
or approximately
Degrees of Freedom
Credible 1 2 3 4 5 6 7 8
Interval
50.0% 1.000σ 0.816σ 0.765σ 0.741σ 0.727σ 0.718σ 0.711σ 0.706σ
68.0% 1.819σ 1.312σ 1.189σ 1.134σ 1.104σ 1.084σ 1.070σ 1.060σ
90.0% 6.314σ 2.920σ 2.353σ 2.132σ 2.015σ 1.943σ 1.895σ 1.860σ
95.0% 12.706σ 4.303σ 3.182σ 2.776σ 2.571σ 2.447σ 2.365σ 2.306σ
99.0% 63.657σ 9.925σ 5.841σ 4.604σ 4.032σ 3.707σ 3.499σ 3.355σ
99.8% 318.309σ 22.327σ 10.215σ 7.173σ 5.893σ 5.208σ 4.785σ 4.501σ
99.995% 12732.395σ 141.416σ 35.298σ 18.522σ 12.893σ 10.261σ 8.783σ 7.851σ
Degrees of Freedom
Credible 9 10 11 12 13 14 15 16
Interval
50.0% 0.703σ 0.700σ 0.697σ 0.695σ 0.694σ 0.692σ 0.691σ 0.690σ
68.0% 1.053σ 1.046σ 1.041σ 1.037σ 1.034σ 1.031σ 1.029σ 1.026σ
90.0% 1.833σ 1.812σ 1.796σ 1.782σ 1.771σ 1.761σ 1.753σ 1.746σ
95.0% 2.262σ 2.228σ 2.201σ 2.179σ 2.160σ 2.145σ 2.131σ 2.120σ
99.0% 3.250σ 3.169σ 3.106σ 3.055σ 3.012σ 2.977σ 2.947σ 2.921σ
99.8% 4.297σ 4.144σ 4.025σ 3.930σ 3.852σ 3.787σ 3.733σ 3.686σ
99.995% 7.215σ 6.757σ 6.412σ 6.143σ 5.928σ 5.753σ 5.607σ 5.484σ
tables 237
Degrees of Freedom
Credible 17 18 19 20 21 22 23 24
Interval
50.0% 0.689σ 0.688σ 0.688σ 0.687σ 0.686σ 0.686σ 0.685σ 0.685σ
68.0% 1.024σ 1.023σ 1.021σ 1.020σ 1.019σ 1.017σ 1.016σ 1.015σ
90.0% 1.740σ 1.734σ 1.729σ 1.725σ 1.721σ 1.717σ 1.714σ 1.711σ
95.0% 2.110σ 2.101σ 2.093σ 2.086σ 2.080σ 2.074σ 2.069σ 2.064σ
99.0% 2.898σ 2.878σ 2.861σ 2.845σ 2.831σ 2.819σ 2.807σ 2.797σ
99.8% 3.646σ 3.610σ 3.579σ 3.552σ 3.527σ 3.505σ 3.485σ 3.467σ
99.995% 5.379σ 5.288σ 5.209σ 5.139σ 5.077σ 5.022σ 4.972σ 4.927σ
Degrees of Freedom
Credible 25 26 27 28 29 30 31 32
Interval
50.0% 0.684σ 0.684σ 0.684σ 0.683σ 0.683σ 0.683σ 0.682σ 0.682σ
68.0% 1.015σ 1.014σ 1.013σ 1.012σ 1.012σ 1.011σ 1.011σ 1.010σ
90.0% 1.708σ 1.706σ 1.703σ 1.701σ 1.699σ 1.697σ 1.696σ 1.694σ
95.0% 2.060σ 2.056σ 2.052σ 2.048σ 2.045σ 2.042σ 2.040σ 2.037σ
99.0% 2.787σ 2.779σ 2.771σ 2.763σ 2.756σ 2.750σ 2.744σ 2.738σ
99.8% 3.450σ 3.435σ 3.421σ 3.408σ 3.396σ 3.385σ 3.375σ 3.365σ
99.995% 4.887σ 4.849σ 4.816σ 4.784σ 4.756σ 4.729σ 4.705σ 4.682σ
Degrees of Freedom
Credible 33 34 35 36 37 38 39 40
Interval
50.0% 0.682σ 0.682σ 0.682σ 0.681σ 0.681σ 0.681σ 0.681σ 0.681σ
68.0% 1.010σ 1.009σ 1.009σ 1.008σ 1.008σ 1.008σ 1.007σ 1.007σ
90.0% 1.692σ 1.691σ 1.690σ 1.688σ 1.687σ 1.686σ 1.685σ 1.684σ
95.0% 2.035σ 2.032σ 2.030σ 2.028σ 2.026σ 2.024σ 2.023σ 2.021σ
99.0% 2.733σ 2.728σ 2.724σ 2.719σ 2.715σ 2.712σ 2.708σ 2.704σ
99.8% 3.356σ 3.348σ 3.340σ 3.333σ 3.326σ 3.319σ 3.313σ 3.307σ
99.995% 4.660σ 4.640σ 4.622σ 4.604σ 4.588σ 4.572σ 4.558σ 4.544σ
Example D.2 Usage of the Credible Interval Table for the Student’s t
Distribution
0.4
0.3
P(z)
0.2 area
0.1
0.0
4 3 2 1 0z 1 2 3 4
z Area z Area z Area z Area z Area
on Left on Left on Left on Left on Left
-3.70 0.0001 -3.40 0.0003 -3.10 0.0010 -2.80 0.0026 -2.50 0.0062
-3.69 0.0001 -3.39 0.0003 -3.09 0.0010 -2.79 0.0026 -2.49 0.0064
-3.68 0.0001 -3.38 0.0004 -3.08 0.0010 -2.78 0.0027 -2.48 0.0066
-3.67 0.0001 -3.37 0.0004 -3.07 0.0011 -2.77 0.0028 -2.47 0.0068
-3.66 0.0001 -3.36 0.0004 -3.06 0.0011 -2.76 0.0029 -2.46 0.0069
-3.65 0.0001 -3.35 0.0004 -3.05 0.0011 -2.75 0.0030 -2.45 0.0071
-3.64 0.0001 -3.34 0.0004 -3.04 0.0012 -2.74 0.0031 -2.44 0.0073
-3.63 0.0001 -3.33 0.0004 -3.03 0.0012 -2.73 0.0032 -2.43 0.0075
-3.62 0.0001 -3.32 0.0005 -3.02 0.0013 -2.72 0.0033 -2.42 0.0078
-3.61 0.0002 -3.31 0.0005 -3.01 0.0013 -2.71 0.0034 -2.41 0.0080
-3.60 0.0002 -3.30 0.0005 -3.00 0.0013 -2.70 0.0035 -2.40 0.0082
-3.59 0.0002 -3.29 0.0005 -2.99 0.0014 -2.69 0.0036 -2.39 0.0084
-3.58 0.0002 -3.28 0.0005 -2.98 0.0014 -2.68 0.0037 -2.38 0.0087
-3.57 0.0002 -3.27 0.0005 -2.97 0.0015 -2.67 0.0038 -2.37 0.0089
-3.56 0.0002 -3.26 0.0006 -2.96 0.0015 -2.66 0.0039 -2.36 0.0091
-3.55 0.0002 -3.25 0.0006 -2.95 0.0016 -2.65 0.0040 -2.35 0.0094
-3.54 0.0002 -3.24 0.0006 -2.94 0.0016 -2.64 0.0041 -2.34 0.0096
-3.53 0.0002 -3.23 0.0006 -2.93 0.0017 -2.63 0.0043 -2.33 0.0099
-3.52 0.0002 -3.22 0.0006 -2.92 0.0018 -2.62 0.0044 -2.32 0.0102
-3.51 0.0002 -3.21 0.0007 -2.91 0.0018 -2.61 0.0045 -2.31 0.0104
-3.50 0.0002 -3.20 0.0007 -2.90 0.0019 -2.60 0.0047 -2.30 0.0107
-3.49 0.0002 -3.19 0.0007 -2.89 0.0019 -2.59 0.0048 -2.29 0.0110
-3.48 0.0003 -3.18 0.0007 -2.88 0.0020 -2.58 0.0049 -2.28 0.0113
-3.47 0.0003 -3.17 0.0008 -2.87 0.0021 -2.57 0.0051 -2.27 0.0116
-3.46 0.0003 -3.16 0.0008 -2.86 0.0021 -2.56 0.0052 -2.26 0.0119
-3.45 0.0003 -3.15 0.0008 -2.85 0.0022 -2.55 0.0054 -2.25 0.0122
-3.44 0.0003 -3.14 0.0008 -2.84 0.0023 -2.54 0.0055 -2.24 0.0125
-3.43 0.0003 -3.13 0.0009 -2.83 0.0023 -2.53 0.0057 -2.23 0.0129
-3.42 0.0003 -3.12 0.0009 -2.82 0.0024 -2.52 0.0059 -2.22 0.0132
-3.41 0.0003 -3.11 0.0009 -2.81 0.0025 -2.51 0.0060 -2.21 0.0136
-3.40 0.0003 -3.10 0.0010 -2.80 0.0026 -2.50 0.0062 -2.20 0.0139
tables 239
0.4
0.3
P(z)
0.2 area
0.1
0.0
4 3 2 1 0z 1 2 3 4
z Area z Area z Area z Area z Area
on Left on Left on Left on Left on Left
-2.20 0.0139 -1.90 0.0287 -1.60 0.0548 -1.30 0.0968 -1.00 0.1587
-2.19 0.0143 -1.89 0.0294 -1.59 0.0559 -1.29 0.0985 -0.99 0.1611
-2.18 0.0146 -1.88 0.0301 -1.58 0.0571 -1.28 0.1003 -0.98 0.1635
-2.17 0.0150 -1.87 0.0307 -1.57 0.0582 -1.27 0.1020 -0.97 0.1660
-2.16 0.0154 -1.86 0.0314 -1.56 0.0594 -1.26 0.1038 -0.96 0.1685
-2.15 0.0158 -1.85 0.0322 -1.55 0.0606 -1.25 0.1056 -0.95 0.1711
-2.14 0.0162 -1.84 0.0329 -1.54 0.0618 -1.24 0.1075 -0.94 0.1736
-2.13 0.0166 -1.83 0.0336 -1.53 0.0630 -1.23 0.1093 -0.93 0.1762
-2.12 0.0170 -1.82 0.0344 -1.52 0.0643 -1.22 0.1112 -0.92 0.1788
-2.11 0.0174 -1.81 0.0351 -1.51 0.0655 -1.21 0.1131 -0.91 0.1814
-2.10 0.0179 -1.80 0.0359 -1.50 0.0668 -1.20 0.1151 -0.90 0.1841
-2.09 0.0183 -1.79 0.0367 -1.49 0.0681 -1.19 0.1170 -0.89 0.1867
-2.08 0.0188 -1.78 0.0375 -1.48 0.0694 -1.18 0.1190 -0.88 0.1894
-2.07 0.0192 -1.77 0.0384 -1.47 0.0708 -1.17 0.1210 -0.87 0.1922
-2.06 0.0197 -1.76 0.0392 -1.46 0.0721 -1.16 0.1230 -0.86 0.1949
-2.05 0.0202 -1.75 0.0401 -1.45 0.0735 -1.15 0.1251 -0.85 0.1977
-2.04 0.0207 -1.74 0.0409 -1.44 0.0749 -1.14 0.1271 -0.84 0.2005
-2.03 0.0212 -1.73 0.0418 -1.43 0.0764 -1.13 0.1292 -0.83 0.2033
-2.02 0.0217 -1.72 0.0427 -1.42 0.0778 -1.12 0.1314 -0.82 0.2061
-2.01 0.0222 -1.71 0.0436 -1.41 0.0793 -1.11 0.1335 -0.81 0.2090
-2.00 0.0228 -1.70 0.0446 -1.40 0.0808 -1.10 0.1357 -0.80 0.2119
-1.99 0.0233 -1.69 0.0455 -1.39 0.0823 -1.09 0.1379 -0.79 0.2148
-1.98 0.0239 -1.68 0.0465 -1.38 0.0838 -1.08 0.1401 -0.78 0.2177
-1.97 0.0244 -1.67 0.0475 -1.37 0.0853 -1.07 0.1423 -0.77 0.2206
-1.96 0.0250 -1.66 0.0485 -1.36 0.0869 -1.06 0.1446 -0.76 0.2236
-1.95 0.0256 -1.65 0.0495 -1.35 0.0885 -1.05 0.1469 -0.75 0.2266
-1.94 0.0262 -1.64 0.0505 -1.34 0.0901 -1.04 0.1492 -0.74 0.2296
-1.93 0.0268 -1.63 0.0516 -1.33 0.0918 -1.03 0.1515 -0.73 0.2327
-1.92 0.0274 -1.62 0.0526 -1.32 0.0934 -1.02 0.1539 -0.72 0.2358
-1.91 0.0281 -1.61 0.0537 -1.31 0.0951 -1.01 0.1562 -0.71 0.2389
-1.90 0.0287 -1.60 0.0548 -1.30 0.0968 -1.00 0.1587 -0.70 0.2420
240 statistical inference for everyone
0.4
0.3
P(z)
0.2 area
0.1
0.0
4 3 2 1 0z 1 2 3 4
z Area z Area z Area z Area z Area
on Left on Left on Left on Left on Left
-0.70 0.2420 -0.40 0.3446 -0.10 0.4602 0.20 0.5793 0.50 0.6915
-0.69 0.2451 -0.39 0.3483 -0.09 0.4641 0.21 0.5832 0.51 0.6950
-0.68 0.2483 -0.38 0.3520 -0.08 0.4681 0.22 0.5871 0.52 0.6985
-0.67 0.2514 -0.37 0.3557 -0.07 0.4721 0.23 0.5910 0.53 0.7019
-0.66 0.2546 -0.36 0.3594 -0.06 0.4761 0.24 0.5948 0.54 0.7054
-0.65 0.2578 -0.35 0.3632 -0.05 0.4801 0.25 0.5987 0.55 0.7088
-0.64 0.2611 -0.34 0.3669 -0.04 0.4840 0.26 0.6026 0.56 0.7123
-0.63 0.2643 -0.33 0.3707 -0.03 0.4880 0.27 0.6064 0.57 0.7157
-0.62 0.2676 -0.32 0.3745 -0.02 0.4920 0.28 0.6103 0.58 0.7190
-0.61 0.2709 -0.31 0.3783 -0.01 0.4960 0.29 0.6141 0.59 0.7224
-0.60 0.2743 -0.30 0.3821 0.00 0.5000 0.30 0.6179 0.60 0.7257
-0.59 0.2776 -0.29 0.3859 0.01 0.5040 0.31 0.6217 0.61 0.7291
-0.58 0.2810 -0.28 0.3897 0.02 0.5080 0.32 0.6255 0.62 0.7324
-0.57 0.2843 -0.27 0.3936 0.03 0.5120 0.33 0.6293 0.63 0.7357
-0.56 0.2877 -0.26 0.3974 0.04 0.5160 0.34 0.6331 0.64 0.7389
-0.55 0.2912 -0.25 0.4013 0.05 0.5199 0.35 0.6368 0.65 0.7422
-0.54 0.2946 -0.24 0.4052 0.06 0.5239 0.36 0.6406 0.66 0.7454
-0.53 0.2981 -0.23 0.4090 0.07 0.5279 0.37 0.6443 0.67 0.7486
-0.52 0.3015 -0.22 0.4129 0.08 0.5319 0.38 0.6480 0.68 0.7517
-0.51 0.3050 -0.21 0.4168 0.09 0.5359 0.39 0.6517 0.69 0.7549
-0.50 0.3085 -0.20 0.4207 0.10 0.5398 0.40 0.6554 0.70 0.7580
-0.49 0.3121 -0.19 0.4247 0.11 0.5438 0.41 0.6591 0.71 0.7611
-0.48 0.3156 -0.18 0.4286 0.12 0.5478 0.42 0.6628 0.72 0.7642
-0.47 0.3192 -0.17 0.4325 0.13 0.5517 0.43 0.6664 0.73 0.7673
-0.46 0.3228 -0.16 0.4364 0.14 0.5557 0.44 0.6700 0.74 0.7704
-0.45 0.3264 -0.15 0.4404 0.15 0.5596 0.45 0.6736 0.75 0.7734
-0.44 0.3300 -0.14 0.4443 0.16 0.5636 0.46 0.6772 0.76 0.7764
-0.43 0.3336 -0.13 0.4483 0.17 0.5675 0.47 0.6808 0.77 0.7794
-0.42 0.3372 -0.12 0.4522 0.18 0.5714 0.48 0.6844 0.78 0.7823
-0.41 0.3409 -0.11 0.4562 0.19 0.5753 0.49 0.6879 0.79 0.7852
-0.40 0.3446 -0.10 0.4602 0.20 0.5793 0.50 0.6915 0.80 0.7881
tables 241
0.4
0.3
P(z)
0.2 area
0.1
0.0
4 3 2 1 0z 1 2 3 4
z Area z Area z Area z Area z Area
on Left on Left on Left on Left on Left
0.80 0.7881 1.10 0.8643 1.40 0.9192 1.70 0.9554 2.00 0.9772
0.81 0.7910 1.11 0.8665 1.41 0.9207 1.71 0.9564 2.01 0.9778
0.82 0.7939 1.12 0.8686 1.42 0.9222 1.72 0.9573 2.02 0.9783
0.83 0.7967 1.13 0.8708 1.43 0.9236 1.73 0.9582 2.03 0.9788
0.84 0.7995 1.14 0.8729 1.44 0.9251 1.74 0.9591 2.04 0.9793
0.85 0.8023 1.15 0.8749 1.45 0.9265 1.75 0.9599 2.05 0.9798
0.86 0.8051 1.16 0.8770 1.46 0.9279 1.76 0.9608 2.06 0.9803
0.87 0.8078 1.17 0.8790 1.47 0.9292 1.77 0.9616 2.07 0.9808
0.88 0.8106 1.18 0.8810 1.48 0.9306 1.78 0.9625 2.08 0.9812
0.89 0.8133 1.19 0.8830 1.49 0.9319 1.79 0.9633 2.09 0.9817
0.90 0.8159 1.20 0.8849 1.50 0.9332 1.80 0.9641 2.10 0.9821
0.91 0.8186 1.21 0.8869 1.51 0.9345 1.81 0.9649 2.11 0.9826
0.92 0.8212 1.22 0.8888 1.52 0.9357 1.82 0.9656 2.12 0.9830
0.93 0.8238 1.23 0.8907 1.53 0.9370 1.83 0.9664 2.13 0.9834
0.94 0.8264 1.24 0.8925 1.54 0.9382 1.84 0.9671 2.14 0.9838
0.95 0.8289 1.25 0.8944 1.55 0.9394 1.85 0.9678 2.15 0.9842
0.96 0.8315 1.26 0.8962 1.56 0.9406 1.86 0.9686 2.16 0.9846
0.97 0.8340 1.27 0.8980 1.57 0.9418 1.87 0.9693 2.17 0.9850
0.98 0.8365 1.28 0.8997 1.58 0.9429 1.88 0.9699 2.18 0.9854
0.99 0.8389 1.29 0.9015 1.59 0.9441 1.89 0.9706 2.19 0.9857
1.00 0.8413 1.30 0.9032 1.60 0.9452 1.90 0.9713 2.20 0.9861
1.01 0.8438 1.31 0.9049 1.61 0.9463 1.91 0.9719 2.21 0.9864
1.02 0.8461 1.32 0.9066 1.62 0.9474 1.92 0.9726 2.22 0.9868
1.03 0.8485 1.33 0.9082 1.63 0.9484 1.93 0.9732 2.23 0.9871
1.04 0.8508 1.34 0.9099 1.64 0.9495 1.94 0.9738 2.24 0.9875
1.05 0.8531 1.35 0.9115 1.65 0.9505 1.95 0.9744 2.25 0.9878
1.06 0.8554 1.36 0.9131 1.66 0.9515 1.96 0.9750 2.26 0.9881
1.07 0.8577 1.37 0.9147 1.67 0.9525 1.97 0.9756 2.27 0.9884
1.08 0.8599 1.38 0.9162 1.68 0.9535 1.98 0.9761 2.28 0.9887
1.09 0.8621 1.39 0.9177 1.69 0.9545 1.99 0.9767 2.29 0.9890
1.10 0.8643 1.40 0.9192 1.70 0.9554 2.00 0.9772 2.30 0.9893
242 statistical inference for everyone
0.4
0.3
P(z)
0.2 area
0.1
0.0
4 3 2 1 0z 1 2 3 4
z Area z Area z Area z Area z Area
on Left on Left on Left on Left on Left
2.30 0.9893 2.60 0.9953 2.90 0.9981 3.20 0.9993 3.50 0.9998
2.31 0.9896 2.61 0.9955 2.91 0.9982 3.21 0.9993 3.51 0.9998
2.32 0.9898 2.62 0.9956 2.92 0.9982 3.22 0.9994 3.52 0.9998
2.33 0.9901 2.63 0.9957 2.93 0.9983 3.23 0.9994 3.53 0.9998
2.34 0.9904 2.64 0.9959 2.94 0.9984 3.24 0.9994 3.54 0.9998
2.35 0.9906 2.65 0.9960 2.95 0.9984 3.25 0.9994 3.55 0.9998
2.36 0.9909 2.66 0.9961 2.96 0.9985 3.26 0.9994 3.56 0.9998
2.37 0.9911 2.67 0.9962 2.97 0.9985 3.27 0.9995 3.57 0.9998
2.38 0.9913 2.68 0.9963 2.98 0.9986 3.28 0.9995 3.58 0.9998
2.39 0.9916 2.69 0.9964 2.99 0.9986 3.29 0.9995 3.59 0.9998
2.40 0.9918 2.70 0.9965 3.00 0.9987 3.30 0.9995 3.60 0.9998
2.41 0.9920 2.71 0.9966 3.01 0.9987 3.31 0.9995 3.61 0.9998
2.42 0.9922 2.72 0.9967 3.02 0.9987 3.32 0.9995 3.62 0.9999
2.43 0.9925 2.73 0.9968 3.03 0.9988 3.33 0.9996 3.63 0.9999
2.44 0.9927 2.74 0.9969 3.04 0.9988 3.34 0.9996 3.64 0.9999
2.45 0.9929 2.75 0.9970 3.05 0.9989 3.35 0.9996 3.65 0.9999
2.46 0.9931 2.76 0.9971 3.06 0.9989 3.36 0.9996 3.66 0.9999
2.47 0.9932 2.77 0.9972 3.07 0.9989 3.37 0.9996 3.67 0.9999
2.48 0.9934 2.78 0.9973 3.08 0.9990 3.38 0.9996 3.68 0.9999
2.49 0.9936 2.79 0.9974 3.09 0.9990 3.39 0.9997 3.69 0.9999
2.50 0.9938 2.80 0.9974 3.10 0.9990 3.40 0.9997 3.70 0.9999
2.51 0.9940 2.81 0.9975 3.11 0.9991 3.41 0.9997 3.71 0.9999
2.52 0.9941 2.82 0.9976 3.12 0.9991 3.42 0.9997 3.72 0.9999
2.53 0.9943 2.83 0.9977 3.13 0.9991 3.43 0.9997 3.73 0.9999
2.54 0.9945 2.84 0.9977 3.14 0.9992 3.44 0.9997 3.74 0.9999
2.55 0.9946 2.85 0.9978 3.15 0.9992 3.45 0.9997 3.75 0.9999
2.56 0.9948 2.86 0.9979 3.16 0.9992 3.46 0.9997 3.76 0.9999
2.57 0.9949 2.87 0.9979 3.17 0.9992 3.47 0.9997 3.77 0.9999
2.58 0.9951 2.88 0.9980 3.18 0.9993 3.48 0.9997 3.78 0.9999
2.59 0.9952 2.89 0.9981 3.19 0.9993 3.49 0.9998 3.79 0.9999
2.60 0.9953 2.90 0.9981 3.20 0.9993 3.50 0.9998 3.80 0.9999