Fundamentals of Probability, Random Processes and Statistics
Fundamentals of Probability, Random Processes and Statistics
Fundamentals of Probability, Random Processes and Statistics
Chapter 4
Fundamentals of
Probability, Random
Processes and Statistics
69
i
i i
Example 4.2. Consider an experiment in which we flip two individual coins (e.g.,
a quarter and nickel) multiple times and record the outcome which consists of an
ordered pair. The sample space and -field of events are thus
x = X(!)
We note that in the statistics literature, many authors employ the same nota-
tion for the random variable and realization and let the context dictate the meaning.
For those who are new to the field, this can obscure the meaning and, to the degree
possible, we will use di↵erent notation for random variables and their realizations.
i
i i
Example 4.6. Consider the experiment of Example 4.2 in which our event ! con-
sists of a single flip of a pair of coins. We define X(!) to be the number of heads
associated with the event so that
X(H, H) = 2
X(H, T ) = X(T, H) = 1
X(T, T ) = 0.
For x < 0, the probability of finding an event ! 2 ⌦ such that X(!) x is 0 so
FX (x) = 0 for x < 0. Similar analysis yields the cdf relation
8
>
> 0 x<0
<
1/4 , 0 x < 1
FX (x) =
>
> 3/4 , 1 x < 2
:
1 , x 2
which is plotted in Figure 4.1.
FX (x)
1
3/4
1/4
1 2 x
i
i i
(i) fX (x) 0
Z
(ii) fX (x)dx = 1
R
Z x2
(iii) P (x1 X x2 ) = FX (x2 ) FX (x1 ) = fX (x)dx
x1
follow immediately from the definition and (4.3). The attributes of density functions
can be further specified by designating their location or centrality, their spread or
variability, their symmetry, and the contribution of tail behavior. In general, this
information is provided by moments
Z
E(X n ) = xn fX (x)dx
R
also termed the first moment or expected value, provides a measure of the density’s
central location whereas the second central moment
Z
2
= var(X) = E[(X µ) ] = (x µ)2 fX (x)dx
2
(4.4)
R
i
i i
0.8
0.6
f(x)
0.4
(a) (b)
Figure 4.2. (a) Normal density with µ = 0.5 and = 0.4 and areas within 1 and
2 of µ. (b) Uniform density on the interval [a, b].
i
i i
Chi-squared distributions naturally arise when evaluating the sum of squares error
between measured data and model values when estimating model parameters.
2
Definition 4.12 (Student’s t-Distribution). Let X ⇠ N (0, 1) and Z ⇠ (k)
be independent random variables. The random variable
X
T =p
Z/k
0.5 0.4
k=1 k=1
k=2 k=2
0.4 k=3 k=10
k=4 0.3 Normal
k=5
0.3
fZ(z;k)
fT(t;k)
0.2
0.2
0.1
0.1
0 0
0 2 4 6 8 −4 −2 0 2 4
z t
(a) (b)
Figure 4.3. (a) Chi-squared density for k = 1, · · · , 5 and (b) Student’s t-density
with k = 1, 2, 10 compared with the normal density with µ = 0, = 1.
i
i i
It will be shown in Section 7.2 that the t-distribution naturally arises when
estimating the mean of a population when the sample size is relatively small and
the population variance is unknown.
On a historic note, aspects of this theory were developed by William Sealy
Gosset, an employee of the Guinness brewery in Dublin, in an e↵ort to select opti-
mally yielding varieties of barley based on relatively small sample sizes. To improve
perception following the recent disclosure of confidential information by another em-
ployee, Gosset was only allowed to publish under the pseudonym “Student.” The
importance of his work was advocated by both Karl Person and R.A. Fisher.
i
i i
2
and the mean and variance are E(Y ) = ↵ 1 for ↵ > 1 and var(Y ) = (↵ 1)2 (↵ 2)
for ↵ > 2.
As noted in Definition 4.13 and illustrated in Example 4.69, the inverse-
gamma distribution is the conjugate prior for normal likelihoods that are functions
of the variance. The equivalence (4.9) can be used to generate random inverse-
gamma values using the MATLAB Statistics Toolbox command gamrnd.m. Since
x = gamrnd(↵, ) is parameterized in terms of the scale parameter, one would em-
ploy the command y = gamrnd(↵, ), with = 1/ , to generate realizations of
Y ⇠ Inv-gam(↵, ). A technique to construct random realizations from the inverse-
gamma distribution, if gamrnd.m is not available, is discussed at the end of this
section.
(↵ + ) ↵ 1 1
fX (x; ↵, ) = x (1 x)
(↵) ( )
for x 2 [0, 1]. As illustrated in Example 4.68, it is the conjugate prior for the bino-
mial likelihood. It is observed that if ↵ = = 1, the beta distribution is simply the
uniform distribution which is often used to provide noninformative priors. Realiza-
tions from the beta distribution can be generated using the MATLAB command
betarnd.m.
i
i i
15 2.5
1
5
0.5
0
0
−0.5
−5 −1
−4 −2 0 2 4 6 −4 −2 0 2 4 6
Quantile of N(0,1) Data Quantile of N(0,1) Data
(a) (b)
Figure 4.4. Q-Q plot for (a) N (3, 4) and (b) U (0, 1) data as compared with N (0, 1)
data.
(a) (b)
Figure 4.5. (a) Histogram and approximating density. (b) Kernel basis function
and kernel density estimate.
i
i i
x = FX 1 (u).
One typically computes FX 1 (u) using numerical algorithms. Even for an arbitrarily
fine mesh, the cost of this procedure is typically low.
This technique can be used in lieu of calling gamrnd.m if the MATLAB Statis-
tics Toolbox is unavailable.
Definition 4.19 (Joint CDF). For a random vector X, the associated joint cdf
FX : Rn ! [0, 1] is defined by
i
i i
We note that if X and Y are independent, then cov(X, Y ) = ⇢XY = 0 and the
random variables are uncorrelated. The converse is not true in general since the
relation (4.15) quantifies only linear dependencies among random variables.
Returning to the case of n random variables, it is shown in [95] that
n
! n
X X X
var a i Xi = a2i var(Xi ) + 2 ai aj cov(Xi , Xj ) (4.16)
i=1 i=1 i<j
which simplifies to !
n
X n
X
var a i Xi = a2i var(Xi ) (4.17)
i=1 i=1
Like the univariate normal, the multivariate normal distribution plays a central
role in uncertainty quantification and model validation.
i
i i
We use the next theorem when constructing proposal functions for the MCMC
algorithms detailed in Chapter 8.
i
i i
(a) (b)
Figure 4.6. (a) Marginal density fX2 (x2 ) and (b) conditional density fX1 |X2 (x1 |x2 )
at x2 = 12 for a normal joint density fX (x1 , x2 ) with covariance matrix V = 0.09I.
as plotted in Figure 4.6(b). We note that fX1 |X2 (x1 |x2 ) is a function of x1 . The
definition for fX2 |X1 (x2 |x1 ) is analogous. Similarly, for n jointly continuous ran-
dom variables X1 , · · · , Xn with joint density function fX (x1 , · · · , xn ) and marginal
density fX1 (x1 ), the conditional pdf of X2 , · · · , Xn given X1 = x1 is
fX (x1 , x2 , · · · , xn )
fX2 ,··· ,Xn |X1 (x2 , · · · , xn |x1 ) = .
fX1 (x1 )
Definition 4.27 (Point and Interval Estimates). Consider a fixed but unknown
parameter q 2 Q ⇢ Rp . A point estimate is a vector in Rp that represents q. An
interval estimate provides an interval that quantifies the plausible location of com-
ponents of q. The mean, median, or mode of a sampling distribution are examples
of point estimates whereas confidence intervals are interval estimates.
i
i i
n n
1X 1 X
X̄ = Xi , S2 = (Xi X̄)2 (4.21)
n i=1 n 1 i=1
which are the sample mean and variance. We note that we employ n 1 rather than
n in the expression for S 2 to ensure that it is unbiased. If we additionally assume
that Xi ⇠ N (µ, 2 ), it is illustrated in [168] that the sampling distributions for X̄
and S 2 are
✓ 2
◆ 2
X̄ ⇠ N µ, , S2 ⇠ 2
(n 1). (4.22)
n n 1
i
i i
X̄ pµ
It thus follows that / n
⇠ N (0, 1) so that
✓ ◆
X̄ µ
P 2< p <2 ⇡ 0.9545
/ n
since 95.45% of the area of a normal distribution lies within 2 standard deviations
of the mean. This implies that
✓ ◆
2 2
P X̄ p < µ < X̄ + p ⇡ 0.9545.
n n
p p
Here [X̄ 2 / n, X̄ + 2 / n] is an interval estimator for µ where both endpoints
2
are statistics Pn A (1 ↵) ⇥ 100% confidence interval is
p since p is considered known.
[x̄ 2 / n, x̄ + 2 / n] where x̄ = n1 i=1 xi is the realized sample mean based on
n measurements, or realizations, xi of the random variables Xi .
Example 4.33. We now turn to the problem of determining the confidence interval
for the mean µ of a normal distribution when the variance 2 is also unknown. To
estimate 2 , we employ the statistic S 2 given by (4.21) which has the 2 distribution
(4.22). We thus have
p
n(X̄ µ) (n 1)S 2
X= ⇠ N (0, 1) , Z = 2
⇠ 2 (n 1)
i
i i
are the random variable and realization in Rp that minimize the respective sum of
squares errors as illustrated in Figure 4.7(a). Details regarding the distribution of
q̂OLS based on various assumptions regarding the distribution of the errors "i are
provided in Chapter 7.
1 The use of the notation q̂
OLS to indicate the estimator is not universal and many texts
denote the least squares estimate by the hat-notation. Hence care must be taken to establish the
convention employed in the specific text.
i
i i
q OLS q MLE
(a) (b)
Figure 4.7. (a) Ordinary least squares solution qOLS to (4.25) and (b) maximum
likelihood estimate qM LE given by (4.27).
where the observed sample is fixed and q varies over all admissible parameter
values. The notation L (q) is somewhat nonstandard but it highlights the fact that
the independent variable is q. Some authors use the notation
i
i i
has the same functional form, the independent variable now is p, and and n are
known. Hence the likelihood function is continuous whereas the probability mass
function is discrete.
To illustrate, we consider (4.24) with the assumption that errors are iid,
unbiased, and normally distributed with true but unknown variance 02 so that
"i ⇠ N (0, 02 ) and hence ⌥i ⇠ N (f (ti , q0 ), 02 ). In this case q and 2 are both
parameters so the likelihood function is
n
Y 1 [ f (ti ,q)]2 /2 2
L(q, | ) = p e i
i=1
2⇡
(4.27)
1 Pn
i=1 [ f (ti ,q)]2 /2 2
= 2 )n/2
e i
(2⇡
and the maximum likelihood estimate is
qM LE = argmax L(q, | ) (4.28)
q2Q
2
2(0,1)
i
i i
where rf denotes the gradient of f with respect to q. It is observed that with the
assumption of iid, unbiased, normally distributed errors, the maximum likelihood
solution qM LE to (4.29) is the same as the least squares estimate qOLS specified by
(4.25). The equivalence between minimizing the sum of squares error and maxi-
mizing the likelihood will be utilized when we construct proposal functions for the
MCMC techniques in Chapter 8.
In frequentist inference, the maximum likelihood estimate qM LE is the param-
eter value that makes the observed output most likely. It should not be interpreted
as the most likely parameter value resulting from the data since this would require
it to be a random variable which contradicts the tenets of frequentist analysis.
Note that X1 , X2 , · · · are typically not iid in this and following definitions. This
mode of convergence is weaker than almost sure convergence.
Examples of sequences that converge in probability but not almost surely are pro-
vided in [60]. This is sometimes referred to as convergence with probability 1.
at all points x where FX (x) is continuous, then Xn is said to have a limiting random
variable X with distribution function FX (x). In this case, Xn is said to converge
D
in distribution to X, which is often written as Xn ! X. Care must be taken when
using this notation since the convergence of random variables is defined in terms
i
i i
P
The formulation of the weak Law of Large Numbers is similar except X̄n ! µ.
These laws are of fundamental importance since they establish that the random
sample adequately represents the population in the sense that X̄n converges to the
mean µ.
Given the central role of the sample mean, it is natural to question the degree
to which its sampling distribution can be established. In Example 4.30, we noted
that if Xi ⇠ N (µ, 2 ) then X̄ ⇠ N (µ, 2 /n). The requirement of normally dis-
tributed random variables is quite restrictive, however, so we relax this assumption
and pose the same question in the context of iid random variables from an arbitrary
distribution. The remarkable answer is provided by the Central Limit Theorem.
i
i i
p
so that the limiting distribution of n(X̄n µ)/ is a normal distribution N (0, 1).
The theorem is often expressed as
p
n(X̄n µ) D
!Z
i
i i
which, as detailed in Example 3.5, models the flow of heat u in a structure having
uncertain di↵usivity ↵. Here ↵ is an example of a random field and the solution
T (t, x, !) is random for all pairs (t, x) of independent variables.
µ(t) = E(Xt ) , t 2 T
(4.34)
C(t, s) = cov(Xt , Xs ) = E [(Xt µ(t))(Xs µ(s))] , t, s 2 T.
Hence µ(t) quantifies the centrality of sample paths whereas C(t, s) quantifies their
variability about µ(t).
i
i i
Xt ⇠ N (µ(t), C(t))
where t = [t1 , · · · , tn ], µ(t) = [E(Xt1 ), · · · , E(Xtn )] and [C(t)]ij = cov(Xti , Xtj ) for
all 1 i, j n. A Gaussian process is thus a probability distribution for a function.
The concept of stationarity is important in the theory of Markov chains since
it provides criteria specifying when MCMC methods can be expected to converge to
posterior distributions for parameters. We consider this in the context of a discrete
index set T but note that a similar definition holds for continuous index sets.
X = {Xx , x 2 X }
i
i i
i
i i
we assume that the transition probabilities are the same for all time which yields a
homogeneous Markov chain.
We let pij denote the probability of moving from xi to xj in one step so that
P = [pij ] , 1 i, j k.
The initial density, which is often termed mass when it is discrete, is given by
p0 = [p01 , · · · , p0k ]
where p0i = P (X0 = xi ). Because p0 and P contain probabilities, their entries are
nonnegative and the elements of p0 and rows of P must sum to unity. Matrices
satisfying the property are termed row-stochastic matrices.
Given an initial distribution and transition kernel, the distribution after 1 step
is p1 = p0 P and
pn = pn 1 P = p 0 P n
after n steps. We illustrate these concepts in the next example.
Example 4.51. Various studies have indicated that factors such as weather, in-
juries, and unquantifiable concepts such as hitting streaks lend a random nature to
baseball [7]. We assume that a team that won its previous game has a 70% chance
of winning their next game and 30% chance of losing whereas a losing team wins
40% and loses 60% of their next games. Hence the probability of winning or losing
the next game is conditioned on a team’s last performance.
This yields the two-state markov chain illustrated in Figure 4.8 where
S = {win, lose}.
i
i i
0.4
0.7 win lose 0.6
0.3
Figure 4.8. Markov chain quantifying the probability of winning or losing based on
the last performance.
is the percentage of teams who won and lost their last games. To illustrate, we take
p0 = [0.8, 0.2]. We assume a schedule in which teams play at di↵erent times so p0w
and p0` do not both have to be 0.5.
The percentage of teams who win/lose their next game is given by
p1 = [0.8 , 0.2] 0.7 0.3
0.4 0.6
= [0.64 , 0.32]
so the distribution after n games is
n
pn = [0.8 , 0.2] 0.7 0.3 .
0.4 0.6
The distributions for n = 0, · · · , 10 are compiled in Table 4.1. These numerical
results indicate that the distribution is limiting to a stationary value.
For this example, we can explicitly compute a limiting distribution ⇡ by solving
the constrained relation
X
⇡ = ⇡P , ⇡i = 1
) [⇡win , ⇡lose ] 0.7 0.3 = [⇡win , ⇡lose ] , ⇡win + ⇡lose = 1
0.4 0.6
to obtain
⇡ = [0.5714 , 0.4286].
In general, however, we cannot solve explicitly for a stationary value and instead
must establish the manner in which pn limits to ⇡. We next discuss the nature of
this convergence and summarize criteria that guarantee the existence of a unique
limiting value.
n pn n pn n pn
0 [0.8000 , 0.2000] 4 [0.5733 , 0.4267] 8 [0.5714 , 0.4286]
1 [0.6400 , 0.3600] 5 [0.5720 , 0.4280] 9 [0.5714 , 0.4286]
2 [0.5920 , 0.4080] 6 [0.5716 , 0.4284] 10 [0.5714 , 0.4286]
3 [0.5776 , 0.4224] 7 [0.5715 , 0.4285]
i
i i
As detailed in Section 4.4, it does not make sense to directly consider limits
lim Xn of random variables. Instead, we consider the limit
n!1
lim pn = ⇡
n!1
which is convergence in distribution. We note that if this limit exists, it must satisfy
⇣ ⌘
⇡ = lim p0 P n = lim p0 P n+1 = lim p0 P n P = ⇡P.
n!1 n!1 n!1
Example 4.54. Consider the Markov chain depicted in Figure 4.9(a) with the
transition matrix 2 3
1 2
0 3 3 0
6 1 0 0 23 7
P =6 3
4 0
7.
0 1 0 5
0 0 0 1
The chain is clearly reducible since p3j = 0 for j = 1, 2, 4. Furthermore, it is easy
to verify that ⇡ = [0, 0, 1, 0] and ⇡ = [0, 0, 0, 1] are both stationary distributions.
The property of irreducibility is required to guarantee that ⇡ is unique.
1 1
3 4 1 1
2 5
2/3 2/3 1 3 1
1/3
1 2 1 1/2 1/2 4
1/3
(b)
(a)
Figure 4.9. (a) Reducible chain for Example 4.54, and (b) periodic chain for
Example 4.56.
i
i i
Example 4.56. The Markov chain depicted in Figure 4.9(b) with the transition
matrix 2 3
0 1 0 0 0
6 0 0 1 0 07
6 1 7
P =6 6 2 0 0
1
2 077
4 0 0 0 0 15
0 0 1 0 0
has the unique stationary distribution ⇡ = [1/6 , 1/6 , 1/3 , 1/6 , 1/6]. It is estab-
lished in Exercise 4.8 that if p0 = [1, 0, 0, 0, 0, 0], then p3 = p6 = p9 = · · · = p0 so
the period is k = 3. Because mass cycles through the chain at a regular interval,
it does not converge so lim pn does not exist. Furthermore, it is demonstrated
n!1
in Exercise 4.9 that if the limit of a periodic chain exists for one initial distribu-
tion, other distributions can yield di↵erent limits. Hence aperiodicity is required to
guarantee that the limit exists.
For infinite chains, one must additionally include conditions regarding the
persistence or recurrence of states. However, we will focus on finite Markov chains
for which it can be shown that if the chain is irreducible, all states are positive
persistent [119].
Before providing a theorem that establishes the convergence limn!1 pn = ⇡,
we summarize relevant results from matrix theory.
There are several statements of the Perron-Frobenius theorem, and details and
proofs can be found in [119, 128, 217].
i
i i
Theorem 4.59. For all finite stochastic matrices P , the largest eigenvalue is 0 = 1.
Further details are provided in [119] and the theorem is illustrated in Exercise 4.10.
The following theorem establishes the convergence of the Markov chain.
Theorem 4.61. Every finite, homogeneous Markov chain that is irreducible and
aperiodic, with transition matrix P , has a unique stationary distribution ⇡. More-
over, chains converge in the sense of distributions, limn!1 pn = ⇡, for every initial
distribution p0 .
Proof. It follows from Theorems 4.58–4.60 that the largest eigenvalue of P is
0 = 1 which has multiplicity
P 1. There is thus a unique left eigenvector ⇡ that
satisfies ⇡P = ⇡ and ⇡i = 1. To establish the convergence, we first consider the
eigendecomposition
2 3
1 0 ··· 0
6 0 2 7
UPV = ⇤ = 6 4 ... .. .. 7
. . 5
0 ··· k
1
where 1 > | 2| ··· | k| and V = U . It follows that
2 3 2 3
1 0 ··· 0 1 0 ··· 0
n
6 0 2 7 6 0 0 7
lim P n = lim V 6
4 ... .. .. 7 U =V 6
4 ... .. .. 7 U.
n!1 n!1 . . 5 . . 5
n
0 ··· k 0 ··· 0
2 3
2 3" # 1 2 3
⇡1 ··· ⇡k ⇡1 ··· ⇡k
6 2 7
4 .. .. 5 P =6 .. 7 4 .. .. 5
. . 4 . 5 . .
uk1 ··· ukk uk1 ··· ukk
n
1
and V = U implies that
2 32 3 2 3
⇡1 ··· ⇡k 1 ··· v1k 1 ··· 0
U V = 4 ... .. 5 4
.
..
.
.. 5 = 4
.
..
.
.. 5
.
uk1 ··· ukk 1 ··· vkk 0 ··· 1
i
i i
P
since ⇡i = 1. This establishes that the first column of V is all ones. Finally
2 3
lim pn = lim p0 P n 2 3 1 2 3
n!1 n!1 1 · · · vk1 ⇡1 ··· ⇡k
⇥ 0 ⇤ .. 6
.. 5 6 2 7 .. .. 5
= lim p1 , · · · , p0k 4 .. 7 4
n!1 . . 4 . 5 . .
1 · · · vkk uk1 · · · ukk
k
2 3
2 3 1 2 3
1 · · · vk1 ⇡1 · · · ⇡k
⇥ ⇤ .. 5 6 0 7
= p01 · · · p0k 4 .. 6 .. 7 4 .. .. 5
. . 4 . 5 . .
1 · · · vkk uk1 · · · ukk
0
= [⇡1 , · · · , ⇡k ]
=⇡
Theorem 4.61 establishes that finite Markov chains which are irreducible and
aperiodic will converge to a stationary distribution ⇡. However, it is often
P difficult
or impossible to solve for ⇡ using the relations ⇡P = ⇡ subject to ⇡i = 1.
The detailed balance condition provides an alternative that is straight-forward to
implement in MCMC methods where the goal is to construct Markov chains whose
stationary distribution ⇡ is the posterior distribution for parameters.
i
i i
1.3 1
1.2
0.5
1.1
Z(t)
z(t)
1 0
0.9
!0.5
0.8
0.7 !1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t
(a) (b)
Figure 4.10. Realizations of (a) a random di↵erential equation and (b) sample
paths of a stochastic di↵erential equation.
i
i i
which is interpreted as
Z t Z t
Z(t) = Z0 aZ(s)ds + bdW (s)
0 0
S = {x1 , · · · , xn } , xj 2 RN
i
i i
Confidence Interval
(a) (b)
Figure 4.11. Interpretation of a (a) frequentist 90% confidence interval and (b)
Bayesian 90% credible interval.
i
i i
⇡( |q)⇡0 (q)
⇡(q| ) = (4.40)
⇡⌥ ( )
where ⇡0 (q) and ⇡(q| ) respectively denote the prior and posterior densities, ⇡( |q)
is a likelihood, and the marginal density ⇡⌥ ( ) is a normalization factor. Here
q = Q(!) denotes realizations of Q. We note that the subscripts which indicate
specific random variables are typically dropped from the prior and posterior in
Bayesian analysis.
The prior density ⇡0 (q) quantifies any prior knowledge that may be known
about the parameter before data is taken into account. For example, one might
have prior information based on similar previous models, data that is similar to
previous data, or initial parameter densities that have been determined through
other means such as related experiments.
For most model calibration, however, one does not have such prior information
so one uses instead what is termed a noninformative prior. A common choice of
noninformative prior is the uniform density, or unnormalized uniform, posed on the
parameter support. For example, one might employ
for a positive parameter. This choice is improper in the sense that the integral of
⇡0 (q) is unbounded. It is recommended that a noninformative prior be used unless
good previous information is known since it is shown in Example 4.66 that incorrect
prior information can degrade (4.40) far more than a noninformative prior.
In “empirical Bayes” inference, one also encounters data-dependent priors in
which priors estimated using frequentist techniques such as maximum likelihood are
i
i i
employed in the Bayesian model. It is argued in [35] that this double use of data
is problematic with small sample sizes and is at odds with the tenets of Bayesian
analysis.
The term ⇡( |q), which is a function of q with fixed, quantifies the likelihood
L(q| ) of observing given parameter realizations q as detailed in Section 4.3.2.
We will illustrate various choices for the likelihood function in the examples at the
end of this section and at the beginning of Chapter 8. The joint density is given by
Example 4.66. To illustrate (4.41) in a setting where the posterior density can
be computed explicitly, we consider the results from tossing a possibly biased coin.
The random variable ⇢
0 , !=T
⌥i (!) =
1 , !=H
represents the result from the ith toss and the parameter q is the probability of
getting heads. We now consider the probability of obtaining N1 heads and N0 tails
in a series of N = N0 + N1 flips of the coin.
Because coin flips are independent events with only two possible outcomes,
the likelihood of observing a sequence = [ 1 , · · · , N ], given the probability q, is
N
Y
⇡( |q) = q i (1 q)1 i
i=1
P P
=q i
(1 q)N i
= q N1 (1 q)N0
i
i i
q N1 (1 q)N0 (N + 1)! N1
⇡(q| ) = R 1 = q (1 q)N0 .
q N1 (1 q)N0 dq N0 !N1 !
0
We note that in this special case, the denominator is the integral of a beta function
which admits an analytic solution. In general, however, quadrature techniques must
be employed to approximate the integral.
For a fair coin with q0 = 12 , the posterior densities associated with various
realizations N1 and N0 are plotted in Figure 4.12. It is first observed that Bayesian
inference yields a posterior density with just one experiment whereas frequentist
analysis would specify a probability of either 0 or 1. It is also observed that the
variability of ⇡(q| ) decreases as N increases. Finally, the manner in which the data
informs the density is illustrated by comparing the results with 5 Heads, 9 Tails,
which has a mode of 0.36, to those of 49 Heads, 51 tails which has a mode of 0.495.
This illustrates that the method is achieving the goal of having the data inform
when there is no prior information.
We next illustrate the e↵ect of a poor choice for the prior density. For the
same fair coin (q0 = 12 ), we consider the choice
1 (q µ)2 /2 2
⇡0 (q) = p e
2⇡
with µ = 0.3 and = 0.1. We cannot analytically evaluate the denominator in this
case so we instead employ Gaussian quadrature. As illustrated in Figure 4.13, even
for a realization of 50 Heads and 50 Tails, the mean of the posterior is still smaller
than q0 = 12 but is significantly better than the result for 5 Heads and 5 Tails. This
illustrates the manner in which a poor informative prior can have negative impact
for a large number of observations. Hence if the validity of an informative prior is
in doubt, it is recommended that a noninformative prior be used instead.
8
1.5 3
6
1 2
4
0.5 1
2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 4.12. Posterior densities associated with a noninformative prior for three
realizations of the coin toss experiment.
i
i i
4 8
3 6
2 4
1 2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 4.13. Posterior densities associated with a poor informative prior for two
realizations of the coin toss experiment.
Conjugate Priors
Definition 4.67 (Conjugacy). The property that the prior and posterior distri-
butions have the same parametric form is termed conjugacy. When this occurs, the
prior ⇡0 (q) is termed a conjugate prior for the likelihood ⇡( |q). Parameters in the
prior relation are often termed prior hyperparameters to distinguish them from the
model parameters q. The corresponding parameters in the posterior relation are
called posterior hyperparameters.
The use of conjugate priors, when possible, is advantageous since closed form
expressions for the posterior are then available. This will be used when estimating
densities for measurement errors in Chapter 8.
N
X
⇡( |q) = q N1 (1 q)N N1
, N1 = i
i=1
used for the likelihood in the coin toss Example 4.66. We observe that if the prior is
parameterized similarly, the product of the prior and likelihood will be in the same
family. Specifically, we take ⇡0 (q) to be a beta density with hyperparameters ↵ and
so that ⇡0 (q) / q ↵ 1 (1 q) 1 as shown in Definition 4.16. It then follows that
the posterior density satisfies
⇡(q| ) / q N1 (1 q)N N1 ↵ 1
q (1 q) 1
= q N1 +↵ 1
(1 q)N N1 + 1
Example 4.69. Here we consider normally distributed random variables with known
mean µ and unknown variance 2 . This will illustrate techniques employed in Chap-
ter 8 to estimate the unknown variance 02 of measurement errors. As detailed in
Section 4.3.2, the likelihood of observing = [ 1 , · · · , N ] iid measurements under
i
i i
these assumptions is
2 1 SS/2 2
⇡( | )= 2 )n/2
e
(2⇡
where the sum of squares error is
n
X
SS = ( j µ)2 .
j=1