Applied Bayesian Statistics
Sharif Mahmood
Institute of Statistical Research and Training
University of Dhaka
Rev. T. Bayes (1702-1761)
2 / 352
Syllabus
Bayes Analysis
◮
Introduction and background, Why bayes?
◮
Bayes theorem, components of Bayes theorem - likelihood, prior and
posterior.
◮
Probability theory and estimation.
◮
Concept of priors, informative and non-informative priors, proper
and improper priors, discrete priors, conjugate priors,
semi-conjugate priors, credible interval, Bayesian hypothesis testing,
buliding a predictive model.
Bayesian inference and prediction
◮
Single parameter models: binomial model, Poisson model, normal
with known variance, normal with known mean
◮
Multi-parameter models: concepts of nuisance parameters, normal
model with a non-informative, conjugate, and semi-conjugate priors,
multinomial model with Dirichlet prior, multivariate normal model,
methods of prior specification; method of evaluating Bayes
estimator
3 / 352
Syllabus
Summarizing posterior distributions:
◮
◮
◮
◮
◮
Introduction, approximate methods: numerical intregation method
Bayesian central limit theorem
Simulation method: direct sampling and rejection sampling,
importance sampling, Markov Chain Monte Carlo (MCMC)
methods Gibbs sampler, general properties of the Gibbs sampler,
Metropolis algorithm, Metroplis Hastings (MH) sampling,
relationship between Gibbs and MH sampling, MCMC diagnostics
assessing convergence, acceptance rates of the MH algorithm.
Bayesian Sample Size Computations
Bayesian Inference for linear models, generalized linear models.
Statistical Analysis with Missing Data
◮
◮
◮
◮
Missing data, missing data pattern, missing data mechanism.
Imputation procedures, e.g. mean imputation, hot deck imputation.
Estimation of sampling variance in the presence of nonresponse.
Likelihood based estimation and propensity score.
4 / 352
Introduction
Lecture 1
Sharif Mahmood
5 / 352
Basic Statistical Operations
Bayesian
Estimation
✻
Summary
✲
⇓
Inference
✛
✲
✻
✛
✛ Hypothesis
Testing ✻
Comparison
✻
Enumeration
✲
Principal approaches to statistical inference : Frequentist versus
Bayesian.
6 / 352
Bayes Formula
Suppose that x = (x1 , x2 , . . . , xn ) is a vector of n observations
whose probability distribution, p(x|θ), depends on the values of θ.
where θ are our parameters of interest.
The Bayes’ formula yields
p(θ|x) =
where p(x) =
of θ.
P
θ
p(x|θ) p(θ)
p(x)
p(θ)p(x|θ) and the sum is over all possible values
Omitting the factor p(x), which does not depend on θ, can thus be
considered as constant, yield the unnormalized posterior density
p(θ|x) ∝ p(x|θ) p(θ)
7 / 352
Bayes Rule
A basic Bayesian model has two stages:
- Likelihood specification: x|θ ∼ f (x|θ) and prior specification:
θ ∼ π(θ)
- The prior distribution π(θ) is assumed known
The posterior distribution of θ is given
p(θ|x) = R
f (x|θ)π(θ)
θ f (x|θ)π(θ)dθ
The denominator is known as normalizing constant.
We have a posterior density, sampling density/likelihood,
prior density, and a normalizing constant (which we typically do
not need to find).
8 / 352
General Principle
Prior knowledge is combined with observed evidence, i.e. prior
knowledge about the density of the unknown parameter vector θ is
combined with sample data or experimental evidence x
Updated knowledge about θ is produced
This is in the form of posterior distribution p(θ|x) of the unknown
θ
A predictive distribution p(x̃|x) of unobserved data x̃.
Evaluation of integrals used to be difficult or impossible.
Monte Carlo computing methods now allow accurate estimation of
such integrals.
9 / 352
Example: Cancer Diagnosis
✑ A young person is diagnosed as having an extremely rare type of
cancer. Let C stands for having cancer and + stands for positive
respond to the test and also assume P (C) = 10−6 , P (+|C) = 0.99,
and P (+|C c ) = 0.01.
Sol: Only one in a million of his age has the disease and the test is very
good giving only 1% false positives and 1% false negatives. The
probability that he has cancer given a positive test is calculated as
P (C|+)
=
=
=
=
P (+|C)P (C)
P (+|C)P (C) + P (+|C c )P (C c )
(.99)(10−6 )
(.99)(10−6 ) + (.01)(.99999)
0.00000099
0.01000098
0.00009899
The probability of a second case can be updated using 0.00009899
as prior probability of having the disease.
10 / 352
Example: Machine
✑ Three machines A, B and C produce 20%, 45% and 35%
respectively of a factory’s wheel nuts output. 2%, 1% and 3%
respectively of these machines outputs are defective. What is the
probability that a randomly selected wheel nut comes from
machine A if it is defective?
Sol: Let D be the event that the wheel nut is defective and A, B, and
C be the events that the selected wheel nut came from machines
A, B and C respectively:
P (A|D)
=
=
=
P (A)P (D|A)
P (A)P (D|A) + P (B)P (D|B) + P (C)P (D|C)
(0.2)(0.02)
(0.2)(0.02) + (0.45)(0.01) + (0.35)(0.03)
0.211
Here we divide the probability of the required path (probability
that it came from machine A and was defective) by the probability
of all paths.
11 / 352
Frequentist vs Bayesian
Frequentist Approach
1 Parameters are fixed at
their true but unknown
value
Bayesian Approach
1 Parameters are considered
as random variables with
distributions
2
Objective notion of
probability based on
repeated sampling
2
Subjective notion of
probability (prior)
combined with data
3
Large sample properties or
asymptotic approximations
3
Does not require large
sample approximations
4
Maximizing a likelihood
4
Simulation-based approach
12 / 352
Criticisms: Frequentist vs Bayesian Paradigms
Frequentist
1
Failure to incorporate relevant prior information
2
Inefficiency, inflexibility and incoherency
3
Unrealistic assumptions on data generating mechanism
4
Objectivity often times is illusory
Bayesian
1
Lack of objectivity
2
Complexity in computation
3
Lack of intuitive model building capabilities
4
Need defense of priors
5
No guarantee of Markov Chain convergence.
13 / 352
θ as Fixed versus as a Random Variable
Frequentist approach (θ fixed):
Estimate θ with measures of uncertainty (SE, CIs)
95% Confidence Interval: 95% of the time, θ is in the 95% interval
that is estimated each time.
◮
◮
P(θ ∈ 95%CI) = 0 or 1
P(θ > 2)=0 or 1
Bayesian approach (θ random):
Find the posterior distribution of θ.
Take quantities of interest from the distribution (posterior mean,
posterior SD, posterior credible intervals)
We can make probability statements regarding θ.
◮
◮
95% Credible Interval: P(θ ∈ 95%CI) = 0.95
P(θ > 2) = (0, 1).
14 / 352
Frequentist vs Bayesian: Confidence Interval (CI)
Frequentist Approach
Bayesian Approach
1 Uncertainties about
parameters are represented
as probabilities in the
Bayesian paradigm
1
A 95% CI calculated from
a sample will contain the
true parameter 95% of the
time
2
The particular confidence
interval from any one
sample may or may not
contain the true parameter
value
2
In Bayesian one can
calculate the probability
that a parameter lies in a
given interval (credible
interval)
3
This is counter to the
intuitive and often held
belief that the CI from a
particular sample contains
the true parameter value
with 95% certainty
3
Under Bayesian framework
we call this interval (or
region in general case) a
highest posterior density
(HPD) region or credible
sets.
15 / 352
Example: Coin tossing
✑ Test H
: θ = 12 versus Ha : θ > 12 , where θ is the true probability
of heads in 12 independent tosses of a coin, 9 heads and 3 tails are
observed.
0
Sol: Two choices for the sampling distribution are possible:
1
Binomial: n = 12 tosses is fixed, the random quantity X is the
number heads observed, X ∼ Bin(12; θ), and the likelihood function
is
12
n x
θ9 (1 − θ)3 .
θ (1 − θ)(n−x) =
L1 (θ) =
9
x
2
Negative binomial: Flip the coin until the third tail appeared. Here
X is the number of heads required to complete the experiment with
X ∼ N egBin(r = 3, θ) and the likelihood function is
11
r+x−1
θ9 (1 − θ)3 .
θx (1 − θ)r =
L2 (θ) =
9
x
p - value corresponding to the rejection region, “Reject H0 if
X ≥ c” is computed under both distributions
16 / 352
Example: Coin tossing
Binomial:
α1 = Pθ= 1 (X ≥ 9) =
2
Negative binomial:
α2 = Pθ= 1 (X ≥ 9) =
2
P12
j=9
P∞
j=9
12
j
2+j
j
R > pbinom(8,12,.5,lower.tail=F)
θj (1 − θ)12−j
θj (1 − θ)3
[1] 0.07299805
> pnbinom(8,3,.5,lower.tail=F)
[1] 0.03271484
At the usual Type I error level α = .05 two model assumption lead
to two different decisions.
The probability of X values “more extreme” than 9 (actually
observed) was used as evidence against H0 , even though these
values did not occur.
17 / 352
Probability Theory
Lecture 2
Sharif Mahmood
18 / 352
Three Axioms of Probability
Let S be the sample space and A be an event in S.
1
2
3
For any event A, P (A) ≥ 0
P (S) = 1
If A1 , A2 , . . . , An are mutually disjoint, then
P (A1 ∪ A2 ∪ . . . ∪ An ) = P (A1 ) + P (A2 ) + . . . + P (An )
The three axioms imply:
P (U) = 1
P (Ac ) = 1 − P (A)
For any event A, 0 ≤ P (A) ≤ 1.
If A ⊂ B, then P (A) ≤ P (B).
For any two events A and B,
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
19 / 352
Conditional Probability
P (A|B) =
P (A ∩ B)
P (B)
Multiplicative Law of Probability:
P (A ∩ B) = P (B|A)P (A) = P (A|B)P (B)
Bayes Rule:
P (A|B) =
P (B|A)P (A)
P (B)
Law of Total Probability:
P (B) =
n
X
P (B|Ai )P (Ai )
i=1
20 / 352
Independence
A and B are independent if
P (AB) = P (A)P (B)
If A and B are independent, then
P (AB)
P (B)
P (A)P (B)
=
P (B)
= P (A)
P (A|B) =
Conditional Independence:
P (AB|C) = P (A|C)P (B|C)
21 / 352
Random Variables
A random variable is a function that takes a random experiment
and assigns a number to the outcome of the experiment.
Outcome values are assigned probabilities by a probability mass
function (for discrete random variable) or probability density
function (for continuous random variable).
22 / 352
Probability Functions
Probability Mass Function
P (Y = y)
Probability Density Function
P (Y ∈ A) =
Z
f (y)dy
A
Characteristics of all PDFs and PMFs:
◮
Area under the curve must integrate to 1
◮
P (Y = y) ≥ 0 for all Y
◮
The support is all y’s where P (Y = y) > 0.
23 / 352
Marginal, Conditional, and Joint Densities
f (x) =
f (x, y) =
f (x|y) =
f (x|y, z) =
f (x, y) =
=
f (x, y, z) =
Z
Z
f (x, y)dy
f (x, y, z)dz
f (x, y)
f (y)
f (x, y, z)
f (y, z)
f (x|y)f (y)
f (y|x)f (x)
f (x|y, z)f (y|z)f (z)
24 / 352
Expectation
Discrete Case:
E(X) =
X
xi P (X = xi )
i
where P (X = x) is the probability mass function (PMF).
Continuous Case:
E(X) =
Z
∞
xf (x)dx
−∞
where f (x) is the probability density function (PDF).
Expectation of a Function of a Random Variable:
X
E[g(X)] =
g(xi )P (X = xi )
i
for discrete random variables and
E[g(X)] =
Z
∞
g(x)f (x)dx
−∞
for continuous random variables.
25 / 352
Variance and Covariance
The variance of X is
var(X) = E[X − E(X)]2
= E X 2 − 2XE(X) + {E(X)}2
= E(X 2 ) − [E(X)]2
covar(X, Y ) = E(XY ) − E(X)E(Y )
= E{[X − E(X)][Y − E(Y )]}
26 / 352
Important Distributions
Discrete Distributions
Continuous Distributions
1
Bernoulli Distribution
1
Univariate Normal Distribution
2
Binomial Distribution
2
Multivariate Normal Distribution
3
Multinomial Distribution
3
Univariate t Distribution
4
Poisson Distribution
4
Multivariate t Distribution
5
Geometric Distribution
5
Uniform Distribution
6
Negative Binomial
Distribution
6
Beta Distribution
7
Dirichlet Distribution
8
Exponential Distribution
9
Gamma Distribution
10
Weibull Distribution
11
Inverse Gamma Distribution
27 / 352
The Bernoulli Distribution
0.8
1.0
Y ∼ Bernoulli(π), y = 0, 1
probability of success: π ∈ [0, 1]
p(y|π) = π y (1 − π)(1−y)
E(Y ) = π
V ar(Y ) = π(1 − π).
0.4
π
0.6
●
0.0
0.2
●
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
x
28 / 352
The Binomial Distribution
0.4
π
0.6
0.8
1.0
Y ∼ Binomial(n, π), y = 0, 1, . . . , n
number of trials: n ∈ {1, 2, . . .}
probability of success: π ∈ [0, 1]
p(y|π) = ny π y (1 − π)n−y
E(Y ) = nπ
V ar(Y ) = nπ(1 − π).
●
0.2
●
●
0.0
●
−1
0
1
2
3
4
x
29 / 352
The Multinomial Distribution
Y ∼ Multinomial(n, π1 , . . . , πk ), yj = 0, 1, . . . , n,
number of trials: n ∈ {1, 2, . . .}
probability of success for j: πj ∈ [0, 1];
p(y|n, π) =
y1 y2
n!
y1 !y2 !...yk ! π1 π2
. . . πkyk
Pk
j=1 πj
Pk
j=1 yj
=n
=1
E(Yj ) = nπj
V ar(Y) = nπj (1 − πj )
Cov(Yi , Yj ) = −nπi πj .
30 / 352
The Poisson Distribution
0.0
0.2
0.4
f(y)
0.6
0.8
1.0
Y ∼ Poisson(λ), y = 0, 1, . . .
expected number of occurrences: λ > 0
−λ y
p(y|λ) = e y!λ
E(Y ) = λ
V ar(Y ) = λ.
●
●
0
●
●
●
●
2
4
●
6
●
●
8
●
●
10
x
31 / 352
The Geometric Distribution
How many Bernoulli trials until a success?
Y ∼ Geometric(π), y = 1, 2, . . .
probability of success: π ∈ [0, 1]
p(y|π) = π(1 − π)y
E(Y ) =
V ar(Y )
1−π
π
.
= 1−π
π2
32 / 352
The Negative Binomial Distribution
How many Bernoulli trials until r success?
Y ∼ Neg.Binomial(π), y = 1, 2, . . .
probability of success: π ∈ [0, 1]
r
π (1 − π)y
p(y|r, π) = y+r−1
y
r(1−π)
,
π
r(1−π)
.
π2
V ar(Y ) =
r=4
0.15
● ●
●
●
0.10
●
f(y)
●
●
0.05
●
●
●
●
0.00
E(Y ) =
0
5
10
●
● ●
● ● ● ● ● ● ●
15
20
y
33 / 352
The Univariate Normal Distribution
Y ∼ N (µ, σ 2 ), y ∈ R
mean: µ ∈ R; variance: σ 2 > 0
2
)
p(y|µ, σ 2 ) = σ√12π exp(− (y−µ)
σ2
E(Y ) = µ
V ar(Y ) = σ 2 .
0.5
Figure 2
0.5
Figure 1
0.3
0.4
σ=1
σ=2
0.0
0.1
0.2
f(y)
0.0
0.1
0.2
f(y)
0.3
0.4
µ=0
µ=1
−3
−2
−1
0
y
1
2
3
−3
−2
−1
0
1
2
3
y
34 / 352
The Multivariate Normal Distribution
Y ∼ N (µ, σ 2 ), y ∈ Rk
mean: µ ∈ Rk ; variance: Σ positive definite k × k matrix
p(y|µ, Σ) = (2π)−k/2 |Σ|−1/2 exp − 12 (y − µ)′ Σ−1 (y − µ)
E(Y ) = µ
V ar(Y ) = Σ.
µ1 = 0, µ2 = 0, σ11 = 10, σ22 = 10, σ12 = 15, ρ = 0.5
0.015
0.010
z
0.005
10
5
0.000
−10
0
−5
0
x1
f (x) =
1
2π
σ11 σ22 (1 − ρ2)
5
1
exp −
,
2(1 − ρ2)
−5
x2
10 −10
(x1 − µ1)2
x1 − µ1 x2 − µ2
− 2ρ
+
σ11
σ22
σ11
(x2 −
µ2)2
σ22
35 / 352
The Univariate t Distribution
Y ∼ t(µ, σ 2 , ν), y ∈ R
mean: µ ∈ R; variance: σ 2 > 0, degrees of freedom: ν
− ν+1
(ν+1)
Γ( 2 )
2
y−µ 2
2
−1
√
p(y|µ, σ ) = Γ( ν )σ νπ 1 + ν
σ
2
E(Y ) = µ for ν > 1
ν
for ν > 2.
V ar(Y ) = σ 2 ν−2
Figure 2
0.5
0.5
Figure 1
0.3
0.4
ν=0
ν=2
ν=4
0.0
0.1
0.2
f(y)
0.0
0.1
0.2
f(y)
0.3
0.4
df=2
df=6
df=10
−3
−2
−1
0
y
1
2
3
−2
0
2
4
6
8
10
y
36 / 352
The Multivariate t Distribution
Y ∼ t(µ, σ 2 ), y ∈ Rk
mean: µ ∈ Rk ; variance: Σ positive definite k × k matrix
− ν+k
(ν+d)
Γ( 2 )
2
− 21
−1
′
−1
√
p(y|µ, Σ) = Γ( ν )ν k/2 π |Σ|
1 + ν (y − µ) Σ (y − µ)
2
10
E(Y ) = µ forν > 1
ν
for ν > 2.
V ar(Y ) = Σ ν−2
0.002
5
0.004
0.008
0.012
0.016
0
0.018
0.014
−5
0.01
−10
0.006
−10
−5
0
5
10
37 / 352
The Uniform Distribution
Y ∼ Uniform(α, β), y ∈ [α, β]
Interval: [α, β]; α < β
1
p(y|α, β) = β−α
E(Y ) =
0.0
0.1
0.2
f(y)
0.3
0.4
0.5
V ar(Y )
α+β
2
2
.
= (β−α)
2
0
1
2
3
4
5
6
x
38 / 352
The Beta Distribution
Y ∼ Beta(α, β), y ∈ [0, 1]
shape parameters:α > 0; β > 0
p(y|α, β) =
E(Y ) =
Γ(α+β) (α−1)
(1
Γ(α)Γβ y
α
α+β
V ar(Y ) =
− y)(β−1)
αβ
.
(α+β)2 (α+β+1)
39 / 352
The Dirichlet Distribution
Y ∼ Dirichlet(α1 , . . . , αk ), yj ∈ [0, 1];
P
α parameters: αj > 0; kj=1 ≡ α0
Pk
j=1 yj
=1
Γ(α1 +...+αk ) α1 −1
. . . ykαk −1
Γ(α1 )...Γ(αk ) y1
α
E(Yj ) = α0j
α (α −α )
V ar(Yj ) = αj2 (α0 +1)j
0
0
α αj
Cov(Yi , Yj ) = − α2 (αi +1)
0
0
p(y|α) =
40 / 352
The Exponential Distribution
Y ∼ Exponential(λ), y > 0
parameter: λ > 0
p(y|λ) = λe−λy
E(Y ) = λ1
V ar(Y ) = λ12
0.2
0.4
f(y)
0.6
0.8
1.0
λ=1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
y
41 / 352
The Gamma Distribution
Y ∼ Gamma(α, β), y > 0
shape parameter: α > 0
inverse scale parameter: β > 0
β α (α−1)
p(y|α, β) = Γ(α)
y
exp (−βy)
α
E(Y ) = β , V ar(Y ) = βα2
0.2
0.4
f(y)
0.6
0.8
1.0
α=1,β=1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
y
42 / 352
The Weibull Distribution
Y ∼ Weibull(α, β), y > 0
shape parameter: β > 0, scale parameter: α > 0
β−1
p(y|α, β) = αβ αy exp (− αy )β
E(Y ) = αΓ 1 + β1 , β > 0
2
V ar(Y ) = α2 Γ(1 + β2 ) − Γ(1 + β1 )
0.2
0.4
f(y)
0.6
0.8
1.0
α=1,β=1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
y
43 / 352
The Inverse Gamma Distribution
If X ∼ Gamma(α, β), then
Y ∼ Invgamma(α, β), y > 0
1
X
∼ Invgamma(α, β).
shape parameter: α > 0
scale parameter: β > 0
p(y|α, β) =
E(Y ) =
β α −(α+1)
exp (− βy )
Γ(α) y
β
α−1 ,
V ar(Y ) =
α>0
β2
(α−1)2 (α−2)
44 / 352
Maximum likelihood estimation
Lecture 3
Sharif Mahmood
45 / 352
Maximum Likelihood
The classical approach to statistics involves two basic steps:
1
2
Model estimation
Inference.
1st : determining an appropriate probability distribution/model for the
data at hand
2nd : estimating its parameters.
Maximum likelihood estimation is that a good choice for the
estimate of a parameter of interest is the value of the parameter
that makes the observed data most likely to have occurred.
46 / 352
Constructing a likelihood function
If x1 , x2 , . . . , xn are independent observations of a random
variable, x, in a data set of size n, then we know from the
multiplication rule in probability theory that the joint probability
for the vector X is:
f (X|θ) ≡ L(θ|x) =
n
Y
i=1
f (xi |θ)
This equation is the likelihood function for the model.
The primary point of constructing a likelihood function is to solve
for the value of the parameter that makes the occurence of the
data most probable, or most “likely” to have actually occurred
From a classical standpoint, the parameter is assumed to be fixed.
47 / 352
✑ Suppose we have one observation y = 0.5 (assumed to be) drawn
from a Normal distribution with σ 2 = 1. Estimate the mean µ of
the distribution.
Obvious guess: µ = 0.5
Normal PDF
●
0.4
0.4
Normal PDF
●
if µ=0.5
if µ=−0.5
0.3
0.3
if µ=0.5
0.2
p(y|µ
µ)
0.1
0.2
0.1
p(y|µ
µ)
●
−2
−1
0
1
y
2
3
−2
−1
0
1
2
3
y
48 / 352
Why Does L(θ|y) = p(y|θ) work?
Likelihood Function
●
0.4
0.4
Normal PDF
●
0.3
0.3
if µ=0.5
if µ=−0.5
if µ=2
0.2
L(µ
µ|y)
●
0.2
p(y|µ
µ)
●
0.1
●
0.1
●
−2
−1
0
1
y
2
3
−2
−1
0
1
2
3
µ
Our best estimate of θ is the value of θ that maximizes L(θ|y )
(MLE).
49 / 352
Example: Poisson Distribution
Suppose we have some count data (number of errors in a page).
Here Poisson distribution can be used to model the data. Thus we
iid
write Yi ∼ Poisson(λ)
We want to find λ, which is the mean of the Poisson distribution.
The PMF (discrete) for the data is
p(y|λ) =
n
Y
λyi e−λ
i=1
yi !
Since L(θ|y) = p(y|θ), we have
L(λ|y) =
n
Y
λyi e−λ
i=1
yi !
50 / 352
Example: Poisson Distribution
The log-likelihood
l(λ|y) =
n
X
i=1
(yi ln λ − λ − ln yi !)
We can drop all terms that dont depend on λ (because likelihood
is a relative concept and is invariant to shifts).
l(λ|y) =
n
X
i=1
(yi ln λ) − nλ
We need to set the derivative (known as the score function) to zero
and solve for λ.
∂l(λ|y)
∂λ
=
⇒ λ̂
=
P
yi
−n
λ
P
yi
n
51 / 352
Maximum Likelihood In R
Write our log-likelihood function:
l(λ|y) =
n
X
i=1
>
+
+
+
+
(yi ln λ) − nλ
ll.poisson <- function(par, y) {
lambda <- exp(par)
out <- sum(y * log(lambda)) - length(y) * lambda
return(out)
}
Find the maximum (with sample data)
> y <- rpois(1000, 5)
> opt <- optim(par = 2, fn = ll.poisson, method = "BFGS",
+ control = list(fnscale = -1), y = y)$par
> mle <- exp(opt) #to reparameterize the output
> mle
[1] 4.954001
52 / 352
More Complicated Likelihoods
Up to this point, our distributions have been relatively simple (i.e.
all observations are i.i.d. from a distribution with the same
parameter).
Suppose our observations come from different distributions, or
some observations are not fully observed.
How do we define the joint distribution for our data (and thus our
likelihood)?
Use an indicator variable.
Indicator Variable
Let D be an indicator variable such that di = 1 if i follows one
distribution, and di = 0 if i follows another distribution.
We can incorporate the indicator variable into our likelihood.
53 / 352
Example 1
Suppose that we have 5 observations. The first 2 are assumed to
be from distribution 1. The last 3 are assumed to be from
distribution 2.
d = c(1, 1, 0, 0, 0)
n
Y
p(θ1 , θ2 |y) =
[p(yi |θ1 )]di [p(yi |θ2 )]1−di
i=1
54 / 352
Example 2
Suppose that we have 5 observations. We observe the first 2
observations completely, but we only observe that the last 3 are
greater than some number z.
d = c(1, 1, 0, 0, 0)
n
Y
p(θ|y) =
[p(yi |θ)]di [1 − F (z)]1−di
i=1
where F(y) is the CDF for p(y|θ), where
F (z) = P (Y ≤ z)
55 / 352
Prior Distributions
Lecture 4
Sharif Mahmood
56 / 352
Introduction
A prior distribution of a parameter is the probability distribution
that represents your uncertainty about the parameter before the
current data are examined.
Multiplying the prior distribution and the likelihood function
together leads to the posterior distribution of the parameter.
The existence of a prior distribution for any problem can be
justified by axioms of decision theory; here we focus on common
prior distributions for any given application.
57 / 352
How do we choose the priors?
Quantifying expert opinion:
◮
Pick the prior mean where expert’s belief is centered on.
◮
Pick the variance such that the range covers reasonable values of
the expected mean, or consider equivalent sample size for your
prior. e.g. if you want your prior to be equivalent to a sample of
size n, your prior variance should be σ 2 /n.
Previous research results. You can use previously reported
research results as your prior. Bayesian inference can proceed
incrementally.
Non-informative, flat priors. If there is no reasonable way to form
a prior, one can use a flat prior. These often result in similar point
estimates as frequentist estimates.
58 / 352
Priors
There are various types of prior distributions:
1
Noninformative priors
2
Improper priors
3
Informative priors
4
Conjugate priors
5
Jeffreys priors
59 / 352
Noninformative Priors
Noninformative priors are “flat” relative to the likelihood function
A prior π(θ) is noninformative if it has minimal impact on the
posterior distribution of θ
It is dominated by the likelihood function, does not change much
over the region where the likelihood is appreciable, and does not
assume large values outside that range
Also known as reference prior, vague prior, or diffuse prior.
If 0 ≤ θ ≤ 1, then θ ∼ U (0, 1) is a noninformative prior for θ, and
π(θ) = 1, 0 ≤ θ ≤ 1
If −∞ ≤ θ ≤ ∞, then θ ∼ N (µ0 , σ02 ), and σ02 → ∞, then we get a
noninformative prior for θ.
60 / 352
Improper Priors
A prior is said to be improper if
Z
π(θ)dθ = ∞.
Θ
i.e. a prior is improper if its normalizing constant is equal to ∞
Improper priors are used in Bayesian inference to obtain
noninformative priors.
Although an improper prior may result in an improper posterior it
can still lead to a proper posterior distribution
If −∞ < θ < ∞; π(θ) ∝ 1,
then θ has a Runiform prior distribution
R∞
∞
on the real line. Clearly, −∞ π(θ)dθ = −∞ dθ = ∞
61 / 352
Example: Improper Priors
✑ Suppose x , . . . , x
are i.i.d. N (θ, 1) given θ. Here
Θ = {θ : −∞ ≤ θ ≤ ∞}. Let π(θ) ∝ 1. Then,
1
n
p(θ|x) ∝ p(x|θ)π(θ)
P
= exp − 12 (xi − θ)2
∝ exp − n2 (θ − x̄)2
Thus θ|x ∼ N (x̄, n2 ).
62 / 352
Informative Priors
An informative prior expresses specific, definite information about
a variable.
An informative prior is not dominated by the likelihood
It has an impact on the posterior distribution
The proper use of prior distributions illustrates the power of the
Bayesian method: information gathered from the previous study,
past experience, or expert opinion can be combined with current
information in a natural way.
63 / 352
Example: Informative Priors
✑ Suppose x , . . . , x
are i.i.d. N (θ, 10) given θ. Also θ ∼ N (0, 1),
representing an informative prior for θ
1
10
h
p(θ|x) ∝ exp −
h
∝ exp −
h
∝ exp −
Thus θ|x ∼ N (
P
xi 10
11 , 11 ).
1
20
1
20
11
20
P
2
i
h
1 2
θ
2
i
(xi − θ) exp −
h
i
P i
θ2 − 2θ xi exp − 21 θ2
P i
2
θ − 11xi
64 / 352
Conjugate Priors
A conjugate prior for a family of distributions is the one for which
the prior and posterior distribution belong to the same family
✑ Suppose x , . . . , x
θ), and θ ∼ Beta(α, β)
1
n are
Pi.i.d. Binomial(1,
P
then θ|x ∼ Beta(α + xi , n − xi + β). Thus the beta prior is a
conjugate prior for the binomial family.
Likelihood
Binomial (n, θ)
Poisson (θ)
N (µ, σ 2 ), σ 2 known,
N (µ, σ 2 ), µ known, τ =
G(α, λ), α known
Beta (α, λ), λ known
1
σ2
Conjugate prior
θ ∼ Beta(α, β)
θ ∼ G(δ0 , λ0 )
µ ∼ N (µ0 σ02 )
τ ∼ G(δ0 , λ0 )
λ ∼ G(δ0 , λ0 )
α ∼ G(δ0 , λ0 )
Posterior
Beta
Gamma
Normal
Gamma
Gamma
Gamma
65 / 352
Jeffreys Priors
Jeffreys prior is based on the Fisher information matrix
It satisfies the local uniformity of noninformative priors
Let θ be a p × 1 vector and p(x|θ) is the density of x given θ. The
Fisher information matrix is defined as
I(θ) = −E
∂ 2 log p(x|θ)
p×p
∂θi ∂θj
1
Jeffreys prior is then defined as π(θ) ∝ |I(θ)| 2 , where | · | denotes
determinant.
66 / 352
Jeffreys Priors
✑ Suppose x , . . . , x
1
n
are i.i.d Binomial(1, θ); p(x|θ) = θx (1 − θ)1−x
log p(x|θ)
∂
⇒
log p(x|θ)
∂θ
∂2
⇒ 2 log p(x|θ)
∂θ
=
⇒ I(θ)
=
=
=
x log θ + (1 − x) log(1 − θ)
x
1−x
−
θ
(1 − θ)
x
1−x
− 2−
θ
(1 − θ)2
h ∂ 2 log p(x|θ) i
1
−E
.
=
∂θi ∂θj
θ(1 − θ)
1
1
Thus Jeffreys prior for θ is π(θ) ∝ θ− 2 (1 − θ)− 2 .
1
1
1
1
Now π(θ) ∝ θ− 2 (1 − θ)− 2 = θ 2 −1 (1 − θ) 2 −1 . Thus θ ∼Beta( 21 , 12 )
which is proper for the binomial model.
67 / 352
Example: Jeffreys Priors
✑ Suppose x , . . . , x
1
n
are i.i.d. Poisson(θ)
log p(x|θ) = x log θ − θ − log(x!)
x
∂
log p(x|θ) =
⇒
∂θ
θ
∂2
x
⇒ 2 log p(x|θ) = − 2
∂θ
θ
1
⇒ I(θ) =
θ
1
Thus Jeffreys prior for θ is π(θ) ∝ θ− 2 which is improper.
68 / 352
Exercises
Let X1 , . . . , Xn ∼ Uniform(0, θ). Let f (θ) ∝
density.
Let X1 , . . . , Xn ∼ Poisson(λ).
1
θ
. Find the posterior
(a) Let λ ∼ G(α, β) be the prior. Show that the posterior is also a
Gamma. Find the posterior mean.
(b) Find the Jeffreys’ prior. Find the posterior.
Show that the
Jeffreys’ prior based on the binomial likelihood
f (y|θ) = ny θy (1 − θ)n−y is given by the Beta(.5,.5) distribution.
Let θ be a univariate parameter of interest, and let γ = g(θ) be a
1-1 transformation. Show that Jeffreys’ prior is invariant under
reparameterization.
69 / 352
Other prior construction methods
There are a multitude of other methods for constructing priors
and one of them is that of using the marginal distribution
Z
m(x) = f (x|θ)p(θ)dθ.
Here, we choose the prior p(θ) based on its ability to preserve
consistency with the marginal distribution of the observed data.
Unlike the previous approaches in this lecture, this approach uses
not only the form of the likelihood, but also the actual observed
data values to help determine the prior.
The resulting inferences from this posterior would thus be
“overconfident”. We might refer to this method as empirical
estimation of the prior. Indeed, EB methods that ignore this fact
are often referred to as naive EB methods.
70 / 352
Bayesian inference and prediction
Lecture 5
Sharif Mahmood
71 / 352
A Simple (Beta-Binomial) Model
✑ Suppose ISRT played 500 games during a regular inter-department
games (football, cricket, basketball etc.) during 1996-2012 season
and won 285 games. Suppose the ISRT team win each game with
probability π. Estimate π.
Sol We have 500 Bernoulli observations or one observation Y , where
Y ∼ Binomial(n, π)
with n = 500.
Assumptions:
Each game is a Bernoulli trial
The football team have the same probability of winning each game
The outcomes of the games are independent.
72 / 352
Beta-Binomial Model (Contd.)
We can use the beta distribution as a prior for π since it has
support over [0,1].
p(π|y) ∝ p(y|π)p(π)
= Binomial(n, π) × Beta(α, β)
n y
Γ(α + β) α−1
π (1 − π)β−1
=
π (1 − π)n−y ×
ΓαΓβ
y
∝ π y (1 − π)n−y × π α−1 (1 − π)β−1
p(π|y) ∝ π y+α−1 (1 − π)n−y+β−1
The posterior distribution is simply a Beta(y + α, n − y + β)
distribution. Effectively, our prior is just adding α − 1 successes
and β − 1 failures to the dataset.
Bayesian priors are just adding pseudo observations to the data.
73 / 352
Summarizing the Posterior
✑ Consider the previous example of the beta-binomial model with a
Beta(1,1) prior (uniform) and a Beta(y + α, n − y + β) posterior,
where n = 500, y = 285, α = 1, and β = 1.
Posterior Mean
◮
Analytically:
286
y+α
=
≈ 0.57
(y + α) + (n − y + β)
286 + 216
◮
Simulation:
> a <- 1
> b <- 1
> posterior.unif.prior <- rbeta(10000, shape1 = 285 + a,
+ shape2 = 500 - 285 + b)
> mean(posterior.unif.prior)
[1] 0.5699
74 / 352
Summarizing the Posterior
Posterior Variance
◮
Analytically:
(y + α)(n − y + β)
≈ 0.00049
[(y + α) + (n − y + β)]2 [(y + α) + (n − y + β) + 1]
◮
Simulation:
> var(posterior.unif.prior)
[1] 0.0004832
Posterior Probability 0.5 < π < 0.6
◮
Analytically:
Z 0.6
Γ((y + α)(n − y + β)) (y+α−1)
π
(1 − π)(n−y+β−1) dπ ≈ 0.91
0.5 Γ(y + α)Γ(n − y + β)
◮
Simulation:
> mean(posterior.unif.prior > 0.5 & posterior.unif.prior < 0.6)
[1] 0.9114
75 / 352
Bayesian Intervals
In Bayesian statistics, we use the terms credible sets and credible
intervals rather than confidence intervals.
Find the central interval that contains 95% of the area of the
posterior.
> qbeta(c(0.025,0.975), 285 + a, 500 - 285 + b)
[1] 0.5262 0.6127
Simulation:
> quantile(posterior.unif.prior, probs = c(0.025, 0.975))
2.5% 97.5%
0.5269 0.6125
76 / 352
Interpretation of Bayesian Intervals
Recall that the posterior distribution represent our updated
subjective probability distribution for the unknown parameter.
Thus, the interpretation of 95% confidence interval is that the
probability is .95 that the true π is in the interval.
If the Beta(1,1) had been a true representation of our prior beliefs
or knowledge about our parameter π, then after seeing our survey
data, we would belief that
P (0.526 < π < 0.612) = 0.95
Contrast this with the interpretation of a frequentist confidence
interval.
77 / 352
Highest Posterior Density Regions
the density at any point inside the interval is greater than the
density at any point outside it.
shortest possible interval trapping the desired probability.
preferable to equal-tail probability tail credible sets when posterior
is highly skewed or multimodal.
generally difficult to compute; tables of HDRs for certain densities
are available.
78 / 352
Summary of Beta-Binomial Model
We knew that the likelihood × prior produced something that
looked like a Beta distribution up to a constant of proportionality.
Since the posterior must be a probability distribution, we know
that it is a Beta distribution and we can easily solve for the
normalizing constant (although we don’t need to since we already
have the posterior).
we can summarize it analytically or via simulation with the
following quantities:
◮
◮
◮
◮
posterior mean (median, mode).
posterior standard deviation.
highest posterior density (hpd) region.
posterior credible intervals (credible sets).
79 / 352
Graph of Beta-Binomial Model
Uninformative Beta(1,1) Prior
Beta(2,12) Prior
8
Prior
Data (MLE)
Posterior
0
0
2
2
4
4
6
6
8
Prior
Data (MLE)
Posterior
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
π
0.8
1.0
0.8
1.0
Beta(2,12) Prior (n=1000)
30
Uninformative Beta(1,1) Prior (n=1000)
Prior
Data (MLE)
Posterior
Prior
Data (MLE)
Posterior
0
0
5
5
10
10
15
15
20
20
25
25
30
0.6
π
0.0
0.2
0.4
0.6
π
0.8
1.0
0.0
0.2
0.4
0.6
π
80 / 352
Prediction
In many situations, interest focuses on predicting values of a
future sample from the same population.
-i.e. on estimating values of potentially observable but not yet
observed quantities.
Example: Suppose we are interested to know that what ISRT
team will do in the upcoming 10 inter-department games in the
season 2013.
So we are considering a new sample of sample size n⋆ and want to
estimate the probability of some particular number y ⋆ of
wins/successes in this sample.
81 / 352
Prediction
If based on the earlier survey, we actually knew the true value of
the population proportion π, we have just use the binomial
probability:
⋆
n
⋆
⋆
⋆
⋆
π y (1 − π)n −y
p(y |π) =
⋆
y
But of course we still had uncertainty about π even after observing
the original sample.
All of our current knowledge about π is contained in the posterior
distribution obtained using the original survey.
The posterior predictive probability of some particular value of y ⋆
in a future sample of size n⋆ is
Z 1
⋆
p(y |y) =
p(y ⋆ |π)p(π|y) dπ
0
where y denotes the data from the original survey, and p(π|y) is
the posterior distribution based on that survey.
82 / 352
Prediction
✑ Suppose we had used the Beta(1, 1) prior so our posterior
distribution is Beta(286, 216). Then the posterior predictive
probability of y ⋆ wins in a future sample of size n⋆ is
Z 1
⋆
p(y ⋆ |π)Beta(π; 286, 216) dπ
p(y |y) =
0
This is particularly easy to compute if n⋆ = 1, in which case:
⋆
P (y = 1|y) =
=
1
Z
P (y ⋆ = 1|π)p(π|y) dπ
Z0 1
0
π × p(π|y) dπ
= E(π|y)
≈ 0.57
83 / 352
Bayesian inference and prediction
Lecture 6
Sharif Mahmood
84 / 352
The Poisson Distribution
The Poisson distribution may be appropriate when the data are
counts of rare events.
events occuring at random at a constant rate per unit time,
distance, volume, or whatever.
Assumptions that the number of events that occur in any interval
is independent of the number of events occuring is a disjoint
interval.
examples:
- the number of cases of a rare form of a cancer occuring in DMC in a
year
- the number of typos occuring in each book produced by Annona
prokashoni.
- the number of defective PCs at ISRT computer lab.
85 / 352
The Poisson Distribution
Since the values of a random variable following a Poisson
distribution are counts, what are the possible values?
probability mass of a Poisson random variable
p(y|λ) =
e−λ λy
,
y!
y = 0, 1, . . .
the count of the number of events occuring in τ time units also
follows a Poisson distribution (but with parameter τ λ).
The conjugate prior distribution for the Poisson rate parameter is
a gamma family.
86 / 352
Poisson Process
✑ We wish to model the occurrence of events in time as a Poisson
process with rate λ: (The rate is the average number per unit
time).
Suppose our prior probability density function for λ is
proportional to
λα−1 exp(−βλ)
This means that our prior distribution for λ is a gamma
distribution. The probability density function for a G(α, β)
distribution is
f (λ; α, β) =
β α α−1
λ
exp(−βλ),
Γ(α)
λ≥0
The parameters of the prior distribution are often referred to as
hyperprameters. The gamma distribution is indexed by two
parameters.
87 / 352
Poisson Process
We observe the process for τ time units. Events occur at times
t1 , t2 , . . . , tn . Writing t0 = 0, the likelihood is
L=
n
Y
i=1
λe−λ(ti −ti−1 ) e−λ(τ −tn )
The last term is for the probability that no events occur in (tn , τ ).
We can simplify L to
L ≈ λn exp(−λτ )
This is effectively the probability of n events in (0, τ ), i.e.
(λτ )n exp(−λτ )
∝ (λτ )n exp(−λτ )
n!
88 / 352
Poisson Process
Hence the posterior p.d.f. is proportional to
λα+n−1 e−(β+τ )λ
This is clearly a gamma distribution so the posterior p.d.f. is
(β + τ )α+n α+n−1 −(β+τ )λ
λ
e
Γ(α + n)
The posterior mean is
α+n
β+τ
The posterior variance is
α+n
(β + τ )2
89 / 352
Poisson Process
Suppose at Shahabag bus stand the times of arrivals of buses passing
the PG hospital, from 12.40 till 13.00 on February 3, 2013 are given in
the table.
12.40
12.41
12.42
12.43
12.44
12.45
12.46
12.47
12.48
12.49
3, 6, 9,15,24,28,30
7,12,14,16,21,24,30,50
9,22,28,46,53
22,25,35,38,58
2, 5, 8,10,14,17,27,30,45
3,46
13,42,51
0, 9,11,18,23,26,39,51
35,39,55
8,10,19,20,33,45,56,58
12.50
12.51
12.52
12.53
12.54
12.55
12.56
12.57
12.58
12.59
4,30,34,47
0, 4,18,21,52,57,59
4, 9,29,30,31,38,59
2, 6,31,53
7, 9,13,24,39,47
28,46,49,59
6, 9,14,35,41,46
8,10,15,22,25,49,52,59
31,34,53,55,56
2,31,34,38,54,59
Table: Times of arrival of buses
90 / 352
Poisson Process
These can be converted to time intervals between vehicles. These
intervals, in seconds, are given in the table (also in shabag.dat)
The first value is the time till the first arrival.
3
19
3
13
4
21
7
3
3
13
10
12
13
3
3
19
3
6
3
44
13
4
5
2
6
18
15
4
4
25
21
1
9
7
18
16
14
22
6
6
4
29
43
13
3
14
5
29
2
3
27
2
31
2
22
3
37
10
29
9
5
4
2
4
5
3
9
1
2
11
5
16
2
20
9
13
5
15
7
5
2
4
9
12
5
8
3
5
3
2
11
20
41
24
3
3
7
2
1
18
3
6
2
5
6
1
3
7
20
4
3
26
7
10
32
Table: Inter arrival times of buses
91 / 352
Poisson Process
30
20
10
0
Frequency
40
50
Histogram of Shabag
0
10
20
30
40
Inter arrival times (115 observations) in shabag data
92 / 352
Poisson Process
✑ Suppose the prior expectation for the rate of arrival was, say, 1
bus every 5 seconds, i.e. λ0 = 0.2. Suppose we were almost certain
that the rate would be less than 1 bus every 2 seconds. Let us say
that Pr(λ < 0.5) = 0.99.
Suppose we use the conjugate (gamma) prior. Then α/β = 0.2.
We require
Z 0.5
f (λ)dλ = 0.99
0
where f (λ) is the p.d.f. of a G(α, α/0.2) distribution. We can
evaluate this integral in R.
First we set up a vector of values for α and a corresponding vector
for β. Then we evaluate the integral for each pair of values.
93 / 352
Poisson Process
>
>
>
>
>
alpha<-seq(1,10)
beta<-alpha/0.2
prob<-pgamma(0.5,alpha,beta)
d<-data.frame(alpha,beta,prob)
d
alpha beta
prob
1
1
5 0.9179150
2
2
10 0.9595723
3
3
15 0.9797433
4
4
20 0.9896639
5
5
25 0.9946545
6
6
30 0.9972076
7
7
35 0.9985300
8
8
40 0.9992214
9
9
45 0.9995856
10
10
50 0.9997785
94 / 352
Summarizing the Posterior
We find that we get approximately what we require with
α = 4, β = 20.
Now τ = 1200 and n = 115
Thus the posterior p.d.f. is
1220119 118 −1220λ
λ e
118!
The prior mean was 0.2. The posterior mean is
119
= 0.0975.
1220
The prior variance was 0.01. The posterior variance is
119
= 0.00008.
12202
95 / 352
Summarizing the Posterior
At a guess the important part of the posterior distribution is likely
to be within ±3 standard deviations of the posterior mean.
That is roughly 0.07 to 0.13 so we create an array of λ values from
0.070 to 0.130 in steps of 0.001. We can check that this covers the
important part of the distribution by looking at the distribution
function.
>
>
>
>
>
lambda<-seq(0.07,0.13,0.001)
cdf<-pgamma(lambda,119,1220)
pdf<-dgamma(lambda,119,1220)
plot(lambda,cdf,type="l")
plot(lambda,pdf,type="l")
96 / 352
Marginal Distributions
With conjugate families, the unknown form of the prior and
posterior densities can be used to find the marginal distribution,
p(y), using the formula
p(y) =
p(y|λ)p(λ)
p(λ|y)
For instance, the Poisson model for a single observation, y, has
prior predictive distribution
p(y) =
Poisson(y|λ)G(λ|α, β)
G(λ|α + 1, β + τ )
97 / 352
Probability Intervals
Probability interval: A probability interval is an interval for an
unknown quantity which contains a given amount of probability.
Symmetric probability interval: A symmetric probability interval
for an unknown quantity θ is an interval t1 < θ < t2 which has the
property that Pr(θ < t1 ) = Pr(θ > t2 ). For example, a 95%
symmetric probability interval (t1 , t2 ) for θ would have the
property that Pr(θ < t1 ) = 0.025, Pr(t1 < θ < t2 ) = 0.95,
Pr(θ > t2 ) = 0.025.
In the shabag example above, the posterior distribution for λ is
G(119, 1220). We can calculate a 95% symmetric posterior
probability interval for λ as follows, using R.
> qgamma(0.025,119,1220)
[1] 0.08080462
> qgamma(0.975,119,1220)
[1] 0.1158291
Thus our interval is 0.0808 < λ < 0.1158.
98 / 352
Probability Intervals
Highest probability density (hpd) interval: A hpd or “highest
probability density” interval for θ is a probability interval for θ
with the property that no point outside the interval has a
probability density greater than any point inside the interval. In
other words, the interval captures the “most likely” values of the
unknown.
It is also easy to see that, if the pdf of θ is f (θ) and the
100(1 − α)% hpd interval is (t1 , t2 ), then t1 , t2 satisfy the following
two properties.
Z
t2
t1
f (θ)dθ = 1 − α
f (t1 ) = f (t2 )
The first of these is true for any 100(1 − α)% probability interval.
The two together make the interval a hpd interval.
If the density f (θ) is unimodal and symmetric then the symmetric
interval and the hpd interval coincide. Otherwise they do not.
99 / 352
2.5
3.0
Probability Intervals
0.0
0.5
1.0
p( θ|y)
1.5 2.0
50% HPD
75% HPD
95% HPD
95% quantile−based
0.0
0.2
0.4
0.6
0.8
1.0
θ
Figure: Highest posterior density regions of varying probability content.
100 / 352
Bayesian inference and prediction
Lecture 7
Sharif Mahmood
101 / 352
Normal Model
Suppose that y = (Y1 , Y2 , . . . , Yn ) is a random sample of size n
from the normal distribution with mean µ ∈ R and variance
σ 2 ∈ [0, ∞).
Three conditions can be considered for the normal model.
1
µ is unknown, σ 2 is known.
2
µ is known, σ 2 is unknown.
3
µ is unknown, σ 2 is unknown.
102 / 352
Normal Model
1
µ is unknown, σ 2 is known
p(µ|y) ∝
∝
∝
∝
∝
∝
Qn
σ 2 )π(µ)
i=1
h f (yi |µ,P
i
h
i
exp − 2σ1 2 ni=1 (yi − µ)2 exp − 2σ1 2 (µ − µ0 )2
0
h
P
1 P 2
2
exp − 2σ2 ( yi − 2 yi µ + nµ ) − 2σ1 2 (µ2 − 2µµ0
0
i
h
P
exp − 2σ1 2 (−2 yi µ + nµ2 ) − 2σ1 2 (µ2 − 2µµ0 )
n 2 2
h
2 P0
oi
nσ +σ
σ
y +µ σ 2
exp − 21 µ2 σ02 σ2
− 2µ 0 σ2iσ2 0
0 i
2 20
h
P
σ02
yi +µ0 σ 2
1 nσ0 +σ
µ − nσ2 +σ2
.
exp − 2 σ2 σ2
0
0
That is p(µ|y) ∼ N
σ02
P
σ2 σ2
yi +µ0 σ 2
, nσ20+σ2
nσ02 +σ 2
0
+ µ0 2 )
i
= N (µ1 , σ12 )
103 / 352
Normal Model
1
µ is unknown, σ 2 is known
The posterior mean is a weighted average of the prior mean and
observed data
Weights are inversely proportional to the corresponding variances
Large σ02 relative to σ 2 (vague prior information), implying
posterior mean µ1 close to data value y
Small σ02 relative to σ 2 (highly informative prior), implying
posterior mean µ1 close to the prior mean µ0 .
104 / 352
Posterior Predictive Distribution
The posterior predictive distribution of a future observation, y ⋆ ,
can be calculated as
Z
⋆
p(y |y) =
p(y ⋆ |µ)p(µ|y)dµ
This is particularly easy to compute if n⋆ = 1 [Hoff, 2009, p. 72],
in which case:
E(y ⋆ |y) = E(E(y ⋆ |µ, y)|y) = E(µ|y) = µ1
and
V ar(y ⋆ |y) = E(V ar(y ⋆ |µ, y)|y) + V ar(E(y ⋆ |µ, y)|y)
= E(σ 2 |y) + V ar(µ|y)
= σ 2 + σ12
105 / 352
Normal Model
2
µ is known, σ 2 is unknown
Suppose τ = σ12 is known as precision parameter
δ0 γ0
τ ∼G 2, 2
p(τ |y) ∝
Qn
n+δ0
2 ,
)
i=1 f (y
h i |τ, µ)π(τ
τ Pn
exp − 2 i=1 (yi
i δ0
− µ)2 τ 2 −1 exp −
P
i
h
n+δ0
n
2+γ
(y
−
µ)
= τ 2 −1 exp − τ2
0
i=1 i
∝ τ
p(τ |y) ∼ G
n
2
Pn
i=1 (yi −µ)
2
2 +γ
0
τ γ0
2
106 / 352
Normal Model
3
µ and σ 2 are both unknown
Specify the joint prior p(µ, τ ) = p(µ|τ )p(τ )
µ|τ = N (µ0 , τ −1 σ02 );
τ ∼ G( δ20 , γ20 )
h
h
i
i
P
n
1
p(µ, τ |y) ∝ τ 2 exp − τ2 ni=1 (yi − µ)2 × τ 2 exp − 2στ 2 (µ − µ0 )2
0
h
i
δ0
× τ 2 −1 exp − τ 2γ0
i
P
h
n+δ0 +1
2
0)
+
γ
(yi − µ)2 + (µ−µ
= τ 2 −1 exp − τ2
0
2
σ
0
Note that the joint posterior distribution does not have a
recognizable form
Compute the normalizing constant by brute force
107 / 352
Normal Model
m(y) =
=
×
R ∞R ∞
0
R∞
τ
−∞
= "
1
2
exp
h
−τ
2
h
n+δ0 +1
−1
2
P
2
n
2 (µ−µ0 )
i=1 (yi−µ) + σ02
+ γ0
i
i
dµdγ
exp − τ2 γ0 + σµ02 + yi2 dτ
0 P
oi
n
µ0
1
τ
2
dµ
yi + σ 2
exp − 2 µ n + σ2 − 2µ
0
− 021
τ
0
R∞
−∞
n+δ0 −1
2
h
1
(2π) 2 Γ
γ0 +
µ0
2+
σ0
n+δ0
2
P
(
yi2 −
τ
⇒ p(µ, τ |y) =
n+
P
1
2
σ0
µ 2
yi + 0
2)
σ0
n+ 12
σ0
n+δ0 +1
−1
2
exp
P
0
# n+δ
2
h
− τ2
P
(µ−µ0 )
(yi −µ)2 +
2
σ0
m(y)
2
+γ0
i
.
108 / 352
A Hypothetical Example for Inference
Throughout this talk we will use a hypothetical example:
We have a group of MS students at ISRT. Among other things,
we’d like to know the mean height of a student (LGM).
We managed to measure height of all 10 LGM we know, the data
is as follows (in centimeters):
122 122 116 134 113 114 113 110 123 130
No one knows the population mean in their planet.
Interestingly, we know that a reliable estimate of the standard
deviation (σ) of the complete LGM population is 8cm.
109 / 352
Inference for Unknown Mean: Confidence Intervals
Sample mean is, ȳ = 119.7 which is our best estimate of the
population mean.
Using our estimate, we calculate a confidence interval
h
σ
σ i
ȳ − t × √ , ȳ + t × √
n
n
where t is the critical value of interest from the t-distribution.
Here is how to calculate 95% confidence interval in R
lgm<-c(122, 122, 116, 134, 113, 114, 113, 110, 123, 130)
> mean (lgm) + qt (.025 , df =9) * (8/ sqrt (10))
[1] 113.977
> mean (lgm) + qt (.925 , df =9) * (8/ sqrt (10))
[1] 123.681
110 / 352
Another Look at the Confidence Intervals
Population mean is a value that is fixed and unknown.
We calculate the 95% confidence interval using the only sample
mean.
If we had many similar samples from the same population, and
calculate the same interval for each, we would have 95% of them
including the population mean.
As a result, we say that we are 95% confident that the interval we
calculated contains the sample mean.
Does it mean: ‘with 0.95 probability the population mean is in the
interval we calculated’ ?
111 / 352
Infering Mean Height of LGM: the/a Bayesian Way
Population N (µ, σ 2 )
Prior N (µ0 , σ02 )
Likelihood N (ȳ, σ 2 /n)
Posterior N (µ1 , σ1 2 )
We start with a Gaussian prior y ∼ N (100, 102 )
We observe the data. Likelihood is N (ȳ, σ 2 /n) = N (119.7, 6.4).
Posterior is again normal N (118.515, 6.015).
µ1
=
σ1 2 =
σ02
yi + µ 0 σ 2
102 × 1197 + 100 × 82
=
= 118.515
2
2
nσ0 + σ
10 × 102 + 82
σ02 σ 2
102 × 82
=
= 6.015
2
2
nσ0 + σ
10 × 102 + 82
P
112 / 352
Infering Mean Height of LGM: Calculations
95% credible interval for our LGM example:
> qnorm (0.025 , 118.515 , sqrt (6.015))
[1] 113.708
> qnorm (0.925 , 118.515 , sqrt (6.015))
[1] 122.046
(our frequentist confidence interval was [113.977, 123.681])
Note that we can now say ‘with .95 probability population mean is
in range [113.708, 122.046]’.
113 / 352
Posterior Distribution using Discrete Approximations
Lecture 8
Sharif Mahmood
114 / 352
One Dimensional Conjugate Example
✑ Suppose we are interested in the proportion of patients given a
certain drug who suffer a particular side effect. Let this proportion
be θ. Our prior distribution for θ is a Beta(1, 4) distribution.
We observe a sample of n = 30 patients, of whom x = 5 suffer the
side effect. Assume that we can regard this observation as a value
from the conditional Binomial(30, θ) distribution of x given θ.
Use R to plot a graph showing the prior density, the likelihood and
the posterior density of θ as follows
1
Set up a grid of θ values.
theta<-seq(0,1,0.01)
2
Calculate the prior density
prior<-dbeta(theta,1,4)
115 / 352
One Dimensional Conjugate Example
3
The likelihood is proportional to θ5 (1 − θ)25 = θ6−1 (1 − θ)26−1 .
This is proportional to the density of a beta(6, 26) distribution.
Calculate this.
likelihood<-dbeta(theta,6,26)
4
The posterior distribution is beta(1+5, 4+25). That is beta(6, 29).
Calculate the posterior density
posterior<-dbeta(theta,6,29)
5
Plot the graph
plot(theta,posterior,type="l",xlab=expression(theta),
ylab="Density")
lines(theta,prior,lty=2)
lines(theta,likelihood,lty=3)
116 / 352
One Dimensional Conjugate Example
3
2
1
0
Density
4
5
6
posterior
prior
likelihood
0.0
0.2
0.4
0.6
0.8
1.0
θ
Figure: One Dimensional Conjugate Example
117 / 352
One Dimensional Numerical Example
✑ This is exactly like the previous example except that we use a
different prior distribution. The prior density is now proportional
to
1.1 − 2θ
0 < θ < 0.5
g(θ) =
0.1
0.5 ≤ θ < 1
1
Set up a grid of θ values.
theta<-seq(0,1,0.01)
2
Calculate the prior density.
prior<-rep(0.1,101)
prior<-prior+(1-2*theta)*(theta<0.5)
3
Now, to make this a proper density we need to divide by the
integral which is 0.1+(0.5×1)/2 =0.35.
prior<-prior/0.35
118 / 352
One Dimensional Numerical Example
4
The likelihood is proportional to θ5 (1 − θ)25 = θ6−1 (1 − θ)26−1 .
This is proportional to the density of a beta(6, 26) distribution.
Calculate this
likelihood<-dbeta(theta,6,26)
5
The posterior density is proportional to the prior density times the
likelihood.
posterior<-prior*likelihood
We have to normalise this by dividing by the integral.
posterior<-posterior/(sum(posterior*0.01))
6
Plot the graph.
plot(theta,posterior,type="l",xlab=expression(theta),
ylab="Density")
lines(theta,prior,lty=2)
lines(theta,likelihood,lty=3)
119 / 352
3
2
1
0
Density
4
5
6
One Dimensional Numerical Example
0.0
0.2
0.4
0.6
0.8
1.0
θ
Figure: One Dimensional Numerical Example
120 / 352
One Dimensional Numerical Example
✑ Consider the shabag data described in lecture 6. To illustrate the
principles involved when we use a non-conjugate prior, we will try
using a “triangular” prior distribution with the pdf
4(1 − 2λ)
0 < λ < 0.5
p(λ) =
0
otherwise
The calculations are actually quite simple in this case but we will
do them using a general method.
The likelihood is proportional to
λ115 exp(−1200λ)
It is usually better to work in terms of logarithms, to avoid very
large and very small numbers.
121 / 352
One Dimensional Numerical Example
The log likelihood, apart from an additive constant, is
115 ln(λ) − 1200λ
We will also take logarithms of the prior density. Note that we
could not do this where the prior density is zero but we really do
not need to. In this example the logarithm of the prior density is,
apart from an additive constant,
ln(1 − 2λ)
Having added the log prior to the log likelihood we subtract the
maximum of this so that the maximum becomes zero. We then
take exponentials and the maximum of this will be 1.0. Again, the
reason for doing this is to avoid very large or very small numbers.
We then normalise by finding the integral and dividing by this.
122 / 352
One Dimensional Numerical Example
R code
1
s t e p s i z e <−0.001
2
lambda<−s e q ( 0 . 0 5 , 0 . 1 5 , s t e p s i z e )
3
4 ##P r i o r
5
p r i o r <−4∗(1−2∗lambda )
6
l o g p r i o r <−l o g (1−2∗ lambda )
7
8 ##L i k e l i h o o d
9
l o g l i k <−115∗ l o g ( lambda ) −1200∗ lambda
10
11 ##P o s t e r i o r
12
l o g p o s <−l o g p r i o r+l o g l i k
13
l o g p o s <−l o g p o s −max( l o g p o s )
14
p o s t e r i o r <−exp ( l o g p o s )
15
p o s t e r i o r <−p o s t e r i o r / ( sum ( p o s t e r i o r ) ∗ s t e p s i z e )
123 / 352
One Dimensional Numerical Example
The R commands which I used to produce this plot are as follows.
plot(lambda,posterior,type="l",xlab=expression(lambda),
ylab="Density")
like<-dgamma(lambda,116,1200)
lines(lambda,prior,lty=2)
lines(lambda,like,lty=3)
abline(h=0)
Having found the posterior density of λ we can find, the posterior
mean, variance and standard deviation, using R as follows.
postmean<-sum(posterior*lambda)*stepsize
postmean
[1] 0.09646694
postvar<-sum(posterior*(lambda^2))*stepsize-postmean^2
postvar
[1] 8.01825e-05
sqrt(postvar)
[1] 0.008954468
124 / 352
One Dimensional Numerical Example
0
10
Density
20
30
40
posterior
prior
likelihood
0.06
0.08
0.10
0.12
0.14
λ
Figure: One Dimensional Numerical Example
125 / 352
Two-Dimensional Numerical Example (Weibull)
✑ To illustrate the calculations in a two-parameter case analysed
numerically we will look at inference for the parameters of a
Weibull distribution using the data in the table
67
206
32
950
178
1
313
1166
131
1615
431
1391
165
340
463
306
630
1088
939
258
662
627
496
247
2285
33
573
313
1859
672
254
2093
437
57
506
858
28
815
132
50
187
492
436
813
637
344
482
17
254
246
545
Load the function into R. Type
weibpost<and then copy and paste the contents of weibpost.r. The function
contains a R function to evaluate the posterior density in the case
where α and β have independent gamma prior distributions.
126 / 352
Two-Dimensional Numerical Example (Weibull)
2
First we need to set up ranges of α and β values. For example:
alphgrid<-seq(0.6,1.6,0.005)
betagrid<-seq(0.001,0.003,0.00001)
these numbers are more or less guesses at this stage. We may have
to change them and try again.
3
We have
F (t) = 1 − exp(−[βt]α )
⇒ 1 − F (t) = exp(−[βt]α )
⇒ ln{− ln[1 − F (t)]} = α ln β + α ln t
If we plot ln{− ln[1 − F (t)]} against ln t, we should get a straight
line with gradient α and intercept α ln β.
127 / 352
Two-Dimensional Numerical Example (Weibull)
4
But we do not know F (t) but we can get a rough idea directly
from the data. If the data arranged in increasing order are
t(1) , . . . , t(n) , then a rough estimate of F (t(j) ) is given by
(j − 0.5)/n. We can therefore plot a graph in R as follows.
ltsort<-log(sort(t))
n<-length(t)
F<-((1:n)-0.5)/n
G<-log(-log(1-F))
plot(ltsort,G)
This gives us a line with approximate gradient 1.1 and
approximate intercept -6.9. These suggest a value for α around 1.1
and a value for β around exp(−6.9/1.1) ≈ 0.002.
128 / 352
Two-Dimensional Numerical Example (Weibull)
5
Next we specify the parameters of our prior distribution. We give
α a gamma(1, 1) distribution and β independently a
gamma(3,1000) distribution as follows.
prior<-c(1,1,3,1000)
6
Lines 10-13 in the function weibpost create two-dimensional
arrays of α and β values so that every combination of values of α
and β is represented by a position in the grid. The next line
computes
ln{αaα −1 e−bα α β aβ −1 e−bβ β }
which, apart from a constant, is the logarithm of the joint prior
density. Line 18 then adds to this, for each observation,
ln{αβ(βti )α−1 exp[−(βti )α ]}
129 / 352
Two-Dimensional Numerical Example (Weibull)
7
We can now create a two-dimensional array containing values of
the posterior density by typing
posterior<-weibpost(alphgrid,betagrid,t,prior)
8
Make a contour plot of the posterior density
contour(alphgrid,betagrid,posterior,
xlab=expression(alpha),ylab=expression(beta))
9
The posterior density function in graph
post<-posterior/1000
persp(alphgrid,betagrid,post,xlab="Alpha",ylab="Beta",
zlab="Density",phi=25,theta=45,ticktype="detailed")
130 / 352
Two-Dimensional Numerical Example (Weibull)
10
To calculate the marginal posterior density of α or β we can
integrate along the rows or columns of the array containing the
posterior density. Calculate and plot the marginal density of α as
follows.
apost<-rowSums(posterior)*rstep
plot(alphgrid,apost,type="l",xlab=expression(alpha),
ylab="Density")
Here rstep is the step size for β, i.e. 0.00001 in the original
calculation of posterior.
11
Use the marginal density to calculate the posterior mean of α
pmeana<-sum(alphgrid*apost)*astep
Here astep is the step size for α, i.e. 0.005 in the original
calculation of posterior.
131 / 352
Two-Dimensional Numerical Example (Weibull)
12
Find a hpd region with a specified probability as follows. First
sort the values of the posterior density into ascending order.
postvec<-sort(c(posterior))
13
Next cumulatively sum them.
postvecsum<-cumsum(postvec)*astep*rstep
14
Now determine the level of the posterior density which will
determine the contour which we want. For example, if we want a
95% region then we want the integral over the region outside the
hpd region to be 0.05.
crit<-max(postvec[postvecsum<0.05])
hpd<-ifelse((posterior>crit),1,0)
contour(alphgrid,betagrid,hpd,levels=c(0.5),xlab
=expression(alpha),ylab=expression(beta),drawlabels=F)
132 / 352
Two-Dimensional Numerical Example (Probit)
✑ A clinical trial on the use of the drug sulfinpyrazone in patients
who had suffered myocardial infarctions (“heart attacks”). We
wanted to see whether the drug had an effect on the number dying.
Patients in one group were given the drug while patients in
another group were given a placebo, that is an inactive substitute.
The following table gives the number of all “analysable” deaths up
to 24 months.
Group 1 (Sulfinpyrazone)
Group 2 (Placebo)
Deaths
44
62
Total
560
540
We can represent this situation by saying that there are two
groups, containing n1 and n2 patients, and two parameters, θ1 , θ2 ,
such that, given these parameters, the distribution of the number
of deaths Xj in Group j is Binomial(nj , θj ).
133 / 352
Two-Dimensional Numerical Example (Probit)
Now we could give θj a beta prior distribution but it seems
reasonable that our prior beliefs would be such that θ1 and θ2
would not be independent.
We transform from the (0, 1) scale of θ1 , θ2 to a (−∞, ∞) scale
and then give the new parameters, η1 , η2 , a bivariate normal
distribution.
We can use a transformation where θj = F (ηj ) and F (x) is the
distribution function of a continuous distribution on (−∞, ∞),
usually one which is symmetric about x = 0.
One possibility is to use the standard normal distribution function
Φ(x) so that θj = Φ(ηj ). We write ηj = Φ−1 (θj ) where this
function, Φ−1 (x), the inverse of the standard normal distribution
function, is sometimes called the probit function.
134 / 352
Two-Dimensional Numerical Example (Probit)
1
Load the function into R. Type
probit1<and then copy and paste the contents of probit1.r.
2
Enter the data.
n<-c(560,540)
x<-c(44,62)
3
Specify the prior.
prior<-c(-1.24,-1.24,0.42,0.21,0.7)
4
5
Set up ranges for θ1 and θ2
theta1<-seq(0.01,0.99,0.01)
theta2<-theta1
Use the function to evaluate the posterior density.
posterior<-probit1(theta1,theta2,n,x,prior)
6
Make a contour plot of the posterior density.
contour(theta1,theta2,posterior$density)
135 / 352
Two-Dimensional Numerical Example (Probit)
7
We see that the posterior probability is concentrated in one region
so we adjust the ranges for θ1 and θ2 and try again.
theta1<-seq(0.05,0.15,0.001)
theta2<-seq(0.05,0.15,0.001)
post<-probit1(theta1,theta2,n,x,prior)
dens<-post$density
contour(theta1,theta2,dens,xlab=expression(theta[1]),
ylab=expression(theta[2]))
8
We can add a line showing where θ1 = θ2 .
abline(0,1,lty=2)
We can easily see that nearly all of the posterior probability lies in
the region where θ2 > θ1
9
The other element of the result of the function is the posterior
probability that θ2 > θ1 . Extract this.
posterior$prob
136 / 352
Two-Dimensional Numerical Example (Probit)
10
To make the plot of the prior and posterior densities of the log
relative risk, load the function probit2 in the file probit2.r
probit2<-
11
Set up the data and prior. (They are the same as before so you
should already have these).
n<-c(560,540)
x<-c(44,62)
prior<-c(-1.24,-1.24,0.42,0.21,0.7)
12
Set up ranges for γ, the log relative risk, and θ2 .
gamma<-seq(-1,1,0.02)
theta2<-seq(0.05,0.15,0.001)
13
Use the function.
probit2(gamma,theta2,n,x,prior)
137 / 352
Introduction to Multiparameter Models
Lecture 9
Sharif Mahmood
138 / 352
Multiparameter Models
Real problem in statistics nearly always involve more than one
unknown and unobservable quantity. However, usually only one,
or a few, parameters or prediction are of substantive interest.
Ultimate aim of Bayesian analysis is to obtain the marginal
posterior distribution.
We first require the joint posterior distribution of all unknowns,
and then integrate this over the unknowns (“nuisance parameters”)
that are not immediately interest to obtain marginal distribution.
Example:
1
2
We may be primarily interested in the population mean height µ,
but of course we don’t really know the value of the population
variance σ 2 .
So in a realistic model, we must also treat σ 2 as an unknown
parameter.
139 / 352
Multiparameter Models
In cases of this kind, the aim of Bayesian analysis is to obtain the
posterior marginal distribution of the parameter(s) of interest, e.g.
p(µ|y)
The general approach is to estimate the joint posterior distribution
of all unknown quantities in the model, and then integrate out the
one(s) we are not interested in.
Example: in normal means example, we will find
p(µ, σ 2 |y)
then
p(µ|y) =
Z
p(µ, σ 2 |y)dσ 2
140 / 352
Normal Model with µ and σ 2 Both Unknown (I)
First consider the conventional noninformative prior
p(µ, σ 2 ) ∝
1
σ2
This arises by considering µ and σ 2 a priori independent and
taking the product of the standard noninformative priors for each.
A priori independence may be a reasonable assumption here; it
says that if we knew something about one of the unknown
parameters, that wouldn’t give us information about the
distribution of the other one.
This is not quite a conjugate prior. We will see that the posterior
distribution does not factor like this into an inverse gamma times
an independent normal.
141 / 352
Joint Posterior Distribution
The joint posterior distribution with conventional noninformative
prior is
p(µ, σ 2 |y)
=
=
=
n
1
1 X
1
2
(y
−
µ)
× 2
−
n exp
i
2σ 2 i=1
σ
(σ 2 ) 2
n
1
1 X
2
2
(y
−
ȳ)
+
n(ȳ
−
µ)
exp
−
n
i
+1
2
2σ i=1
(σ 2 ) 2
1
1
2
2
(n
−
1)s
+
n(ȳ
−
µ)
exp
−
n
2σ 2
(σ 2 ) 2 +1
where s2 in the sample variance of the yi s:
n
1 X
s =
(yi − ȳ)2
n−1
2
i=1
ȳ and s2 are the sufficient statistics for µ and σ 2 .
142 / 352
Steps to the marginal posterior distribution of µ
To determine p(σ 2 |y) we must average the joint distribution over
µ. i.e.
Z
1
1
2
2
2
(n
−
1)s
+
n(ȳ
−
µ)
dµ.
p(σ |y) ∝
exp
−
n
2σ 2
(σ 2 ) 2 +1
Integrating this expression over µ requires evaluating the integral
of the factor exp(− 2σ1 2 n(ȳ − µ)2 ); which is a simple normal
integral; thus
2
p(σ |y)
∝
∝
∝
r
1
1
2πσ 2
2
×
exp − 2 (n − 1)s
n
+1
2
2σ
n
(σ ) 2
1
1
1
exp − 2 (n − 1)s2 × (σ 2 ) 2
n
+1
2
2
2σ
(σ )
1
1
− 2 (n − 1)s2
n+1 exp
2σ
(σ 2 ) 2
which is a inverse gamma distribution.
143 / 352
The conditional posterior distribution of µ given σ 2
Use what we already know about the posterior mean of µ with
known variance and uniform prior on µ
p(µ|σ 2 , y) = N (ȳ,
σ2
)
n
We will use these identities from conditional probability
p(µ|y)
=
=
Z
Z
p(µ, σ 2 |y)dσ 2
p(µ|σ 2 , y)p(σ 2 |y)dσ 2
Again it can be shown by direct integration
n(µ − ȳ)2 i− n2
p(µ|y) ∝ 1 +
(n − 1)s2
h
is a student’s t distribution. [Gelman et al., 2004, p. 75-77]
144 / 352
Normal Model with µ and σ 2 Both Unknown (II)
Second, consider a normal data with a conjugate prior. A
convenient parameterization is given by the following specification
2
p(µ|σ 2 ) ∼ N (µ0 , nσ0 )
p(σ 2 ) ∼ InvG( α2 , β2 )
which corresponds to the joint prior density
n0
α
1
1
p(µ, σ 2 ) ∝ (σ 2 )−( 2 +1) exp − 2 β × (σ 2 )− 2 exp − 2 (µ − µ0 )2
2σ
2σ
The joint posterior distribution p(µ, σ 2 |y) can be obtained
multiplying the prior by normal density yields
α
1
1
p(µ, σ 2 |y) ∝ (σ 2 )−( 2 +1) (σ 2 )− 2 exp − 2 [β + n0 (µ − µ0 )2 ] ×
2σ
n
1
× (σ 2 )− 2 exp(− 2 [(n − 1)s2 + n(ȳ − µ)2 )].
2σ
145 / 352
Normal Model with µ and σ 2 Both Unknown (III)
Third, consider prior distributions for µ and σ 2 independently.
p(µ|σ 2 ) ∼ N (µ0 , σ02 )
p(σ 2 ) ∼ InvG( α2 , β2 )
i.e. the prior distribution of p(µ|σ 2 ) does not depend on σ 2 . The
joint prior is the product of these two priors.
This is called a “semi-conjugate” prior.
The joint posterior density function can be written as
n
N (µ0 , σ02 ) × InvG
α β Y
,
×
N (yi |µ, σ 2 )
2 2
i=1
Infact, the marginal posterior distributions p(σ 2 |y) and p(µ|y)
have no simple conjugate form.
146 / 352
Normal Model with µ and σ 2 Both Unknown (III)
If p(y|µ, σ 2 ) ∼ N (µ, σ 2 ), we showed in Lecture 7 that
p(µ|σ 2 , y) ∼ N (µ1 , σ1 ) with
µ1 =
nȳσ02 + µ0 σ 2
nσ02 + σ 2
and
σ12 =
σ02 σ 2
nσ02 + σ 2
In the conjugate case, we showed that p(σ 2 |y) was an inverse
gamma distribution.
A sample of {µ, σ 2 } from the joint posterior distribution can be
obtained by sampling
1
2
a value σ 2(s) from p(σ 2 |y), an inverse-gamma distribution, then
a value µ(s) from p(µ|σ 2(s) , y), a normal distribution.
147 / 352
Normal Model with µ and σ 2 Both Unknown (III)
We first suggested that the prior mean and standard deviation of
µ should be µ0 = 1.9 and σ0 = .95, as this would put most of the
prior mass on µ > 0
For σ 2 , let’s use prior parameters of α = 1 and β = 0.01
post.grid
rid
c.g
mea
n.gr
pre
id
Figure: Joint posterior distribution
148 / 352
Normal Model with µ and σ 2 Both Unknown (III)
R code for joint posterior distribution [Hoff, 2009, p. 91-92]
R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mu0<−1.9 ; t 2 0 <−0.95ˆ2 ; alpha <−1 ; beta <−.01
y<−c ( 1 . 6 4 , 1 . 7 0 , 1 . 7 2 , 1 . 7 4 , 1 . 8 2 , 1 . 8 2 , 1 . 8 2 , 1 . 9 0 , 2 . 0 8 )
G<−100 ; H<−100
mean . g r i d <−s e q ( 1 . 5 0 5 , 2 . 0 0 , l e n g t h=G)
p r e c . g r i d <−s e q ( 1 . 7 5 , 1 7 5 , l e n g t h=H)
p o s t . g r i d <−m a t r i x ( nrow=G, n c o l=H)
f o r ( g i n 1 :G) {
f o r ( h i n 1 :H) {
p o s t . g r i d [ g , h]<−
dnorm ( mean . g r i d [ g ] , mu0 , s q r t ( t 2 0 ) ) ∗
dgamma( p r e c . g r i d [ h ] , a l p h a / 2 , b e t a / 2 ) ∗
prod ( dnorm ( y , mean . g r i d [ g ] , 1 / s q r t ( p r e c . g r i d [ h ] ) ) )
}
}
p o s t . g r i d <−p o s t . g r i d /sum ( p o s t . g r i d )
149 / 352
The Weibull Model
The Weibull distribution is often used as a distribution for
lifetimes. The distribution function of a Weibull distribution is
f (t) = αρ(ρt)α−1 exp − (ρt)α .
Suppose we have n observations t1 , . . . , tn . The likelihood is
L(α, ρ) = αn ρnα
n
n
Y
X
α−1
tαi .
ti
exp − ρα
i=1
i=1
(0)
Suppose that our prior density for α and ρ is fα,ρ (α, ρ). The
posterior pdf is
(1)
(0)
fα,ρ
(α, ρ) ∝ hα,ρ (α, ρ) ∝ fα,ρ
(α, ρ) × L(α, ρ).
150 / 352
The Weibull Model
To complete the evaluation of the posterior pdf we find
Z Z
C=
hα,ρ (α, ρ) dα dρ
numerically and then
(1)
(α, ρ) = hα,ρ (α, ρ)/C.
fα,ρ
Suppose, for example, that we give α and ρ independent gamma
prior distributions so that
(0)
(α, ρ) ∝ αaα −1 exp[−bα α] × ρaρ −1 exp[−bρ ρ].
fα,ρ
Then the posterior pdf is proportional to
hα,ρ (α, ρ) = α
n+aα −1 nα+aρ −1
ρ
n
n
h
Y
X
α−1
i
α
ti
exp − bα α+bρ ρ+ρ
tαi .
i=1
i=1
151 / 352
Bayesian Computational Methods
Lecture 10
Sharif Mahmood
152 / 352
Introduction
Bayesian analysis requires computation of expectations and
quantiles of probability distributions that arise as posterior
distributions.
If conjugate priors are not used, as is mostly the case these days,
posterior distributions will not be standard distributions and
hence the required Bayesian quantities (i.e., posterior quantities of
inferential interest) cannot be computed in closed form.
Thus special techniques are needed for Bayesian computations.
153 / 352
Development Over Time
1763-1960: Conjugate priors.
1960’s: Numerical quadrature (Newton-Cotes methods, Gaussian
quadrature, etc.).
1970’s: Expectation-Maximization (EM) algorithm (iterative
mode finder).
1980’s: Asymptotic methods.
1980’s: Noniterative Monte Carlo methods (direct posterior
sampling and indirect methods, e.g., importance
sampling, rejection).
1990’s: Markov chain Monte Carlo (MCMC; Gibbs sampling,
Metropolis Hastings algorithm, etc.).
154 / 352
Asymptotic Methods
Advantages:
1
Deterministic, noniterative algorithm.
2
Substitutes differentiation for integration.
3
Facilitates studies of Bayesian robustness.
Disadvantages:
1
Requires well-parametrized, unimodal posterior.
2
θ must be of at most moderate dimension.
3
n must be large, but is beyond our control.
155 / 352
Noniterative Monte Carlo Methods
Direct sampling
Indirect methods
◮
Importance sampling
◮
Rejection sampling
156 / 352
Direct Sampling
Suppose θ ∼ p(θ) and γ = E[c(θ)] =
R
c(θ)p(θ)dθ
If θ1 , . . . , θN are independent and identically distributed as p(θ),
we have
N
1 X
c(θj )
γ̂ =
N
j=1
γ̂ converges to E[c(θ)] with probability 1 by SLLN as N → ∞, N
is the Monte Carlo sample size
V ar(γ̂) = V ar[c(θ)]
N
q
P
2
ˆ
SE(γ̂)
= N (N1−1) N
j=1 [c(θj ) − γ̂]
ˆ
γ̂ ± 2SE(γ̂)
provides approximately 95% confidence interval for
the posterior mean γ.
157 / 352
Importance Sampling
Suppose we wish to approximate
R
h(θ)f (y|θ)π(θ)dθ
.
E[h(θ)|y] = R
f (y|θ)π(θ)dθ
Suppose further we can roughly approximate the normalized
likelihood times prior, cf (y|θ)π(θ), by some density g(θ) from
which we can easily sample.
Then defining the weight function w(θ) = f (y|θ)π(θ)/g(θ),
E[h(θ)|y] =
iid
where θj ∼ g(θ)
R
h(θ)w(θ)g(θ)dθ
R
=
w(θ)g(θ)dθ
1
N
PN
1
N
j=1 h(θj )w(θj )
PN
j=1 w(θj )
Here, g(θ) is called the importance function; a good match to
cf (y|θ)π(θ) will produce roughly equal weights.
158 / 352
Rejection Sampling
Instead of approximating the normalized posterior
p(θ|y) = R
p(y|θ)π(θ)
p(y|θ)π(θ)dθ
we try to “blanket” it.
Suppose there exists an identifiable constant M > 0, and a smooth
density g(θ) called the envelop function such that
f (y|θ)π(θ) < M g(θ)
Rejection method proceeds as follows:
(i) Generate θj ∼ g(θ)
(ii) Generate U ∼ Unif(0, 1)
(iii) If M U g(θj ) < f (y|θj )π(θj ) accept θj . Otherwise reject θj .
(iv) Return to step (i) and repeat until the desired sample
{θj , j = 1, 2, . . . , N } is obtained
The final sample consists of random draws from p(θ|y).
159 / 352
Rejection Sampling
Figure: Unstandardized posterior distribution and proper rejection envelope.
Consider the θj samples in the histogram bar centered at a: the
rejection step “slices off” the top portion of the bar.
Repeat for all a: accepted θj s mimic the lower curve!
Need to choose M as small as possible (efficiency), and watch for
“envelope violations”!
160 / 352
An example
✑ Suppose our target density (the one we want to sample from) is
the triangle density:
if 0 ≤ y < 0.25
8y
8
8
− y if 0.25 ≤ y ≤ 1
f (y) =
3 3
0
otherwise
Target Density f (y) (the triangle density):
f.y <- function(y) {
if (y >= 0 && y < 0.25)
8 * y
else if (y >= 0.25 && y <= 1)
8/3 - 8 * y/3
else 0
}
For our candidate density g(y), let’s use the Unif(0,1) density:
g.y <- function(y) {
if (y >= 0 && y <= 1)
1
else 0
}
161 / 352
An example
Let’s set M = 3 because I know from guess and check that f (y) is
never greater than M g(y), which is 3 for all y ∈ [0, 1].
5
M <- 3
c1 <- unlist(lapply(y, g.y))*M
c2 <- unlist(lapply(y, f.y))
plot(y, c1, type="l", xlab="y", ylab="Density", ylim=c(0,5))
lines(y, c2, lty=2)
legend("topright", c("f(y)", "Mg(y)"), lty=c(2,1), cex=1.3)
0
1
2
Density
3
4
f(y)
Mg(y)
0.0
0.2
0.4
0.6
0.8
1.0
y
162 / 352
An example: Rejection Sampling
Lets do rejection sampling!
R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
m <− 10000
n . draws <− 0
draws <− c ( )
y . g r i d <− s e q ( 0 , 1 , by = 0 . 0 1 )
w h i l e ( n . draws < m) {
y . c <− r u n i f ( 1 , 0 , 1 )
a c c e p t . prob <− f . y ( y . c ) / (M ∗ g . y ( y . c ) )
u <− r u n i f ( 1 , 0 , 1 )
i f ( a c c e p t . prob >= u ) {
draws <− c ( draws , y . c )
n . draws <− n . draws + 1
}
}
p l o t ( d e n s i t y ( draws ) , x l a b=”y ” , y l a b=”D e n s i t y ” , y l i m=c ( 0 , 3 ) , main=””)
l i n e s ( y , c2 , l t y =2)
l e g e n d ( ” t o p r i g h t ” , c ( ” R e j e c t i o n Sampling Draws ” , ” Tar get D e n s i t y ” )
, l t y=c ( 1 , 2 ) , c e x =1.3)
163 / 352
An example
3.0
We can compare the density of our draws to the target density
f (y).
1.5
1.0
0.5
0.0
Density
2.0
2.5
Rejection Sampling Draws
Target Density
0.0
0.2
0.4
0.6
0.8
1.0
y
Figure: Triangular density function.
164 / 352
Limitations of IS and RS
Rejection sampling is a general method for simulating from an
arbitrary posterior distribution, but it can be difficult to set up
since it requires the construction of a suitable proposal density.
Importance sampling is also general-purpose algorithms, but they
also require proposal densities that may be difficult to find for
high-dimensional problems.
So we require Markov chain Monte Carlo (MCMC) algorithms to
handle such situations.
165 / 352
Bayesian Computational Methods
Lecture 11
Sharif Mahmood
166 / 352
Markov Chain Monte Carlo (MCMC)
Markov Chain: a stochastic process in which future states are
independent of past states given the present state
Monte Carlo: simulation
Goals
1
to make inference about model parameters
2
to make predictions
Requires
1
integration over possibly high-dimentional integrand
2
we may not know the integrating constant
We can take quantities of interest of a distribution from simulated
draws from the distribution.
167 / 352
Monte Carlo Integration
✑ Suppose we have a distribution p(θ) (perhaps a posterior) that we
want to take quantities of interest from.
⇒ To derive it analytically, we need to take integrals:
Z
I=
g(θ)p(θ)dθ
Θ
where g(θ) is some function of θ (e.g. g(θ) = θ for the mean and
g(θ)= [θ − E(θ)]2 for the variance).
We can approximate the integrals via Monte Carlo Integration by
simulating M values from p(θ) and calculating
M
1 X
g(θ(i) )
IˆM =
M
i=1
168 / 352
Monte Carlo Integration
For example, we can compute the expected value of the Beta(3,3)
distribution analytically:
Z
Z
1
Γ(6)
θ2 (1 − θ)2 dθ =
E(θ) =
θp(θ)dθ =
θ
2
Θ
Θ Γ(3)Γ(3)
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IˆM is a simulation consistent
estimator of the true value I : IˆM → I as M → ∞. We know this
to be true from the Strong Law of Large Numbers.
169 / 352
Strong Law of Large Numbers (SLLN)
Let Y1 , Y2 , . . . be a sequence of independent and identically
distributed random variables, each having a finite mean µ = E(Yi ).
Then with probability 1,
Y1 + Y2 + . . . + YM
→ µ as M → ∞
M
In our previous example, each simulation draw was independent
and distributed from the same Beta(3,3) distribution.
This also works with variances and other quantities of interest,
since a function of i.i.d. random variables are also i.i.d. random
variables.
But what if we cant generate draws that are independent?
170 / 352
Monte Carlo Integration on the Markov Chain
Once we have a Markov chain that has converged to the stationary
distribution, then the draws in our chain appear to be like draws
from p(θ|y), so it seems like we should be able to use Monte Carlo
Integration methods to find quantities of interest.
One problem: our draws are not independent, which we required
for Monte Carlo Integration to work (remember SLLN).
Luckily, we have the Ergodic Theorem.
171 / 352
Ergodic Theorem
Let θ (1) , θ (2) , . . . , θ (M ) be M values from a Markov chain that is
aperiodic, irreducible, and positive recurrent (then the chain is
ergodic), and E[g(θ)] < ∞.
Then with probability 1,
Z
M
1 X
g(θi ) → g(θ)π(θ)dθ
M
i=1
as M → ∞, where π is the stationary distribution.
This is the Markov chain analog to the SLLN, and it allows us to
ignore the dependence between draws of the Markov chain when
we calculate quantities of interest from the draws.
But what does it mean for a chain to be aperiodic, irreducible, and
positive recurrent, and therefore ergodic?
172 / 352
Aperiodicity
A Markov chain is aperiodic if the only length of time for which
the chain repeats some cycle of values is the trivial case with cycle
length equal to one.
Let A, B, and C denote the states (analogous to the possible
values of θ) in a 3-state Markov chain. The following chain is
periodic with period 3, where the period is the number of steps
that it takes to return to a certain state.
As long as the chain is not repeating an identical cycle, then the
chain is aperiodic.
173 / 352
Irreducibility
A Markov chain is irreducible if it is possible go from any state to
any other state (not necessarily in one step).
The following chain is reducible, or not irreducible.
The chain is not irreducible because we cannot get to A from B or
C regardless of the number of steps we take.
174 / 352
Positive Recurrence
A Markov chain is recurrent if for any given state i , if the chain
starts at i, it will eventually return to i with probability 1.
A Markov chain is positive recurrent if the expected return time to
state i is finite; otherwise it is null recurrent.
So if our Markov chain is aperiodic, irreducible, and positive
recurrent (all the ones we use in Bayesian statistics usually are),
then it is ergodic and the ergodic theorem allows us to do Monte
Carlo Integration by calculating quantities of interest from our
draws, ignoring the dependence between draws.
175 / 352
Burn-in
Since convergence usually occurs regardless of our starting point,
we can usually pick any feasible starting point (e.g., picking
starting draws that are in the parameter space). However, the
time it takes for the chain to converge varies depending on the
starting point.
As a matter of practice, most people throw out a certain number
of the first draws, known as the burn-in.
Pros: This is to make our draws closer to the stationary distribution and
less dependent on the starting point.
Cons: However, it is unclear how much we should burn-in since our
draws are all slightly dependent and we don’t know exactly when
convergence occurs.
176 / 352
Thinning the Chain
In order to break the dependence between draws in the Markov
chain, some have suggested only keeping every dth draw of the
chain.
This is known as thinning.
Pros:
Perhaps gets you a little closer to i.i.d. draws.
Saves memory since you only store a fraction of the draws.
Cons:
Unnecessary with ergodic theorem.
Shown to increase the variance of your Monte Carlo estimates.
177 / 352
So Really, What is MCMC?
MCMC is a class of methods in which we can simulate draws that
are slightly dependent and are approximately from a (posterior)
distribution.
We then take those draws and calculate quantities of interest for
the (posterior) distribution.
In Bayesian statistics, there are generally two MCMC algorithms
that we use: the Gibbs Sampler and the Metropolis-Hastings
algorithm.
178 / 352
Markov chain Monte Carlo methods
Iterative MC methods are useful when it is difficult or impossible
to find a feasible importance or envelope density.
Algorithms:
◮
Gibbs sampler
◮
Metropolis-Hastings algorithm
◮
Slice sampler
Performance evaluation:
◮
Convergence monitoring and diagnostics
◮
Variance estimation
179 / 352
Bayesian Computational Methods
Lecture 12
Sharif Mahmood
180 / 352
Gibbs Sampler
Suppose there are k parameters θ = (θ1 , . . . , θk )′ in our model
Assume that full (or complete) conditional distributions
{p(θi |θj6=i , y), i = 1, . . . , k} are available for sampling.
Direct sampling (if full conditionals have known forms) or indirect
sampling may be used.
Under mild conditions the collection of full conditional
distributions uniquely determine the joint posterior p(θ|y) and all
marginals, p(θi |y), i = 1, . . . , k
181 / 352
Gibbs Sampling Algorithm
Suppose a set of arbitrary starting values {θ20 , . . . , θk0 } are given
The algorithm proceeds as follows:
(1)
from p(θ1 |θ2 , . . . , θk , y)
(1)
from p(θ2 |θ1 , θ3 , . . . , θk , y)
(i) Draw θ1
(ii) Draw θ2
(0)
(0)
(1)
(0)
(0)
(1)
(1)
(1)
..
.
(1)
(k) Draw θk
(1)
from p(θk |θ1 , θ2 , . . . , θk−1 , y)
(1)
We have θ1 , . . . , θk after one iteration
(t)
(t)
After t such iterations we have θ1 , . . . , θk which converges in
distribution to a draw from the true joint posterior p(θ1 , . . . , θk |y).
182 / 352
Gibbs Sampling Algorithm
For T sufficiently large (say, bigger than t0 ), {θ (t) }Tt=t0 +1 is a
(correlated) sample from the true posterior.
We might use a sample mean to estimate the posterior mean
E(θi |y) ≈
T
X
1
(t)
θi
T − t0
t=t0 +1
The time from t = 0 to t = t0 is commonly known as the burn-in
period.
We may also run m parallel Gibbs sampling chains and obtain
E(θi |y) ≈
m X
T
X
1
(j,t)
θi
m(T − t0 )
j=1 t=t0 +1
where the index j indicates thinning number.
183 / 352
Example (I): Gibbs Sampling Algorithm
✑ Suppose we have data of the number of failures (y ) for each of 10
i
pumps in a nuclear plant [Carlin and Louis, 2008, p.127].
We also have the times (ti ) at which each pump was observed.
i
y
t
1
5
94
2
1
16
3
5
63
4
14
126
5
3
5
6
19
31
7
1
1
8
1
1
9
4
2
10
22
10
Table: Plum failure data
We want to model the number of failures with a Poisson
likelihood, where the expected number of failure λi differs for each
pump. Since the time which we observed each pump is different,
we need to scale each λi by its observed time ti .
Q
Our likelihood is 10
i=1 Poisson(λi ti ).
184 / 352
Example (I): Gibbs Sampling Algorithm
Let’s put a G(α, β) prior on λi with α = 1.8, so the λi s are drawn
from the same distribution.
Also, let’s put a G(γ, δ) prior on β, with γ = 0.01 and δ = 1.
So our model has 11 parameters that are unknown (10 λi ’s and β).
Our posterior is
p(λ, β|y, t) ∝
10
Y
i=1
Poisson(λi ti ) × G(α, β) × G(γ, δ)
10 −λi ti
Y
β α α−1 −βλi
δ γ γ−1 −δβ
e
(λi ti )yi
×
λi e
β
e
×
=
yi !
Γ(α)
Γ(γ)
i=1
∝
=
10
Y
i=1
10
Y
i=1
e−λi ti (λi ti )yi × β α λiα−1 e−βλi × β γ−1 e−δβ
λiyi +α−1 e−(ti +β)λi × β 10α+γ−1 e−δβ
185 / 352
Example (I): Gibbs Sampling Algorithm
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λiyi +α−1 e−(ti +β)λi
p(β|λ, y, t) ∝ e−β(δ+
P10
i=1
λi ) 10α+γ−1
β
p(λi |λ−i , β, y, t) is a G(yi + α, ti + β) distribution.
P
p(β|λ, y, t) is a G(10α + γ, δ + 10
i=1 λi ) distribution.
186 / 352
Coding the Gibbs Sampler
1
Define starting values for β (we only need to define β here because
we will draw λ first and it only depends on β and other given
values).
beta.cur <- 1
2
Draw λ(1) from its full conditional (were drawing all the λi ’s as a
block because they all only depend on β and not each other).
lambda.update <- function(alpha, beta, y, t) {
rgamma(length(y), y + alpha, t + beta)
}
3
Draw β (1) from its full conditional, using λ(1) .
beta.update <- function(alpha, gamma, delta, lambda, y) {
rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
}
187 / 352
Coding the Gibbs Sampler
4
Repeat using most updated values until we get M draws.
5
Optional burn-in and thinning.
6
Make it into a function.
188 / 352
Coding the Gibbs Sampler
R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
g i b b s <− f u n c t i o n ( n . s i m s , b e t a . s t a r t , a l p h a , gamma , d e l t a ,
y , t , burnin = 0 , thin = 1) {
b e t a . draws <− c ( )
lambda . draws <− m a t r i x (NA, nrow = n . s i m s , n c o l = l e n g t h ( y ) )
b e t a . c u r <− b e t a . s t a r t
###########
lambda . u p d a t e <− f u n c t i o n ( a l p h a , b e t a , y , t ) {
rgamma ( l e n g t h ( y ) , y + a l p h a , t + b e t a )
}
b e t a . u p d a t e <− f u n c t i o n ( a l p h a , gamma , d e l t a , lambda , y ) {
rgamma ( 1 , l e n g t h ( y ) ∗ a l p h a + gamma , d e l t a + sum ( lambda ) )
}
f o r ( i in 1 : n . sims ) {
lambda . c u r <− lambda . u p d a t e ( a l p h a=a l p h a , b e t a=b e t a . cur , y=y , t=t )
b e t a . c u r <− b e t a . u p d a t e ( a l p h a = a l p h a , gamma = gamma , d e l t a = d e l t a ,
lambda=lambda . cur , y=y )
i f ( i > b u r n i n & ( i − b u r n i n)%%t h i n == 0 ) {
lambda . draws [ ( i − b u r n i n ) / t h i n , ] <− lambda . c u r
b e t a . draws [ ( i − b u r n i n ) / t h i n ] <− b e t a . c u r
}
}
r e t u r n ( l i s t ( lambda . draws = lambda . draws , b e t a . draws = b e t a . draws ) )
}
189 / 352
Coding the Gibbs Sampler
4
Do Monte Carlo Integration on the resulting Markov chain, which
are samples approximately from the posterior.
R code
1
2
3
4
5
6
7
8
9
10
11
12
13
y <− c ( 5 , 1 , 5 , 1 4 , 3 , 1 9 , 1 , 1 , 4 , 2 2 )
t <− c ( 9 4 , 1 6 , 6 3 , 1 2 6 , 5 , 3 1 , 1 , 1 , 2 , 1 0 )
rbind (y , t )
p o s t e r i o r <− g i b b s ( n . s i m s = 1 0 0 0 0 , b e t a . s t a r t = 1 ,
a l p h a =1.8 , gamma=0.01 , d e l t a =1 , y=y , t=t )
colMeans ( p o s t e r i o r $ l a m b d a . draws )
mean ( p o s t e r i o r $ b e t a . draws )
a p pl y ( p o s t e r i o r $ l a m b d a . draws , 2 , sd )
sd ( p o s t e r i o r $ b e t a . draws )
190 / 352
Example (II): Gibbs Sampling Algorithm
✑ Suppose y , . . . , y
are i.i.d. N (µ, σ 2 ), τ = σ12 , π(µ, τ ) ∝ τ −1 . We
want to devise the Gibbs sampler to sample from the joint
posterior density of (µ, σ 2 ).
Sol
1
n
p(µ, τ |y)
∝
=
∝
∝
=
∝
τ X
(yi − µ)2 ] × τ −1
2
n
τ X
τ 2 −1 exp[−
(yi − µ)2 ]
2
τ X 2
yi − 2nȳµ + nµ2 )]
exp[− (
2
τ
exp[− (−2nȳµ + nµ2 )]
2
nτ 2
exp − (µ − 2ȳµ)]
2
nτ
exp[− (µ − ȳ)2 ]
2
n
τ 2 exp[−
1
) and τ |y, µ ∼ Γ( n2 ,
µ|y, τ ∼ N (ȳ, nτ
P
(yi −µ)2
)
2
191 / 352
Example (II): Gibbs Sampling Algorithm (contd)
To obtain joint posterior samples from [µ, τ |y] we
1
Start with arbitrary (µ(0) , τ (0) )
2
Sample µ(1) from [µ|y, τ (0) ]
3
Sample τ (1) from [τ |y, µ(1) ]
4
Repeat until we have t iterations yielding (µ(1) , τ (1) ), . . . , (µ(t) , τ (t) )
5
The t iterations produce t gibbs samples
192 / 352
Features of Gibbs Sampling
The Gibbs sampler is easy to understand and implement, but
requires the ability to readily sample from each of the full
conditional distributions, p(θi |θl6=i , y).
Unfortunately, when the prior π(θ) and the likelihood f (y|θ) are
not a conjugate pair, one or more of these full conditionals may
not be available in closed form.
Even in this settting, however, p(θi |θl6=i , y) will be available upto a
proportionality constant as it is proportional to the portion of
f (y|θ) that involvels θi .
Metropolis-Hastings algorithm gives solution to this problem.
193 / 352
Bayesian Computational Methods
Lecture 13
Sharif Mahmood
194 / 352
Intuition
1
The Metropolis-Hastings algorithm can draw samples from any
probability distribution p(y), provided you can compute the value
of a function f (y) which is proportional to the density of p.
2
The Metropolis-Hastings algorithm works by generating a
sequence of sample values in such a way that, as more and more
sample values are produced, the distribution of values more closely
approximates the desired distribution, p(y).
3
These sample values are produced iteratively, with the distribution
of the next sample being dependent only on the current sample
value (thus making the sequence of samples into a Markov chain.)
4
Specifically, at each iteration, the algorithm picks a candidate for
the next sample value based on the current sample value. Then,
with some probability, the candidate is either accepted or rejected.
195 / 352
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
(i) Choose a starting value θ (0) .
(ii) At iteration t, draw a candidate θ ∗ from a jumping distribution
Jt (θ ∗ |θ (t−1) ).
(iii) Compute an acceptance ratio (probability):
r=
p(θ ∗ |y)/Jt (θ ∗ |θ (t−1) )
p(θ (t−1) |y)/Jt (θ (t−1) |θ ∗ )
(iv) Accept θ ∗ as θ (t) with probability min(r, 1). If θ ∗ is not accepted,
then θ (t) = θ (t−1) .
(v) Repeat steps ii-iv M times to get M draws from p(θ|y), with
optional burn-in and/or thinning.
196 / 352
Step 1: Choose a starting value θ (0) .
This is equivalent to drawing from our initial stationary
distribution.
The important thing to remember is that θ (0) must have positive
probability.
p(θ (0) |y) > 0
Otherwise, we are starting with a value that cannot be drawn.
197 / 352
Step 2: Draw θ ∗ from Jt (θ ∗ |θ (t−1) ).
The jumping distribution Jt (θ ∗ |θ (t−1) ) determines where we move
to in the next iteration of the Markov chain. The support of the
jumping distribution must contain the support of the posterior.
The original Metropolis algorithm required that Jt (θ ∗ |θ (t−1) ) be a
symmetric distribution (such as the normal distribution), that is
Jt (θ ∗ |θ (t−1) ) = Jt (θ (t−1) |θ ∗ )
Metropolis-Hastings algorithm that symmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent on
θ (t−1) , then we have what is known as random walk Metropolis
sampling.
198 / 352
Step 2: Draw θ ∗ from Jt (θ ∗ |θ (t−1) ).
If our jumping distribution does not depend on θ (t−1) ,
Jt (θ ∗ |θ (t−1) ) = Jt (θ ∗ )
then we have what is known as independent Metropolis-Hastings
sampling.
Basically all our candidate draws θ ∗ are drawn from the same
distribution, regardless of where the previous draw was.
This can be extremely efficient or extremely inefficient, depending
on how close the jumping distribution is to the posterior.
Generally speaking, chain will behave well only if the jumping
distribution has heavier tails than the posterior.
199 / 352
Step 3: Compute acceptance ratio r.
r=
p(θ ∗ |y)/Jt (θ ∗ |θ (t−1) )
p(θ (t−1) |y)/Jt (θ (t−1) |θ ∗ )
In the case where our jumping distribution is symmetric,
r=
p(θ ∗ |y)
p(θ (t−1) |y)
If our candidate draw has higher probability than our current
draw, then our candidate is better so we definitely accept it.
Otherwise, our candidate is accepted according to the ratio of the
probabilities of the candidate and current draws.
Note that since r is a ratio, we only need p(θ|y) up to a constant
of proportionality since p(y) cancels out in both the numerator
and denominator.
200 / 352
Step 3: Compute acceptance ratio r.
In the case where our jumping distribution is not symmetric,
r=
p(θ ∗ |y)/Jt (θ ∗ |θ (t−1) )
p(θ (t−1) |y)/Jt (θ (t−1) |θ ∗ )
We need to weight our evaluations of the draws at the posterior
densities by how likely we are to draw each draw.
For example, if we are very likely to jump to some θ ∗ , then
Jt (θ ∗ |θ (t−1) ) is likely to be high, so we should accept less of them
than some other θ ∗ that we are less likely to jump to.
In the case of independent Metropolis-Hastings sampling,
r=
p(θ ∗ |y)/Jt (θ ∗ )
p(θ (t−1) |y)/Jt (θ (t−1) )
201 / 352
Step 4: Decide whether to accept θ ∗ .
Accept θ ∗ as θ (t) with probability min(r, 1). If θ ∗ is not accepted,
then θ (t) = θ (t−1) .
1
2
For each θ ∗ , draw a value u from the Uniform(0,1) distribution.
If u ≤ r , accept θ ∗ as θ (t) . Otherwise, use θ (t−1) as θ (t) .
Candidate draws with higher density than the current draw are
always accepted.
Unlike in rejection sampling, each iteration always produces a
draw, either θ ∗ or θ (t−1) .
202 / 352
Acceptance Rates
It is important to monitor the acceptance rate (the fraction of
candidate draws that are accepted) of your Metropolis-Hastings
algorithm.
If your acceptance rate is too high, the chain is probably not
mixing well (not moving around the parameter space quickly
enough).
If your acceptance rate is too low, your algorithm is too inefficient
(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,
but generally
• random walk: somewhere between 0.25 and 0.50 is recommended
• independent: something close to 1 is preferred.
203 / 352
A Simple Example
Using a random walk Metropolis algorithm to sample from a
G(1.7, 4.4) distribution with a Normal jumping distribution with
standard deviation of 2.
R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
mh . gamma <− f u n c t i o n ( n . s i m s , s t a r t , b u r n i n , cand . sd , shape , r a t e ) {
t h e t a . c u r <− s t a r t
draws <− c ( )
t h e t a . u p d a t e <− f u n c t i o n ( t h e t a . c ur , shape , r a t e ) {
t h e t a . can <− rnorm ( 1 , mean = t h e t a . cu r , s d = cand . s d )
a c c e p t . prob <− dgamma ( t h e t a . can , s h a p e = shape , r a t e = r a t e )
/dgamma ( t h e t a . cur , s h a p e = shape , r a t e = r a t e )
i f ( r u n i f ( 1 ) <= a c c e p t . prob )
t h e t a . can
e l s e theta . cur
}
f o r ( i in 1 : n . sims ) {
draws [ i ] <− t h e t a . c u r <− t h e t a . u p d a t e ( t h e t a . cu r , s h a p e = shape ,
rate = rate )
}
r e t u r n ( draws [ ( b u r n i n + 1 ) : n . s i m s ] )
}
mh . draws <− mh . gamma ( 1 0 0 0 0 , s t a r t = 1 , b u r n i n = 1 0 0 0 , cand . s d = 2 ,
shape = 1 . 7 , r a t e = 4 . 4 )
204 / 352
How Gibbs and MH sampling differ
The MH algorithm does not require that all parameters in a
problem be updated simultaneously, updating parameters
one-at-a-time in an MH algorithm makes MH sampling seem like
Gibbs sampling. However, the MH algorithm differs from the
Gibbs sampler in two important ways.
First, the candidate parameter is not automatically accepted,
because it comes from a proposal density and not from the
appropriate conditional distribution.
Second, the proposal density is not an enveloping function as in
the rejection sampling routine. In short, all Gibbs sampling is MH
sampling, but not all MH sampling is Gibbs sampling.
205 / 352
Relationship between Gibbs and MH sampling
First, the Gibbs sampler can be viewed as a special case of the
MH algorithm in which the proposal densities for each
parameter/random variable are the full conditionals.
In Gibbs sampling, every candidate is selected; there is no
rejection of candidates. The reason is that the ratio r is always 1.
A second link between Gibbs sampling and MH sampling is that
they can be combined to sample from multivariate densities.
206 / 352
Features of Markov Chain Monte Carlo Methods
Importance sampling and rejection sampling are non iterative
methods
They draw a sample of size N and stop
For high dimensional problems, it may be quite difficult to find an
importance density or envelop function
To address these issues, it is now standard Bayesian practice to
turn to Markov Chain Monte Carlo (MCMC) methods
These methods operate by sequentially sampling parameter values
from a Markov Chain (MC)
The stationary distribution is exactly the desired joint posterior
distribution of interest
One needs to insure the convergence of the MC to its stationary
distribution
Gibbs sampler and Metropolis -Hastings are the two commonly
used MCMC methods in practice.
207 / 352
Bayesian Computational Methods
Lecture 14
Sharif Mahmood
208 / 352
Sequential updating
we can change our beliefs about an unknown parameter θ from a
prior distribution with density f (0) (θ) to a posterior distribution
with density f (1) (θ) when we observe data y, using
f (1) (θ) ∝ f (0) (θ)L(θ; y),
where the likelihood L(θ; y) = fY (y|θ), the probability or
probability density of y given θ.
Now suppose that we repeat the process and consider data
arriving sequentially.
We start with a prior distribution with density f (0) (θ). Then we
observe data y (1) and our beliefs change so that they are
represented by f (1) (θ).
Suppose that we now observe further data y (2) so that our density
for θ becomes f (2) (θ).
209 / 352
Sequential updating
Before we observe either y (1) or y (2) , the joint probability (density)
of θ, y (1) , y (2) is
f (0) (θ) × f1 (y (1) |θ) × f2 (y (2) |θ, y (1) ),
where f1 (y (1) |θ) is the conditional probability (density) of y (1)
given θ and f2 (y (2) |θ, y (1) ) is the conditional probability (density)
of y (2) given θ and y (1) .
Clearly
f (1) (θ) ∝ f (0) (θ)f1 (y (1) |θ) = f (0) (θ)L1 (θ; y (1) )
and
f (2) (θ) ∝ f (0) (θ)f1 (y (1) |θ)f2 (y (2) |θ, y (1) ) = f (0) (θ)L1,2 (θ; y (1) , y (2) )
∝ f (1) (θ)f2 (y (2) |θ, y (1) ) = f (1) (θ)L2 (θ; y (2) |y (1) )
210 / 352
Sequential updating
In many cases y (1) and y (2) will be independent given θ. In such a
case f2 (y (2) |θ, y (1) ) = f2 (y (2) |θ) and
L1,2 (θ; y (1) , y (2) ) = L1 (θ; y (1) )L2 (θ; y (2) ).
Our “posterior” distribution after observing y (1) is our “prior”
distribution before we observe y (2) .
The only complication when y (1) and y (2) are not independent
given θ is that the second likelihood L2 (θ; y (2) |y (1) ) must allow for
the dependence of y (2) on y (1) .
211 / 352
Sequential updating
Note that f (2) (θ) is the same whether we update our beliefs first
by y (1) and then by y (2) or first by y (2) and then by y (1) . This is
obvious when y (1) and y (2) are independent given θ.
In the case where they are not it is simply a matter of whether we
factorise the joint probability (density) of y (1) and y (2) given θ into
f1 (y (1) |θ)f2 (y (2) |θ, y (1) ), as above, or into f2∗ (y (2) |θ)f1∗ (y (1) |θ, y (2) ).
At time n we have a density for data conditional on θ as
f (y (1) , . . . , y (n) |θ) = f (y (1) |θ) × f (y (2) |θ, y (1) ) × . . . × f (y (n) |θ, y n−1 )
where y (i) = (y (1) , . . . , y (i) ).
212 / 352
Example 1: Beta-Binomial
✑ Suppose the prior pdf f
(0) (θ)
is proportional to
θa−1 (1 − θ)b−1 .
We observe 20 trials, 3 of which have the success, giving a
likelihood
L1 (θ; y (1) ) ∝ θ3 (1 − θ)17 .
Thus the posterior pdf at this stage, f (1) (θ), is proportional to
θa+3−1 (1 − θ)b+17−1 .
the pdf of a Beta(a + 3, b + 17) distribution.
213 / 352
Example 1: Beta-Binomial
✑ Now suppose that we examine another 30 trials and 7 of these
have the success. These observations are reasonably supposed to
be independent of y (1) given θ. So
L2 (θ; y (2) ) ∝ θ7 (1 − θ)23
and our new pdf for θ, f (2) (θ), is proportional to
θa+10−1 (1 − θ)b+40−1 ,
the pdf of a Beta(a + 10, b + 40) distribution.
214 / 352
Hierarchical models
Lecture 15
Sharif Mahmood
215 / 352
Introduction to Hierarchical Models
A popular model for describing the heterogeneity across several
populations is the hierarchical model,
Hierarchical models serve two purposes:
1
Methodological: when units of analysis are drawn from clusters
within a population (communities, neighborhoods, city blocks, etc.),
they can no longer be considered independent. When independence
does not hold, we cannot construct the likelihood as simply.
2
Substantive: The influence of predictors may be context-dependent,
e.g. student performance may be dependent on the teacher-the
environmental context of classes.
Use of hierarchical models has increased dramatically in the last
two decades.
216 / 352
Hierarchical structure of the parameters
✑ Suppose we have J observations within each of G groups:
y11 , . . . , yJ1 , y12 , . . . , yJ2 , y1G , . . . , yJG , and we assume that the
data are distributed within groups according to some distribution
Q with parameter θ, but that each group has its own parameter
(θg ). Thus
yig ∼ Q(θg ).
Suppose we assume further that these parameters θg arise from a
common distribution W with parameter γ (this parameter is
called a “hyperparameter”). So,
θg ∼ W (γ).
Finally, assume γ has some vague distribution like a uniform:
γ ∼ U (−100, 100)
217 / 352
Hierarchical structure of the parameters
A posterior distribution for all unknown parameters would then be
(after substituting the densities Q and W into the conditional
structure below):
p(γ, θ|y) ∝ p(y|θ, γ)p(θ|γ)p(γ).
The multiple of this marginal joint density for the parameters and
the sampling density for the data, given the parameters, yields a
posterior density for all of the parameters.
We might not be interested in group level parameters θg but
rather in the posterior distribution for the hyperparameter γ that
structures the distribution of the group level parameters.
The marginal distribution for γ:
Z
p(γ|y) ∝ p(y|θ, γ)p(θ|γ)p(γ)dθ
This integration is performed via MCMC methods as we sample
from the conditional posterior distributions for each parameter.
218 / 352
The hierarchical normal model
The hierarchical normal model describes the heterogeneity of
means across several populations in which the within and between
group sampling models are both normal:
within-group model:
between-group model:
φj = {θj , σ 2 }
ψ = {µ, τ 2 }
p(y|φj ) ∼ N (θj , σ 2 )
p(θj |ψ) ∼ N (µ, τ 2 )
It might help to visualize this setup as in the Figure.
The fixed but unknown parameters in this model are µ, τ 2 and σ 2 .
For convenience we will use standard semiconjugate normal and
inverse-gamma prior distributions for these parameters:
ν ν σ2
1
0 0
0
∼G
,
2
σ
2
2
η η τ2
1
0
0 0
∼G
,
τ2
2
2
µ ∼ N (µ0 , γ0 )
219 / 352
The hierarchical normal model
µ,ττ2
θ1
θ2
.....
θm−−1
θm
y1
y2
.....
ym−−1
ym
σ2
Figure: A graphical representation of the basic hierarchical normal model.
220 / 352
Posterior
The unknown quantities in our system include the group-specific
means {θ1 , . . . , θm }, the within-group sampling variability σ 2 and
the mean and variance (µ, τ 2 ) of the population of group-specific
means.
Joint posterior inference for these parameters can be made by
constructing a Gibbs sampler which approximates the posterior
distribution
p(θ1 , . . . , θm , µ, τ 2 , σ 2 |y1 , . . . , ym )
∝ p(µ, τ 2 , σ 2 ) × p(θ1 , . . . , θm |µ, τ 2 , σ 2 ) × p(y1 , . . . , ym |θ1 , . . . , θm , µ, τ 2 , σ 2 )
Y
Y
nj
m
m Y
2
2
2
2
= p(µ)p(τ )p(σ )
p(θj |µ, τ )
p(yi,j |θj , σ )
j=1
j=1 i=1
the term in the second pair of brackets is the result of an
important conditional independence feature of our model.
221 / 352
Full conditional distributions of µ and σ 2
The full conditional distributions of µ and σ 2 are also proportional
to
m
Y
p(µ)p(τ )
p(θj |µ, τ 2 )
j=1
In particular, this must mean that
p(µ|θ1 , . . . , θm , τ 2 , σ 2 , y1 , . . . , ym ) ∝ p(µ)
Y
p(θj |µ, τ 2 )
Y
p(τ 2 |θ1 , . . . , θm , τ 2 , σ 2 , y1 , . . . , ym ) ∝ p(τ 2 )
p(θj |µ, τ 2 )
In previous, we saw that if y1 , . . . , yn were i.i.d. N (θ, σ 2 ) and θ had
a normal prior distribution, then the conditional distribution of θ
was also normal and the conditional distribution of σ12 was gamma.
Since our current situation is exactly analogous, the fact that
θ1 , . . . , θm are i.i.d. N (µ, τ 2 ) and µ has a normal prior distribution
implies that the conditional distribution of µ must be normal and
the conditional distribution of τ12 is gamma as well.
222 / 352
Full conditional of θj
The posterior that depend on θj shows that the full conditional
distribution of θj must be proportional to
p(θj |µ, τ 2 , σ 2 , y1 , . . . , ym ) ∝ p(θj |µ, τ 2 )
nj
Y
i=1
p(yi,j |θj , σ 2 )
This says that, conditional on {µ, τ 2 , σ 2 , yj }, θj must be
conditionally independent of the other θ’s as well as independent
of the data from groups other than j.
The full conditional distribution is therefore
n ȳ τ 2 + σ 2
τ 2σ2
j j
p(θj |y1,j , . . . , ynj ,j , σ ) ∼ N
,
nj τ 2 + σ 2 nj τ 2 + σ 2
2
223 / 352
Full conditional of σ 2
The full conditional of σ 2 is similar to that in the one-sample
normal model, except now we have information about σ 2 from m
separate groups:
2
2
p(σ |θ1 , . . . , θm , y1 , . . . , ym ) ∝ p(σ )
∝ (σ 2 )−
nj
m Y
Y
j=1 i=1
ν0
+1
2
p(yi,j |θj , σ 2 )
e−
2
ν 0 σ0
2σ 2
(σ 2 )−
P
nj
2
e−
PP
(yi,j −θj )2
2σ 2
Adding the powers of σ 2 and collecting the terms in the exponent,
we recognize this as proportional to an inverse-gamma density,
giving
nj
m
m X
X
X
1
1
1
2
2
p( 2 |θ, y1 , . . . , ym ) ∼ G (ν0 +
ν 0 σ0 +
nj ),
(yi,j − θj ) .
σ
2
2
j=1
j=1 i=1
224 / 352
Posterior approximation
Posterior approximation proceeds by iterative sampling of each
unknown quantity from its full conditional distribution.
Given a current state of the unknowns
(s)
(s)
{θ1 , . . . , θm , µ(s) , τ 2(s) , σ 2(s) } a new state is generated as follows:
1
2
3
4
(s)
(s)
sample µ(s+1) ∼ p(µ|θ1 , . . . , θm , τ 2(s) );
(s)
(s)
(s)
(s)
sample τ 2(s+1) ∼ p(τ 2 |θ1 , . . . , θm , µ(s+1) );
sample σ 2(s+1) ∼ p(σ 2 |θ1 , . . . , θm , y1 , . . . , ym );
(s+1)
sample θj
∼ p(θj |µ(s+1) , τ 2(s+1) , σ 2(s+1) , yj ) for each
j ∈ {1, . . . , m}
The order in which the new parameters are generated does not
matter. What does matter is that each parameter is updated
conditional upon the most current value of the other parameters.
225 / 352
Bayesian Hypothesis Testing
Lecture 16
Sharif Mahmood
226 / 352
The Problem
Consider the posterior probability distribution,
p(y|θ)π(θ)
.
Θ p(y|θ)π(θ)dθ
R
The normalizing constant p(y) = Θ p(y|θ)π(θ)dθ does not have a
closed form.
p(θ|y) = R
Thus the posterior distribution does not have a closed form.
A first order approximation is given by Bayesian central limit
theorem.
A second order approximation is given by Laplace method.
227 / 352
Bayesian CLT
Theorem
Suppose y1 , . . . , yn are independent observations from p(y|θ). Suppose
that π(θ) is the prior for θ which may be improper. Further suppose
that the posterior distribution isproper and its mode exists. Then as
n → ∞, θ|y → Np θ̂m , H −1 (θ̂m )
Here θ̂m is the posterior mode of θ obtained by solving
∂
∗
j = 1, . . . , p and p∗ (θ|y) = p(y|θ)π(θ)
∂θj log p (θ|y) = 0,
H(θ) = − ∂
2
log p∗ (θ|y)
∂θi ∂θj
is called the Hessian matrix
The asymptotic covariance matrix of θ is minus the inverse of
Hessian matrix evaluated at θ̂m , H −1 (θ̂m ) = H −1 (θ)|θ=θ̂m
228 / 352
Bayesian CLT: proof sketch
Proof.
p(θ|y)
∝
=
=
=
exp[I(θ)]
≈
=
≈
p(y|θ)π(θ)
exp log(p(y|θ)π(θ))
exp log p∗ (θ|y)
exp[I(θ)]
i
h
1
∂2
∂
exp I(θ̂m ) + (θ − θ̂m ) I(θ)|θ=θ̂m + (θ − θ̂m )2 2 I(θ)|θ=θ̂m
∂θ
2
∂θ
h
i
1
2
exp I(θ̂m ) − H(θ̂m )(θ − θ̂m )
2
h 1
i
exp − H(θ̂m )(θ − θ̂m )2 .
2
∴ p(θ|y) ≈ N θ̂m , H −1 (θ̂m ) .
229 / 352
Bayesian CLT: Example I
✑ Suppose y , . . . , y
are i.i.d. Binomial(1, θ). Consider a flat prior
π(θ) = 1, 0 < θ < 1, so that
1
n
p(θ|y) ∝ θ
P
yi
(1 − θ)(n−
P
yi )
≡ θy (1 − θ)(n−y)
Sol
I(θ)
∂I(θ)
∂θ
=
∂2
I(θ)|θ=θ̂m
∂θ2
=
⇒ H −1 (θ̂m )
=
⇒
⇒
=
θ̂m )
∴ p(θ|y) ∼ N θ̂m , θ̂m (1−
n
X
yi log θ + (n −
P
P
n − yi
yi
−
θ
1−θ
n
−
θ̂m (1 − θ̂m )
X
yi ) log(1 − θ)
θ̂m (1 − θ̂m )
n
230 / 352
Bayesian CLT: Example II
✑ Suppose y , . . . , y
1
π(θ) ∼ θ
α−1
Sol
are i.i.d. Poisson(θ). Consider prior
exp(−βθ), show that
α + P y − 1 α + P y − 1
i
i
,
.
p(θ|y) ∼ N
n+β
(n + β)2
n
I(θ)
∂I(θ)
∂θ
=
∂2
I(θ)|θ=θ̂m
∂θ2
=
⇒ H −1 (θ̂m )
=
⇒
⇒
=
∴ p(θ|y) ∼ N
X
(α +
yi − 1) log θ − (n + β)θ
P
α + yi − 1
− (n + β)
θ
(n + β)2
P
α − yi − 1
P
α − yi − 1
(n + β)2
P
P
α+ yi −1 α+ yi −1
,
n+β
(n+β)2
231 / 352
Limitations of Bayesian CLT
If true posterior differs greatly from normality or sample size is too
small, this approximation may be poor.
Laplace’s method provide second order approximation to posterior
expectations of functions of θ.
For approximations to be valid, the posterior distribution must be
unimodal.
Accuracy depends on large sample and parameterization (e.g. θ or
log θ).
232 / 352
Statistical Hypothesis Testing
A statistical hypothesis test is a method of making decisions using
data.
Statistical hypothesis testing is a key technique of frequentist
inference.
the p-value, the probability that observations as extreme as the
data would occur by chance under null hypothesis.
The critical region of a hypothesis test is the set of all outcomes
which cause the null hypothesis to be rejected in favor of the
alternative hypothesis.
233 / 352
Fallacies of Statistical Hypothesis Testing
The results of statistical tests are frequently misunderstood
Failure to reject the null hypothesis leads to its acceptance.
(Wrong!)
The p value is the probability that the null hypothesis is incorrect.
(Wrong!)
α = .05 is a standard with an objective basis. (Wrong!)
Small p values indicate large effects. (Wrong!)
Data show a theory to be true or false. (Wrong!)
Statistical significance implies importance. (Wrong!)
234 / 352
Could Fisher, Jeffreys and Neyman Have Agreed on
Testing?
235 / 352
Fisher’s Significance Test
Experiment yields Y ∼ f (y|θ); test H0 : θ = θ0
Choose a test statistics T = t(y) , so that larger values of T reflect
evidencce against H0
Compute the p-value for the observed data y
p = P0 (t(Y ) > t(y))
rejecting H0 if p is small (e.g p < 0.05).
The justification is that the p-value can be viewed as an index of
the ‘strength of evidence’ against H0
236 / 352
Neyman-Pearson Hypothesis Testing
Test H0 : θ = θ0 versus alternative H1 : θ = θ1 .
Reject H0 , if T > c; c is a pre-chosen critical value.
Compute Type I and Type II error probabilities
α = P0 (rejecting H0 |H0 true) and β = P1 (accepting H0 |H0 false),
where Pi refers to probability under Hi .
The justification is the frequentist principle:
In repeated actual use of a statistical procedure, the average
actual error should not be greater than the average reported error.
237 / 352
A Comparison Between Fisherian, Frequentist
(Neyman-Pearson)
Fisher’s null hypothesis testing
Neyman-Pearson decision theory
1
Set up a statistical null
hypothesis. The null need not
be a nil hypothesis (i.e., zero
difference).
1
Set up two hypotheses, H1 and
H2 , and decide α, β. These
define a rejection region for
each hypothesis.
2
Report the exact level of
significance (e.g., p = 0.051 or p
= 0.049). Do not use a
conventional 5% level, and do
not talk about accepting or
rejecting hypotheses.
2
If the data falls into the
rejection region of H0 , accept
H1 ; otherwise accept H0 . Note
that accepting a hypothesis
does not mean that you believe
in it.
3
Use this procedure only to draw
provisional conclusions in the
context of an attempt to
understand the experimental
situation.
3
The usefulness of the procedure
is limited among others where
you have a disjunction of
hypotheses (e.g., either µ1 = 8
or µ2 = 10 is true).
238 / 352
Criticism
1
Confusion resulting (in part) from combining the methods of
Fisher and Neyman-Pearson which are conceptually distinct.
2
Emphasis on statistical significance to the exclusion of estimation
and confirmation by repeated experiments.
3
Statistical hypothesis testing is misunderstood, overused and
misused.
4
The numerous criticisms of significance testing do not lead to a
single alternative or even to a unified set of alternatives.
5
When used to detect whether a difference exists between groups, a
paradox arises.
239 / 352
Alternatives to Significance Testing
Bayesian inference is one alternative to significance testing.
For example, Bayesian parameter estimation can provide rich
information about the data from which researchers can draw
inferences, while using uncertain priors that exert only minimal
influence on the results when enough data is available.
The probability a hypothesis is true can only be derived from use
of Bayes’ Theorem, which was unsatisfactory to both the Fisher
and Neyman-Pearson camps due to the explicit use of subjectivity
in the form of the prior probability.
Bayesians test the hypothesis, given the data.
240 / 352
The Jeffreys Approach to Testing
Bayesian hypothesis testing of H0 vs H1 is based on ‘Bayes factor’
Specify prior probabilities for H0 and H1 as p(H0 ) and p(H1 )
Let Y denote data collected in the experiment
p(H0 |Y ) and p(H1 |Y ) denote the posterior probabilities
The Bayes factor in favor of H0 is defined as
BF
=
=
=
where p(H0 |Y )=
p(H0 |Y )/p(H0 )
p(H1 |Y )/p(H1 )
p(H0 |Y )/p(H1 |Y )
p(H0 )/p(H1 )
Posterior odds
Prior odds
p(Y |H0 )p(H0 )
p(Y |H0 )p(H0 )+p(Y |H0c )p(H0c )
With similar formula for p(H1 |Y ).
241 / 352
Bayes Factor
The Bayes factor becomes
BF =
p(Y |H0 )
.
p(Y |H1 )
Large values of Bayes factor favors H0 . Reject H0 of BF ≤ 1.
Jeffry devised the following criteria for decision making:
1 < BF ≤ 3
3 ≤ BF ≤12
12 ≤ BF ≤ 150
BF > 150
Weak evidence
Positive evidence
Strong evidence
Decisive
Bayesians may use BF to compare hypotheses.
BF is the Bayesian analog of the likelihood ratio (LR) test but BF
does not require nested model as LR test.
242 / 352
Bayes Factor: Example I
✑ Child is given intelligence test, with resulting score Y .
Y ∼ N (θ, 100)
- where θ indicates child’s own true IQ.
- 100 is variance if same child is repeated IQ test of the same kind.
IQ test is in population is distributed as
θ ∼ N (100, 225).
If child score y = 115, then posterior distribution of θ is
p(θ|y) ∼ N (110.4, 69.2).
243 / 352
Bayes Factor: Example I
we have H0 : θ ≤ 100 vs H1 : θ > 100
Prior probabilities
⇒
⇒
and prior odds:
0.5
0.5
p(θ) = N (100, 225)
P (θ ≤ 100) = 0.5
P (θ > 100) = 0.5
=1
Posterior probabilities
P (θ ≤ 100|y) = .106
P (θ > 100|y) = .894
and posterior odds:
0.106
0.894
= 0.119
The BF in favor of H0 vs H1 is 0.119/1=0.119
244 / 352
Bayes Factor: Example II
✑ Suppose y , . . . , y
are observed from N (θ, 1), and we wish to test
H0 : θ = 0 versus H1 : θ = 1. Then
1
n
n
BF ≡ LR = exp 2 −
✑ Simple vs composite H
0
π(θ) ∼ N (θ, 1).
BF
=
=
R
n = 10,
H0 .
P
yi
: θ = 0 versus H1 : θ 6= 0. Consider
p(y|θ = 0)
p(y|θ)π(θ)dθ
ΘH
1
n
1
(2π)− 2 e− 2
R∞
−∞
=
P
n
1
(2π)− 2 e− 2
1
1
(n + 1) 2 e 2 +
P
P
yi2
(yi −θ)2 (2π)− 21 e− 12
P
(θ−1)2 dθ
P
(1+ yi )2
2(n+1)
yi = 5 ⇒ BF = 1.06 implying weak evidence in favor of
245 / 352
Bayesian Model Selection (BMS)
BMS is another way of doing notion of hypothesis A vs.
hypothesis B to explain the data.
The Bayesian information criterion(BIC) score is similar to χ2
value in the traditional hypothesis testing.
Bayesian Information Criterion (BIC)
BIC = 2 log P (Y |M, θ̂) − d log n
where
Y:
M:
θ̂:
d:
n:
Observed data,
Model (The Model is actually a family of models with unit
variance and unknown mean therefore θ̂ is needed),
The maximum likelihood estimator (MLE) of parameters
in the model,
The number of free parameters, and
The number of data points.
A model with a higher BIC is a better model, since if data fits well to
the model, the log likelihood would be higher.
246 / 352
Hypothesis Testing: Example
✑ Suppose
a study conducted with 16,395 individuals from the
general (not high-risk) population
• 74 HIV cases reported in the 8198 individuals receiving placebos
• 51 HIV cases reported in the 8197 individuals receiving the
treatment
The test that was performed:
• Let p1 and p2 denote the probability of HIV in the placebo and
treatment populations, respectively.
• Test H0 : p1 = p2 versus H1 : p1 6= p2
247 / 352
Hypothesis Testing: Example
Normal approximation okay, so
z =
=
p̂ − p̂2
p1
σ̂p̂1 −p̂2
.009027 − .006222
= 2.06
.001359
is approximately N (θ, 1), where θ = (p1 − p2 )/(.001359).
We thus test H0 : θ = 0 versus H1 : θ 6= 0, based on z.
Observed z = 2.06, so the p-value is 0.04.
If test H0 : θ = 0 versus H1 : θ > 0, the p-value is 0.02.
248 / 352
Bayesian Analysis:
Posterior odds of H0 to H1 = Prior odds of H0 to H1 × BF ;
where
BF
=
likelihood of H0 for observed data
average likelihood of H1
1
=
1
(2π)− 2 e− 2 (z−0)
R
1
1
2
2
(2π)− 2 e− 2 (z−θ) dθ
For z = 2.06 and π(θ) = Uniform(0, 2.95), the nonincreasing prior
most favorable to H0 ,
BF = 0.178
249 / 352
The Disagreement
Fisher, Jeffreys and Neyman would report considerably different
numbers in actual practice.
Example: In testing whether a normal mean is zero or not, with
z = σ/ȳ√n = 2.3 (or z = 2.9) (σ known, n = 10),
• Fisher would report p = 0.021 (or p = 0.0037).
• Neyman, with pre-specified α = 0.05, would report α = 0.05 in
either case (and a Type II error β).
• Jeffreys would report the objective posterior probability
P (H0 |y) = 0.30 (or P (H0 |y) = 0.10) (using equal prior probabilities
of the hypotheses and a conventional Cauchy(0, σ) prior on the
alternative).
250 / 352
Their Criticisms of the Other Approaches
1
Criticisms of Bayesian analysis in general. (Fisher and Neyman
rarely addressed Jeffreys particular version of Bayesian theory.)
• Fisher and Neyman rejected pre-Jeffreys objective Bayesian analysis
as logically flawed.
• Fishers subsequent position was that Bayesian analysis is based on
prior information that is only rarely available.
• Neyman, in addition, assumed that Bayesian analysis violated the
Frequentist Principle.
2
Criticisms of Neyman-Pearson testing
• Fisher and Jeffreys rejected N-P tests because they report fixed
(α, β) i.e., do not provide evidence or error measures that vary with
the data.
• Fisher disliked alternative hypotheses and power.
251 / 352
Their Criticisms of the Other Approaches
3
Criticisms of Fishers significance testing
• Neyman and Jeffreys: alternative hypotheses are essential.
• Neyman: p-values violate the Frequentist Principle.
• p-values are commonly misinterpreted as error rates, resulting in a
considerable overestimation of the actual evidence against H0 .
252 / 352
Bayesian t-test Hypothesis Testing for Two Independent
Groups
R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
l i b r a r y (R2OpenBUGS)
#your s e t o f v a l u e s f o r group 1 and group 2
group1 = c ( 4 9 , 1 3 , 5 0 , 6 4 , 2 1 , 2 3 , 1 5 , 7 , 3 2 , 8 , 1 7 , 1 5 , 1 9 , 1 6 , 3 3 )
group2 = c ( 4 1 , 2 0 , 5 3 , 4 1 , 2 0 , 4 4 , 3 1 , 2 4 , 2 4 , 4 4 , 1 5 , 3 5 , 3 2 , 2 5 , 3 5 )
n1 = l e n g t h ( group1 ) ; n2 = l e n g t h ( group2 )
# Rescale :
group2 = group2−mean ( group1 ) / ( sd ( group1 ) )
group1 = a s . v e c t o r ( s c a l e ( group1 ) )
#S p e c i f y t h e f i l e l o c a t i o n t h a t c o n t a i n s t h e OpenBugs code
openbugsmodel = ” T t e s t 2 . t x t ”
p r i o r f o r h 1 = dcauchy ( 0 ) #p r i o r f o r t h e c a s e o f d e l t a <>0
p r i o r f o r h 2 = 2∗ dcauchy ( 0 ) #p r i o r f o r t h e c a s e o f d e l t a <0
p r i o r f o r h 3 = 2∗ dcauchy ( 0 ) #p r i o r f o r t h e c a s e o f d e l t a >0
#Advanced i n p u t
i t t e r a t i o n s = 10000
b u r n i n = 1000
253 / 352
Bayesian t-test for Two Independent Groups
OpenBugs code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
model
{
f o r ( i i n 1 : n1 )
{
group1 [ i ] ˜ dnorm (muX, lambdaXY )
}
f o r ( i i n 1 : n2 )
{
group2 [ i ] ˜ dnorm (muY, lambdaXY )
}
lambdaXY <− pow ( sigmaXY , −2)
delta
˜ dnorm ( 0 , lambdaDelta )
lambdaDelta ˜ d c h i s q r ( 1 )
sigma
˜ dnorm ( 0 , sigmaChi )
sigmaChi ˜ d c h i s q r ( 1 )
sigmaXY <− abs ( sigma )
mu
˜ dnorm ( 0 , muChi )
muChi ˜ d c h i s q r ( 1 )
a l p h a <− d e l t a ∗sigmaXY
muX <− mu + a l p h a ∗ 0 . 5
muY <− mu − a l p h a ∗ 0 . 5
}
254 / 352
Bayesian t-test for Two Independent Groups
R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# t o be p a s s e d on t o WinBUGS
data= l i s t ( ” group1 ” , ” group2 ” , ” n1 ” , ” n2 ” )
i n i t s =f u n c t i o n ( )
{
l i s t ( d e l t a=rnorm ( 1 , 0 , 1 ) ,mu=rnorm ( 1 , 0 , 1 ) , sigma=r u n i f ( 1 , 0 , 5 ) )
}
# Pa r a m e t e r s t o be m o n i t o r e d
p a r a m e t e r s=c ( ” d e l t a ” )
# The f o l l o w i n g command c a l l s WinBUGS with s p e c i f i c o p t i o n s .
s a m p l e s = bugs ( data , i n i t s , p a r a m e t e r s ,
model . f i l e =openbugsmodel ,
n . c h a i n s =3 , n . i t e r=i t t e r a t i o n s ,
n . b u r n i n=burnin , n . t h i n =1 ,
DIC=T, codaPkg=F , debug=F)
# Now t h e v a l u e s f o r t h e m o n i t o r e d p a r a m e t e r s a r e i n t h e
# ” samples ” object , ready f o r i n s p e c t i o n .
samples$summary
255 / 352
Comparing Two Groups
The shows math scores from a sample of 10th grade students from
two schools, namely, Group 1 and Group 2.
10
20
30
40
50
60
●
Group 1
Group 2
256 / 352
Comparing Two Groups
Suppose we are interested in estimating θ1 , the average score we
would obtain if all 10th graders in school 1 were tested, and
possibly comparing it to θ2 , the corresponding average from
school 2.
The results from the sample data are ȳ1 = 25.47 and ȳ2 = 32.27,
suggesting that θ2 is larger than θ1 .
To assess whether or not the observed mean difference of
ȳ2 − ȳ1 = 6.8 is large compared to the sampling variability it is
standard practice to compute the t-statistic.
t =
=
ȳ − ȳ1
q2
sp n12 + n11
32.27 − 32.26
q
1
1
14.20 14
+ 14
= 1.31
257 / 352
Model selection based on p-values:
If p < 0.05,
◮
reject the model that the two groups have the same distribution;
◮
conclude that θ1 6= θ2
◮
use the estimates θ̂1 = ȳ1 , θ̂2 = ȳ2
If p > 0.05
◮
accept the model that the two groups have the same distribution;
◮
conclude that θ1 = θ2
◮
use the estimates θ̂1 = θ̂2 = (
P
yi,1 +
P
yi,2 )/(n1 + n2 )
For our math score data, the above procedure would take the
p-value of 0.20 and tell us to treat the population means of the
two groups as being numerically equivalent.
258 / 352
Comparing Two Groups
The p-value-based procedure described above can be re-expressed
as estimating θ1 as θ̂1 = wȳ1 + (1 − w)ȳ2 , where w = 1 if p < 0.05
and w = n1 /(n1 + n2 ) otherwise.
It might make more sense to allow w to vary continuously and
have a value that depends on such things as the relative sample
sizes n1 and n2 .
It also depends on the sampling variability σ 2 and our prior
information about the similarities of the two populations. An
estimator similar to this is produced by a Bayesian analysis that
allows for information to be shared across the groups.
259 / 352
Comparing Two Groups
Consider the following sampling model for data from the two
groups:
Yi,1 = µ − δ + ei,1
Yi,2 = µ + δ + ei,2
{ei,j } ∼ N (0, σ 2 )
where θ1 = µ − δ and θ2 = µ + δ, we see that δ represents half the
population difference in means, as (θ2 − θ1 )/2 = δ, and µ
represents the pooled average, as (θ2 + θ1 )/2 = µ.
Convenient conjugate prior distributions for the unknown
parameters are
p(µ, δ, σ 2 ) = p(µ) × p(δ) × p(σ 2 )
p(µ) ∼ N (µ0 , γ0 )
p(δ) ∼ N (δ0 , τ02 )
p(σ 2 ) ∼ IG(ν0 , ν0 σ02 /2)
260 / 352
Comparing Two Groups
The full conditional distributions of these parameters are:
p(µ|y, δ, σ 2 ) ∼ N (µ1 , γ1 )
n1
n2
i
h
1 X
1 X
(yi,1 + δ) + 2
(yi,2 − δ)
µ1 = γ12 × µ0 /γ02 + 2
σ
σ
i=1
i=1
γ12 = [1/γ02 + (n2 + n1 )/σ 2 ]−1
p(δ|y, µ, σ 2 ) ∼ N (δ1 , τ12 )
h
i
1 X
1 X
(yi,1 − µ) + 2
(yi,2 − µ)
δ1 = τ12 × δ0 /τ02 + 2
σ
σ
τ12 = [1/τ02 + (n2 + n1 )/σ 2 ]−1
p(σ 2 |y, µ, δ) ∼ IG(ν1 , ν1 σ12 /2)
ν 1 = ν 0 + n2 + n1
X
X
(yi,1 − [µ − δ] +
(yi,2 − [µ + δ]))
ν1 σ12 = ν0 σ02 +
261 / 352
Comparing Two Groups
R code
1
2
3
4
5
6
7
8
9
10
11
12
##### data
y1 <−group1 ; y2 <−group2
n1 <−l e n g t h ( group1 ) ; n2 <−l e n g t h ( group2 )
##### p r i o r p a r a m e t e r s
mu0 <−30 ; g02 <−400
d e l 0 <−0 ; t 0 2 <−400
s 2 0 <−100; nu0 <−1
##### s t a r t i n g v a l u e s
mu <−(mean ( y1 ) + mean ( y2 ) ) / 2
d e l <−(mean ( y1 ) − mean ( y2 ) ) / 2
262 / 352
Comparing Two Groups
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
##### Gibbs s a m p l e r
MU <−DEL <−S2 <−NULL; Y12 <−NULL
set . seed (1)
for ( s in 1:5000) {
##update s 2
s 2 <−1/rgamma ( 1 , ( nu0+n1+n2 ) / 2 ,
( nu0 ∗ s 2 0+sum ( ( y1−mu−d e l )ˆ2)+sum ( ( y2−mu+d e l ) ˆ 2 ) ) / 2 )
##update mu
v a r . mu <−1/(1/ g02+ ( n1+n2 ) / s 2 )
mean . mu <−v a r . mu∗ (mu0/ g02 + sum ( y1+d e l ) / s 2 + sum ( y2−d e l ) / s 2 )
mu <−rnorm ( 1 , mean . mu, s q r t ( v a r . mu) )
##update d e l
v a r . d e l <−1/(1/ t 0 2+ ( n1+n2 ) / s 2 )
mean . d e l <−v a r . d e l ∗ ( d e l 0 / t 0 2 + sum ( y2−mu) / s 2 − sum ( y1−mu) / s 2 )
d e l <−rnorm ( 1 , mean . d e l , s q r t ( v a r . d e l ) )
}
##s a v e p a r a m e t e r v a l u e s
MU <−c (MU, mu) ; DEL <−c (DEL, d e l ) ; S2 <−c ( S2 , s 2 )
Y12 <−r b i n d ( Y12 , c ( rnorm ( 2 ,mu+c (1 , −1)∗ d e l , s q r t ( s 2 ) ) ) )
263 / 352
Comparing Two Groups
The figure shows the marginal posterior distributions of µ and δ,
and how they are much more concentrated than their
corresponding prior distributions.
0.12
prior
posterior
0.00
0.00
0.04
0.04
density
0.08
density
0.08
0.12
prior
posterior
10
20
30
µ
40
50
−20
−10
0
δ
10
20
Figure: Prior and posterior distributions for µ and δ.
264 / 352
Advantages over classical approach
In particular, a 95% quantile-based posterior confidence interval
for 2 × δ , the difference in average scores between the two schools.
Additionally, the posterior probability
P r(θ2 > θ1 |y) = P r(δ > 0|y) can be measured.
From the joint posterior predictive distribution P r(Y2 > Y1 |y) can
be obtained.
Find the values of the above three points??
265 / 352
The Empirical Bayes Approach
Lecture 17
Sharif Mahmood
266 / 352
Introduction
Empirical Bayes methods are procedures for statistical inference in
which the prior distribution is estimated from the data. This
approach stands in contrast to standard Bayesian methods, for
which the prior distribution is fixed before any data are observed.
Empirical Bayes, also known as maximum marginal likelihood
The priors can be parametric or nonparametric, depending on
unknown parameters that may in turn be drawn from some second
stage prior.
This sequence of parameters and priors constitutes a hierarchical
model. The hierarchy must stop at some point, with all remaining
prior parameters assumed known.
Rather than the previous assumption, the empirical bayes
approach uses the observed data to estimate this final stage
parameters and then proceeds as though the prior were known.
267 / 352
Empirical Bayes methods
✑ Suppose in a two-stage hierarchical Bayes model, y = {y , . . . , y }
1
n
are generated from p(y|θ), where θ = {θ1 , . . . , θk }. Again θ can be
considered samples drawn from a population characterised by
hyperparameters η according to p(θ|η), where η are drawn from
unparameterized distributions.
Thus information about a particular quantity of interest θi comes
not only from data, but also from the hyperparameters.
Using Bayes’ theorem,
p(θ|y) =
=
p(y|θ) × p(θ)
p(y)
Z
p(y|θ)
p(θ|η)p(η)dη
p(y)
In general, this integral will not be tractable analytically and must
be evaluated by numerical methods, e.g MCMC.
268 / 352
Empirical Bayes methods
Alternatively, the expression can be written as
Z
p(θ|y) =
p(θ|η, y)p(η|y)dη
Z
p(y|θ)p(θ|η)
=
p(η|y)dη
p(y|η)
and the term in the integral can in turn be expressed as
Z
p(η|y) = p(η|θ)p(θ|y)dθ
These suggest an iterative scheme, qualitatively similar in
structure to a Gibbs sampler, to evolve successively improved
approximations to p(θ|y) and p(η|y).
269 / 352
Empirical Bayes methods
Calculation
1
2
3
calculate an initial approximation to p(θ|y) ignoring the η
dependence completely.
calculate an approximation to p(η|y) based upon the initial
approximate distribution of p(θ|y).
use this p(η|y) to update the approximation for p(θ|y); then update
p(η|y); and so on.
When the true distribution p(η|y) is sharply peaked, the integral
determining p(θ|y) may be not much changed by replacing the
probability distribution over η with a point estimate η ∗
representing the distribution’s peak (or, alternatively, its mean),
p(θ|y) ≃
p(y|θ)p(θ|η ∗ )
p(y|η ∗ )
270 / 352
Point estimation
1
Non-parametric empirical Bayes
Suppose Yi |θi ∼ Poisson(θi ) where θ has a cdf G(θ). That is,
e−θi θiyi
f (yi |θi ) =
, yi = 1, 2, . . .
yi !
Under squared error loss (SEL), the conditional expectation
E(θi |Yi = yi ) is a reasonable quantity to use for prediction. For
the Poisson compound sampling model, this quantity is
R y +1 −θ
(θ i e /yi !)dG(θ)
R
E(θi |yi ) =
(θyi e−θ /yi !)dG(θ)
(yi + 1)pG (yi + 1)
=
pG (yi )
where pG is the marginal distribution obtained by integrating out
θ over G.
271 / 352
Point estimation
1
Non-parametric empirical Bayes
To take advantage of this, Robbins(1995) suggested estimating the
marginals with their empirical frequencies, yielding the fully
non-parametric estimate as:
E(θi |yi ) ≈ (yi + 1)
#(ys equal to yi + 1)
#(ys equal to yi )
Since the estimate of the marginal density depends on all of the yi ,
the estimate for each component θi is influenced by data from
other components.
The foregoing is one of several models for which the analyst can
make empirical Bayes inferences with no knowledge of or
assumptions concerning G.
272 / 352
Point estimation
2
Parametric empirical Bayes
If the likelihood and its prior take on simple parametric forms,
then the empirical Bayes problem is only to estimate the marginal
m(y|η) and the hyperparameters η using the complete set of
empirical measurements.
One common approach, called parametric empirical Bayes point
estimation, is to approximate the marginal using the maximum
likelihood estimate (MLE).
This simplified marginal allows one to plug in the empirical
averages into a point estimate for the prior θ.
273 / 352
Point estimation
2
Parametric empirical Bayes
There are several common parametric empirical Bayes models,
including the Poisson-Gamma model (below), the Beta-binomial
model, the Gaussian-Gaussian model, the Dirichlet-multinomial
model, as well specific models for Bayesian linear regression and
Bayesian multivariate linear regression. More advanced approaches
include hierarchical Bayes models and Bayesian mixture models.
✑ Poisson-Gamma model
Let the likelihood be a Poisson distribution in the previous
example, and let the prior now be specified by the conjugate prior,
which is a Gamma distribution (G(α, β)) (where η = (α, β) ):
p(θ|α, β) =
β α α−1 −βθ
θ
e
Γ(α)
for θ > 0, α > 0, β > 0.
274 / 352
Point estimation
2
Parametric empirical Bayes
✑ Poisson-Gamma model
It is straightforward to show the posterior is also a Gamma
distribution. Write
p(θ|y) ∝ p(y|θ)p(θ|α, β),
where the marginal distribution has been omitted since it does not
depend explicitly on θ.
Expanding terms which do depend on θ gives the posterior as:
p(θ|y) ∝ θy e−θ × θα−1 e−βθ
= θy+α−1 e−θ(1+β)
So the posterior density is also a Gamma distribution G(α∗ , β ∗ ),
where α∗ = y + α, and β ∗ = 1 + β.
275 / 352
Point estimation
2
Parametric empirical Bayes
✑ Poisson-Gamma model
To apply empirical Bayes, we will approximate the marginal using
the MLE. But since the posterior is a Gamma distribution, the
MLE is just the mean of the posterior, which is the point estimate
E(θ|y) we need. We have,
E(θ|y) =
=
α∗
ȳ + α
=
∗
β
1+β
β
α
1
× ȳ +
×
1+β
1+β
β
The resulting point estimate E(θ|y) is therefore like a weighted
average of the sample mean ȳ and the prior mean αβ .
This turns out to be a general feature of empirical Bayes; the
point estimates for the prior (i.e. mean) will look like a weighted
averages of the sample estimate and the prior estimate (likewise
for estimates of the variance).
276 / 352
Empirical Bayes performance
The empirical bayes performance over other mehods estimators is
quite impressive [Carlin and Louis, 2008].
Being superior to that of the frequentist estimator when there is a
high degree of similarity among the θi s and no worse otherwise.
In the last two sections, we see that procedures derived using
Bayesian machinery and vague priors typically enjoy good
frequentist and empirical bayes properties.
Empirical bayes methodology incorporates the bayesian approach
but rejects the unyielding subjectivism of the era’s leading
Bayesian thinkers.
277 / 352
Predictive Distributions
✑ Suppose an experiment with y ≡ (y , . . . , y ) being observed with
1
n
p(y|θ)π(θ)
model p(y|θ) and prior π(θ), the posterior p(θ|y) = R p(y|θ)π(θ)dθ
.
Predict a future observation z from this population
p(z|y) =
=
=
=
R
p(z, y|θ)π(θ)dθ
p(z, y)
= R
p(y)
p(y|θ)π(θ)dθ
R
p(z|θ)p(y|θ)π(θ)dθ
R
p(y|θ)π(θ)dθ
Z
p(z|θ)p(θ|y)
Eθ|y p(z|θ)
278 / 352
Predictive Distributions
✑ Suppose y , . . . , y
1
n
Thus
be i.i.d. Binomial(1, θ) and θ ∼ Beta(α, β).
p(θ|y) ∝ θ
P
yi +α−1
(1 − θ)n−
P
yi +β−1
.
P
P
Hence by kernel recognition, θ|y ∼Beta( yi + α, n − yi + β).
Then
p(z|y)
=
×
=
p(z = 1|y) =
P
Z
1
0
Γ(
P
P
Γn + α + β
P
θz+ yi +α−1
yi + α)Γ(n − yi + β)
P
(1 − θ)n+1− yi +β−z−1 dθ
Γ(n + α + β)
P
P
×
Γ( yi + α)Γ(n − yi + β)
P
P
Γ(z + yi + α)Γ(n + 1 − yi + β − z)
Γ(n + 1 + α + β)
yi +α
n+α+β ;
p(z = 0|y) =
P
n− yi +β
n+α+β .
279 / 352
Bayesian Sample Size Computations
Lecture 18
Sharif Mahmood
280 / 352
Review of classical sample size calculations
Suppose we want to take a sample y1 , . . . , yn ∼ N (θ, σ 2 ). So
2
ȳ ∼ N (θ, σn ). Assume σ 2 is known. n is not known.
Consider the classical hypothesis testing problem: H0 : θ = θ0
against the alternative H1 : θ = θ1 > θ0
Decision rule: Reject H0 if ȳ > θ0 +
Φ(·) is the standard normal cdf.
√σ z1−α ,
n
where Φ(zα ) = α and
281 / 352
Review of classical sample size calculations
Requiring the procedure to have a power of at least 1 − β, we have:
σ
1 − β ≤ P (Rej H0 |H1 ) = P ȳ > θ0 + √ z1−α |θ = θ1
n
√
√n
n
= P
(ȳ − θ1 ) >
(θ0 − θ1 ) + z1−α
σ
σ
√ ∆
= P Z > − n + z1−α
σ
√ ∆
= 1 − Φ − n + z1−α
σ
√ ∆
√ ∆
= Φ( n + z1−α ) = Φ( n − z1−α )
σ
σ
where ∆ = θ1 − θ0 and, in the last steps, we have used the facts
that Φ(−x) = 1 − Φ(x) and z1−α = −zα .
282 / 352
Review of classical sample size calculations
The preceding computations lead to
√ ∆
n
+ zα ≥ z1−β
σ
√ ∆
n
⇒
≥ z1−β − zα = −(zα − zβ )
σ
2
2 σ
⇒ n ≥ (zα − zβ )
∆
Thus, we arrive at the ubiquitous sample size formula:
σ 2
n = (zα − zβ )2
∆
.
For two-sided alternatives, one simply replaces α by α/2 in the
above expression.
283 / 352
Bayesian sample size calculations
2
Suppose, we take the prior θ ∼ N (θ1 , τ 2 ). We let τ 2 = nσ0 , where
n0 (prior sample size!) reflects the precision of the prior relative to
the data. This simplifies the calculations:
σ2
σ2
N θ|θ1 ,
× N ȳ|θ,
n0
n
n
σ2
n0
θ1 +
ȳ,
= N θ|
n + n0
n + n0 n + n0
Let Aα (θ0 , θ1 ) = {ȳ|P (θ < θ0 |ȳ < α}. Note:
P (θ < θ0 |ȳ) =
√
n √n + n
n + n0
n0 θ1 + nȳ
n0 θ1 + nȳ o
0
P
θ−
θ0 −
<
σ
n + n0
σ
n + n0
i
h √n + n
n
θ
+
nȳ
0 1
0
θ0 −
= Φ
σ
n + n0
284 / 352
Bayesian sample size calculations
Thus,
ȳ :
=
n
ȳ
=
n
ȳ
Aα (θ0 , θ1 ) =
=
n
ȳ
√
o
n + n0
n0 θ1 + nȳ
< zα
θ0 −
σ
n + n0
o
σ
n0 θ1 + nȳ
√
<
zα
: θ0 −
n + n0
n + n0
o
n0
n
σ
θ1 −
ȳ < √
zα
: θ0 −
n + n0
n + n0
n + n0
s
o
n0 σ
n0
√ zα
1+
: ȳ > θ0 − (θ1 − θ0 ) −
n
n
n
n
Note that as n0 → 0 (i.e. the prior becomes vague) Aα (θ0 , θ1 )
becomes identical to the the critical region from classical
hypothesis testing.
285 / 352
Bayesian sample size calculations
The Bayesian power or Bayesian assurance δ is defined as:
δ = Pȳ (Aα (θ0 , θ1 ))
= Pȳ {ȳ : P (θ < θ0 |ȳ) < α}
= Pȳ {ȳ : P (θ < θ0 |ȳ) > 1 − α}
r
n0
n0 σ
√ zα
= Pȳ ȳ > θ0 − (θ1 − θ0 ) −
1+
n
n
n
286 / 352
Bayesian sample size calculations
The marginal distribution of ȳ is given by:
Z
σ2
σ2
σ2
N θ|θ1 ,
dθ = N ȳ|θ1 ,
× N ȳ|θ,
n0
n
n + n0
Therefore, the Bayesian power or assurance is:
r
o
n0
n0 σ
√ zα
Pȳ ȳ > θ0 − (θ1 − θ0 ) −
1+
n
n
n
√
o
n
n0
n + n0
σzα
= Pȳ ȳ − θ1 > −(1 + )(θ1 − θ0 ) −
n
n
√
n 0 θ1 − θ0
n0
zα
= P Z > − n + n0 1 +
− 1+
n
σ
n
√
n 0 θ1 − θ0
n0
= Φ n + n0 1 +
zα
− 1+
n
σ
n
n
287 / 352
Bayesian Assurance Curve
Rewriting the Bayesian power in terms of the relative precision
n0 /n and n, we obtain:
√
n0 32 ∆
n0
zα
δ=Φ
n 1+
+ 1+
n
σ
n
where ∆ = θ1 − θ0 . This is often called the critical difference that
needs to be detected.
This leads to:
The Bayesian Power (Assurance) Curve:
√
n0
n0 32 ∆
δ=Φ
zα
+ 1+
n 1+
n
σ
n
Note: For study design purposes, σ 2 and n0 are assumed known
and the Bayesian power curve is investigated as a function of ∆
and n.
288 / 352
Concluding remarks
Given n0 , the Bayesian will compute the sample size needed to
detect a critical difference of ∆ with probability 1 − β as
n = arg min{n : δ(∆, n) ≥ 1 − β}
As the prior becomes vague, n0 → 0 and:
√ ∆
lim δ(∆, n) = Φ n + zα
n→0
σ
which is exactly the classical power curve. Now the Bayesian
sample size formula coincides with the classical sample size
formula:
Bayesian sample size with vague prior information:
2
2 σ
n = (zα − zβ )
∆
For two-sided alternatives, one simply replaces α by α/2 in the
above expression.
289 / 352
Bayesian Methods for Linear Models
Lecture 19
Sharif Mahmood
290 / 352
The Linear Models
The linear model can be written as
Y = Xβ + e,
where Y is n × 1, X is n × p, β is p × 1 and e ∼ N (0, σ 2 I)
Let M = X(X ′ X)− X ′ and τ =
inverse
1
σ2
, ‘–’ denotes the generalized
The UMVUE of µ = E(Y ) = Xβ is M Y
Derive posterior distributions of β and τ .
291 / 352
The N IG conjugate prior family
A popular Bayesian model builds upon the linear regression of y
using conjugate priors by specifying
p(β, σ 2 ) = p(β|σ 2 )p(σ 2 )
= N (µβ , σ 2 Vβ ) × IG(a, b)
1 a+ p +1
ba
2
=
p
1
2
(2π) 2 |Vβ | 2 Γ(a) σ
h
oi
1n
1
× exp − 2 b + (β − µβ )′ Vβ −1 (β − µβ )
σ
2
oi
1 a+ p +1
h 1n
1
2
′
−1
(β
−
µ
)
V
(β
−
µ
)
b
+
∝
×exp
−
β
β
β
σ2
σ2
2
The N IG probability distribution is a joint probability
distribution of a vector β and a scalar σ 2 .
292 / 352
The N IG conjugate prior family
If (β, σ 2 ) ∼ (µ, V, a, b) then an interesting analytic form results
from integrating out σ 2 from the joint density:
Z
N IG(µ, V , a, b)
Z a+ p +1
h 1n
oi
ba
1
1
2
′ −1
=
b
+
exp
−
(β
−
µ)
V
(β
−
µ)
dσ 2
p
1
2
2
σ
σ
2
2
2
(2π) |Vβ | Γ(a)
=
ba
p
2
1
(2π) |Vβ | 2 Γ(a)
Γ(a + p2 )
×h
Γ(a + p2 )
b + 21 (β − µ)′ V −1 (β − µ)
ia+ p2
(β − µ)′ [ ab V ]−1 (β − µ)
1
+
=
p
1
2a
π 2 |(2a) ab V | 2 Γ(a)
−( 2a+p
)
2
This is a multivariate t density with ν = 2a and Σ = ( ab )V .
293 / 352
The likelihood
The likelihood for the model is defined, up to proportionality, as
the joint probability of observing the data given the parameters.
Since X is fixed, the likelihood is given by
p(y|β, σ 2 ) = N (Xβ, σ 2 I)
o
n
1 1
1
2
′
(y
−
Xβ)
(y
−
Xβ)
exp
−
=
2πσ 2
2σ 2
294 / 352
The posterior distribution from the N IG prior
Inference will proceed from the posterior distribution
p(β, σ 2 |y) =
p(β, σ 2 )p(y|β, σ 2 )
p(y)
R
where p(y) = p(β, σ 2 )p(y|β, σ 2 )dβdσ 2 is the marginal
distribution of the data.
The key to deriving the joint posterior distribution is the following
easily verified multivariate completion of squares or ellipsoidal
rectification identity:
u′ Au − 2α′ u = (u − A−1 α)′ A(u − A−1 α) − α′ A−1 α
where A is a symmetric positive definite (hence invertible) matrix.
295 / 352
The posterior distribution from the N IG prior
An application of this identity immediately reveals,
oi
n1
1h
(β − µβ )′ Vβ −1 (β − µβ ) + (y − Xβ)′ (y − Xβ)
− 2 b+
σ
2
i
1h ∗ 1
= − 2 b + (β − µ∗ )′ V ∗ −1 (β − µ∗ )
σ
2
using which we can write the posterior as
p(β, σ 2 |y) ∝
where
1 a+ n+p +1
oi
h 1n
2
∗ ′ ∗−1
∗
∗ 1
(β
−
µ
)
V
(β
−
µ
)
×exp
−
b
+
σ2
σ2
2
µ∗ = (Vβ−1 + X ′ X)−1 (Vβ−1 µβ + X ′ y)
V ∗ = (V −1 + X ′ X)−1
n
a∗ = a +
2
b∗ = b + µ′β Vβ−1 µβ + y ′ y − µ∗′ V ∗−1 µ∗
296 / 352
Bayesian Analysis of Linear Models
Theorem
Suppose τ is known, X is of full rank p, and π(β) ∝ 1. Then
β|y ∼ Np (β̂, τ −1 (X ′ X)−1 ), where β̂ = (X ′ X)−1 X ′ Y .
Proof.
o
τ
− (Y − Xβ)(Y − X ′ β)
n τ2
o
= exp − (Y ′ (I − M )Y + (β − β̂ ′ )X ′ X(β − β̂)
n τ2
o
∝ exp − (β − β̂ ′ )X ′ X(β − β̂)
2
p(β|y, τ ) ∝ exp
n
We recognize that this is a normal kernel with mean β̂ and covariance
matrix τ −1 (X ′ X)−1 . Thus β|y, τ ∼ Np (β̂, τ −1 (X ′ X)−1 ).
297 / 352
Jeffreys’s Prior
Theorem
When τ is known, Jeffreys’s prior for β is a uniform prior, i.e.,
π(β) ∝ 1.
Proof.
We have
n
τ
n
log(2π) + log τ − (Y − Xβ)′ (Y − Xβ)
2
2
2
log[p(y|β, τ )]
=
−
∂
log[p(y|β, τ )]
∂β
=
τ X ′ Y − τ (X ′ X)β
=
−τ (X ′ X)
=
−
=
(X ′ Y ) − (X ′ X)β
∂2
log[p(y|β, τ )]
∂β∂β ′
∂2
log[p(y|β, τ )]
∂τ 2
∂2
log[p(y|β, τ )]
∂β∂τ
n
2τ 2
298 / 352
Jeffreys’s Prior
Theorem
When both β and τ are known, Jeffreys’s joint prior for (β, τ ) is a
p
uniform prior, i.e., π(β, τ ) ∝ τ 2 −1
Proof.
∂ 2 log[p(y|β, τ )]
= E(X ′ Y ) + (X ′ X)β = −(X ′ X)β + (X ′ X)β = 0
∂β∂τ
τ (X ′ X) 0
Thus, I(β, τ ) =
.
n
0
2τ 2
n
n
n
Now |I(β, τ )| = |τ (X ′ X)| × | τ −2 | = τ p |X ′ X| τ −2 = τ p−2 |X ′ X|
2
2
2
∝ τ p−2
−E
Thus, π(β, τ )
∝
1
|I(β, τ )| 2 = τ
p−2
2
p
= τ 2 −1
1
Note that with p = 1, π(β, τ ) ∝ τ − 2 .
299 / 352
Bayes Factor computation for Linear Models
Theorem
Consider the linear model Y = Xβ + e, where e ∼ Nn (0, σ 2 I) with
(β, τ ) unknown and suppose β|τ ∼ Np (µ0 , τ −1 Σ0 ) and τ ∼ G( δ20 , γ20 ).
Then the marginal distribution of y is given by
y ∼ Sn (δ0 , Xµ0 , γδ00 (I + XΣ0 X)′ ) and therefore,
n
p(y|m) =
×
n
1
0)
Γ (n+δ
(πδ0 )− 2 ( γδ00 )− 2 |I + XΣ0 X ′ |− 2
2
Γ( δ20 )
h
1+
!
i
1
(Y − Xµ0 )′ (I + XΣ0 X ′ )−1 (Y − Xµ0 )
δ0
(n+δ0 )
2
.
300 / 352
Bayes Factor computation for Linear Models
Given a model m ∈ M , we have
(m)
Γ
p(y|m) =
(n+δ0
2
)
(m)
n
(m) − n γ0
) 2 ( (m) )− 2 |I
δ0
(πδ0
(m)
1
′
+ X (m) Σ0 X (m) |− 2
(m)
Γ(
δ0
2
)
!
(m)
h
× 1+
1
(Y −
(m)
δ0
(m)
(m)
X (m) µ0 )′ (I
(m)
(m)
+X
(m)
′
(m)
Σ0 X (m) )−1 (Y
−
(m)
X (m) µ0 )
i (n+δ20
)
.
(m)
where µ0 , Σ0 , δ0 , γ0 are hyper-parameter values under model
m. The Bayes factor for comparing model 1 (m1 ) to model 2 (m2 ) is
p(y|m1 )
BF= p(y|m
.
2)
301 / 352
Test of Hypothesis using Bayes factor
Theorem
If X ∼ Sp (ν, µ, Σ), where X = (X1 , . . . , Xp ), then marginal
(1)
X
distributions of X are also multivariate t. Thus if
, where X (1)
X (2)
Σ11 Σ12
µ1
(2)
. Then
and Σ =
is r × 1, X is (p − r) × 1,
Σ21 Σ22
µ2
X (1) ∼ Sr (ν, µ1 , Σ11 ).
302 / 352
Test of Hypothesis using Bayes factor
Suppose we wish to test H0 : β1 = 0 versus H1 : β1 6= 0
Under H0 , β0 |τ ∼ N (0, 9), τ ∼ G( δ20 , γ20 ), δ0 = γ0 = 1
1 9 .5
, τ ∼ G( δ20 , γ20 ), δ0 = γ0 = 1
Under H1 , β0 |τ ∼ N2
.5 9
0
Let
m1 ≡ model under H0 : (β0 ), τ
m2 ≡ model under H1 : (β0 , β1 ), τ
303 / 352
Computation in R
A common prediction problem in medical practice is prognosis
Prognosis is a forecasting of probable course and outcome of a
disease, in particular chances of recovery
Ages and survival times of 20 patients are given
The objective is to check if there is any relationship between age
and survival time
BF = 0.03963679, which is in favor of H1 with 1/BF = 25.23. This
implies strong evidence against H0 that β1 is not equal to zero.
304 / 352
Fitting Bayesian Regression Model with WinBugs
Soft drink delivery times (Montgomery and Peck, 1992)
A quality assurance study is set up to estimate the required time
needed by each employee to refill an automatic vending machine
The response variable is the total service time (measured in
minutes) of each machine
Two predictor variables identified are: number of cases of stocked
products and the distance walked by the employee (measured in
feet)
A dataset of 25 observations was collected
305 / 352
Results
The posterior mean of R2 is equal to 0.95 indicating considerable
improvement in the prediction of delivery times with the
covariates Cases and Distance
Considering the posterior means as point estimates we have the
prediction model as, Expected time = 2.36 + 1.6 × Cases + 0.015
× Distance
Both explanatory variables are important in predicting delivery
time [95%] CI does not include 0
For each additional case stocked by the employee, the expected
delivery time is a posteriori expected to increase by 1.6 minutes
The increase in expected delivery time for each additional case lies
in (1.3 − 2.0) with probability 95%
The delivery time is a posteriori expected to increase by 1.5
minutes for every 100 feet of additional walking distance
The increase in expected delivery time for every 100 feet of
additional walking distance lies in (.7 − 2.2) with probability 95%.
306 / 352
Results(contd.)
The interpretation of β0 is meaningless as zero value is nonsense
for both covariates
The delivery employee will always have to stock some cases of
products in the machine and will walk at least a small small
distance to reach the delivery location
Consider a typical or representative delivery route
A typical delivery rout will take 22.4 minutes on average and will
range from 21.1 to 23.8 minutes with probability 95%.
307 / 352
Bayesian Methods for Generalized Linear Models
Lecture 20
Sharif Mahmood
308 / 352
Review the assumptions for LM and GLM
Model : Y = E(Y ) + e
LM
Yi ∼ N (µ, σ 2 ) and iid,
σ is a constant
GLM
Yi ∼ exponential family
and iid
Systematic
component
η = Xβ
η = Xβ
Link function
E(Y ) ≡ µ = η
E(Y ) ≡ µ = g −1 (η)
Random
component
where Y is n-vector, X is n × p matrix and β is p-vector.
309 / 352
The exponential family of distributions
A distribution belongs to the exponential family if its pdf, or pmf,
can be written as
f (y; θ, φ) = exp
yθ − b(θ)
+ c(y, φ)
a(φ)
Here θ is the canonical parameter and φ is an scale parameter
a(φ) is positive and continuous, b(θ) is twice differentiable with
the second derivative a positive function and c(y, φ) is independent
of the parameter θ.
Generalized linear models is a family of probability models that
includes a class of exponential family of distributions.
310 / 352
Normal Distribution
f (y; µ, σ 2 )
=
=
=
=
√
1
exp
n
− (Y − µ)2 /2σ 2
o
2πσ 2
o
n
o
n 1
exp − ln(2πσ 2 ) exp − (Y − µ)2 /2σ 2
2
n
o
1
exp − (Y 2 − 2yµ + µ2 )/2σ 2 − ln(2πσ 2 )
2
i
h
1
Y 2 /σ 2 + ln(2πσ 2 )
exp Y µ − µ2 /2 /σ 2 −
2
is an exponential family distribution with θ = µ and φ = σ 2 .
Thus, θ is the ’location’ parameter (mean) and φ the ’scale’
parameter.
311 / 352
Bernoulli Distribution
f (y; π) = π Y (1 − π)1−Y
where π =Pr(Y = 1). Now
f (y; π)
=
=
=
Thus
π Y (1 − π)1−Y
n
o
exp Y ln π + (1 − Y ) ln(1 − π)
h
i
exp Y ln π/(1 − π) + ln(1 − π)
θ
=
=
ln π/(1 − π)
logit(π)
and
φ
=
1
312 / 352
Poisson Distribution
f (y, λ) = e−λ λY /Y !
= exp Y ln λ − λ − ln(Y !)
Thus
and
⇒ θ = ln(λ)
φ=1
313 / 352
Examples of the exponential family
Normal
Poisson
Binomial
Gamma
Inv Gaussain
θ
µ
ln(µ)
µ
)
ln( n−µ
1
−µ
− 2µ1 2
φ
σ2
1
1
1
ν
σ2
a(φ)
b(θ)
θ2
φ
2
φ
exp(θ)
φ
ln(1 + eθ )
φ
− ln(−θ)
√
φ
− −2θ
c(y, φ)
y2
φ + ln(2πφ)
-ln(y!)
ln ny
ν ln(νy) − ln y − ln Γ(ν)
1
− 12 ln(2πφy 3 ) + φy
− 12
314 / 352
Properties of the exponential family
The representation of the exponential family distributions makes it
easy to see that the log likelihood is
log L(Y, θ, φ) = Y θ − a(θ)/φ + b(y, φ)
Using calculus, various properties of the exponential family
distributions can be derived.
In particular, it can be shown that
E(Y )
V ar(Y )
=
=
a′ (θ)
a′′ (θ)φ
where
a′ (θ)
′′
a (θ)
=
∂a(θ)/∂θ and
=
∂ 2 a(θ)/∂θ2
315 / 352
The systematic component
Given covariates X1 , . . . , Xk , the effect of the covariates on the
expected value of Y is expressed through the linear predictor
η = β0 +
k
X
βi Xi = Xβ
i=1
The marginal expectation, E(Y ), of each response is then
modelled as a function of covariates.
316 / 352
The Link function
The link function, g(.), describes the relation between the linear
predictor, η, and the expected value of Y (denoted by µ),
g(µ) = η = β0 +
k
X
βi Xi
i=1
g(µ) is a link function if the domain of g is the parameter space of
µ and the range of g is an interval in (−∞, ∞)
g(µ) is strictly monotone and is differentiable everywhere in the
domain
For instance, if µ ∈ (0, ∞) then η = ln µ is a link function
µ
) for logit.
If µ ∈ (0, 1) then η = ln( 1−µ
317 / 352
318 / 352
Prior specification
The prior distributions on the the parameters β are specified given
the dispersion parameter φ
Usually prior distribution of the type βj |φ ∼ N (µβj , σβ2j φ) is
considered
In linear model, φ = σ 2 ∼ IG(a, b) giving conjugate prior
distribution
This prior assumes prior independence between model parameters
In the case of collinear covariates a multivariate normal model
with βj |φ ∼ N (µβ , Σβ ) would be more appropriate
−1
An extension to this prior by setting Σβ = c2 − H(β̂)
can be
considered
Here β̂ is the maximum likelihood estimate and H(β) is the
second derivative matrix of log f (y|β, φ)
In GLM we get: H(β) = X ′ HX where H is a n × n diagonal
matrix with elements
1
∂µi
hi =
∂ηi ai (φ)b′′ (θ)
319 / 352
Posterior distribution of a GLM
The likelihood function of a GLM is given by
f (y|β, φ) = exp
n
n
X
yi gθ−1 (Xi β) − b(gθ−1 (Xi β)) X
+
c(yi , φ)
a(φ)
i=1
i=1
Using the multivariate normal prior, the posterior can be written
as
f (β, φ|y)
n
n
X
yi gθ−1 (Xi β) − b(gθ−1 (Xi β)) X
= exp
+
c(yi , φ)
a(φ)
i=1
i=1
1
1
′ −1
× exp − log |Σβ | − (β − µβ ) Σβ (β − µβ )
2
2
n
X
yi gθ−1 (Xi β) − b(gθ−1 (Xi β)) 1
− (β − µβ )′ Σ−1
(β
−
µ
)
∝ exp
β
β
a(φ)
2
i=1
This does not have closed form, MCMC methods are used to
evaluate posterior summaries
320 / 352
Statistical Analysis with Missing Data
Lecture 21
Sharif Mahmood
321 / 352
Missing Data and Dropout
Missing data arise whenever one or more of the measurements are
incomplete, in the sense that some intended measurements are not
obtained.
Let Y denote the complete response vector which can be
partitioned into two sub-vectors:
1
Y(o) the measurements observed
2
Y(m) the measurements that are missing
If there were no missing data, we would have observed the
complete response vector Y .
Instead, we get to observe Y(o) .
322 / 352
Problem of Missing Data
The main problem that arises with missing data is that the
distribution of the observed data may not be the same as the
distribution of the complete data.
Consider the following simple illustration:
Suppose we intend to measure subjects at 6 months (Y1 ) and 12
months (Y2 ) post treatment. All of the subjects return for
measurement at 6 months, but many do not return at 12 months.
If subjects fail to return for measurement at 12 months because
they are not well (say, values of Y2 are low), then the distribution
of observed Y2 ’s will be positively skewed compared to the
distribution of Y2 ’s in the population of interest.
In general, the situation may often be quite complex, with some
missingness unrelated to either the observed or unobserved
response, some related to the observed, some related to the
unobserved, and some to both.
323 / 352
Reasons for Missing
A particular pattern of missingness that is common in longitudinal
studies is ‘dropout’ or ‘attrition’. This is where an individual is
observed from baseline up until a certain point in time, thereafter
no more measurements are made.
Possible reasons for dropout:
1
Recovery
2
Lack of improvement or failure
3
Undesirable side effects
4
External reasons unrelated to specific treatment or outcome
5
Death
324 / 352
Examples
In clinical trials, missing data can arise from a variety of
circumstances:
a. Late entrants: If the study has staggered entry, at any interim
analysis some individuals may have only partial response data.
Usually, this sort of missing data does not introduce any bias.
b. Dropout: Individuals may drop out of a clinical trial because of
side effects or lack of efficacy. Usually, this type of missing data is
of concern, especially if dropout is due to lack of efficacy. Dropout
due to lack of efficacy suggests that those who drop out come from
the lower end of the spectrum. Dropout due to side effects may or
may not be a problem, depending upon the relationship between
side effects and the outcome of interest.
325 / 352
Missing Data Mechanisms
In order to obtain valid inferences from incomplete data the
mechanism (probability model) producing the missing observations
must be considered.
A hierarchy of three different types of missing data mechanisms
can be distinguished:
1
Data are missing completely at random (MCAR) when the
probability that an individual value will be missing is independent
of Y(o) and Y(m) .
2
Data are missing at random (MAR) when the probability that an
individual value will be missing is independent of Y(m) (but may
depend on Y(o) ).
3
Missing data are nonignorable when the probability that an
individual value will be missing depends on Y(m) .
Note: Under assumptions (1) and (2), the missing data mechanism
is often referred to as being ‘ignorable’.
326 / 352
Methods of Handling Missing Data
1
Complete Case Methods: These methods omit all cases with
missing values at any measurement occasion.
Drawbacks:
2
1
Can results in a very substantial loss of information which has an
impact on precision and power.
2
Can give severely biased results if complete cases are not a random
sample of population of interest, i.e. complete case methods require
MCAR assumption.
All Available Case Methods: This is a general term for a
variety of different methods that use the available information to
estimate means and covariances (the latter based on all available
pairs of cases).
In general, these methods are more efficient than complete case
methods (and can be fully efficient in some cases).
327 / 352
Methods of Handling Missing Data
Drawbacks:
3
1
Sample base of cases changes over measurement occasions.
2
Pairwise available case estimates of correlations can lie outside (-1,
1).
3
Available case methods require MCAR assumption.
Imputation Methods: These are methods that fill in the
missing values. Once imputation is done, the analysis is
straightforward.
1
Stratification: Stratify into homogeneous subsets or classes; impute
mean value of strata, or randomly draw data value from those in
the strata.
2
Regression imputation: Estimate response via some appropriately
chosen regression model.
328 / 352
Methods of Handling Missing Data
Drawbacks:
1
Systematically underestimate the variance and covariance.
2
Treating imputed data as real data leads to standard errors that are
too small (multiple imputation addresses this problem).
3
Their performance can be unreliable and usually require MCAR
assumption.
4
Can be fairly ad hoc, e.g. LVCF.
Last Value Carried Forward: Set the response equal to the last
observed value (or sometimes the worst observed value).
This method of imputation is only valid under very strong
assumptions. In general, LVCF is not recommended!
329 / 352
Methods of Handling Missing Data
4
Likelihood-based Methods: At least in principle, maximum
likelihood estimation for incomplete data is the same as for
complete data and provides valid estimates and standard errors for
more general circumstances than methods (1), (2), or (3).
That is, under clearly stated assumptions likelihood-based
methods have optimal statistical properties.
If missing data are ‘non-ignorable’, likelihood-based inference must
also explicitly (and correctly) model the non-response process.
However, with ‘non-ignorable’ missingness the methods are very
sensitive to unverifiable assumptions.
330 / 352
Methods of Handling Missing Data
5
Weighting Methods: Base estimation on observed data, but
weight the data to account for missing data.
Basic idea: some sub-groups of the population are
under-represented in the observed data, therefore weight these up
to compensate for under-representation.
For example, with dropout, can estimate the weights as a function
of the individual’s covariates and responses up until the time of
dropout.
This approach is valid provided the model for dropout is correct,
i.e. provided the correct weights are available.
331 / 352
Imputation example
ID
1
2
3
4
5
Age
70
70
60
85
70
Sex
F
F
M
F
M
Edu
16
16
12
21
21
Race
W
W
B
W
W
SCL
3.8
0.6
1.1
1.3
⋆
ADL
4
1
2
2
2
Pain
8
1
3
3
4
Comorb
5
1
1
1
3
How to impute the missing SCL for patient #5?
◮
Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7
◮
By age: (3.8+0.6)/2 = 2.2
◮
By sex: 1.1
◮
By education: 1.3
◮
By race: (3.8 + 0.6 + 1.3)/3 = 1.9
◮
By ADL: (1.1 + 1.3)/2 = 1.2
Who is/are in the ⋆ with #5?
332 / 352
Imputation - regression
Model:
SCL = βˆ0 + βˆ1 ×Age + βˆ2 ×Sex + βˆ3 ×Edu + βˆ4 ×Race + βˆ5 ×ADL +
βˆ6 ×Pain + βˆ7 ×Comorb
Fit the model to the observed data:
βˆ0 =-0.8, βˆ1 =0.02, βˆ2 =-0.5, βˆ3 =0.05, βˆ4 =-0.6, βˆ5 =0.1, βˆ6 =0.1, βˆ7 =0.05
Plug in the information of #5 to derive the predicted value:
(-0.8) + (0.02)×70 + (-0.5)×0 + (0.05)×21 + (-0.6)×1 + (0.1)×2 +
(0.1)×4 + (0.05)×3
= 1.8 → predicted SCL
Notes:
◮
◮
Predicted values might be out of the natural range of the outcome
(e.g. SCL > 4 or SCL < 0)
For ordinal outcomes, the predicted values might not be plausible
(e.g. number of people living in the house = 2.7)
333 / 352
Propensity score
Measure the similarity by “the likelihood of being
observed/missing”
Use logistic regression models to estimate this likelihood
Dependent variable
1
Z=
0
if a subject’s outcome is observed
if a subject’s outcome is missing
Independent variables are anything that might be associated with
the outcome being missing (Z = 1)
◮
Demographic information
◮
Baseline characteristics
334 / 352
Propensity score (cont.)
Model:
◮
◮
p = prob(Z = 1)
p
= β 0 + β 1 X1 + β 2 X2 + . . . + β k Xk
log 1−p
Z has no missing values
X1 ∼ Xk all must have non-missing values
Statistical significance of β’s is not important
The predicted p’s derived from the model are the propensity scores
335 / 352
Propensity score - example
Dependent variable:
Y = 12-month SCL score
1
if Y is observed
Z=
0
if Y is missing
Independent variable:
X1 = age = Age
X2 = sex ( = 1 if male, = 0 if female) = Sex
X3 = number of chronic conditions = NumC
X4 = baseline SCL score = SCL00
Model: log(p/(1 − p)) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4
Result: β0 = 0.31, β1 = 0.003, β2 = −0.58, β3 = −0.25, β4 = 0.25
336 / 352
Propensity score - example
Model:
p
=(0.31) + (0.003)×Age + (-0.58)×Sex + (-0.25)×NumC +
log 1−p
(0.25)×SCL00
Derive the propensity scores for subject A and B:
Subject A: 70-year-old male, 3 chronic conditions, SCL00 = 1.7
(0.31)+(0.003)×70+(-0.58)×1+(-0.25)×3+(0.25)×1.7
= -0.385
p
⇒ log 1−p
= - 0.385 → p = 0.405
Subject B: 85-year-old female, 4 chronic conditions, SCL00 = 0.7
(0.31)+(0.003)×85+(-0.58)×0+(-0.25)×4+0.25×0.7
= -0.26
⇒ log
p
1−p
= -0.26 → p = 0.435
337 / 352
Imputation - hot-deck
ID
1
2
3
4
5
Age
70
70
60
85
70
Sex
F
F
M
F
M
Edu
16
16
12
21
21
Race
W
W
B
W
W
SCL
3.8
0.6
1.1
1.3
⋆
ADL
4
1
2
2
2
Pain
8
1
3
3
4
Comorb
5
1
1
1
3
Propensity
0.18
0.42
0.39
0.22
0.41
How to impute the missing SCL for patient # 5?
◮
4 strata → closest to #2 → impute with 0.6
◮
2 strata → closest to both #2 and #3 → impute with a randomly
selected value from (0.6, 1.1)
◮
The method is called “Hot-Deck”, #2, #3 are called “donors”
Common approach:
◮
Stratify the sample by the propensity scores (e.g. 5 strata)
◮
Randomly select a donor from the same stratum and impute the
missing value with the donor’s observed value
338 / 352
Imputation
For each missing value, impute m data points
◮
◮
m > 1, usually m = 5
For single imputation m = 1
What’s wrong with single imputation?
◮
◮
◮
Imputed values are derived from the observed sample, and thus the
imputed sample is more homogeneous
Variances are under-estimated
(10, 20, 30) → mean = 20, variance = 100
(10, 20, 30, 20, 20, 20) → mean = 20, variance = 40
More likely to yield biased result
Advantage of multiple imputation
◮
◮
Add the variation across the m data sets “back” to the estimation
of variance
Result is less likely to be biased
339 / 352
Assumptions for Multiple Imputation (MI)
The MI procedure assumes that the data are from a continuous
multivariate distribution and contain missing values that can
occur on any of the variables.
The data are from a multivariate normal distribution when either
the regression method or the MCMC method is used.
Suppose D is the n × p matrix of complete data, which is not fully
observed, and denote the observed part of D by Dobs and the
missing part by Dmis .
To be more precise, suppose that M is the n × p matrix of
response indicators whose elements are zero or one depending on
whether the corresponding elements of D are missing or observed.
Furthermore, the MI procedures assume that the parameters θ of
the data model and the parameters φ of the model for the missing
data indicators are distinct.
340 / 352
Missing Data
Suppose we have missing data. Define D as our data matrix and
M as our missingness matrix.
y
1
5
D=
2
⋆
⋆
⋆
x1
2.5
3.2
1.2
1.9
1.2
7.7
x2 x3
432 0
543 1
219 1
⋆
1
108 0
95 1
y x1 x2 x3
1 1 1 1
1 1 1 1
1
1
1
1
M =
0 1 0 1
0 1 1 1
0 1 1 1
One non-Bayesian approach to dealing with missing data is
multiple imputation.
341 / 352
Multiple Imputation
We need a method for filling in the missing cells of D as a first
step before we go to the analysis stage.
1
Assume a joint distribution for Dobs and M :
Z
P (Dobs , M |φ, γ) = p(D|φ)p(M |Dobs , Dmis , γ)dDmis
If we assume MAR, then
P (Dobs , M |φ, γ) ∝
2
Z
p(D|φ)dDmis
Find φ, which characterizes the full D.
L(φ|Dobs ) =
n Z
Y
i=1
p(Di,obs |φ)dDmis
How do we do this integral?
342 / 352
Multiple Imputation
Assume Normality since the marginals of a multivariate Normal are
Normal.
n
Y
N (Di,obs |µobs , Σobs )
L(µ, Σ|Dobs ) =
i=1
3
Find µ̂, Σ̂ and its distribution via EM algorithm and bootstrapping.
4
Draw m, µ and Σ values, then use them to predict values for Dmis .
What we end up with is m datasets with missing values imputed.
We then run our regular analyses on the m datasets and combine
the results using Rubins rule.
343 / 352
Bayesian Approach to Missing Data
Bayesian paradigm: Everything unobserved is a random variable.
So we can set up the missing data as a parameter that we need to
find.
p(Dmis , φ, γ|Dobs , M ) ∝ p(Dobs |Dmis , φ) p(Dmis |φ) p(M |Dobs , Dmis , γ
p(γ) p(φ)
If we assume MAR:
p(Dmis , φ, γ|Dobs , M ) ∝ p(Dobs |Dmis , φ) p(Dmis |φ) p(φ)
Use Gibbs Sampling or M-H to sample both Dmis and φ.
We dont have to assume normality of D to integrate over Dmis .
We can just drop the draws of Dmis .
344 / 352
Bayesian Approach to Missing Data
We can also incorporate both imputation and analyses in the same
model.
p(θ, Dmis , φ, γ|Dobs , M ) ∝ p(yobs , ymis |θ) p(Dobs |Dmis , φ) p(Dmis |φ
p(φ) p(θ)
Again, find the posterior via Gibbs Sampling or M-H.
Thus, We can easily set up an application specific Bayesian model
to incorporate missing data.
345 / 352
EM algorithm
Lecture 22
Sharif Mahmood
346 / 352
EM algorithm
This is an algorithm that is used to find the MLE when the data
can be viewed as simpler if additional observations had been made.
The data (both observed and missing/imaginary) are classified as:
Y = Observed data
Z = Missing data
X = (Y,Z) = Complete data
the algorithm has two steps
(E) Expectation step. Find the expectation of the complete data
likelihood, conditional on the observed data. This involves
computing the conditional expectation of the missing data. The
calculation is made using the current estimates of the parameters.
(M) Maximization step. Find θ to maximize the complete data
likelihood .
347 / 352
Example A.
✑ Suppose we have observations Y , . . . , Y
from two groups with
known densities f0 (y) and f1 (y). Let Zi = 1 if the observation is
from group 1 and 0 otherwise. The binary labels Z1 , . . . , Zn are
missing. If the fraction from group “1” is p, then the pdf of each
observation is
X
f (y) =
f (y|Z = z) P (Z = z)
1
n
= (1 − p)f0 (y) + pf1 (y)
This is called a finite mixture model. If we introduce some
parameters into the component densities, fj (y|θ), j = 0, 1, then it
is is challenging to find the MLEs for this model.
348 / 352
Example A.
Let us see how to use the EM algorithm to turn this into a very
simple task.
The complete data are
Xi = (Yi , Zi )
If we could observe the Zi ’s, the MLE for the only parameter in
the model p is
P
Zi
p̂ = i .
n
The General Formulation Define f (y, z|θ) as the complete data
likelihood. (Throughout this section we use y and z to indicate a
possibly vector values quantity. yi and zi are elements from such a
vector.)
349 / 352
The EM Algorithm
(0) Pick a starting value for the parameter value θ0 . Iterate the
following steps for m = 1, 2, . . . , until the algorithm converges.
(E) Calculate
Here
◮
◮
◮
n
o
f (Y, Z|θ)
|Y, θm )
Q(θ, θm ) = E( log
f (Y, Z|θm )
The expectation is over the missing data Z,
treat Y = y as fixed, because we condition on it,
use θm as the true value of the parameter in the expectation
(M) view Q(θ, θm ) as a function of θ with θm fixed. Find θm+1 , the
value that maximizes Q(θ, θm ).
Often with careful selection of Z there is a closed form solution to
the M-step. If there is, the EM algorithm is a very convenient
tool. Alternatively, the M-step may employ a well studied, and
reliable mazimization approach.
350 / 352
351 / 352
References
Albert, J.
Bayesian Computation with R, 2nd edn.
New York: Springer-Verlag, 2009.
Carlin, B.P. and Louis, T.A.
Bayesian Methods for Data Analysis, 3rd edn. Texts in Statistical
Science Series.
Chapman and Hall, 2008.
Gelman, A. and Carlin, J.B. and Stern, H.S. and Rubin, D.B.
Bayesian Data Analysis, 2nd edn. Texts in Statistical Science
Series.
Chapman and Hall, 2004.
Hoff, P.D.
A First Course in Bayesian Statistical Methods.
Springer, 2009.
352 / 352