Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Bayesian

Applied Bayesian Statistics Sharif Mahmood Institute of Statistical Research and Training University of Dhaka Rev. T. Bayes (1702-1761) 2 / 352 Syllabus Bayes Analysis ◮ Introduction and background, Why bayes? ◮ Bayes theorem, components of Bayes theorem - likelihood, prior and posterior. ◮ Probability theory and estimation. ◮ Concept of priors, informative and non-informative priors, proper and improper priors, discrete priors, conjugate priors, semi-conjugate priors, credible interval, Bayesian hypothesis testing, buliding a predictive model. Bayesian inference and prediction ◮ Single parameter models: binomial model, Poisson model, normal with known variance, normal with known mean ◮ Multi-parameter models: concepts of nuisance parameters, normal model with a non-informative, conjugate, and semi-conjugate priors, multinomial model with Dirichlet prior, multivariate normal model, methods of prior specification; method of evaluating Bayes estimator 3 / 352 Syllabus Summarizing posterior distributions: ◮ ◮ ◮ ◮ ◮ Introduction, approximate methods: numerical intregation method Bayesian central limit theorem Simulation method: direct sampling and rejection sampling, importance sampling, Markov Chain Monte Carlo (MCMC) methods Gibbs sampler, general properties of the Gibbs sampler, Metropolis algorithm, Metroplis Hastings (MH) sampling, relationship between Gibbs and MH sampling, MCMC diagnostics assessing convergence, acceptance rates of the MH algorithm. Bayesian Sample Size Computations Bayesian Inference for linear models, generalized linear models. Statistical Analysis with Missing Data ◮ ◮ ◮ ◮ Missing data, missing data pattern, missing data mechanism. Imputation procedures, e.g. mean imputation, hot deck imputation. Estimation of sampling variance in the presence of nonresponse. Likelihood based estimation and propensity score. 4 / 352 Introduction Lecture 1 Sharif Mahmood 5 / 352 Basic Statistical Operations Bayesian Estimation ✻ Summary ✲ ⇓ Inference ✛ ✲ ✻ ✛ ✛ Hypothesis Testing ✻ Comparison ✻ Enumeration ✲ Principal approaches to statistical inference : Frequentist versus Bayesian. 6 / 352 Bayes Formula Suppose that x = (x1 , x2 , . . . , xn ) is a vector of n observations whose probability distribution, p(x|θ), depends on the values of θ. where θ are our parameters of interest. The Bayes’ formula yields p(θ|x) = where p(x) = of θ. P θ p(x|θ) p(θ) p(x) p(θ)p(x|θ) and the sum is over all possible values Omitting the factor p(x), which does not depend on θ, can thus be considered as constant, yield the unnormalized posterior density p(θ|x) ∝ p(x|θ) p(θ) 7 / 352 Bayes Rule A basic Bayesian model has two stages: - Likelihood specification: x|θ ∼ f (x|θ) and prior specification: θ ∼ π(θ) - The prior distribution π(θ) is assumed known The posterior distribution of θ is given p(θ|x) = R f (x|θ)π(θ) θ f (x|θ)π(θ)dθ The denominator is known as normalizing constant. We have a posterior density, sampling density/likelihood, prior density, and a normalizing constant (which we typically do not need to find). 8 / 352 General Principle Prior knowledge is combined with observed evidence, i.e. prior knowledge about the density of the unknown parameter vector θ is combined with sample data or experimental evidence x Updated knowledge about θ is produced This is in the form of posterior distribution p(θ|x) of the unknown θ A predictive distribution p(x̃|x) of unobserved data x̃. Evaluation of integrals used to be difficult or impossible. Monte Carlo computing methods now allow accurate estimation of such integrals. 9 / 352 Example: Cancer Diagnosis ✑ A young person is diagnosed as having an extremely rare type of cancer. Let C stands for having cancer and + stands for positive respond to the test and also assume P (C) = 10−6 , P (+|C) = 0.99, and P (+|C c ) = 0.01. Sol: Only one in a million of his age has the disease and the test is very good giving only 1% false positives and 1% false negatives. The probability that he has cancer given a positive test is calculated as P (C|+) = = = = P (+|C)P (C) P (+|C)P (C) + P (+|C c )P (C c ) (.99)(10−6 ) (.99)(10−6 ) + (.01)(.99999) 0.00000099 0.01000098 0.00009899 The probability of a second case can be updated using 0.00009899 as prior probability of having the disease. 10 / 352 Example: Machine ✑ Three machines A, B and C produce 20%, 45% and 35% respectively of a factory’s wheel nuts output. 2%, 1% and 3% respectively of these machines outputs are defective. What is the probability that a randomly selected wheel nut comes from machine A if it is defective? Sol: Let D be the event that the wheel nut is defective and A, B, and C be the events that the selected wheel nut came from machines A, B and C respectively: P (A|D) = = = P (A)P (D|A) P (A)P (D|A) + P (B)P (D|B) + P (C)P (D|C) (0.2)(0.02) (0.2)(0.02) + (0.45)(0.01) + (0.35)(0.03) 0.211 Here we divide the probability of the required path (probability that it came from machine A and was defective) by the probability of all paths. 11 / 352 Frequentist vs Bayesian Frequentist Approach 1 Parameters are fixed at their true but unknown value Bayesian Approach 1 Parameters are considered as random variables with distributions 2 Objective notion of probability based on repeated sampling 2 Subjective notion of probability (prior) combined with data 3 Large sample properties or asymptotic approximations 3 Does not require large sample approximations 4 Maximizing a likelihood 4 Simulation-based approach 12 / 352 Criticisms: Frequentist vs Bayesian Paradigms Frequentist 1 Failure to incorporate relevant prior information 2 Inefficiency, inflexibility and incoherency 3 Unrealistic assumptions on data generating mechanism 4 Objectivity often times is illusory Bayesian 1 Lack of objectivity 2 Complexity in computation 3 Lack of intuitive model building capabilities 4 Need defense of priors 5 No guarantee of Markov Chain convergence. 13 / 352 θ as Fixed versus as a Random Variable Frequentist approach (θ fixed): Estimate θ with measures of uncertainty (SE, CIs) 95% Confidence Interval: 95% of the time, θ is in the 95% interval that is estimated each time. ◮ ◮ P(θ ∈ 95%CI) = 0 or 1 P(θ > 2)=0 or 1 Bayesian approach (θ random): Find the posterior distribution of θ. Take quantities of interest from the distribution (posterior mean, posterior SD, posterior credible intervals) We can make probability statements regarding θ. ◮ ◮ 95% Credible Interval: P(θ ∈ 95%CI) = 0.95 P(θ > 2) = (0, 1). 14 / 352 Frequentist vs Bayesian: Confidence Interval (CI) Frequentist Approach Bayesian Approach 1 Uncertainties about parameters are represented as probabilities in the Bayesian paradigm 1 A 95% CI calculated from a sample will contain the true parameter 95% of the time 2 The particular confidence interval from any one sample may or may not contain the true parameter value 2 In Bayesian one can calculate the probability that a parameter lies in a given interval (credible interval) 3 This is counter to the intuitive and often held belief that the CI from a particular sample contains the true parameter value with 95% certainty 3 Under Bayesian framework we call this interval (or region in general case) a highest posterior density (HPD) region or credible sets. 15 / 352 Example: Coin tossing ✑ Test H : θ = 12 versus Ha : θ > 12 , where θ is the true probability of heads in 12 independent tosses of a coin, 9 heads and 3 tails are observed. 0 Sol: Two choices for the sampling distribution are possible: 1 Binomial: n = 12 tosses is fixed, the random quantity X is the number heads observed, X ∼ Bin(12; θ), and the likelihood function is     12 n x θ9 (1 − θ)3 . θ (1 − θ)(n−x) = L1 (θ) = 9 x 2 Negative binomial: Flip the coin until the third tail appeared. Here X is the number of heads required to complete the experiment with X ∼ N egBin(r = 3, θ) and the likelihood function is     11 r+x−1 θ9 (1 − θ)3 . θx (1 − θ)r = L2 (θ) = 9 x p - value corresponding to the rejection region, “Reject H0 if X ≥ c” is computed under both distributions 16 / 352 Example: Coin tossing Binomial: α1 = Pθ= 1 (X ≥ 9) = 2 Negative binomial: α2 = Pθ= 1 (X ≥ 9) = 2 P12 j=9 P∞ j=9 12 j  2+j j R > pbinom(8,12,.5,lower.tail=F) θj (1 − θ)12−j  θj (1 − θ)3 [1] 0.07299805 > pnbinom(8,3,.5,lower.tail=F) [1] 0.03271484 At the usual Type I error level α = .05 two model assumption lead to two different decisions. The probability of X values “more extreme” than 9 (actually observed) was used as evidence against H0 , even though these values did not occur. 17 / 352 Probability Theory Lecture 2 Sharif Mahmood 18 / 352 Three Axioms of Probability Let S be the sample space and A be an event in S. 1 2 3 For any event A, P (A) ≥ 0 P (S) = 1 If A1 , A2 , . . . , An are mutually disjoint, then P (A1 ∪ A2 ∪ . . . ∪ An ) = P (A1 ) + P (A2 ) + . . . + P (An ) The three axioms imply: P (U) = 1 P (Ac ) = 1 − P (A) For any event A, 0 ≤ P (A) ≤ 1. If A ⊂ B, then P (A) ≤ P (B). For any two events A and B, P (A ∪ B) = P (A) + P (B) − P (A ∩ B) 19 / 352 Conditional Probability P (A|B) = P (A ∩ B) P (B) Multiplicative Law of Probability: P (A ∩ B) = P (B|A)P (A) = P (A|B)P (B) Bayes Rule: P (A|B) = P (B|A)P (A) P (B) Law of Total Probability: P (B) = n X P (B|Ai )P (Ai ) i=1 20 / 352 Independence A and B are independent if P (AB) = P (A)P (B) If A and B are independent, then P (AB) P (B) P (A)P (B) = P (B) = P (A) P (A|B) = Conditional Independence: P (AB|C) = P (A|C)P (B|C) 21 / 352 Random Variables A random variable is a function that takes a random experiment and assigns a number to the outcome of the experiment. Outcome values are assigned probabilities by a probability mass function (for discrete random variable) or probability density function (for continuous random variable). 22 / 352 Probability Functions Probability Mass Function P (Y = y) Probability Density Function P (Y ∈ A) = Z f (y)dy A Characteristics of all PDFs and PMFs: ◮ Area under the curve must integrate to 1 ◮ P (Y = y) ≥ 0 for all Y ◮ The support is all y’s where P (Y = y) > 0. 23 / 352 Marginal, Conditional, and Joint Densities f (x) = f (x, y) = f (x|y) = f (x|y, z) = f (x, y) = = f (x, y, z) = Z Z f (x, y)dy f (x, y, z)dz f (x, y) f (y) f (x, y, z) f (y, z) f (x|y)f (y) f (y|x)f (x) f (x|y, z)f (y|z)f (z) 24 / 352 Expectation Discrete Case: E(X) = X xi P (X = xi ) i where P (X = x) is the probability mass function (PMF). Continuous Case: E(X) = Z ∞ xf (x)dx −∞ where f (x) is the probability density function (PDF). Expectation of a Function of a Random Variable: X E[g(X)] = g(xi )P (X = xi ) i for discrete random variables and E[g(X)] = Z ∞ g(x)f (x)dx −∞ for continuous random variables. 25 / 352 Variance and Covariance The variance of X is var(X) = E[X − E(X)]2   = E X 2 − 2XE(X) + {E(X)}2 = E(X 2 ) − [E(X)]2 covar(X, Y ) = E(XY ) − E(X)E(Y ) = E{[X − E(X)][Y − E(Y )]} 26 / 352 Important Distributions Discrete Distributions Continuous Distributions 1 Bernoulli Distribution 1 Univariate Normal Distribution 2 Binomial Distribution 2 Multivariate Normal Distribution 3 Multinomial Distribution 3 Univariate t Distribution 4 Poisson Distribution 4 Multivariate t Distribution 5 Geometric Distribution 5 Uniform Distribution 6 Negative Binomial Distribution 6 Beta Distribution 7 Dirichlet Distribution 8 Exponential Distribution 9 Gamma Distribution 10 Weibull Distribution 11 Inverse Gamma Distribution 27 / 352 The Bernoulli Distribution 0.8 1.0 Y ∼ Bernoulli(π), y = 0, 1 probability of success: π ∈ [0, 1] p(y|π) = π y (1 − π)(1−y) E(Y ) = π V ar(Y ) = π(1 − π). 0.4 π 0.6 ● 0.0 0.2 ● −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 x 28 / 352 The Binomial Distribution 0.4 π 0.6 0.8 1.0 Y ∼ Binomial(n, π), y = 0, 1, . . . , n number of trials: n ∈ {1, 2, . . .} probability of success: π ∈ [0, 1] p(y|π) = ny π y (1 − π)n−y E(Y ) = nπ V ar(Y ) = nπ(1 − π). ● 0.2 ● ● 0.0 ● −1 0 1 2 3 4 x 29 / 352 The Multinomial Distribution Y ∼ Multinomial(n, π1 , . . . , πk ), yj = 0, 1, . . . , n, number of trials: n ∈ {1, 2, . . .} probability of success for j: πj ∈ [0, 1]; p(y|n, π) = y1 y2 n! y1 !y2 !...yk ! π1 π2 . . . πkyk Pk j=1 πj Pk j=1 yj =n =1 E(Yj ) = nπj V ar(Y) = nπj (1 − πj ) Cov(Yi , Yj ) = −nπi πj . 30 / 352 The Poisson Distribution 0.0 0.2 0.4 f(y) 0.6 0.8 1.0 Y ∼ Poisson(λ), y = 0, 1, . . . expected number of occurrences: λ > 0 −λ y p(y|λ) = e y!λ E(Y ) = λ V ar(Y ) = λ. ● ● 0 ● ● ● ● 2 4 ● 6 ● ● 8 ● ● 10 x 31 / 352 The Geometric Distribution How many Bernoulli trials until a success? Y ∼ Geometric(π), y = 1, 2, . . . probability of success: π ∈ [0, 1] p(y|π) = π(1 − π)y E(Y ) = V ar(Y ) 1−π π . = 1−π π2 32 / 352 The Negative Binomial Distribution How many Bernoulli trials until r success? Y ∼ Neg.Binomial(π), y = 1, 2, . . . probability of success: π ∈ [0, 1]  r π (1 − π)y p(y|r, π) = y+r−1 y r(1−π) , π r(1−π) . π2 V ar(Y ) = r=4 0.15 ● ● ● ● 0.10 ● f(y) ● ● 0.05 ● ● ● ● 0.00 E(Y ) = 0 5 10 ● ● ● ● ● ● ● ● ● ● 15 20 y 33 / 352 The Univariate Normal Distribution Y ∼ N (µ, σ 2 ), y ∈ R mean: µ ∈ R; variance: σ 2 > 0 2 ) p(y|µ, σ 2 ) = σ√12π exp(− (y−µ) σ2 E(Y ) = µ V ar(Y ) = σ 2 . 0.5 Figure 2 0.5 Figure 1 0.3 0.4 σ=1 σ=2 0.0 0.1 0.2 f(y) 0.0 0.1 0.2 f(y) 0.3 0.4 µ=0 µ=1 −3 −2 −1 0 y 1 2 3 −3 −2 −1 0 1 2 3 y 34 / 352 The Multivariate Normal Distribution Y ∼ N (µ, σ 2 ), y ∈ Rk mean: µ ∈ Rk ; variance: Σ positive definite k × k matrix  p(y|µ, Σ) = (2π)−k/2 |Σ|−1/2 exp − 12 (y − µ)′ Σ−1 (y − µ) E(Y ) = µ V ar(Y ) = Σ. µ1 = 0, µ2 = 0, σ11 = 10, σ22 = 10, σ12 = 15, ρ = 0.5 0.015 0.010 z 0.005 10 5 0.000 −10 0 −5 0 x1 f (x) = 1 2π σ11 σ22 (1 − ρ2) 5  1 exp − , 2(1 − ρ2)  −5 x2 10 −10 (x1 − µ1)2 x1 − µ1 x2 − µ2  − 2ρ +  σ11 σ22 σ11  (x2 − µ2)2    σ22 35 / 352 The Univariate t Distribution Y ∼ t(µ, σ 2 , ν), y ∈ R mean: µ ∈ R; variance: σ 2 > 0, degrees of freedom: ν − ν+1  (ν+1) Γ( 2 ) 2 y−µ 2 2 −1 √ p(y|µ, σ ) = Γ( ν )σ νπ 1 + ν σ 2 E(Y ) = µ for ν > 1 ν for ν > 2. V ar(Y ) = σ 2 ν−2 Figure 2 0.5 0.5 Figure 1 0.3 0.4 ν=0 ν=2 ν=4 0.0 0.1 0.2 f(y) 0.0 0.1 0.2 f(y) 0.3 0.4 df=2 df=6 df=10 −3 −2 −1 0 y 1 2 3 −2 0 2 4 6 8 10 y 36 / 352 The Multivariate t Distribution Y ∼ t(µ, σ 2 ), y ∈ Rk mean: µ ∈ Rk ; variance: Σ positive definite k × k matrix  − ν+k (ν+d) Γ( 2 ) 2 − 21 −1 ′ −1 √ p(y|µ, Σ) = Γ( ν )ν k/2 π |Σ| 1 + ν (y − µ) Σ (y − µ) 2 10 E(Y ) = µ forν > 1 ν for ν > 2. V ar(Y ) = Σ ν−2 0.002 5 0.004 0.008 0.012 0.016 0 0.018 0.014 −5 0.01 −10 0.006 −10 −5 0 5 10 37 / 352 The Uniform Distribution Y ∼ Uniform(α, β), y ∈ [α, β] Interval: [α, β]; α < β 1 p(y|α, β) = β−α E(Y ) = 0.0 0.1 0.2 f(y) 0.3 0.4 0.5 V ar(Y ) α+β 2 2 . = (β−α) 2 0 1 2 3 4 5 6 x 38 / 352 The Beta Distribution Y ∼ Beta(α, β), y ∈ [0, 1] shape parameters:α > 0; β > 0 p(y|α, β) = E(Y ) = Γ(α+β) (α−1) (1 Γ(α)Γβ y α α+β V ar(Y ) = − y)(β−1) αβ . (α+β)2 (α+β+1) 39 / 352 The Dirichlet Distribution Y ∼ Dirichlet(α1 , . . . , αk ), yj ∈ [0, 1]; P α parameters: αj > 0; kj=1 ≡ α0 Pk j=1 yj =1 Γ(α1 +...+αk ) α1 −1 . . . ykαk −1 Γ(α1 )...Γ(αk ) y1 α E(Yj ) = α0j α (α −α ) V ar(Yj ) = αj2 (α0 +1)j 0 0 α αj Cov(Yi , Yj ) = − α2 (αi +1) 0 0 p(y|α) = 40 / 352 The Exponential Distribution Y ∼ Exponential(λ), y > 0 parameter: λ > 0 p(y|λ) = λe−λy E(Y ) = λ1 V ar(Y ) = λ12 0.2 0.4 f(y) 0.6 0.8 1.0 λ=1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 y 41 / 352 The Gamma Distribution Y ∼ Gamma(α, β), y > 0 shape parameter: α > 0 inverse scale parameter: β > 0 β α (α−1) p(y|α, β) = Γ(α) y exp (−βy) α E(Y ) = β , V ar(Y ) = βα2 0.2 0.4 f(y) 0.6 0.8 1.0 α=1,β=1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 y 42 / 352 The Weibull Distribution Y ∼ Weibull(α, β), y > 0 shape parameter: β > 0, scale parameter: α > 0  β−1  p(y|α, β) = αβ αy  exp (− αy )β E(Y ) = αΓ 1 + β1 , β > 0 2   V ar(Y ) = α2 Γ(1 + β2 ) − Γ(1 + β1 ) 0.2 0.4 f(y) 0.6 0.8 1.0 α=1,β=1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 y 43 / 352 The Inverse Gamma Distribution If X ∼ Gamma(α, β), then Y ∼ Invgamma(α, β), y > 0 1 X ∼ Invgamma(α, β). shape parameter: α > 0 scale parameter: β > 0 p(y|α, β) = E(Y ) = β α −(α+1) exp (− βy ) Γ(α) y β α−1 , V ar(Y ) = α>0 β2 (α−1)2 (α−2) 44 / 352 Maximum likelihood estimation Lecture 3 Sharif Mahmood 45 / 352 Maximum Likelihood The classical approach to statistics involves two basic steps: 1 2 Model estimation Inference. 1st : determining an appropriate probability distribution/model for the data at hand 2nd : estimating its parameters. Maximum likelihood estimation is that a good choice for the estimate of a parameter of interest is the value of the parameter that makes the observed data most likely to have occurred. 46 / 352 Constructing a likelihood function If x1 , x2 , . . . , xn are independent observations of a random variable, x, in a data set of size n, then we know from the multiplication rule in probability theory that the joint probability for the vector X is: f (X|θ) ≡ L(θ|x) = n Y i=1 f (xi |θ) This equation is the likelihood function for the model. The primary point of constructing a likelihood function is to solve for the value of the parameter that makes the occurence of the data most probable, or most “likely” to have actually occurred From a classical standpoint, the parameter is assumed to be fixed. 47 / 352 ✑ Suppose we have one observation y = 0.5 (assumed to be) drawn from a Normal distribution with σ 2 = 1. Estimate the mean µ of the distribution. Obvious guess: µ = 0.5 Normal PDF ● 0.4 0.4 Normal PDF ● if µ=0.5 if µ=−0.5 0.3 0.3 if µ=0.5 0.2 p(y|µ µ) 0.1 0.2 0.1 p(y|µ µ) ● −2 −1 0 1 y 2 3 −2 −1 0 1 2 3 y 48 / 352 Why Does L(θ|y) = p(y|θ) work? Likelihood Function ● 0.4 0.4 Normal PDF ● 0.3 0.3 if µ=0.5 if µ=−0.5 if µ=2 0.2 L(µ µ|y) ● 0.2 p(y|µ µ) ● 0.1 ● 0.1 ● −2 −1 0 1 y 2 3 −2 −1 0 1 2 3 µ Our best estimate of θ is the value of θ that maximizes L(θ|y ) (MLE). 49 / 352 Example: Poisson Distribution Suppose we have some count data (number of errors in a page). Here Poisson distribution can be used to model the data. Thus we iid write Yi ∼ Poisson(λ) We want to find λ, which is the mean of the Poisson distribution. The PMF (discrete) for the data is p(y|λ) = n Y λyi e−λ i=1 yi ! Since L(θ|y) = p(y|θ), we have L(λ|y) = n Y λyi e−λ i=1 yi ! 50 / 352 Example: Poisson Distribution The log-likelihood l(λ|y) = n X i=1 (yi ln λ − λ − ln yi !) We can drop all terms that dont depend on λ (because likelihood is a relative concept and is invariant to shifts). l(λ|y) = n X i=1 (yi ln λ) − nλ We need to set the derivative (known as the score function) to zero and solve for λ. ∂l(λ|y) ∂λ = ⇒ λ̂ = P yi −n λ P yi n 51 / 352 Maximum Likelihood In R Write our log-likelihood function: l(λ|y) = n X i=1 > + + + + (yi ln λ) − nλ ll.poisson <- function(par, y) { lambda <- exp(par) out <- sum(y * log(lambda)) - length(y) * lambda return(out) } Find the maximum (with sample data) > y <- rpois(1000, 5) > opt <- optim(par = 2, fn = ll.poisson, method = "BFGS", + control = list(fnscale = -1), y = y)$par > mle <- exp(opt) #to reparameterize the output > mle [1] 4.954001 52 / 352 More Complicated Likelihoods Up to this point, our distributions have been relatively simple (i.e. all observations are i.i.d. from a distribution with the same parameter). Suppose our observations come from different distributions, or some observations are not fully observed. How do we define the joint distribution for our data (and thus our likelihood)? Use an indicator variable. Indicator Variable Let D be an indicator variable such that di = 1 if i follows one distribution, and di = 0 if i follows another distribution. We can incorporate the indicator variable into our likelihood. 53 / 352 Example 1 Suppose that we have 5 observations. The first 2 are assumed to be from distribution 1. The last 3 are assumed to be from distribution 2. d = c(1, 1, 0, 0, 0) n Y p(θ1 , θ2 |y) = [p(yi |θ1 )]di [p(yi |θ2 )]1−di i=1 54 / 352 Example 2 Suppose that we have 5 observations. We observe the first 2 observations completely, but we only observe that the last 3 are greater than some number z. d = c(1, 1, 0, 0, 0) n Y p(θ|y) = [p(yi |θ)]di [1 − F (z)]1−di i=1 where F(y) is the CDF for p(y|θ), where F (z) = P (Y ≤ z) 55 / 352 Prior Distributions Lecture 4 Sharif Mahmood 56 / 352 Introduction A prior distribution of a parameter is the probability distribution that represents your uncertainty about the parameter before the current data are examined. Multiplying the prior distribution and the likelihood function together leads to the posterior distribution of the parameter. The existence of a prior distribution for any problem can be justified by axioms of decision theory; here we focus on common prior distributions for any given application. 57 / 352 How do we choose the priors? Quantifying expert opinion: ◮ Pick the prior mean where expert’s belief is centered on. ◮ Pick the variance such that the range covers reasonable values of the expected mean, or consider equivalent sample size for your prior. e.g. if you want your prior to be equivalent to a sample of size n, your prior variance should be σ 2 /n. Previous research results. You can use previously reported research results as your prior. Bayesian inference can proceed incrementally. Non-informative, flat priors. If there is no reasonable way to form a prior, one can use a flat prior. These often result in similar point estimates as frequentist estimates. 58 / 352 Priors There are various types of prior distributions: 1 Noninformative priors 2 Improper priors 3 Informative priors 4 Conjugate priors 5 Jeffreys priors 59 / 352 Noninformative Priors Noninformative priors are “flat” relative to the likelihood function A prior π(θ) is noninformative if it has minimal impact on the posterior distribution of θ It is dominated by the likelihood function, does not change much over the region where the likelihood is appreciable, and does not assume large values outside that range Also known as reference prior, vague prior, or diffuse prior. If 0 ≤ θ ≤ 1, then θ ∼ U (0, 1) is a noninformative prior for θ, and π(θ) = 1, 0 ≤ θ ≤ 1 If −∞ ≤ θ ≤ ∞, then θ ∼ N (µ0 , σ02 ), and σ02 → ∞, then we get a noninformative prior for θ. 60 / 352 Improper Priors A prior is said to be improper if Z π(θ)dθ = ∞. Θ i.e. a prior is improper if its normalizing constant is equal to ∞ Improper priors are used in Bayesian inference to obtain noninformative priors. Although an improper prior may result in an improper posterior it can still lead to a proper posterior distribution If −∞ < θ < ∞; π(θ) ∝ 1, then θ has a Runiform prior distribution R∞ ∞ on the real line. Clearly, −∞ π(θ)dθ = −∞ dθ = ∞ 61 / 352 Example: Improper Priors ✑ Suppose x , . . . , x are i.i.d. N (θ, 1) given θ. Here Θ = {θ : −∞ ≤ θ ≤ ∞}. Let π(θ) ∝ 1. Then, 1 n p(θ|x) ∝ p(x|θ)π(θ)   P = exp  − 12 (xi − θ)2 ∝ exp − n2 (θ − x̄)2 Thus θ|x ∼ N (x̄, n2 ). 62 / 352 Informative Priors An informative prior expresses specific, definite information about a variable. An informative prior is not dominated by the likelihood It has an impact on the posterior distribution The proper use of prior distributions illustrates the power of the Bayesian method: information gathered from the previous study, past experience, or expert opinion can be combined with current information in a natural way. 63 / 352 Example: Informative Priors ✑ Suppose x , . . . , x are i.i.d. N (θ, 10) given θ. Also θ ∼ N (0, 1), representing an informative prior for θ 1 10 h p(θ|x) ∝ exp − h ∝ exp − h ∝ exp − Thus θ|x ∼ N ( P xi 10 11 , 11 ). 1 20 1 20 11 20 P 2 i h 1 2 θ 2 i (xi − θ) exp − h i P i θ2 − 2θ xi exp − 21 θ2 P  i 2 θ − 11xi 64 / 352 Conjugate Priors A conjugate prior for a family of distributions is the one for which the prior and posterior distribution belong to the same family ✑ Suppose x , . . . , x θ), and θ ∼ Beta(α, β) 1 n are Pi.i.d. Binomial(1, P then θ|x ∼ Beta(α + xi , n − xi + β). Thus the beta prior is a conjugate prior for the binomial family. Likelihood Binomial (n, θ) Poisson (θ) N (µ, σ 2 ), σ 2 known, N (µ, σ 2 ), µ known, τ = G(α, λ), α known Beta (α, λ), λ known 1 σ2 Conjugate prior θ ∼ Beta(α, β) θ ∼ G(δ0 , λ0 ) µ ∼ N (µ0 σ02 ) τ ∼ G(δ0 , λ0 ) λ ∼ G(δ0 , λ0 ) α ∼ G(δ0 , λ0 ) Posterior Beta Gamma Normal Gamma Gamma Gamma 65 / 352 Jeffreys Priors Jeffreys prior is based on the Fisher information matrix It satisfies the local uniformity of noninformative priors Let θ be a p × 1 vector and p(x|θ) is the density of x given θ. The Fisher information matrix is defined as I(θ) = −E  ∂ 2 log p(x|θ)  p×p ∂θi ∂θj 1 Jeffreys prior is then defined as π(θ) ∝ |I(θ)| 2 , where | · | denotes determinant. 66 / 352 Jeffreys Priors ✑ Suppose x , . . . , x 1 n are i.i.d Binomial(1, θ); p(x|θ) = θx (1 − θ)1−x log p(x|θ) ∂ ⇒ log p(x|θ) ∂θ ∂2 ⇒ 2 log p(x|θ) ∂θ = ⇒ I(θ) = = = x log θ + (1 − x) log(1 − θ) x 1−x − θ (1 − θ) x 1−x − 2− θ (1 − θ)2 h ∂ 2 log p(x|θ) i 1 −E . = ∂θi ∂θj θ(1 − θ) 1 1 Thus Jeffreys prior for θ is π(θ) ∝ θ− 2 (1 − θ)− 2 . 1 1 1 1 Now π(θ) ∝ θ− 2 (1 − θ)− 2 = θ 2 −1 (1 − θ) 2 −1 . Thus θ ∼Beta( 21 , 12 ) which is proper for the binomial model. 67 / 352 Example: Jeffreys Priors ✑ Suppose x , . . . , x 1 n are i.i.d. Poisson(θ) log p(x|θ) = x log θ − θ − log(x!) x ∂ log p(x|θ) = ⇒ ∂θ θ ∂2 x ⇒ 2 log p(x|θ) = − 2 ∂θ θ 1 ⇒ I(θ) = θ 1 Thus Jeffreys prior for θ is π(θ) ∝ θ− 2 which is improper. 68 / 352 Exercises Let X1 , . . . , Xn ∼ Uniform(0, θ). Let f (θ) ∝ density. Let X1 , . . . , Xn ∼ Poisson(λ). 1 θ . Find the posterior (a) Let λ ∼ G(α, β) be the prior. Show that the posterior is also a Gamma. Find the posterior mean. (b) Find the Jeffreys’ prior. Find the posterior. Show that the  Jeffreys’ prior based on the binomial likelihood f (y|θ) = ny θy (1 − θ)n−y is given by the Beta(.5,.5) distribution. Let θ be a univariate parameter of interest, and let γ = g(θ) be a 1-1 transformation. Show that Jeffreys’ prior is invariant under reparameterization. 69 / 352 Other prior construction methods There are a multitude of other methods for constructing priors and one of them is that of using the marginal distribution Z m(x) = f (x|θ)p(θ)dθ. Here, we choose the prior p(θ) based on its ability to preserve consistency with the marginal distribution of the observed data. Unlike the previous approaches in this lecture, this approach uses not only the form of the likelihood, but also the actual observed data values to help determine the prior. The resulting inferences from this posterior would thus be “overconfident”. We might refer to this method as empirical estimation of the prior. Indeed, EB methods that ignore this fact are often referred to as naive EB methods. 70 / 352 Bayesian inference and prediction Lecture 5 Sharif Mahmood 71 / 352 A Simple (Beta-Binomial) Model ✑ Suppose ISRT played 500 games during a regular inter-department games (football, cricket, basketball etc.) during 1996-2012 season and won 285 games. Suppose the ISRT team win each game with probability π. Estimate π. Sol We have 500 Bernoulli observations or one observation Y , where Y ∼ Binomial(n, π) with n = 500. Assumptions: Each game is a Bernoulli trial The football team have the same probability of winning each game The outcomes of the games are independent. 72 / 352 Beta-Binomial Model (Contd.) We can use the beta distribution as a prior for π since it has support over [0,1]. p(π|y) ∝ p(y|π)p(π) = Binomial(n, π) × Beta(α, β)   n y Γ(α + β) α−1 π (1 − π)β−1 = π (1 − π)n−y × ΓαΓβ y ∝ π y (1 − π)n−y × π α−1 (1 − π)β−1 p(π|y) ∝ π y+α−1 (1 − π)n−y+β−1 The posterior distribution is simply a Beta(y + α, n − y + β) distribution. Effectively, our prior is just adding α − 1 successes and β − 1 failures to the dataset. Bayesian priors are just adding pseudo observations to the data. 73 / 352 Summarizing the Posterior ✑ Consider the previous example of the beta-binomial model with a Beta(1,1) prior (uniform) and a Beta(y + α, n − y + β) posterior, where n = 500, y = 285, α = 1, and β = 1. Posterior Mean ◮ Analytically: 286 y+α = ≈ 0.57 (y + α) + (n − y + β) 286 + 216 ◮ Simulation: > a <- 1 > b <- 1 > posterior.unif.prior <- rbeta(10000, shape1 = 285 + a, + shape2 = 500 - 285 + b) > mean(posterior.unif.prior) [1] 0.5699 74 / 352 Summarizing the Posterior Posterior Variance ◮ Analytically: (y + α)(n − y + β) ≈ 0.00049 [(y + α) + (n − y + β)]2 [(y + α) + (n − y + β) + 1] ◮ Simulation: > var(posterior.unif.prior) [1] 0.0004832 Posterior Probability 0.5 < π < 0.6 ◮ Analytically: Z 0.6 Γ((y + α)(n − y + β)) (y+α−1) π (1 − π)(n−y+β−1) dπ ≈ 0.91 0.5 Γ(y + α)Γ(n − y + β) ◮ Simulation: > mean(posterior.unif.prior > 0.5 & posterior.unif.prior < 0.6) [1] 0.9114 75 / 352 Bayesian Intervals In Bayesian statistics, we use the terms credible sets and credible intervals rather than confidence intervals. Find the central interval that contains 95% of the area of the posterior. > qbeta(c(0.025,0.975), 285 + a, 500 - 285 + b) [1] 0.5262 0.6127 Simulation: > quantile(posterior.unif.prior, probs = c(0.025, 0.975)) 2.5% 97.5% 0.5269 0.6125 76 / 352 Interpretation of Bayesian Intervals Recall that the posterior distribution represent our updated subjective probability distribution for the unknown parameter. Thus, the interpretation of 95% confidence interval is that the probability is .95 that the true π is in the interval. If the Beta(1,1) had been a true representation of our prior beliefs or knowledge about our parameter π, then after seeing our survey data, we would belief that P (0.526 < π < 0.612) = 0.95 Contrast this with the interpretation of a frequentist confidence interval. 77 / 352 Highest Posterior Density Regions the density at any point inside the interval is greater than the density at any point outside it. shortest possible interval trapping the desired probability. preferable to equal-tail probability tail credible sets when posterior is highly skewed or multimodal. generally difficult to compute; tables of HDRs for certain densities are available. 78 / 352 Summary of Beta-Binomial Model We knew that the likelihood × prior produced something that looked like a Beta distribution up to a constant of proportionality. Since the posterior must be a probability distribution, we know that it is a Beta distribution and we can easily solve for the normalizing constant (although we don’t need to since we already have the posterior). we can summarize it analytically or via simulation with the following quantities: ◮ ◮ ◮ ◮ posterior mean (median, mode). posterior standard deviation. highest posterior density (hpd) region. posterior credible intervals (credible sets). 79 / 352 Graph of Beta-Binomial Model Uninformative Beta(1,1) Prior Beta(2,12) Prior 8 Prior Data (MLE) Posterior 0 0 2 2 4 4 6 6 8 Prior Data (MLE) Posterior 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 π 0.8 1.0 0.8 1.0 Beta(2,12) Prior (n=1000) 30 Uninformative Beta(1,1) Prior (n=1000) Prior Data (MLE) Posterior Prior Data (MLE) Posterior 0 0 5 5 10 10 15 15 20 20 25 25 30 0.6 π 0.0 0.2 0.4 0.6 π 0.8 1.0 0.0 0.2 0.4 0.6 π 80 / 352 Prediction In many situations, interest focuses on predicting values of a future sample from the same population. -i.e. on estimating values of potentially observable but not yet observed quantities. Example: Suppose we are interested to know that what ISRT team will do in the upcoming 10 inter-department games in the season 2013. So we are considering a new sample of sample size n⋆ and want to estimate the probability of some particular number y ⋆ of wins/successes in this sample. 81 / 352 Prediction If based on the earlier survey, we actually knew the true value of the population proportion π, we have just use the binomial probability:  ⋆ n ⋆ ⋆ ⋆ ⋆ π y (1 − π)n −y p(y |π) = ⋆ y But of course we still had uncertainty about π even after observing the original sample. All of our current knowledge about π is contained in the posterior distribution obtained using the original survey. The posterior predictive probability of some particular value of y ⋆ in a future sample of size n⋆ is Z 1 ⋆ p(y |y) = p(y ⋆ |π)p(π|y) dπ 0 where y denotes the data from the original survey, and p(π|y) is the posterior distribution based on that survey. 82 / 352 Prediction ✑ Suppose we had used the Beta(1, 1) prior so our posterior distribution is Beta(286, 216). Then the posterior predictive probability of y ⋆ wins in a future sample of size n⋆ is Z 1 ⋆ p(y ⋆ |π)Beta(π; 286, 216) dπ p(y |y) = 0 This is particularly easy to compute if n⋆ = 1, in which case: ⋆ P (y = 1|y) = = 1 Z P (y ⋆ = 1|π)p(π|y) dπ Z0 1 0 π × p(π|y) dπ = E(π|y) ≈ 0.57 83 / 352 Bayesian inference and prediction Lecture 6 Sharif Mahmood 84 / 352 The Poisson Distribution The Poisson distribution may be appropriate when the data are counts of rare events. events occuring at random at a constant rate per unit time, distance, volume, or whatever. Assumptions that the number of events that occur in any interval is independent of the number of events occuring is a disjoint interval. examples: - the number of cases of a rare form of a cancer occuring in DMC in a year - the number of typos occuring in each book produced by Annona prokashoni. - the number of defective PCs at ISRT computer lab. 85 / 352 The Poisson Distribution Since the values of a random variable following a Poisson distribution are counts, what are the possible values? probability mass of a Poisson random variable p(y|λ) = e−λ λy , y! y = 0, 1, . . . the count of the number of events occuring in τ time units also follows a Poisson distribution (but with parameter τ λ). The conjugate prior distribution for the Poisson rate parameter is a gamma family. 86 / 352 Poisson Process ✑ We wish to model the occurrence of events in time as a Poisson process with rate λ: (The rate is the average number per unit time). Suppose our prior probability density function for λ is proportional to λα−1 exp(−βλ) This means that our prior distribution for λ is a gamma distribution. The probability density function for a G(α, β) distribution is f (λ; α, β) = β α α−1 λ exp(−βλ), Γ(α) λ≥0 The parameters of the prior distribution are often referred to as hyperprameters. The gamma distribution is indexed by two parameters. 87 / 352 Poisson Process We observe the process for τ time units. Events occur at times t1 , t2 , . . . , tn . Writing t0 = 0, the likelihood is L= n  Y i=1  λe−λ(ti −ti−1 ) e−λ(τ −tn ) The last term is for the probability that no events occur in (tn , τ ). We can simplify L to L ≈ λn exp(−λτ ) This is effectively the probability of n events in (0, τ ), i.e. (λτ )n exp(−λτ ) ∝ (λτ )n exp(−λτ ) n! 88 / 352 Poisson Process Hence the posterior p.d.f. is proportional to λα+n−1 e−(β+τ )λ This is clearly a gamma distribution so the posterior p.d.f. is (β + τ )α+n α+n−1 −(β+τ )λ λ e Γ(α + n) The posterior mean is α+n β+τ The posterior variance is α+n (β + τ )2 89 / 352 Poisson Process Suppose at Shahabag bus stand the times of arrivals of buses passing the PG hospital, from 12.40 till 13.00 on February 3, 2013 are given in the table. 12.40 12.41 12.42 12.43 12.44 12.45 12.46 12.47 12.48 12.49 3, 6, 9,15,24,28,30 7,12,14,16,21,24,30,50 9,22,28,46,53 22,25,35,38,58 2, 5, 8,10,14,17,27,30,45 3,46 13,42,51 0, 9,11,18,23,26,39,51 35,39,55 8,10,19,20,33,45,56,58 12.50 12.51 12.52 12.53 12.54 12.55 12.56 12.57 12.58 12.59 4,30,34,47 0, 4,18,21,52,57,59 4, 9,29,30,31,38,59 2, 6,31,53 7, 9,13,24,39,47 28,46,49,59 6, 9,14,35,41,46 8,10,15,22,25,49,52,59 31,34,53,55,56 2,31,34,38,54,59 Table: Times of arrival of buses 90 / 352 Poisson Process These can be converted to time intervals between vehicles. These intervals, in seconds, are given in the table (also in shabag.dat) The first value is the time till the first arrival. 3 19 3 13 4 21 7 3 3 13 10 12 13 3 3 19 3 6 3 44 13 4 5 2 6 18 15 4 4 25 21 1 9 7 18 16 14 22 6 6 4 29 43 13 3 14 5 29 2 3 27 2 31 2 22 3 37 10 29 9 5 4 2 4 5 3 9 1 2 11 5 16 2 20 9 13 5 15 7 5 2 4 9 12 5 8 3 5 3 2 11 20 41 24 3 3 7 2 1 18 3 6 2 5 6 1 3 7 20 4 3 26 7 10 32 Table: Inter arrival times of buses 91 / 352 Poisson Process 30 20 10 0 Frequency 40 50 Histogram of Shabag 0 10 20 30 40 Inter arrival times (115 observations) in shabag data 92 / 352 Poisson Process ✑ Suppose the prior expectation for the rate of arrival was, say, 1 bus every 5 seconds, i.e. λ0 = 0.2. Suppose we were almost certain that the rate would be less than 1 bus every 2 seconds. Let us say that Pr(λ < 0.5) = 0.99. Suppose we use the conjugate (gamma) prior. Then α/β = 0.2. We require Z 0.5 f (λ)dλ = 0.99 0 where f (λ) is the p.d.f. of a G(α, α/0.2) distribution. We can evaluate this integral in R. First we set up a vector of values for α and a corresponding vector for β. Then we evaluate the integral for each pair of values. 93 / 352 Poisson Process > > > > > alpha<-seq(1,10) beta<-alpha/0.2 prob<-pgamma(0.5,alpha,beta) d<-data.frame(alpha,beta,prob) d alpha beta prob 1 1 5 0.9179150 2 2 10 0.9595723 3 3 15 0.9797433 4 4 20 0.9896639 5 5 25 0.9946545 6 6 30 0.9972076 7 7 35 0.9985300 8 8 40 0.9992214 9 9 45 0.9995856 10 10 50 0.9997785 94 / 352 Summarizing the Posterior We find that we get approximately what we require with α = 4, β = 20. Now τ = 1200 and n = 115 Thus the posterior p.d.f. is 1220119 118 −1220λ λ e 118! The prior mean was 0.2. The posterior mean is 119 = 0.0975. 1220 The prior variance was 0.01. The posterior variance is 119 = 0.00008. 12202 95 / 352 Summarizing the Posterior At a guess the important part of the posterior distribution is likely to be within ±3 standard deviations of the posterior mean. That is roughly 0.07 to 0.13 so we create an array of λ values from 0.070 to 0.130 in steps of 0.001. We can check that this covers the important part of the distribution by looking at the distribution function. > > > > > lambda<-seq(0.07,0.13,0.001) cdf<-pgamma(lambda,119,1220) pdf<-dgamma(lambda,119,1220) plot(lambda,cdf,type="l") plot(lambda,pdf,type="l") 96 / 352 Marginal Distributions With conjugate families, the unknown form of the prior and posterior densities can be used to find the marginal distribution, p(y), using the formula p(y) = p(y|λ)p(λ) p(λ|y) For instance, the Poisson model for a single observation, y, has prior predictive distribution p(y) = Poisson(y|λ)G(λ|α, β) G(λ|α + 1, β + τ ) 97 / 352 Probability Intervals Probability interval: A probability interval is an interval for an unknown quantity which contains a given amount of probability. Symmetric probability interval: A symmetric probability interval for an unknown quantity θ is an interval t1 < θ < t2 which has the property that Pr(θ < t1 ) = Pr(θ > t2 ). For example, a 95% symmetric probability interval (t1 , t2 ) for θ would have the property that Pr(θ < t1 ) = 0.025, Pr(t1 < θ < t2 ) = 0.95, Pr(θ > t2 ) = 0.025. In the shabag example above, the posterior distribution for λ is G(119, 1220). We can calculate a 95% symmetric posterior probability interval for λ as follows, using R. > qgamma(0.025,119,1220) [1] 0.08080462 > qgamma(0.975,119,1220) [1] 0.1158291 Thus our interval is 0.0808 < λ < 0.1158. 98 / 352 Probability Intervals Highest probability density (hpd) interval: A hpd or “highest probability density” interval for θ is a probability interval for θ with the property that no point outside the interval has a probability density greater than any point inside the interval. In other words, the interval captures the “most likely” values of the unknown. It is also easy to see that, if the pdf of θ is f (θ) and the 100(1 − α)% hpd interval is (t1 , t2 ), then t1 , t2 satisfy the following two properties. Z t2 t1 f (θ)dθ = 1 − α f (t1 ) = f (t2 ) The first of these is true for any 100(1 − α)% probability interval. The two together make the interval a hpd interval. If the density f (θ) is unimodal and symmetric then the symmetric interval and the hpd interval coincide. Otherwise they do not. 99 / 352 2.5 3.0 Probability Intervals 0.0 0.5 1.0 p( θ|y) 1.5 2.0 50% HPD 75% HPD 95% HPD 95% quantile−based 0.0 0.2 0.4 0.6 0.8 1.0 θ Figure: Highest posterior density regions of varying probability content. 100 / 352 Bayesian inference and prediction Lecture 7 Sharif Mahmood 101 / 352 Normal Model Suppose that y = (Y1 , Y2 , . . . , Yn ) is a random sample of size n from the normal distribution with mean µ ∈ R and variance σ 2 ∈ [0, ∞). Three conditions can be considered for the normal model. 1 µ is unknown, σ 2 is known. 2 µ is known, σ 2 is unknown. 3 µ is unknown, σ 2 is unknown. 102 / 352 Normal Model 1 µ is unknown, σ 2 is known p(µ|y) ∝ ∝ ∝ ∝ ∝ ∝ Qn σ 2 )π(µ) i=1 h f (yi |µ,P i h i exp − 2σ1 2 ni=1 (yi − µ)2 exp − 2σ1 2 (µ − µ0 )2 0 h P 1 P 2 2 exp − 2σ2 ( yi − 2 yi µ + nµ ) − 2σ1 2 (µ2 − 2µµ0 0 i h P exp − 2σ1 2 (−2 yi µ + nµ2 ) − 2σ1 2 (µ2 − 2µµ0 ) n  2 2 h  2 P0 oi nσ +σ σ y +µ σ 2 exp − 21 µ2 σ02 σ2 − 2µ 0 σ2iσ2 0 0 i  2 20 h P σ02 yi +µ0 σ 2 1 nσ0 +σ µ − nσ2 +σ2 . exp − 2 σ2 σ2 0 0 That is p(µ|y) ∼ N  σ02 P σ2 σ2 yi +µ0 σ 2 , nσ20+σ2 nσ02 +σ 2 0  + µ0 2 ) i = N (µ1 , σ12 ) 103 / 352 Normal Model 1 µ is unknown, σ 2 is known The posterior mean is a weighted average of the prior mean and observed data Weights are inversely proportional to the corresponding variances Large σ02 relative to σ 2 (vague prior information), implying posterior mean µ1 close to data value y Small σ02 relative to σ 2 (highly informative prior), implying posterior mean µ1 close to the prior mean µ0 . 104 / 352 Posterior Predictive Distribution The posterior predictive distribution of a future observation, y ⋆ , can be calculated as Z ⋆ p(y |y) = p(y ⋆ |µ)p(µ|y)dµ This is particularly easy to compute if n⋆ = 1 [Hoff, 2009, p. 72], in which case: E(y ⋆ |y) = E(E(y ⋆ |µ, y)|y) = E(µ|y) = µ1 and V ar(y ⋆ |y) = E(V ar(y ⋆ |µ, y)|y) + V ar(E(y ⋆ |µ, y)|y) = E(σ 2 |y) + V ar(µ|y) = σ 2 + σ12 105 / 352 Normal Model 2 µ is known, σ 2 is unknown Suppose τ = σ12 is known as precision parameter   δ0 γ0 τ ∼G 2, 2 p(τ |y) ∝ Qn  n+δ0 2 , ) i=1 f (y h i |τ, µ)π(τ τ Pn exp − 2 i=1 (yi i δ0  − µ)2 τ 2 −1 exp − P i h n+δ0 n 2+γ (y − µ) = τ 2 −1 exp − τ2 0 i=1 i ∝ τ p(τ |y) ∼ G n 2 Pn i=1 (yi −µ) 2 2 +γ 0 τ γ0  2  106 / 352 Normal Model 3 µ and σ 2 are both unknown Specify the joint prior p(µ, τ ) = p(µ|τ )p(τ ) µ|τ = N (µ0 , τ −1 σ02 ); τ ∼ G( δ20 , γ20 ) h h i i P n 1 p(µ, τ |y) ∝ τ 2 exp − τ2 ni=1 (yi − µ)2 × τ 2 exp − 2στ 2 (µ − µ0 )2 0 h i δ0 × τ 2 −1 exp − τ 2γ0 i P h n+δ0 +1 2 0) + γ (yi − µ)2 + (µ−µ = τ 2 −1 exp − τ2 0 2 σ 0 Note that the joint posterior distribution does not have a recognizable form Compute the normalizing constant by brute force 107 / 352 Normal Model m(y) = = × R ∞R ∞ 0 R∞ τ −∞ = "  1 2 exp h −τ 2 h n+δ0 +1 −1 2 P  2 n 2 (µ−µ0 ) i=1 (yi−µ) + σ02 + γ0 i i dµdγ exp − τ2 γ0 + σµ02 + yi2 dτ  0 P oi n  µ0 1 τ 2 dµ yi + σ 2 exp − 2 µ n + σ2 − 2µ 0 − 021  τ 0 R∞ −∞ n+δ0 −1 2 h 1 (2π) 2 Γ γ0 + µ0 2+ σ0 n+δ0 2 P ( yi2 − τ ⇒ p(µ, τ |y) = n+ P 1 2 σ0 µ 2 yi + 0 2) σ0 n+ 12 σ0 n+δ0 +1 −1 2 exp P 0 # n+δ 2 h − τ2 P (µ−µ0 ) (yi −µ)2 + 2 σ0 m(y) 2 +γ0 i . 108 / 352 A Hypothetical Example for Inference Throughout this talk we will use a hypothetical example: We have a group of MS students at ISRT. Among other things, we’d like to know the mean height of a student (LGM). We managed to measure height of all 10 LGM we know, the data is as follows (in centimeters): 122 122 116 134 113 114 113 110 123 130 No one knows the population mean in their planet. Interestingly, we know that a reliable estimate of the standard deviation (σ) of the complete LGM population is 8cm. 109 / 352 Inference for Unknown Mean: Confidence Intervals Sample mean is, ȳ = 119.7 which is our best estimate of the population mean. Using our estimate, we calculate a confidence interval h σ σ i ȳ − t × √ , ȳ + t × √ n n where t is the critical value of interest from the t-distribution. Here is how to calculate 95% confidence interval in R lgm<-c(122, 122, 116, 134, 113, 114, 113, 110, 123, 130) > mean (lgm) + qt (.025 , df =9) * (8/ sqrt (10)) [1] 113.977 > mean (lgm) + qt (.925 , df =9) * (8/ sqrt (10)) [1] 123.681 110 / 352 Another Look at the Confidence Intervals Population mean is a value that is fixed and unknown. We calculate the 95% confidence interval using the only sample mean. If we had many similar samples from the same population, and calculate the same interval for each, we would have 95% of them including the population mean. As a result, we say that we are 95% confident that the interval we calculated contains the sample mean. Does it mean: ‘with 0.95 probability the population mean is in the interval we calculated’ ? 111 / 352 Infering Mean Height of LGM: the/a Bayesian Way Population N (µ, σ 2 ) Prior N (µ0 , σ02 ) Likelihood N (ȳ, σ 2 /n) Posterior N (µ1 , σ1 2 ) We start with a Gaussian prior y ∼ N (100, 102 ) We observe the data. Likelihood is N (ȳ, σ 2 /n) = N (119.7, 6.4). Posterior is again normal N (118.515, 6.015). µ1 = σ1 2 = σ02 yi + µ 0 σ 2 102 × 1197 + 100 × 82 = = 118.515 2 2 nσ0 + σ 10 × 102 + 82 σ02 σ 2 102 × 82 = = 6.015 2 2 nσ0 + σ 10 × 102 + 82 P 112 / 352 Infering Mean Height of LGM: Calculations 95% credible interval for our LGM example: > qnorm (0.025 , 118.515 , sqrt (6.015)) [1] 113.708 > qnorm (0.925 , 118.515 , sqrt (6.015)) [1] 122.046 (our frequentist confidence interval was [113.977, 123.681]) Note that we can now say ‘with .95 probability population mean is in range [113.708, 122.046]’. 113 / 352 Posterior Distribution using Discrete Approximations Lecture 8 Sharif Mahmood 114 / 352 One Dimensional Conjugate Example ✑ Suppose we are interested in the proportion of patients given a certain drug who suffer a particular side effect. Let this proportion be θ. Our prior distribution for θ is a Beta(1, 4) distribution. We observe a sample of n = 30 patients, of whom x = 5 suffer the side effect. Assume that we can regard this observation as a value from the conditional Binomial(30, θ) distribution of x given θ. Use R to plot a graph showing the prior density, the likelihood and the posterior density of θ as follows 1 Set up a grid of θ values. theta<-seq(0,1,0.01) 2 Calculate the prior density prior<-dbeta(theta,1,4) 115 / 352 One Dimensional Conjugate Example 3 The likelihood is proportional to θ5 (1 − θ)25 = θ6−1 (1 − θ)26−1 . This is proportional to the density of a beta(6, 26) distribution. Calculate this. likelihood<-dbeta(theta,6,26) 4 The posterior distribution is beta(1+5, 4+25). That is beta(6, 29). Calculate the posterior density posterior<-dbeta(theta,6,29) 5 Plot the graph plot(theta,posterior,type="l",xlab=expression(theta), ylab="Density") lines(theta,prior,lty=2) lines(theta,likelihood,lty=3) 116 / 352 One Dimensional Conjugate Example 3 2 1 0 Density 4 5 6 posterior prior likelihood 0.0 0.2 0.4 0.6 0.8 1.0 θ Figure: One Dimensional Conjugate Example 117 / 352 One Dimensional Numerical Example ✑ This is exactly like the previous example except that we use a different prior distribution. The prior density is now proportional to  1.1 − 2θ 0 < θ < 0.5 g(θ) = 0.1 0.5 ≤ θ < 1 1 Set up a grid of θ values. theta<-seq(0,1,0.01) 2 Calculate the prior density. prior<-rep(0.1,101) prior<-prior+(1-2*theta)*(theta<0.5) 3 Now, to make this a proper density we need to divide by the integral which is 0.1+(0.5×1)/2 =0.35. prior<-prior/0.35 118 / 352 One Dimensional Numerical Example 4 The likelihood is proportional to θ5 (1 − θ)25 = θ6−1 (1 − θ)26−1 . This is proportional to the density of a beta(6, 26) distribution. Calculate this likelihood<-dbeta(theta,6,26) 5 The posterior density is proportional to the prior density times the likelihood. posterior<-prior*likelihood We have to normalise this by dividing by the integral. posterior<-posterior/(sum(posterior*0.01)) 6 Plot the graph. plot(theta,posterior,type="l",xlab=expression(theta), ylab="Density") lines(theta,prior,lty=2) lines(theta,likelihood,lty=3) 119 / 352 3 2 1 0 Density 4 5 6 One Dimensional Numerical Example 0.0 0.2 0.4 0.6 0.8 1.0 θ Figure: One Dimensional Numerical Example 120 / 352 One Dimensional Numerical Example ✑ Consider the shabag data described in lecture 6. To illustrate the principles involved when we use a non-conjugate prior, we will try using a “triangular” prior distribution with the pdf  4(1 − 2λ) 0 < λ < 0.5 p(λ) = 0 otherwise The calculations are actually quite simple in this case but we will do them using a general method. The likelihood is proportional to λ115 exp(−1200λ) It is usually better to work in terms of logarithms, to avoid very large and very small numbers. 121 / 352 One Dimensional Numerical Example The log likelihood, apart from an additive constant, is 115 ln(λ) − 1200λ We will also take logarithms of the prior density. Note that we could not do this where the prior density is zero but we really do not need to. In this example the logarithm of the prior density is, apart from an additive constant, ln(1 − 2λ) Having added the log prior to the log likelihood we subtract the maximum of this so that the maximum becomes zero. We then take exponentials and the maximum of this will be 1.0. Again, the reason for doing this is to avoid very large or very small numbers. We then normalise by finding the integral and dividing by this. 122 / 352 One Dimensional Numerical Example R code 1 s t e p s i z e <−0.001 2 lambda<−s e q ( 0 . 0 5 , 0 . 1 5 , s t e p s i z e ) 3 4 ##P r i o r 5 p r i o r <−4∗(1−2∗lambda ) 6 l o g p r i o r <−l o g (1−2∗ lambda ) 7 8 ##L i k e l i h o o d 9 l o g l i k <−115∗ l o g ( lambda ) −1200∗ lambda 10 11 ##P o s t e r i o r 12 l o g p o s <−l o g p r i o r+l o g l i k 13 l o g p o s <−l o g p o s −max( l o g p o s ) 14 p o s t e r i o r <−exp ( l o g p o s ) 15 p o s t e r i o r <−p o s t e r i o r / ( sum ( p o s t e r i o r ) ∗ s t e p s i z e ) 123 / 352 One Dimensional Numerical Example The R commands which I used to produce this plot are as follows. plot(lambda,posterior,type="l",xlab=expression(lambda), ylab="Density") like<-dgamma(lambda,116,1200) lines(lambda,prior,lty=2) lines(lambda,like,lty=3) abline(h=0) Having found the posterior density of λ we can find, the posterior mean, variance and standard deviation, using R as follows. postmean<-sum(posterior*lambda)*stepsize postmean [1] 0.09646694 postvar<-sum(posterior*(lambda^2))*stepsize-postmean^2 postvar [1] 8.01825e-05 sqrt(postvar) [1] 0.008954468 124 / 352 One Dimensional Numerical Example 0 10 Density 20 30 40 posterior prior likelihood 0.06 0.08 0.10 0.12 0.14 λ Figure: One Dimensional Numerical Example 125 / 352 Two-Dimensional Numerical Example (Weibull) ✑ To illustrate the calculations in a two-parameter case analysed numerically we will look at inference for the parameters of a Weibull distribution using the data in the table 67 206 32 950 178 1 313 1166 131 1615 431 1391 165 340 463 306 630 1088 939 258 662 627 496 247 2285 33 573 313 1859 672 254 2093 437 57 506 858 28 815 132 50 187 492 436 813 637 344 482 17 254 246 545 Load the function into R. Type weibpost<and then copy and paste the contents of weibpost.r. The function contains a R function to evaluate the posterior density in the case where α and β have independent gamma prior distributions. 126 / 352 Two-Dimensional Numerical Example (Weibull) 2 First we need to set up ranges of α and β values. For example: alphgrid<-seq(0.6,1.6,0.005) betagrid<-seq(0.001,0.003,0.00001) these numbers are more or less guesses at this stage. We may have to change them and try again. 3 We have F (t) = 1 − exp(−[βt]α ) ⇒ 1 − F (t) = exp(−[βt]α ) ⇒ ln{− ln[1 − F (t)]} = α ln β + α ln t If we plot ln{− ln[1 − F (t)]} against ln t, we should get a straight line with gradient α and intercept α ln β. 127 / 352 Two-Dimensional Numerical Example (Weibull) 4 But we do not know F (t) but we can get a rough idea directly from the data. If the data arranged in increasing order are t(1) , . . . , t(n) , then a rough estimate of F (t(j) ) is given by (j − 0.5)/n. We can therefore plot a graph in R as follows. ltsort<-log(sort(t)) n<-length(t) F<-((1:n)-0.5)/n G<-log(-log(1-F)) plot(ltsort,G) This gives us a line with approximate gradient 1.1 and approximate intercept -6.9. These suggest a value for α around 1.1 and a value for β around exp(−6.9/1.1) ≈ 0.002. 128 / 352 Two-Dimensional Numerical Example (Weibull) 5 Next we specify the parameters of our prior distribution. We give α a gamma(1, 1) distribution and β independently a gamma(3,1000) distribution as follows. prior<-c(1,1,3,1000) 6 Lines 10-13 in the function weibpost create two-dimensional arrays of α and β values so that every combination of values of α and β is represented by a position in the grid. The next line computes ln{αaα −1 e−bα α β aβ −1 e−bβ β } which, apart from a constant, is the logarithm of the joint prior density. Line 18 then adds to this, for each observation, ln{αβ(βti )α−1 exp[−(βti )α ]} 129 / 352 Two-Dimensional Numerical Example (Weibull) 7 We can now create a two-dimensional array containing values of the posterior density by typing posterior<-weibpost(alphgrid,betagrid,t,prior) 8 Make a contour plot of the posterior density contour(alphgrid,betagrid,posterior, xlab=expression(alpha),ylab=expression(beta)) 9 The posterior density function in graph post<-posterior/1000 persp(alphgrid,betagrid,post,xlab="Alpha",ylab="Beta", zlab="Density",phi=25,theta=45,ticktype="detailed") 130 / 352 Two-Dimensional Numerical Example (Weibull) 10 To calculate the marginal posterior density of α or β we can integrate along the rows or columns of the array containing the posterior density. Calculate and plot the marginal density of α as follows. apost<-rowSums(posterior)*rstep plot(alphgrid,apost,type="l",xlab=expression(alpha), ylab="Density") Here rstep is the step size for β, i.e. 0.00001 in the original calculation of posterior. 11 Use the marginal density to calculate the posterior mean of α pmeana<-sum(alphgrid*apost)*astep Here astep is the step size for α, i.e. 0.005 in the original calculation of posterior. 131 / 352 Two-Dimensional Numerical Example (Weibull) 12 Find a hpd region with a specified probability as follows. First sort the values of the posterior density into ascending order. postvec<-sort(c(posterior)) 13 Next cumulatively sum them. postvecsum<-cumsum(postvec)*astep*rstep 14 Now determine the level of the posterior density which will determine the contour which we want. For example, if we want a 95% region then we want the integral over the region outside the hpd region to be 0.05. crit<-max(postvec[postvecsum<0.05]) hpd<-ifelse((posterior>crit),1,0) contour(alphgrid,betagrid,hpd,levels=c(0.5),xlab =expression(alpha),ylab=expression(beta),drawlabels=F) 132 / 352 Two-Dimensional Numerical Example (Probit) ✑ A clinical trial on the use of the drug sulfinpyrazone in patients who had suffered myocardial infarctions (“heart attacks”). We wanted to see whether the drug had an effect on the number dying. Patients in one group were given the drug while patients in another group were given a placebo, that is an inactive substitute. The following table gives the number of all “analysable” deaths up to 24 months. Group 1 (Sulfinpyrazone) Group 2 (Placebo) Deaths 44 62 Total 560 540 We can represent this situation by saying that there are two groups, containing n1 and n2 patients, and two parameters, θ1 , θ2 , such that, given these parameters, the distribution of the number of deaths Xj in Group j is Binomial(nj , θj ). 133 / 352 Two-Dimensional Numerical Example (Probit) Now we could give θj a beta prior distribution but it seems reasonable that our prior beliefs would be such that θ1 and θ2 would not be independent. We transform from the (0, 1) scale of θ1 , θ2 to a (−∞, ∞) scale and then give the new parameters, η1 , η2 , a bivariate normal distribution. We can use a transformation where θj = F (ηj ) and F (x) is the distribution function of a continuous distribution on (−∞, ∞), usually one which is symmetric about x = 0. One possibility is to use the standard normal distribution function Φ(x) so that θj = Φ(ηj ). We write ηj = Φ−1 (θj ) where this function, Φ−1 (x), the inverse of the standard normal distribution function, is sometimes called the probit function. 134 / 352 Two-Dimensional Numerical Example (Probit) 1 Load the function into R. Type probit1<and then copy and paste the contents of probit1.r. 2 Enter the data. n<-c(560,540) x<-c(44,62) 3 Specify the prior. prior<-c(-1.24,-1.24,0.42,0.21,0.7) 4 5 Set up ranges for θ1 and θ2 theta1<-seq(0.01,0.99,0.01) theta2<-theta1 Use the function to evaluate the posterior density. posterior<-probit1(theta1,theta2,n,x,prior) 6 Make a contour plot of the posterior density. contour(theta1,theta2,posterior$density) 135 / 352 Two-Dimensional Numerical Example (Probit) 7 We see that the posterior probability is concentrated in one region so we adjust the ranges for θ1 and θ2 and try again. theta1<-seq(0.05,0.15,0.001) theta2<-seq(0.05,0.15,0.001) post<-probit1(theta1,theta2,n,x,prior) dens<-post$density contour(theta1,theta2,dens,xlab=expression(theta[1]), ylab=expression(theta[2])) 8 We can add a line showing where θ1 = θ2 . abline(0,1,lty=2) We can easily see that nearly all of the posterior probability lies in the region where θ2 > θ1 9 The other element of the result of the function is the posterior probability that θ2 > θ1 . Extract this. posterior$prob 136 / 352 Two-Dimensional Numerical Example (Probit) 10 To make the plot of the prior and posterior densities of the log relative risk, load the function probit2 in the file probit2.r probit2<- 11 Set up the data and prior. (They are the same as before so you should already have these). n<-c(560,540) x<-c(44,62) prior<-c(-1.24,-1.24,0.42,0.21,0.7) 12 Set up ranges for γ, the log relative risk, and θ2 . gamma<-seq(-1,1,0.02) theta2<-seq(0.05,0.15,0.001) 13 Use the function. probit2(gamma,theta2,n,x,prior) 137 / 352 Introduction to Multiparameter Models Lecture 9 Sharif Mahmood 138 / 352 Multiparameter Models Real problem in statistics nearly always involve more than one unknown and unobservable quantity. However, usually only one, or a few, parameters or prediction are of substantive interest. Ultimate aim of Bayesian analysis is to obtain the marginal posterior distribution. We first require the joint posterior distribution of all unknowns, and then integrate this over the unknowns (“nuisance parameters”) that are not immediately interest to obtain marginal distribution. Example: 1 2 We may be primarily interested in the population mean height µ, but of course we don’t really know the value of the population variance σ 2 . So in a realistic model, we must also treat σ 2 as an unknown parameter. 139 / 352 Multiparameter Models In cases of this kind, the aim of Bayesian analysis is to obtain the posterior marginal distribution of the parameter(s) of interest, e.g. p(µ|y) The general approach is to estimate the joint posterior distribution of all unknown quantities in the model, and then integrate out the one(s) we are not interested in. Example: in normal means example, we will find p(µ, σ 2 |y) then p(µ|y) = Z p(µ, σ 2 |y)dσ 2 140 / 352 Normal Model with µ and σ 2 Both Unknown (I) First consider the conventional noninformative prior p(µ, σ 2 ) ∝ 1 σ2 This arises by considering µ and σ 2 a priori independent and taking the product of the standard noninformative priors for each. A priori independence may be a reasonable assumption here; it says that if we knew something about one of the unknown parameters, that wouldn’t give us information about the distribution of the other one. This is not quite a conjugate prior. We will see that the posterior distribution does not factor like this into an inverse gamma times an independent normal. 141 / 352 Joint Posterior Distribution The joint posterior distribution with conventional noninformative prior is p(µ, σ 2 |y) = = = n   1 1 X 1 2 (y − µ) × 2 − n exp i 2σ 2 i=1 σ (σ 2 ) 2 n   1 1 X 2 2 (y − ȳ) + n(ȳ − µ) exp − n i +1 2 2σ i=1 (σ 2 ) 2   1 1  2 2 (n − 1)s + n(ȳ − µ) exp − n 2σ 2 (σ 2 ) 2 +1 where s2 in the sample variance of the yi s: n 1 X s = (yi − ȳ)2 n−1 2 i=1 ȳ and s2 are the sufficient statistics for µ and σ 2 . 142 / 352 Steps to the marginal posterior distribution of µ To determine p(σ 2 |y) we must average the joint distribution over µ. i.e. Z   1  1 2 2 2 (n − 1)s + n(ȳ − µ) dµ. p(σ |y) ∝ exp − n 2σ 2 (σ 2 ) 2 +1 Integrating this expression over µ requires evaluating the integral of the factor exp(− 2σ1 2 n(ȳ − µ)2 ); which is a simple normal integral; thus 2 p(σ |y) ∝ ∝ ∝ r   1  1 2πσ 2 2 × exp − 2 (n − 1)s n +1 2 2σ n (σ ) 2    1 1  1 exp − 2 (n − 1)s2 × (σ 2 ) 2 n +1 2 2 2σ (σ )   1 1  − 2 (n − 1)s2 n+1 exp 2σ (σ 2 ) 2 which is a inverse gamma distribution. 143 / 352 The conditional posterior distribution of µ given σ 2 Use what we already know about the posterior mean of µ with known variance and uniform prior on µ p(µ|σ 2 , y) = N (ȳ, σ2 ) n We will use these identities from conditional probability p(µ|y) = = Z Z p(µ, σ 2 |y)dσ 2 p(µ|σ 2 , y)p(σ 2 |y)dσ 2 Again it can be shown by direct integration n(µ − ȳ)2 i− n2 p(µ|y) ∝ 1 + (n − 1)s2 h is a student’s t distribution. [Gelman et al., 2004, p. 75-77] 144 / 352 Normal Model with µ and σ 2 Both Unknown (II) Second, consider a normal data with a conjugate prior. A convenient parameterization is given by the following specification 2 p(µ|σ 2 ) ∼ N (µ0 , nσ0 ) p(σ 2 ) ∼ InvG( α2 , β2 ) which corresponds to the joint prior density  n0   α 1 1  p(µ, σ 2 ) ∝ (σ 2 )−( 2 +1) exp − 2 β × (σ 2 )− 2 exp − 2 (µ − µ0 )2 2σ 2σ The joint posterior distribution p(µ, σ 2 |y) can be obtained multiplying the prior by normal density yields   α 1 1 p(µ, σ 2 |y) ∝ (σ 2 )−( 2 +1) (σ 2 )− 2 exp − 2 [β + n0 (µ − µ0 )2 ] × 2σ n 1 × (σ 2 )− 2 exp(− 2 [(n − 1)s2 + n(ȳ − µ)2 )]. 2σ 145 / 352 Normal Model with µ and σ 2 Both Unknown (III) Third, consider prior distributions for µ and σ 2 independently. p(µ|σ 2 ) ∼ N (µ0 , σ02 ) p(σ 2 ) ∼ InvG( α2 , β2 ) i.e. the prior distribution of p(µ|σ 2 ) does not depend on σ 2 . The joint prior is the product of these two priors. This is called a “semi-conjugate” prior. The joint posterior density function can be written as n N (µ0 , σ02 ) × InvG α β Y , × N (yi |µ, σ 2 ) 2 2 i=1 Infact, the marginal posterior distributions p(σ 2 |y) and p(µ|y) have no simple conjugate form. 146 / 352 Normal Model with µ and σ 2 Both Unknown (III) If p(y|µ, σ 2 ) ∼ N (µ, σ 2 ), we showed in Lecture 7 that p(µ|σ 2 , y) ∼ N (µ1 , σ1 ) with µ1 = nȳσ02 + µ0 σ 2 nσ02 + σ 2 and σ12 = σ02 σ 2 nσ02 + σ 2 In the conjugate case, we showed that p(σ 2 |y) was an inverse gamma distribution. A sample of {µ, σ 2 } from the joint posterior distribution can be obtained by sampling 1 2 a value σ 2(s) from p(σ 2 |y), an inverse-gamma distribution, then a value µ(s) from p(µ|σ 2(s) , y), a normal distribution. 147 / 352 Normal Model with µ and σ 2 Both Unknown (III) We first suggested that the prior mean and standard deviation of µ should be µ0 = 1.9 and σ0 = .95, as this would put most of the prior mass on µ > 0 For σ 2 , let’s use prior parameters of α = 1 and β = 0.01 post.grid rid c.g mea n.gr pre id Figure: Joint posterior distribution 148 / 352 Normal Model with µ and σ 2 Both Unknown (III) R code for joint posterior distribution [Hoff, 2009, p. 91-92] R code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 mu0<−1.9 ; t 2 0 <−0.95ˆ2 ; alpha <−1 ; beta <−.01 y<−c ( 1 . 6 4 , 1 . 7 0 , 1 . 7 2 , 1 . 7 4 , 1 . 8 2 , 1 . 8 2 , 1 . 8 2 , 1 . 9 0 , 2 . 0 8 ) G<−100 ; H<−100 mean . g r i d <−s e q ( 1 . 5 0 5 , 2 . 0 0 , l e n g t h=G) p r e c . g r i d <−s e q ( 1 . 7 5 , 1 7 5 , l e n g t h=H) p o s t . g r i d <−m a t r i x ( nrow=G, n c o l=H) f o r ( g i n 1 :G) { f o r ( h i n 1 :H) { p o s t . g r i d [ g , h]<− dnorm ( mean . g r i d [ g ] , mu0 , s q r t ( t 2 0 ) ) ∗ dgamma( p r e c . g r i d [ h ] , a l p h a / 2 , b e t a / 2 ) ∗ prod ( dnorm ( y , mean . g r i d [ g ] , 1 / s q r t ( p r e c . g r i d [ h ] ) ) ) } } p o s t . g r i d <−p o s t . g r i d /sum ( p o s t . g r i d ) 149 / 352 The Weibull Model The Weibull distribution is often used as a distribution for lifetimes. The distribution function of a Weibull distribution is   f (t) = αρ(ρt)α−1 exp − (ρt)α . Suppose we have n observations t1 , . . . , tn . The likelihood is L(α, ρ) = αn ρnα n n Y X α−1   tαi . ti exp − ρα i=1 i=1 (0) Suppose that our prior density for α and ρ is fα,ρ (α, ρ). The posterior pdf is (1) (0) fα,ρ (α, ρ) ∝ hα,ρ (α, ρ) ∝ fα,ρ (α, ρ) × L(α, ρ). 150 / 352 The Weibull Model To complete the evaluation of the posterior pdf we find Z Z C= hα,ρ (α, ρ) dα dρ numerically and then (1) (α, ρ) = hα,ρ (α, ρ)/C. fα,ρ Suppose, for example, that we give α and ρ independent gamma prior distributions so that (0) (α, ρ) ∝ αaα −1 exp[−bα α] × ρaρ −1 exp[−bρ ρ]. fα,ρ Then the posterior pdf is proportional to hα,ρ (α, ρ) = α n+aα −1 nα+aρ −1 ρ n n h Y X α−1 i α ti exp − bα α+bρ ρ+ρ tαi . i=1 i=1 151 / 352 Bayesian Computational Methods Lecture 10 Sharif Mahmood 152 / 352 Introduction Bayesian analysis requires computation of expectations and quantiles of probability distributions that arise as posterior distributions. If conjugate priors are not used, as is mostly the case these days, posterior distributions will not be standard distributions and hence the required Bayesian quantities (i.e., posterior quantities of inferential interest) cannot be computed in closed form. Thus special techniques are needed for Bayesian computations. 153 / 352 Development Over Time 1763-1960: Conjugate priors. 1960’s: Numerical quadrature (Newton-Cotes methods, Gaussian quadrature, etc.). 1970’s: Expectation-Maximization (EM) algorithm (iterative mode finder). 1980’s: Asymptotic methods. 1980’s: Noniterative Monte Carlo methods (direct posterior sampling and indirect methods, e.g., importance sampling, rejection). 1990’s: Markov chain Monte Carlo (MCMC; Gibbs sampling, Metropolis Hastings algorithm, etc.). 154 / 352 Asymptotic Methods Advantages: 1 Deterministic, noniterative algorithm. 2 Substitutes differentiation for integration. 3 Facilitates studies of Bayesian robustness. Disadvantages: 1 Requires well-parametrized, unimodal posterior. 2 θ must be of at most moderate dimension. 3 n must be large, but is beyond our control. 155 / 352 Noniterative Monte Carlo Methods Direct sampling Indirect methods ◮ Importance sampling ◮ Rejection sampling 156 / 352 Direct Sampling Suppose θ ∼ p(θ) and γ = E[c(θ)] = R c(θ)p(θ)dθ If θ1 , . . . , θN are independent and identically distributed as p(θ), we have N 1 X c(θj ) γ̂ = N j=1 γ̂ converges to E[c(θ)] with probability 1 by SLLN as N → ∞, N is the Monte Carlo sample size V ar(γ̂) = V ar[c(θ)] N q P 2 ˆ SE(γ̂) = N (N1−1) N j=1 [c(θj ) − γ̂] ˆ γ̂ ± 2SE(γ̂) provides approximately 95% confidence interval for the posterior mean γ. 157 / 352 Importance Sampling Suppose we wish to approximate R h(θ)f (y|θ)π(θ)dθ . E[h(θ)|y] = R f (y|θ)π(θ)dθ Suppose further we can roughly approximate the normalized likelihood times prior, cf (y|θ)π(θ), by some density g(θ) from which we can easily sample. Then defining the weight function w(θ) = f (y|θ)π(θ)/g(θ), E[h(θ)|y] = iid where θj ∼ g(θ) R h(θ)w(θ)g(θ)dθ R = w(θ)g(θ)dθ 1 N PN 1 N j=1 h(θj )w(θj ) PN j=1 w(θj ) Here, g(θ) is called the importance function; a good match to cf (y|θ)π(θ) will produce roughly equal weights. 158 / 352 Rejection Sampling Instead of approximating the normalized posterior p(θ|y) = R p(y|θ)π(θ) p(y|θ)π(θ)dθ we try to “blanket” it. Suppose there exists an identifiable constant M > 0, and a smooth density g(θ) called the envelop function such that f (y|θ)π(θ) < M g(θ) Rejection method proceeds as follows: (i) Generate θj ∼ g(θ) (ii) Generate U ∼ Unif(0, 1) (iii) If M U g(θj ) < f (y|θj )π(θj ) accept θj . Otherwise reject θj . (iv) Return to step (i) and repeat until the desired sample {θj , j = 1, 2, . . . , N } is obtained The final sample consists of random draws from p(θ|y). 159 / 352 Rejection Sampling Figure: Unstandardized posterior distribution and proper rejection envelope. Consider the θj samples in the histogram bar centered at a: the rejection step “slices off” the top portion of the bar. Repeat for all a: accepted θj s mimic the lower curve! Need to choose M as small as possible (efficiency), and watch for “envelope violations”! 160 / 352 An example ✑ Suppose our target density (the one we want to sample from) is the triangle density:  if 0 ≤ y < 0.25  8y 8 8 − y if 0.25 ≤ y ≤ 1 f (y) =  3 3 0 otherwise Target Density f (y) (the triangle density): f.y <- function(y) { if (y >= 0 && y < 0.25) 8 * y else if (y >= 0.25 && y <= 1) 8/3 - 8 * y/3 else 0 } For our candidate density g(y), let’s use the Unif(0,1) density: g.y <- function(y) { if (y >= 0 && y <= 1) 1 else 0 } 161 / 352 An example Let’s set M = 3 because I know from guess and check that f (y) is never greater than M g(y), which is 3 for all y ∈ [0, 1]. 5 M <- 3 c1 <- unlist(lapply(y, g.y))*M c2 <- unlist(lapply(y, f.y)) plot(y, c1, type="l", xlab="y", ylab="Density", ylim=c(0,5)) lines(y, c2, lty=2) legend("topright", c("f(y)", "Mg(y)"), lty=c(2,1), cex=1.3) 0 1 2 Density 3 4 f(y) Mg(y) 0.0 0.2 0.4 0.6 0.8 1.0 y 162 / 352 An example: Rejection Sampling Lets do rejection sampling! R code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 m <− 10000 n . draws <− 0 draws <− c ( ) y . g r i d <− s e q ( 0 , 1 , by = 0 . 0 1 ) w h i l e ( n . draws < m) { y . c <− r u n i f ( 1 , 0 , 1 ) a c c e p t . prob <− f . y ( y . c ) / (M ∗ g . y ( y . c ) ) u <− r u n i f ( 1 , 0 , 1 ) i f ( a c c e p t . prob >= u ) { draws <− c ( draws , y . c ) n . draws <− n . draws + 1 } } p l o t ( d e n s i t y ( draws ) , x l a b=”y ” , y l a b=”D e n s i t y ” , y l i m=c ( 0 , 3 ) , main=””) l i n e s ( y , c2 , l t y =2) l e g e n d ( ” t o p r i g h t ” , c ( ” R e j e c t i o n Sampling Draws ” , ” Tar get D e n s i t y ” ) , l t y=c ( 1 , 2 ) , c e x =1.3) 163 / 352 An example 3.0 We can compare the density of our draws to the target density f (y). 1.5 1.0 0.5 0.0 Density 2.0 2.5 Rejection Sampling Draws Target Density 0.0 0.2 0.4 0.6 0.8 1.0 y Figure: Triangular density function. 164 / 352 Limitations of IS and RS Rejection sampling is a general method for simulating from an arbitrary posterior distribution, but it can be difficult to set up since it requires the construction of a suitable proposal density. Importance sampling is also general-purpose algorithms, but they also require proposal densities that may be difficult to find for high-dimensional problems. So we require Markov chain Monte Carlo (MCMC) algorithms to handle such situations. 165 / 352 Bayesian Computational Methods Lecture 11 Sharif Mahmood 166 / 352 Markov Chain Monte Carlo (MCMC) Markov Chain: a stochastic process in which future states are independent of past states given the present state Monte Carlo: simulation Goals 1 to make inference about model parameters 2 to make predictions Requires 1 integration over possibly high-dimentional integrand 2 we may not know the integrating constant We can take quantities of interest of a distribution from simulated draws from the distribution. 167 / 352 Monte Carlo Integration ✑ Suppose we have a distribution p(θ) (perhaps a posterior) that we want to take quantities of interest from. ⇒ To derive it analytically, we need to take integrals: Z I= g(θ)p(θ)dθ Θ where g(θ) is some function of θ (e.g. g(θ) = θ for the mean and g(θ)= [θ − E(θ)]2 for the variance). We can approximate the integrals via Monte Carlo Integration by simulating M values from p(θ) and calculating M 1 X g(θ(i) ) IˆM = M i=1 168 / 352 Monte Carlo Integration For example, we can compute the expected value of the Beta(3,3) distribution analytically: Z Z 1 Γ(6) θ2 (1 − θ)2 dθ = E(θ) = θp(θ)dθ = θ 2 Θ Θ Γ(3)Γ(3) or via Monte Carlo methods: > M <- 10000 > beta.sims <- rbeta(M, 3, 3) > sum(beta.sims)/M [1] 0.5013 Our Monte Carlo approximation IˆM is a simulation consistent estimator of the true value I : IˆM → I as M → ∞. We know this to be true from the Strong Law of Large Numbers. 169 / 352 Strong Law of Large Numbers (SLLN) Let Y1 , Y2 , . . . be a sequence of independent and identically distributed random variables, each having a finite mean µ = E(Yi ). Then with probability 1, Y1 + Y2 + . . . + YM → µ as M → ∞ M In our previous example, each simulation draw was independent and distributed from the same Beta(3,3) distribution. This also works with variances and other quantities of interest, since a function of i.i.d. random variables are also i.i.d. random variables. But what if we cant generate draws that are independent? 170 / 352 Monte Carlo Integration on the Markov Chain Once we have a Markov chain that has converged to the stationary distribution, then the draws in our chain appear to be like draws from p(θ|y), so it seems like we should be able to use Monte Carlo Integration methods to find quantities of interest. One problem: our draws are not independent, which we required for Monte Carlo Integration to work (remember SLLN). Luckily, we have the Ergodic Theorem. 171 / 352 Ergodic Theorem Let θ (1) , θ (2) , . . . , θ (M ) be M values from a Markov chain that is aperiodic, irreducible, and positive recurrent (then the chain is ergodic), and E[g(θ)] < ∞. Then with probability 1, Z M 1 X g(θi ) → g(θ)π(θ)dθ M i=1 as M → ∞, where π is the stationary distribution. This is the Markov chain analog to the SLLN, and it allows us to ignore the dependence between draws of the Markov chain when we calculate quantities of interest from the draws. But what does it mean for a chain to be aperiodic, irreducible, and positive recurrent, and therefore ergodic? 172 / 352 Aperiodicity A Markov chain is aperiodic if the only length of time for which the chain repeats some cycle of values is the trivial case with cycle length equal to one. Let A, B, and C denote the states (analogous to the possible values of θ) in a 3-state Markov chain. The following chain is periodic with period 3, where the period is the number of steps that it takes to return to a certain state. As long as the chain is not repeating an identical cycle, then the chain is aperiodic. 173 / 352 Irreducibility A Markov chain is irreducible if it is possible go from any state to any other state (not necessarily in one step). The following chain is reducible, or not irreducible. The chain is not irreducible because we cannot get to A from B or C regardless of the number of steps we take. 174 / 352 Positive Recurrence A Markov chain is recurrent if for any given state i , if the chain starts at i, it will eventually return to i with probability 1. A Markov chain is positive recurrent if the expected return time to state i is finite; otherwise it is null recurrent. So if our Markov chain is aperiodic, irreducible, and positive recurrent (all the ones we use in Bayesian statistics usually are), then it is ergodic and the ergodic theorem allows us to do Monte Carlo Integration by calculating quantities of interest from our draws, ignoring the dependence between draws. 175 / 352 Burn-in Since convergence usually occurs regardless of our starting point, we can usually pick any feasible starting point (e.g., picking starting draws that are in the parameter space). However, the time it takes for the chain to converge varies depending on the starting point. As a matter of practice, most people throw out a certain number of the first draws, known as the burn-in. Pros: This is to make our draws closer to the stationary distribution and less dependent on the starting point. Cons: However, it is unclear how much we should burn-in since our draws are all slightly dependent and we don’t know exactly when convergence occurs. 176 / 352 Thinning the Chain In order to break the dependence between draws in the Markov chain, some have suggested only keeping every dth draw of the chain. This is known as thinning. Pros: Perhaps gets you a little closer to i.i.d. draws. Saves memory since you only store a fraction of the draws. Cons: Unnecessary with ergodic theorem. Shown to increase the variance of your Monte Carlo estimates. 177 / 352 So Really, What is MCMC? MCMC is a class of methods in which we can simulate draws that are slightly dependent and are approximately from a (posterior) distribution. We then take those draws and calculate quantities of interest for the (posterior) distribution. In Bayesian statistics, there are generally two MCMC algorithms that we use: the Gibbs Sampler and the Metropolis-Hastings algorithm. 178 / 352 Markov chain Monte Carlo methods Iterative MC methods are useful when it is difficult or impossible to find a feasible importance or envelope density. Algorithms: ◮ Gibbs sampler ◮ Metropolis-Hastings algorithm ◮ Slice sampler Performance evaluation: ◮ Convergence monitoring and diagnostics ◮ Variance estimation 179 / 352 Bayesian Computational Methods Lecture 12 Sharif Mahmood 180 / 352 Gibbs Sampler Suppose there are k parameters θ = (θ1 , . . . , θk )′ in our model Assume that full (or complete) conditional distributions {p(θi |θj6=i , y), i = 1, . . . , k} are available for sampling. Direct sampling (if full conditionals have known forms) or indirect sampling may be used. Under mild conditions the collection of full conditional distributions uniquely determine the joint posterior p(θ|y) and all marginals, p(θi |y), i = 1, . . . , k 181 / 352 Gibbs Sampling Algorithm Suppose a set of arbitrary starting values {θ20 , . . . , θk0 } are given The algorithm proceeds as follows: (1) from p(θ1 |θ2 , . . . , θk , y) (1) from p(θ2 |θ1 , θ3 , . . . , θk , y) (i) Draw θ1 (ii) Draw θ2 (0) (0) (1) (0) (0) (1) (1) (1) .. . (1) (k) Draw θk (1) from p(θk |θ1 , θ2 , . . . , θk−1 , y) (1) We have θ1 , . . . , θk after one iteration (t) (t) After t such iterations we have θ1 , . . . , θk which converges in distribution to a draw from the true joint posterior p(θ1 , . . . , θk |y). 182 / 352 Gibbs Sampling Algorithm For T sufficiently large (say, bigger than t0 ), {θ (t) }Tt=t0 +1 is a (correlated) sample from the true posterior. We might use a sample mean to estimate the posterior mean E(θi |y) ≈ T X 1 (t) θi T − t0 t=t0 +1 The time from t = 0 to t = t0 is commonly known as the burn-in period. We may also run m parallel Gibbs sampling chains and obtain E(θi |y) ≈ m X T X 1 (j,t) θi m(T − t0 ) j=1 t=t0 +1 where the index j indicates thinning number. 183 / 352 Example (I): Gibbs Sampling Algorithm ✑ Suppose we have data of the number of failures (y ) for each of 10 i pumps in a nuclear plant [Carlin and Louis, 2008, p.127]. We also have the times (ti ) at which each pump was observed. i y t 1 5 94 2 1 16 3 5 63 4 14 126 5 3 5 6 19 31 7 1 1 8 1 1 9 4 2 10 22 10 Table: Plum failure data We want to model the number of failures with a Poisson likelihood, where the expected number of failure λi differs for each pump. Since the time which we observed each pump is different, we need to scale each λi by its observed time ti . Q Our likelihood is 10 i=1 Poisson(λi ti ). 184 / 352 Example (I): Gibbs Sampling Algorithm Let’s put a G(α, β) prior on λi with α = 1.8, so the λi s are drawn from the same distribution. Also, let’s put a G(γ, δ) prior on β, with γ = 0.01 and δ = 1. So our model has 11 parameters that are unknown (10 λi ’s and β). Our posterior is p(λ, β|y, t) ∝ 10 Y i=1  Poisson(λi ti ) × G(α, β) × G(γ, δ) 10 −λi ti Y β α α−1 −βλi  δ γ γ−1 −δβ e (λi ti )yi × λi e β e × = yi ! Γ(α) Γ(γ) i=1 ∝ = 10 Y i=1 10 Y i=1  e−λi ti (λi ti )yi × β α λiα−1 e−βλi × β γ−1 e−δβ  λiyi +α−1 e−(ti +β)λi × β 10α+γ−1 e−δβ 185 / 352 Example (I): Gibbs Sampling Algorithm Finding the full conditionals: p(λi |λ−i , β, y, t) ∝ λiyi +α−1 e−(ti +β)λi p(β|λ, y, t) ∝ e−β(δ+ P10 i=1 λi ) 10α+γ−1 β p(λi |λ−i , β, y, t) is a G(yi + α, ti + β) distribution. P p(β|λ, y, t) is a G(10α + γ, δ + 10 i=1 λi ) distribution. 186 / 352 Coding the Gibbs Sampler 1 Define starting values for β (we only need to define β here because we will draw λ first and it only depends on β and other given values). beta.cur <- 1 2 Draw λ(1) from its full conditional (were drawing all the λi ’s as a block because they all only depend on β and not each other). lambda.update <- function(alpha, beta, y, t) { rgamma(length(y), y + alpha, t + beta) } 3 Draw β (1) from its full conditional, using λ(1) . beta.update <- function(alpha, gamma, delta, lambda, y) { rgamma(1, length(y) * alpha + gamma, delta + sum(lambda)) } 187 / 352 Coding the Gibbs Sampler 4 Repeat using most updated values until we get M draws. 5 Optional burn-in and thinning. 6 Make it into a function. 188 / 352 Coding the Gibbs Sampler R code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 g i b b s <− f u n c t i o n ( n . s i m s , b e t a . s t a r t , a l p h a , gamma , d e l t a , y , t , burnin = 0 , thin = 1) { b e t a . draws <− c ( ) lambda . draws <− m a t r i x (NA, nrow = n . s i m s , n c o l = l e n g t h ( y ) ) b e t a . c u r <− b e t a . s t a r t ########### lambda . u p d a t e <− f u n c t i o n ( a l p h a , b e t a , y , t ) { rgamma ( l e n g t h ( y ) , y + a l p h a , t + b e t a ) } b e t a . u p d a t e <− f u n c t i o n ( a l p h a , gamma , d e l t a , lambda , y ) { rgamma ( 1 , l e n g t h ( y ) ∗ a l p h a + gamma , d e l t a + sum ( lambda ) ) } f o r ( i in 1 : n . sims ) { lambda . c u r <− lambda . u p d a t e ( a l p h a=a l p h a , b e t a=b e t a . cur , y=y , t=t ) b e t a . c u r <− b e t a . u p d a t e ( a l p h a = a l p h a , gamma = gamma , d e l t a = d e l t a , lambda=lambda . cur , y=y ) i f ( i > b u r n i n & ( i − b u r n i n)%%t h i n == 0 ) { lambda . draws [ ( i − b u r n i n ) / t h i n , ] <− lambda . c u r b e t a . draws [ ( i − b u r n i n ) / t h i n ] <− b e t a . c u r } } r e t u r n ( l i s t ( lambda . draws = lambda . draws , b e t a . draws = b e t a . draws ) ) } 189 / 352 Coding the Gibbs Sampler 4 Do Monte Carlo Integration on the resulting Markov chain, which are samples approximately from the posterior. R code 1 2 3 4 5 6 7 8 9 10 11 12 13 y <− c ( 5 , 1 , 5 , 1 4 , 3 , 1 9 , 1 , 1 , 4 , 2 2 ) t <− c ( 9 4 , 1 6 , 6 3 , 1 2 6 , 5 , 3 1 , 1 , 1 , 2 , 1 0 ) rbind (y , t ) p o s t e r i o r <− g i b b s ( n . s i m s = 1 0 0 0 0 , b e t a . s t a r t = 1 , a l p h a =1.8 , gamma=0.01 , d e l t a =1 , y=y , t=t ) colMeans ( p o s t e r i o r $ l a m b d a . draws ) mean ( p o s t e r i o r $ b e t a . draws ) a p pl y ( p o s t e r i o r $ l a m b d a . draws , 2 , sd ) sd ( p o s t e r i o r $ b e t a . draws ) 190 / 352 Example (II): Gibbs Sampling Algorithm ✑ Suppose y , . . . , y are i.i.d. N (µ, σ 2 ), τ = σ12 , π(µ, τ ) ∝ τ −1 . We want to devise the Gibbs sampler to sample from the joint posterior density of (µ, σ 2 ). Sol 1 n p(µ, τ |y) ∝ = ∝ ∝ = ∝ τ X (yi − µ)2 ] × τ −1 2 n τ X τ 2 −1 exp[− (yi − µ)2 ] 2 τ X 2 yi − 2nȳµ + nµ2 )] exp[− ( 2 τ exp[− (−2nȳµ + nµ2 )] 2 nτ 2 exp − (µ − 2ȳµ)] 2 nτ exp[− (µ − ȳ)2 ] 2 n τ 2 exp[− 1 ) and τ |y, µ ∼ Γ( n2 , µ|y, τ ∼ N (ȳ, nτ P (yi −µ)2 ) 2 191 / 352 Example (II): Gibbs Sampling Algorithm (contd) To obtain joint posterior samples from [µ, τ |y] we 1 Start with arbitrary (µ(0) , τ (0) ) 2 Sample µ(1) from [µ|y, τ (0) ] 3 Sample τ (1) from [τ |y, µ(1) ] 4 Repeat until we have t iterations yielding (µ(1) , τ (1) ), . . . , (µ(t) , τ (t) ) 5 The t iterations produce t gibbs samples 192 / 352 Features of Gibbs Sampling The Gibbs sampler is easy to understand and implement, but requires the ability to readily sample from each of the full conditional distributions, p(θi |θl6=i , y). Unfortunately, when the prior π(θ) and the likelihood f (y|θ) are not a conjugate pair, one or more of these full conditionals may not be available in closed form. Even in this settting, however, p(θi |θl6=i , y) will be available upto a proportionality constant as it is proportional to the portion of f (y|θ) that involvels θi . Metropolis-Hastings algorithm gives solution to this problem. 193 / 352 Bayesian Computational Methods Lecture 13 Sharif Mahmood 194 / 352 Intuition 1 The Metropolis-Hastings algorithm can draw samples from any probability distribution p(y), provided you can compute the value of a function f (y) which is proportional to the density of p. 2 The Metropolis-Hastings algorithm works by generating a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution, p(y). 3 These sample values are produced iteratively, with the distribution of the next sample being dependent only on the current sample value (thus making the sequence of samples into a Markov chain.) 4 Specifically, at each iteration, the algorithm picks a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted or rejected. 195 / 352 Metropolis-Hastings Algorithm The Metropolis-Hastings Algorithm follows the following steps: (i) Choose a starting value θ (0) . (ii) At iteration t, draw a candidate θ ∗ from a jumping distribution Jt (θ ∗ |θ (t−1) ). (iii) Compute an acceptance ratio (probability): r= p(θ ∗ |y)/Jt (θ ∗ |θ (t−1) ) p(θ (t−1) |y)/Jt (θ (t−1) |θ ∗ ) (iv) Accept θ ∗ as θ (t) with probability min(r, 1). If θ ∗ is not accepted, then θ (t) = θ (t−1) . (v) Repeat steps ii-iv M times to get M draws from p(θ|y), with optional burn-in and/or thinning. 196 / 352 Step 1: Choose a starting value θ (0) . This is equivalent to drawing from our initial stationary distribution. The important thing to remember is that θ (0) must have positive probability. p(θ (0) |y) > 0 Otherwise, we are starting with a value that cannot be drawn. 197 / 352 Step 2: Draw θ ∗ from Jt (θ ∗ |θ (t−1) ). The jumping distribution Jt (θ ∗ |θ (t−1) ) determines where we move to in the next iteration of the Markov chain. The support of the jumping distribution must contain the support of the posterior. The original Metropolis algorithm required that Jt (θ ∗ |θ (t−1) ) be a symmetric distribution (such as the normal distribution), that is Jt (θ ∗ |θ (t−1) ) = Jt (θ (t−1) |θ ∗ ) Metropolis-Hastings algorithm that symmetry is unnecessary. If we have a symmetric jumping distribution that is dependent on θ (t−1) , then we have what is known as random walk Metropolis sampling. 198 / 352 Step 2: Draw θ ∗ from Jt (θ ∗ |θ (t−1) ). If our jumping distribution does not depend on θ (t−1) , Jt (θ ∗ |θ (t−1) ) = Jt (θ ∗ ) then we have what is known as independent Metropolis-Hastings sampling. Basically all our candidate draws θ ∗ are drawn from the same distribution, regardless of where the previous draw was. This can be extremely efficient or extremely inefficient, depending on how close the jumping distribution is to the posterior. Generally speaking, chain will behave well only if the jumping distribution has heavier tails than the posterior. 199 / 352 Step 3: Compute acceptance ratio r. r= p(θ ∗ |y)/Jt (θ ∗ |θ (t−1) ) p(θ (t−1) |y)/Jt (θ (t−1) |θ ∗ ) In the case where our jumping distribution is symmetric, r= p(θ ∗ |y) p(θ (t−1) |y) If our candidate draw has higher probability than our current draw, then our candidate is better so we definitely accept it. Otherwise, our candidate is accepted according to the ratio of the probabilities of the candidate and current draws. Note that since r is a ratio, we only need p(θ|y) up to a constant of proportionality since p(y) cancels out in both the numerator and denominator. 200 / 352 Step 3: Compute acceptance ratio r. In the case where our jumping distribution is not symmetric, r= p(θ ∗ |y)/Jt (θ ∗ |θ (t−1) ) p(θ (t−1) |y)/Jt (θ (t−1) |θ ∗ ) We need to weight our evaluations of the draws at the posterior densities by how likely we are to draw each draw. For example, if we are very likely to jump to some θ ∗ , then Jt (θ ∗ |θ (t−1) ) is likely to be high, so we should accept less of them than some other θ ∗ that we are less likely to jump to. In the case of independent Metropolis-Hastings sampling, r= p(θ ∗ |y)/Jt (θ ∗ ) p(θ (t−1) |y)/Jt (θ (t−1) ) 201 / 352 Step 4: Decide whether to accept θ ∗ . Accept θ ∗ as θ (t) with probability min(r, 1). If θ ∗ is not accepted, then θ (t) = θ (t−1) . 1 2 For each θ ∗ , draw a value u from the Uniform(0,1) distribution. If u ≤ r , accept θ ∗ as θ (t) . Otherwise, use θ (t−1) as θ (t) . Candidate draws with higher density than the current draw are always accepted. Unlike in rejection sampling, each iteration always produces a draw, either θ ∗ or θ (t−1) . 202 / 352 Acceptance Rates It is important to monitor the acceptance rate (the fraction of candidate draws that are accepted) of your Metropolis-Hastings algorithm. If your acceptance rate is too high, the chain is probably not mixing well (not moving around the parameter space quickly enough). If your acceptance rate is too low, your algorithm is too inefficient (rejecting too many candidate draws). What is too high and too low depends on your specific algorithm, but generally • random walk: somewhere between 0.25 and 0.50 is recommended • independent: something close to 1 is preferred. 203 / 352 A Simple Example Using a random walk Metropolis algorithm to sample from a G(1.7, 4.4) distribution with a Normal jumping distribution with standard deviation of 2. R code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 mh . gamma <− f u n c t i o n ( n . s i m s , s t a r t , b u r n i n , cand . sd , shape , r a t e ) { t h e t a . c u r <− s t a r t draws <− c ( ) t h e t a . u p d a t e <− f u n c t i o n ( t h e t a . c ur , shape , r a t e ) { t h e t a . can <− rnorm ( 1 , mean = t h e t a . cu r , s d = cand . s d ) a c c e p t . prob <− dgamma ( t h e t a . can , s h a p e = shape , r a t e = r a t e ) /dgamma ( t h e t a . cur , s h a p e = shape , r a t e = r a t e ) i f ( r u n i f ( 1 ) <= a c c e p t . prob ) t h e t a . can e l s e theta . cur } f o r ( i in 1 : n . sims ) { draws [ i ] <− t h e t a . c u r <− t h e t a . u p d a t e ( t h e t a . cu r , s h a p e = shape , rate = rate ) } r e t u r n ( draws [ ( b u r n i n + 1 ) : n . s i m s ] ) } mh . draws <− mh . gamma ( 1 0 0 0 0 , s t a r t = 1 , b u r n i n = 1 0 0 0 , cand . s d = 2 , shape = 1 . 7 , r a t e = 4 . 4 ) 204 / 352 How Gibbs and MH sampling differ The MH algorithm does not require that all parameters in a problem be updated simultaneously, updating parameters one-at-a-time in an MH algorithm makes MH sampling seem like Gibbs sampling. However, the MH algorithm differs from the Gibbs sampler in two important ways. First, the candidate parameter is not automatically accepted, because it comes from a proposal density and not from the appropriate conditional distribution. Second, the proposal density is not an enveloping function as in the rejection sampling routine. In short, all Gibbs sampling is MH sampling, but not all MH sampling is Gibbs sampling. 205 / 352 Relationship between Gibbs and MH sampling First, the Gibbs sampler can be viewed as a special case of the MH algorithm in which the proposal densities for each parameter/random variable are the full conditionals. In Gibbs sampling, every candidate is selected; there is no rejection of candidates. The reason is that the ratio r is always 1. A second link between Gibbs sampling and MH sampling is that they can be combined to sample from multivariate densities. 206 / 352 Features of Markov Chain Monte Carlo Methods Importance sampling and rejection sampling are non iterative methods They draw a sample of size N and stop For high dimensional problems, it may be quite difficult to find an importance density or envelop function To address these issues, it is now standard Bayesian practice to turn to Markov Chain Monte Carlo (MCMC) methods These methods operate by sequentially sampling parameter values from a Markov Chain (MC) The stationary distribution is exactly the desired joint posterior distribution of interest One needs to insure the convergence of the MC to its stationary distribution Gibbs sampler and Metropolis -Hastings are the two commonly used MCMC methods in practice. 207 / 352 Bayesian Computational Methods Lecture 14 Sharif Mahmood 208 / 352 Sequential updating we can change our beliefs about an unknown parameter θ from a prior distribution with density f (0) (θ) to a posterior distribution with density f (1) (θ) when we observe data y, using f (1) (θ) ∝ f (0) (θ)L(θ; y), where the likelihood L(θ; y) = fY (y|θ), the probability or probability density of y given θ. Now suppose that we repeat the process and consider data arriving sequentially. We start with a prior distribution with density f (0) (θ). Then we observe data y (1) and our beliefs change so that they are represented by f (1) (θ). Suppose that we now observe further data y (2) so that our density for θ becomes f (2) (θ). 209 / 352 Sequential updating Before we observe either y (1) or y (2) , the joint probability (density) of θ, y (1) , y (2) is f (0) (θ) × f1 (y (1) |θ) × f2 (y (2) |θ, y (1) ), where f1 (y (1) |θ) is the conditional probability (density) of y (1) given θ and f2 (y (2) |θ, y (1) ) is the conditional probability (density) of y (2) given θ and y (1) . Clearly f (1) (θ) ∝ f (0) (θ)f1 (y (1) |θ) = f (0) (θ)L1 (θ; y (1) ) and f (2) (θ) ∝ f (0) (θ)f1 (y (1) |θ)f2 (y (2) |θ, y (1) ) = f (0) (θ)L1,2 (θ; y (1) , y (2) ) ∝ f (1) (θ)f2 (y (2) |θ, y (1) ) = f (1) (θ)L2 (θ; y (2) |y (1) ) 210 / 352 Sequential updating In many cases y (1) and y (2) will be independent given θ. In such a case f2 (y (2) |θ, y (1) ) = f2 (y (2) |θ) and L1,2 (θ; y (1) , y (2) ) = L1 (θ; y (1) )L2 (θ; y (2) ). Our “posterior” distribution after observing y (1) is our “prior” distribution before we observe y (2) . The only complication when y (1) and y (2) are not independent given θ is that the second likelihood L2 (θ; y (2) |y (1) ) must allow for the dependence of y (2) on y (1) . 211 / 352 Sequential updating Note that f (2) (θ) is the same whether we update our beliefs first by y (1) and then by y (2) or first by y (2) and then by y (1) . This is obvious when y (1) and y (2) are independent given θ. In the case where they are not it is simply a matter of whether we factorise the joint probability (density) of y (1) and y (2) given θ into f1 (y (1) |θ)f2 (y (2) |θ, y (1) ), as above, or into f2∗ (y (2) |θ)f1∗ (y (1) |θ, y (2) ). At time n we have a density for data conditional on θ as f (y (1) , . . . , y (n) |θ) = f (y (1) |θ) × f (y (2) |θ, y (1) ) × . . . × f (y (n) |θ, y n−1 ) where y (i) = (y (1) , . . . , y (i) ). 212 / 352 Example 1: Beta-Binomial ✑ Suppose the prior pdf f (0) (θ) is proportional to θa−1 (1 − θ)b−1 . We observe 20 trials, 3 of which have the success, giving a likelihood L1 (θ; y (1) ) ∝ θ3 (1 − θ)17 . Thus the posterior pdf at this stage, f (1) (θ), is proportional to θa+3−1 (1 − θ)b+17−1 . the pdf of a Beta(a + 3, b + 17) distribution. 213 / 352 Example 1: Beta-Binomial ✑ Now suppose that we examine another 30 trials and 7 of these have the success. These observations are reasonably supposed to be independent of y (1) given θ. So L2 (θ; y (2) ) ∝ θ7 (1 − θ)23 and our new pdf for θ, f (2) (θ), is proportional to θa+10−1 (1 − θ)b+40−1 , the pdf of a Beta(a + 10, b + 40) distribution. 214 / 352 Hierarchical models Lecture 15 Sharif Mahmood 215 / 352 Introduction to Hierarchical Models A popular model for describing the heterogeneity across several populations is the hierarchical model, Hierarchical models serve two purposes: 1 Methodological: when units of analysis are drawn from clusters within a population (communities, neighborhoods, city blocks, etc.), they can no longer be considered independent. When independence does not hold, we cannot construct the likelihood as simply. 2 Substantive: The influence of predictors may be context-dependent, e.g. student performance may be dependent on the teacher-the environmental context of classes. Use of hierarchical models has increased dramatically in the last two decades. 216 / 352 Hierarchical structure of the parameters ✑ Suppose we have J observations within each of G groups: y11 , . . . , yJ1 , y12 , . . . , yJ2 , y1G , . . . , yJG , and we assume that the data are distributed within groups according to some distribution Q with parameter θ, but that each group has its own parameter (θg ). Thus yig ∼ Q(θg ). Suppose we assume further that these parameters θg arise from a common distribution W with parameter γ (this parameter is called a “hyperparameter”). So, θg ∼ W (γ). Finally, assume γ has some vague distribution like a uniform: γ ∼ U (−100, 100) 217 / 352 Hierarchical structure of the parameters A posterior distribution for all unknown parameters would then be (after substituting the densities Q and W into the conditional structure below): p(γ, θ|y) ∝ p(y|θ, γ)p(θ|γ)p(γ). The multiple of this marginal joint density for the parameters and the sampling density for the data, given the parameters, yields a posterior density for all of the parameters. We might not be interested in group level parameters θg but rather in the posterior distribution for the hyperparameter γ that structures the distribution of the group level parameters. The marginal distribution for γ: Z p(γ|y) ∝ p(y|θ, γ)p(θ|γ)p(γ)dθ This integration is performed via MCMC methods as we sample from the conditional posterior distributions for each parameter. 218 / 352 The hierarchical normal model The hierarchical normal model describes the heterogeneity of means across several populations in which the within and between group sampling models are both normal: within-group model: between-group model: φj = {θj , σ 2 } ψ = {µ, τ 2 } p(y|φj ) ∼ N (θj , σ 2 ) p(θj |ψ) ∼ N (µ, τ 2 ) It might help to visualize this setup as in the Figure. The fixed but unknown parameters in this model are µ, τ 2 and σ 2 . For convenience we will use standard semiconjugate normal and inverse-gamma prior distributions for these parameters:  ν ν σ2  1 0 0 0 ∼G , 2 σ 2 2 η η τ2  1 0 0 0 ∼G , τ2 2 2 µ ∼ N (µ0 , γ0 ) 219 / 352 The hierarchical normal model µ,ττ2 θ1 θ2 ..... θm−−1 θm y1 y2 ..... ym−−1 ym σ2 Figure: A graphical representation of the basic hierarchical normal model. 220 / 352 Posterior The unknown quantities in our system include the group-specific means {θ1 , . . . , θm }, the within-group sampling variability σ 2 and the mean and variance (µ, τ 2 ) of the population of group-specific means. Joint posterior inference for these parameters can be made by constructing a Gibbs sampler which approximates the posterior distribution p(θ1 , . . . , θm , µ, τ 2 , σ 2 |y1 , . . . , ym ) ∝ p(µ, τ 2 , σ 2 ) × p(θ1 , . . . , θm |µ, τ 2 , σ 2 ) × p(y1 , . . . , ym |θ1 , . . . , θm , µ, τ 2 , σ 2 ) Y  Y  nj m m Y 2 2 2 2 = p(µ)p(τ )p(σ ) p(θj |µ, τ ) p(yi,j |θj , σ ) j=1 j=1 i=1 the term in the second pair of brackets is the result of an important conditional independence feature of our model. 221 / 352 Full conditional distributions of µ and σ 2 The full conditional distributions of µ and σ 2 are also proportional to m Y p(µ)p(τ ) p(θj |µ, τ 2 ) j=1 In particular, this must mean that p(µ|θ1 , . . . , θm , τ 2 , σ 2 , y1 , . . . , ym ) ∝ p(µ) Y p(θj |µ, τ 2 ) Y p(τ 2 |θ1 , . . . , θm , τ 2 , σ 2 , y1 , . . . , ym ) ∝ p(τ 2 ) p(θj |µ, τ 2 ) In previous, we saw that if y1 , . . . , yn were i.i.d. N (θ, σ 2 ) and θ had a normal prior distribution, then the conditional distribution of θ was also normal and the conditional distribution of σ12 was gamma. Since our current situation is exactly analogous, the fact that θ1 , . . . , θm are i.i.d. N (µ, τ 2 ) and µ has a normal prior distribution implies that the conditional distribution of µ must be normal and the conditional distribution of τ12 is gamma as well. 222 / 352 Full conditional of θj The posterior that depend on θj shows that the full conditional distribution of θj must be proportional to p(θj |µ, τ 2 , σ 2 , y1 , . . . , ym ) ∝ p(θj |µ, τ 2 ) nj Y i=1 p(yi,j |θj , σ 2 ) This says that, conditional on {µ, τ 2 , σ 2 , yj }, θj must be conditionally independent of the other θ’s as well as independent of the data from groups other than j. The full conditional distribution is therefore  n ȳ τ 2 + σ 2 τ 2σ2  j j p(θj |y1,j , . . . , ynj ,j , σ ) ∼ N , nj τ 2 + σ 2 nj τ 2 + σ 2 2 223 / 352 Full conditional of σ 2 The full conditional of σ 2 is similar to that in the one-sample normal model, except now we have information about σ 2 from m separate groups: 2 2 p(σ |θ1 , . . . , θm , y1 , . . . , ym ) ∝ p(σ ) ∝ (σ 2 )− nj m Y Y j=1 i=1 ν0 +1 2 p(yi,j |θj , σ 2 ) e− 2 ν 0 σ0 2σ 2 (σ 2 )− P nj 2 e− PP (yi,j −θj )2 2σ 2 Adding the powers of σ 2 and collecting the terms in the exponent, we recognize this as proportional to an inverse-gamma density, giving   nj m m X X X  1 1 1 2 2 p( 2 |θ, y1 , . . . , ym ) ∼ G (ν0 + ν 0 σ0 + nj ), (yi,j − θj ) . σ 2 2 j=1 j=1 i=1 224 / 352 Posterior approximation Posterior approximation proceeds by iterative sampling of each unknown quantity from its full conditional distribution. Given a current state of the unknowns (s) (s) {θ1 , . . . , θm , µ(s) , τ 2(s) , σ 2(s) } a new state is generated as follows: 1 2 3 4 (s) (s) sample µ(s+1) ∼ p(µ|θ1 , . . . , θm , τ 2(s) ); (s) (s) (s) (s) sample τ 2(s+1) ∼ p(τ 2 |θ1 , . . . , θm , µ(s+1) ); sample σ 2(s+1) ∼ p(σ 2 |θ1 , . . . , θm , y1 , . . . , ym ); (s+1) sample θj ∼ p(θj |µ(s+1) , τ 2(s+1) , σ 2(s+1) , yj ) for each j ∈ {1, . . . , m} The order in which the new parameters are generated does not matter. What does matter is that each parameter is updated conditional upon the most current value of the other parameters. 225 / 352 Bayesian Hypothesis Testing Lecture 16 Sharif Mahmood 226 / 352 The Problem Consider the posterior probability distribution, p(y|θ)π(θ) . Θ p(y|θ)π(θ)dθ R The normalizing constant p(y) = Θ p(y|θ)π(θ)dθ does not have a closed form. p(θ|y) = R Thus the posterior distribution does not have a closed form. A first order approximation is given by Bayesian central limit theorem. A second order approximation is given by Laplace method. 227 / 352 Bayesian CLT Theorem Suppose y1 , . . . , yn are independent observations from p(y|θ). Suppose that π(θ) is the prior for θ which may be improper. Further suppose that the posterior distribution isproper and its mode exists. Then as n → ∞, θ|y → Np θ̂m , H −1 (θ̂m ) Here θ̂m is the posterior mode of θ obtained by solving ∂ ∗ j = 1, . . . , p and p∗ (θ|y) = p(y|θ)π(θ) ∂θj log p (θ|y) = 0, H(θ) = − ∂ 2 log p∗ (θ|y) ∂θi ∂θj is called the Hessian matrix The asymptotic covariance matrix of θ is minus the inverse of Hessian matrix evaluated at θ̂m , H −1 (θ̂m ) = H −1 (θ)|θ=θ̂m 228 / 352 Bayesian CLT: proof sketch Proof. p(θ|y) ∝ = = = exp[I(θ)] ≈ = ≈ p(y|θ)π(θ)   exp log(p(y|θ)π(θ))  exp log p∗ (θ|y) exp[I(θ)] i h 1 ∂2 ∂ exp I(θ̂m ) + (θ − θ̂m ) I(θ)|θ=θ̂m + (θ − θ̂m )2 2 I(θ)|θ=θ̂m ∂θ 2 ∂θ h i 1 2 exp I(θ̂m ) − H(θ̂m )(θ − θ̂m ) 2 h 1 i exp − H(θ̂m )(θ − θ̂m )2 . 2  ∴ p(θ|y) ≈ N θ̂m , H −1 (θ̂m ) . 229 / 352 Bayesian CLT: Example I ✑ Suppose y , . . . , y are i.i.d. Binomial(1, θ). Consider a flat prior π(θ) = 1, 0 < θ < 1, so that 1 n p(θ|y) ∝ θ P yi (1 − θ)(n− P yi ) ≡ θy (1 − θ)(n−y) Sol I(θ) ∂I(θ) ∂θ = ∂2 I(θ)|θ=θ̂m ∂θ2 = ⇒ H −1 (θ̂m ) = ⇒ ⇒ =   θ̂m ) ∴ p(θ|y) ∼ N θ̂m , θ̂m (1− n X yi log θ + (n − P P n − yi yi − θ 1−θ n − θ̂m (1 − θ̂m ) X yi ) log(1 − θ) θ̂m (1 − θ̂m ) n 230 / 352 Bayesian CLT: Example II ✑ Suppose y , . . . , y 1 π(θ) ∼ θ α−1 Sol are i.i.d. Poisson(θ). Consider prior exp(−βθ), show that α + P y − 1 α + P y − 1 i i , . p(θ|y) ∼ N n+β (n + β)2 n I(θ) ∂I(θ) ∂θ = ∂2 I(θ)|θ=θ̂m ∂θ2 = ⇒ H −1 (θ̂m ) = ⇒ ⇒ = ∴ p(θ|y) ∼ N  X (α + yi − 1) log θ − (n + β)θ P α + yi − 1 − (n + β) θ (n + β)2 P α − yi − 1 P α − yi − 1 (n + β)2  P P α+ yi −1 α+ yi −1 , n+β (n+β)2 231 / 352 Limitations of Bayesian CLT If true posterior differs greatly from normality or sample size is too small, this approximation may be poor. Laplace’s method provide second order approximation to posterior expectations of functions of θ. For approximations to be valid, the posterior distribution must be unimodal. Accuracy depends on large sample and parameterization (e.g. θ or log θ). 232 / 352 Statistical Hypothesis Testing A statistical hypothesis test is a method of making decisions using data. Statistical hypothesis testing is a key technique of frequentist inference. the p-value, the probability that observations as extreme as the data would occur by chance under null hypothesis. The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be rejected in favor of the alternative hypothesis. 233 / 352 Fallacies of Statistical Hypothesis Testing The results of statistical tests are frequently misunderstood Failure to reject the null hypothesis leads to its acceptance. (Wrong!) The p value is the probability that the null hypothesis is incorrect. (Wrong!) α = .05 is a standard with an objective basis. (Wrong!) Small p values indicate large effects. (Wrong!) Data show a theory to be true or false. (Wrong!) Statistical significance implies importance. (Wrong!) 234 / 352 Could Fisher, Jeffreys and Neyman Have Agreed on Testing? 235 / 352 Fisher’s Significance Test Experiment yields Y ∼ f (y|θ); test H0 : θ = θ0 Choose a test statistics T = t(y) , so that larger values of T reflect evidencce against H0 Compute the p-value for the observed data y p = P0 (t(Y ) > t(y)) rejecting H0 if p is small (e.g p < 0.05). The justification is that the p-value can be viewed as an index of the ‘strength of evidence’ against H0 236 / 352 Neyman-Pearson Hypothesis Testing Test H0 : θ = θ0 versus alternative H1 : θ = θ1 . Reject H0 , if T > c; c is a pre-chosen critical value. Compute Type I and Type II error probabilities α = P0 (rejecting H0 |H0 true) and β = P1 (accepting H0 |H0 false), where Pi refers to probability under Hi . The justification is the frequentist principle: In repeated actual use of a statistical procedure, the average actual error should not be greater than the average reported error. 237 / 352 A Comparison Between Fisherian, Frequentist (Neyman-Pearson) Fisher’s null hypothesis testing Neyman-Pearson decision theory 1 Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference). 1 Set up two hypotheses, H1 and H2 , and decide α, β. These define a rejection region for each hypothesis. 2 Report the exact level of significance (e.g., p = 0.051 or p = 0.049). Do not use a conventional 5% level, and do not talk about accepting or rejecting hypotheses. 2 If the data falls into the rejection region of H0 , accept H1 ; otherwise accept H0 . Note that accepting a hypothesis does not mean that you believe in it. 3 Use this procedure only to draw provisional conclusions in the context of an attempt to understand the experimental situation. 3 The usefulness of the procedure is limited among others where you have a disjunction of hypotheses (e.g., either µ1 = 8 or µ2 = 10 is true). 238 / 352 Criticism 1 Confusion resulting (in part) from combining the methods of Fisher and Neyman-Pearson which are conceptually distinct. 2 Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments. 3 Statistical hypothesis testing is misunderstood, overused and misused. 4 The numerous criticisms of significance testing do not lead to a single alternative or even to a unified set of alternatives. 5 When used to detect whether a difference exists between groups, a paradox arises. 239 / 352 Alternatives to Significance Testing Bayesian inference is one alternative to significance testing. For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. The probability a hypothesis is true can only be derived from use of Bayes’ Theorem, which was unsatisfactory to both the Fisher and Neyman-Pearson camps due to the explicit use of subjectivity in the form of the prior probability. Bayesians test the hypothesis, given the data. 240 / 352 The Jeffreys Approach to Testing Bayesian hypothesis testing of H0 vs H1 is based on ‘Bayes factor’ Specify prior probabilities for H0 and H1 as p(H0 ) and p(H1 ) Let Y denote data collected in the experiment p(H0 |Y ) and p(H1 |Y ) denote the posterior probabilities The Bayes factor in favor of H0 is defined as BF = = = where p(H0 |Y )= p(H0 |Y )/p(H0 ) p(H1 |Y )/p(H1 ) p(H0 |Y )/p(H1 |Y ) p(H0 )/p(H1 ) Posterior odds Prior odds p(Y |H0 )p(H0 ) p(Y |H0 )p(H0 )+p(Y |H0c )p(H0c ) With similar formula for p(H1 |Y ). 241 / 352 Bayes Factor The Bayes factor becomes BF = p(Y |H0 ) . p(Y |H1 ) Large values of Bayes factor favors H0 . Reject H0 of BF ≤ 1. Jeffry devised the following criteria for decision making: 1 < BF ≤ 3 3 ≤ BF ≤12 12 ≤ BF ≤ 150 BF > 150 Weak evidence Positive evidence Strong evidence Decisive Bayesians may use BF to compare hypotheses. BF is the Bayesian analog of the likelihood ratio (LR) test but BF does not require nested model as LR test. 242 / 352 Bayes Factor: Example I ✑ Child is given intelligence test, with resulting score Y . Y ∼ N (θ, 100) - where θ indicates child’s own true IQ. - 100 is variance if same child is repeated IQ test of the same kind. IQ test is in population is distributed as θ ∼ N (100, 225). If child score y = 115, then posterior distribution of θ is p(θ|y) ∼ N (110.4, 69.2). 243 / 352 Bayes Factor: Example I we have H0 : θ ≤ 100 vs H1 : θ > 100 Prior probabilities ⇒ ⇒ and prior odds: 0.5 0.5 p(θ) = N (100, 225) P (θ ≤ 100) = 0.5 P (θ > 100) = 0.5 =1 Posterior probabilities P (θ ≤ 100|y) = .106 P (θ > 100|y) = .894 and posterior odds: 0.106 0.894 = 0.119 The BF in favor of H0 vs H1 is 0.119/1=0.119 244 / 352 Bayes Factor: Example II ✑ Suppose y , . . . , y are observed from N (θ, 1), and we wish to test H0 : θ = 0 versus H1 : θ = 1. Then 1 n n BF ≡ LR = exp 2 − ✑ Simple vs composite H 0 π(θ) ∼ N (θ, 1). BF = = R n = 10, H0 . P yi : θ = 0 versus H1 : θ 6= 0. Consider p(y|θ = 0) p(y|θ)π(θ)dθ ΘH 1 n 1 (2π)− 2 e− 2 R∞ −∞ = P n 1 (2π)− 2 e− 2 1 1 (n + 1) 2 e 2 + P P yi2 (yi −θ)2 (2π)− 21 e− 12 P (θ−1)2 dθ P (1+ yi )2 2(n+1) yi = 5 ⇒ BF = 1.06 implying weak evidence in favor of 245 / 352 Bayesian Model Selection (BMS) BMS is another way of doing notion of hypothesis A vs. hypothesis B to explain the data. The Bayesian information criterion(BIC) score is similar to χ2 value in the traditional hypothesis testing. Bayesian Information Criterion (BIC) BIC = 2 log P (Y |M, θ̂) − d log n where Y: M: θ̂: d: n: Observed data, Model (The Model is actually a family of models with unit variance and unknown mean therefore θ̂ is needed), The maximum likelihood estimator (MLE) of parameters in the model, The number of free parameters, and The number of data points. A model with a higher BIC is a better model, since if data fits well to the model, the log likelihood would be higher. 246 / 352 Hypothesis Testing: Example ✑ Suppose a study conducted with 16,395 individuals from the general (not high-risk) population • 74 HIV cases reported in the 8198 individuals receiving placebos • 51 HIV cases reported in the 8197 individuals receiving the treatment The test that was performed: • Let p1 and p2 denote the probability of HIV in the placebo and treatment populations, respectively. • Test H0 : p1 = p2 versus H1 : p1 6= p2 247 / 352 Hypothesis Testing: Example Normal approximation okay, so z = = p̂ − p̂2 p1 σ̂p̂1 −p̂2 .009027 − .006222 = 2.06 .001359 is approximately N (θ, 1), where θ = (p1 − p2 )/(.001359). We thus test H0 : θ = 0 versus H1 : θ 6= 0, based on z. Observed z = 2.06, so the p-value is 0.04. If test H0 : θ = 0 versus H1 : θ > 0, the p-value is 0.02. 248 / 352 Bayesian Analysis: Posterior odds of H0 to H1 = Prior odds of H0 to H1 × BF ; where BF = likelihood of H0 for observed data average likelihood of H1 1 = 1 (2π)− 2 e− 2 (z−0) R 1 1 2 2 (2π)− 2 e− 2 (z−θ) dθ For z = 2.06 and π(θ) = Uniform(0, 2.95), the nonincreasing prior most favorable to H0 , BF = 0.178 249 / 352 The Disagreement Fisher, Jeffreys and Neyman would report considerably different numbers in actual practice. Example: In testing whether a normal mean is zero or not, with z = σ/ȳ√n = 2.3 (or z = 2.9) (σ known, n = 10), • Fisher would report p = 0.021 (or p = 0.0037). • Neyman, with pre-specified α = 0.05, would report α = 0.05 in either case (and a Type II error β). • Jeffreys would report the objective posterior probability P (H0 |y) = 0.30 (or P (H0 |y) = 0.10) (using equal prior probabilities of the hypotheses and a conventional Cauchy(0, σ) prior on the alternative). 250 / 352 Their Criticisms of the Other Approaches 1 Criticisms of Bayesian analysis in general. (Fisher and Neyman rarely addressed Jeffreys particular version of Bayesian theory.) • Fisher and Neyman rejected pre-Jeffreys objective Bayesian analysis as logically flawed. • Fishers subsequent position was that Bayesian analysis is based on prior information that is only rarely available. • Neyman, in addition, assumed that Bayesian analysis violated the Frequentist Principle. 2 Criticisms of Neyman-Pearson testing • Fisher and Jeffreys rejected N-P tests because they report fixed (α, β) i.e., do not provide evidence or error measures that vary with the data. • Fisher disliked alternative hypotheses and power. 251 / 352 Their Criticisms of the Other Approaches 3 Criticisms of Fishers significance testing • Neyman and Jeffreys: alternative hypotheses are essential. • Neyman: p-values violate the Frequentist Principle. • p-values are commonly misinterpreted as error rates, resulting in a considerable overestimation of the actual evidence against H0 . 252 / 352 Bayesian t-test Hypothesis Testing for Two Independent Groups R code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 l i b r a r y (R2OpenBUGS) #your s e t o f v a l u e s f o r group 1 and group 2 group1 = c ( 4 9 , 1 3 , 5 0 , 6 4 , 2 1 , 2 3 , 1 5 , 7 , 3 2 , 8 , 1 7 , 1 5 , 1 9 , 1 6 , 3 3 ) group2 = c ( 4 1 , 2 0 , 5 3 , 4 1 , 2 0 , 4 4 , 3 1 , 2 4 , 2 4 , 4 4 , 1 5 , 3 5 , 3 2 , 2 5 , 3 5 ) n1 = l e n g t h ( group1 ) ; n2 = l e n g t h ( group2 ) # Rescale : group2 = group2−mean ( group1 ) / ( sd ( group1 ) ) group1 = a s . v e c t o r ( s c a l e ( group1 ) ) #S p e c i f y t h e f i l e l o c a t i o n t h a t c o n t a i n s t h e OpenBugs code openbugsmodel = ” T t e s t 2 . t x t ” p r i o r f o r h 1 = dcauchy ( 0 ) #p r i o r f o r t h e c a s e o f d e l t a <>0 p r i o r f o r h 2 = 2∗ dcauchy ( 0 ) #p r i o r f o r t h e c a s e o f d e l t a <0 p r i o r f o r h 3 = 2∗ dcauchy ( 0 ) #p r i o r f o r t h e c a s e o f d e l t a >0 #Advanced i n p u t i t t e r a t i o n s = 10000 b u r n i n = 1000 253 / 352 Bayesian t-test for Two Independent Groups OpenBugs code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 model { f o r ( i i n 1 : n1 ) { group1 [ i ] ˜ dnorm (muX, lambdaXY ) } f o r ( i i n 1 : n2 ) { group2 [ i ] ˜ dnorm (muY, lambdaXY ) } lambdaXY <− pow ( sigmaXY , −2) delta ˜ dnorm ( 0 , lambdaDelta ) lambdaDelta ˜ d c h i s q r ( 1 ) sigma ˜ dnorm ( 0 , sigmaChi ) sigmaChi ˜ d c h i s q r ( 1 ) sigmaXY <− abs ( sigma ) mu ˜ dnorm ( 0 , muChi ) muChi ˜ d c h i s q r ( 1 ) a l p h a <− d e l t a ∗sigmaXY muX <− mu + a l p h a ∗ 0 . 5 muY <− mu − a l p h a ∗ 0 . 5 } 254 / 352 Bayesian t-test for Two Independent Groups R code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # t o be p a s s e d on t o WinBUGS data= l i s t ( ” group1 ” , ” group2 ” , ” n1 ” , ” n2 ” ) i n i t s =f u n c t i o n ( ) { l i s t ( d e l t a=rnorm ( 1 , 0 , 1 ) ,mu=rnorm ( 1 , 0 , 1 ) , sigma=r u n i f ( 1 , 0 , 5 ) ) } # Pa r a m e t e r s t o be m o n i t o r e d p a r a m e t e r s=c ( ” d e l t a ” ) # The f o l l o w i n g command c a l l s WinBUGS with s p e c i f i c o p t i o n s . s a m p l e s = bugs ( data , i n i t s , p a r a m e t e r s , model . f i l e =openbugsmodel , n . c h a i n s =3 , n . i t e r=i t t e r a t i o n s , n . b u r n i n=burnin , n . t h i n =1 , DIC=T, codaPkg=F , debug=F) # Now t h e v a l u e s f o r t h e m o n i t o r e d p a r a m e t e r s a r e i n t h e # ” samples ” object , ready f o r i n s p e c t i o n . samples$summary 255 / 352 Comparing Two Groups The shows math scores from a sample of 10th grade students from two schools, namely, Group 1 and Group 2. 10 20 30 40 50 60 ● Group 1 Group 2 256 / 352 Comparing Two Groups Suppose we are interested in estimating θ1 , the average score we would obtain if all 10th graders in school 1 were tested, and possibly comparing it to θ2 , the corresponding average from school 2. The results from the sample data are ȳ1 = 25.47 and ȳ2 = 32.27, suggesting that θ2 is larger than θ1 . To assess whether or not the observed mean difference of ȳ2 − ȳ1 = 6.8 is large compared to the sampling variability it is standard practice to compute the t-statistic. t = = ȳ − ȳ1 q2 sp n12 + n11 32.27 − 32.26 q 1 1 14.20 14 + 14 = 1.31 257 / 352 Model selection based on p-values: If p < 0.05, ◮ reject the model that the two groups have the same distribution; ◮ conclude that θ1 6= θ2 ◮ use the estimates θ̂1 = ȳ1 , θ̂2 = ȳ2 If p > 0.05 ◮ accept the model that the two groups have the same distribution; ◮ conclude that θ1 = θ2 ◮ use the estimates θ̂1 = θ̂2 = ( P yi,1 + P yi,2 )/(n1 + n2 ) For our math score data, the above procedure would take the p-value of 0.20 and tell us to treat the population means of the two groups as being numerically equivalent. 258 / 352 Comparing Two Groups The p-value-based procedure described above can be re-expressed as estimating θ1 as θ̂1 = wȳ1 + (1 − w)ȳ2 , where w = 1 if p < 0.05 and w = n1 /(n1 + n2 ) otherwise. It might make more sense to allow w to vary continuously and have a value that depends on such things as the relative sample sizes n1 and n2 . It also depends on the sampling variability σ 2 and our prior information about the similarities of the two populations. An estimator similar to this is produced by a Bayesian analysis that allows for information to be shared across the groups. 259 / 352 Comparing Two Groups Consider the following sampling model for data from the two groups: Yi,1 = µ − δ + ei,1 Yi,2 = µ + δ + ei,2 {ei,j } ∼ N (0, σ 2 ) where θ1 = µ − δ and θ2 = µ + δ, we see that δ represents half the population difference in means, as (θ2 − θ1 )/2 = δ, and µ represents the pooled average, as (θ2 + θ1 )/2 = µ. Convenient conjugate prior distributions for the unknown parameters are p(µ, δ, σ 2 ) = p(µ) × p(δ) × p(σ 2 ) p(µ) ∼ N (µ0 , γ0 ) p(δ) ∼ N (δ0 , τ02 ) p(σ 2 ) ∼ IG(ν0 , ν0 σ02 /2) 260 / 352 Comparing Two Groups The full conditional distributions of these parameters are: p(µ|y, δ, σ 2 ) ∼ N (µ1 , γ1 ) n1 n2 i h 1 X 1 X (yi,1 + δ) + 2 (yi,2 − δ) µ1 = γ12 × µ0 /γ02 + 2 σ σ i=1 i=1 γ12 = [1/γ02 + (n2 + n1 )/σ 2 ]−1 p(δ|y, µ, σ 2 ) ∼ N (δ1 , τ12 ) h i 1 X 1 X (yi,1 − µ) + 2 (yi,2 − µ) δ1 = τ12 × δ0 /τ02 + 2 σ σ τ12 = [1/τ02 + (n2 + n1 )/σ 2 ]−1 p(σ 2 |y, µ, δ) ∼ IG(ν1 , ν1 σ12 /2) ν 1 = ν 0 + n2 + n1 X X (yi,1 − [µ − δ] + (yi,2 − [µ + δ])) ν1 σ12 = ν0 σ02 + 261 / 352 Comparing Two Groups R code 1 2 3 4 5 6 7 8 9 10 11 12 ##### data y1 <−group1 ; y2 <−group2 n1 <−l e n g t h ( group1 ) ; n2 <−l e n g t h ( group2 ) ##### p r i o r p a r a m e t e r s mu0 <−30 ; g02 <−400 d e l 0 <−0 ; t 0 2 <−400 s 2 0 <−100; nu0 <−1 ##### s t a r t i n g v a l u e s mu <−(mean ( y1 ) + mean ( y2 ) ) / 2 d e l <−(mean ( y1 ) − mean ( y2 ) ) / 2 262 / 352 Comparing Two Groups 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ##### Gibbs s a m p l e r MU <−DEL <−S2 <−NULL; Y12 <−NULL set . seed (1) for ( s in 1:5000) { ##update s 2 s 2 <−1/rgamma ( 1 , ( nu0+n1+n2 ) / 2 , ( nu0 ∗ s 2 0+sum ( ( y1−mu−d e l )ˆ2)+sum ( ( y2−mu+d e l ) ˆ 2 ) ) / 2 ) ##update mu v a r . mu <−1/(1/ g02+ ( n1+n2 ) / s 2 ) mean . mu <−v a r . mu∗ (mu0/ g02 + sum ( y1+d e l ) / s 2 + sum ( y2−d e l ) / s 2 ) mu <−rnorm ( 1 , mean . mu, s q r t ( v a r . mu) ) ##update d e l v a r . d e l <−1/(1/ t 0 2+ ( n1+n2 ) / s 2 ) mean . d e l <−v a r . d e l ∗ ( d e l 0 / t 0 2 + sum ( y2−mu) / s 2 − sum ( y1−mu) / s 2 ) d e l <−rnorm ( 1 , mean . d e l , s q r t ( v a r . d e l ) ) } ##s a v e p a r a m e t e r v a l u e s MU <−c (MU, mu) ; DEL <−c (DEL, d e l ) ; S2 <−c ( S2 , s 2 ) Y12 <−r b i n d ( Y12 , c ( rnorm ( 2 ,mu+c (1 , −1)∗ d e l , s q r t ( s 2 ) ) ) ) 263 / 352 Comparing Two Groups The figure shows the marginal posterior distributions of µ and δ, and how they are much more concentrated than their corresponding prior distributions. 0.12 prior posterior 0.00 0.00 0.04 0.04 density 0.08 density 0.08 0.12 prior posterior 10 20 30 µ 40 50 −20 −10 0 δ 10 20 Figure: Prior and posterior distributions for µ and δ. 264 / 352 Advantages over classical approach In particular, a 95% quantile-based posterior confidence interval for 2 × δ , the difference in average scores between the two schools. Additionally, the posterior probability P r(θ2 > θ1 |y) = P r(δ > 0|y) can be measured. From the joint posterior predictive distribution P r(Y2 > Y1 |y) can be obtained. Find the values of the above three points?? 265 / 352 The Empirical Bayes Approach Lecture 17 Sharif Mahmood 266 / 352 Introduction Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Empirical Bayes, also known as maximum marginal likelihood The priors can be parametric or nonparametric, depending on unknown parameters that may in turn be drawn from some second stage prior. This sequence of parameters and priors constitutes a hierarchical model. The hierarchy must stop at some point, with all remaining prior parameters assumed known. Rather than the previous assumption, the empirical bayes approach uses the observed data to estimate this final stage parameters and then proceeds as though the prior were known. 267 / 352 Empirical Bayes methods ✑ Suppose in a two-stage hierarchical Bayes model, y = {y , . . . , y } 1 n are generated from p(y|θ), where θ = {θ1 , . . . , θk }. Again θ can be considered samples drawn from a population characterised by hyperparameters η according to p(θ|η), where η are drawn from unparameterized distributions. Thus information about a particular quantity of interest θi comes not only from data, but also from the hyperparameters. Using Bayes’ theorem, p(θ|y) = = p(y|θ) × p(θ) p(y) Z p(y|θ) p(θ|η)p(η)dη p(y) In general, this integral will not be tractable analytically and must be evaluated by numerical methods, e.g MCMC. 268 / 352 Empirical Bayes methods Alternatively, the expression can be written as Z p(θ|y) = p(θ|η, y)p(η|y)dη Z p(y|θ)p(θ|η) = p(η|y)dη p(y|η) and the term in the integral can in turn be expressed as Z p(η|y) = p(η|θ)p(θ|y)dθ These suggest an iterative scheme, qualitatively similar in structure to a Gibbs sampler, to evolve successively improved approximations to p(θ|y) and p(η|y). 269 / 352 Empirical Bayes methods Calculation 1 2 3 calculate an initial approximation to p(θ|y) ignoring the η dependence completely. calculate an approximation to p(η|y) based upon the initial approximate distribution of p(θ|y). use this p(η|y) to update the approximation for p(θ|y); then update p(η|y); and so on. When the true distribution p(η|y) is sharply peaked, the integral determining p(θ|y) may be not much changed by replacing the probability distribution over η with a point estimate η ∗ representing the distribution’s peak (or, alternatively, its mean), p(θ|y) ≃ p(y|θ)p(θ|η ∗ ) p(y|η ∗ ) 270 / 352 Point estimation 1 Non-parametric empirical Bayes Suppose Yi |θi ∼ Poisson(θi ) where θ has a cdf G(θ). That is, e−θi θiyi f (yi |θi ) = , yi = 1, 2, . . . yi ! Under squared error loss (SEL), the conditional expectation E(θi |Yi = yi ) is a reasonable quantity to use for prediction. For the Poisson compound sampling model, this quantity is R y +1 −θ (θ i e /yi !)dG(θ) R E(θi |yi ) = (θyi e−θ /yi !)dG(θ) (yi + 1)pG (yi + 1) = pG (yi ) where pG is the marginal distribution obtained by integrating out θ over G. 271 / 352 Point estimation 1 Non-parametric empirical Bayes To take advantage of this, Robbins(1995) suggested estimating the marginals with their empirical frequencies, yielding the fully non-parametric estimate as: E(θi |yi ) ≈ (yi + 1) #(ys equal to yi + 1) #(ys equal to yi ) Since the estimate of the marginal density depends on all of the yi , the estimate for each component θi is influenced by data from other components. The foregoing is one of several models for which the analyst can make empirical Bayes inferences with no knowledge of or assumptions concerning G. 272 / 352 Point estimation 2 Parametric empirical Bayes If the likelihood and its prior take on simple parametric forms, then the empirical Bayes problem is only to estimate the marginal m(y|η) and the hyperparameters η using the complete set of empirical measurements. One common approach, called parametric empirical Bayes point estimation, is to approximate the marginal using the maximum likelihood estimate (MLE). This simplified marginal allows one to plug in the empirical averages into a point estimate for the prior θ. 273 / 352 Point estimation 2 Parametric empirical Bayes There are several common parametric empirical Bayes models, including the Poisson-Gamma model (below), the Beta-binomial model, the Gaussian-Gaussian model, the Dirichlet-multinomial model, as well specific models for Bayesian linear regression and Bayesian multivariate linear regression. More advanced approaches include hierarchical Bayes models and Bayesian mixture models. ✑ Poisson-Gamma model Let the likelihood be a Poisson distribution in the previous example, and let the prior now be specified by the conjugate prior, which is a Gamma distribution (G(α, β)) (where η = (α, β) ): p(θ|α, β) = β α α−1 −βθ θ e Γ(α) for θ > 0, α > 0, β > 0. 274 / 352 Point estimation 2 Parametric empirical Bayes ✑ Poisson-Gamma model It is straightforward to show the posterior is also a Gamma distribution. Write p(θ|y) ∝ p(y|θ)p(θ|α, β), where the marginal distribution has been omitted since it does not depend explicitly on θ. Expanding terms which do depend on θ gives the posterior as: p(θ|y) ∝ θy e−θ × θα−1 e−βθ = θy+α−1 e−θ(1+β) So the posterior density is also a Gamma distribution G(α∗ , β ∗ ), where α∗ = y + α, and β ∗ = 1 + β. 275 / 352 Point estimation 2 Parametric empirical Bayes ✑ Poisson-Gamma model To apply empirical Bayes, we will approximate the marginal using the MLE. But since the posterior is a Gamma distribution, the MLE is just the mean of the posterior, which is the point estimate E(θ|y) we need. We have, E(θ|y) = = α∗ ȳ + α = ∗ β 1+β β α 1 × ȳ + × 1+β 1+β β The resulting point estimate E(θ|y) is therefore like a weighted average of the sample mean ȳ and the prior mean αβ . This turns out to be a general feature of empirical Bayes; the point estimates for the prior (i.e. mean) will look like a weighted averages of the sample estimate and the prior estimate (likewise for estimates of the variance). 276 / 352 Empirical Bayes performance The empirical bayes performance over other mehods estimators is quite impressive [Carlin and Louis, 2008]. Being superior to that of the frequentist estimator when there is a high degree of similarity among the θi s and no worse otherwise. In the last two sections, we see that procedures derived using Bayesian machinery and vague priors typically enjoy good frequentist and empirical bayes properties. Empirical bayes methodology incorporates the bayesian approach but rejects the unyielding subjectivism of the era’s leading Bayesian thinkers. 277 / 352 Predictive Distributions ✑ Suppose an experiment with y ≡ (y , . . . , y ) being observed with 1 n p(y|θ)π(θ) model p(y|θ) and prior π(θ), the posterior p(θ|y) = R p(y|θ)π(θ)dθ . Predict a future observation z from this population p(z|y) = = = = R p(z, y|θ)π(θ)dθ p(z, y) = R p(y) p(y|θ)π(θ)dθ R p(z|θ)p(y|θ)π(θ)dθ R p(y|θ)π(θ)dθ Z p(z|θ)p(θ|y)   Eθ|y p(z|θ) 278 / 352 Predictive Distributions ✑ Suppose y , . . . , y 1 n Thus be i.i.d. Binomial(1, θ) and θ ∼ Beta(α, β). p(θ|y) ∝ θ P yi +α−1 (1 − θ)n− P yi +β−1 . P P Hence by kernel recognition, θ|y ∼Beta( yi + α, n − yi + β). Then p(z|y) = × = p(z = 1|y) = P Z 1 0 Γ( P P Γn + α + β P θz+ yi +α−1 yi + α)Γ(n − yi + β) P (1 − θ)n+1− yi +β−z−1 dθ Γ(n + α + β) P P × Γ( yi + α)Γ(n − yi + β) P P Γ(z + yi + α)Γ(n + 1 − yi + β − z) Γ(n + 1 + α + β) yi +α n+α+β ; p(z = 0|y) = P n− yi +β n+α+β . 279 / 352 Bayesian Sample Size Computations Lecture 18 Sharif Mahmood 280 / 352 Review of classical sample size calculations Suppose we want to take a sample y1 , . . . , yn ∼ N (θ, σ 2 ). So 2 ȳ ∼ N (θ, σn ). Assume σ 2 is known. n is not known. Consider the classical hypothesis testing problem: H0 : θ = θ0 against the alternative H1 : θ = θ1 > θ0 Decision rule: Reject H0 if ȳ > θ0 + Φ(·) is the standard normal cdf. √σ z1−α , n where Φ(zα ) = α and 281 / 352 Review of classical sample size calculations Requiring the procedure to have a power of at least 1 − β, we have:  σ 1 − β ≤ P (Rej H0 |H1 ) = P ȳ > θ0 + √ z1−α |θ = θ1 n √  √n  n = P (ȳ − θ1 ) > (θ0 − θ1 ) + z1−α σ σ  √ ∆ = P Z > − n + z1−α σ  √ ∆ = 1 − Φ − n + z1−α σ √ ∆ √ ∆ = Φ( n + z1−α ) = Φ( n − z1−α ) σ σ where ∆ = θ1 − θ0 and, in the last steps, we have used the facts that Φ(−x) = 1 − Φ(x) and z1−α = −zα . 282 / 352 Review of classical sample size calculations The preceding computations lead to √ ∆ n + zα ≥ z1−β σ √ ∆ n ⇒ ≥ z1−β − zα = −(zα − zβ ) σ  2 2 σ ⇒ n ≥ (zα − zβ ) ∆ Thus, we arrive at the ubiquitous sample size formula:  σ 2 n = (zα − zβ )2 ∆ . For two-sided alternatives, one simply replaces α by α/2 in the above expression. 283 / 352 Bayesian sample size calculations 2 Suppose, we take the prior θ ∼ N (θ1 , τ 2 ). We let τ 2 = nσ0 , where n0 (prior sample size!) reflects the precision of the prior relative to the data. This simplifies the calculations:   σ2  σ2  N θ|θ1 , × N ȳ|θ, n0 n  n σ2  n0 θ1 + ȳ, = N θ| n + n0 n + n0 n + n0 Let Aα (θ0 , θ1 ) = {ȳ|P (θ < θ0 |ȳ < α}. Note: P (θ < θ0 |ȳ) = √ n √n + n  n + n0  n0 θ1 + nȳ  n0 θ1 + nȳ o 0 P θ− θ0 − < σ n + n0 σ n + n0 i h √n + n  n θ + nȳ 0 1 0 θ0 − = Φ σ n + n0 284 / 352 Bayesian sample size calculations Thus, ȳ : = n ȳ = n ȳ Aα (θ0 , θ1 ) = = n ȳ √ o n + n0  n0 θ1 + nȳ  < zα θ0 − σ n + n0 o σ n0 θ1 + nȳ √ < zα : θ0 − n + n0 n + n0 o n0 n σ θ1 − ȳ < √ zα : θ0 − n + n0 n + n0 n + n0 s  o n0 σ n0 √ zα 1+ : ȳ > θ0 − (θ1 − θ0 ) − n n n n Note that as n0 → 0 (i.e. the prior becomes vague) Aα (θ0 , θ1 ) becomes identical to the the critical region from classical hypothesis testing. 285 / 352 Bayesian sample size calculations The Bayesian power or Bayesian assurance δ is defined as: δ = Pȳ (Aα (θ0 , θ1 )) = Pȳ {ȳ : P (θ < θ0 |ȳ) < α} = Pȳ {ȳ : P (θ < θ0 |ȳ) > 1 − α} r  n0 n0  σ √ zα = Pȳ ȳ > θ0 − (θ1 − θ0 ) − 1+ n n n 286 / 352 Bayesian sample size calculations The marginal distribution of ȳ is given by: Z    σ2  σ2  σ2  N θ|θ1 , dθ = N ȳ|θ1 , × N ȳ|θ, n0 n n + n0 Therefore, the Bayesian power or assurance is: r o n0 n0  σ √ zα Pȳ ȳ > θ0 − (θ1 − θ0 ) − 1+ n n n √ o n n0 n + n0 σzα = Pȳ ȳ − θ1 > −(1 + )(θ1 − θ0 ) − n n  √ n 0  θ1 − θ0 n0   zα = P Z > − n + n0 1 + − 1+ n σ n √ n 0  θ1 − θ0 n0   = Φ n + n0 1 + zα − 1+ n σ n n 287 / 352 Bayesian Assurance Curve Rewriting the Bayesian power in terms of the relative precision n0 /n and n, we obtain:    √ n0  32 ∆  n0  zα δ=Φ n 1+ + 1+ n σ n where ∆ = θ1 − θ0 . This is often called the critical difference that needs to be detected. This leads to: The Bayesian Power (Assurance) Curve:    √ n0  n0  32 ∆  δ=Φ zα + 1+ n 1+ n σ n Note: For study design purposes, σ 2 and n0 are assumed known and the Bayesian power curve is investigated as a function of ∆ and n. 288 / 352 Concluding remarks Given n0 , the Bayesian will compute the sample size needed to detect a critical difference of ∆ with probability 1 − β as n = arg min{n : δ(∆, n) ≥ 1 − β} As the prior becomes vague, n0 → 0 and: √ ∆  lim δ(∆, n) = Φ n + zα n→0 σ which is exactly the classical power curve. Now the Bayesian sample size formula coincides with the classical sample size formula: Bayesian sample size with vague prior information:  2 2 σ n = (zα − zβ ) ∆ For two-sided alternatives, one simply replaces α by α/2 in the above expression. 289 / 352 Bayesian Methods for Linear Models Lecture 19 Sharif Mahmood 290 / 352 The Linear Models The linear model can be written as Y = Xβ + e, where Y is n × 1, X is n × p, β is p × 1 and e ∼ N (0, σ 2 I) Let M = X(X ′ X)− X ′ and τ = inverse 1 σ2 , ‘–’ denotes the generalized The UMVUE of µ = E(Y ) = Xβ is M Y Derive posterior distributions of β and τ . 291 / 352 The N IG conjugate prior family A popular Bayesian model builds upon the linear regression of y using conjugate priors by specifying p(β, σ 2 ) = p(β|σ 2 )p(σ 2 ) = N (µβ , σ 2 Vβ ) × IG(a, b)  1 a+ p +1 ba 2 = p 1 2 (2π) 2 |Vβ | 2 Γ(a) σ h oi 1n 1 × exp − 2 b + (β − µβ )′ Vβ −1 (β − µβ ) σ 2 oi  1 a+ p +1 h 1n 1 2 ′ −1 (β − µ ) V (β − µ ) b + ∝ ×exp − β β β σ2 σ2 2 The N IG probability distribution is a joint probability distribution of a vector β and a scalar σ 2 . 292 / 352 The N IG conjugate prior family If (β, σ 2 ) ∼ (µ, V, a, b) then an interesting analytic form results from integrating out σ 2 from the joint density: Z N IG(µ, V , a, b) Z  a+ p +1 h 1n oi ba 1 1 2 ′ −1 = b + exp − (β − µ) V (β − µ) dσ 2 p 1 2 2 σ σ 2 2 2 (2π) |Vβ | Γ(a) = ba p 2 1 (2π) |Vβ | 2 Γ(a) Γ(a + p2 ) ×h Γ(a + p2 ) b + 21 (β − µ)′ V −1 (β − µ) ia+ p2 (β − µ)′ [ ab V ]−1 (β − µ) 1 + = p 1 2a π 2 |(2a) ab V | 2 Γ(a)  −( 2a+p ) 2 This is a multivariate t density with ν = 2a and Σ = ( ab )V . 293 / 352 The likelihood The likelihood for the model is defined, up to proportionality, as the joint probability of observing the data given the parameters. Since X is fixed, the likelihood is given by p(y|β, σ 2 ) = N (Xβ, σ 2 I) o n  1 1 1 2 ′ (y − Xβ) (y − Xβ) exp − = 2πσ 2 2σ 2 294 / 352 The posterior distribution from the N IG prior Inference will proceed from the posterior distribution p(β, σ 2 |y) = p(β, σ 2 )p(y|β, σ 2 ) p(y) R where p(y) = p(β, σ 2 )p(y|β, σ 2 )dβdσ 2 is the marginal distribution of the data. The key to deriving the joint posterior distribution is the following easily verified multivariate completion of squares or ellipsoidal rectification identity: u′ Au − 2α′ u = (u − A−1 α)′ A(u − A−1 α) − α′ A−1 α where A is a symmetric positive definite (hence invertible) matrix. 295 / 352 The posterior distribution from the N IG prior An application of this identity immediately reveals, oi n1 1h (β − µβ )′ Vβ −1 (β − µβ ) + (y − Xβ)′ (y − Xβ) − 2 b+ σ 2 i 1h ∗ 1 = − 2 b + (β − µ∗ )′ V ∗ −1 (β − µ∗ ) σ 2 using which we can write the posterior as p(β, σ 2 |y) ∝ where  1 a+ n+p +1 oi h 1n 2 ∗ ′ ∗−1 ∗ ∗ 1 (β − µ ) V (β − µ ) ×exp − b + σ2 σ2 2 µ∗ = (Vβ−1 + X ′ X)−1 (Vβ−1 µβ + X ′ y) V ∗ = (V −1 + X ′ X)−1 n a∗ = a + 2  b∗ = b + µ′β Vβ−1 µβ + y ′ y − µ∗′ V ∗−1 µ∗ 296 / 352 Bayesian Analysis of Linear Models Theorem Suppose τ is known, X is of full rank p, and π(β) ∝ 1. Then β|y ∼ Np (β̂, τ −1 (X ′ X)−1 ), where β̂ = (X ′ X)−1 X ′ Y . Proof. o τ − (Y − Xβ)(Y − X ′ β) n τ2  o = exp − (Y ′ (I − M )Y + (β − β̂ ′ )X ′ X(β − β̂) n τ2  o ∝ exp − (β − β̂ ′ )X ′ X(β − β̂) 2 p(β|y, τ ) ∝ exp n We recognize that this is a normal kernel with mean β̂ and covariance matrix τ −1 (X ′ X)−1 . Thus β|y, τ ∼ Np (β̂, τ −1 (X ′ X)−1 ). 297 / 352 Jeffreys’s Prior Theorem When τ is known, Jeffreys’s prior for β is a uniform prior, i.e., π(β) ∝ 1. Proof. We have n τ n log(2π) + log τ − (Y − Xβ)′ (Y − Xβ) 2 2 2 log[p(y|β, τ )] = − ∂ log[p(y|β, τ )] ∂β = τ X ′ Y − τ (X ′ X)β = −τ (X ′ X) = − = (X ′ Y ) − (X ′ X)β ∂2 log[p(y|β, τ )] ∂β∂β ′ ∂2 log[p(y|β, τ )] ∂τ 2 ∂2 log[p(y|β, τ )] ∂β∂τ n 2τ 2 298 / 352 Jeffreys’s Prior Theorem When both β and τ are known, Jeffreys’s joint prior for (β, τ ) is a p uniform prior, i.e., π(β, τ ) ∝ τ 2 −1 Proof.  ∂ 2 log[p(y|β, τ )]  = E(X ′ Y ) + (X ′ X)β = −(X ′ X)β + (X ′ X)β = 0 ∂β∂τ   τ (X ′ X) 0 Thus, I(β, τ ) = . n 0 2τ 2 n n n Now |I(β, τ )| = |τ (X ′ X)| × | τ −2 | = τ p |X ′ X| τ −2 = τ p−2 |X ′ X| 2 2 2 ∝ τ p−2 −E Thus, π(β, τ ) ∝ 1 |I(β, τ )| 2 = τ p−2 2 p = τ 2 −1 1 Note that with p = 1, π(β, τ ) ∝ τ − 2 . 299 / 352 Bayes Factor computation for Linear Models Theorem Consider the linear model Y = Xβ + e, where e ∼ Nn (0, σ 2 I) with (β, τ ) unknown and suppose β|τ ∼ Np (µ0 , τ −1 Σ0 ) and τ ∼ G( δ20 , γ20 ). Then the marginal distribution of y is given by y ∼ Sn (δ0 , Xµ0 , γδ00 (I + XΣ0 X)′ ) and therefore, n p(y|m) = × n 1 0) Γ (n+δ (πδ0 )− 2 ( γδ00 )− 2 |I + XΣ0 X ′ |− 2 2 Γ( δ20 ) h 1+ ! i 1 (Y − Xµ0 )′ (I + XΣ0 X ′ )−1 (Y − Xµ0 ) δ0 (n+δ0 ) 2 . 300 / 352 Bayes Factor computation for Linear Models Given a model m ∈ M , we have (m) Γ p(y|m) = (n+δ0 2 ) (m) n (m) − n γ0 ) 2 ( (m) )− 2 |I δ0 (πδ0 (m) 1 ′ + X (m) Σ0 X (m) |− 2 (m) Γ( δ0 2 ) ! (m) h × 1+ 1 (Y − (m) δ0 (m) (m) X (m) µ0 )′ (I (m) (m) +X (m) ′ (m) Σ0 X (m) )−1 (Y − (m) X (m) µ0 ) i (n+δ20 ) . (m) where µ0 , Σ0 , δ0 , γ0 are hyper-parameter values under model m. The Bayes factor for comparing model 1 (m1 ) to model 2 (m2 ) is p(y|m1 ) BF= p(y|m . 2) 301 / 352 Test of Hypothesis using Bayes factor Theorem If X ∼ Sp (ν, µ, Σ), where X = (X1 , . . . , Xp ), then marginal  (1)  X distributions of X are also multivariate t. Thus if , where X (1) X (2)     Σ11 Σ12 µ1 (2) . Then and Σ = is r × 1, X is (p − r) × 1, Σ21 Σ22 µ2 X (1) ∼ Sr (ν, µ1 , Σ11 ). 302 / 352 Test of Hypothesis using Bayes factor Suppose we wish to test H0 : β1 = 0 versus H1 : β1 6= 0 Under H0 , β0 |τ ∼ N (0, 9), τ ∼ G( δ20 , γ20 ), δ0 = γ0 = 1  1  9 .5  , τ ∼ G( δ20 , γ20 ), δ0 = γ0 = 1 Under H1 , β0 |τ ∼ N2 .5 9 0 Let m1 ≡ model under H0 : (β0 ), τ m2 ≡ model under H1 : (β0 , β1 ), τ 303 / 352 Computation in R A common prediction problem in medical practice is prognosis Prognosis is a forecasting of probable course and outcome of a disease, in particular chances of recovery Ages and survival times of 20 patients are given The objective is to check if there is any relationship between age and survival time BF = 0.03963679, which is in favor of H1 with 1/BF = 25.23. This implies strong evidence against H0 that β1 is not equal to zero. 304 / 352 Fitting Bayesian Regression Model with WinBugs Soft drink delivery times (Montgomery and Peck, 1992) A quality assurance study is set up to estimate the required time needed by each employee to refill an automatic vending machine The response variable is the total service time (measured in minutes) of each machine Two predictor variables identified are: number of cases of stocked products and the distance walked by the employee (measured in feet) A dataset of 25 observations was collected 305 / 352 Results The posterior mean of R2 is equal to 0.95 indicating considerable improvement in the prediction of delivery times with the covariates Cases and Distance Considering the posterior means as point estimates we have the prediction model as, Expected time = 2.36 + 1.6 × Cases + 0.015 × Distance Both explanatory variables are important in predicting delivery time [95%] CI does not include 0 For each additional case stocked by the employee, the expected delivery time is a posteriori expected to increase by 1.6 minutes The increase in expected delivery time for each additional case lies in (1.3 − 2.0) with probability 95% The delivery time is a posteriori expected to increase by 1.5 minutes for every 100 feet of additional walking distance The increase in expected delivery time for every 100 feet of additional walking distance lies in (.7 − 2.2) with probability 95%. 306 / 352 Results(contd.) The interpretation of β0 is meaningless as zero value is nonsense for both covariates The delivery employee will always have to stock some cases of products in the machine and will walk at least a small small distance to reach the delivery location Consider a typical or representative delivery route A typical delivery rout will take 22.4 minutes on average and will range from 21.1 to 23.8 minutes with probability 95%. 307 / 352 Bayesian Methods for Generalized Linear Models Lecture 20 Sharif Mahmood 308 / 352 Review the assumptions for LM and GLM Model : Y = E(Y ) + e LM Yi ∼ N (µ, σ 2 ) and iid, σ is a constant GLM Yi ∼ exponential family and iid Systematic component η = Xβ η = Xβ Link function E(Y ) ≡ µ = η E(Y ) ≡ µ = g −1 (η) Random component where Y is n-vector, X is n × p matrix and β is p-vector. 309 / 352 The exponential family of distributions A distribution belongs to the exponential family if its pdf, or pmf, can be written as f (y; θ, φ) = exp  yθ − b(θ)  + c(y, φ) a(φ) Here θ is the canonical parameter and φ is an scale parameter a(φ) is positive and continuous, b(θ) is twice differentiable with the second derivative a positive function and c(y, φ) is independent of the parameter θ. Generalized linear models is a family of probability models that includes a class of exponential family of distributions. 310 / 352 Normal Distribution f (y; µ, σ 2 ) = = = = √ 1 exp n − (Y − µ)2 /2σ 2 o 2πσ 2 o n o n 1 exp − ln(2πσ 2 ) exp − (Y − µ)2 /2σ 2 2 n o 1 exp − (Y 2 − 2yµ + µ2 )/2σ 2 − ln(2πσ 2 ) 2 i h  1 Y 2 /σ 2 + ln(2πσ 2 ) exp Y µ − µ2 /2 /σ 2 − 2 is an exponential family distribution with θ = µ and φ = σ 2 . Thus, θ is the ’location’ parameter (mean) and φ the ’scale’ parameter. 311 / 352 Bernoulli Distribution f (y; π) = π Y (1 − π)1−Y where π =Pr(Y = 1). Now f (y; π) = = = Thus π Y (1 − π)1−Y n o exp Y ln π + (1 − Y ) ln(1 − π) h i  exp Y ln π/(1 − π) + ln(1 − π) θ = =  ln π/(1 − π) logit(π) and φ = 1 312 / 352 Poisson Distribution f (y, λ) = e−λ λY /Y !  = exp Y ln λ − λ − ln(Y !) Thus and ⇒ θ = ln(λ) φ=1 313 / 352 Examples of the exponential family Normal Poisson Binomial Gamma Inv Gaussain θ µ ln(µ) µ ) ln( n−µ 1 −µ − 2µ1 2 φ σ2 1 1 1 ν σ2 a(φ) b(θ) θ2 φ 2 φ exp(θ) φ ln(1 + eθ ) φ − ln(−θ) √ φ − −2θ c(y, φ)  y2  φ + ln(2πφ) -ln(y!)  ln ny ν ln(νy) − ln y − ln Γ(ν)   1 − 12 ln(2πφy 3 ) + φy − 12 314 / 352 Properties of the exponential family The representation of the exponential family distributions makes it easy to see that the log likelihood is  log L(Y, θ, φ) = Y θ − a(θ)/φ + b(y, φ) Using calculus, various properties of the exponential family distributions can be derived. In particular, it can be shown that E(Y ) V ar(Y ) = = a′ (θ) a′′ (θ)φ where a′ (θ) ′′ a (θ) = ∂a(θ)/∂θ and = ∂ 2 a(θ)/∂θ2 315 / 352 The systematic component Given covariates X1 , . . . , Xk , the effect of the covariates on the expected value of Y is expressed through the linear predictor η = β0 + k X βi Xi = Xβ i=1 The marginal expectation, E(Y ), of each response is then modelled as a function of covariates. 316 / 352 The Link function The link function, g(.), describes the relation between the linear predictor, η, and the expected value of Y (denoted by µ), g(µ) = η = β0 + k X βi Xi i=1 g(µ) is a link function if the domain of g is the parameter space of µ and the range of g is an interval in (−∞, ∞) g(µ) is strictly monotone and is differentiable everywhere in the domain For instance, if µ ∈ (0, ∞) then η = ln µ is a link function µ ) for logit. If µ ∈ (0, 1) then η = ln( 1−µ 317 / 352 318 / 352 Prior specification The prior distributions on the the parameters β are specified given the dispersion parameter φ Usually prior distribution of the type βj |φ ∼ N (µβj , σβ2j φ) is considered In linear model, φ = σ 2 ∼ IG(a, b) giving conjugate prior distribution This prior assumes prior independence between model parameters In the case of collinear covariates a multivariate normal model with βj |φ ∼ N (µβ , Σβ ) would be more appropriate −1 An extension to this prior by setting Σβ = c2 − H(β̂) can be considered Here β̂ is the maximum likelihood estimate and H(β) is the second derivative matrix of log f (y|β, φ) In GLM we get: H(β) = X ′ HX where H is a n × n diagonal matrix with elements 1 ∂µi  hi = ∂ηi ai (φ)b′′ (θ) 319 / 352 Posterior distribution of a GLM The likelihood function of a GLM is given by f (y|β, φ) = exp n n X  yi gθ−1 (Xi β) − b(gθ−1 (Xi β)) X + c(yi , φ) a(φ) i=1 i=1 Using the multivariate normal prior, the posterior can be written as f (β, φ|y) n n X  yi gθ−1 (Xi β) − b(gθ−1 (Xi β)) X = exp + c(yi , φ) a(φ) i=1 i=1   1 1 ′ −1 × exp − log |Σβ | − (β − µβ ) Σβ (β − µβ ) 2 2 n X  yi gθ−1 (Xi β) − b(gθ−1 (Xi β)) 1 − (β − µβ )′ Σ−1 (β − µ ) ∝ exp β β a(φ) 2 i=1 This does not have closed form, MCMC methods are used to evaluate posterior summaries 320 / 352 Statistical Analysis with Missing Data Lecture 21 Sharif Mahmood 321 / 352 Missing Data and Dropout Missing data arise whenever one or more of the measurements are incomplete, in the sense that some intended measurements are not obtained. Let Y denote the complete response vector which can be partitioned into two sub-vectors: 1 Y(o) the measurements observed 2 Y(m) the measurements that are missing If there were no missing data, we would have observed the complete response vector Y . Instead, we get to observe Y(o) . 322 / 352 Problem of Missing Data The main problem that arises with missing data is that the distribution of the observed data may not be the same as the distribution of the complete data. Consider the following simple illustration: Suppose we intend to measure subjects at 6 months (Y1 ) and 12 months (Y2 ) post treatment. All of the subjects return for measurement at 6 months, but many do not return at 12 months. If subjects fail to return for measurement at 12 months because they are not well (say, values of Y2 are low), then the distribution of observed Y2 ’s will be positively skewed compared to the distribution of Y2 ’s in the population of interest. In general, the situation may often be quite complex, with some missingness unrelated to either the observed or unobserved response, some related to the observed, some related to the unobserved, and some to both. 323 / 352 Reasons for Missing A particular pattern of missingness that is common in longitudinal studies is ‘dropout’ or ‘attrition’. This is where an individual is observed from baseline up until a certain point in time, thereafter no more measurements are made. Possible reasons for dropout: 1 Recovery 2 Lack of improvement or failure 3 Undesirable side effects 4 External reasons unrelated to specific treatment or outcome 5 Death 324 / 352 Examples In clinical trials, missing data can arise from a variety of circumstances: a. Late entrants: If the study has staggered entry, at any interim analysis some individuals may have only partial response data. Usually, this sort of missing data does not introduce any bias. b. Dropout: Individuals may drop out of a clinical trial because of side effects or lack of efficacy. Usually, this type of missing data is of concern, especially if dropout is due to lack of efficacy. Dropout due to lack of efficacy suggests that those who drop out come from the lower end of the spectrum. Dropout due to side effects may or may not be a problem, depending upon the relationship between side effects and the outcome of interest. 325 / 352 Missing Data Mechanisms In order to obtain valid inferences from incomplete data the mechanism (probability model) producing the missing observations must be considered. A hierarchy of three different types of missing data mechanisms can be distinguished: 1 Data are missing completely at random (MCAR) when the probability that an individual value will be missing is independent of Y(o) and Y(m) . 2 Data are missing at random (MAR) when the probability that an individual value will be missing is independent of Y(m) (but may depend on Y(o) ). 3 Missing data are nonignorable when the probability that an individual value will be missing depends on Y(m) . Note: Under assumptions (1) and (2), the missing data mechanism is often referred to as being ‘ignorable’. 326 / 352 Methods of Handling Missing Data 1 Complete Case Methods: These methods omit all cases with missing values at any measurement occasion. Drawbacks: 2 1 Can results in a very substantial loss of information which has an impact on precision and power. 2 Can give severely biased results if complete cases are not a random sample of population of interest, i.e. complete case methods require MCAR assumption. All Available Case Methods: This is a general term for a variety of different methods that use the available information to estimate means and covariances (the latter based on all available pairs of cases). In general, these methods are more efficient than complete case methods (and can be fully efficient in some cases). 327 / 352 Methods of Handling Missing Data Drawbacks: 3 1 Sample base of cases changes over measurement occasions. 2 Pairwise available case estimates of correlations can lie outside (-1, 1). 3 Available case methods require MCAR assumption. Imputation Methods: These are methods that fill in the missing values. Once imputation is done, the analysis is straightforward. 1 Stratification: Stratify into homogeneous subsets or classes; impute mean value of strata, or randomly draw data value from those in the strata. 2 Regression imputation: Estimate response via some appropriately chosen regression model. 328 / 352 Methods of Handling Missing Data Drawbacks: 1 Systematically underestimate the variance and covariance. 2 Treating imputed data as real data leads to standard errors that are too small (multiple imputation addresses this problem). 3 Their performance can be unreliable and usually require MCAR assumption. 4 Can be fairly ad hoc, e.g. LVCF. Last Value Carried Forward: Set the response equal to the last observed value (or sometimes the worst observed value). This method of imputation is only valid under very strong assumptions. In general, LVCF is not recommended! 329 / 352 Methods of Handling Missing Data 4 Likelihood-based Methods: At least in principle, maximum likelihood estimation for incomplete data is the same as for complete data and provides valid estimates and standard errors for more general circumstances than methods (1), (2), or (3). That is, under clearly stated assumptions likelihood-based methods have optimal statistical properties. If missing data are ‘non-ignorable’, likelihood-based inference must also explicitly (and correctly) model the non-response process. However, with ‘non-ignorable’ missingness the methods are very sensitive to unverifiable assumptions. 330 / 352 Methods of Handling Missing Data 5 Weighting Methods: Base estimation on observed data, but weight the data to account for missing data. Basic idea: some sub-groups of the population are under-represented in the observed data, therefore weight these up to compensate for under-representation. For example, with dropout, can estimate the weights as a function of the individual’s covariates and responses up until the time of dropout. This approach is valid provided the model for dropout is correct, i.e. provided the correct weights are available. 331 / 352 Imputation example ID 1 2 3 4 5 Age 70 70 60 85 70 Sex F F M F M Edu 16 16 12 21 21 Race W W B W W SCL 3.8 0.6 1.1 1.3 ⋆ ADL 4 1 2 2 2 Pain 8 1 3 3 4 Comorb 5 1 1 1 3 How to impute the missing SCL for patient #5? ◮ Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7 ◮ By age: (3.8+0.6)/2 = 2.2 ◮ By sex: 1.1 ◮ By education: 1.3 ◮ By race: (3.8 + 0.6 + 1.3)/3 = 1.9 ◮ By ADL: (1.1 + 1.3)/2 = 1.2 Who is/are in the ⋆ with #5? 332 / 352 Imputation - regression Model: SCL = βˆ0 + βˆ1 ×Age + βˆ2 ×Sex + βˆ3 ×Edu + βˆ4 ×Race + βˆ5 ×ADL + βˆ6 ×Pain + βˆ7 ×Comorb Fit the model to the observed data: βˆ0 =-0.8, βˆ1 =0.02, βˆ2 =-0.5, βˆ3 =0.05, βˆ4 =-0.6, βˆ5 =0.1, βˆ6 =0.1, βˆ7 =0.05 Plug in the information of #5 to derive the predicted value: (-0.8) + (0.02)×70 + (-0.5)×0 + (0.05)×21 + (-0.6)×1 + (0.1)×2 + (0.1)×4 + (0.05)×3 = 1.8 → predicted SCL Notes: ◮ ◮ Predicted values might be out of the natural range of the outcome (e.g. SCL > 4 or SCL < 0) For ordinal outcomes, the predicted values might not be plausible (e.g. number of people living in the house = 2.7) 333 / 352 Propensity score Measure the similarity by “the likelihood of being observed/missing” Use logistic regression models to estimate this likelihood Dependent variable  1 Z= 0 if a subject’s outcome is observed if a subject’s outcome is missing Independent variables are anything that might be associated with the outcome being missing (Z = 1) ◮ Demographic information ◮ Baseline characteristics 334 / 352 Propensity score (cont.) Model: ◮ ◮ p = prob(Z = 1)  p = β 0 + β 1 X1 + β 2 X2 + . . . + β k Xk log 1−p Z has no missing values X1 ∼ Xk all must have non-missing values Statistical significance of β’s is not important The predicted p’s derived from the model are the propensity scores 335 / 352 Propensity score - example Dependent variable: Y = 12-month SCL score  1 if Y is observed Z= 0 if Y is missing Independent variable: X1 = age = Age X2 = sex ( = 1 if male, = 0 if female) = Sex X3 = number of chronic conditions = NumC X4 = baseline SCL score = SCL00 Model: log(p/(1 − p)) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 Result: β0 = 0.31, β1 = 0.003, β2 = −0.58, β3 = −0.25, β4 = 0.25 336 / 352 Propensity score - example Model:  p =(0.31) + (0.003)×Age + (-0.58)×Sex + (-0.25)×NumC + log 1−p (0.25)×SCL00 Derive the propensity scores for subject A and B: Subject A: 70-year-old male, 3 chronic conditions, SCL00 = 1.7 (0.31)+(0.003)×70+(-0.58)×1+(-0.25)×3+(0.25)×1.7 = -0.385 p  ⇒ log 1−p = - 0.385 → p = 0.405 Subject B: 85-year-old female, 4 chronic conditions, SCL00 = 0.7 (0.31)+(0.003)×85+(-0.58)×0+(-0.25)×4+0.25×0.7 = -0.26 ⇒ log p  1−p = -0.26 → p = 0.435 337 / 352 Imputation - hot-deck ID 1 2 3 4 5 Age 70 70 60 85 70 Sex F F M F M Edu 16 16 12 21 21 Race W W B W W SCL 3.8 0.6 1.1 1.3 ⋆ ADL 4 1 2 2 2 Pain 8 1 3 3 4 Comorb 5 1 1 1 3 Propensity 0.18 0.42 0.39 0.22 0.41 How to impute the missing SCL for patient # 5? ◮ 4 strata → closest to #2 → impute with 0.6 ◮ 2 strata → closest to both #2 and #3 → impute with a randomly selected value from (0.6, 1.1) ◮ The method is called “Hot-Deck”, #2, #3 are called “donors” Common approach: ◮ Stratify the sample by the propensity scores (e.g. 5 strata) ◮ Randomly select a donor from the same stratum and impute the missing value with the donor’s observed value 338 / 352 Imputation For each missing value, impute m data points ◮ ◮ m > 1, usually m = 5 For single imputation m = 1 What’s wrong with single imputation? ◮ ◮ ◮ Imputed values are derived from the observed sample, and thus the imputed sample is more homogeneous Variances are under-estimated (10, 20, 30) → mean = 20, variance = 100 (10, 20, 30, 20, 20, 20) → mean = 20, variance = 40 More likely to yield biased result Advantage of multiple imputation ◮ ◮ Add the variation across the m data sets “back” to the estimation of variance Result is less likely to be biased 339 / 352 Assumptions for Multiple Imputation (MI) The MI procedure assumes that the data are from a continuous multivariate distribution and contain missing values that can occur on any of the variables. The data are from a multivariate normal distribution when either the regression method or the MCMC method is used. Suppose D is the n × p matrix of complete data, which is not fully observed, and denote the observed part of D by Dobs and the missing part by Dmis . To be more precise, suppose that M is the n × p matrix of response indicators whose elements are zero or one depending on whether the corresponding elements of D are missing or observed. Furthermore, the MI procedures assume that the parameters θ of the data model and the parameters φ of the model for the missing data indicators are distinct. 340 / 352 Missing Data Suppose we have missing data. Define D as our data matrix and M as our missingness matrix.  y 1  5  D= 2 ⋆  ⋆ ⋆ x1 2.5 3.2 1.2 1.9 1.2 7.7  x2 x3 432 0   543 1   219 1   ⋆ 1  108 0  95 1  y x1 x2 x3 1 1 1 1    1 1 1 1     1 1 1 1 M =   0 1 0 1    0 1 1 1  0 1 1 1  One non-Bayesian approach to dealing with missing data is multiple imputation. 341 / 352 Multiple Imputation We need a method for filling in the missing cells of D as a first step before we go to the analysis stage. 1 Assume a joint distribution for Dobs and M : Z P (Dobs , M |φ, γ) = p(D|φ)p(M |Dobs , Dmis , γ)dDmis If we assume MAR, then P (Dobs , M |φ, γ) ∝ 2 Z p(D|φ)dDmis Find φ, which characterizes the full D. L(φ|Dobs ) = n Z Y i=1 p(Di,obs |φ)dDmis How do we do this integral? 342 / 352 Multiple Imputation Assume Normality since the marginals of a multivariate Normal are Normal. n Y N (Di,obs |µobs , Σobs ) L(µ, Σ|Dobs ) = i=1 3 Find µ̂, Σ̂ and its distribution via EM algorithm and bootstrapping. 4 Draw m, µ and Σ values, then use them to predict values for Dmis . What we end up with is m datasets with missing values imputed. We then run our regular analyses on the m datasets and combine the results using Rubins rule. 343 / 352 Bayesian Approach to Missing Data Bayesian paradigm: Everything unobserved is a random variable. So we can set up the missing data as a parameter that we need to find. p(Dmis , φ, γ|Dobs , M ) ∝ p(Dobs |Dmis , φ) p(Dmis |φ) p(M |Dobs , Dmis , γ p(γ) p(φ) If we assume MAR: p(Dmis , φ, γ|Dobs , M ) ∝ p(Dobs |Dmis , φ) p(Dmis |φ) p(φ) Use Gibbs Sampling or M-H to sample both Dmis and φ. We dont have to assume normality of D to integrate over Dmis . We can just drop the draws of Dmis . 344 / 352 Bayesian Approach to Missing Data We can also incorporate both imputation and analyses in the same model. p(θ, Dmis , φ, γ|Dobs , M ) ∝ p(yobs , ymis |θ) p(Dobs |Dmis , φ) p(Dmis |φ p(φ) p(θ) Again, find the posterior via Gibbs Sampling or M-H. Thus, We can easily set up an application specific Bayesian model to incorporate missing data. 345 / 352 EM algorithm Lecture 22 Sharif Mahmood 346 / 352 EM algorithm This is an algorithm that is used to find the MLE when the data can be viewed as simpler if additional observations had been made. The data (both observed and missing/imaginary) are classified as: Y = Observed data Z = Missing data X = (Y,Z) = Complete data the algorithm has two steps (E) Expectation step. Find the expectation of the complete data likelihood, conditional on the observed data. This involves computing the conditional expectation of the missing data. The calculation is made using the current estimates of the parameters. (M) Maximization step. Find θ to maximize the complete data likelihood . 347 / 352 Example A. ✑ Suppose we have observations Y , . . . , Y from two groups with known densities f0 (y) and f1 (y). Let Zi = 1 if the observation is from group 1 and 0 otherwise. The binary labels Z1 , . . . , Zn are missing. If the fraction from group “1” is p, then the pdf of each observation is X f (y) = f (y|Z = z) P (Z = z) 1 n = (1 − p)f0 (y) + pf1 (y) This is called a finite mixture model. If we introduce some parameters into the component densities, fj (y|θ), j = 0, 1, then it is is challenging to find the MLEs for this model. 348 / 352 Example A. Let us see how to use the EM algorithm to turn this into a very simple task. The complete data are Xi = (Yi , Zi ) If we could observe the Zi ’s, the MLE for the only parameter in the model p is P Zi p̂ = i . n The General Formulation Define f (y, z|θ) as the complete data likelihood. (Throughout this section we use y and z to indicate a possibly vector values quantity. yi and zi are elements from such a vector.) 349 / 352 The EM Algorithm (0) Pick a starting value for the parameter value θ0 . Iterate the following steps for m = 1, 2, . . . , until the algorithm converges. (E) Calculate Here ◮ ◮ ◮ n o f (Y, Z|θ) |Y, θm ) Q(θ, θm ) = E( log f (Y, Z|θm ) The expectation is over the missing data Z, treat Y = y as fixed, because we condition on it, use θm as the true value of the parameter in the expectation (M) view Q(θ, θm ) as a function of θ with θm fixed. Find θm+1 , the value that maximizes Q(θ, θm ). Often with careful selection of Z there is a closed form solution to the M-step. If there is, the EM algorithm is a very convenient tool. Alternatively, the M-step may employ a well studied, and reliable mazimization approach. 350 / 352 351 / 352 References Albert, J. Bayesian Computation with R, 2nd edn. New York: Springer-Verlag, 2009. Carlin, B.P. and Louis, T.A. Bayesian Methods for Data Analysis, 3rd edn. Texts in Statistical Science Series. Chapman and Hall, 2008. Gelman, A. and Carlin, J.B. and Stern, H.S. and Rubin, D.B. Bayesian Data Analysis, 2nd edn. Texts in Statistical Science Series. Chapman and Hall, 2004. Hoff, P.D. A First Course in Bayesian Statistical Methods. Springer, 2009. 352 / 352