Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
An Introduction
Editorial Board
Nancy Reid (University of Toronto)
Ramon van Handel (Princeton University)
Xuming He (University of Michigan)
Susan Holmes (Stanford University)
IMS Textbooks give introductory accounts of topics of current concern suitable for
advanced courses at master’s level, for doctoral students and for individual study. They
are typically shorter than a fully developed textbook, often arising from material
created for a topical course. Lengths of 100–290 pages are envisaged. The books
typically contain exercises.
In collaboration with the International Society for Bayesian Analysis (ISBA),
selected volumes in the IMS Textbooks series carry the “with ISBA” designation at the
recommendation of the ISBA editorial representative.
Peter Müller
University of Texas, Austin
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
79 Anson Road, #06-04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781108481038
DOI: 10.1017/9781108646185
© Maria Antónia Amaral Turkman, Carlos Daniel Paulino, and Peter Müller 2019
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2019
Printed and bound in Great Britain by Clays Ltd, Elcograf S.p.A.
A catalogue record for this publication is available from the British Library .
ISBN 978-1-108-48103-8 Hardback
ISBN 978-1-108-70374-1 Paperback
Additional resources for this publication at www.cambridge.org/9781108481038
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
1 Bayesian Inference 1
1.1 The Classical Paradigm 2
1.2 The Bayesian Paradigm 5
1.3 Bayesian Inference 8
1.3.1 Parametric Inference 8
1.3.2 Predictive Inference 12
1.4 Conclusion 13
Problems 14
v
vi Contents
5 Model Assessment 70
5.1 Model Criticism and Adequacy 70
5.2 Model Selection and Comparison 76
5.2.1 Measures of Predictive Performance 76
5.2.2 Selection by Posterior Predictive Performance 81
5.2.3 Model Selection Using Bayes Factors 83
5.3 Further Notes on Simulation in Model Assessment 85
5.3.1 Evaluating Posterior Predictive Distributions 85
5.3.2 Prior Predictive Density Estimation 86
5.3.3 Sampling from Predictive Distributions 87
Problems 88
9 Software 172
9.1 Application Example 173
9.2 The BUGS Project: WinBUGS and OpenBUGS 173
9.2.1 Application Example: Using R2OpenBUGS 175
9.3 JAGS 181
9.3.1 Application Example: Using R2jags 181
9.4 Stan 185
9.4.1 Application Example: Using RStan 186
9.5 BayesX 192
9.5.1 Application Example: Using R2BayesX 194
9.6 Convergence Diagnostics: the Programs CODA and BOA 198
9.6.1 Convergence Diagnostics 199
9.6.2 The CODA and BOA Packages 201
9.6.3 Application Example: CODA and BOA 203
9.7 R-INLA and the Application Example 213
9.7.1 Application Example 215
Problems 222
This book is based on lecture notes for a short course that was given at
the XXII Congresso da Sociedade Portuguesa de Estatística. In the trans-
lation from the original Portuguese text we have added some additional
material on sequential Monte Carlo, Hamiltonian Monte Carlo, transdi-
mensional Markov chain Monte Carlo (MCMC), and variational Bayes,
and we have introduced problem sets. The inclusion of problems makes
the book suitable as a textbook for a first graduate-level class in Bayesian
computation with a focus on Monte Carlo methods. The extensive discus-
sion of Bayesian software makes it useful also for researchers and graduate
students from beyond statistics.
The core of the text lies in Chapters 4, 6, and 9 on Monte Carlo meth-
ods, MCMC methods, and Bayesian software. Chapters 5, 7, and 8 include
additional material on model validation and comparison, transdimensional
MCMC, and conditionally Gaussian models. Chapters 1 through 3 intro-
duce the basics of Bayesian inference, and could be covered fairly quickly
by way of introduction; these chapters are intended primarily for review
and to introduce notation and terminology. For a more in-depth introduc-
tion we recommend the textbooks by Carlin and Louis (2009), Christensen
et al (2011), Gelman et al (2014a) or Hoff (2009).
viii
Preface
ix
x Preface
statistics and for the opportunity to give a minicourse at the 22nd confer-
ence of the society. We also acknowledge the institutional support from
the Universidade de Lisboa through the Centro de Estatística e Aplicações
(PEst-OE/MAT/UI0006/2014, UID/MAT/00006/2013), in the Department
of Statistics and Operations Research in the Faculdade de Ciências and
of the Department of Mathematics in the Instituto Superior Técnico. We
would like to acknowledge that the partial support by the Funda cão para a
Ciência e Tecnologia through various projects over many years enabled us
to build up this expertise in Bayesian statistics.
Finally, we would like to dedicate this book to Professor Bento Murteira
to whom the development of Bayesian statistics in Portugal owes a lot. In
fact, Chapter 1 in this book reflects in many ways the flavor of his writings.
1
Bayesian Inference
1
2 Bayesian Inference
under the principle of repeated sampling, that is, with respect to the per-
formance under infinitely many hypothetical repetitions of the experiment
carried out under identical conditions. One of the aspects of this principle
is the use of frequencies as a measure of uncertainties, that is, a frequentist
interpretation of probability. See , Paulino et al. (2018, section 1.2), for a
review of this and other interpretations of probability.
In the case of parametric inference, in answer to question (2) above, we
need to consider first the question of point estimation, which, grosso modo,
is: Given a sample X = (X1 , X2 , . . . , Xn ), how should one “guess,” esti-
mate, or approximate the true value θ, through an estimator T (X1 , X2 , . . . ,
Xn ). The estimator should have the desired properties such as unbiasedness,
consistency, sufficiency, efficiency, etc.
For example, with X ≡ Rn , the estimator T (X1 , X2 , . . . , Xn ) based on a
random sample is said to be centered or unbiased if
Z
E{T | θ} = T (x1 , x2 , . . . , xn )Πni=1 f (xi | θ) dx1 dx2 . . . dxn = θ, ∀θ ∈ Θ.
Rn
interval on the real line (now with real numbers as lower and upper limits).
(T ∗ (x1 , x2 , . . . , xn ), T ∗∗ (x1 , x2 , . . . , xn )),
and the probability
P{∗ T (x1 , x2 , . . . , xn ) < θ < T ∗∗ (x1 , x2 , . . . , xn ) | θ} = 1 − α,
0 < α < 1, is no longer meaningful. In fact, once θ has an unknown, but
fixed, value, this probability can only be 1 or 0, depending upon whether
the true value of θ is or is not in the real interval
(T ∗ (x1 , x2 , . . . , xn ), T ∗∗ (x1 , x2 , . . . , xn )).
Of course, since θ is unknown, the investigator does not know which situ-
ation applies. However, a classical statistician accepts the frequentist inter-
pretation of probability and invokes the principle of repeated sampling in
the following way: If one imagines a repetition of the sampling and infer-
ence process (each sample with n observations) a large number of times,
then in (1 − α) 100% of the repetitions the numerical interval will include
the value of θ.
Another instance of classical statistical inference is a parametric hypoth-
esis test. In the course of scientific investigation one frequently encounters,
in the context of a certain theory, the concept of a hypothesis about the
value of one (or multiple) parameter(s), for example in the symbols
H0 : θ = θ0 .
This raises the following fundamental question: Do the data (x1 , x2 , . . . , xn )
support or not support the proposed hypothesis? This hypothesis is tradi-
tionally referred to as the null hypothesis. Also here the classical solution
is again based on the principle of repeated sampling if one follows the
Neyman–Pearson theory. It aims to find a rejection region W (critical re-
gion) defined as a subset of the sample space, W ⊂ X, such that
(X1 , X2 , . . . , Xn ) ∈ W ⇒ rejection of H0 ,
What does it mean that the critical region is associated with a type-I er-
ror, equal to, for example, 0.05? The investigator can not know whether a
false or true hypothesis is being rejected when a particular observation falls
into the critical region and the hypothesis is thus rejected. However, being a
classical statistician the investigator is convinced that under a large number
of repetitions and if the hypothesis were true, then only in 5% of the cases
would the observation fall into the rejection region. What does it mean that
the critical region is associated with a type-II error equal to, say 0.10? Sim-
ilarly, when a particular observation is not in the rejection region and thus
the hypothesis is not rejected, then the investigator cannot know whether
a true or false hypothesis is being accepted. Being a classical statistician,
the investigator can affirm that under a large number of repetitions of the
entire process and if the hypothesis were in fact false, only in 10% of the
cases would the observation not fall into the rejection region.
In the following discussion, it is assumed that the reader is familiar with
at least the most elementary aspects of how classical inference approaches
estimation and hypothesis testing, which is therefore not discussed here in
further detail.
lead to certain loss. The strongest exponent of the concept of personal prob-
abilities is, however, de Finetti. In discussing the Bayesian paradigm and its
application to statistics, one must also cite Harold Jeffreys, who, reacting
to the predominantly classical position in the middle of the century, besides
inviting disapproval, managed to resurrect Bayesianism, giving it a logical
basis and putting forward solutions to statistical inference problems in his
time. From there the number of Bayesians grew rapidly and it becomes
impossible to mention all but the most influential – perhaps Good, Savage,
and Lindley.
The well-known Bayes’ theorem is a proposition about conditional prob-
abilities. It is simply probability calculus and is thus not subject to any
doubts. Only the application to statistical inference problems is subject to
some controversy. It obviously plays a central role in Bayesian inference,
which is fundamentally different from classical inference. In the classical
model, the parameter θ, θ ∈ Θ, is an unknown but fixed quantity, i.e., it is a
particular value that indexes the sampling model or family of distributions
F that “appropriately” describes the process or physical system that gener-
ates the data. In the Bayesian model, the parameter θ, θ ∈ Θ, is treated as an
unobservable random variable. In the Bayesian view, any unknown quan-
tity – in this case, the parameter θ – is uncertain and all uncertainties are
described in terms of a probability model. Related to this view, Bayesians
would argue that initial information or a priori information – prior or ex-
ternal to the particular experiment, but too important to be ignored – must
be translated into a probability model for θ, say h(θ), and referred to as the
prior distribution. The elicitation and interpretation of prior distributions
are some of the most controversial aspects of Bayesian theory.
The family F is also part of the Bayesian model; that is, the sampling
model is a common part of the classical and the Bayesian paradigms, ex-
cept that in the latter the elements f (x | θ) of F are in general assumed to
also have a subjective interpretation, similar to h(θ).
The discussion of prior distributions illustrates some aspects of the dis-
agreement between Bayesian and classical statisticians. For the earlier,
Berger, for example, the subjective choice of the family F is often con-
sidered a more drastic use of prior information than the use of prior dis-
tributions. And some would add: In the process of modeling, a classi-
cal statistician uses prior information, albeit in a very informal manner.
Such informal use of prior information is seen critically under a Bayesian
paradigm, which would require that initial or prior information of an in-
vestigator needs to be formally stated as a probability distribution on the
random variable θ. Classical statisticians, for example, Lehmann, see an
1.2 The Bayesian Paradigm 7
factor that does not depend on θ. The likelihood function – it is not a prob-
ability, and therefore, for example, it is not meaningful to add likelihoods –
plays an important role in Bayes’ theorem as it is the factor through which
the data, x, updates prior knowledge about θ; that is, the likelihood can be
interpreted as quantifying the information about θ that is provided by the
data x.
In summary, for a Bayesian the posterior distribution contains, by way
of Bayes’ theorem, all available information about a parameter:
prior information + information from the sample.
It follows that all Bayesian inference is based on h(θ | x) [or h(θ | x1 , x2 ,
. . . , xn )].
When θ is a parameter vector, that is, θ = (γ, φ) ∈ Γ × Φ, it can be
the case that the desired inference is restricted to a subvector of θ, say
γ. In this case, in contrast to the classical paradigm, the elimination of
the nuisance parameter φ under the Bayesian paradigm follows always the
same principle, namely through the marginalization of the joint posterior
distribution,
Z Z
h(γ | x) = h(γ, φ | x)dφ = h(γ | φ, x)h(φ | x)dφ. (1.3)
Φ Φ
the central problem in the face of uncertainty is shifting from which infer-
ence one should report, to which decision should be taken. As individual
decisions have been considered outdated by some philosophers, we have
also recently seen a resurgence of the Bayesian approach in the context of
group decisions.
Under a Bayesian approach, confidence intervals are replaced by credi-
ble intervals (or regions). Given x, and once a posterior distribution is deter-
mined, one finds a credible interval for a parameter θ (assume, for the mo-
ment, a scalar). The interval is formed by two values in θ, say [θ(x), θ̄(x)],
or simpler, (θ, θ̄), such that
Z θ
P(θ < θ < θ̄ | x) = h(θ | x) dθ = 1 − α, (1.4)
θ
The ratio P(Θ0 )/P(Θ1 ) is known as the prior odds for H0 versus H1 . Af-
ter the experiment resulting in the observations x, and after determining
h(θ | x), a Bayesian statistician calculates the corresponding posterior prob-
abilities
Z Z
P(Θ0 | x) = h(θ | x) dθ, P(Θ1 | x) = h(θ | x) dθ,
Θ0 Θ1
and usually also the posterior odds for H0 versus H1 , that is, P(Θ0 | x)/P(Θ1 |
x). One can therefore say that in the Bayesian framework the inference out-
come is not so much the acceptance or rejection of a the hypothesis H0 –
as is the case in the Neyman–Pearson framework – but rather the updating
of the plausibility that is attributed to the competing hypotheses. Bayesian
inference can be described as a comparison of posterior odds versus prior
odds through
P(Θ0 | x)/P(Θ1 | x)
B(x) = , (1.6)
P(Θ0 )/P(Θ1 )
which is known as the Bayes factor in favor of H0 (or Θ0 ). The Bayes factor
quantifies the evidence in the data x in favor of H0 . Of course, the larger
the Bayes factor, the larger is the increase of the posterior odds relative
to the prior odds and thus the support that the data give to the hypothesis
H0 . In general, the Bayes factor depends on the prior distribution and can
be expressed as a ratio of likelihoods weighted by the prior distributions
conditional on the respective hypothesis on Θ0 and Θ1 (see also Paulino et
al., 2003). In this sense one can not say that the Bayes factor is a measure
of the support for H0 based on only the data.
When the hypothesis about θ is specific to the point of being defined
as H0 : θ = θ0 , evaluation of a Bayes factor or of posterior odds requires
that the prior distribution be consistent with this conjecture in the sense
of avoiding zero probability for H0 , implying in general that the prior is a
mixture model. This implication is considered natural by Bayesians such
as Jeffreys, with the argument that a prior distribution needs to integrate
12 Bayesian Inference
the place of the prior, as representing the information given the sample.
One can then report summaries of this predictive distribution, including
probabilities of any region in the sample space for Xn+1 or values, a = a(x)
and b = b(x) for any pre-specified probability P(a < Xn+1 < b | x) =
Rb
a
f (xn+1 | x) dxn+1 , which then determine a prediction interval (of HPD
type if desired).
1.4 Conclusion
In summary, from a Bayesian point of view:
• The classical approach to statistical inference proceeds by arguments of
an inductive type, such as, the notion of confidence intervals, which do
not have a direct interpretation as probabilities. The difficulty or impos-
sibility of making inference with a direct probabilistic interpretation –
as the parameter θ itself is not even considered as a random variable – is
strongly criticized by Jaynes (2003).
• Under a Bayesian approach, all inference can be derived from a logical
14 Bayesian Inference
Problems
1.1 Suppose there are N cable cars in San Francisco, numbered 1 to N. You
don’t know the value of N, so this is the unknown parameter. Your prior
distribution on N is a geometric distribution with mean 100; that is
!N−1
1 99
h(N) = ,
100 100
N = 1, 2, . . . . You see a cable car at random. It is numbered x = 203. Assume
that x =number on a randomly picked car and has the probability distribution
f (x | N) = 1/N for x = 1, . . . , N, and f (x | N) = 0 for x > N.
a. Find the posterior distribution h(N | x). Find the Bayes estimate of N, i.e.,
the posterior mean of N, and the posterior standard deviation of N (use
results for a geometric series ∞ x
P
x=k a ; or use a numerical approximation).
b. Find a 95% HPD credible interval for N (you will not be able to exactly
match the 95% – get as close as possible).
1.2 Recording the number of bacteria (yi ) found in n = 6 water samples (of the
same volume), we find yi = 2, 3, 8, 6, 4, and 1 (you might need S = yi =
P
24). It is thought reasonable to assume that yi follows a Poisson distribution
with mean θ, i = 1, . . . , n. √
Also suppose that the prior h(θ) ∝ 1/ θ is used for the parameter θ (this is
a so-called improper prior – see Chapter 2 for more discussion).
Problems 15
a. Find the posterior distribution h(θ | y), and show how you will obtain a
95% credible interval for θ.
b. Suppose that you are now told that only non-zero outcomes were recorded
in the above experiment, and therefore the correct distribution for yi , i =
1, ..., 6, is the truncated Poisson given by
e−θ θy
f (y | θ) = , y = 1, 2, ...
y!(1 − e−θ )
(i) Write down the likelihood function, and the posterior (using the
same prior as before) up to a constant.
(ii) Find a 95% credible interval using numerical integration (the pre-
vious posterior is no longer in a simple form that allows analytic
evaluation of credible intervals).
1.3 Assume that the waiting time for a bus follows an exponential distribution
with parameter θ. We have a single observation, x = 3. Assume that θ can
only take one of 5 values, Θ = {1, 2, 3, 4, 5}, with prior probabilities h(θ) ∝
1/θ, θ ∈ Θ.
a. Find the posterior mode (MAP), the posterior standard deviation SD(θ |
x), and the (frequentist) standard error of the estimator, i.e., SD(MAP | θ).
b. Find a 60% HPD credible interval A = (θ0 , θ1 ), i.e., find the shortest
interval A with P(A | x) ≥ 0.6.
c. Find h(θ | x, θ ≥ 2), i.e., find the posterior distribution for θ when we
know that θ ≥ 2.
d. Find the posterior mode MAP0 conditional on θ ≥ 2, the conditional pos-
terior standard deviation SD(θ | x, θ ≥ 2), and the (frequentist) standard
error of the estimator, SD(MAP0 | θ).
Compare with the answers under item (a) and comment.
e. How would you justify the choice of the prior distribution h(θ) ∝ 1/θ?
1.4 Your friend always uses a certain coin to bet “heads or tails”4 and you have
doubts whether the coin is unbiased or not. Let θ denote the probability of a
head. You want to test H1 : θ < 0.5 versus H2 : θ = 0.5 versus H3 : θ > 0.5.
Assign a prior probability 1/2 that the coin is unbiased and equal probability
to the other two hypotheses. That is, p(H2 ) = 0.5 and p(H1 ) = p(H3 ) = 0.25.
The prior distribution for θ under H1 and H3 is uniform. That is, h(θ | H1 ) =
U(0, 0.5) and h(θ | H3 ) = U(0.5, 1). Assume you observe n = 1 coin toss.
Let x1 ∈ {0, 1} denote an indicator of “head.”
a. Find the predictive distribution (i.e., the marginal distribution) for the first
toss, under each of the three hypotheses, i.e., find p(x1 = 1 | H1 ), p(x1 =
1 | H2 ), and p(x1 = 1 | H3 ).
b. Find the predictive distribution for the first toss, p(x1 = 1).
4
The labels “head” and “tail” refer to the two sides of a 25 cent coin.
16 Bayesian Inference
The mechanics of the inference process require that all basic ingredients
be properly specified. The first one is a sampling model that can explain
(with more or less accuracy) the data as they arise from some experiment
or observational process and which we wish to analyze. This model en-
compasses a set of unknown aspects about which one could have prior
information that should be included in the analysis, no matter how vague
or significant this information is, and that need to be somehow represented
and quantified.
The process of representing prior information is often complicated by
the involvement of subjective elements that need to be elicited. We address
this here for two cases:
• The first case is when no prior information is available, neither objective
nor subjective (sometimes referred to as “a priori ignorance”) or when
prior knowledge is of little significance relative to the information from
the sampling model (“vague” or “diffuse” prior information). We review
some of the principal methods to derive prior distributions that are in
some sense very little informative, and which commonly are referred to
as non-informative priors.
• The second case assumes a convenient parametric family and then chooses
a member of that family using carefully elicited summary measures for
the desired distribution. An example of this elicitation process arises in
the medical inference problem that is discussed by Paulino et al (2003).
This is the context of so-called natural conjugate priors. Such priors
can also be used to generate improper non-informative priors. In this
sense it is closely related to the first case.
For more discussion about the problem of prior specification and other
methods to generate vague priors or to elicit subjective priors, see O’Hagan
(2010), Kass and Wasserman (1996), and Paulino et al (2018).
17
18 Representation of Prior Information
Jeffreys’ Prior
One of the approaches that ensure invariance under one-to-one transfor-
mations is the one proposed by Jeffreys. Jeffreys’ prior is based on Fisher
information for θ ∈ R, defined as
∂ ln f (X | θ) 2
!
I(θ) = E θ .
∂θ
1
For the univariate case Jeffreys’ prior is defined by h(θ) ∝ [I(θ)] 2 . The fact
that for any real-valued one-to-one transformation ψ of θ ∈ R,
!2
dθ
I(ψ) = I(θ(ψ)) ,
dψ
shows that Jeffreys’ prior for the univariate case has the desired invariance
property and thus, ensures invariance of inference under arbitrary transfor-
mations of the parameter space (that is, under reparametrization).
To understand the non-informative nature of Jeffreys’ prior, note that I(θ)
grows with the square of the change (in expectation over the sample space)
with respect to θ of ln f (X | θ). Also note that the better the model can
differentiate θ from θ+dθ, the larger I(θ), that is, the sample information on
θ. Therefore, considering values of θ with larger (smaller) I(θ) to be a priori
more (less) plausible maximally reduces the impact of prior information.
This explains the non-informative nature of Jeffreys’ prior. And it can be
claimed to be objective as it is derived in an automatic fashion from the
assumed generative model for the data.
20 Representation of Prior Information
(i.e., under rescaling). Here, invariance refers to the fact that the implied
prior for any ψ = θr is of the same type, h(ψ) ∝ ψ−1 , ψ > 0.
In multiparameter models, Jeffreys’ rule is based on the square root of
the determinant of the Fisher information matrix. However, due to unde-
sirable implications on the posterior distribution, the rule tends to be re-
placed, as even suggested by Jeffreys himself, by an assumption of a priori
independence between parameters (especially when they are of a different
nature) and the use of univariate Jeffreys’ rules for the marginal prior dis-
tributions. For example, in the model {N(µ, σ2 ) : µ ∈ R, σ2 ∈ R+ }, Jeffreys’
prior is
h(µ, σ2 ) ∝ σ−m , µ ∈ R, σ2 > 0,
with m = 3 when the bivariate rule is used and m = 2 when the mentioned
factorization of the joint prior for location, µ, and scale, σ2 is used.
{1, . . . , k}, and that the median is fixed at one of the possible values, say q.
Thus, we have a restriction imposed with u1 = q and g1 (θ) being an indi-
cator for θ ≤ q, that is, given by qi=1 h(i) = 1/2. By the earlier expression
P
eλ1
eλ1 q+(k−q) , if i ≤ q
h(i) = ,
λ 1
e 1 q+(k−q)
, if q < i ≤ k
which shows how sampling information enters the easily computed poste-
rior distribution, through the number of successes and failures (that is, the
minimum sufficient statistic), additively with respect to the prior hyperpa-
rameters (a, b).
Since a, b > 0, prior information vanishes (relative to the information
in the likelihood) when a, b → 0. Thus a non-informative (or vague) prior
obtained from the natural conjugate family is the improper Haldane prior,
“Be(0, 0),” defined as h(θ) ∝ θ−1 (1 − θ)−1 , θ ∈ (0, 1), which corresponds
to the “uniform” prior for ψ = ln [θ/(1 − θ)] ∈ R. Consequently, the prior
distribution Be(a, b) can be interpreted as a posterior distribution resulting
from updating a non-informative prior based on a hypothetical sample of
size a + b with a successes.
In conjugate families, the process of Bayesian updating, that is, the com-
bination of prior and sample information by Bayes’ theorem, is carried out
entirely within them. This allows us to symbolically represent the updating
by a transformation in the space A of hyperparameters, e.g.,
F
a ∈ A → A = a + (A − a) ∈ A.
in Example 2.5. The transformation shows the relative weights of the two
types of information and highlights the interpretative and analytical sim-
plicity of the Bayesian mechanism in the context of a natural conjugate
family. In the above form, A − a represents the role of the sample infor-
mation in the updating of prior information, represented as a. This is illus-
trated in Example 2.5.
Problems
2.1 Consider a Poisson sampling model f (n | λ) = Poi(λ). Find Jeffreys’ prior
for λ.
2.2 In order to diagnose the cause of a certain symptom, a patient’s physician
analyzes the tetrahydrocortisone content (THE) in the urine of the patient
over 24 hours. The laboratory returns a result of THE = 13 mg/24 h. There
are two possible causes, adenoma (z = 1) and carcinoma (z = 2). It is known
that the distribution of, y = ln(THE), i.e., the logarithms of the amount of
THE in the urine has approximately a normal distribution. We assume
M1 : y ∼ N(µ1 , σ21 )
M2 : y ∼ N(µ2 , σ22 )
for adenoma (M1 ) and carcinoma (M2 ), respectively. A data bank of THE
measures is available with information about six patients with adenoma and
five patients with carcinoma as follows (in mg/24 h):
adenoma 3.1, 3.0, 1.9, 3.8, 4.1, 1.9
carcinoma 10.2, 9.2, 9.6, 53.8, 15.8
a. First, we use the data to determine the parameters in the sampling model.
That is, use the data to estimate µ j and σ j , j = 1, 2.
Fix the parameters, and determine the posterior probability of this patient
having carcinoma, i.e., p(z = 2 | y), if the initial opinion of the physicians
team, based on the patient’s history, is that the patient has a carcinoma
with probability 0.7.
2.2 Natural Conjugate Priors 27
2
Bayesian inference satisfies the likelihood principle since it is entirely based on the
posterior distribution which in turn depends on the data only through the likelihood
f (x | θ).
3
∝ θa+ (1 − θ)b+ ,
P P
i xi −1 i (mi −xi )−1
28
3.1 The Binomial ∧ Beta Model 29
That is, similar to the binomial ∧ beta case, the posterior predictive proba-
bility function can be written as a product of a homogeneous multinomial
probability function and a Poi–Ga marginal probability function. This rep-
resentation highlights the nature of the dependence in the posterior predic-
tive distribution, which reduces to a negative binomial when k = 1.
where k = n − 1 and ks2 = i (xi − x̄)2 . This is the kernel of a joint normal-
P
inverse gamma distribution for (µ, σ2 ), defined as a normal for µ given σ2
and an inverse gamma for σ2 . The family of normal-inverse gamma distri-
butions is closed under products. The natural conjugate family is therefore
defined by density functions of the type h(µ, σ2 | a, v, c, d) = hN(a,σ2 /v) (µ |
σ2 ) hIGa(c,d) (σ2 ).
Jeffreys’ prior under prior independence of µ and σ2 , h(µ, σ2 ) ∝ σ−2 ,
is a limiting case of an improper NIGa. Bayesian updating with the given
likelihood function yields the posterior distribution
ks2
2 +1)
n n−1
h(µ, σ2 | x) ∝ (σ2 )−1/2 e− 2σ2 (µ − x̄)2 × (σ2 )−( e− 2σ2
2 2
implying µ | σ2 , x ∼ N( x̄, σ2 /n) and σ2 | x ∼ IGa( 2k , ks2 ) ⇔ 2σks
2 | x ∼ χ(k) .
2
, ks +n(µ−
2
x̄)2
σ2 | µ, x ∼ IGa( k+12 2
).
Inferences about the location and scale parameters in the sampling model
are easily obtained from the Student t and inverse gamma distributions (or
χ2 after appropriate transformation). In terms of prediction, consider, for
example, a future random sample of size m from the same model, and as-
sume that we wish to predict its mean Ȳ. The posterior predictive distribu-
tion for Ȳ is a mixture of Ȳ | µ, σ2 ∼ N(µ, σ2 /m) with respect to the joint
posterior h(µ, σ2 | x) = h(µ | σ2 , x) h(σ2 | x). Usingh an algebraic identityi
for the linear combination of the quadratic forms, σ12 m(µ − ȳ)2 + n(µ − x̄)2 ,
we see that this distribution takes the form of a Student t,
m+n
Ȳ | x ∼ t(k) x̄, s2 ,
mn
from which one can easily find point and interval estimates.
3.5 Two Independent Normal Models ∧ Marginal Jeffreys’ Priors 33
Comparison of Means
Following an analogous argument as in the previous section, we easily find
that (µ1 , σ21 ) and (µ2 , σ22 ) are a posteriori again independent with univariate
marginal distributions:
µ j − x̄ j
µ j | x j ∼ t(k j ) ( x̄ j , s2j /n j ) ⇔ ν j = √ | x j ∼ t(k j )
s j/ n j
2
kj kjsj
σ2j
| x j ∼ IGa( , ),
2 2
Pn j
where k j = n j − 1 and k j s2j = i=1 (x ji − x̄ j )2 .
The standardized difference λ = µ1 − µ2 , written as
λ − ( x̄1 − x̄2 )
τ= q 2 ≡ ν1 sin u + ν2 cos u,
s1 s22
n1
+ n2
k1 k2
c1 = sin2 u + cos2 u, and
k1 − 2 k2 − 2
k12 k22
c2 = sin 4
u + cos4 u.
(k1 − 2)2 (k1 − 4) (k2 − 2)2 (k2 − 4)
As an alternative to the use of Patil’s approximation, one could by computer
simulation generate a Monte Carlo sample from the posterior distribution
1
Note that the dependence on u implies that there is no duality between the posterior and
sampling distributions for τ, and thus, no numerical identity of Bayesian and classical
inference on the difference of means. This is in contrast to what happens in other
situations under non-informative priors.
34 Bayesian Inference in Basic Problems
Comparing Variances
σ21
Assume now that the parameter of interest is ψ = σ22
. Based on the inde-
d s21
pendent gamma posteriors for {1/σ2j } we find that ψ | x1 , x2 ≡ F
s22 (k2 ,k1 )
,
which allows easy implementation of inferences about ψ.
which is then the basic inference result for the comparison of two normal
populations under homoskedasticity.
This can be used to construct one-sided or two-sided Bayesian tests for the
comparison of proportions.
36 Bayesian Inference in Basic Problems
with xc+1 = N − x and θc+1 = 1 − θi . The first two moments are defined
P
as (using, e.g., the moment-generating function):
µ = E(X | θ) = Nθ; Σ = Var(X | θ) = N(Dθ − θθ0 ),
where Dθ = diag(θ1 , . . . , θc ).
Next, we consider the implied distribution on aggregate counts over sub-
sets. Let Ck = { jk−1 + 1, . . . , jk }, k = 1, . . . , s + 1 denote the subsets of a
partition of the set of indices {1, 2, . . . , c, c + 1} into s + 1 subsets with
# Ck = dk , j0 = 0, and j s+1 = c + 1. A corresponding segmentation of X is
defined as
X
X (k) = (Xi , i ∈ Ck ); Mk = Xi , k = 1, . . . , s + 1,
i∈Ck
νi , i = 1, . . . , c + 1 ∼ Ga(ai , 1).
ind
h i
By their definition, the mixed moments are given by E Πc+1 i=1 θi | a =
ri
πk | a, k = 1, . . . , s + 1 ∼ Ddk −1 (ai , i ∈ Ck ).
ind
Application to Inference
Assume x = (x1 , . . . , xc ) is a realization of a random vector X | θ ∼
Mc (N, θ). Since f (x | θ) is proportional to the kernel of a Dc (xi + 1, i =
1, . . . , c + 1) distribution that is closed under products, we conclude that
the Dirichlet family is the natural conjugate for a multinomial sampling
model. Thus, if θ | a ∼ Dc (a), then θ | a, x ∼ Dc (A), A = (Ai = ai + xi , i =
1, . . . , c + 1).
Bayesian estimates of (θi , i = 1, . . . , c + 1) can be derived in particular
from the components of the posterior mode (A − 1c+1 )/A (if Ai > 1, ∀i),
where 1c+1 is a c + 1 vector of all 1s, or the posterior mean A/A . Note how
this is a weighted mean of the prior mean, a/a , and the vector of sample
proportions p = (xi /N, i = 1, . . . , c + 1).
In the analysis of contingency tables, inference of interest is often related
to independent structures (or other log-linear models) in which paramet-
ric functions i bi ln θi with i bi = 0 often play a critical role (see, e.g.,
P P
Paulino and Singer, 2006). When the components of A are large, one can
invoke approximate posterior normality of the posterior distribution and
use the resulting χ2 distribution for appropriate quadratic forms, allowing
tests for such structures. For more detail see, e.g., (Paulino et al, 2018, ch.
6). In the special case of a 2 × 2 table the use of such an approach to the
test of independence, which reduces simply to a test of homogeneity of
two binomials, leads to the approach mentioned at the end of the previous
section.
Assume now the aim is to predict a vector Y with Y | m, θ ∼ Mc (m, θ).
The corresponding posterior predictive distribution is a multinomial-Dirichlet,
Y | m, x ∼ MDc (m, A), whose summary by the first two moments can be
obtained by the formulas in Appendix A.
The posterior distribution for the vector θ of group totals is derived from
that by translation by x, which of course implies the posterior distribution
for the vector of population proportions θ/N (Basu and Pereira, 1982).
Problems
3.1 Exponential ∧ gamma model. Let xi ∼ Exp(θ), i = 1, . . . , n, i.i.d.
a. Assuming θ is distributed as Ga(α, β), with α and β fixed, find the poste-
rior distribution h(θ | x1 , . . . , xn ).
b. Under the same assumptions, compute the predictive distribution p(xn+1 |
x1 , . . . , xn ).
c. Consider now a slightly different situation. Imagine that the exponential
random variables represent waiting times for the recurrence of disease in
n patients surgically treated at time 0. We are at time a, and all patients are
still alive. Some of the n patients may have experienced recurrence. We
record their recurrence time xi (assume patients are not followed up after
the recurrence, i.e., we have no further data). Other patients may still be
healthy, so we don’t know exactly what the recurrence time is. We only
know xi > a. Repeat steps (a) and (b) for this case.
3.2 Binomial (unknown θ, n) ∧ beta/Poisson model. Consider a binomial sam-
pling model x | θ, n ∼ Bi(n, θ).
a. For fixed n and θ ∼ Be(a0 , b0 ), find the marginal distribution p(x). This
distribution is known as beta-binomial (cf. Section 3.1).
b. Assume now that also n is unknown, h(n) ∝ 1/n2 . Plot the joint posterior
distribution h(n, θ | x) for x = 50, and (a0 , b0 ) = (1, 4).
Hint: Use a grid on 0.01 ≤ θ ≤ 0.99 and x ≤ n ≤ 500. Use the R function
lgamma(n+1) to evaluate ln(n!).
3.3 Hierarchical Poisson ∧ gamma model. An animal experiment records tu-
mor counts for mice of two different strains, A and B. Tumor counts are
approximately Poisson distributed. The observed tumor counts for nA = 10
mice of strain A and nB = 13 mice of strain B are
yA = (12, 9, 12, 14, 13, 13, 15, 8, 15, 6); nA = 10
yB = (11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7); nB = 13.
We assume yAi ∼ Poi(θA ), i.i.d., for i = 1, . . . , nA . And independently yBi ∼
Poi(θB ), i.i.d., for i = 1, . . . , nB . In parts (a) through (c) we will consider
three alternative prior models that reflect our prior information about θA and
θB at different levels.
a. Find the posterior distributions, means, variances, and 95% credible in-
tervals for θA and θB , assuming Poisson sampling distributions for each
Problems 41
43
44 Inference by Monte Carlo Methods
Table 4.1 CRM: dose levels, prior probabilities of toxicity π0k , and
standardized doses δk .
Dose level
k 1 2 3 4 5 6
Dose (mg/m2 ) 10 20 40 60 75 90
Prob(toxicity) π0k 0.05 0.10 0.20 0.30 0.50 0.70
Standardized dose δk –1.47 –1.1 –0.69 –0.42 0 0.42
aim to establish the maximum tolerable dose (MTD) that can be adminis-
tered without excessive probability of a dose-limiting toxicity (DLT). As-
suming that higher doses are more effective for the tumor treatment, we
wish to find the highest dose possible without excessive toxicity. However,
some toxicity has to be tolerated, otherwise the treatment would be ineffec-
tive. Investigators might therefore, for example, aim to find the highest dose
with probability of DLT less than π? = 30%. The continual reassessment
method (CRM) is one of the first Bayesian model-based designs for such
trials (O’Quigley et al, 1990). In its most basic form, this method char-
acterizes the dose–toxicity relationship by simple one-parameter models,
such as the hyperbolic tangent model, the logistic model, or the power
model. Let δk , k = 1, . . . , K denote the available dose levels in the trial and
let πk be the probability of a DLT at the k-th dose. The hyperbolic tangent
model assumes
" #a
exp(δk )
πk (a) = [(tanh(δk ) + 1)/2]a = . (4.4)
exp(δk ) + exp(−δk )
To highlight the nature of πk as a function of the unknown parameter a,
we write πk (a). Let π0k denote an a priori expert judgement of the expected
toxicity at the k-th dose level. Recording doses as standardized dose levels
δk = tanh−1 (2π0k − 1), we can match the prior mean E(πk | a = 1) with the
expert prior judgment, i.e., E(πk | a = 1) = π0k .
Suppose then that in developing a new agent, K = 6 dose levels are to
be studied. We assume a hyperbolic tangent dose–toxicity curve with prior
distribution h(a) = Exp(1). The targeted toxicity level is set at π? = 20%.
The dose levels and our prior beliefs regarding the probability of toxicity
at each dose level are given in Table 4.1. Assume, then, that the first four
patients are treated at dose levels ki = 3, 4, 4, and 5, that is, patient i = 1
was treated with dose δ3 , etc. Let yi ∈ {0, 1} denote an indicator for the i-th
patient reporting a DLT. Assume that the observed toxicity outcomes are
46 Inference by Monte Carlo Methods
k
0.6
1
2
3
0.5
4
5
10
●
6
0.4
h(πk | y)
p(Y=1)
0.3
●
5
0.2
●
0.1
●
0
which shows that the ordinates of the marginal posterior density for θ1 can
be interpreted as posterior expected values (with respect to θ, including in
particular θ2 ) of the corresponding ordinates of the conditional posterior
density for θ1 .
Generalizing this argument for θ = θ(m) , θ(−m) , with θ(−m) = (θm+1 , . . . , θk ),
we get
Z
h(θ∗ | x) =
(m)
h(θ∗(m) | θ(−m) , x)h(θ | x) dθ. (4.8)
Θ
The expression implies that the marginal posterior density for θ(m) = (θ1 , . . . ,
θm ) can be approximated with a Monte Carlo estimate based on a ran-
dom sample of h(θ | x), θ(i) = (θ(i) , θ(i) ) with θ(i)
(m) (−m) (m)
= θ(i)1 , . . . , θ(i)m and
θ(i)
(−m)
= θ(i)m+1 , . . . , θ(i)k , i = 1, . . . , n, as
1X n
ĥ θ∗(m) = h θ∗(m) | θ(i)
(−m)
,x . (4.9)
n i=1
This estimate was proposed by Gelfand and Smith (1990; see also Gelfand
et al., 1992). It does not make use of the part θ(i) (m)
, i = 1, . . . , n, of the
simulated values on which the kernel density estimate is based, but instead
requires complete knowledge of the conditional posterior density of θ(m)
given θ(−m) .
Assuming this condition is met, the estimate (4.9) turns out to be more
efficient than the estimate obtained by the kernel method, as was shown by
Gelfand and Smith (1990). This is because it exploits the knowledge of the
model structure as it is embodied in the required conditional distribution.
Something similar happens with the following estimate of the posterior
mean of θ(m) that, accounting for (4.9), is given by
n
1 X (m) (−m)
θ̂ (m)
= E θ | θ(i) , x . (4.10)
n i=1
This estimator, using the availability of the indicated conditional expec-
tation, is more precise than the classical Monte Carlo estimator obtained
50 Inference by Monte Carlo Methods
The form of this estimator shows that importance sampling could be seen
as weighted sampling, with the weights wi for g(θi ) known as importance
sampling weights. Under appropriate assumptions, that is, in particular
Rthat the support of p(θ) includes the support of h(θ | x) and the integral
g(θ)h(θ | x)dθ exists and is finite, Geweke (1989) shows that with an
i.i.d. importance sample θi from p(θ),
n Z
1 X
Pn wi g(θi ) → g(θ)h(θ | x)dθ a.s.,
i=1 wi i=1
The result assumes a finite variance of the Monte Carlo estimator, that is,
of the posterior expected value of the product of [g(θ)]2 and the importance
weight h(θ | x)/p(θ) (this is identical to the expected value with respect to
p of the square of g(θ)h(θ | x)/p(θ)).
The rate of convergence for the importance sampling estimator depends
on the ratio between the importance distribution and the target distribution
h(θ | x). Note that the estimator gives more weight to θ values with p(θ) <
h(θ | x) and less when the opposite inequality holds. If the importance
ratio is unbounded, as happens if the tails of p are lighter than the tails of
h(· | x), then the weights can vary substantially and give great importance to
few simulated values in a range (tails) with low probability under the target
distribution. Depending on the function g, the variance of the estimator
could even be infinite, implying far worse performance than an estimator
based directly on the target distribution, if that were possible.
In summary, whether Monte Carlo with importance sampling is a promis-
ing technique for variance reduction depends on the selected importance
sampling density (for more details, see Robert and Casella, 2004). Desir-
able properties for a good importance sampling density are: (1) easy ran-
dom variate generation; (2) heavier tails than the target distribution h(· | x);
and (3) good approximation of h(· | x). Shaw (1988) developed a class of
univariate distributions that are designed to be suitable as importance func-
tions (see also Smith, 1991). In the case of multivariate parameters, often
multivariate normals or Student t distributions are used.5
5
The program BAYESPACK provides an R interface for Fortran subroutines that
implement numerical integration by importance sampling, based on methods described
4.2 Monte Carlo with Importance Sampling 53
Example 4.2 Recall the setup from Example 4.1. The single-parameter
model (4.4) was deliberately chosen to allow meaningful inference also
with small sample sizes in a phase I trial. Still keeping a parsimonious
model, but allowing for some more flexibility Neuenschwander et al (2008)
propose a design based on a two-parameter probit regression,
f (yi = 1 | di = δk ) = πk with πk = 1 − Φ [−a − b ln(δk /δo )] . (4.15)
Here, a and b are probit regression coefficients and δo is a reference dose.
The model allows easy interpretation of the parameters, with a determined
by the prior odds at the reference dose δo , and b determines the shift in
log odds for doses away from δo . The model is completed with a bivariate
normal prior (a, b) ∼ N(µ, Σ). We set the prior mean to µ = (1.65, 1.327)
to match a prior expectation E(π1 ) = 0.01 and E(πK ) = 0.95, respectively,
and Σ = I to have wide prior support over a range of plausible dose–
response curves.
Noting that it is impossible to define a single precise target toxicity π? ,
Neuenschwander et al (2008) extend the notion of a single target toxicity
level π? to an ordinal scale over four sub-intervals of toxicity probabil-
ities. Neuenschwander et al (2008) partition the range of toxicity proba-
bilities πk into four intervals: under-dosing, Iu = (0, 0.20]; targeted toxi-
city, It = (0.20, 0.35]; excessive toxicity, Ie = (0.35, 0.60]; and unaccept-
able toxicity, Ia = (0.60, 1.00]. Based on the posterior probabilities of πk ,
k = 1, . . . , K, being in these four intervals they propose a pragmatic design
that prescribes de-escalation, escalation, and continued enrollment at the
current dose, depending on these four probabilities.
Consider a study with a dose grid δ = (12.5, 25, 50, 100, 150, 200, 250)
with reference dose δ0 = 250. Assume that the first n = 7 patients were as-
signed doses ki = 1, 1, 2, 2, 3, 2, 3, with recorded outcomes yi = 0, 0, 0, 0, 1, 0, 1.
We use importance sampling to find P(πk ∈ It | Dn ) and similar for Ie and
Ia , for k = 1, . . . , K. Let θ = (a, b). We use a bivariate normal importance
function p(θ) = N(m, V), with m being the posterior mode (MAP) and V
being the negative inverse of the Hessian of the log posterior at the mode,6
i.e., S = −H −1 . Figure 4.2 summarizes posterior inference. The trial design
would then determine the dose with maximum p(πk ∈ It | Dn ), restricting
the search to all doses with p(πk ∈ Ie ∪ Ia | Dn ) < 0.25.
7
Additional discussion can be found in the books by Paulino et al (2018) and Chen et al
(2000).
4.2 Monte Carlo with Importance Sampling 55
is a (unbiased and consistent) Monte Carlo estimate of B(x). Note that this
approximation is most efficient when the tails of h1 (θ) are heavier than the
tails of h0 (θ) and less efficient when there is little overlap of h0 (θ) and h1 (θ)
(that is, small Eh1 [h0 (θ)]).
In many problems, the Bayes factor involves densities of different di-
mension and therefore requires different strategies. We will discuss some
in Chapter 7. For a more general discussion of this problem, see the books
by Chen et al (2000) and Paulino et al (2018).
where
(−m) (−m)
wi = h θ(i) | x /p θ(i) ,
i = 1, . . . , n. Alternatively, with an argument as in (4.8), one could use
wi = h θ(i) | x /p θ(i) , i = 1, . . . , n, with an importance sampling density
p(·) for h(θ | x).
As yet another alternative, one could generate θ(i) (m)
, i = 1, . . . , n, by the
composition method using p(θ ) and h(θ | θ , x). However, since we
(−m) (m) (−m)
used p(·) instead of the marginal posterior for θ(−m) , the samples θ(i) (m)
are not
a random sample from h(θ | x). Instead, the θ(i) are a weighted sample of
(m) (m)
where Θ−m (θ(m) ) = {θ(−m) : (θ(m) , θ(−m) ) ∈ Θ} is the subspace of Θ with fixed
θ(m) and w(θ(m) | θ(−m) ) is a completely known conditional density with the
same as or greater support than h(θ(m) | θ(−m) , x), given by Θm (θ(−m) ) =
{θ(m) : (θ(m) , θ(−m) ) ∈ Θ}. The criteria for selecting the function w are the
same as for selecting an importance sampling density.
The form of (4.21) highlights that the normalization constant of the joint
posterior distribution (and, a fortiori, conditional) cancel out and that the
marginal density in question can be unbiasedly estimated based on a sam-
ple θ(i) = (θ(i) , θ(i) ), 1 ≤ i ≤ n, of h(θ | x), by
(m) (−m)
1X n (m) (−m) h θ∗(m) , θ(i)
(−m)
|x
ĥ θ∗ | x =
(m)
w θ(i) | θ(i) . (4.22)
n i=1 h θ(m) , θ(−m) | x
(i) (i)
4.3 Sequential Monte Carlo 59
likelihood factor for yt . This is essentially the idea of the auxiliary variable
particle filter, except that by combining steps (1) and (3), the algorithm
makes the sampling more efficient.
Similar to (4.30), we define
X
ht ≈ bht ∝ wt−1,i f (yt | θt )p(θt | θt−1,i ) (4.31)
i
without the likelihood approximation in g(θt , i). Instead, the normal as-
sumption allows us to use Bayes’ theorem and replace the last two factors
to obtain
h(θt , i) ∝ wt−1,i N(θt | mi , vi ) N(yt | µ(θt−1,i ), σ2 (θt−1,i ) + 1),
b (4.35)
| {z }
λi
completed with a prior on the static parameters and the initial state (φ, θ0 ) ∼
h(φ, θ0 ). For example, in the ARCH model, φ = (β0 , β1 , σ2 ).
There are several approaches to generalizing particle filter methods to
include parameter learning. Liu and West (2001) reduce static parameter
learning to the earlier problem by allowing a negligibly small but positive
evolution noise for φ, thus including it as φt in the state vector. Polson
et al (2008) use a (approximate) sufficient statistic to define the “practical
filter.” For a review and more detailed discussion see, for example, Prado
and West (2010).
Problems
4.1 Simple Monte Carlo integration. Implement inference for Example 4.1.
a. Verify the posterior means πk shown in Figure 4.1, and add central 50%
credible intervals for πk , k = 1, . . . , K.
b. Assume the next patient is treated at dose k5 = 3, and we record no tox-
icity, y5 = 0. Find the updated posterior means πk = E(πk | D5 ) and
determine the treatment allocation k? for the next patient.
4.2 Importance sampling: probit regression. Refer to Example 4.2.
a. Based on Figure 4.2 and the design rule as it is summarized in the exam-
ple, which dose k? is assigned to the next, (n + 1)-st patient?
b. Let θ = (a, b). Evaluate the marginal posterior distributions h(πk | Dn )
using importance sampling, using a bivariate normal importance function
p(θ) = N(m, V), as in the example. Plot h(πk | Dn ), k = 1, . . . , K. In the
plot indicate the four intervals for under-dosing, target dose, excessive
dose, and unacceptable toxicity by vertical dotted lines on the interval
boundaries.
c. Let {θ(m) ; m = 1, . . . , M} denote the importance sample from p(θ). Figure
4.2(a) shows the importance weights w(m) ≡ h(θ(m) | y)/p(θ(m) ) for an
importance Monte Carlo sample {θ(m) ; m = 1, . . . , M} of size M = 100.
Redo the importance sampling estimate, now with M = 1000, and plot
the histogram of weights. What do you observe?
d. Now consider an alternative bivariate Student t importance function, p2 (θ) =
t2 (m, V; ν) with ν = 4. Plot a histogram of the importance sampling
weights w(m)2 = h(θ
(m) | y)/p (θ(m) ) and compare.
2
4.3 Importance sampling: nonlinear regression. Let x denote the number of
plants per unit area and y denote the yield per plant. To model the relation-
ship between yield and planting density, we use the sampling model
f (yi | θ) = N(µi , σ2 ) with µi = 1/(α + βxi + γxi2 ).
Problems 65
Let τ = ln(σ2 ) and θ = (α, β, γ, τ). In this model, 1/α = lim x→0 y is inter-
preted as “genetic potential” and 1/β can be interpreted as “environmental
potential.”
The model is completed with a non-informative prior h(θ) ∝ 1. Use the data
set onion in the R package SemiPar. The data set includes the variables
dens, yield, and location. We use data from location 0 (Purnong
Landing) only. Scale y = yield/100 by 100, and standardize x as x =
(d − d̄)/sd , where d =density, and d̄, s2d are the sample mean and variance.
a. Set up a multivariate normal importance sampling density p(θ). Use a
N(m, V) with m being the posterior mode (MAP) and V = −H −1 being
the negative inverse Hessian of the log posterior at the mode.8 Carry out
importance sampling to estimate the posterior means and standard de-
viations of θ. Report numerical uncertainties and show a histogram of
importance sampling weights.
b. Repeat the same importance sampling estimation with a multivariate Stu-
dent t distribution with ν = 4 d.f. (see Appendix A). See also the hint for
Problem 4.4(b).
c. Why would one recommend the Student t importance function? Explain.
4.4 Importance sampling: hierarchical model. Carlin and Gelfand (1991) con-
sider a hierarchical event rate model for the number of pump failures (yi )
over given exposure times (ti ), i = 1, . . . , n = 10. The data are shown in the
following table.
yi 5 1 5 14 5
ti 94.32 15.72 62.88 125.76 5.24
yi 19 1 1 4 22
ti 31.44 1.048 1.048 2.096 10.48
We assume a Poisson regression,
yi ∼ Poi(λi ti ),
with a hierarchical prior for Li = ln(λi ):
Li ∼ N(η, σ2 ), and η ∼ N(µ, τ2 ), γ ≡ 1/σ2 ∼ Ga(a, b).
We fix the hyperparameters at (µ, τ2 , a, b) = (−1, 1, 1, 1).
Let g = ln(γ) and let θ = (η, g, L1 , . . . , Ln ) denote the complete parameter
vector (now with g = ln(γ)). Let θ̂ and S denote mean and covariance matrix
of a multivariate normal approximation of the joint posterior distribution
h(θ | y) (for example, posterior mode and negative inverse Hessian of the
log posterior at the mode, as in Problem 4.3). Under the multivariate N(θ̂, S )
8
We will discuss the use of such multivariate normal approximations in Chapter 8.
66 Inference by Monte Carlo Methods
distribution for θ let m = (θ̂1 , θ̂2 ) and V denote the corresponding marginal
moments for ψ = (η, g).
Let p1 (ψ) = t2 (m, V; ν) (use ν = 2), and p2 (Li | η, γ) ∝ N(Li | η, σ2 )Poi(yi |
ti eLi ). Here, Poi(y | µ) denotes the Poisson probability mass function with
rate µ evaluated for y and similar for N(x | µ, σ2 ). Use an importance sam-
pling density
p(θ) = p1 (η, g) Πni=1 p2 (Li | η, g). (4.38)
We can generate from p(·) by first generating ψ = (η, g) from the bivariate
t distribution p1 (ψ), and then Li from p2 using, for example, a grid-based
method using evaluation of p2 on a grid (using the R function sample();
see Appendix B).
a. Using the importance sampling function (4.38), find the expression for
the importance sampling weights wi .
Hint: Do not forget the normalization constant in p2 (Li | η, g).
b. Let y = (y1 , . . . , yn ) denote the data. Implement importance sampling
to evaluate posterior moments, λ̄i = E(λi | y) and s2i = Var(λi | y),
i = 1, . . . , n. Show a histogram of the importance sampling weights, a
table of estimated posterior moments λ̄i (si ), and corresponding numerical
uncertainties.
√ V = LL ,
Hint: To generate from the bivariate t distribution tν (m, V) with 0
you could use: z ∼ N2 (0, I), y = Lz, u ∼ χ (ν), and ψ = m + ν/uy. Then,
2
" #− ν+2
1 2 ν+2
p(ψ) ∝ 1 + (ψ − m) V (ψ − m)
0 −1
= [1 + z0 z/u]− 2
ν
c. Use suitable Monte Carlo methods to evaluate and plot the marginal pos-
terior distributions h(λi | y), i = 1, . . . , n.
d. The problem greatly simplifies with λi ∼ Ga(c, ηc ). Why?
4.5 The skewed normal distribution defines a skewed continuous random vari-
able. Let ϕm,s denote a normal p.d.f. with moments (m, s) and let Φ(z) denote
a standard normal c.d.f. The skewed normal can be defined by its p.d.f.,
x − µ
f (x | µ, σ, α) ∝ ϕµ,σ (x) Φ α .
σ
For α > 0 (α < 0), the distribution is right (left) skewed. We write X ∼
SN(µ, σ, α) for a skewed normal random variable X. For the following ques-
tions use the data set stockreturns.txt9 of (simulated) stock returns. We
assume the sampling model
xi ∼ SN(µ, σ, α),
9
Available on the book’s homepage at
sites.google.com/view/computationalbayes/home
Problems 67
for the quadratic approximation `(θt ) ≈ −1/(2σ2t )(θt −µt )2 , using a Taylor
series expansion of `(θt ).
b. Use this quadratic approximation to derive a variation of (4.35).
c. Using that variation of (4.35), work out appropriate variations (100 ), (200 ),
and (300 ) of the adapted particle filter (the weights in step (300 ) are no
longer constant).
4.10 Consider a stochastic volatility model,
yt = t β exp(θt /2) and θt+1 = φθt + ηt ,
with t ∼ N(0, 1) and ηt ∼ N(0, σ2η ). For this problem we fix φ = 0.9702, ση =
0.178, and β = 0.5992 (Pitt and Shephard, 1999: section 4.2).
a. Simulate a data set y1 , . . . , yT , using θ0 = 0 and T = 200. State the random
variate seed that you use. For example, if you use R, use set.seed(1963).
Plot yt versus t as a time series.
b. Let `(θt ) = ln f (yt | θt ) denote the log-likelihood factor for θt . Find (µt , σt )
for the quadratic approximation `(θt ) ≈ −1/(2σ2t )(θt − µt )2 (see the previ-
ous problem).
c. Implement the adapted auxiliary particle filter from the previous problem.
As in Section 4.3, let Dt = (y1 , . . . , yt ) denote the data up to time t. Plot the
posterior means θt = E(θt | Dt ) against time, and add posterior medians
and 25% and 75% quantiles.
5
Model Assessment
This chapter deals with approaches for model assessment aiming to imple-
ment a selection and comparison leading to the choice of the best model(s).
We start in Section 5.1 with a description of model diagnostics such as
Bayesian p-values associated with various discrepancy measures, resid-
uals, conditional predictive ordinates, and corresponding pseudo-Bayes
factors. In Section 5.2 we then consider several measures of predictive
performance, known by their acronyms as AIC, SIC (or BIC), DIC, and
WAIC, and ways of using Bayes factors for selection and comparison tasks.
Finally, in Section 5.3, using Monte Carlo samples to represent posterior
distributions in complex models, we review ways of estimating relevant
quantities for model assessment in this context.
Bayesian P-values
Box (1980, 1983) proposed to base a critical analysis of a model on the
comparison of the data x and the marginal (or prior predictive) p(x) =
Eh f (x | θ) by way of marginal p-values P p(X) ≥ p(x) . This approach
does not work with improper priors. In particular, it prevents the use of
hypothetical replicate data based on simulation from h(θ). Even when h(θ)
is proper, Monte Carlo methods tend to be in general unstable (Gelfand,
1996), and therefore not reliable, for the evaluation of p(x).
70
5.1 Model Criticism and Adequacy 71
One of the most widely used strategies for a critical model examination
is based on the posterior predictive distribution under the model,
Z
p(y | x) = f (y | θ) h(θ | x) dθ. (5.1)
Θ
Here, y are hypothetical future data (see the following for details) and x
are the observed data. The data y should expectedly (or not) reflect the
observed data in the case of a good (bad) fit of the model. Any systematic
discrepancies between the two data sets, y and x, are evidence for some
flaw in the model.
The discrepancy between the observed data and data that are observable
under the model can be evaluated by summary variables V(x, θ), which
should be chosen to relate to the specific model aspects that the investigator
wishes to evaluate. Examples of such variables include standardized mean
squared deviations defined as
X [xi − E(Xi | θ)]2
V(x, θ) = (5.2)
i
Var(Xi | θ)
and the actual log-likelihood, ln f (x | θ), as well as the special cases of
statistics V(x) and functions g(θ) of the parameters only.
The posterior predictive distribution is typically determined by simula-
tion from the joint distribution of (y, θ), conditional on x. Consider, then,
{(yk , θk ), k = 1, . . . , m)}, a set of values generated from this joint distribu-
tion. The comparison of the real data x and predictive data via the sum-
mary variable V can be shown graphically by a scatter plot of the val-
ues {(V(yk , θk ), V(x, θk ), k = 1, . . . , m)}, or by a histogram of {V(yk , θk ) −
V(x, θk ), k = 1, . . . , m}. For a model that fits the data well, the points in the
scatter plot should be symmetric with respect to the 45-degree line and the
histogram should include 0.
One of the summary measures of discrepancy between V(x, θ) and the
distribution of V(Y, θ) is the posterior predictive p-value, sometimes re-
ferred to as the Bayesian p-value, defined as
PB = P [V(Y, θ) ≥ V(x, θ) | x]
= E(Y,θ) I{V(Y, θ) > V(x, θ)} | x .
(5.3)
Note that a very small or very large value of PB (for example, below 1% or
above 99%) implies that V(x, θ) falls into one or the other extreme tail of
the posterior distribution of V(Y, θ), given x (in both cases, being in general
implausible under the model). In this way, both cases are evidence for a
bad fit of the model to the data with respect to the characteristics that are
72 Model Assessment
reflected in the chosen summary variable. For this reason, the Bayesian p-
value can, alternatively, be defined as PB = P [V(Y, θ) ≤ V(x, θ) | x], which
can be regarded as resulting from (5.3) by taking the negative of the chosen
variable. Typically, the Bayesian p-value, PB , is calculated as a proportion
of simulated values from the posterior predictive distribution of (Y, θ) given
x which satisfy V(yk , θk ) ≥ V(x, θk ). Such simulated values can be obtained
by first generating from h(θ | x) and then future data y from the sampling
model.
When the discrepancy measure is a statistic V(X), the posterior predic-
tive evaluation can alternatively be applied separately for each observation,
making use of the marginal predictive distributions p(yi | x). The corre-
sponding marginal p-values with respect to V(Xi ) are defined as PBi =
P [V(Yi ) ≥ V(xi ) | x], i = 1, . . . , n, which reduces to PBi = P(Yi ≥ xi | x) if
V is the identity. This is useful to detect outliers and to check whether there
is a global correspondence between the model and observed data. p-values
centered at extremes or too concentrated in the middle of the range of val-
ues are evidence for overdispersion and underdispersion, respectively, rel-
ative to the predictive data.
Extreme values of a measure of statistical significance, such as PB , need
not necessarily prompt one to completely abandon the model if the char-
acteristic that is expressed by the corresponding variable V is not of great
practical relevance. Even if it were, the features that are subject to criticism
resulting from this model inspection should suggest possible directions for
model revision to find models that are more adequate for the context of the
actual data.
the more adequate is a model. For other diagnostic instruments see, in par-
ticular, Gelfand (1996) and Gelman and Meng (1996).
74 Model Assessment
Example 5.1 Henderson and Velleman (1981) describe a study of the per-
formance of car models in terms of fuel consumption. We use a subset of
the data from this study. The data report efficiency, E f , measured as miles
per gallon, weight in pounds (X1 ), power in horse power (X4∗ ), and number
of gears in the transmission, at levels 3, 4, or 5, jointly represented by indi-
cator variables for 4 (X2 ) and 5 (X3 ), respectively. After some preliminary
analysis they consider normal linear regression models for the transformed
response variable Y = 100/E f , gallons per 100 miles (Y here is unrelated
to the earlier notation y for the predictive data).
One of the models involves the explanatory variables X1 , (X2 , X3 ), and
X4 = X4∗ /X1 (power per unit weight) in a multiple regression model:
4
X
M1 : µ ≡ E(Y) = β0 + β j X j + β5 X2 X4 + β6 X3 X4 ,
j=1
150
150
100
100
V(x,θ)
V(x,θ)
50
50
0
0
0 20 40 60 80 0 20 40 60 80
V(x,θ) V(x,θ)
respect to the 45-degree line seems a little less pronounced for M2 than for
M1 . The Bayesian p-values associated with the standardized mean squared
deviations V(·, θk ) are evaluated with the observed data and the poste-
rior distribution of V(Y ∗ , θ) for the predicted data Y ∗ . The corresponding
Bayesian p-values are tabulated for the three models in Table 5.1, and
suggest that the reduced models perform better than the encompassing
model, in terms of fitting the data. The sum of the logarithmic CPOs sug-
gest the same, with a slight advantage for model M2 . Following a proposal
by Gelfand (1996), these quantities were estimated by the harmonic mean
of the assessed sampling density with respect to a simulated sample from
the posterior of θ (see Section 5.3.1).
model dimension is larger than the corresponding term in the AIC, thus
heavily penalizing more complicated models.
To avoid simulation-based maximization, Carlin and Louis (2009) sug-
gested a modification of this criterion using the posterior mean of the log-
likelihood instead of the maximum. This modified version of the BIC is
thus defined as
BICCL = −2Eθ|x ln f (x | θ) + p ln n.
(5.12)
between posterior mean deviance and deviance under the posterior mean
when θ̄ = E(θ | x). This quantity naturally depends on the data, the param-
eter focus, and the prior information.
Alternatively other Bayesian estimates of θ and D(θ) could be used, with
naturally different results than under the posterior means. One advantage
of the earlier definition is easy computation. As long as the model includes
a closed-form likelihood function, one can always use the approximation
m m
1X 1 X
θk ,
pD ' D(θk ) − D (5.15)
m k=1 m k=1
where θk are a Monte Carlo sample from h(θ | x). Another advantage is the
fact that pD ≥ 0 for any model with log-concave likelihood (by Jensen’s
inequality).
In some standard models (without hierarchical structure) with a likeli-
hood dominating the prior, it can be shown that pD is approximately equal
to the actual number of parameters. The interpretation of pD as a measure
of dimensionality of the model attempts to extend the same notion to more
complex models for which it is not easy to determine the number of pa-
rameters, such as models that include latent variables. However, in some
models such as finite mixture models or some hierarchical models the im-
plied pD can be negative. See, for example, Celeux et al (2006).
Using pD as a measure of model complexity (and for the moment ignor-
ing the standardization function g), the corresponding measure of predic-
tive accuracy becomes
ÃDIC = ln f (x | θ̄) − pD . (5.16)
This leads to the DIC criterion (deviance information criterion) proposed
by Spiegelhalter et al (2002) as
DIC = D(θ̄) + 2pD = D(θ) + pD = 2D(θ) − D(θ̄). (5.17)
In practice, the normalizing factor in the deviance is usually omitted
(that is, g(x) = 1) when the Bayesian models under consideration are all
based on the same sampling model and differ only in the parametric struc-
ture. Otherwise, one must be careful because the DIC measure depends on
the chosen function g(·). Regarding this, it is by no means straightforward
to determine the maximum likelihood for the saturated structure in most
multiparameter models.
80 Model Assessment
Just like Ã(x) previously, also the complexity penalty term involves an ex-
pression in terms of the individual data. One of the proposals for pW is
similar to the one used in the DIC, expressed as
n
X
pW1 = −2 {Eθ|x ln f (xi | θ) − ln Eθ|x f (xi | θ) }
i=1
n m m
X 1X 1 X
ln f (xi | θk ) − ln f (xi | θk ) ,
' −2 (5.19)
i=1
m k=1 m k=1
with pW1 ≥ 0 by construction.
Another proposal for pW is based on the posterior variance of ln f (xi | θ)
for all data points, and defined as
n
X
pW2 = Varθ|x ln f (xi | θ)
i=1
n m
X 1 Xh i2
¯ i) ,
' lk (xi ) − l(x (5.20)
i=1
m − 1 k=1
example, using the sum of squares (or of absolute values) of these residu-
als, one should favor models that report smaller values.
Another means of evaluating comparative adequacy among several mod-
els is the sum of the log conditional predictive ordinates, with the idea of
selecting the models with the highest values. When comparing two models,
H1 versus H2 , by means of CPO one can use
p(xi | x(−i) , H1 )
PBF = Πni=1 ≡ Πni=1 Qi . (5.22)
p(xi | x(−i) , H2 )
This is known as the pseudo-Bayes factor (PBF). Keeping in mind that the
product of CPOs for each model is used as a substitute for the respective
marginal likelihood (recall the definition of the Bayes factor), then values
of the pseudo-Bayes factor greater (less) than 1 are evidence for model
H1 (H2 ). Besides this, since observations with Qi greater (less) than 1 are
evidence for H1 (H2 ), a plot of ln(Qi ) versus i is useful to visualize which
observations are better fit by which of the two competing models.
{p∗j } for the dimension of the parameter vectors under each model, we get:
f j (x | θˆj )n−p j /2
∗
p j (x)
' −2 ln = −2 ln B jo (x). (5.23)
po (x)
That is, the differences r j (BIC) are related with the Bayes factor between
the two models under comparison, say H j and Ho , in the sense that for large
sample sizes they can be approximated by −2 ln B jo (x), using an argument
of Schwarz (1978).
Example 5.2 Continuing with the earlier example, consider now compar-
ative evaluation of the three multiple regression models with respect to
their predictive performance. The PBF for the pairwise comparisons are
PBF(M1 /M2 ) = 0.809; PBF(M1 /M3 ) = 0.941, and PBF(M2 /M3 ) = 1.164.
These values support the earlier result that M2 was the best and M1 the
worst of the three models, in terms of this criterion based on conditional
predictive ordinates.
A comparison with respect to the information criteria shown in Table
5.2, reports model M2 as the best in terms of the Bayesian summaries DIC
and WAIC, which, however, is dominated by M3 under the BIC. The latter
is no surprise, knowing that BIC favors the more parsimonious models.
Table 5.2 DIC, BIC and WAIC criteria for models M1 , M2 and M3
Model DIC (pD ) BIC (p) WAIC (pW2 )
M1 48.69 (8.27) 67.36 (8) 46.77 (5.38)
M2 47.32 (6.19) 61.33 (6) 46.70 (4.78)
M3 47.58 (4.12) 56.93 (4) 47.40 (3.48)
Pooling the results here and in the previous example, the applied criteria
show that the overall best of the three models is the one with intermediate
complexity as measured by the number of parameters.
available information. But there is no concern that any model stands for the
reality under study.
In contrast, some methods correspond to a perspective (possibly con-
troversial) that the set of considered models must include a so-called true
model, in the sense of being the one that generated the observed data (or,
at least, which constitutes a sufficiently good approximation of the true
model). The method of using Bayes factors to select one of a finite number
of models fits into this perspective. This and other aspects that could be
criticized should not prevent one from using this method in some contexts,
with due caution.
Any set of Bayesian models M = {M j , j ∈ J} (with different sampling
models and prior distributions) implies a set of corresponding prior predic-
tive models
Z
p(x | M j ) ≡ p j (x) = f j (x | θ j )h j (θ j ) dθ j , j ∈ J. (5.24)
where P(M j ) is the prior probability of M j being the true model and is
updated in the face of data x to
P(M j | x) = P(M j )p(x | M j )/p(x), j∈J. (5.26)
The Bayes factor in favor of Mk versus M j is thus the ratio
P(Mk | x)/P(M j | x) p(x | Mk )
Bk j (x) = = , (5.27)
P(Mk )/P(M j ) p(x | M j )
and the posterior odds for Mk are
P(Mk | x) P(Mk )
=P , (5.28)
1 − P(Mk | x) j,k P(M j )B jk (x)
where B jk (x) = 1/Bk j (x) is the Bayes factor in favour of M j versus Mk . The
formula for these odds becomes ( j,k B jk (x))−1 when one uses a uniform
P
prior over the space of models. Simple examples of the application of this
method can be found in Paulino et al (2018).
In general, the evaluation of prior predictive distributions, and thus of
Bayes factors, requires simulation – see the preceding chapter and the next
5.3 Further Notes on Simulation in Model Assessment 85
section. Model comparison based on Bayes factors requires the use of prac-
tical rules for the choice of thresholds on the strength of evidence in favor
of a model. One such rule is proposed by Kass and Raftery (1995).
we get
"Z #−1
p(x) 1
p(xi | x(−i) ) = = h(θ | x) dθ ,
p(x(−i) ) f (xi | x(−i) , θ)
and thus, if {θ(?j) ; j = 1, . . . , m} is a sample from h(θ | x), we have
m
−1
1 X 1
p̂(xi | x(−i) ) = ?
. (5.29)
m j=1 f (xi | x(−i) , θ( j) )
This estimate of p(x) has been shown, however, to be in general very inef-
ficient, as we already mentioned in Section 4.2.2.
Newton and Raftery (1994) suggested the harmonic mean of { f (x | θ(?j) )},
that is
h1 Xm
1 i−1
p̂(x) = ? , (5.30)
m j=1 f (x | θ( j) )
More discussion of this and other prior predictive density estimates can
be found in Kass and Raftery (1995).
Problems
5.1 Replicate Example 5.1. The data are available in R as the data frame mtcars.
Evaluate PB and the log pseudo-Bayes factor (5.22).
5.2 Let yi be the number of misprints on page i of the book Bayesian Theory by
Bernardo and Smith (2000). Consider two alternative models:
Model H1 : yi | λ ∼ Poi(λ · Ni ), i = 1, . . . , n
λ ∼ Ga(1, β)
Model H2 : yi | θ ∼ Bin(Ni , θ)
θ ∼ Be(1, β − 1),
where Ni is the number of characters per page, β > 1 is a fixed hyperparam-
eter, and the gamma distribution is parametrized such that E(λ) = 1/β. That
is, under both models 1/β is interpreted as the expected number of typos per
word.
a. Find the Bayes factor B = p(y | H2 )/p(y | H1 ) for choosing between
competing models H1 and H2 .
Model H1 :
yi = β1 + β2 xi + i , i = 1, . . . , n; i ∼ N(0, 1)
β1 ∼ N(0, 1), β2 ∼ N(1, 1),
with β1 and β2 a priori independent.
Model H2 :
yi = γ1 + γ2 xi + γ3 xi2 + i , i = 1, . . . , n; i ∼ N(0, 1)
γ1 = N(0, 1), γ2 ∼ N(1, 1), γ3 ∼ N(0, 1),
with γ1 , γ2 , γ3 a priori independent.
distributions p(y | H1 ) = f (y | β, H1 ) h(β) dβ. and
R
a. Find the marginal
p(y | H2 ) = f (y | γ, H2 ) h(γ) dγ.
R
2
Although the marginal distributions might be improper, i.e., meaningless.
6
90
6.1 Definitions and Basic Results for Markov Chains 91
of this text.2 For simplicity, we use in this section (and beyond) a generic
notation for the states of a chain. In the following sections we describe
the most commonly used methods, including Metropolis–Hastings chains,
Gibbs samplers, slice sampling, and Hamiltonian Monte Carlo. The last
section is dedicated to implementation questions related to MCMC meth-
ods, including the question of one versus multiple parallel chains and con-
vergence diagnostics.
P(Un+1 = v | U0 = u0 , . . . , Un = u) =
P(Un+1 = v | Un = u) ≡ p(u, v), ∀n ≥ 0, u, v ∈ U.
In the case of a finite state space the transition probabilities p(·, ·) can be
recorded as a matrix P of probabilities for one step. When U is uncount-
ably infinite and F(u, v) is absolutely continuous, then the transition func-
tion can be defined by a density p(u, v) = ∂F(u,v)
∂v
.
For the moment we assume a discrete Markov chain. We have
X X
P(Un+1 = v) = P(Un = u)p(u, v) = P(U0 = u)pn (u, v),
u u
irreducible and positive recurrent chain. However, with the additional con-
dition of aperiodicity, defined as min{n ≥ 1 : pn (u, u) > 0} = 1 (it suffices
if there exists u such that p(u, u) > 0), such a chain is then called ergodic
and has a limiting behavior pn (u, v) −→ π(v) for all u, v ∈ U, thus assuring
n→∞
convergence of P(Un = u) to π(u) for all u.
In addition, if g(U) is a function defined on the state space of an ergodic
Markov chain with finite expectation under π, then we have
n
1X
g(Ut )−→ Eπ g(U) , a.s.
n t=1 n→∞
u) dũ is the probability that the chain remains in u. The transition function
96 Markov Chain Monte Carlo Methods
one should always start with a preliminary analysis of π, such that q could
be chosen to approximate the target distribution as well as possible.
Given the generic nature of the M-H algorithm, we now describe two of
the most commonly used special cases. 7
sition of the Gibbs generates iteratively the values u(t+1) for the next state
vector of the chain using the full conditional distributions
U1(t+1) ∼ π(u1 | u(t)
2 , u3 , . . . , uk )
(t) (t)
The next transition repeats the k-step cycle, now starting from u(t+1) . The
scheme is summarized in Algorithm 2, below.
(t+1)
each component, u j , of the next state vector for the chain using
U (t+1)
j ∼ π(u(t+1)
j | u(t+1)
1 , . . . , u(t+1)
j−1 , u j+1 , . . . , uk ),
(t) (t)
for j = 1, 2, . . . , k.
2. At the end of k steps, take u(t+1) = (u(t+1)
1 , . . . , u(t+1)
k ) and repeat step 1 for
t ≡ t + 1.
which depends on the past only through the previous value of U2 . In other
words, the definition of the marginal densities and the transition density for
U2 implies
Z
π2 (ũ2 ) = π2 (ũ2 | w)π1 (w) dw =
Z "Z # Z
π2 (ũ2 | w)π1 (w | u2 ) dw π2 (u2 ) du2 = P(u2 , ũ2 )π2 (u2 ) du2 ,
π(u) remains the marginal distribution of the joint f (u, Z). 12 In some cases,
the augmented model f (u, Z) allows a much easier implementation of the
Gibbs sampler, now involving the distributions f (u | z) and f (z | u).
Example 6.2 Consider a diagnostic test for some disease with a binary
outcome (positive or negative). Taking a random sample of N patients, let
X be the number of positive results, and assume that X | φ ∼ Bi(N, φ).
Most diagnostic tests are subject to classification errors implying that the
probability of a positive can be written as φ = ασ + (1 − α)(1 − ε), where
θ = (α, σ, ε) and α is the prevalence of the disease, σ is the sensitivity of
the test (probability of a true positive), and ε is the specificity of the test
(probability of a true negative).
The vector θ is typically unknown and is the parameter vector of interest
in the inference problem. However, given the obvious overparametrization
of the sampling model, it is not possible to report inference on θ (or func-
tions of θ, except φ, for example) without additional data (e.g., from a fur-
ther test regarded as a gold standard) or a priori information related to the
type of diagnostic test and the disease under consideration. Suppose that
one only has access to such prior information, represented as independent
beta distributions for the components of θ, with fixed hyperparameters. The
posterior distribution takes the following analytically intractable form:
h(θ | x) ∝ f (x | θ) αa p −1 (1 − α)b p −1 σcs −1 (1 − σ)ds −1 εce −1 (1 − ε)de −1 ,
θ ∈ (0, 1)3 . Intending to implement posterior inference using MCMC, the
Gibbs sampler is not a particularly easy approach for this posterior dis-
tribution because the conditional posterior distributions are complicated,
requiring specialized random variate generation methods. However, the
implementation of a Gibbs sampler is greatly facilitated by the following
model augmentation with latent data. Let Y = (X, Z1 , Z2 ), where Z1 (resp.
Z2 ) are unobserved (latent) data that report the number of individuals with
true positive (negative) results. A model for Y that remains consistent with
the observed data X is defined as
f (y | θ) = f (x | θ) f (z1 | x, θ) f (z2 | x, θ),
where now Z1 | x, θ ∼ Bi(x, ασ/φ) and Z2 | x, θ ∼ Bi(N −x, (1−α)ε/(1−φ)).
Note that the parameters of the conditional distributions for the latent
variables correspond to the so-called positive predicted values V+ = ασ/φ
12
The construction of f (u, Z) could also be referred to as demarginalization or
augmentation of π(u).
104 Markov Chain Monte Carlo Methods
tion for θ given the augmented data. That is, generate θ(1) as
α(1) ∼ Be(A p , B p ), σ(1) ∼ Be(C s , D s ), ε(1) ) ∼ Be(Ce , De ).
Starting from θ(1) repeat the cycle of the two steps repeatedly.
6.4 Slice Sampler 105
This algorithm was introduced, still without any reference to the Gibbs
sampler, by Tanner and Wong (1987) under the aforementioned title, where
1 , z2 ) converge as t → ∞ to h(θ | x), under quite
they prove that h(θ | x, z(t) (t)
general conditions.
z π(u)
z (t+1)×
×
u(t+1) u(t) u
Figure 6.1 Slice sampler for a univariate distribution π(u).
value z(t+1) is generated. The intersection of the line Z = z(t+1) with π(u)
defines the point(s) that delimit the horizontal slice S(z(t+1) ) (an interval
or union of intervals) over which u(t+1) is generated. In practice, the main
difficulty with this algorithm is the second step since the support S(z) of the
distribution on the horizontal slice could be complicated for a multimodal
π(u), which might need the use of other simulation methods in this step
6.5 Hamiltonian Monte Carlo 107
π(u).
This interpretation remains essentially valid also for multivariate target
distributions, except for a naturally larger number of steps. As a conse-
quence, the convergence conditions for a slice sampler can be seen to fol-
low from those for the Gibbs sampler, based on the introduction of a vector
of auxiliary variables to deal with more complex target distributions (for a
discussion, see, for example, Robert and Casella, 2004).
Leapfrog Approximation
To implement Hamiltonian dynamics we use a discretization of the differ-
ential equation system (6.8) known as the leapfrog method. Starting from
(θ(t), p(t)) we generate (θ(t + ), p(t + ) by using a discrete approximation
over two subintervals of length /2 each:
∂ ln π(θ(t))
pi t + = pi (t) +
2 2 ∂θ
i
θi (t + ) = θi (t) + pi t +
2
∂ ln π(θ(t + ))
pi (t + ) = pi t + + . (6.9)
2 2 ∂θi
Let T (θ(t), p(t)) = (θ(t + ), p(t + )) denote the discrete approximation
implemented in (6.9). It is easily verified that the approximation is perfectly
reversible, that is, T − (θ(t + ), p(t + )) = (θ(t), p(t)) again, or T − (θ, p) =
T −1 (θ, p). Even easier than turning back time, all we have to do to send
the ball back to where it came from is to turn it around. That is, p ≡ −p,
or T −1 (θ, p) = T (θ, −p). Note that the time index in θ(t) and p(t) is only
used for the implementation of T (·). After the last step in (6.9), we drop
the time index again.
Example 6.3 Logistic regression. In a toxicity study for some new com-
pound, various dose levels of the compound are administered to batches
of animals. Let i = 1, . . . , k index the batches, let xi index the dose for the
i-th batch, let ni denote the number of animals in the i-th batch, and yi the
number of animals in the i-th batch that show a response. We assume that
yi is a binomial random variable, yi ∼ Bi(ni , πi ), where πi is the probability
of response at dose xi . We assume πi = 1/ 1 + eα+βxi . This is known as a
logistic regression model. The probabilities πi define the sampling model
(likelihood).
We complete the model with a prior h(α, β) = N(0, cI), where I is a
(2 × 2) identity matrix, and c = 100, i.e., a very vague prior. We observe
110 Markov Chain Monte Carlo Methods
with gradient
! ! !
−α X 1 X 1
∇ ln h(θ | y) = − yi πi + ȳi πi .
−β xi xi
Using (6.9) with target distribution π(θ) = h(θ | y), we implement M = 8
steps of size = 0.1. The transitions are deterministic, and we expect to
stay on a constant level of the augmented model (6.7). The algorithm re-
quires an initial value of p. Think of θ as the location of a ball on log pos-
terior contours, and p as the momentum that you give the ball by kicking it.
Hamiltonian mechanics and the approximation in (6.9) describes the tra-
jectory of the ball over M small time intervals of length . In this analogy
you have to think of the posterior distribution as a valley, with the posterior
mode being the lowest point. This is because ln h(θ | y) appears in (6.7)
with negative sign. Figure 6.2 shows a contour plot of the log posterior
“valley,” and three possible trajectories. All start at θ = (−2, −4), kicking
the ball with p = (1, 0), (1, −3), and (1, 1.4). On the four-dimensional con-
tours of the augmented model H(θ, p) the trajectories would move along
contour lines. Next we will introduce a random transition function, build-
ing on this deterministic transition of the leapfrog approximation, which we
will then use in Problem 6.12 to implement MCMC posterior simulation for
this example.
6.5 Hamiltonian Monte Carlo 111
In summary, the MCMC algorithm proceeds with the following two tran-
sition probabilities.
1. Generate pi ∼ N(0, 1), i = 1, . . . , d.
T (θ, p)
w. pr. 1/2
2. Generate (θ̃, p̃) = .
T − (θ, p) w. pr. 1/2
the first factor being the ratio of target distributions and the second fac-
tor the ratio of proposal distributions for the proposed and the reciprocal
moves. This and other simplifications are discussed by Welling and Teh
6.6 Implementation Details 113
θ(1) , . . . , θ(m) , where θ( j) ≡ θ(`+ jk ) , and which shall be used to evaluate the
∗
Multiple Chains
• Generate m chains with t∗ (usually, t∗ t) iterations each, starting
with m initial values, usually generated to be distinct and well separated
across the support of the target distribution.
• Use the final iterations θ(t ) of each chain to form a MC sample θ(1) , . . . , θ(m)
∗
– the choice of t∗ , beyond some initial burn-in, and the choice of m are
subject to similar considerations as before.
Both approaches have their advantages and limitations, which explains
the great variety of choices that are found in the literature. 15 The first
scheme allows us to reduce the computational cost and can lead to a chain
closer to h(θ | x) than schemes using multiple chains, using the same total
number of iterations. The second scheme makes it easier to control conver-
gence to h(θ | x) by reducing dependence on the initial states and allows a
more direct exploration of the target support. It is more likely to detect if
the apparent convergence of a chain is merely an indication that it is trapped
around some local mode, far from the target distribution. If this is the case,
one can reparametrize the model (difficult in general) or one can change
the initial values. In summary, the scheme with multiple chains seems by
and large appropriate for an exploratory evaluation before running one long
chain for definite inference.
Convergence diagnostics
A variety of methods have been proposed to diagnose convergence. Some
are already included as default in specific or more general software for
Bayesian inference. Others can be added by writing problem-specific code
for a given inference problem. Keeping with the scope of this text, we limit
the discussion to a brief review of some of the most widely used methods.
The most widely used tools to monitor convergence to the stationary
distribution are plots of simulated values across iterations, known as tra-
jectory (or trace) plots, and the analysis of such plots across different time
windows to inspect any changes in the mixing of the chain over the sup-
port of the target posterior distribution. Another tool is the superposition of
15
The issue is discussed at length by Geyer (1992) and in the following discussion, where
many pros and cons of each method are presented.
116 Markov Chain Monte Carlo Methods
Problems
Most of the following problems require programming to implement MCMC
simulation. See Appendix B for some suggestions on how to implement
such simulation in R. Later, in Chapter 9, we will discuss public domain
software to implement MCMC for most problems. However, it is useful to
implement MCMC and understand implementation details for at least some
more stylized problems. It is therefore recommended not to use MCMC
software, such as OpenBUGS, JAGS, or Stan in the following problems.
A large number of good case studies can be found in the manuals for
BUGS and R-INLA, including some substantially more complex infer-
ence problems than the following exercises. See www.openbugs.net/
Examples/ and also the problem set in Chapter 9.
Problems 117
6.1 Metropolis–Hastings. Verify that the transition function (6.4) does indeed
satisfy the detailed balance condition (6.1).
6.2 Data augmentation. The table below shows the observed frequencies y j re-
lated to the observed phenotype defined by the blood group of an individual,
for a sample of n = 435 individuals. Here j ∈ {1, 2, 3, 4} indexes the four
blood groups O, A, B, AB.
j Blood Frequency Probability
group yj pj
1 O 176 r2
2 A 182 p2 + 2pr
3 B 60 q2 + 2qr
4 AB 17 2pq
(The answer need not be perfect – any reasonable, practical, creative sug-
gestions is fine. See Section 9.6 for more discussion of convergence diag-
nostics.)
6.5 We consider a hierarchical event rate model with a Poisson sampling model
yi ∼ Poi(λi ti ), and a prior model
λi ∼ Ga(α, β) and β ∼ Ga(c, d),
where (α, c, d) are fixed hyperparameters.16
a. Find the conditional posterior distributions:
1. h(λi | β, λ j , j , i, y).
2. h(β | λ1 , . . . , λn , y).
b. Using the data from Problem 4.4, implement a Gibbs sampler to simulate
from the posterior in the above model.
c. Alternatively, marginalize with respect to λi to find the marginal likeli-
hood f (yi | β) and h(β | y).
6.6 Assume that x, y are jointly distributed random variables with support (0, B),
with conditional p.d.f.
f (x | y) ∝ e−xy , 0 < x < B, (6.12)
f (y | x) ∝ e −xy
, 0 < y < B. (6.13)
a. Propose a Gibbs sampler to generate an MC sample from f (x, y) (this is
easy – there are no tricks).
b. Now consider B = ∞, that is, the support for x and y is the entire posi-
tive real line. Show that there is no (proper) joint distribution f (x, y) with
conditionals (6.12) and (6.13).
c. Justify the statement “Under the setup of (b) we can not apply the Gibbs
sampler.”
6.7 Probit regression. Consider a probit regression model for a binary response
yi on covariates xi , i = 1, . . . , n. Let Φ(·) denote the standard normal c.d.f.
P(yi = 1 | xi ) = Φ(xi0 β), (6.14)
where β = (β0 , β1 , β2 ) and xi = (1, xi1 , xi2 ). We observe the following data.17
These are historical data – do you recognize them?
16
Compare with Problem 4.4 for a variation of the same problem.
17
The (identical) data are available as oring.dta on the book homepage
sites.google.com/view/computationalbayes/home
120 Markov Chain Monte Carlo Methods
0.20
0.15
0.10
0.05
0.00 10 15 20 25 30 35
VELOCITY
Figure 6.3 Galaxy data. The histogram shows the data points.
The curve shows a kernel density estimate.
6.10 This problem describes a real inference problem that arises in bioinformatics
research. The probability model is more complex than other examples, but
does not introduce any serious difficulties. 18
Likelihood (sampling model). Let xi , i = 1, . . . , n, denote a set of multinomial
random variables, with outcomes xi ∈ {1, . . . , N}. Let y = (y1 , . . . , yN ) denote
a contingency table summarizing the multinomial experiment, i.e., y j is the
frequency of outcome j. Let π j = P(xi = j) denote the unknown probability
of observing outcome j, π j = 1.0. We write
P
y ∼ MN−1 (n; π1 , . . . , πN ).
Prior. Due to the nature of the experiment we believe a priori that some of
the π j are much larger than others. We refer to this subset of outcomes with
much larger probability as the “prevalent” outcomes A1 , and to the set of not
so likely outcomes as “rare” outcomes A0 . We will formally define A0 and
A1 in the following.
Parametrization. To define a prior probability model for θ = (π1 , . . . , πN ) it
is convenient to introduce latent indicators z j , and a change of variables for
18
The problem is a stylized form of the analysis for SAGE data discussed by Morris et al
(2003).
Problems 123
the π j :
π∗ q j
if z j = 1
πj =
(6.18)
(1 − π∗ ) r j
if z j = 0,
with q j = 1 and r j = 1 (let q j = 0 if z j = 0, and r j = 0 if z j = 1). In
P P
words, the latent variable z j is an indicator for outcome j being a prevalent
outcome, i.e., A1 = { j : z j = 1} and A0 = { j : z j = 0}. The probability
of observing some prevalent outcome is π∗ , with π∗ close to 1.0; q j is the
probability of outcome j given that we observe a prevalent outcome, and r j
is the probability of j given a rare outcome. For later reference we define
M1 = #A1 and M0 = N − M1
as the number of prevalent and rare outcomes, respectively.
Prior probability model h(z, π∗ , q, r). We assume
P(z j = 1) = ρ, (6.19)
a beta prior on the total probability of prevalent outcomes:
π∗ ∼ Be(a∗ , b∗ ), (6.20)
and a Dirichlet prior for the partitioning of π∗ into the cell probabilities π j ,
j ∈ A1 . Let q̃h , h = 1, . . . , M1 denote the non zero weights q j :
(q̃1 , . . . , q̃ M1 ) ∼ D M1 −1 (a1 , . . . , a1 ). (6.21)
The use of equal Dirichlet parameters a1 reflects the symmetric nature of
prior beliefs about (q1 , . . . , q M1 ). Similarly for r j , j ∈ A0 :
(r̃1 , . . . , r̃ M0 ) ∼ D M0 −1 (a0 , . . . , a0 ). (6.22)
y = X β + . (6.23)
We complete the model with a prior. Let τ2 = 1/σ2 denote the residual
precision:
h(τ2 ) = Ga(ν0 /2, ν0 σ20 /2) and h(β ) = N(β0 , Σ0 ), (6.24)
independently. That is, we assume an informative prior for β (for example,
this could be based on historical data from a related earlier study).
Here, X ∼ Ga(a, b) denotes a gamma distribution with mean E(X) = a/b.
a. State the joint posterior h(β , τ2 | y) (up to a normalizing constant is okay).
No need to simplify.
b. Find the conditional distribution h(β | τ2 , y). Use notation β b for the least
squares solution. Recognize the distribution as a well-known parametric
family.
c. Find h(τ2 | β , y). You can use notation RSS(β ) = (y − X β )0 (y − X β ).
Recognize the distribution as a well-known parametric family.
d. Propose a Gibbs sampling posterior MCMC scheme based on the condi-
tionals from (b) to generate a Monte Carlo sample,
(β (m) , τ2(m) ) ∼ h(β , τ2 | y), m = 1, . . . , M.
6.12 Hamiltonian Monte Carlo. Refer to Example 6.3 in Section 6.5.1.
a. Posterior. Plot the posterior distribution h(α, β | y). Use a contour plot
(or heatplot, or any other format that allows you to add the points for the
simulations from part (b)).
b. Hamiltonian Monte Carlo. Let θ = (α, β). Propose a Hamiltonian Monte
Carlo algorithm to simulate a Monte Carlo sample:
θm ∼ h(θ | y),
m = 1, . . . , M (the θm need not be independent, as in any MCMC). Use
M = 100.
1. Describe the algorithm by stating all transition probabilities.
2. Implement the algorithm.
3. Plot θm by adding them to the plot from part (a).
c. Metropolis–Hastings. Implement a Metropolis–Hastings MCMC to gen-
erate θm ∼ h(θ | y), m = 1, . . . , M.
1. Describe the algorithm by stating all transition probabilities.
2. Plot θm by adding them to the plot from part (a).
126 Markov Chain Monte Carlo Methods
with
h(θ̃) I {d(z̃, y) < } q(θi | θ̃)
A(θi , θ̃) = ,
h(θi ) 1 q(θ̃ | θi )
where d(z, y) is some distance measure for two (hypothetical or actually ob-
served) data sets z and y, for example, d(z, y) = ||z − y||2 .
Find the stationary distribution π(θ) for this Markov chain.
Hint: Introduce a variable zi , initializing with z0 ∼ f (z0 | θ0 ), and then mod-
ify (6.25) to a proposal distribution for an augmented state vector (θi , zi ),
find the stationary distribution π(θ, z) for the Markov chain with this aug-
mented state vector. Finally, argue that the desired π(θ) is the marginal under
π(θ, z).21
6.14 Thall et al (2003) consider inference for a phase I oncology trial for the
combination of two cytotoxic agents. Here, we focus on estimating the dose–
response surface π(x; θ) = P(y = 1 | x = (x1 , x2 ), θ) for the probability of
toxicity (y = 1) as a function of the doses of the two agents (x1 and x2 ). We
assume standardized doses, 0 ≤ x j ≤ 1, and let yi ∈ {0, 1} denote an indicator
of the i-th patient recording toxicity after being treated with combination
therapy with doses xi1 and xi2 of the two agents. The response surface is
indexed by unknown parameters θ = (a1 , b1 , a2 , b2 , a3 , b3 ),
b3
a1 x1b1 + a2 x2b2 + a3 x1b1 x2b2
π(x, θ) = b3 . (6.26)
1 + a1 x1b1 + a2 x2b2 + a3 x1b1 x2b2
The model is chosen to allow easy incorporation of information about single-
agent toxicities. For x2 = 0 the model reduces to the single-agent dose–
toxicity curve π((x1 , 0), θ) ≡ π1 (x1 , θ) and similarly for π2 . The parameters
(a3 , b3 ) characterize the two-agent interactions.
21
This is one of the variations of ABC. See Marin et al (2012) for a review.
Problems 127
under this chain. Show that P(n) → pθ,y in total variation norm. That is,
lim ||P(n) (x0 , ·) − pθ,y || = 0
n→∞
in total variation norm, for all x0 . You may assume that q(θm | θm−1 , ym−1 ) >
0 for all θm ∈ Θ in the parameter space Θ (and all θm−1 , ym−1 ); and simi-
larly f (y | θ) > 0 for all y ∈ X in the sample space X and θ ∈ Θ.
Use at most three sentences.
c. Can you suggest an alternative way of generating an MC sample xn ∼
pθ,y (θ, y)? 23
Hint: There is a very easy answer.
23
Geweke (2004) uses ergodic Monte Carlo averages under (a) and under (b) to construct
an interesting diagnostic.
7
Earlier (Sections 4.2.2. and 4.2.3) we discussed how the evaluation of the
marginal likelihood function (or prior predictive distribution) for model se-
lection can be implemented by means of importance sampling. Since the
1990s, various methods have been proposed in the literature to implement
such schemes.1 This works for problems with moderate dimensional pa-
rameter vectors, but breaks down in higher dimensional problems. More
complex problems usually require the use of Markov chain Monte Carlo
(MCMC) methods to generate posterior Monte Carlo samples (or, to be
more precise, approximate posterior samples), as discussed in previous
chapters. Using such posterior Monte Carlo samples, one can then approx-
imately evaluate the marginal likelihood function. It is convenient to dis-
tinguish two cases. The first case is when the simulated values are gener-
ated from the parameter space for each model separately. The other case
is when the simulated values are generated from the model space, that is,
model indicators (separately or jointly with the parameter space). A popu-
lar algorithm to implement the latter is reversible jump MCMC (RJ).
In the following discussion, M j (or for short, j, when the context clarifies
the use) will denote competing models, and θ j will denote parameters that
index the sampling distribution under model M j , and we will use notation
f (y | θ j , M j ) or for short f j (y | θ j ) for the sampling model under model M j ,
h(θ j | M j ), or h j (θ j ) for the prior under model M j , p(x | M j ), or p j (x) for
the marginal under model M j , and h(M j ) or just h( j) for the prior model
probability.
129
130 Model Selection and Trans-dimensional MCMC
where b j and a j are fixed values of small and large magnitude, respectively.
Assuming conditional independence the implied prior distribution for β =
(β j , j = 1, . . . , p) in the first level of the hierarchical
h model is a multivariate
i
normal β | υ ∼ N p (0, Bυ ), with Bυ = diag (1 − υ j )b2j + υ j a2j b2j being a
diagonal matrix.2
At the second level of the hierarchy, the SSVS model assumes an inverse
2
George and McCulloch (1993, 1997) consider more general multivariate normal prior
distributions, including correlations in Bυ and dependence on a sampling variance σ2 .
132 Model Selection and Trans-dimensional MCMC
MC3 Method
Generating a Markov chain with a stationary distribution given by the ker-
nel h∗ (υ | y) can be done by an M-H algorithm with some proposal dis-
tribution4 q(υ̃ | υ(t) ) for any υ(t) . Using the special case of a Metropolis
algorithm with symmetric proposal q, the acceptance ratio for a proposal υ̃
generated from q simplifies to R(υ(t) , υ̃) = h∗ (υ̃ | y)/h∗ (υ(t) | y). A simple
example of this class of Metropolis algorithms uses a proposal distribution
that is only non-zero for υ̃ that differs from υ(t) in only one component,
q(υ̃ | υ(t) ) = 1/p, where p is the dimension of the least parsimonious model
corresponding to υ = 1 p with all 1s. This algorithm was proposed for mod-
els with discrete data by Madigan and York (1995) under the name MC3 ,
(Markov chain Monte Carlo model composition). Raftery et al (1997) used
MC3 Bayesian model averaging in multiple regression with a conjugate
(normal-gamma) prior on (β, σ2 ).
4
Under suitable conditions, this algorithm can be implemented very computation
efficiently, similar to what happens in a Gibbs sampler (George and McCulloch, 1997).
134 Model Selection and Trans-dimensional MCMC
This is because the Dirichlet distribution with total mass pa < 1 piles up
probability mass in the corners of the simplex, leading to effective variable
selection.
as
h i h i
f (y | M j ) = Eθ f (y | M j , θ) | M j = Eθ j f (y | M j , θ j ) | M j ,
from which one can see that for the purpose of model comparison via Bayes
factors the choice of the pseudo-priors is irrelevant.
The pseudo-prior method makes use of the complete conditional distri-
butions for the J + 1 blocks
h∗ (M, θ | y)
∀M ∈ M h(M | θ, y) = P (7.6)
h∗ (Mk , θ | y)
k∈J
h(θ j | M j , y), M = M j
∀ j ∈ J h(θ j | θ− j , M, y) =
(7.7)
h(θ | M ),
j k M = M , k , j.
k
The simulation based on these distributions is more efficient if, for each
model M j , one has conjugacy of the likelihood and the respective prior
h(θ j | M j ), and if one chooses pseudo-priors h(θk | M j ), k , j, that are
similar to h(θk | Mk , y). Note that under conjugacy for model M j one can
exactly evaluate the posterior distribution for the respective parameter and
the marginal likelihood f (y | M j ) and, on the basis of the latter, evaluate the
posterior probability of M j , which allows a separate and easy evaluation of
the posterior quantities.
Once a Monte Carlo sample {(M (t) , θ(t) )} from h(M, θ | y) is available,
it can be used to evaluate posterior probabilities for each model h(M j | y)
(as relative sample frequencies of M = M j ) and, therefore, Bayes factors
between any pair of models. Inference related to h(θ j | M j , y) is obtained
from the subsample of θ(t) consisting of the components θ(t) j associated with
model M j .
For details about the implementation, see Carlin and Louis (2009: sec-
tion 4.5.1). Dellaportas et al (2002) propose a modification of the pseudo-
prior method for the special case of variable selection.
2. Accept the proposal (β̃, α̃) with probability Aup = min{ρup , 1} and ratio
h( M̃) h(β̃, α̃ | M̃) f (y | α̃, β̃) q21 ∂T (β, u) .
ρup (β, u, β̃, α̃) =
h(M1 ) h(β̃ | M1 ) f (y | β) q12 q(u) ∂(β, u)
prior ratio × likelihood × proposal × Jacobian
Note how we had to carefully record all moves in steps 1 through 3 and
account for the corresponding proposal probabilities in (7.8).
Nothing much changes if the “down” move also involves an auxiliary
v ∈ Rq . It only complicates notation.
Problems 143
Problems
As in Chapter 6, most of the problems require some programming to im-
plement MCMC simulation. See Appendix B for some suggestions on how
to implement such simulation in R.
7.1 Normal linear regression – variable selection. Use the setup and notation of
Problem 6.11.
a. Let β̄ = E(β | τ2 , y) and V = Var(β | τ2 , y) denote the posterior mean and
covariance matrix for β conditional on τ2 . Let RSS(β ) = i (yi − x0i β )2
P
denote the residual sum of squares for β . Prove the following result
τ2
!
1 n
f (y | τ ) ∝ |V| h(β̄ ) τ exp − RSS(β̄ ) ,
2 2 2 (7.9)
2
as a function of τ2 . That is, the proportionality constant is a function of y.
Hint: You could use Bayes’ theorem for h(β | y, τ2 ) and substitute for
β = β̄ .
b. Now consider variable selection. Let γ = (γ1 , . . . , γ p ) denote a vector of
indicators γ j ∈ {0, 1}, let pγ = j γ j , and let Xγ denote the (n × pγ )
P
submatrix of X with the columns selected by γ j = 1. Similarly, let βγ
denote the subvector of β selected by γ , and similar for β0γ and Σ0,γ .
We change model (6.23) to include variable selection by using
y = Xγ βγ + , (7.10)
with = (1 , . . . , n ), and a modified prior h(βγ | γ) = N(β0,γ , Σ0,γ ) and
hyperprior h(γ j = 1) = π, j = 1, . . . , p, independently across j.
Use (7.9) to find h(γ | τ2 , y).
c. Propose a posterior MCMC scheme to generate a posterior Monte Carlo
sample (γ (m) , τ2(m) ) ∼ h(γ , τ2 | y).
How would you augment the Monte Carlo sample with β(m) ∼ h(β |
γ (m) , τ2(m) , y), if desired?
7.2 Variable selection. Ročková and George (2014) propose the following im-
plementation of SSVS for a Gaussian linear model:
f (y | β, σ) = N(Xβ, σ2 I),
where y = (y1 , . . . , yn ) is a response vector, and X is an (n × p) matrix of
covariates.
We use the following SSVS prior for the regression coefficients β. Let
cov(β) = Dσ,γ = σ2 diag(a1 , . . . , a p ) with ai = (1 − γi )ν0 + γi ν1
denote the prior covariance matrix for β. Here, γ j ∈ {0, 1}, ν0 is small such
√
that β j ≈ ν0 is practically dropping the jth covariate; and ν1 is large.
144 Model Selection and Trans-dimensional MCMC
We now complete the model with a prior on J. Let Poi+ (λ) denote a Poisson
distribution restricted to positive integers. We assume
J ∼ Poi+ (λ).
Use λ = 5.
a. Joint posterior. Let µ = (µ1 , . . . , µ J ), w = (w1 , . . . , w J ), and x = (x1 , . . . , xn ),
and let θ = (µ, w, σ2 , J) denote the complete parameter vector. Find the
joint posterior distribution h(θ | x).
b. Hierarchical model. Rewrite the mixture as a hierarchical model using
indicators si ∈ {1, . . . , J}.
f (xi | si = j) = N(µ j , σ2 ) and P(si = j | w, J) = w j . (7.13)
Let s−i = (s1,...,i−1 , si+1,...,n ).
Find h(µ j | s, w, x), h(si | s−i , µ, w, J, xi ), and h(w | s, J, x).
c. Reversible jump MCMC. Propose a RJ MCMC for posterior simulation
across variable J.
1. Split move. Propose a transition probability for incrementing J, i.e.,
J˜ = J + 1.
Describe the construction of a proposal (step by step), and state the
acceptance probability for the proposal.
Hint: You could: (1) select a term j to split; assume without loss of
generality j = J (rearrange indexes if needed, to simplify notation);
(2) generate a bivariate auxiliary variable (u, v); (3) define T (θ, u, v) by
µ̃ J = µ J + u, µ̃ J+1 = µ J − u, w̃ J = w J v and w̃ J+1 = w J (1 − v).
2. Merge move. Propose a transition probability for decrementing J, i.e.,
J˜ = J − 1 (“merge move”). Describe the construction of a proposal
(step by step) and state the acceptance probability for the proposal.
d. Implementation. Implement an RJ using the complete conditional distri-
butions from part (b) and the RJ move from (c).
Use the data set galaxy in R. You can get it, for example, in the R pack-
age ElemStatLearn. The data is shown in Figure 6.3.
• Plot J against iteration (trajectory of imputed J).
• Estimate g(x) = E[gθ (x) | x] for x on a grid x.
• Evaluate convergence
R∞ diagnostics for σ2 , g10 = gθ (10), g20 = gθ (20),
and G30 = 30 gθ (s)ds.5
7.4 Pseudo-prior. The following example is about inference in a clinical trial us-
ing a historical control. The trial is a study of a proposed treatment for uterine
papillary serous carcinoma, a very rare disease. The study drug is adminis-
tered in three doses, including 0 (control), 1, and 2. To facilitate a study in a
5
Use the R package boa or coda (or any other) for convergence diagnostics. See Section
9.6.
146 Model Selection and Trans-dimensional MCMC
realistic time frame we consider pooling with historical data on control. Let
yi denote the outcome (PFS, progression free survival, in months) for the ith
patient, numbering all patients, including patients from the historical study
as i = 1, . . . , n, with n = 56, including n0 = 40 patients in the historical data
and n1 = 16 in the new study. Let Ii ∈ {0, 1} denote an indicator for a patient
to be in the historical study and zi ∈ {0, 1, 2} for the dose of the study drug.
All patients in the historical data are on control, i.e., zi = 0, i = 1, . . . , n0 .
We use a Weibull regression with (or without in part (b)) a study effect. That
is, a sampling model
yi ∼ Weib(λ, a), with ln λ = β0 + β1 zi + β2 Ii .
a. Set up a prior β j ∼ N(m j , s2j ), j = 0, 1, 2, with m2 = 0, s0 = 0.1 and
s1 = s2 = 0.5. Determine suitable values for m0 and m1 to match expert
opinion that PFS at doses z = 0, 1, 2 should be around 7, 11, and 14
months (you might not be able to exactly match these means). The data
are in the file uterine.txt which is linked on the book’s homepage.6
In addition to zi , yi , and Ii , the file reports an indicator si for observed
event time (si = 1) versus censoring time (si = 0). Carry out posterior
inference.
b. Now we change the prior on β2 , the study effect. Let δ x denote a point
mass at x. We assume
δ0
if M = 0
β2 ∼
N(m2 , s2 ) if M = 1,
2
split move (and simplify the answer to the following question), we label the
current state vector ω̃ and the proposal ω:
w1 = w̃1 + w̃2 (7.15)
w1 µ1 = w̃1 µ̃1 + w̃2 µ̃2
w1 (κ1d + µ?2 ?2 ?2
1d ) = w̃1 (κ̃1d + µ̃1d ) + w̃2 (κ̃2d + µ̃2d ),
use indicators for Fn and Fn+1 to replace the range of integration. (4) Use a
change of variables to make both sides into integrals over (θn , u). (5) Finally,
argue that equality of the integrands is a sufficient condition for equality of
the integrals, and verify that the RJ acceptance probability satisfies the latter
condition.
8
Since the 1980s, investigators have aimed to find efficient, and preferably
simple, approaches to overcome the technical problems of calculus that
arise in Bayesian inference. A variety of strategies have been proposed, in-
cluding in particular multivariate normal approximations of posterior dis-
tributions, the Laplace approach, numerical quadrature methods, classical
Monte Carlo methods, and Markov chain Monte Carlo (MCMC) methods.
Advances in data collection give rise to the need for modeling ever more
complex data structures. This includes, spatio-temporal models, dynamic
linear models, generalized linear mixed models, generalized additive mod-
els, log-Gaussian Cox processes, geo-additive models, and more. All these
models are part of a much larger class of models known as latent Gaus-
sian models (LGMs). See, for example, Blangiardo and Cameletti (2015).
Theoretically, it is always possible to implement MCMC algorithms for
such LGMs. However, such implementations come with many problems in
terms of convergence and computation time.
Recently, Rue et al (2009) developed an analytical approach based on in-
tegrated and nested Laplace approximations (INLAs), which allows deter-
ministic approximation of marginal posterior distributions in these models.
The method provides a particularly efficient implementation of Bayesian
inference in LGMs. The INLA approach has two major advantages over
MCMC techniques. The first is computation time. Using INLA one can get
results within seconds or minutes in models that would take hours or even
days for inference using MCMC algorithms. The second advantage is that
INLA treats LGMs in a unified way, allowing substantial automation of
inference, independent of the specific model.
To better understand the methods proposed by Rue et al (2009), this
chapter starts with a review of the main techniques for analytical approx-
imation that were developed to implement Bayesian inference, and then
presents the INLA method and its implementation in an associated R pack-
age.
150
8.1 Analytical Methods 151
and
(an − 1) (an − 1)(bn − 1)
mn = , −{Ln00 (mn )}−1 = . (8.4)
an + bn − 2 (an + bn − 2)3
It is easy to verify that the regularity conditions are satisfied, and there-
fore the posterior distribution can be well approximated, for large n, by a
normal distribution, N(mn , σ2n ), with σ2n = −{Ln00 (mn )}−1 .
In the case of a uniform prior (a0 = b0 = 1), we have
x x/n(1 − x/n)
mn = , σ2n = ,
n n
that is, the posterior distribution for θ given X = x can be approximated by
a normal distribution, N(x/n, (x/n)(1 − x/n)/n). Note the duality of this re-
sult with what is obtained – by the central limit theorem – about the asymp-
totic distribution of X/n given θ. In fact, we know that for X ∼ Bi(n, θ) and
large n the distribution of X/n given θ is well approximated by a normal
distribution N(θ, θ(1 − θ)/n).
If approximate inference on the log odds ρ is desired, one could pro-
ceed with several approaches. Start by verifying that the exact posterior
distribution for ρ can be obtained by a simple transformation. In fact, note
hn (ρ | x) ∝ e an ρ (1 + e ρ )−(an +bn ) , ρ ∈ R. (8.5)
The posterior mean of ρ evaluated from (8.5) is ψ(an ) − ψ(bn ) where the
function ψ(x) is the derivative of the logarithm of Γ(x)2 . We find
ln hn (ρ | x) ∝ an ρ − (an + bn ) ln(1 + eρ )
eρ eρ
Ln0 (ρ) = an − (an + bn ) , L 00
(ρ) = −(an + bn ) , (8.6)
(1 + eρ ) n
(1 + eρ )2
and thus
an 1 1
mn = ln and − {Ln00 (mn )}−1 = + . (8.7)
bn an bn
In summary, the posterior distribution for ρ can be approximated by a nor-
mal distribution, N (ln(an /bn ), 1/an + 1/bn ). In the case of a vague prior
(a0 = b0 = 0), we get:
θ̂ 1
mn = ln , σ2n = with θ̂ = x/n.
1 − θ̂ nθ̂(1 − θ̂)
That is, the posterior for ρ given X = x can be approximated by a normal
2
See Gradshteyn and Ryzhik (2007: 943) for integral representations and series
expansions of this function.
154 Methods Based on Analytic Approximations
distribution N(mn , σ2n ). Note again the duality of this result with the one
obtained, by the central limit theorem, for the asymptotic distribution of
X/n
ln 1−X/n given θ. In fact, we know that for X ∼ Bi(n, θ), the distribution of
θ
ln 1−X/n given θ is well approximated, for large n, by an N ln 1−θ
X/n
, nθ(1−θ)
1
distribution.
where
√ 1
Iˆ = 2πn− 2 σ̂ f (θ̂) exp(−nψ(θ̂))
and σ̂ = [ψ00 (θ̂)]−1/2 .
• In the k-dimensional case a similar argument leads to Iˆ of the form,
k k 1
Iˆ = (2π) 2 n− 2 det(Σ̂) 2 f (θ̂) exp(−nψ(θ̂)),
where Σ̂−1 = ∇2 ψ(θ̂) is the Hessian matrix of ψ in θ̂.
Of course, higher-order expansions of f and ψ would obtain better ap-
proximations, for example:
Z (
−nψ̂ ˆ 1 2 ˆ00
dθ = (2π)σe f+ σ f − σ4 fˆ0 ψ̂000 +
−nψ(θ)
p
f (θ)e
2n
)
5 ˆ 000 2 6 1 ˆ (4) 4
+ f (ψ̂ ) σ − f ψ̂ σ + O(n−2 ), (8.11)
12 4
where fˆ, ψ̂, etc., are the respective functions evaluated in θ̂, which is, as
already mentioned, a value where the maximum of −ψ(θ) is achieved and
σ2 = [ψ00 (θ̂)]−1 .
Assume, then, that one wants to evaluate the posterior expected value of
some function g(θ) of the parameters. By (8.8) we see that E[g(θ) | x] can
be obtained as a ratio of two integrals, that is,
g(θ) f (x | θ)h(θ)dθ
R
E[g(θ) | x] = R . (8.12)
f (x|θ)h(θ)dθ
The basic idea is to apply separate Laplace approximations to the numer-
ator and denominator integrals and consider the ratio of these approxima-
tions. Tierney and Kadane (1986) obtain the following two corresponding
approximations for E[g(θ) | x].
• Use the Laplace approximation for exp(−nψ(θ)) = f (x | θ)h(θ), in
numerator and denominator, with f (θ) = g(θ) in the numerator and
f (θ) = 1 in the denominator of (8.12) to get
E[g(θ) | x] = g(θ̂)[1 + O(n−1 )]. (8.13)
Note that this corresponds to approximating E[g(θ) | x] by the mode
g(θ̂), where θ̂ is the posterior mode, since θ̂ was defined as the point of
maximum of −ψ(θ).
156 Methods Based on Analytic Approximations
Assuming that ψ(·), ψ? (·) are sufficiently regular functions, the Laplace
approximations for the numerator and denominator integrals in (8.14)
(using in both cases f (θ) = 1) are, respectively,
√ √
2πσ? n−1/2 exp{−nψ? (θ? )} and 2πσ̂n−1/2 exp{−nψ(θ̂)}.
From there we get the following approximation for E[g(θ) | x],
E[g(θ) | x] ≈ (σ? /σ̂) exp{−n[ψ? (θ? ) − ψ(θ̂)]}. (8.17)
−1
The approximation errors for the two integrals are of order n . However,
the relevant terms of the two errors are identical and therefore cancel
out in the ratio. Therefore the final approximation has a relative error of
order n−2 .
To arrive at the previous approximation, we imposed a quite restrictive
condition, namely that g be positive almost everywhere. One can proceed
in various ways for approximations with real-valued function g in more
general problems. Tierney et al (1989) suggest to start with an approxi-
mation of the moment-generating function of g(θ) (E[exp{sg(θ)}]), using
Laplace approximation for positive functions, and then from there to ob-
tain an approximation of the expected value of g(θ) as the derivative of the
logarithm of the moment generating function in s = 0. The error for this
approximation is of order O(n−2 ).
8.1 Analytical Methods 157
Example 8.2 Getting back to the first example, if one wishes to estimate
θ
g(θ) = ρ = ln 1−θ , one could use a Laplace approximation. Since g(θ) can
take negative values, we either use the first approximation that corresponds
to estimating the mean by the mode, obtaining
an − 1
E(ρ | x) = ln ,
bn − 1
or we use one of the alternative approaches proposed by Tierney et al
(1989).
Using the approach in (8.18) with ψN = ψD = ψ, that is, −nψ = ln h(θ) +
ln f (x | θ); fN (θ) = g(θ); fD (θ) = 1, we get
an − 1 1 (an − bn )
E(ρ | x) = ln +
bn − 1 2n (an − 1)(bn − 1)
1 (an − 1)(bn − 1)
− 2 [(an − 1)2 − (bn − 1)2 ]. (8.19)
n (an + bn − 2)
Compare this result, for specific values of an and bn , with the exact value.
Tanner (1996) suggests obtaining an approximation of the mean of ρ
based on the transformation
1 bn θ
λ= ln .
2 an (1 − θ)
It is easily seen that the distribution of λ is the distribution of Fisher’s z
with probability density function
e2an λ
h(λ) ∝ ,
(2bn + 2an e2λ )(an +bn )
with approximate mean
1 − (2an )−1
" #
1
ln .
2 1 − (2bn )−1
3
For example, if one were to use ψN = ψD ; fN (θ) = g(θ); fD (θ) = 1 then (8.18) would
reduce to (8.13) only. See Tierney et al (1989) or Robert (1994) for more details.
158 Methods Based on Analytic Approximations
Using similar arguments, one can further show that (8.17) remains valid
when θ ∈ Rk with
σ̂ = |∇2 ψ(θ̂)|−1/2 and σ? = |∇2 ψ? (θ? )|−1/2 ,
where
2 ?
[∇2 ψ(θ)]i j = ∂∂θψ(θ) [∇2 ψ? (θ)]i j = ∂∂θψi ∂θ(θ)j .
2
i ∂θ j
and
For more discussion of this method, see the references cited in the book
by Paulino et al (2018). Despite being a powerful technique, the approach
has some limitations. In the multivariate case the application can become
difficult and impractical, in particular when the integrands are multimodal
8.2 Latent Gaussian Models (LGM) 159
and/or the derivatives are difficult to obtain. In many problems, even with
moderate dimension parameter vectors, it is convenient to consider re-
parametrizations for better approximations.
where ψ−(k) denotes the vector ψ without the component ψk . The evaluation
of these distributions requires first estimates of h(ψ | x) and h(θi | ψ, x) to
then obtain
Z
h̃(θi | x) = h̃(ψ | x)h̃(θi | ψ, x)dψ (8.26)
Z
h̃(ψk | x) = h̃(ψ | x)dψ−(k) , (8.27)
Similar simplifications apply for updating q(τy ) q(τα ), q(τβ ), q(µα ), and
q(µβ ). In all cases it turns out that the expectation under q−x in the equiva-
lent of (8.34) is easy. This is no coincidence. Because the complete condi-
tional distributions are exponential families, ln h(x | ω−x , y) always reduces
to an expression of the type t(x)η x (ω−x ), and the expectation of η x (·) is usu-
ally easy since q−x includes independence across all other parameters.
In Problem 8.6 we consider a finite mixture. VB for mixture models has
been generalized to infinite mixture of normal models by several recent
papers, including those by Blei and Jordan (2006) and Lin (2013), who
also introduce a sequential scheme to accommodate big data.
Problems
8.1 Normal approximation. Under a particular genetic model, animals of a given
species should be one of four specific phenotypes with probabilities p1 =
(2 + θ)/4, p2 = (1 − θ)/4, p3 = (1 − θ)/4 and p4 = θ/4, respectively. Let
y1 , . . . , y4 denote the number of animals of each phenotype, and N = y1 +. . .+
y4 . Assuming multinomial sampling and a Be(a, b) prior on θ the posterior
Problems 169
and hyperpriors
eφ
0
and φk j .
Hint: See the discussion in Blei et al (2017).
c. Now include γ and π in the parameter vector. Extend D by defining q(γ) =
Ga(c, d) and q(π) = D(e1 , . . . , eK ). Find the updating equations for c, d,
and e.
8.7 In R, load the package MASS which provides the data set galaxies. Scale
the data as x = c(scale(galaxies)). Assume a mixture of K = 5 nor-
mal models, f (xi | θ) = k=1 πk N(µk , σ2 ) with θ = (µ, π, σ2 ). Complete the
PK
model with priors as in the previous problem, using τ = a = b = c = d = e =
1. Use variational inference to generate an approximate Monte Carlo poste-
rior sample, Θ = {θm , m = 1, . . . , M} with θm ∼ h(θ | x), approximately.
a. At each iteration, evaluate the ELBO. Plot it against the iteration.
b. Use Θ to approximate the posterior predictive distribution
1 XX m
f (x) =
b π N(µk , (σm )2 ).
M m k k
Software
172
9.1 Application Example 173
To simplify the commands we will always assume that all files that are
needed for the program are saved in the working directory of the R session
(use setwd() if needed). The steps to analyze the model in R2OpenBUGS
are then the following:
1. Write the code for the model and save it in a file with extension .txt.
In this case, we saved it in the file Cexemplo1BUGS.txt.
model{
for(i in 1:147){
X[i]~dnorm(mu[i],tau)
mu[i]<-beta0+beta[1]*z1[i]+beta[2]*z2[i]
+beta[3]*z3[i]+beta[4]*z[i]+a[ID[i]]+b[i]
176 Software
b[i]~dnorm(0,tau_b)
}
for(j in 1:49){
a[j]~dnorm(0,tau_a)
}
for(k in 1:4){
beta[k]~dnorm(0,0.0001)
}
beta0~dnorm(0,0.0001)
tau~dgamma(0.05,0.05)
tau_a~dgamma(0.05,0.05)
tau_b~dgamma(0.05,0.05)
sigma<-1/sqrt(tau)
sigma_a<-1/sqrt(tau_a)
sigma_b<-1/sqrt(tau_b)
}
The first for loop corresponds to the statement of the probability model
given in item 1 of Section 9.1. In BUGS the second parameter (tau) of
the normal distribution is the precision (inverse variance). The loop also
includes the definition of the linear predictor mu and the statement of the
model for the random effects b that appear in items 2. and 3. of the same
section. The number of patients is n = 49, but since all were observed at
three distinct time points, the total number of observations is 147.
The second for loop is the model for the random effects a. Those effects
are specific to each patient, who are identified by the variable ID in the first
loop.
The third for loop defines the prior distribution for the fixed effect pa-
rameters. After this, the prior distribution for the remaining model para-
meters are given. All prior distributions reflect vague prior information.
Finally, to allow monitoring of standard deviations, those are defined as a
function of the respective precisions. Note that the sequence of these com-
mands is completely arbitrary.
To verify the file that defines the model to be used by OpenBUGS, one
can use the command
file.show("Cexemplo1BUGS.txt")
> head(round(Cexemplo1,4))
If one wishes to monitor mu, or any other parameter that was defined in the
model specification, for example a, or b, those must appear in the above
list.
5. Define the initial values of parameters and hyperparameters in the
model. In this case we need to define initial values for beta0, beta, tau,
tau_a, tau_b, a, b. Those are defined in a list that needs to include initial
values for each chain.
Inits<-list(tau=1,tau_a=1,tau_b=1,beta=c(0,0,0,0), b=c(0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
178 Software
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),a=c(0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0),beta0=0)
Using more than one chain, one needs to create a list of the same type for
each chain, with different initial values, arranged as objects with names,
for examples, “Inits1” and “Inits2”, and then define a list of lists using
Inits<-list(Inits1,Inits2,...)
6. The program OpenBUGS can then be run through R using the func-
tion bugs(), after having verified that the R2OpenBUGS package is loaded
in the current R session.
library(R2OpenBUGS)
Cexemplo1_openBUGS.fit<- bugs(data=Cexemplo1.data, inits=list(Inits),
parameters.to.save=Cexemplo1.params,
"Cexemplo1BUGS.txt", n.chains=1, n.iter=40000,
n.burnin=20000, debug=FALSE,save.history=FALSE,DIC=TRUE)
which generates in this case the output (the final column for the 97.5%
quantile is cut to fit the page)
mean sd 2.5% 25% 50% 75%
beta0 17.4290 1.9208 13.9200 16.1000 17.2800 18.6600
beta[1] 4.2329 2.8568 -2.6160 2.5750 4.4480 6.0600
beta[2] 0.1422 0.1391 -0.1303 0.0474 0.1448 0.2352
beta[3] 4.3456 0.9455 2.3960 3.7600 4.3430 4.9100
beta[4] -0.1029 0.0366 -0.1726 -0.1281 -0.1046 -0.0798
tau 1.5607 3.6155 0.0636 0.0847 0.1521 1.1012
tau_a 0.0162 0.0038 0.0097 0.0135 0.0159 0.0185
tau_b 2.2989 5.4749 0.0638 0.0870 0.1604 1.4452
sigma_a 8.0226 0.9547 6.4140 7.3560 7.9320 8.5990
sigma_b 2.1790 1.2991 0.2279 0.8318 2.4965 3.3910
sigma 2.2725 1.2624 0.2750 0.9528 2.5640 3.4360
deviance 583.7390 241.1451 37.2990 403.3750 695.3000 782.2000
Note that only parameters that were declared in the object Cexemplo1.params
appear. To carry out residual analysis, for example, it is important to mon-
itor mu.
If the random effects b are removed (note that the file Cexemplo1BUGS.txt
with the model definition and the initial values need to be changed accord-
ingly) we get
> exemplo1_OpenBUGS1.fit<- bugs(data=Cexemplo1.data,
inits=list(Inits1), parameters.to.save=Cexemplo1.params1,
"Cexemplo1BUGS_semb.txt", n.chains=1,
n.iter=40000,n.burnin=20000, debug=FALSE,save.history=FALSE,
DIC=TRUE)
> exemplo1_OpenBUGS1.fit$summary
mean sd 2.5% 25% 50% 75% 97.5%
beta0 17.210 1.763 13.760 16.040 17.190 18.360 20.790
beta[1] 4.781 2.393 0.063 3.186 4.731 6.417 9.495
beta[2] 0.152 0.145 -0.117 0.052 0.149 0.249 0.450
beta[3] 4.146 0.923 2.384 3.527 4.121 4.751 6.053
beta[4] -0.107 0.036 -0.177 -0.131 -0.107 -0.083 -0.036
tau 0.077 0.011 0.057 0.070 0.077 0.085 0.101
tau_a 0.016 0.004 0.010 0.014 0.016 0.019 0.025
sigma_a 7.974 0.936 6.363 7.319 7.897 8.544 10.020
sigma 3.625 0.265 3.150 3.439 3.608 3.794 4.190
deviance 795.126 12.624 772.600 786.200 794.300 803.200 822.000
> exemplo1_OpenBUGS1.fit$DIC
[1] 843.2
> exemplo1_OpenBUGS1.fit$pD
[1] 48.03
> A1<-exemplo1_OpenBUGS1.fit$sims.matrix
> dim(A1)
[1] 20000 10
> head(A1)
beta0 beta[1] beta[2] beta[3] beta[4] tau
[1,] 20.62 -0.8067 0.1487 4.631 -0.1005 0.0838
[2,] 18.39 2.8580 0.0692 5.767 -0.0569 0.0826
[3,] 19.56 2.5330 0.3619 3.045 -0.1277 0.0784
9.3 JAGS 181
> Cexemplo1_OpenBUGS1$DIC
[1] 843.1
> Cexemplo1_OpenBUGS1$pD
[1] 48.12
9.3 JAGS
In 2003 Martyn Plummer, of the International Agency for Research on
Cancer, created JAGS (Just Another Gibbs Sampler) as a clone of BUGS
written in C++, which fixed certain limitations of BUGS. Models writ-
ten in BUGS syntax can be used in JAGS, practically without any change.
This has the advantage that the same code can be run on both platforms.
There are two parts to a model definition in JAGS: the model description
(model{}), as in BUGS, and the definition of the data (data{}). The lat-
ter can be used, for example, to define transformations of the data, define
summary statistics, simulate data sets, etc. The latest version of JAGS was
launched in July 2017 (JAGS 4.3.0). To understand how JAGS works, it
is important to read the manual (Plummer, 2012). The R package R2jags
(https://cran.r-project.org/web/packages/R2jags/R2jags.pdf) serves
as an interface between R and JAGS. One of the main advantages of JAGS
compared to OpenBUGS is the faster run time.
library(R2jags)
182 Software
2. Reading in the data, the definition of the model variables, the decla-
ration of parameters that are to be monitored, and the definition of initial
values is done exactly as before.
Cexemplo1.data <- list("X","ID","z3","z1","z2","z")
Cexemplo1.params1 <- c("beta0","beta","tau","tau_a",
"sigma_a","sigma")
Inits1<-list("tau"=1,"tau_a"=1,"beta"=c(0,0,0,0),
"a"=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
"beta0"=0)
3. Before using R2jags for the first time, we need to set a random variate
seed, for example, by the R command
set.seed(123)
> print(exemplo1_JAGS.fit)
Inference for Bugs model at "C:/Users...model1f5469ec52a3.txt",
fit using jags,
1 chains, each with 40000 iterations (first 20000 discarded),
n.thin = 20, n.sims = 1000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
beta[1] 4.751 2.537 -0.248 3.038 4.783 6.507 9.450
beta[2] 0.160 0.143 -0.121 0.069 0.159 0.249 0.447
beta[3] 4.163 0.915 2.467 3.552 4.132 4.777 6.081
beta[4] -0.107 0.036 -0.177 -0.132 -0.108 -0.081 -0.031
beta0 17.212 1.810 13.561 16.050 17.151 18.387 20.717
sigma 3.676 0.808 3.133 3.441 3.604 3.789 4.267
sigma_a 7.920 1.159 6.289 7.335 7.921 8.529 9.778
tau 0.077 0.013 0.055 0.070 0.077 0.084 0.102
tau_a 0.699 13.406 0.010 0.014 0.016 0.019 0.025
deviance 797.267 28.396 773.207 786.080 794.391 802.998 823.233
> print(exemplo1_JAGS.fit2)
Inference for Bugs model at
"C:/Users.../model1f5477c03ee1.txt",
fit using jags,
2 chains, each with 40000 iterations (first 20000 discarded),
n.thin = 20
n.sims = 2000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
beta[1] 4.859 2.596 -0.346 3.166 4.939 6.604 9.950
beta[2] 0.157 0.139 -0.113 0.068 0.151 0.242 0.453
beta[3] 4.191 0.918 2.422 3.595 4.173 4.805 6.059
beta[4] -0.105 0.036 -0.175 -0.130 -0.106 -0.081 -0.035
beta0 17.218 1.816 13.857 15.996 17.207 18.435 20.716
sigma 3.659 0.715 3.164 3.436 3.608 3.798 4.235
sigma_a 7.943 1.049 6.363 7.317 7.886 8.587 9.873
tau 0.077 0.012 0.056 0.069 0.077 0.085 0.100
184 Software
Note that there are now two extra columns, “Rhat” and “n.eff,” whose
meanings are explained in the output.
A plot of simulated parameter values can be obtained by
traceplot(exemplo1_JAGS.fit2)
print(exemplo1_JAGS.fit2.upd)
Inference for Bugs model at
"C:/Users.../model1f5477c03ee1.txt",
fit using jags,
2 chains, each with 1000 iterations (first 0 discarded)
n.sims = 2000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
beta[1] 4.681 2.562 -0.227 2.972 4.702 6.345 9.832
beta[2] 0.156 0.143 -0.117 0.059 0.151 0.253 0.440
beta[3] 4.179 0.924 2.410 3.575 4.181 4.771 5.951
beta[4] -0.106 0.037 -0.178 -0.132 -0.106 -0.081 -0.036
beta0 17.288 1.777 13.735 16.117 17.333 18.443 20.676
sigma 3.631 0.265 3.139 3.444 3.611 3.801 4.210
sigma_a 8.007 0.955 6.314 7.363 7.936 8.582 10.080
tau 0.077 0.011 0.056 0.069 0.077 0.084 0.102
9.4 Stan 185
Note the decrease of the values for pD and DIC. Another function that can
be used for the same purpose, and which requires only one chain, is the
function update().
In Section 9.6 we will discuss how to evaluate convergence diagnostics
for this example, including also for posterior simulation using other pack-
ages.
9.4 Stan
In 2010, Andrew Gelman, Columbia University, New York, and collabora-
tors were working on a Bayesian analysis of multi-level generalized linear
models, as described by Gelman and Hill (2006). The implementation of
these models in WinBUGS or JAGS turned out to be extremely challeng-
ing due to the complex model structure. For example, Matt Schofield noted
that simulation for a multi-level time series model that was used for a cli-
mate model using measurements of growth rings on trees, did not converge
even after hundreds of thousands of iterations (Schofield et al, 2016). To
resolve the problem, Gelman and collaborators developed a new Bayesian
software, which they called Stan, in honor of Stanislaw Ulam, one of the
creators of the Monte Carlo methods. The first version was made available
for users in August 2012. Stan does not use Gibbs samplers, but instead
uses Hamiltonian Monte Carlo (Neal, 2011). With the no-U-turn sampler
186 Software
algorithm that they developed (Hoffman and Gelman, 2014), all parame-
ters are simulated in one block. Using this strategy, convergence problems
were substantially mitigated. In contrast to BUGS, Stan is written as an
imperative language.
Stan allows the use of all basic operators from C++, in addition to a large
number of other special functions, including in particular Bessel functions,
gamma and digamma functions, and a variety of functions related to inverse
functions that are used in generalized linear models. A complete list of the
basic operators and special functions that are implemented in Stan is given
by Carpenter et al (2017). The list of implemented probability distributions,
which also appears in the same reference, is extensive. This allows great
flexibility in model construction.
A bit of history of the development of Stan, as well as implementation
details, can be found in Stan Modeling Language: User’s Guide and Ref-
erence Manual, which corresponds to version 2.6.2. An interface between
R and Stan is implemented in RStan (Stan Development Team, 2014).
}
transformed parameters {
real<lower=0> sigma_a;
real<lower=0> sigma;
sigma_a = sqrt(1/tau_a);
sigma = sqrt(1/tau);
}
model {
vector[J] mu;
for(i in 1:J){
mu[i] = beta0+beta[1]*z1[i]+beta[2]*z2[i]
+beta[3]*z3[i]+a[ID[i]]+beta[4]*z[i]};
beta0 ~ normal(0,100);
beta ~ normal(0,100);
tau ~ gamma(0.05,0.05);
tau_a ~ gamma(0.05,0.05);
a ~ normal(0, sigma_a);
X ~ normal(mu, sigma);
}
generated quantities {
vector[J] log_lik;
vector[J] m;
for(i in 1:J) {
m[i] = beta0+beta[1]*z1[i]+beta[2]*z2[i]
+beta[3]*z3[i]+a[ID[i]]+beta[4]*z[i]};
log_lik[i] = normal_log(X[i] | m[i], sigma);
}
}
2. The data are introduced as a list, as before, or if the data are read from
a file it suffices to create an object with the variable names, as follows:
Cexemplo1<-read.table("Cexemplo1.txt",header=T)
attach(Cexemplo1)
J<-nrow(Cexemplo1) #J=147
N<-length(unique(ID)) #N=49
# object to be used by Stan
Cexemplo1_data<-c("N","J","X","ID","z3","z2","z1","z")
3. Next we can call the function stan() from the package rstan to
simulate from the posterior distribution:
library(rstan)
exemplo1.fit_stan <- stan(file="exemplo1.stan.txt",
data=Cexemplo1_data, iter=40000, chains=2)
Call ?stan to see the arguments of the function stan. Particularly use-
ful are the arguments sample_file and diagnostic_file that allow
indicating the names of files where simulated samples of all model pa-
rameters and convergence diagnostics can be saved. If these names are not
given, these elements are not saved. They can, however, be extracted later.
rstan provides information about the execution time for each chain in
the following form (here only for one chain):
COMPILING THE C++ CODE FOR MODEL ’example1’ NOW.
The execution time for 40,000 iterations is much longer than the ex-
ecution time in OpenBUGS or JAGS, which for this particular problem
is practically instantaneous. However, the number of iterations needed to
achieve convergence is lower using Stan. Above we used the same number
of iterations, but this was not necessary. In fact, if the number of iterations
is not specified in iter=, we find convergence after only 2000 iterations
in this example.
4. To get summary statistics for the marginal posterior distributions for
the parameters of interest, use the command
print(exemplo1.fit_stan,
pars=c("beta0","beta","sigma","sigma_a","tau","tau_a","lp__"))
to get (a final column for the 97.5% quantile is cut to fit the page) the
following:
Inference for Stan model: example1.
2 chains, each with iter=40000; warmup=20000; thin=1;
post-warmup draws per chain=20000, total post-warmup draws=40000.
sigma_a 23821 1
tau 21546 1
tau_a 25151 1
lp__ 11599 1
Here, se_mean is the Monte Carlo standard error of the mean. If no param-
eters are specified then the summary statistics are given for the marginal
posterior distributions of all unknown quantities, including in particular
also a, log_lik, m.
The quantity lp_ which appears as the last element in pars= and whose
summary statistics are found in the last row of the output, is the log pos-
terior density (un-normalized), which is calculated by Stan for the imple-
mentation of Hamiltonian Monte Carlo. It can also be used for model as-
sessment (see, for example, Vethari and Ojanen, 2012).
The print() function provides summary statistics based on all chains
that are generated, whereas summary() provides summary statistics for
each chain separately.
5. To obtain the simulated parameter values, use the function extract().
With the argument permuted=TRUE, a list of all simulated values for all
parameters is created (if only simulated values for some parameters are de-
sired, those need to be given in the argument pars=); using permuted=FALSE,
an array is created with the first dimension corresponding to iterations, the
second dimension corresponding to chains, and the third dimension corre-
sponding to the parameters. See, for example, the following:
#####using permuted=TRUE##############
samples_stan<-extract(exemplo1.fit_stan,
pars=c("beta0", "beta", "sigma", "sigma_a"),
permuted = TRUE, inc_warmup = FALSE, include = TRUE)
> class(samples_stan)
[1] "list"
> names(samples_stan)
[1] "beta0" "beta" "sigma" "sigma_a"
> length(samples_stan$beta0)
[1] 40000 #20000 each chain
> head(samples_stan$beta0)
[1] 16.18767 18.64417 20.43510 16.69809 14.35278 15.39996
> dim(samples_stan$beta)
> head(round(samples_stan$beta,3))
chains
iterations chain:1 chain:2
[1,] 16.29099 17.81893
[2,] 16.68243 17.31063
[3,] 16.49383 17.31063
[4,] 16.20388 16.70740
, , parameters = beta[1]
chains
iterations chain:1 chain:2
[1,] 6.530125 5.718621
[2,] 4.949012 6.479835
[3,] 6.000288 6.479835
[4,] 6.204705 7.421142
, , parameters = beta[2]
chains
iterations chain:1 chain:2
[1,] 0.1718956 0.07575568
[2,] 0.1657402 0.18286167
[3,] 0.1793824 0.18286167
[4,] 0.1347633 0.15160846
> library(loo)
> log_lik1 <- extract_log_lik(exemplo1.fit_stan)
> waic(log_lik1)
Computed from 40000 by 147 log-likelihood matrix
Estimate SE
elpd_waic -422.4 9.9
p_waic 40.5 4.7
waic 844.8 19.8
9.5 BayesX
BayesX is a program that was developed in the Department of Statistics
at Ludwig Maximilian Universität München by Andreas Brezger, Thomas
Kneib, and Stefan Lang, with the first versions appearing in 2002. The
software is written specifically for structured additive regression models
(Brezger et al, 2005). This family of models (STAR), which was introduced
in Chapter 8, includes various well-known and widely used regression
models, such as generalized additive models (GAM), generalized additive
9.5 BayesX 193
2. Using this code, we get the following output of summary statistics for
the marginal distributions of the parameters:
> summary(exemplo1_BayesX)
### Chain_1
Call:
bayesx(formula = formula, data = data, weights = weights,
subset = subset, offset = offset, na.action = na.action,
contrasts = contrasts, control = control, model = model,
chains = NULL, cores = NULL)
Parametric coefficients:
Mean Sd 2.5% 50% 97.5%
(Intercept) 17.2313 1.8036 13.6702 17.2744 20.7688
z2 0.1557 0.1371 -0.1174 0.1566 0.4250
z3 4.2146 0.9665 2.3040 4.2043 6.2029
z1 4.7691 2.4910 -0.1460 4.7646 9.6957
z -0.1043 0.0366 -0.1756 -0.1055 -0.0319
Scale estimate:
Mean Sd 2.5% 50% 97.5%
Sigma2 13.2389 1.9332 9.9936 13.0804 17.373
Parametric coefficients:
Mean Sd 2.5% 50% 97.5%
(Intercept) 17.1458 1.7125 13.9773 17.0971 20.3820
z2 0.1591 0.1438 -0.1282 0.1612 0.4407
z3 4.1544 0.9413 2.3008 4.1405 6.0312
z1 4.9990 2.5100 -0.2337 5.0116 9.6973
z -0.1025 0.0351 -0.1751 -0.1016 -0.0367
Scale estimate:
Mean Sd 2.5% 50% 97.5%
Sigma2 13.2323 1.9739 9.9179 13.0191 17.518
When the argument control gives a file name for the output, folders are
automatically created in the indicated directory, one for each chain, con-
taining various files with the simulated parameter values (with file exten-
sion .raw) and text files with the summary statistics. In the current exam-
ple, two folders are created with the names Chain_1_bayes2x.estim
and Chain_2_bayes2x.estim. All these data files can later be used for
graphical summaries, to evaluate diagnostics, etc.
3. The command
getscript(exemplo1_BayesX)
[1] 10000
> length(AA[[2]])
[1] 10000
plot(AA)
The object AA is a list that contains in AA[[1]] a sample from the poste-
rior distribution of the fixed effect parameters from chain 1, and similarly in
AA[[2]] for chain 2. Since we save 20,000 iterations and thin out to every
tenth, there are 2000 simulated values for each of the five parameters. The
order in which the simulated values appear is the same as that in which they
were introduced in the formula. Thus AA[[1]][1:2000] contains poste-
rior simulated values, starting first with beta0; AA[[1]][2001:4000]
which contains posterior simulated values for beta1, the coefficient for
the covariate z2, etc.
5. The command plot(AA) plots the superimposed series of simulated
values for each parameter and corresponding marginal posterior density
estimates.
6. To get the posterior simulated values of the variances (in this case, the
variance of a normal distribution), and the corresponding graphical sum-
maries, write
> Va<-samples(exemplo1_BayesX,term = "var-samples")
> length(Va[[1]])
[1] 2000
> plot(Va)
The information from this latex document can also be obtained with
the command
bayesx_logfile(exempl1_BayesX)
Geweke’s Diagnostic
Let θ , t = 1, ..., N, be a sequence of states generated by an MCMC simu-
t
using the first na iterations, as well as the average gb = n1b g(θt ) using the
P
200 Software
last nb iterations. If the chain is stationary, then the mean of the first part
of the chain should be similar to the mean of the latter part of the chain.
Letting N → ∞ with fixed na /N and nb /N, one can show that
(ga − gb )
q → N(0, 1),
(s2a /na ) + (s2b /nb )
where s2a and s2b are independent estimates of the asymptotic variances of
ga and gb , adjusted for the autocorrelation of the time series. Using this
statistic, we can now assess whether or not there is evidence for conver-
gence.
discarding the first 10% of iterations. Repeat the process while the null
hypothesis is rejected.
4. If we continue to reject the null hypothesis as the remaining number of
iterations reaches 50% of the initial N iterations, then the Markov chain
simulation has to continue, as the chain has not yet reached equilibrium.
In that case CODA reports the test statistic and indicates that the chain
has failed the test of stationarity.
5. Otherwise, the part of the chain that passed the stationarity test is used to
estimate the mean (m) and asymptotic standard error (s) of the average
and a half-width test is applied to that part of the chain as follows. If
1.96s < m, for small (CODA uses a default of α = 0.05 and = 0.1),
then the chain passes the overall test. Otherwise, the condition 1.96s ≥
m means that the Markov chain simulation needs to be continued.
Both packages, CODA and BOA from CRAN, have a function codamenu()
and boa.menu(), respectively, which allows using the programs with a
menu interface, for casual users with limited knowledge of R. For exam-
ple, in CODA:
> library(coda)
> codamenu()
CODA startup menu
3: Quit
Selection: 2
1: Output Analysis
2: Diagnostics
3: List/Change Options
4: Quit
After reading in the data, the user is presented with a list of analysis op-
tions, summary statistics, graphical representations, and convergence diag-
nostics.
The BOA menu is quite similar. See details of the boa.menu() in Smith
(2007).
Alternatively to the menu mode, it is possible to carry out an analysis
using R commands. To have an interface between CODA and R such that
functions can be used from the R command line, a new R class mcmc was
created. BOA accepts the simulated values as a vector or matrix as input
parameter, and can save them as an mcmc object.
The latest version of the manuals for CODA and BOA are available from
https://cran.r-project.org/web/packages/coda/coda.pdf
https://cran.r-project.org/web/packages/boa/boa.pdf.
> A1<-Cexemplo1_OpenBUGS.fit$sims.matrix
> dim(A1)
[1] 40000 10
> head(round(A1,4))
beta0 beta[1] beta[2] beta[3] beta[4] tau tau_a
[1,] 15.73 6.349 0.0098 3.843 -0.1046 0.0819 0.0192
[2,] 18.55 2.689 0.0214 4.742 -0.1315 0.0953 0.0195
[3,] 16.41 6.330 0.2284 4.585 -0.0643 0.0664 0.0218
[4,] 14.18 5.653 -0.1744 4.911 -0.1551 0.0793 0.0127
[5,] 19.86 2.291 0.0826 4.259 -0.1209 0.0875 0.0180
[6,] 19.00 1.449 -0.0214 5.277 -0.0964 0.0778 0.0190
sigma_a sigma deviance
[1,] 7.207 3.495 797.3
[2,] 7.168 3.240 792.8
[3,] 6.779 3.882 800.5
[4,] 8.883 3.552 781.2
[5,] 7.452 3.381 793.0
[6,] 7.247 3.585 783.9
> class(A1)
[1] "matrix"
We see that A1 has 40,000 rows (20,000 iterations for each of two chains)
and ten columns with the names of the monitored parameters. Note that the
states for the first chain are in lines 1 through 20,000, and for the second
chain in rows 20,001 through 40,000.
Thus, to use CODA with two chains, we need to define two objects of
type mcmc, one for each chain, and then combine them into a single object
using the function as.mcmc.list, as is shown in the following example.
> library(coda)
204 Software
> A1_1chain<-as.mcmc(A1[1:20000,])
> A1_2chain<-as.mcmc(A1[20001:40000,])
> A1_2_mcmc<-as.mcmc.list(list(A1_1chain,A1_2chain))
We can plot superimposed trajectories for the two chains, posterior den-
sities (illustrated in Figure 9.3 for the parameters β1 and β2 ), autocorre-
lations, trajectories of quantiles, and the correlation matrix between the
components. The plots are obtained using the following commands:
plot(A1_mcmc)
plot(A1_2_mcmc[,2:3]) #only for cols that appear in the figure
autocorr.plot(A1_2_mcmc) #autocorrelation
cumuplot(A1_2_mcmc) #evaluate quantiles (0.025,0.5,0.975)
crosscorr.plot(A1_2_mcmc)#plot correlation matrix
The convergence diagnostics from Section 9.6.1 are easily evaluated with
the respective CODA functions.
Multivariate psrf
1
beta[1] beta[2]
1.4
median median
1.08
97.5% 97.5%
1.3
1.06
shrink factor
shrink factor
1.2
1.04
1.1
1.02
1.00
1.0
2. Geweke’s Diagnostic
> geweke.diag(A1_2_mcmc)
[[1]]
[[2]]
The output of this function is the Z-scores for the tests of equal means
between the earlier and later part of the series for each variable. Since the
values for the first chain are all within the interval (−1.96, 1.96), we do
not reject the null hypothesis of equal means for all monitored parameters.
However, the same is not true for β0 and β1 in the second chain. Keep
in mind that any parameter failing the test is evidence for the chain as a
whole not having reached stationarity. This is the case since convergence
is a property of the chain, not of individual parameters.
3. Raftery and Lewis’ Diagnostic
This method was considered for short pilot sequences of Markov chains.
Thus, we will apply it for the first 4000 iterates only for each chain (iter-
ates during the initial burn-in, in this case 20,000, were not saved in the
object A1). The method requires the designation of a quantile that is to be
estimated. By default, CODA uses the 0.025 quantile. We use the method
assuming one wishes to estimate the median, that is, the 0.5 quantile.
> raftery.diag(A1_1chain[1:4000,],q=0.5,r=0.01,s=0.95,
converge.eps=0.001)
> raftery.diag(A1_2chain[1:10000,],q=0.5,r=0.01,s=0.95)
[[2]]
With = 0.01 the parameter β2 passed the test of stationarity (for both
chains), but failed the half-width test, indicating the need to continue sim-
ulation to achieve the desired precision. If we augment to 0.05, all pa-
rameters pass both stationarity and half-width tests.
5. HPD Intervals
Finally, recall that in CODA one can obtain HPD intervals for all mon-
itored parameters, using the function HPDinterval(). This is illustrated
here with 95% HPD intervals for each of the chains.
> HPDinterval(A1_2_mcmc, prob = 0.95)
[[1]]
lower upper
beta0 13.860000 20.80000
beta[1] 0.036880 9.84600
beta[2] -0.128400 0.42490
beta[3] 2.487000 6.08700
beta[4] -0.178200 -0.03300
tau 0.056580 0.10000
tau_a 0.009279 0.02371
sigma_a 6.246000 9.88700
sigma 3.122000 4.14500
deviance 771.400000 820.30000
attr(,"Probability")
[1] 0.95
[[2]]
lower upper
beta0 13.960000 20.87000
beta[1] 0.002304 9.64300
beta[2] -0.114500 0.42770
beta[3] 2.402000 6.02500
beta[4] -0.178100 -0.03235
tau 0.055750 0.09912
tau_a 0.009449 0.02374
sigma_a 6.223000 9.83100
sigma 3.122000 4.15300
deviance 771.100000 819.80000
attr(,"Probability")
[1] 0.95
210 Software
B. Using R2jags
1. The output from jags() can be converted into an mcmc object with the
command:
exemplo1_JAGS.fit2.mcmc <- as.mcmc(exemplo1_JAGS.fit2 )
2. As before, with the mcmc object one can then use a variety of com-
mands for convergence diagnostics in CODA:
library(coda)
plot(exemplo1_JAGS.fit2.mcmc)
autocorr.plot(exemplo1_JAGS.fit2.mcmc)
gelman.plot(exemplo1_JAGS.fit2.mcmc)
gelman.diag(exemplo1_JAGS.fit2.mcmc)
geweke.diag(exemplo1_JAGS.fit2.mcmc)
9.6 Convergence Diagnostics: the Programs CODA and BOA 211
raftery.diag(exemplo1_JAGS.fit2.mcmc)
heidel.diag(exemplo1_JAGS.fit2.mcmc)
Thus, for using BOA, we proceed exactly as we did in the case of two
chains with output from ROpenBUGS.
C. Using RStan
As mentioned before (item 5, section 9.4.1), the simulated parameter values
for a chain that is set up using the command stan() can be obtained by
applying the function extract() to the output of stan().
> samples_stan_array<-extract(exemplo1.fit_stan,
pars=c("beta0", "beta", "sigma", "sigma_a", "tau", "tau_a"),
permuted = FALSE, inc_warmup = FALSE, include = TRUE)
> class(samples_stan_array)
[1] "array"
> dim(samples_stan_array)
[1] 20000 2 9 #20000 cada cadeia, 2 cadeias, 9 parameters
For BOA, we need to first define dimnames for the row and column names:
212 Software
samples_coda_1<-as.matrix(samples_stan_array[1:20000,1,1:9])
samples_coda_2<-as.matrix(samples_stan_array[1:20000,2,1:9])
dimnames(samples_coda_1)<- list(1:20000,
c("beta0", "beta[1]","beta[2]", "beta[3]", "beta[4]",
"sigma", "sigma_a", "tau", "tau_a"))
dimnames(samples_coda_2)<- list(1:20000,
c("beta0", "beta[1]","beta[2]", "beta[3]", "beta[4]",
"sigma", "sigma_a", "tau", "tau_a"))
D. Using R2BayesX
Recall (item 4, Subsection 9.5.1) that the function samples() applied to
the output of bayesx() returns the simulated values from the posterior
distribution, including the parameters that are indicated in the argument
term. The argument CODA controls the type and class of the output object.
Thus with the following commands we get a list of type mcmc, which
can be used for convergence diagnostics in CODA:
> AA_coda<-samples(exemplo1_BayesX,
term=c("linear-samples","var-samples","sd(ID)"),coda=TRUE)
> class(AA_coda)
[1] "mcmc.list"
> names(AA_coda)
[1] "Chain_1" "Chain_2"
#illustrating
9.7 R-INLA and the Application Example 213
> gelman.diag(AA_coda)
Potential scale reduction factors:
Multivariate psrf
1
#illustrating
> library(boa)
> boa.geweke(AA_boa_1,p.first=0.1,p.last=0.5)
Z-Score p-value
Chain_1.Param.Intercept -0.2281478 0.8195313
Chain_1.Param.z2 1.2278951 0.2194863
Chain_1.Param.z3 -0.2358216 0.8135711
Chain_1.Param.z1 1.1215734 0.2620438
Chain_1.Param.z 0.8195813 0.4124548
Chain_1.Var 0.6110576 0.5411614
> boa.geweke(AA_boa_1,p.first=0.1,p.last=0.5)
Z-Score p-value
Chain_1.Param.Intercept -0.2281478 0.8195313
Chain_1.Param.z2 1.2278951 0.2194863
Chain_1.Param.z3 -0.2358216 0.8135711
Chain_1.Param.z1 1.1215734 0.2620438
Chain_1.Param.z 0.8195813 0.4124548
Chain_1.Var 0.6110576 0.5411614
As with any other R package, to load R-INLA in each work session, write
library(INLA)
to get the most recent and most stable version of the package.
Using R-INLA there is a large number of probability distributions that
can be used for the response variable. The list can be seen with the com-
mand
> names(inla.models()$likelihood)
9.7 R-INLA and the Application Example 215
As before, fixed effects that appear linearly in the model have been cen-
tered. Using the function f(), which appears in the definition of the for-
mula, we define the structural effects (the various types are defined at
www.r-inla.org/models/latent-models). In the case at hand we
have only the patient-specific (variable ID) random effects which are in-
troduced in the model as a. The model iid specifies a normal distribution
with zero mean and precision τa . The prior distribution that is specified in
the argument hyper is for ln(τa ). Being a log-gamma it corresponds to a
gamma distribution for the precision τa .
2. Next we call the inla() function to run the INLA algorithm and
obtain the desired results to eventually proceed with Bayesian inference,
as follows:
> ?inla
> resultado_INLA <- inla(INLA_formula,family="normal",
control.predictor=list(compute=TRUE),
control.compute =list(waic=TRUE,dic=TRUE,cpo=TRUE),
data = Cexemplo1,
control.family=
list(hyper=list(prec=list(prior="loggamma",param=c(1,0.005)))))
216 Software
The first line in the above code lists all arguments of the inla() func-
tion, of which the only required ones are the object that states the formula,
in this case INLA_formula, and the object that contains the data, in this
case data = Cexemplo1. By not specifying the other arguments they are
assumed by R-INLA to be specified by omission.
The directive control.predictor=list(compute=TRUE) specifies
to compute the marginal distributions of the linear predictor. There are
other arguments for this function, which can be listed by the command
?control.predictor
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 17.250 1.737 13.820 17.251 20.669 17.253 0
z1 4.772 2.448 -0.050 4.770 9.599 4.766 0
z2 0.154 0.138 -0.118 0.154 0.427 0.153 0
z3 4.169 0.908 2.394 4.164 5.972 4.154 0
z -0.106 0.036 -0.176 -0.106 -0.035 -0.107 0
Random effects:
Name Model
ID IID model
Model hyperparameters:
mean sd 0.025quant 0.5quant
Precision for the Gaussian observations 0.0789 0.0113 0.0587 0.0782
Precision for ID 0.0170 0.0039 0.0106 0.0167
0.975quant mode
9.7 R-INLA and the Application Example 217
Compare these results with the corresponding results from simulation meth-
ods. In particular, compare the WAIC values with those obtained from
RStan.
4. Note that the results also include, for each configuration of hyperpa-
rameters, an estimate of the effective number of parameters. This estimate
essentially corresponds to the expected number of independent parameters
in the model. In our case, we have 7 + 49 = 56 parameters, but since the
random effects are correlated, the effective number of parameters is lower,
≈ 47, as can be seen. As mentioned, this is one of the strategies, proposed
in Rue et al (2009), to evaluate the accuracy of the approximation. In par-
ticular, if the effective number of parameters is low compared to the sample
size, then one expects the approximation to be good. In this case the ratio
of sample size (147) and effective number of parameters (46.38) is approxi-
mately 3.17, suggesting a reasonably good approximation. In fact, the ratio
can be interpreted as the number of “equivalent replicates” corresponding
to the number of observations for each expected number of effective pa-
rameters.
Another reported quantity is the mean Kullback–Leibler divergence (in
the column kld). This value describes the difference between the normal
approximation and the simplified Laplace approximation (recall the dis-
cussion in Chapter 8 about the various approximation strategies used in
INLA) for the marginal posterior distributions. Small values indicate that
the posterior distribution is well-approximated by a normal.
5. The default approximation strategy in inla() is the simplified Laplace
approach. Other approximation and integration methods can be defined us-
ing the argument control.inla in the function inla(). For example, if
one wanted the complete Laplace approach used, which is recommended
218 Software
6. Besides the results shown earlier, R-INLA can also report two types
of goodness-of-fit measures, namely CPO p(yi | y−i ) and the probability
integral transforms P(Yinova ≤ yi | y−i ) (PIT). To add these in the out-
put, just add cpo=TRUE in the list of the argument control.compute
for the function inla. The values are then returned as part of the output
from inla. A list of all possible values can be obtained by the command
names(resultados_INLA). Of the 51 possible ones, we list only a few:
> names(resultado_INLA)
[1] "names.fixed"
[2] "summary.fixed"
[3] "marginals.fixed"
[4] "summary.lincomb"
[5] "marginals.lincomb"
[6] "size.lincomb"
[7] "summary.lincomb.derived"
[8] "marginals.lincomb.derived"
[9] "size.lincomb.derived"
[10] "mlik"
[11] "cpo"
[12] "po"
[13] "waic"
...
[18] "summary.linear.predictor"
[19] "marginals.linear.predictor"
[20] "summary.fitted.values"
[21] "marginals.fitted.values"
...
[27] "offset.linear.predictor"
...
[51] "model.matrix"
One can then save the values of these two measures of fit in an object, for
example CPO_PIT below, and use, for further analysis, for graphs, etc.
> CPO_PIT<-resultado_INLA$cpo
> names(CPO_PIT)
[1] "cpo" "pit" "failure"
> class(CPO_PIT)
[1] "list"
> summary(CPO_PIT$cpo)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003652 0.0486900 0.0741300 0.0673100 0.0905200 0.0924000
> summary(CPO_PIT$pit)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.7 R-INLA and the Application Example 219
30
15
0
> tau<-resultado_INLA$marginals.hyperpar$
+"Precision for the Gaussian observations"
> sigma<-inla.emarginal(function(x) 1/sqrt(x), tau)
> sigma
[1] 3.588387
> tau_a<-resultado_INLA$marginals.hyperpar$"Precision for ID"
> sigma_a<-inla.emarginal(function(x) 1/sqrt(x), tau_a)
> sigma_a
[1] 7.813781
More information about the functions that are used to operate on marginal
distributions is obtained via the help command ?inla.marginal.
10. One of these functions, inla.hpdmarginal(), allows getting HPD
credible intervals for the model parameters. To obtain these intervals for the
parameters corresponding to the fixed effects in the linear predictor, one
can either get them individually, or get all in one command, as illustrated
here for a 95% interval,
> HPD<-NULL
> for(i in 1:5){
HPD[[i]]<-inla.hpdmarginal
(0.95, resultado_INLA$marginals.fixed[[i]])}
> HPD
[[1]]
low high
level:0.95 13.82332 20.66469
[[2]]
low high
level:0.95 -0.05260815 9.58653
[[3]]
low high
level:0.95 -0.1184688 0.4263571
[[4]]
low high
level:0.95 2.384431 5.958083
[[5]]
222 Software
low high
level:0.95 -0.1759522 -0.03584334
[[2]]
low high
level:0.95 0.009948865 0.02469588
As expected, the HPD intervals for the fixed effect parameters practically
coincide with the equal tail intervals in the summary results. The same is
not true for the precision parameters.
Problems
9.1 FEV Data. Rosner (1999) reports on a study of lung capacity and smoking
on n = 654 youths aged 3 – 19 years. Lung capacity yi is measured as forced
expiratory volume (FEV). Covariates include age (xi1 ) and an indicator of
smoking (xi2 ). The same data are also analyzed by Chistensen et al. (2011:
ch. 7).
Set up a regression of yi on age (xi1 ), adjusting intercept and slope for smok-
ing (xi2 ), that is, a normal linear regression for yi = b0 + b1 xi1 + b2 xi2 +
b3 xi1 xi2 + i j . The data are available as the data set fev in the R package
tmle. Set up inference using OpenBUGS, JAGS, Stan, or INLA. Does FEV
differ by smoking? Does smoking affect the increase of lung capacity with
age? Is the interaction term needed?
9.2 Consider any of the examples in the OpenBUGS manual, www.openbugs.
net/w/Examples. Implement inference in JAGS or Stan or OpenBUGS.
9.3 Consider the Epilepsy study reported in the OpenBUGS manual, www.openbugs.
net/Examples/Epil.html. Implement inference using R-INLA. Show
the marginal posterior distributions for the logistic regression coefficients
a j . Compare versus a simplified model without extra-Poisson variation (that
is, without the b jk random effect).
9.4 Consider the Salmonella dose–response study in the OpenBUGS manual,
Problems 223
Probability Distributions
Discrete Distributions
Binomial: x ∼ Bi(n, p), x ∈ {0, . . . , n};
f (x) = C nx p x (1 − p)n−x , n ∈ N, 0 < p < 1
µ = np, σ2 = np(1 − p).
Bernoulli: Ber(p) = Bi(n = 1, p).
Beta-binomial: x ∼ BeBin(n, a, b), x ∈ {0, . . . , n}
f (x) = C nx B(a + x, b + n − x)/B(a, b), a > 0, b > 0, n ∈ N
µ = na/(a + b), σ2 = nab(a + b + n)/[(a + b)2 (a + b + 1)].
Note: A BeBin(n, a, b) distribution is a mixture of a Bi(n, θ) with respect
to θ ∼ Be(a, b).
Negative binomial: x ∼ NBin(r, θ), x ∈ N0 ;
f (x) = C xx+r−1 (1 − θ)r θ x , 0 < θ < 1, r ∈ N;
µ = rθ/(1 − θ) and σ2 = rθ/(1 − θ)2 .
Geometric: Geo(θ) = NBin(1, θ).
224
Probability Distributions 225
C xM Cn−x
N−M
f (x) = , N ∈ N, M ∈ {0, . . . , N}, n ∈ {1, . . . , N};
CnN
µ = nM
N
and σ2 = n N−n M
N−1 N
1− MN
.
Continuous Distributions
Uniform: x ∼ U(α, β), α ≤ x ≤ β;
f (x) = 1/(β − α), α < β, α, β ∈ R;
µ = (β + α)/2 and σ2 = (β − α)2 /12.
Note: We also write U(S ) for a uniform distribution over a set S .
Beta: x ∼ Be(α, β), 0 ≤ x ≤ 1;
Γ(α + β) α−1
f (x) = x (1 − x) β−1 , α > 0, β > 0;
Γ(α)Γ(β)
µ = α/(α + β) and σ2 = αβ/[(α + β)2 (α + β + 1)].
226 Probability Distributions
Normal: x ∼ N(µ, σ2 ), x ∈ R;
f (x) = √2πσ e−(x−µ) /2σ , µ ∈ R, σ2 > 0;
1 2 2
αα/2 β β/2
f (x) = xα/2−1 (αx + β)−(α+β)/2 , α > 0, β > 0;
B (α/2, β/2)
µ = β/(β − 2), β > 2, and σ2 = 2β2 (α + β − 2) [α(β − 2)2 (β − 4)], β > 4.
Multivariate Distributions
Dirichlet: θ ∼ Dc (a), θ ∈ Sc ,
Γ(a• )
f (θ) = i=1 θi
Πc+1 ai −1
, ai ∈ R+ , i = 1, . . . , c + 1; θc+1 = 1 − θ• ,
Πc+1
i=1 Γ(a i )
µi = a/a• , σ2i = µi (1 − µi )/(a• + 1) and σi j = −µi µ j /(a• + 1), i , j.
Multinomial: x ∼ Mc (n, θ), x = (x1 , . . . , xc+1 ), xi ∈ N0 , ci=1 xi ≤ n and
P
xc+1 = n − ci=1 xi
P
n!
f (x) = i=1 θi , θ ∈ Sc , θc+1 = 1 − θ•
Πc+1 xi
Πc+1 x
i=1 i !
µ = nθ and Σ = n(diag(θ1 , . . . , θc ) − θθ0 ).
Multinomial-Dirichlet: x ∼ MDk (n, α) for x = (x1 , . . . , xk ), with xi ∈
{0, 1, 2, . . . , n}, ki=1 xi ≤ n, xk+1 = n − ki=1 xi
P P
n! B({αi + xi })
f (x) = , α = (α1 , . . . , αk , αk+1 ), αi > 0, n ∈ N
Πk+1
i=1 xi !
B(α)
µi = nmi σ2i = nmi (1 − mi ) αα•• +n
+1
and σi j = −nmi m j αα•• +1
+n
, where m j =
α j /α• .
Note: An MDk (n, α) distribution is the mixture of an Mk (n, θ) distribu-
tion with respect to θ ∼ Dk (α).
Multivariate normal: x ∼ Nk (m, V) for x ∈ Rk
1 0 −1
f (x) = (2π)−k/2 |V|−1/2 e− 2 (x−m) V (x−m)
, m ∈ Rk ,
228 Probability Distributions
Programming Notes
229
230 Programming Notes
return(x)
}
#######################################################
# simulating from a bivariate distribution
MCMC Simulation
The last code fragment shows a typical Gibbs sampler implementation.
The main function gibbs() implements the iteration over the steps of
the MCMC simulation. The code implements a Gibbs sampler with two
transition probabilities, generating parameters b and s2 from the respective
complete conditional posterior distributions.
Programming Notes 231
232
References 233
Brezger, A., Kneib, T., and Lang, S. 2005. BayesX: analyzing Bayesian structural
additive regression models. Journal of Statistical Software, Articles, 14(11), 1–22.
(Cited on pages 172, 192, and 193.)
Burnham, K. P., and Anderson, D. R. 2002. Model Selection and Multimodel Inference:
A Practical Information-Theoretic Approach. 2nd edn. Springer.
Carlin, B. P., and Chib, S. 1995. Bayesian model choice via Markov chain Monte Carlo.
Journal of the Royal Statistical Society, B, 57(3), 473–484. (Cited on page 136.)
Carlin, B. P., and Gelfand, A. E. 1991. An iterative Monte Carlo method for non-
conjugate Bayesian analysis. Statistics and Computing, 1(2), 119–128. (Cited on
page 65.)
Carlin, B. P., and Louis, T. A. 2009. Bayesian Methods for Data Analysis. CRC Press.
(Cited on pages viii and 78.)
Carpenter, B., Gelman, A., Hoffman, M., et al. 2017. Stan: a probabilistic programming
language. Journal of Statistical Software, Articles, 76(1), 1–32. (Cited on pages 172
and 186.)
Carvalho, C. M., Polson, N. G., and Scott, J. G. 2010. The horseshoe estimator for
sparse signals. Biometrika, 97(2), 465–480. (Cited on page 134.)
Celeux, G., Forbes, F., Robert, C. P., and Titterington, D. M. 2006. Deviance informa-
tion criteria for missing data models. Bayesian Analysis, 1(4), 651–673. (Cited on
page 79.)
Chen, M.-H. 1994. Importance-weighted marginal Bayesian posterior density esti-
mation. Journal of the American Statistical Association, 89, 818–824. (Cited on
page 58.)
Chen, M.-H., and Shao, Q. 1999. Monte Carlo estimation of Bayesian credible and
HPD intervals. Journal of Computational and Graphical Statistics, 8, 69–92. (Cited
on page 47.)
Chen, M.-H., Shao, Q., and Ibrahim, J. G. 2000. Monte Carlo Methods in Bayesian
Computation. Springer. (Cited on pages 54, 57, and 129.)
Chib, S. 1995. Marginal likelihood from the Gibbs output. Journal of the American
Statistical Association, 90(432), 1313–1321. (Cited on pages 129 and 130.)
Chib, S., and Jeliazkov, I. 2001. Marginal likelihood from the Metropolis–Hastings
output. Journal of the American Statistical Association, 96(453), 270–281. (Cited
on pages 129 and 131.)
Christensen, R., Johnson, W., Hanson, T., and Branscum, A. 2011. Bayesian Ideas and
Data Analysis: An Introduction for Scientists and Statisticians. CRC Press. (Cited
on page viii.)
Cowles, M. K. 1994. Practical issues in Gibbs sampler implementation with applica-
tion to Bayesian hierarchical modelling of clinical trial data. PhD thesis, University
of Minnesota. (Cited on page 201.)
Cowles, M. K., and Carlin, B. P. 1996. Markov chain Monte Carlo convergence diag-
nostics: a comparative review. Journal of the American Statistical Association, 91,
883–904. (Cited on pages 116 and 199.)
Damien, P., Wakefield, J., and Walker, S. 1999. Gibbs sampling for Bayesian non-
conjugate and hierarchical models by using auxiliary variables. Journal of the Royal
Statistical Society, B, 61(2), 331–344. (Cited on page 106.)
Dawid, A. P. 1985. The impossibility of inductive inference. (invited discussion of
234 References
Gelman, A., Hwang, J., and Vehtari, A. 2014b. Understanding predictive information
criterion for Bayesian models. Statistics and Computing, 24, 997–1016.
Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distribution and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6, 721–741. (Cited on pages 90 and 98.)
Gentle, J. E. 2004. Random Number Generation and Monte Carlo Methods. 2nd edn.
Springer. (Cited on page 43.)
Genz, A., and Kass, R. E. 1997. Subregion adaptative integration of functions having
a dominant peak. Journal of Computational and Graphical Statistics, 6, 92–111.
(Cited on page 53.)
George, E. I., and McCulloch, R. 1997. Approaches for Bayesian variable selection.
Statistica Sinica, 7, 339–373. (Cited on pages 131, 132, and 133.)
George, E. I., and McCulloch, R. E. 1993. Variable selection via Gibbs sampling. Jour-
nal of the American Statistical Association, 88(423), 881–889. (Cited on pages 131
and 132.)
Geweke, J. 1989. Bayesian inference in econometric models using Monte Carlo inte-
gration. Econometrica, 57(02), 1317–39. (Cited on pages 52 and 68.)
Geweke, J. 1992. Evaluating the accuracy of sampling-based approaches to calculating
posterior moments. In: Bayesian Statistics 4. Clarendon Press. (Cited on page 199.)
Geweke, J. 2004. Getting it right. Journal of the American Statistical Association,
99(467), 799–804. (Cited on pages 127 and 128.)
Geyer, C. J. 1992. Practical Markov chain Monte Carlo (with discussion). Statistical
Science, 7, 473–511. (Cited on page 115.)
Gillies, D. 2001. Bayesianism and the fixity of the theoretical framework. Pages 363–
379 of: Corfield, J., and Williamson, J. (eds.), Foundations of Bayesianism. Kluwer
Academic Publishers. (Cited on page 12.)
Givens, G. H., and Hoeting, J. A. 2005. Computational Statistics. Wiley. (Cited on
page 97.)
Gradshteyn, I., and Ryzhik, I. 2007. Table of Integrals, Series, and Products, Jeffrey,
A., and Zwillinger, D. (eds.). Academic Press.
Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika, 82, 711–732. (Cited on page 138.)
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57, 97–109. (Cited on page 90.)
Heidelberger, P., and Welch, P. 1983. Simulation run length control in the presence of
an initial transient. Operations Research, 31, 1109–1144. (Cited on page 199.)
Henderson, H. V., and Velleman, P. F. 1981. Building multiple regression models inter-
actively. Biometrics, 37, 391–411. (Cited on page 73.)
Hoff, P. D. 2009. A First Course in Bayesian Statistical Methods. Springer. (Cited on
page viii.)
Hoffman, M. D., and Gelman, A. 2014. The No-U-Turn sampler: Adaptively setting
path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research,
15(1), 1593–1623. (Cited on page 186.)
Jaynes, E. T. 1968. Prior probabilities. IEEE Transactions on Systems, Science and
Cybernetics, 4, 227–291. (Cited on page 22.)
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge University
Press. (Cited on pages 13 and 21.)
236 References
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. 1999. An introduction
to variational methods for graphical models. Machine Learning, 37(2), 183–233.
(Cited on page 164.)
Karabatsos, G. 2015. A menu-driven software package for Bayesian regression analy-
sis. The ISBA Bulletin, 22, 13–16. (Cited on page 172.)
Kass, R. E., and Raftery, A. E. 1995. Bayes factors. Journal of the American Statistical
Association, 90, 773–795. (Cited on page 85.)
Kass, R. E., and Wasserman, L. 1996. The selection of prior distributions by formal
rules. Journal of the American Statistical Association, 91, 1343–1370. (Cited on
page 17.)
Kempthorn, O., and Folks, L. 1971. Probability, Statistics and Data Analysis. Iowa
State University Press. (Cited on page 7.)
Kneib, T., Heinzl, F., Brezger, A., Bove, D., and Klein, N. 2014. BayesX: R Utilities
Accompanying the Software Package BayesX. R package version 0.2-9. (Cited on
page 193.)
Korner-Nievergelt, F., von Felten, S., Roth, T., et al. 2015. Bayesian Data Analysis in
Ecology Using Linear Models with R, BUGS, and Stan. Academic Press. (Cited on
page 172.)
Kruschke, J. 2011. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Aca-
demic Press/Elsevier. (Cited on page 172.)
Kruschke, J. 2014. Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan.
Academic Press/Elsevier. (Cited on page 172.)
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. 2017. Automatic
differentiation variational inference. Journal of Machine Learning Research, 18(1),
430–474. (Cited on page 168.)
Kuhn, T. S. 1962. The Structure of Scientific Revolutions. University of Chicago Press.
(Cited on page 5.)
Kuo, L., and Mallick, B. 1998. Variable selection for regression models. Sankhya: The
Indian Journal of Statistics, Series B, 60(1), 65–81. (Cited on page 132.)
Lauritzen, S. L., and Spiegelhalter, D. J. 1988. Local computations with probabilities
on graphical structures and their application to expert systems. Journal of the Royal
Statistical Society, B, 50(2), 157–224. (Cited on page 174.)
Lin, D. 2013. Online learning of nonparametric mixture models via sequential vari-
ational approximation. Pages 395–403 of: Proceedings of the 26th International
Conference on Neural Information Processing Systems. USA: Curran Associates
Inc. (Cited on page 168.)
Lindgren, F., and Rue, H. 2015. Bayesian spatial modelling with R-INLA. Journal of
Statistical Software, Articles, 63(19), 1–25. (Cited on page 214.)
Lindgren, F., Rue, H., and Lindstrom, J. 2011. An explicit link between Gaussian
fields and Gaussian Markov random fields: the stochastic partial differential equation
approach. Journal of the Royal Statistical Society, B, 73(4), 423–498. (Cited on
page 214.)
Lindley, D. V. 1990. The 1988 Wald memorial lectures: the present position in Bayesian
statistics. Statistical Science, 5, 44–89. (Cited on page 10.)
Liu, J., and West, M. 2001. Combined parameter and state estimation in simulation-
based filtering. Pages 197–223 of: Doucet, A., de Freitas, N., and Gordon, N. (eds.),
Sequential Monte Carlo Methods in Practice. Springer. (Cited on page 64.)
References 237
Lunn, D., Spiegelhalter, D., Thomas, A., and Best, N. 2009. The BUGS project: Evo-
lution, critique and future directions. Statistics in Medicine, 28(25), 3049–3067.
(Cited on page 174.)
MacEachern, S., and Berliner, L. 1994. Subsampling the Gibbs sampler. The American
Statistician, 48, 188–190.
Madigan, D., and York, J. 1995. Bayesian graphical models for discrete data. Interna-
tional Statistical Review, 63, 215–232. (Cited on page 133.)
Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. 2012. Approximate Bayesian
computational methods. Statistics and Computing, 22(6), 1167–1180. (Cited on
page 126.)
Mayo, D., and Kruse, M. 2001. Principles of inference and their consequences. Pages
381–403 of: Corfield, J., and Williamson, J. (eds.), Foundations of Bayesianism.
Kluwer Academic Publishers. (Cited on page 9.)
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
1953. Equation of state calculations by fast computing machines. J. Chem. Phys,
21, 1087–1092. (Cited on pages 90 and 97.)
Morris, J. S., Baggerly, K. A., and Coombes, K. R. 2003. Bayesian shrinkage estimation
of the relative abundance of mRNA transcripts using SAGE. Biometrics, 59, 476–
486. (Cited on page 122.)
Neal, R. M. 1997. Markov Chain Monte Carlo methods based on “slicing” the density
function. Technical Report. University of Toronto. (Cited on page 106.)
Neal, R. M. 2003. Slice sampling (with discussion). Annals of Statistics, 31, 705–767.
(Cited on page 106.)
Neal, R. M. 2011. MCMC using Hamiltonian dynamics. Chap. 5 of: Brooks, S.,
Gelman, A., Jones, G., and Meng, X.-L. (eds.), Handbook of Markov Chain Monte
Carlo. Chapman & Hall / CRC Press. (Cited on pages 107 and 185.)
Neuenschwander, B., Branson, M., and Gsponer, T. 2008. Critical aspects of the
Bayesian approach to phase I cancer trials. Statistics in Medicine, 27, 2420–2439.
(Cited on page 53.)
Newton, M. A., and Raftery, A. E. 1994. Approximate Bayesian inference by the
weighted likelihood bootstrap (with discussion). Journal of the Royal Statistical
Society, B, 56, 1–48. (Cited on page 86.)
Ntzoufras, I. 2009. Bayesian Modeling Using WinBUGS. (Cited on page 172.)
O’Hagan, A. 2010. Bayesian Inference, Vol. 2B. 3rd edn. Arnold. (Cited on pages 1, 9,
14, and 17.)
O’Quigley, J., Pepe, M., and Fisher, L. 1990. Continual reassessment method: A prac-
tical design for phase 1 clinical trials in cancer. Biometrics, 46(1), 33–48. (Cited on
page 45.)
Park, T., and Casella, G. 2008. The bayesian lasso. Journal of the American Statistical
Association, 103(482), 681–686. (Cited on page 134.)
Patil, V. H. 1964. The Behrens–Fisher problem and its Bayesian solution. Journal of
the Indian Statistical Association, 2, 21. (Cited on page 33.)
Paulino, C. D., and Singer, J. M. 2006. Análise de Dados Categorizados. Editora
Edgard Blücher.
Paulino, C. D., Soares, P., and Neuhaus, J. 2003. Binomial regression with misclassifi-
cation. Biometrics, 59, 670–675. (Cited on page 17.)
238 References
Paulino, C. D., Amaral Turkman, M. A., Murteira, B., and Silva, G. 2018. Estatística
Bayesiana. 2nd edn. Fundacão Calouste Gulbenkian. (Cited on pages 17, 38, 54,
57, 84, 93, 158, and 199.)
Pitt, M. K., and Shephard, N. 1999. Filtering via simulation: Auxiliary particle fil-
ters. Journal of the American Statistical Association, 94(446), 590–599. (Cited on
pages 61, 62, and 63.)
Plummer, M. 2003. JAGS: a program for analysis of Bayesian graphical models using
Gibbs sampling. In: Hornik, K., Leisch, F., and Zeileis, A. (eds.), 3rd International
Workshop on Distributed Statistical Computing (DSC 2003). (Cited on page 172.)
Plummer, M. 2012. JAGS Version 3.3.0 User Manual. http://mcmc-jags.
sourceforge.net, accessed on July 22, 2018. (Cited on page 181.)
Plummer, M., Best, N. G., Cowles, M. K., and Vines, S. K. 2006. CODA: Conver-
gence diagnostics and output analysis for MCMC. R News, 6(1), 7–11. (Cited on
pages 116, 198, and 201.)
Polson, N. G., Stroud, J. R., and Müller, P. 2008. Practical filtering with sequential
parameter learning. Journal of the Royal Statistical Society, B, 70(2), 413–428.
(Cited on page 64.)
Prado, R., and West, M. 2010. Time Series: Modeling, Computation, and Inference.
Chapman & Hall/CRC Press. (Cited on page 64.)
Raftery, A. L., and Lewis, S. 1992. How many iterations in the Gibbs sampler? Pages
763–74 of: Bernardo, J., Berger, J., Dawid, A., and Smith, A. (eds.), Bayesian Statis-
tics IV. Oxford University Press. (Cited on page 199.)
Raftery, A. E., Madigan, D., and Hoeting, J. A. 1997. Bayesian model averaging for
linear regression models. Journal of the American Statistical Association, 92(437),
179–191. (Cited on page 133.)
Richardson, S., and Green, P. J. 1997. On Bayesian analysis of mixtures with an un-
known number of components (with discussion). Journal of the Royal Statistical
Society, B, 59(4), 731–792. (Cited on page 141.)
Rickert, J. 2018. A first look at NIMBLE. Blog: https://rviews.rstudio.com/
2018/07/05/a-first-look-at-nimble/, accessed on July 16, 2018. (Cited on
page 172.)
Ripley, B. D. 1987. Stochastic Simulation. Wiley. (Cited on pages 43 and 44.)
Robert, C. P. 1994. The Bayesian Choice. Springer. (Cited on pages 27 and 157.)
Robert, C. R., and Casella, G. 2004. Monte Carlo Statistical Methods. 2nd edn. New
York: Springer. (Cited on pages 44 and 96.)
Rosner, B. 1999. Fundamentals of Biostatistics. Duxbury. (Cited on page 222.)
Ross, S. M. 2014. Introduction to Probability Models, 11th ed. Academic Press. (Cited
on page 91.)
Rossi, P. E., Allenby, G. M., and McCulloch, R. 2005. Bayesian Statistics and Market-
ing. Wiley. (Cited on page 172.)
Ročková, V., and George, E. I. 2014. EMVS: The EM approach to Bayesian variable se-
lection. Journal of the American Statistical Association, 109(506), 828–846. (Cited
on page 143.)
Rubinstein, R. Y. 1981. Simulation and the Monte Carlo Method. 1st edn. Wiley. (Cited
on page 44.)
Rue, H., and Held, L. 2005. Gaussian Markov Random Fields: Theory and Applica-
tions. Chapman & Hall. (Cited on page 159.)
References 239
Rue, H., Martino, S., and Chopin, N. 2009. Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the
Royal Statistical Society, B, 71(2), 319–392. (Cited on pages 150, 162, 163, 169,
214, and 217.)
Schofield, M. R., Barker, R. J., Gelman, A., Cook, E. R., and Briffa, K. 2016. A model-
based approach to climate reconstruction using tree-ring data. Journal of the Amer-
ican Statistical Association, 2016, 93–106. (Cited on page 185.)
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6, 461–
466. (Cited on pages 77 and 83.)
Scott, S., Blocker, A., Bonassi, F., et al. 2016. Bayes and big data: the consensus Monte
Carlo algorithm. International Journal of Management Science and Engineering
Management, 11(2), 78–88. (Cited on page 68.)
Shaw, J. E. H. 1988. Aspects of numerical integration and summarization. Pages 625–
631 of: Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M. (eds.),
Bayesian Statistics 3. Oxford: University Press. (Cited on page 52.)
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. London:
Chapman and Hall.
Smith, A. F. M. 1991. Bayesian computation methods. Phil. Trans. R. Soc. Lond. A,
337, 369–386.
Smith, A. F. M., and Gelfand, A. E. 1992. Bayesian statistics without tears. The Amer-
ican Statistician, 46, 84–88. (Cited on page 58.)
Smith, B. 2007. BOA: An R package for MCMC output convergence assessment and
posterior inference. Journal of Statistical Software, 21, 1–37. (Cited on pages 116,
198, and 202.)
Spiegelhalter, D. J. 1986. Probabilistic prediction in patient management and clinical
trials. Statistics in Medicine, 5(5), 421–433. (Cited on page 174.)
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. 2002. Bayesian
measures of model complexity and fit (with discussion). Journal of the Royal Sta-
tistical Society, B, 64, 583–639. (Cited on pages 78 and 79.)
Stan Development Team. 2014. RStan: The R Interface to Stan, Version 2.5.0. (Cited
on page 186.)
Sturtz, S., Ligges, U., and Gelman, A. 2005. R2WinBUGS: a package for running
WinBUGS from R. Journal of Statistical Software, 12(3), 1–16. (Cited on page 175.)
Tanner, M. A. 1996. Tools for Statistical Inference. 3rd edn. New York: Springer Verlag.
(Cited on page 157.)
Tanner, M. A., and Wong, W. H. 1987. The calculation of posterior distributions by data
augmentation. Journal of the American Statistical Association, 82(398), 528–540.
(Cited on page 105.)
Thall, P. F., Millikan, R. E., Müller, P., and Lee, S.-J. 2003. Dose-finding with two
agents in phase i oncology trials. Biometrics, 59(3), 487–496. (Cited on pages 126
and 127.)
Thomas, A., O’Hara, B., Ligges, U., and Sturtz, S. 2006. Making BUGS open. R News,
6(01), 12–17. (Cited on page 172.)
Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. Journal of the
Royal Statistical Society, B, 58(1), 267–288. (Cited on page 134.)
Tierney, L. 1994. Markov chains for exploring posterior distributions. Annals of Statis-
tics, 22, 1701–1728. (Cited on page 96.)
240 References
Tierney, L. 1996. Introduction to general state-space Markov chain theory. Pages 61–74
of: Gilks, W., Richardson, S., and Spiegelhalter, D. (eds.), In Markov Chain Monte
Carlo in Practice. Chapman. (Cited on page 91.)
Tierney, L., and Kadane, J. 1986. Accurate approximations for posterior moments and
marginal densities. Journal of The American Statistical Association, 81(03), 82–86.
(Cited on pages 154 and 162.)
Tierney, L., Kass, R., and Kadane, J. 1989. Fully exponential laplace approximations
to expectations and variances of nonpositive functions. Journal of the American
Statistical Association, 84(407), 710–716. (Cited on pages 156 and 157.)
Umlauf, N., Adler, D., Kneib, T., Lang, S., and Zeileis, A. 2015. Structured additive re-
gression models: An r interface to BayesX. Journal of Statistical Software, Articles,
63(21), 1–46. (Cited on pages 193, 194, and 195.)
Vehtari, A., and Ojanen, J. 2012. A survey of Bayesian predictive methods for model
assessment, selection and comparison. Statist. Surv., 6, 142–228.
Vehtari, A., Gelman, A., and Gabry, J. 2017. Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–
1432. (Cited on pages 188 and 191.)
Walker, A. M. 1969. On the asymptotic behaviour of posterior distributions. Journal of
the Royal Statistical Society, B, 31(1), 80–88. (Cited on page 151.)
Wasserman, L. 2004. All of Statistics. Springer-Verlag. (Cited on page 14.)
Watanabe, S. 2010. Asymptotic equivalence of Bayes cross validation and widely appli-
cable information criterion in singular learning theory. Journal of Machine Learning
Research, 11(Dec.), 3571–3594. (Cited on page 80.)
Welling, M., and Teh, Y. W. 2011. Bayesian learning via stochastic gradient Langevin
dynamics. Pages 681–688 of: Proceedings of the 28th International Conference on
International Conference on Machine Learning. Omnipress. (Cited on page 112.)
Zhang, Z., Chan, K. L., Wu, Y., and Chen, C. 2004. Learning a multivariate Gaussian
mixture model with the reversible jump MCMC algorithm. Statistics and Comput-
ing, 14(4), 343–355. (Cited on page 146.)
Index
241
242 Index
probability
frequentist, 3
subjective, 6
probit regression, 119
pseudo-Bayes factor, 82
pseudo-prior, 136
metropolized, 137
R-INLA, 214
Rao–Blackwellization, 50
Rayleigh distribution, 41, 226
repeated sampling, 3
slice sampler, 105
Stan, 185
BOA, 211
CODA, 211
state space model, 60
stationary distribution, 92
Student t distribution, 66, 226
multivariate, 228
t distribution, see Student t distribution
transition function, 91
uniform distribution, 41, 225
variable selection
Bayesian lasso, 134
Dirichlet–Laplace, 134
horseshoe, 134
MC3 , 133
SSVS, 131, 143
variational Bayes, 164
Stan, 223
variational family, 164
variational inference, 164
mean-field, 164
Weibull distribution, 226
WinBUGS, 174
BOA, 203
CODA, 203
Wishart distribution, 228