Statistical Causal Inferences and Their Applications in Public Health Research-Springer International Publishing (2016)
Statistical Causal Inferences and Their Applications in Public Health Research-Springer International Publishing (2016)
Statistical Causal Inferences and Their Applications in Public Health Research-Springer International Publishing (2016)
Hua He
Pan Wu
Ding-Geng (Din) Chen Editors
Statistical Causal
Inferences and
Their Applications
in Public Health
Research
ICSA Book Series in Statistics
Series Editors
Jiahua Chen
Department of Statistics
University of British Columbia
Vancouver
Canada
123
Editors
Hua He Pan Wu
Department of Epidemiology Christiana Care Health System
School of Public Health Value Institute
and Tropical Medicine Newark, DE, USA
Tulane University
New Orleans, LA, USA
This book originated from a series of discussions among the editors when we were
all at the University of Rochester, NY, before 2015. At that time, we had a research
discussion group under the leadership of Professor Xin M. Tu that met biweekly
to discuss the methodological development on statistical causal inferences and their
applications to public health data. In this group, we got a closer overview of the
principles and methods behind the statistical causal inferences which are needed to
be disseminated to aid the further development in the area of public health research.
We were convinced that this can be accomplished better through the compilation of
a book in this area.
This book compiles and presents new developments in statistical causal infer-
ence. Data and computer programs will be publicly available in order for readers
to replicate model development and data analysis presented in each chapter so that
these new methods can be readily applied by interested readers in their research.
The book strives to bring together experts engaged in causal inference research
to present and discuss recent issues in causal inference methodological development
as well as applications. The book is timely and has high potential to impact model
development and data analyses of causal inference across a wide spectrum of
analysts, as well as fostering more research in this direction.
The book consists of four parts which are presented in 15 chapters. Part I includes
Chap. 1 with an overview on statistical causal inferences. This chapter introduces
the concept of potential outcomes and its application to causal inference as well as
the basic concepts, models, and assumptions in causal inference.
Part II discusses propensity score method for causal inference which includes
six chapters from Chaps. 2 to 7. Chapter 2 gives an overview of propensity score
methods with underlying assumptions for using propensity score, and Chap. 3
addresses causal inference within Dawid’s decision-theoretic framework, where
studies of “sufficient covariates” and their properties are essential. In addition, this
chapter investigates the augmented inverse probability weighted (AIPW) estimator,
which is a combination of a response model and a propensity model. It is found that,
in the linear regression with homoscedasticity, propensity variable analysis provides
exactly the same estimated causal effect as that from multivariate linear regression,
vii
viii Preface
for both population and sample. The AIPW estimator has the property of “double
robustness,” and it is possible to improve the precision given that the propensity
model is correctly specified.
As a critical component of propensity score analysis to reduce selection bias,
propensity score estimation can only account for observed covariates, and this
estimation to unobserved covariates has not been fully understood. Chapter 4 is then
designed to introduce a new technique to assess the robustness of propensity score
estimation methods to unobserved covariates. A real dataset on substance abuse
prevention for high-risk youth is used to illustrate this technique.
Chapter 5 discusses the missing confounder data in propensity score methods
for causal inference. It is well known that the propensity score methods, including
weighting, matching, or stratification, have been used to control potential con-
founding effects in observational studies and non-randomized trials to obtain causal
effects of treatment or intervention. However, there are few studies to investigate the
missing confounder data problem in propensity score estimation which is unique
and different from most missing covariate data problem where the goal is parameter
estimation. This chapter is then to review and compare existing methods to deal
with missing confounder data in propensity score methods and suggest diagnostic
checking tools to select a suitable method in practice. In Chap. 6, the focus is turned
to the models of propensity scores for different kinds of treatment variables. This
chapter gives a thorough discussion of all methods with a comparison between
parametric and nonparametric approaches illustrated by a public health dataset.
Chapter 7 is to discuss the computational barrier in propensity score in the era of big
data with example in optimal pair matching and consequently offer a novel solution
by constructing a stratification tree based on exact matching and propensity scores.
Part III is designed for causal inference in randomized clinical studies which
includes five chapters from Chaps. 8 to 12. Chapter 8 reviews important aspects
of semiparametric theory and empirical processes that arise in causal inference
problems with discussions on empirical process theory, which provides powerful
tools for understanding the asymptotic behavior of semiparametric estimators that
depend on flexible nonparametric estimators of nuisance functions. This chapter
concludes by examining related extensions and future directions for work in
semiparametric causal inference.
Chapter 9 discusses the structural nested models for cluster-randomized trials
for clinical trials and epidemiologic studies. It is known that in clinical trials
and epidemiologic studies, adherence to the assigned components is not always
perfect. In this chapter, the estimation of causal effect of cluster-level adherence
on an individual-level outcome is provided with two different methodologies based
on ordinary and weighted structural nested models (SNMs) which are validated
by simulation studies. The methods are then applied to a school-based water,
sanitation, and hygiene study to estimate the causal effect of increased adherence
to intervention components on student absenteeism. In Chap. 10, the causal models
for randomized trials with two active treatments and continuous compliance are
addressed by first proposing a structural model for the principal effects and
Preface ix
then specifying compliance models within each arm of the study. The proposed
methodology is illustrated with an analysis of data from a smoking cessation trial.
In Chap. 11, the causal ensembles for evaluating the effect of delayed switch
to second-line antiretroviral regimens are proposed to deal with the challenge in
randomized clinical trials of delayed switch. The method is applied for cohort
studies where decisions to switch to subsequent antiretroviral regimens were left
to study participants and their providers as seen from ACTG 5095. Chapter 12
is to introduce a new class of structural functional response models (SFRMs)
in causal inference, especially focusing on estimating causal treatment effect in
complex intervention design. SFRM is an extended version of existing structural
mean models (SMMs) that is widely used in the area of randomized controlled
trials to provide optimal solution in estimation of exposure-effect relationship when
treatment exposure is imperfect and inconsistent to every individual subject. With
a flexible model structure, SFRM is ready to address the limitations of existing
approaches in causal inference when the study design contains multiple intervention
layers or dynamic intervention layers and capable to offer robust inference with a
simple and straightforward algorithm.
Part IV is devoted to the structural equation modeling for mediation analysis
which includes three chapters from Chaps. 13 to 15. In Chap. 13, the identification
of causal mediation models with an unobserved pretreatment confounder is explored
on identifiability of mediation, direct, and indirect effects of treatment on outcome.
The mediation effects are represented by a causal mediation model which includes
an unobserved confounder, and the direct and indirect effects are represented
by the mediation effects. Simulation studies demonstrate satisfactory estimation
performance compared to the standard mediation approach. In Chap. 14, the causal
mediation analysis with multilevel data and interference is studied since this type
of data is a challenge for causal inference using the potential outcomes framework
because the number of potential outcomes becomes unmanageable. Then the goal
of this chapter is to extend recent developments in causal inference research with
multilevel data and violations of the interference assumption to the context of
mediation. This book concludes with Chap. 15 to compressively examine the causal
mediation analysis using structure equation modeling by taking advantage of its
flexibility as a powerful technique for causal mediation analysis.
As a general note, the references for each chapter are at the end of the chapter so
that the readers can readily refer to the chapter under discussion. Thus each chapter
is self-contained.
We would like to express our gratitude to many individuals. First, thanks go
to Professors Xin M. Tu and Wan Tang for leading and organizing the research
discussion which led the production of this book. Thanks go to Hannah Bracken,
the associate editor in statistics from Springer; to Jeffrey Taub, project coordinator
from Springer (http://link.springer.com); and to Professor Jiahua Chen, the coeditor
of Springer/ICSA Book Series in Statistics (http://www.springer.com/series/13402),
for their professional support of the book. Special thanks are due to the authors of
the chapters.
x Preface
Part I Overview
1 Causal Inference: A Statistical Paradigm for Inferring Causality . . . . 3
Pan Wu, Wan Tang, Tian Chen, Hua He, Douglas Gunzler,
and Xin M. Tu
xi
xii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Contributors
xiii
xiv Contributors
Hui Guo Centre for Biostatistics, School of Health Sciences, The University of
Manchester, Manchester, UK
Hua He Department of Epidemiology, School of Public Health & Tropical
Medicine, Tulane University, New Orleans, LA, USA
Jiang He Department of Epidemiology, School of Public Health & Tropical
Medicine, Tulane University, New Orleans, LA, USA
Ping He School of Mathematical Sciences, Peking University, Beijing, China
Shanjun Helian Department of Biostatistics, University of Florida, Gainesville,
FL, USA
Jun Hu College of Basic Science and Information Engineering, Yunnan Agricul-
tural University, Yunnan, China
Brent A. Johnson Department of Biostatistics and Computational Biology,
University of Rochester, Rochester, NY, USA
Edward H. Kennedy University of Pennsylvania, Philadelphia, PA, USA
Li Li Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN, USA
Lin (Laura) Lin Department of Statistics & Actuarial Science, University of
Waterloo, Waterloo, ON, Canada
Yan Ma Department of Epidemiology and Biostatistics, The George Washington
University, Washington, DC, USA
David P. MacKinnon Department of Psychology, Arizona State University,
Tempe, AZ, USA
Nathan Morris Department of Epidemiology and Biostatistics, Case Western
Reserve University, Cleveland, OH, USA
Wei Pan Duke University School of Nursing, Durham, NC, USA
Richard Rheingans Chair, Department of Sustainable Development, Appalachian
State University, Boone, NC, USA
Jason Roy Center for Clinical Epidemiology and Biostatistics, University of
Pennsylvania, Philadelphia, PA, USA
Li Su MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
Wan Tang Department of Biostatistics, School of Public Health & Tropical
Medicine, Tulane University, New Orleans, LA, USA
Xin M. Tu Department of Biostatistics and Computational Biology, University of
Rochester, Rochester, NY, USA
Pan Wu Value Institute, Christiana Care Health System, Newark, DE, USA
Zhenguo Wu School of Mathematical Sciences, Peking University, Beijing, China
Contributors xv
Pan Wu, Wan Tang, Tian Chen, Hua He, Douglas Gunzler, and Xin M. Tu
Abstract Inferring causation is one important aim of many research studies across
a wide range of disciplines. In this chapter, we will introduce the concept of potential
outcomes for its application to causal inference as well as the basic concepts,
models, and assumptions in causal inference. An overview of statistical methods
for causal inference will be discussed.
1 Introduction
Assessing causal effect is one important aim of many research studies across a
wide range of disciplines. Although many statistical models, including the popular
regression, strive to provide causal relationships among variables of interest, few
P. Wu ()
Value Institute, Christiana Care Health System, Newark, DE 19718, USA
e-mail: PWu@Christianacare.org
W. Tang
Department of Biostatistics, School of Public Health & Tropical Medicine, Tulane University,
New Orleans, LA 70112, USA
e-mail: wtang1@tulane.edu
T. Chen
Department of Mathematics and Statistics, University of Toledo, Toledo, OH 43606, USA
e-mail: tian.chen@utoledo.edu
H. He
Department of Epidemiology, School of Public Health & Tropical Medicine, Tulane University,
New Orleans, LA 70112, USA
e-mail: Hhe2@tulane.edu
D. Gunzler
Center for Health Care & Policy, MetroHealth Medical Center, Case Western Reserve University,
Cleveland, OH 44109, USA
e-mail: dgunzler@metrohealth.org
X.M. Tu ()
Department of Biostatistics and Computational Biology, University of Rochester, Rochester,
NY 14642, USA
e-mail: Xin_Tu@urmc.rochester.edu
can really offer estimates with a causal connotation. A primary reason for such
difficulties is confounding, observed or otherwise. Unless such factors, which
constitute the source of bias, are all identified and/or controlled for, the observed
association cannot be attributed to causation.
For example, if patients in one treatment have a higher rate of recovery from a
disease of interest than those in another treatment, we cannot generally conclude
that the first treatment is more effective, since the difference could simply be due to
different makeups of the groups such as differential disease severity and comorbid
conditions. Alternatively, if those in the first treatment group are in better health-
care facilities and/or have easier access to some efficacious adjunctive therapy, we
could also see a difference in recovery between the two groups.
An approach widely used to address such bias in epidemiology and clinical
trials research is to control for covariates in the analysis. Ideally, if one can find
all confounders for the relationship of interest, differences found between treatment
and control groups by correctly adjusting for such covariates do represent causal
effects. However, as variables collected and our understanding of covariates for
relationships of interest in most studies are generally limited, it is inevitable that
some residual bias remains due to exclusions of some important confounding
variables in the analysis. Without being able to assess the effect of such hidden
bias, it would still be difficult to interpret findings from such conventional methods.
A well-defined concept of causation is needed to assess hidden bias.
Although observational studies are most prone to bias, or selection bias as in
statistical lingo, randomized controlled trials (RCTs) are not completely immune
to confounders. The primary sources of confounders for RCTs are treatment
noncompliance and missing follow-ups. Although modern longitudinal models can
effectively address the latter issue, the traditional intention-to-treat (ITT) approach
based on the treatment assigned rather than eventually received generally fails to
deal with the former problem, especially when treatment compliance occurs in
multilayered intervention studies, an emerging paradigm for designing research
studies that integrate multi-level social support to increase and sustain treatment
effects [34].
Another problem of great interest in both experimental and observational studies
is the causal mechanism of treatment effect. The ITT and other methods only
provide a wholesome view of treatment effect, since they fail to tell us how and
why such effects occur. One mechanism of particular interest is mediation, a process
that describes the pathway from the intervention to the outcome of interest. Causal
mediation analysis allows one to ascertain causation for changes of implicated
outcomes along such a pathway. Mediation analysis is not only of significant
theoretical interest to further our understanding of causal interplays among various
outcomes of interest, but also of great practical utility to help develop alternative
and potentially more efficient and cost-effective treatment modalities.
In this chapter, we give an overview of the concept of potential outcome and
popular methods developed under this paradigm.
1 Causal Inference: A Statistical Paradigm for Inferring Causality 5
Let n1 .n0 / denote the number of subjects assigned to the intervention (control)
group and let n D n0 C n1 . If yik denotes the potential outcome of the ith subject for
the kth treatment for the n subjects, we observe
yik if the subject is assigned to the
kth treatment condition .k D 0; 1/. If yi1 1 yj0 0 represents the observed outcome for
the i1 th (j0 th) subject in the n1 .n0 / subjects in the intervention (control) group, we
can express the observed potential outcomes for the n subjects as: yi1 D yi1 1 with
i D i1 for 1 i1 n1 (yi0 D yj0 0 with i D j0 C n1 for 1 j0 n0 ).
The sample means for the two groups and the difference between the sample
means are given by
1 X
nk
b D y1 y0 ;
yk D yi k ; k D 0; 1: (1.1)
nk i D1 k
k
X n1
1 X
n0
b D 1
E E .yi1 1 / E yj0 0
n1 i D1 n0 j D1
1 0
D E .yi1 1 / E yj0 0
D E .yi1 j zi D 1/ E .yi0 j zi D 0/
D E .yi1 / E .yi0 /
D : (1.3)
Thus, the difference between the sample means does estimate the causal treatment
effect in the RCT.
The above shows that standard statistical approaches such as the two sample
t-test and regression models can be applied to RCTs to infer causal treatment
effects. Randomization is key to the transition from the incomputable individual
level difference, yi1 yi0 , to the computable sample means in (1.1) in estimating the
average treatment effect. For non-randomized trials such as most epidemiological
studies, exposure to treatments or agents may depend on the values of the outcome
variable, in which case the difference between the sample means in (1.1) generally
does not estimate the average causal effect D E .yi1 yi0 /. Thus, associations
found in observational studies generally do not imply causation.
1 Causal Inference: A Statistical Paradigm for Inferring Causality 7
Mediation
mi bmy
bzm eyi
Treatment Outcome
zi bzy
yi
though those who continue with the intervention do benefit. Thus, we must address
such downward bias in ITT estimates, if we want to estimate treatment effects for
those who are either not affected by or able to tolerate the side effects.
Under the SEM framework, the parameter ˇzy is interpreted as the direct effect of
treatment on the outcome yi , while ˇzm ˇmy is interpreted as the indirect, or mediated,
effect of the treatment zi on the outcome yi through mi . Thus, the total effect of
treatment is viewed as the combination of the direct and indirect effects, ˇzy C
ˇzm ˇmy .
1 Causal Inference: A Statistical Paradigm for Inferring Causality 9
Selection bias is the most important issue for observational studies. In the presence
of such bias, not only models for cross-sectional data such as linear regression,
but even models for longitudinal data such as mixed-effects models and structural
equation models are wrongly suited for causal inference. Over the last 30 years,
many methods have been proposed and a large body of literature has been
accumulated to address selection bias in both observational and RCT studies. The
prevailing approach is to view unobserved components of potential outcomes as
missing data and employ missing data methodology to address associated technical
problems within the context of causal inference. Thus, in principle, the goal of
causal inference is to model or impute the missing values, or the unobserved
potential outcomes, to estimate the average causal effect D E .yi1 yi0 /, which is
not directly estimable using standard statistical methods such as the sample mean,
due to the counterfactual nature of the potential outcomes .yi1 ; yi0 /.
In practice, these issues are further compounded by missing data, especially
those that show consistent patterns such as monotone patterns resulting from study
dropouts in longitudinal studies [31]. Various approaches have been developed to
address the two types of confounders. These models are largely classified into
one of the two broad categories: (1) parametric models and (2) semi-parametric
(distribution-free) models. Since the unobserved potential outcome can be treated
as missing data, the parametric and non-parametric frameworks both seek to extend
standard statistical models for causal inference by treating the latent potential
outcome as a missing data problem and applying missing data methods.
If treatment assignment is not random, it may depend on the observed, or missing
potential outcome, or both. If the assignment mechanism is completely determined
by a set of covariates such as demographic information, medical and mental health
history, and indicators of behavioral problems, denoted collectively by a vector of
covariates, xi , then the unobserved potential outcome is independent of treatment
assignment once conditioned upon xi . This assumption, also known as the missing
at random (MAR) mechanism in the lingo of missing data analysis [28], allows
one to estimate the average causal effect D E .yi1 yi0 /. Thus, by identifying
10 P. Wu et al.
the unobserved potential outcome as a missing data problem, methods for missing
data can be applied to develop inference procedures within the current context. For
notational brevity and without the loss of generality, we continue to assume the
relatively simple setting of two treatment conditions in what follows unless stated
otherwise.
same gender, same (or similar) age, and smoking patterns. As the dimension of xi
increases, however, matching subjects with respect to a large number of covariates
can be quite difficult.
A popular approach for matching subjects is the Propensity Score matching (PS).
This approach is premised upon the fact that treatment assignment dictated by xi
is characterized by the probability of receiving treatment given the covariates xi
[24, 25], i.e.,
If xi is a vector of covariates such that .yi1 ; yi0 / ? zi j xi , then we can show that [25]:
Pr .xi j zi D 1; i / D Pr .xi j zi D 0; i / :
The above shows that conditional on i , xi has the same distribution between the
treated .zi D 1/ and control .zi D 0/ groups. Thus, we can use the one-dimensional
Propensity Score in (1.5), rather than the multi-dimensional and multi-type xi , to
match subjects.
For example, we may model i using logistic regression. With an estimated O i ,
we can partition the sample by grouping together subjects with similar estimated
propensity scores to create strata and compare group differences within each stratum
using standard methods. We may derive causal effects for the entire sample by
weighting and averaging such differences over all strata.
Although convenient to use and applicable to both parametric and semi-
parametric models (e.g., the generalized estimating equations), the PS generally
lacks desirable properties of formal statistical models such as estimates consistency
and asymptotic normality. Another major problem is that in most studies xi is
only approximately balanced between the treatment groups, after matching or
subclassification using the estimated propensity score, especially when the observed
covariates xi are not homogeneous in the treatment and control groups and/or one
or more components of xi are continuous. Thus, this approach does not completely
remove selection bias [10], although Rosenbaum and Rubin [26] showed through
simulations that creating five propensity score subclasses removes at least 90% of
the bias in the estimated treatment effect. In addition, since the choice of cutpoint
for creating strata using the propensity score is subjective in subclassification
methods, different people may partition the sample differently, such as 5–10 for
moderate and 10–20 for large sample size, yielding different estimates and even
different conclusions, especially when the treatment difference straddles borderline
significance. An alternative is to simply use the estimated propensity score as a
covariate in standard regression analysis. This implementation is also popular,
since it reduces the number of covariates to a single variable, which is especially
desirable in studies with relatively small sample sizes. The approach is again ad-hoc
and, like the parametric approach discussed above, its validity depends on assumed
parametric forms of the covariate effects (typically linear).
12 P. Wu et al.
A popular alternative to PS is the marginal structural model (MSM; [8, 21]). Like
PS, MSM uses the probability of treatment assignment for addressing selection bias.
But, unlike PS, it uses the propensity score as a weight, rather than a stratification
variable, akin to weighting selected households sampled from a targeted region of
interest in survey research [10]. By doing so, not only does the MSM completely
remove selection bias, but also yields estimates with nice asymptotic properties.
Another nice feature about the MSM is its readiness to address missing data, a
common issue in longitudinal study data [8].
Under MSM, we model the potential outcome as
E .yik / D k D ˇ0 C ˇ1 k; 1 i n; k D 0; 1: (1.6)
Since only one of the potential outcomes .yi1 ; yi0 / is observed, the above model
cannot be fit directly using standard statistical methods. If treatment assignment is
random, i.e., yik ? zi , then E .yik / D E .yik k / and thus
E .yik k / D ˇ0 C ˇ1 k; 1 ik nk ; k D 0; 1; (1.7)
Thus for the RCT we can estimate the parameters ˇ D .ˇ0 ; ˇ1 /> , including the
average causal effect D ˇ1 , for the model for the potential outcome in (1.6)
by substituting the observed outcomes from the two treatment groups in (1.7). The
above is the same argument as in Sect. 2.1, but from the perspective of a regression
model.
For observational studies, zi is generally not independent of yik . If xi is a vector
of covariates such that .yi1 ; yi0 / ? zi j xi , then we can still estimate ˇ by modeling
the observed outcomes yik k as in (1.7), although we cannot use standard methods to
estimate ˇ and must construct new estimates. To this end, consider the following
weighted estimating equations:
!
X
n zi
.yi1 1 /
i
1zi D 0; (1.8)
1i
.yi0 0 /
iD1
where i is defined in Sect. 3.1.2. Although the above involves potential outcomes,
the set of equations is well defined. If the ith subject is assigned to the
first (second)
treatment condition, then i D i1 .i D j0 C n1 / and yi1 D yi1 1 yi0 D yj0 0 for
1 i1 n1 .1 j0 n0 /. It follows that
8 !
ˆ
ˆ
1
.yi1 1 1 /
! ˆ
ˆ i if zi D 1
zi
.yi1 1 / < 0
i
1zi D ! :
.yi0 0 / ˆ
ˆ 0
1i ˆ if zi D 0
:̂ 1
.yi0 0 0 /
1i
1 Causal Inference: A Statistical Paradigm for Inferring Causality 13
Thus the estimating equations in (1.8) are readily computed based on the observed
data. Also, the set of estimating equations is unbiased, since
zi zi
E yik DE E yik j xi
i i
1 zi
DE E yik j xi
i i
1
DE E .zi j xi / E .yik j xi /
i
D E ŒE .yik j xi /
D k :
One way to address treatment noncompliance is to partition study subjects into dif-
ferent types based on their impacts on causal treatment effects and then characterize
the causal effect for each of the types of treatment noncompliance [1, 13]. One
approach that has been extensively discussed in the literature is a partition of the
study sample into four types in terms of their compliance behavior:
1. Complier (CP): subjects compliant with assigned treatment (control or interven-
tion);
2. Never-taker (NT): subjects who would take the control treatment regardless of
what they are assigned;
3. Always-taker (AT): subjects who would take the intervention regardless of what
they are assigned;
4. Defiers (DF): subjects who would take the opposite treatment to their assignment.
14 P. Wu et al.
The above is called the Complier Average Causal Effect (CACE). In contrast, the
ITT effect is given by: ITT D E .yi1 yi0 /.
If Ci is observed for each subject, then we have
where Cik denotes the complier’s status for the ik th subject in the kth treatment
group .k D 0; 1/. We can then estimate E .yik k j Cik D 1/ based on the Complier’s
subsample within the kth treatment condition using standard methods such as the
sample mean.
In practice, we can only observe compliance status Dik for the assigned treatment
condition. Although similar, Dik is generally different from Cik . For example,
Di1 D 1 includes both the CP and AT subsamples within the treated, while Di0 D 1
includes the CP + NT subsample within the control condition. By conditioning on
Dik , we can estimate
In other words, we can estimate the CACE by modifying the ITT estimate:
b 1 X
nk
bCP D ITT D y1 y0 ;
pO k D Di ; k D 0; 1:
pO 1 pO 0 pO 1 pO 0 nk i D1 k
k
1 Causal Inference: A Statistical Paradigm for Inferring Causality 15
pk D E .dik / D Pr .dik D 1; zi D k/ ;
1 if compliant
sik D :
0 if noncompliant
For each subject, the potential outcome of noncompliance status si D .si1 ; si0 /> has
four patterns, which constitutes the basic principal stratification:
The four distinct patterns represent the CP .1; 1/, the DF .0; 0/, the AT .1; 0/, and
NT .0; 1/ subsamples under the IV classification of treatment noncompliance. By
combining some of the patterns in the basic principal stratification P0 , we can
16 P. Wu et al.
l D E .yi1 yi0 j sl / ; 1 l L:
The goal is to estimate l for each lth stratum. We may also create weighted
averages to obtain overall treatment effects of interest. Inference about
D fl I 1 l Lg can be based on maximum likelihood or Bayesian methods [4].
In the special case of IV categorization, the PST provides more information
about the relationship between noncompliance and treatment effects than the IV.
In addition to the CP, PST also provides treatment effects for the AT, NT, or even
the DF group.
where g .si1 ; ˇ/ is some continuous function of si1 and ˇ. Since .yi1 ; yi0 / ? zi for
randomized studies, it follows that
where i1 again indexes the subjects assigned to the treatment group and yi1 1 is the
observed outcome of the subject in the treatment group. The model in (1.12) is the
Structural Mean Model (SMM) [20].
To estimate i .si1 /, we must evaluate E .yi0 j si1 ; zi D 0/ so that it can be
estimated with observed data. If si1 is independent of yi0 , then we have
This compliance non-selective assumption is reasonable, if, for example, si1 does
not correlate with disease severity. In this case, (1.13) reduces to
Given a specific form of g .si1 ; ˇ/, the SMM in (1.16) allows one to model and
estimate treatment effects for continuous dose variables.
For example, if g .si1 ; ˇ/ D si1 ˇ1 , the SMM has the form:
or equivalently,
Note that although si1 is missing for the control group, the above is still well defined,
since si1 zi0 0 for all 1 i0 n0 .
18 P. Wu et al.
Under this compliance explainable condition [5, 32, 34], the SMM can be
expressed as
In medication vs. placebo studies, if treatment compliance is also tracked for the
placebo group, then it is reasonable to assume that the variable of placebo use,
di0 , explains treatment compliance, if the subject is assigned to the medication
group. This is because under randomization subjects cannot distinguish between
medication and placebo. Thus, if we let xi0 D di0 , then it follows that
As before, the above is still well defined, even if si1 is missing for the control group.
In psychosocial intervention studies, the control condition offers either nothing
or sessions that provide information unrelated to the intervention, such as attention
or information control. In the latter case, compliance (with respect to the attention
or information control) may also be tracked. However, such a dose variable, di0 ,
generally does not explain treatment compliance, if the subject is assigned to the
intervention group, since the information disseminated through the control condition
may have nothing to do with the information provided by the intervention condition.
For example, in a HIV prevention intervention study for teenage girls at high risk
for HIV infection, the intervention condition contains information on HIV infection,
condom use and safe sex, while the control condition contains nutritional and
dietary information. Thus, subjects with high compliance in the intervention may
1 Causal Inference: A Statistical Paradigm for Inferring Causality 19
be quite different from their counterparts in the control group. This may happen if a
majority of girls with high attendance in the intervention group are sexually active,
while those with high attendance in the control group are more interested in the
information on weight loss and healthy diets. Thus xi in this study should contain
variables that help explain behaviors of compliance for the intervention such as risks
for unsafe sex, alcohol and drug use, and HIV knowledge.
In recent years, there has been heightened activities to develop models for causal
mediation effect under the counterfactual outcome framework (e.g., [11, 12, 19, 22,
23, 33]). We give a brief review of relevant methods, focusing on the identifiability
assumptions and definitions of indirect, or mediated, effect.
Let mik denote the potential outcome of a mediator, mi , for the ith subject
corresponding to the kth treatment. The potential outcome of the primary variable
of interest is more complex to allow one to tease out the direct and mediation
causal effects of the intervention or exposure on this variable (see the definition
of direct and mediation causal effect below). Let yi .k; mik0 / denote the potential
outcome of the variable of interest yi corresponding to the kth treatment condition
and mediator mik0 .k; k0 D 0; 1/. Note that in practice we can only observe mik
and yi .k; mik / (mik0 and yi .k0 ; mik0 /), if the ith subject is assigned to the kth (k0 th)
treatment .k; k0 D 0; 1/. But, in order to tease out the direct and mediation effects,
we must consider yi .k; mik0 /, which is not observed if k ¤ k [7, 19].
The direct effect of treatment is the effect of treatment, i.e.,
This quantity i .k/ is also called the natural direct effect (e.g., [19]) or the pure
(total) direct effect (e.g., [22]) corresponding to k D 0 .1/. In addition, there is
also the so-called controlled direct effect, yi .1; m/ yi .0; m/, which may be viewed
as the treatment effect that would have been realized, had the mediator mik been
controlled at level m uniformly in the population [19, 22, 23]. Note that i .1/ is
generally not the same as i .0/ and the difference represents interaction between
treatment assignment and the mediator.
20 P. Wu et al.
The causal mediation, or indirect effect, or natural indirect effect, is the differ-
ence between the two potential outcomes, yi .k; mi1 / and yi .k; mi0 /, of the variable
of interest resulting from the two potential outcomes of the mediator, mi1 and mi0 ,
corresponding to the two treatment conditions k D 1 and k D 0, i.e.,
If the treatment has no effect on the mediator, that is mi1 mi0 D 0, then the causal
mediation effect is zero. The quantity ıi .0/ .ıi .1// is also referred to as the pure
indirect effect (total indirect effect) [22]. As in the case of direct effect, ıi .1/ is
generally different from ıi .0/.
The total effect of treatment is the sum of the direct and mediation effect:
As noted in Sect. 2.4, the independence between the error terms in the SEM (1.4)
plays a critical role in the causal interpretation of the mediation model. This pseudo-
isolation condition plays a critical for the identifiability of the parameters of the
SEM in (1.4). The issue of identifiability has also been discussed under the potential-
outcome based inference paradigm [11, 22]. For example, Imai et al. [11] has shown
that if xi is a vector of pre-treatment covariates for the ith subject, then
The above is called sequential ignorability (SI) because the first condition indicates
that zi is ignorable given the pre-treatment covariates xi , while the second states that
the mediator mik is ignorable given xi and the observed treatment assignment zi .
Although the first is satisfied by all randomized trials, the second is not. In fact, the
second condition of the SI cannot be directly tested from observed data [18]. Thus,
sensitivity analysis is usually carried out to examine the robustness of findings under
violations of the second ignorability assumption [11].
1 Causal Inference: A Statistical Paradigm for Inferring Causality 21
Other assumptions have also been proposed. For example, Robins [22] proposed
the following condition for the identification of controlled direct effect:
yi .1; m/ yi .0; m/ D Bi ;
Under the SI in (1.19), it can be shown that the ACME can be nonparametrically
identified for k D 0; 1 [11]. Since the conditions in the SI imply yi .k0 ; m/ ? zi j
mik D m0 ; xi D x, it follows that for any k and k0 :
Z
E .yi .k; mik0 / j xi / D E.yi j zi D k; m; xi /dFmi jzi Dk0 ;xi .m/; (1.21)
where FT ./ FTjW ./ denotes the (conditional) cumulative distribution function
(CDF) of a random variable T (T given W). We may further integrate out xi to
obtain the unconditional mean:
Z
E Œyi .k; mik / D E .yi .k; mik0 / j xi / dFxi .x/:
0
By using (1.21), we can derive direct, indirect, and total effects for the Linear
SEM (LSEM) in (1.4) as well as the Generalized Linear Structural Equation Models
(GLSEM), where mi or yi or both may be non-continuous variables. For example,
by expressing the LSEM in (1.4) using the potential outcomes, we have
Note that to indicate the dependence of the potential outcome of the mediator mi
as a function of treatment assignment, we use mi .zi /, rather than mik , in the LSEM
22 P. Wu et al.
for the indirect (mediation), direct and total causal effect. These effects are again
consistent with those derived from the classic LSEM approach [15].
1 Causal Inference: A Statistical Paradigm for Inferring Causality 23
Note that others have considered mediation analyses without using the SEM
paradigm. For example, Rubin [29, 30] and Jo et al. [14] considered methods to
estimate the causal effect of treatment in the face of an intermediate confounding
variable (mediator) based on the framework of Principal Stratification. These
methods are limited in their ability to accommodate continuous mediating and
outcome variables and are less popular than their SEM-based counterparts.
4 Discussion
References
1. Angrist, J., Imbens, G.W., Rubin, D.B.: Identification of causal effects using instrumental
variables (with discussion). J. Am. Stat. Assoc. 91, 444–472 (1996)
2. Baron, R.M., Kenny, D.A.: The moderator–mediator variable distinction in social psycholog-
ical research: conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol. 51,
1173–1182 (1986)
3. Bollen, K.: Total, direct and indirect effects in structural equation models. In: Clogg, C. (ed.)
Sociological Methodology, pp. 37–69. American Sociological Association, Washington, D.C
(1987)
24 P. Wu et al.
4. Frangakis, C.E., Rubin, D.B.: Principal stratification in causal inference. Biometrics 58, 21–29
(2002)
5. Goetghebeur, E., Lapp, K.: The effect of treatment compliance in a placebo-controlled trial:
regression with unpaired data. J. R. Stat. Soc. Ser. C 46, 351–364 (1997)
6. Gunzler, D., Tang, W., Lu, N., Wu, P., Tu, X.M.: A class of distribution-free models for
longitudinal mediation analysis. Psychometrika 79(4), 543–568 (2014)
7. Hafeman, D.M., Schawrtz, S.: Opening the black box: a motivation for the assessment of
mediation. Int. J. Epidemiol. 38, 838–845 (2009)
8. Hernan, M.A., Brumback, B., Robins, J.M.: Estimating the causal effect of zidovudine on CD4
count with a marginal structural model for repeated measures. Stat. Med. 21, 1689–1709 (2002)
9. Holland, P.: Statistics and causal inference. J. Am. Stat. Assoc. 81 945–970 (1986)
10. Horvitz, D.G., Thompson, D.J.: A Generalization of sampling without replacement from a
finite universe. J. Am. Stat. Assoc. 47, 663–685 (1952)
11. Imai, K., Keele, L., Yamamoto, T.: Identification, inference, and sensitivity analysis for causal
mediation effects. Stat. Sci. 25, 51–71 (2010)
12. Imai, K., Keele, L., Tingley, D.: Replication data for: a general approach to causal mediation
analysis. Psychol. Methods 15(4), 309–344 (2010)
13. Imbens, G.W., Rubin, D.B.: Bayesian inference for causal effects in randomized experiments
with noncompliance. Ann. Stat. 25, 305–327 (1997)
14. Jo, B., Stuart, E.A., MacKinnon, D.P., Vinokur, A.D.: The use of propensity scores in mediation
analysis. Multivar. Behav. Res., 46(3), 425–452 (2011)
15. Judd, C., Kenny, D.: Process analysis: estimating mediation in treatment evaluations. Eval. Rev.
5, 602–619 (1981)
16. Kowalski, J., Tu, X.M.: Modern Applied U Statistics. Wiley, New York (2007)
17. MacKinnon, D., Dwyer, J.: Estimating mediating effects in prevention studies. Eval. Rev. 17,
144–158 (1993)
18. Manski, C.F.: Identification for Prediction and Decision. Harvard University Press, Cambridge,
MA (2007)
19. Pearl, J.: Direct and indirect effects. In: Breese, J., Koller, D. (eds.) Proceedings of the 17th
Conference on Uncertainty in Artificial Intelligence, pp. 411–420. Morgan Kaufmann, San
Francisco, CA (2001)
20. Robins J.M.: Correcting for non-compliance in randomized trials using structural nested mean
models. Commun. Stat. 23, 2379–2412 (1994)
21. Robins, J.M.: Marginal structural models versus structural nested models as tools for causal
inference. In: Halloran, M.E., Berry, D. (eds.) Statistical Models in Epidemiology: The
Environment and Clinical Trials, pp. 95–134. Springer, New York (1999)
22. Robins, J.M.: Semantics of causal DAG models and the identification of direct and indirect
effects. In: Green, P.J., Hjort, N.L., Richardson, S. (eds.) Highly Structured Stochastic Systems,
pp. 70–81. Oxford University Press, New York, NY (2003)
23. Robins, J.M., Greenland, S.: Identifiability and exchangeability for direct and indirect effects.
Epidemiology 3(2), 143–155 (1992)
24. Rosenbaum, P.R.: The consequences of adjustment for a concomitant variable that has been
affected by the treatment. J. R. Stat. Soc. Ser. A 147, 656–666 (1984)
25. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies
for causal effects. Biometrika 70, 41–55 (1983)
26. Rosenbaum, P.R., Rubin, D.B.: Constructing a control group using multivariate matched
sampling methods that incorporate the propensity score. Am. Stat. 39, 33–38 (1985)
27. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies.
J. Educ. Psychol. 66, 688–701 (1974)
28. Rubin, D.B.: Inference and missing data (with discussion). Biometrika 63, 581–592 (1976)
29. Rubin, D.B.: Direct and indirect causal effects via potential outcomes. Scandinavian J. Stat.
31, 161–170 (2004)
30. Rubin, D.B.: Causal inference using potential outcomes: design, modeling, decisions. J. Am.
Stat. Assoc. 100, 322–331 (2005)
1 Causal Inference: A Statistical Paradigm for Inferring Causality 25
31. Tang, W., He, H., Tu, X.M.: Applied Categorical Data Analysis. Chapman & Hall/CRC, Boca
Raton, FL (2012)
32. Vansteelandt, S., Goetghebeur, E.: Causal inference with generalized structural mean models.
J. R. Stat. Soc. Ser. B 65, 817–835 (2003)
33. Wu, P.: A new class of structural functional response models for causal inference and mediation
analysis. Ph.D. Thesis, Department of Biostatistics and Computational Biology, University of
Rochester, Rochester, New York (2013)
34. Wu, P., Gunzler, D., Lu, N., Chen, T., Wyman, P., Tu, XM.: Causal inference for community
based multi-layered intervention study. Stat. Med. 33(22), 3905–3918 (2014)
Part II
Propensity Score Method
for Causal Inference
Chapter 2
Overview of Propensity Score Methods
Abstract The propensity score methods are widely used to adjust confounding
effects in observational studies when comparing treatment effects. The propensity
score is defined as the probability of treatment assignment conditioning on some
observed baseline characteristics and it provides a balanced score for the treatment
conditions as conditioning on the propensity score, the treatment groups are
comparable in terms of the baseline covariates. In this chapter, we will first
provide an overview of the propensity score and the underlying assumptions for
using propensity score, we will then discuss four methods based on propensity
score: matching on the propensity score, stratification on the propensity score,
inverse probability of treatment weighting using the propensity score, and covariate
adjustment using the propensity score, as well as the differences among the four
methods.
1 Introduction
H. He () • J. He
Department of Epidemiology, School of Public Health & Tropical Medicine, Tulane University,
New Orleans, LA 70112, USA
e-mail: hhe2@tulane.edu; jhe@tulane.edu
J. Hu
College of Basic Science and Information Engineering, Yunnan Agricultural University,
Yunnan, 650201, China
e-mail: hududu@ynau.edu.cn
This shows that missing values in the counterfactual outcomes yi;j are missing
completely
at random (MCAR) and can thus be completely ignored. It follows
that E yi;j can be estimated based on the observed component of each subject’s
counterfactual outcomes corresponding to the assigned treatment. It is for this
reason that simple randomized controlled trials (RCTs) are generally considered as
the gold standard approach in making causal conclusions on the treatment effects.
However, simple randomization may not always be feasible. In clinical trials, it
may be preferable to adopt other randomization procedures because of cost, ethnic,
and scientific reasons. For example, in some studies we often need to oversample
underrepresented subjects to achieve required accuracy of estimations. In such
cases, it is important to deal with the treatment selection bias and the propensity
score is a very powerful tool for this task.
2 Overview of Propensity Score Methods 31
To address the selection bias raised in the above more complex randomization
schemes or non-randomized observational studies, assume that the treatment assign-
ments are based on xi ; a vector of covariates, which is always observed. In such
cases, the missing mechanism for the unobserved outcome no longer follows
MCAR, but rather follows missing at random (MAR) as defined by
So, within each pattern of the covariate xi ; the treatment effect can be estimates
simply by those subjects receiving the two treatments.
Within the context of causal inference, the MAR condition in (2.2) is known
as the strongly ignorable treatment assignment assumption [38]. Although the
treatment assignments for the whole study do not follow simple randomization,
the ones within each of the strata defined by the distinct values of xi do. Thus,
if there is a sufficient number of subjects within each of the strata defined by
the unique values of xi , then E .yi;1 j xi / and E .yi;2 j xi / can be estimated by the
corresponding sample means within each strata. The overall treatment effect can
then be estimated by a weighted average of these means, the weights are assigned
based on the distribution of xi . The approach may not result in reliable estimates or
simply may not work if some groups have a small or even 0 number of subjects for
one or both treatment conditions. This can occur if the overall sample size is relative
small, and/or the number of distinct values of xi is large such as when xi contains
continuous components and/or xi has a high dimension. However, the propensity
score can help facilitate the dimension reduction.
The propensity score (PS) is defined as
This follows directly from (2.2), using the iterated conditional expectation argument
(see [37–39]).
From (2.4), the treatment effect for subjects with a given propensity score can
be estimated by the subjects actually receiving the two treatments. Thus, using the
propensity score we can reduce the dimension of the covariates from dim(xi ) to 1.
However, if there are continuous covariates, and hence e is also continuous, (2.4)
is still not directly applicable. Methods of propensity score matching, stratification,
weighting, and covariate adjustment have been developed to facilitate the causal
inference using propensity scores [15, 38, 39, 43].
In observational studies, it is not uncommon that there are only a limited number
of subjects in the treatment group, but a much larger number of subjects in the
control group. An example is that physicians have data available from hospital
records for patients treated for a disease, but there is no data for subjects who don’t
have the disease (control). In such cases, they often seek large survey data to find
2 Overview of Propensity Score Methods 33
controls. For example, in the study of metabolic syndrome among patients receiving
clozapine by Lamberti et al. [25], they treated 93 outpatients with schizophrenia and
schizoaffective disorder with clozapine. For treatment comparison purpose, they
obtained a control group with more than 2700 subjects by matching the subjects
in the treatment group from the National Health and Nutrition Examination Survey.
When there is a very large pool of control subjects to match, we can match
each subject in the treatment group with all the key covariates. However, if the
pool of control subjects is not so large and/or there are many control covariates,
then the propensity score matching approach will be a useful tool because of the
reduced dimensionality. The matching can be performed with 1:1 matching or more
generally 1:n matching.
Different matching methods have been proposed. First, we can simply match the
subjects based on the (estimated) propensity scores. When there are continuous or
high dimensional covariates, we may not always be able to find subjects with the
exact same propensity score to match. In this case, we can match the subject with
the closest propensity score. It is recommended to select the subjects based on the
logit scale (logit of the propensity score), rather than the propensity score itself. This
approach is simple and easy to implement, however, it may be important to control
(match) some key covariates as well. A Mahalanobis metric matching is to select
the control subject with the minimum distance based on the Mahalanobis metric of
some key covariates and the logit of propensity scores. For subjects with u for the
key covariates and v for the logit of the propensity score, the Mahalanobis distance
is defined as
where C is the sample covariance matrix of these variables for the full set of control
subjects.
To give the propensity score a higher priority, one may combine the two matching
methods. We can first select a subgroup of the control subjects based on the logit of
propensity scores (caliper), and then select the control subjects from this subgroup
based on the Mahalanobis metric. This approach is in general preferred over the
above two methods [5, 11, 38, 40, 41].
Based on the selection criteria, the propensity score matching approach can be
processed as follows. For the first subject in the treatment group, select the control
subject(s). Remove them to a new data set, and repeat the process for the second
subject, etc., until all the subjects in the treatment group are removed to the new
data set. Ultimately, we have a new data set with matched subjects with treatment
and control conditions. In these procedures, once a control is selected, it cannot
be selected again to match another treated subject. This is called greedy algorithm.
If the pool of control subjects is not big, one can consider reusing the matched
control subjects, i.e., by putting the matched subjects back for matching again.
We may check that covariates are balanced across treatment and control groups,
and then analysis can be performed based on new sample [2]. Note that the sample
does not satisfy the common i.i.d assumption anymore because of the matching,
34 H. He et al.
hence common methods for cross-sectional data do not apply. Paired t-test may be
applied for simple group comparison if the matching is 1 to 1. As for 1 to n matching,
methods for dependent outcomes such as generalized estimating equations can
be applied to assess the treatment effects, which has already been adjusted for
covariates.
The propensity score matching approach is not only very popular in practice, but
also an active methodological research topic. Applications of the propensity score
matching for different scenarios, variations of the matching procedures, and new
methods of inferences have been proposed, see, for example, [1–3, 5, 6, 9, 10, 12,
21, 27–29, 33, 48].
One disadvantage of the propensity matching approach is that subjects may not
be able to find a matched subject in the control group. For example, if the treatment
and control groups have comparable sample size, it will be very likely that there
will be more subjects with high propensity scores in the treatment group than in
the control group. Similarly, there will be more subjects with low propensity scores
in the control group than in the treatment group. This will result in more difficulty
in matching, i.e., more subjects without matched subjects. This not only suffers
information loss, but also raises the question of what the matched sample represents,
and hence may introduce another source of selection bias. Thus, the propensity score
matching method is preferred when the control group is large so that there is no
problem for every subject in the treatment group to find a matching subject.
When the control group is much larger than the treatment group, the propensity
score matching approach usually only selects a small portion of subjects in the
control group, although there may be more subjects with good matching in the
propensity score and key covariates available. In this case, the propensity score
matching approach suffers low power. To make use of all the subjects in the control
group, another common approach called stratification or subclassification can be
applied. Instead of matching each individual, the propensity score stratification
approach divides subjects into subgroups according to the propensity scores. More
precisely, let 0 D c0 < c1 < c2 < : : : < cm D 1; then we can separate the sample
into m groups, where the kth group consists of subjects with propensity scores
falling within Ik D .ck1 ; ck : Under the regularity assumption that the treatments
effect is a continuous function of the propensity scores, i.e., E .yi;1 yi;2 j ei D e/
is continuous in e; which means that subjects with comparable propensity scores
should show similar treatments effect, i.e.,
E yi;j j ei 2 Ik E yi;j j zi D j; ei 2 Ik ; for k D 1; 2; : : : ; m; j D 1; 2
2 Overview of Propensity Score Methods 35
Hence, within each subgroup, we can estimate the treatment effects for each
treatment condition by the observed outcomes for that subgroup, i.e.,
X X
yi;1 yi;2
b iWei 2Ik ;zi D1 iWei 2Ik ;zi D2
E .yi;1 j ei 2 Ik / D ;b
E .yi;2 j ei 2 Ik / D ;
nk1 nk2
where nk1 and nk2 are the number of subjects in the kth subgroup for the treatment
and control group, respectively. So the treatment effect for the kth subgroup can be
estimated by
E .yi;1 j ei 2 Ik / b
b E .yi;2 j ei 2 Ik / :
Based on the estimated treatment effect for each subgroup, we can estimate the
treatment effects for the whole sample. Note that the overall treatment effects for
the whole sample can be expressed as
Z
ŒE .yi;1 j ei D e/ E .yi;2 j ei D e/ f .e/de; (2.5)
X
m
ŒE .yi;1 j ei 2 Ik / E .yi;2 j ei 2 Ik / Pr.ei 2 Ik /;
kD1
which is a weighted average of the treatment effects across the subgroups. Pr.e 2 Ik /
can be estimated by the sample proportion
b 2 Ik / D nk1 C nk2 ;
Pr.e
n
where n is the total sample size.
This approach can be viewed as a numeric estimate of the overall treatment
effect (2.5). Since the over treatment effect is an integral over the propensity score ei ;
which is a scalar-valued function of xi regardless of the dimensionality and density
of the range of xi , we can estimate the integral (2.5) as a Riemann sum.
Under the propensity score stratification approach, we need to decide the cut
points for the classification. In general, we can divide the subjects into comparable
36 H. He et al.
subgroups, i.e., based on the quantiles of the estimated propensity scores for the
combined groups. In general, 5–10 groups is sufficient, and simulation studies show
that such a partition seems to be sufficient to remove 90 % of the bias [39]. In the
case where the treatment group is small, such a division may result in subgroups
with few subject to the treatment and hence produce instable inference. In such
cases, one may also choose the cut points based on the quantiles of the estimated
propensity scores based on the treatments group only in order to obtain subgroups
with comparable number of the subjects receiving the treatment [42, 44].
Instead of comparing the treatment and control groups at each propensity score
or a small interval of propensity scores, we can also correct the selection bias
by the propensity score weighting approach. Note that the propensity score is
the probability of a subject being assigned to a treatment group, thus, a subject
in a treatment group with propensity score e D 0:1 would be thought of as a
representative of a total 1e D 10 subjects with similar characteristics, hence in the
analysis we would assign a weight of 1e D 10 to that subject when estimate the
treatment effect. Similarly, since a subject in control group with propensity score
e D 0:1 has a probability of 1 e D 0:9 being assigned to the control group, it also
1
would be thought of as a representative of a total 1e D 1:1 subjects in the control
group with similar characteristic, hence in the analysis we would assign a weight
1
of 1e D 1:1 to the subject in estimating the treatment effect. This is the inverse
probability weighting (IPW) approach, which has a long history in the analysis of
sample survey data [22].
The mathematical justification of the propensity score weighting is the fact that
zi 1 zi
E yi;1 D E .yi;2 / and E yi;2 D E .yi;1 / : (2.6)
ei 1 ei
This weighting approach can also be applied to regression analysis. For example,
suppose that there is no interaction between the treatment and the covariates, so we
can assume that
The two regression models for the potential outcomes yij (2.7) can be expressed in
one model of the observed outcome yi ;
1 X zi
n
Var.yi j xi / Œyi .˛zi C ˇxi / D 0 (2.9)
n iD1 ei
is unbiased. To account for the variation associated with estimating the propensity
score, we can combine this EE in (2.9) with estimating equations for the propensity
score. Note that even when ei is known, the estimated propensity score is often
preferred over the true ei because it may fit the observed data better [20].
For the propensity score weighting approach, to provide valid inference, we need
0 < ei < 1, so that each subject has a positive probability to be assigned to
both treatment and control groups. In other words, the subgroups must have their
representatives observed in both groups. For subjects in the treatment group with
extremely small ei s, the inverses of such ei can become quite large, yielding very
highly volatiled estimates. Similarly, subjects in the control group with extremely
large ei s (close to 1), the weights can also become quite large and cause the estimates
to be highly volatile. So, to ensure good behaviors of estimates, we need to assume
where c is some positive constant. This assumption is similar to the bounded away
from 0 assumption for regular inverse probability weight approaches for missing
values.
To reduce bias and improve the stability of the propensity score weighting
approach, some modified propensity score methods including the double robust
estimator have been developed and discussed, see [7, 13, 16, 17, 24, 26, 27, 30,
38, 45, 47].
to assess the causal effect. Without any further assumption, we can apply nonpara-
metric curve regression methods such as local polynomial regressions to the two
groups separately to estimate the two curves [8, 14]. Treatment effect may then be
assessed by comparing these two estimated curves.
38 H. He et al.
If we assume that the treatment effect is homogeneous across all the propensity
scores, then f1 .e/ f2 .e/ is a constant, and ˛ D f1 .e/ f2 .e/ is the treatment effect.
Then (2.10) can be written compactly as
where ˛ is the treatment effect. If the function f .e/ is further linear in e, then
Conditioning on the propensity score, since the mean of the potential outcome
equals to the mean of the observed outcome, the two regression equations in (2.12)
for the two groups can be written in a regular regression model
and again the parameter ˛ carries the information for treatment effect.
In the arguments above, the assumption of homogeneous treatment effects (2.11)
is important to provide valid inference. It has been proved that under the homoge-
neous treatment effect, the regression model (2.13) will provide robust inference
about the treatment effect, even when the parametric assumption, i.e., the function
form for f .e/ in (2.11) is not correctly specified [11, 36]. One may check the
homogeneity assumption (2.11) by testing if the interaction between the treatment
and propensity score is significant. Using the propensity score stratification, we can
also compare the estimated treatment effect across the groups, and test if they are
the same.
Note that this propensity score covariate regression adjustment is similar to the
regular covariate adjustment in regression analysis. In fact, Rosenbaum and Rubin
showed the point estimate of the treatment effect is the same if the same xi is used
in the estimation of the propensity score and the treatment effect and the propensity
score is a linear function of xi (this can only be approximately true since logistic
functions are not linear). The two-step procedure of propensity score covariate
adjustment has the advantage that one can apply a very complicated propensity score
model without worrying about the problem of over-parameterizing the model [11].
The covariate adjustment is commonly used in practice, and the methods are
generalized for different scenarios [23, 46]. However, the covariance adjustment
should be performed with caution [11, 19]. Standard linear regression models are
based on the homoscedasticity, so it may be a problem if the variance in the
treatment and control groups is very different. The above arguments are based
on linear model for continuous outcomes, their application to nonlinear cases
is questionable. For example, for nonlinear regression models such as logistic
regression models, Austin et al. found there are considerable bias associated with
treatment effect estimate if the propensity score is used as a covariate for the
adjustment [4]. Even for linear models, Hade and Lu also investigated the size of
2 Overview of Propensity Score Methods 39
the bias and recommended adjusting for the propensity score through stratification
or matching followed by regression or using splines [19].
the participants with physical activity score less than 51.1 were considered as
control. We expect that participants in the treatment group would have a lower blood
pressure than participants in the control group.
Covariates In addition to the demographic information such as age, gender, marital
status, education level, employment status, baseline BMI, smoking and drinking
status, we also considered personal medical history such as stroke, hypertension,
and high cholesterol and blood chemistry results such as glucose, creatinine, total
cholesterol, HDL cholesterol, LDL cholesterol, and triglycerides. All the covariates
were compared between the treatment and control groups by chi-square tests
for categorical variables and Wilcoxon Rank-sum tests for continuous variables.
Most of the variables are significantly different between the two groups. We also
compared the BP difference between the two groups, the sample difference is
4.16 mm Hg in MAP with the control group having higher MAP.
Next, we will apply propensity score methods to examine the effects of physical
activity on BP.
All covariates above that were identified as potential confounder were included in
the selection model to estimate the propensity scores. A forward model selection
was applied to select potential interactions. The selected final model for estimating
the propensity score is summarized in Table 2.1.
The Hosmer and Lemeshow goodness-of-fit test was performed to check if the
model fits the data well. The p-value for the Hosmer and Lemeshow test is 0.4632,
indicating that the model to estimate the propensity scores fits the data pretty well.
Based on the estimated propensity scores, we can match the subjects in the treatment
group with subjects in the control group. In this example, we match subjects with
more activity with subjects with less activity. We use the SAS macro function
provided in [32] to obtain 818 pairs of matched subjects. We checked the balance
of the matched groups in terms of covariates, and the propensity score matching
succeeded in reducing the selection bias between the two groups. Summarized in
Table 2.2 are the p-values of comparisons of covariates between the two groups
mentioned above, before and after the matching.
While most of variables showed significant difference before the propensity score
matching, there was no significant difference at all in the matched sample.
Paired t-test was then applied to assess the physical activity on the blood pressure
based on the matched sample. After adjusting for the confounders, the treatment
2 Overview of Propensity Score Methods 41
group that has more physical activity has 1.6598 mm Hg lower in MAP than the
control group with less activity. The standard error is 0.5994, and the corresponding
p-value for the treatment effect is 0.0058. The adjusted effect is smaller than the
unadjusted effect 4.16 mm Hg.
42 H. He et al.
In the above propensity score matching approach, only a little bit more than half
of the subjects were matched. Unmatched subjects were used in the estimation of
the propensity score, but their information were otherwise ignored in assessing the
treatment effect. To utilize all the information, we then use the propensity score
stratification approach to estimate the treatment effect. We divide the whole sample
into 5 subgroups according to the propensity scores. The propensity scores range
from 0.0260582 to 0.2436369, 0.2437835 to 0.3666789, 0.3668133 to 0.5341626,
0.5342977 to 0.7668451 and 0.7670692 to 0.9999613 for the five subgroups,
respectively. Summarized in Table 2.3 are the sample size for each subgroup for
the two treatment groups, their mean/sd in blood pressures, as well as the mean
difference between the two groups.
Included in the last column are the difference in the means of the blood pressure.
These were the estimates of the treatment effects for the subgroups. It is clear that the
treatment effects are not homogeneous across the different propensity score levels.
2 Overview of Propensity Score Methods 43
We can also use the propensity score weighting approach to correct the selection
bias. Using the blood pressure measures as the response and the treatment as the
only predictor and weighting each subject by their inverse of the propensity scores
of being assigned to the treatment group, the estimated treatment effect was 2:38
with standard error 0.45185. The more activity group had 2.38 mm Hg lower than
the less activity group in MAP. The p-value was less than .0001, which indicated that
the more activity group had a significant lower MAP than the less activity group.
Note that there are subjects with propensity scores as small as 0.0260582 and as big
as 0.9999613, so we need to be cautious about subjects with potential high influence.
In fact, there are 5 subjects with weight larger than 20, with the highest weight being
47.0519.
If the subject with the highest weight is removed from the data, the estimated
treatment effect would be 2:4238. In fact, this observation is not the only one with
the highest impact on the estimate of treatment effect. Thus, in such situations where
we have subjects with large weights, we should use the propensity score weighting
approach with caution.
In the above analysis using propensity score weighting approach, the estimated
propensity scores were used. For rigorous statistical inference, we should take
into account the variation associated with the estimation of the propensity score.
Unfortunately many inverse weighting procedures treat the weights as fixed, and
do not have the capability of taking into account such variation. However, in our
example, this may not be a concern since the p-value is very small.
Based on the analysis using the propensity score stratification approach, the
treatment effects across the propensity scores did not seem to be homogeneous
in this example. We can formally test this by testing the interaction between the
treatment and the propensity score. The p-value for testing the interaction was
<.0001, which indicated that there was significant interaction between treatment
and the propensity score. We can also compare the 5 subgroups to test the null
hypothesis of no treatment effect differences among the 5 subgroups. The p value
44 H. He et al.
for the test was 0.0005. This further confirmed that the treatment effects were
significantly different across the propensity score levels.
The significant interaction between the treatment and the propensity score
implies that a simple covariate adjustment is not appropriate in this case. However,
for illustrative purpose, we still applied the propensity score covariate adjustment
approach. We applied a linear regression model with the blood pressure measures
as the response and the treatment and the propensity score as the predictor and
covariate to assess the treatment effect. The estimated treatment effect was 1:86
with an SE of 0.53354, and a p-value of 0.0005. Instead of using the exact propensity
score, we also used the stratified ranks as covariate. The estimated treatment effect
was 2:05 with a SE of 0.5290, and a p-value of 0.0001.
So far, we have illustrated all the propensity score approaches using the Gensalt
study as an example. Based on results obtained from different approaches of
adjustment based on the propensity scores, the estimated treatment effects range
from 1.86 to 2.37, which are smaller than the unadjusted difference of 4.16 mm Hg
in MAP. All the results shows that more activity is beneficial to the blood pressure
outcome.
5 Discussion
All the analysis for the examples in Sect. 4 were performed using SAS. The SAS
program codes are included here for readers who are interested in applying the
methods for their data analyses.
• Logistic regression model for estimation of the propensity scores.
• The fitted values are saved in variable prob in data set preds.
proc logistic data=path.comb;
class High_Cholesterol Stroke Drinking Gender High_Education
Field_Center Marital Employment;
model act_b50=Age BMI Gender High_Education Field_Center
Marital Employment
Drinking High_Cholesterol Stroke Creatinine GFR
HDL_Cholesterol LDL_Cholesterol Age*Gender Age*High_Education
BMI*Drinking Drinking*Gender High_Cholesterol*Gender Creatinine
*Field_Center Creatinine*Field_Center
GFR*High_Education Age*Marital BMI*Marital Field_Center*Marital
Field_Center*Marital Age*Employment Age*Employment Field_Center
*Employment
Field_Center*Employment Field_Center*Employment Field_Center
*Employment/lackfit;
output out=preds pred=prob;
run;
Macro %OneToManyMTCH was used for the propensity score matching. The
macro can be copied from [32]
%OneToManyMTCH(work, preds,act_b50,hid,pid,Matches,1);
• Paired t-test for matched subjects
* first generate paired variables
proc sort data=Matches;
by match_1 act_b50;
run;
data paired;
set Matches;
control=B_MAP;
treated=lag(B_MAP);
if mod(_n_,2)=0 then output;
run;
* paired t-test
proc t-test data=dd ;
paired treated* control;
run;
46 H. He et al.
References
1. Austin, P.C.: A critical appraisal of propensity-score matching in the medical literature between
1996 and 2003. Stat. Med. 27(12), 2037–2049 (2008)
2. Austin, P.C.: Some methods of propensity-score matching had superior performance to others:
results of an empirical investigation and Monte Carlo simulations. Biom. J. 51(1), 171–184
(2009)
3. Austin, P.C.: A comparison of 12 algorithms for matching on the propensity score. Stat. Med.
33(6), 1057–1069 (2014)
4. Austin, P.C., Grootendorst, P., Anderson, G.M.: A comparison of the ability of different
propensity score models to balance measured variables between treated and untreated subjects:
a monte carlo study. Stat. Med. 26(4), 734–753 (2007)
5. Austin, P.C., Small, D.S.: The use of bootstrapping when using propensity-score matching
without replacement: a simulation study. Stat. Med. 33(24), 4306–4319 (2014)
6. Baycan, I.O.: The effects of exchange rate regimes on economic growth: evidence from
propensity score matching estimates. J. Appl. Stat. 43(5), 914–924 (2016)
7. Berk, R.A., Freedman, D.A.: Weighting regressions by propensity scores. Eval. Rev. 32,
392–400 (2008); Berk, R.A., Freedman, D.A.: Statistical Models and Causal Inference,
pp.279–294. Cambridge University Press, Cambridge (2010)
8. Cleveland, W.S.: Lowess: a program for smoothing scatterplots by robust locally weighted
regression. Am. Stat. 35(1), 54 (1981)
9. Cottone, F., Efficace, F., Apolone, G., Collins, G.S.: The added value of propensity score
matching when using health-related quality of life reference data. Stat. Med. 32(29),
5119–5132 (2013)
2 Overview of Propensity Score Methods 47
10. Cuong, N.V.: Which covariates should be controlled in propensity score matching? Evidence
from a simulation study. Stat. Neerlandica 67(2), 169–180 (2013)
11. d’Agostino, R.B.: Tutorial in biostatistics: propensity score methods for bias reduction in the
comparison of a treatment to a non-randomized control group. Stat. Med. 17(19), 2265–2281
(1998)
12. Dehejia, R.H., Wahba, S.: Propensity score-matching methods for nonexperimental causal
studies. Rev. Econ. Stat. 84(1), 151–161 (2002)
13. Ertefaie, A., Stephens, D.A.: Comparing approaches to causal inference for longitudinal data:
inverse probability weighting versus propensity scores. Int. J. Biostat. 6(2), Art. 14, 24 (2010)
14. Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications. Chapman and Hall,
London (1996)
15. Frölich, M.: A note on the role of the propensity score for estimating average treatment effects.
A note on “On the role of the propensity score in efficient semiparametric estimation of average
treatment effects” [Econometria 66(2), 315–331 (1998); mr1612242] by J. Hahn. Econ. Rev.
23(2), 167–174 (2004)
16. Fujii, Y., Henmi, M., Fujita, T.: Evaluating the interaction between the therapy and the
treatment in clinical trials by the propensity score weighting method. Stat. Med. 31(3),
235–252 (2012)
17. Funk, M.J., Westreich, D., Wiesen, C., Stürmer, T., Brookhart, M.A., Davidian, M.: Doubly
robust estimation of causal effects. Am. J. Epidemiol. 173(7), 761–767 (2011)
18. Group, G.C.R., et al.: Genetic epidemiology network of salt sensitivity (gensalt): rationale,
design, methods, and baseline characteristics of study participants. J. Hum. Hypertens. 21, 639
(2007)
19. Hade, E.M., Lu, B.: Bias associated with using the estimated propensity score as a regression
covariate. Stat. Med. 33(1), 74–87 (2014)
20. He, H., McDermott, M.: A robust method for correcting verification bias for binary tests.
Biostatistics 13(1), 32–47 (2012)
21. Heckman, J.J., Todd, P.E.: A note on adapting propensity score matching and selection models
to choice based samples. Econ. J. 12, S1, S230–S234 (2009)
22. Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite
universe. J. Am. Stat. Assoc. 47, 663–685 (1952)
23. Jiang, D., Zhao, P., Tang, N.: A propensity score adjustment method for regression models with
nonignorable missing covariates. Comput. Stat. Data Anal. 94, 98–119 (2016)
24. Kim, J.K., Im, J.: Propensity score adjustment with several follow-ups. Biometrika 101(2),
439–448 (2014)
25. Lamberti, J., Olson, D., Crilly, J., Olivares, T., Williams, G., Tu, X., Tang, W., Wiener,
K., Dvorin, S., Dietz, M.: Prevalence of the metabolic syndrome among patients receiving
clozapine. Am. J. Psychiatry 163(7), 1273–1276 (2006)
26. Lee, B.K., Lessler, J., Stuart, E.A.: Improving propensity score weighting using machine
learning. Stat. Med. 29(3) , 337–346 (2010)
27. Li, F., Zaslavsky, A.M., Landrum, M.B.: Propensity score weighting with multilevel data. Stat.
Med. 32(19), 3373–3387 (2013)
28. Loux, T.M.: Randomization, matching, and propensity scores in the design and analysis of
experimental studies with measured baseline covariates. Stat. Med. 34(4), 558–570 (2015)
29. Lu, B., Qian, Z., Cunningham, A., Li, C.-L.: Estimating the effect of premarital cohabitation on
timing of marital disruption: using propensity score matching in event history analysis. Sociol.
Methods Res. 41(3) , 440–466 (2012)
30. Lunceford, J.K., Davidian, M.: Stratification and weighting via the propensity score in
estimation of causal treatment effects: a comparative study. Stat. Med. 23(19), 2937–2960
(2004)
31. Paffenbarger, R., Blair, S., Lee, I., et al.: Measurement of physical activity to assess health
effects in free-living populations. Med. Sci. Sports Exerc. 25(1), 60–70 (1993)
32. Parsons, L.: Performing a 1: N case-control match on propensity score. In: Proceedings of the
29th Annual SAS Users Group International Conference, pp.165–29 (2004)
48 H. He et al.
33. Peikes, D.N., Moreno, L., Orzol, S.M.: Propensity score matching: a note of caution for
evaluators of social programs. Am. Stat. 62(3), 222–231 (2008)
34. Perloff, D., Grim, C., Flack, J., Frohlich, E., Hill, M., McDonald, M., et al.: Human blood
pressure determination by sphygmomanometer. Circulation 88(5), 2460–2470 (1993)
35. Rebholz, C.M., Gu, D., Chen, J., Huang, J.-F., Cao, J., Chen, J.-C., Li, J., Lu, F., Mu, J., Ma, J.,
Hu, D., Ji, X., Bazzano, L.A., Liu, D., He, J., Forthe GenSalt Collaborative ResearchGroup.:
Physical activity reduces salt sensitivity of blood pressure. Am. J. Epidemiol. 176(7), 106–113
(2012)
36. Robins, J.M., Mark, S.D., Newey, W.K.: Estimating exposure effects by modelling the
expectation of exposure conditional on confounders. Biometrics 48, 479–495 (1992)
37. Rosenbaum, P.R.: Observational Studies. Springer, New York (2002)
38. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies
for causal effects. Biometrika 70, 41–55 (1983)
39. Rosenbaum, P.R., Rubin, D.B.: Reducing bias in observational studies using subclassification
on the propensity score. J. Am. Stat. Assoc. 79 (1984), 516–524.
40. Rubin, D.B.: Using multivariate matched sampling and regression adjustment to control bias
in observational studies. J. Am. Stat. Assoc. 74(366a), 318–328 (1979)
41. Rubin, D.B.: Bias reduction using mahalanobis-metric matching. Biometrics 36(2), 293–298
(1980)
42. Senn, S., Graf, E., Caputo, A.: Stratification for the propensity score compared with linear
regression techniques to assess the effect of treatment or exposure. Stat. Med. 26(30),
5529–5544 (2007)
43. Sobel, M.E.: Causal inference in the social sciences. J. Am. Stat. Assoc. 95(450), 647–651
(2000)
44. Stampf, S., Graf, E., Schmoor, C., Schumacher, M.: Estimators and confidence intervals for
the marginal odds ratio using logistic regression and propensity score stratification. Stat. Med.
29(7–8), 760–769 (2010)
45. Ukoumunne, O.C., Williamson, E., Forbes, A.B., Gulliford, M.C., Carlin, J.B.: Confounder-
adjusted estimates of the risk difference using propensity score-based weighting. Stat. Med.
29(30), 3126–3136 (2010)
46. Vansteelandt, S., Daniel, R.M.: On regression adjustment for the propensity score. Stat. Med.
33(23), 4053–4072 (2014)
47. Williamson, E.J., Forbes, A., White, I.R.: Variance reduction in randomised trials by inverse
probability weighting using the propensity score. Stat. Med. 33(5), 721–737 (2014)
48. Xu, Z., Kalbfleisch, J.D.: Propensity score matching in randomized clinical trials. Biometrics
66(3), 813–823 (2010)
Chapter 3
Sufficient Covariate, Propensity Variable
and Doubly Robust Estimation
1 Introduction
H. Guo ()
Centre for Biostatistics, School of Health Sciences, The University of Manchester,
Jean McFarlane Building, Oxford Road, Manchester M13 9PL, UK
e-mail: hui.guo@manchester.ac.uk
P. Dawid
Statistical Laboratory, University of Cambridge, Wilberforce Road, Cambridge CB3 0WB, UK
e-mail: apd25@cam.ac.uk
G. Berzuini
Department of Brain and Behavioural Sciences, University of Pavia, Pavia, Italy
e-mail: giomanuel_b@hotmail.com
causal effect (ACE) can simply be estimated as the outcome difference of the
two groups from the observed data. However, randomised experiments, although
ideal and to be conducted whenever possible, are not always feasible. For instance,
to investigate whether smoking causes lung cancer, we cannot randomly force
a group of subjects to take cigarettes. Moreover, it may take years or longer
for development of this disease. Instead, a retrospective case–control study may
have to be considered. The task of drawing causal conclusion, however, becomes
problematic since similarity of subjects from the two groups will rarely hold, e.g.,
lifestyles of smokers might be different from those of non-smokers. Thus, we
are unable to ‘compare like with like’ — the classic problem of confounding in
observational studies, which may require adjusting for a suitable set of variables
(such as age, sex, health status, diet). Otherwise, the relationship between treatment
and response will be distorted, and lead to biased inferences. In general, linear
regressions, matching or subclassification are used for adjustment purpose. If there
are multiple confounders, especially for matching and subclassification, identifying
two individuals with very similar values of all confounders simultaneously would be
cumbersome or impossible. Thus, it would be sensible to replace all the confounders
by a scalar variable. The propensity score [19] is a popular dimension reduction
approach in a variety of research fields.
2 Framework
The aim of statistical causal inference is to understand and estimate a ‘causal effect’,
and to identify scientific and in principle testable conditions under which the causal
effect can be identified from observational studies. The philosophical nature of
‘causality’ is reflected in the diversity of its statistical formalisations, as exemplified
by three frameworks:
1. Rubin’s potential response framework [21–23] (also known as Rubin’s causal
model) based on counterfactual theory;
2. Pearl’s causal framework [16, 17] richly developed from graphical models;
3. Dawid’s decision-theoretic framework [6, 7] based on decision theory and
probabilistic conditional independence.
In Dawid’s framework, causal relations are modelled entirely by conditional
probability distributions. We adopt it throughout this chapter to address causal
inference; the assumptions required are, at least in principle, testable.
Let X, T and Y denote, respectively, a (typically multivariate) confounder,
treatment, and response (or outcome). For simplicity, Y is a scalar and X a
multi-dimensional variable. We assume that T is binary: 1 (treatment arm) and
0 (control arm). Within Dawid’s framework, a non-stochastic regime indicator
variable FT , taking values ;, 0 and 1, is introduced to denote the treatment
assignment mechanism operating. This divides the world into three distinct regimes,
as follows:
1. FT D ;: the observational (idle) regime. In this regime, the value of the treatment
is passively observed and treatment assignment is determined by Nature.
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 51
read as ‘Y is independent of FT given T’. However, this condition will be most likely
inappropriate in observational studies where randomisation is absent.
Causal effect is defined as the response difference by manipulating treatment,
which purely involves interventional regimes. In particular, the population-based
average causal effect (ACE) of the treatment is defined as
or alternatively,
Without further assumptions, by its definition ACE is not identifiable from the
observational regime.
1
For convenience, the values of the regime indicator FT are presented as subscripts.
52 H. Guo et al.
3 Identification of ACE
It would be hardly true that FACE = ACE, as we would not expect the conditional
distribution of Y given T D t is the same in any regime. In fact, identification of
ACE from observational studies requires, on one hand, adjusting for confounders,
on the other hand, interplay of distributional information between different regimes.
One can make no further progress unless some properties are satisfied.
X??FT :
Property 2 requires that the distribution of Y, given X and T, is the same in all
regimes. It can also be described as ‘strongly ignorable treatment assignment, given
X’ [19]. We assume that readers are familiar with the concept and properties of
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 53
FT T Y
Pf .A j X; T/ D w.X; T/
almost surely (a.s.) in each regime f D 0; 1; ;. Let P; .A/ D 0. Then a.s. ŒP; ,
By Property 3, for t D 0; 1,
w.X; t/ D 0 (3.6)
a.s. ŒP; . As w.X; t/ is a function of X, it follows that (3.6) holds a.s. ŒPt by
Property 1. Consequently,
since a.s. ŒPt , T D t and w.X; T/ D w.X; t/ for any bounded function w. Then
by (3.7),
Proof. Let j.X; T/ be an arbitrary but fixed version of Et .Z j X; T/. Then j.X; T/ D
j.X; t/ a.s. ŒPt , and j.X; t/ serves as a version of Et .Z j X; T/ under ŒPt . So
and consequently
2
The symbol is interpreted as ‘a function of’.
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 55
for f D 0; 1; ;. We also have that g.X; T/ D h.X; T/ a:s: ŒP; , and so, by Lemma 1,
a:s: ŒPt . Then g.X; t/ D h.X; t/ a:s: ŒPt , where g.X; t/ and h.X; t/ are both functions
of X. By Property 1,
Let X be a covariate.
Definition 5. The specific causal effect of T on Y, relative to X, is
SCE WD E1 .Y j X/ E0 .Y j X/:
which is the definition of ‘individual causal effect’, ICE, in Rubin’s causal model.
Thus, although the formalisations of causality are different, SCE in Dawid’s
decision-theoretic framework can be regarded as a generalisation of ICE in Rubin’s
causal model.
We can easily prove that, for any covariate X, ACE D E.SCEX /, where the
expectation may be taken in any regime. Since by Property 1,
is identifiable, from the observational joint distribution of .X; T; Y/. Formula (3.12)
is Pearl’s ‘back-door formula’ [17] because by the property of modularity, P.X/ is
the same with or without intervention on T and thus can be taken as the distribution
of X in the observational regime.
Since two arrows initiate from X in Fig. 3.1, possible reductions may be
naturally considered, on the pathways from X to T, and from X to Y. Indeed, the
following theorem gives two alternative sufficient conditions for (3.13) to hold.
However, (3.13) can still hold without these conditions.
Theorem 2. Suppose X is a strongly sufficient covariate and V X. Then V is a
strongly sufficient covariate if either of the following conditions is satisfied:
(a) Response-sufficient reduction:
or
That is, in the observational regime, treatment does not depend on X condition-
ing on the information of V.
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 57
FT T Y
FT T Y
Both of the two reductions in Theorem 2 were proved in [9]. An alternative proof
of (b) can be implemented graphically [9], which results in a DAG as Fig. 3.23 off
which (3.16) and (3.13) can be directly read.
A graphical approach to (a) does not work since Property 3 is required. However,
while not serving as a proof, Fig. 3.3 conveniently embodies the conditional
independencies Properties 1, 2 and the trivial property V??Tj.X; FT /, as well
as (3.13).
4 Propensity Analysis
Here we further discuss the treatment-sufficient reduction, which does not involve
the response. This brings in the concept of propensity variable: a minimal treatment-
sufficient covariate, for which we investigate the unbiasedness and precision of
the estimator of ACE. Also the asymptotic precision of the estimated ACE, as
well as the variation of the estimate from the actual data, will be analysed. In a
simple normal linear model that applied for covariate adjustment, two cases are
considered: homoscedasticity and heteroscedasticity. A non-parametric approach—
subclassification will also be conducted, for different covariance matrices of X of
the two treatment arms. The estimated ACE obtained by adjusting for multivariate X
and by adjusting for a scalar propensity variable will then be compared theoretically
and through simulations [9].
3
The hollow arrow head, pointing from X to V, is used to emphasise that V is a function of X.
58 H. Guo et al.
The propensity score (PS), first introduced by Rosenbaum and Rubin, is a balancing
score [19]. Regarded as a useful tool to reduce bias and increase precision, it is a
very popular approach to causal effect estimation. PS matching (or subclassification)
method, widely used in various research fields, exploits the property of conditional
(within-stratum) exchangeability, whereby individuals with the same value of PS
(or belonging to a group with similar values of PS) are taken as comparable or
exchangeable. We will, however, mainly focus on the application of PS within a
linear regression. The definitions of the balancing score and PS given below are
borrowed from [19].
Definition 6. A balancing score b.X/ is a function of X such that, in the obser-
vational regime,4 the conditional distribution of X given b.X/ is the same for both
treatment groups. That is,
X??Tj.b.X/; FT D ;/:
It has been shown that adjusting for a balancing score rather than X results in
unbiased estimate of ACE, with the assumption of strongly ignorable treatment
assignment [19]. One can trivially choose b.X/ D X, but it is more constructive
to find a balancing score to be a many to one function.
Definition 7. The propensity score, denoted by ˘ , is the probability of being
assigned to the treatment group given X in the observational regime:
˘ WD P; .T D 1 j X/:
4
Rosenbaum and Rubin do not define the balancing score and the PS explicitly for observational
studies, although they do aim to apply the PS approach in such studies.
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 59
which states that the observational distribution of X given V is the same for both
treatment arms. That is to say, V is a balancing score for X.
The treatment-sufficient condition (b) can be equivalently interpreted as follows.
Consider the family Q D fQ0 ; Q1 g consisting of observational distributions of X
for the two groups T D 0 and T D 1. Then Eq. (3.16), re-expressed as (3.17),
says that V is a sufficient statistic (in the usual Fisherian sense [8]) for this family.
In particular, a minimal treatment-sufficient reduction is obtained as a minimal
sufficient statistic for Q: i.e., any variable almost surely equal to a one-one function
of the likelihood ratio statistic WD q1 .X/=q0 .X/, where qi ./ is a version of the
density of Qi .
Definition 8. A propensity variable is a minimal treatment-sufficient covariate, or
a one–one function of the likelihood ratio statistic .
The concept of a propensity variable is derived from PS which is related to in
the following way:
˘ D P; .T D 1 j X/ D =.1 C /; (3.18)
where the symbol stands for ‘is distributed as’ and the symbol N stands for
normal distribution, with parameters d and ı (scalar), b (p 1) and (scalar). Note
that here and in the following models, we assume no interactions between variables
in X although interactions can be formally dealt with via dummy variables. Suppose
X is a strongly sufficient covariate, then the coefficient ı of T in (3.19) is the average
causal effect ACE, which can be easily proved as follows:
It is readily seen that the specific causal effect SCEX is a constant and equals ı.
From (3.19), the linear predictor LP WD b0 X satisfies the conditional indepen-
dence properties in Condition (a) of Theorem 2. Thus, LP is a response-sufficient
reduction of X, and E.Y j LP; T/ D d C ıT C LP, with coefficient ı of T that does
not depend on the regime by virtue of the sufficiency condition.
Now assume that our model for the observational distribution of .T; X/ is as
follows:
P; .T D 1/ D (3.20)
X j .T; FT D ;/ N .T ; ˙/ (3.21)
in the observational regime, and because we have assumed Property 1 to hold, also
in the interventional regime. The observational distribution of T given X is given
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 61
by (3.18), with
with
WD ˙ 1 .1 0 /: (3.25)
One might intuitively think that the precision of the estimated ACE would be
improved if we were to adjust for a scalar variable—the sample-based propensity
variable LD , rather than p-dimensional variable X. However, Corollary 1 tells us
that adjusting for LD does not increase the precision of our estimator. In fact,
whether one adjusts for LD and for all the p predictors makes absolutely no
difference to our estimate, and thus, to its precision. Similar conclusions have been
drawn in [10, 28, 30]. Our intuition is that the increased precision obtained by
regressing on V is offset by the overfitting error involved in selecting V.
Previous evidence [11, 18, 25] supports the claim that the estimated propensity
variable outperforms the true propensity variable. That is, adjusting for the former
yields higher precision of the estimated ACE than the latter. These two types of
adjustment correspond to regressing Y on .T; LD/ and on .T; LD / in our model and
both provide an unbiased estimator of ACE. The claim obviously cannot be always
valid by simply considering a special case: LD D LP, because by Corollary 1,
regressing on LD is the same as adjusting for LP , which by the Gauss–Markov
theorem will be less precise than regressing on the true linear predictor LP (or
equivalently LD). Nevertheless, the claim is likely to hold when LD is not highly
correlated with LP because LD is a less precise response predictor.
To gain a closer insight into the variance of the estimated ACE, by adjusting for
the true propensity variable PV (if known) and the estimated propensity variable
EPV, we consider a toy example in which the parameters in (3.19)–(3.21) are set as
follows:
X1 ??X2 j T: (3.27)
FT T Y
of the observed X, which is equivalent to adjusting for LD (or EPV) in the
linear regression approach by Corollary 1. In particular, two linear regressions are
considered as follows:
M0 : Y on (T; X),
M1 : Y on (T; X1 ).
b
Then the design matrix is .1; T; X1 ; X2 /0 for M0 and .1; T; X1 /0 for M1 . Let ˇM0
b
and ˇM1 , respectively, be the least square estimators of the parameters in M0 and
b
M1 . The asymptotic variance of ˇM0 for sample size n is then given as
b
Var:asy .ˇM0 / D
A1 Var.Y j T; X/
n
D
A1
n
;
where
0 1
1 E.X1 / E.X2 /
B E.TX 1/ E.TX2 / C
ADB
@ E.X1 / E.TX1 / E.X1 2 /
C:
E.X1 X2 / A
E.X2 / E.TX2 / E.X1 X2 / E.X2 2 /
By solving A1 and extract the (2, 2)th element which is variance multiplier of
the coefficient of T, we have that
where
and
By (3.26),
and
where, by (3.27),
Hence,
Var.X1 /=Œn .1 /
Var:asy .ıc
M0 / D 2
: (3.28)
E.X1 / ŒE.X1 j T D 1/2 .1 /ŒE.X1 j T D 0/2
For M1 , by (3.27),
WX1 X1
Var:asy .ıc
M1 / D Var.Y j T; X1 /
n .1 /VX1 X1
WX1 X1
D f C b2 2 Var.X2 j T; X1 /g
n .1 /VX1 X1
. C b2 2 /Var.X1 /=Œn .1 /
D : (3.29)
E.X1 2 / ŒE.X1 j T D 1/2 .1 /ŒE.X1 j T D 0/2
4.2.4 Simulations
Simulations are carried out for numerical illustration. Suppose we have the follow-
ing true values for the parameters in (3.19)–(3.21): p D 2; d D 0; ı D 0:5; b D
.0; 1/0 ; D 1; D 0:5; 1 D .1; 0/0 ; 0 D .0; 0/0 , ˙ D I2 .
Then the population linear predictor is LP D X2 , with
1
Y j .X; T; FT / N . T C X2 ; 1/;
2
while the population linear discriminant LD D X1 which is not predictive to Y.
Since for any regime f D 0; 1; ;,
1
Ef .Y j X1 ; T/ D Ef fEf .Y j X; T/ j X1 ; Tg D T
2
and
The conditional distribution of Y given .X1 ; T/, for any regime, is then given by
1
Y j .X1 ; T; FT / N . T; 2/:
2
To investigate the performance of the population-based as well as sample-based
LP and PV, we now consider four linear regression models:
M0 : Y on T and X (X D .X1 ; X2 /),
M1 : Y on T and X1 ,
M2 : Y on T and X2 ,
M3 : Y on T and LD ,
where M0 is the full model with all parameters unknown. In M1 , by setting b2 D 0,
the true linear discriminant LD D X1 is fitted. While fitting the true linear predictor
LP D X2 , equivalent to setting b1 D 0, we get M2 . Note that all these models are
‘true’. For M1 the true value of b1 is 0, and the true residual variance is 2, as against
1 for M0 and M2 . Finally, for any dataset with no information of parameters, we
construct the estimated propensity variable LD , and then fit the model M3 .
In each model Mk , for k D 0; 1; 2; 3, the least-squares estimator ıbk is unbiased
for ı D 0:5. By the Gauss–Markov theorem and Corollary 1,
40 40
20 20
0 0
−2 −1 0 1 2 −2 −1 0 1 2
40 40
20 20
0 0
−2 −1 0 1 2 −2 −1 0 1 2
For the sample analysis, 200 simulated datasets are generated, each of size
n D 20. Shown in Fig. 3.5 are the empirical distributions of ıbk for all four models.
Unsurprisingly, in terms of precision (from high to low), first comes the LP; next
is the estimated propensity variable LD (or the estimated linear predictor LP ), or
equivalently, X (= (X1 ; X2 )); and last comes the true propensity variable LD D X1 .
X j .T; FT D ;/ N .T ; ˙T /
X j FT .1 / N .0 ; ˙0 / C N .1 ; ˙1 /:
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 67
Accordingly,
log D c C QD
where
1˚
cD log.det ˙1 / log.det ˙0 / C 01 ˙11 1 00 ˙01 0
2
and
0 1
QD WD ˙11 1 ˙01 0 X X 0 ˙11 ˙01 X: (3.30)
2
QD is the quadratic discriminant including a linear term and a quadratic term of
X, distinguishing the observational distributions of X given T D 0; 1. We see that
QD is a minimal treatment-sufficient covariate, and thus a PV but no longer a linear
function of X.
Because of the balancing property of PS (or PV), it now follows that ACE D
E.SCEQD /; with
LD D .1 0 /0 ˙ 1 X;
4.3.1 Simulations
Simulated data is based on the above model, with the parameters: p D 20, d D 0,
ı D 0:5, D 0:5, b D .0; 1; : : : ; 0/0 ; 0 D .0; : : : ; 0/0 and 1 D .0:5; 0; 0; : : : ; 0/0 :
Also, ˙0 is set, diagonally, to 0:8 for the first ten entries and to 1:3 for the remaining
entries, and ˙1 the identity matrix.
We then have, for the population, that
5
LD D X1 ;
9
10 20
1 1X 2 3 X 2
PV D QD D X1 C Xi X ;
2 8 iD1 26 jD11 j
40 40
20 20
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Subclassification on QD Regression on QD
80 mean = 0.4951 80 mean = 0.4963
sd = 0.1498 sd = 0.1483
mse = 0.0225 mse = 0.022
60 60
40 40
20 20
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 3.6 Estimates of ACE by four different methods (clockwise): 1 Regression on population
linear predictor LP D X2 . 2 Regression on population linear discriminant LD D 59 X1 . 3 Regression
on population quadratic discriminant (propensity variable) QD. 4 Subclassification on QD
40 40
20 20
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
40 40
20 20
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 3.7 Estimates of ACE by four different methods (clockwise): 1 Regression on sufficient
covariate X. 2 Regression on sample linear discriminant LD . 3 Regression on sample quadratic
discriminant (propensity variable) QD . 4 Subclassification on QD
70 H. Guo et al.
rather than on X, has absolutely no effect on the estimated ACE. LD outperforms
LD because the latter does not contain the response predictor. Regressions on LD,
QD, and on QD are roughly equal, because apart from X1 , the distributions of the
remaining 19 variables are identical, with rather small multipliers. Thus, the two
quadratic terms in QD are roughly the same, and QD 12 X1 works approximately
as a function of a single variable X1 . Last comes subclassification on the quadratic
PV, particularly when it is estimated.
For simplicity, suppose that Y; T.1 1/ and X.p 1/ are all binary and components
of X are mutually independent, The joint distribution of .FT ; X; T; Y/ is constructed
as follows:
X j FT Ber./ (3.31)
0
logitfP; .T j X/g D c C a X (3.32)
logitfPf .Y j T; X/g D d C ıT C b0 X; (3.33)
and
P.Y D 1 j T; X1 ; X2 / D P.Y D 1 j T; X2 /
1
DE j T; X2
1 C e.dCıTCb2 X2 Cb3 X3 /
3 1 3
D C
1 C e.dCıTCb2 X2 Cb3 / 1 C e.dCıTCb2 X2 /
We illustrate the method with the aid of a study involving 511 subjects sentenced
to prison in 1980 by the California Superior Court, and 511 offenders sentenced
72 H. Guo et al.
to probation following conviction for certain felonies [2]. These probationers were
matched to the prisoners on county of conviction, condition offence type and risk
of imprisonment quantitative index, so as to bring into the final sample the most
serious offenders on probation and the least serious offenders sentenced to prison.
The structure of this study corresponds to the (partially matched) case–control
design. In fact, this is analogous to the regression discontinuity designs where only
observations near the cut-off of the risk score are included for causal effect analysis
[13]. We were to compare the average causal effect of judicial sanction (probation
or prison) on the probability of re-offence. We specify variables as follows.
• Treatment T: taking values 0 (probation) and 1 (prison);
• Response Y: occurrence of recidivism (re-offence);
• Pre-treatment variable X: including 17 carefully selected non-collinear variables
that we can reasonably assume to make X a strongly sufficient covariate.
Simple random multiple imputation by bootstrapping (R package: mi) was
applied to deal with missing data. We then considered two logistic regressions for
the imputed data:
1. Y on .T; X/, where X includes all the 17 variables.
2. Y on .T; EPS), where EPS is the propensity score estimated from the logistic
regression of T on all the 17 variables. In selecting these variables, we took
Prison
0
2.0
1
1.5
Density
1.0
0.5
0.0
Fig. 3.9 Distribution density comparison of the estimated propensity score: prison vs. probation
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 73
5 Double Robustness
Proof.
T T 1
E; Y D E; E; . Y j X/ D E; E; .TY j X/
.X/ .X/ .X/
1
D E; E; .Y j X; T D 1/P; .T D 1 j X/
.X/
D E; fE; .Y j X; T D 1/g D 1 by (3.36):
n o
1T
It automatically follows that E; 1.X/
Y D 0 : By Lemma 4, we see that, under
1T
the observational regime, .X/ T
Y and 1.X/ Y are unbiased estimators of 1 and 0 ,
respectively.
One may have noticed that the two terms for ACE in (3.37) are similar with
the Horvitz–Thompson (HT) estimator for sample surveys [12]. They are, however,
different in various aspects. The aim of HT estimator P is to estimate the mean of a
finite population Y1 ; : : : ; YN , denoted by D N 1 NiD1 Yi , from a stratified sample
of size n drawn without replacement. For i D 1; : : : ; N, let i be the binary sampling
indicator (i D 1: unit i is in sample; 0: unit i is not in sample), and i be the
probability that unit i being drawn in the sample. Then HT estimator is given by
X
N
i
O HT D N 1 Yi ; (3.38)
iD1
i
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 75
where i is pre-specified, and thus known in a sample survey design. But the
propensity model .X/ in (3.37) is normally unknown. Moreover, HT estimator
is applied to estimate the mean of a finite population, while ACE is used to b
estimate the mean of a superpopulation.5 HT estimator depends on pre-specified
b
sampling scheme, but observations involved in ACE are generated from, and thus
are dependent on, the joint distribution of .X; T; Y/ in the observational regime.
b
Nevertheless, both HT estimator and ACE are formed by means of the inverse
probability weights 1=i or 1=.X/. In fact, HT estimator is also termed the inverse
probability weighted (IPW) estimator.
Sample surveys are closely related to missing data because the information is
missing for those not sampled. So IPW estimator is frequently used in missing data
models in the presence of partially observed response [1, 3, 14]. As counterfactuals
are also regarded as missing data, IPW estimator can be used in the potential
response framework with half observed information, to make causal inference of
treatment effect under the assumptions of ‘strongly ignorable treatment assign-
ment’: .Y.0/; Y.1//??T j X and ‘no unobserved confounders’ [1, 29].
From above discussion, there exists an unbiased estimator of ACE if either RRM
or PM is correct. However, unknown RRM and PM makes it impossible to decide
whether they are correct. Nevertheless, the augmented inverse probability weighted
(AIPW) estimator can be constructed by combining the two models in the following
alternative forms:
T
O 1;AIPW D m.X/ C .Y m.X//
.X/
T T
D Y C 1 m.X/; (3.39)
.X/ .X/
and similarly,
1T
O 0;AIPW D m.X/ C .Y m.X//
1 .X/
1T 1T
D Y C 1 m.X/; (3.40)
1 .X/ 1 .X/
where m./ and ./ are arbitrary functions of X. As also indicated in its name,
O t;AIPW is the sum of the IPW estimator and an augmented term.
5
In causal system, finite number of individuals in a study is called ‘population’, which can be
regard as a sample from a larger ‘superpopulation’ of interest.
76 H. Guo et al.
b
ACEAIPW D O 1;AIPW O 0;AIPW :
b
Theorem 4. Suppose that X is a strongly sufficient covariate. Then the AIPW
estimator ACEAIPW is doubly robust.
To prove Theorem 4, we simply apply the fact that both O 1;AIPW and O 0;AIPW are
doubly robust, so is their difference.
Suppose that we specify two parametric working models: the propensity working
model .XI ˛/ and the response regression working model m.T; XI ˇ/. Then
by (3.39) and (3.40), we have, for the estimated E1 .Y/ and E0 .Y/, that
( )
X
n
Ti Ti
1
O 1;AIPW D n Yi C 1 O
m.1; Xi I ˇ/ (3.41)
iD1
.Xi I ˛/
O .Xi I ˛/
O
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 77
and
( )
X
n
1 Ti 1 Ti
O 0;AIPW D n 1
Yi C 1 O
m.0; Xi I ˇ/ (3.42)
iD1
1 .Xi I ˛/
O 1 .Xi I ˛/
O
b
ACEAIPW D O 1;AIPW O 0;AIPW
( n )
X Ti 1 Ti
Dn 1
O
.Yi m.Ti ; Xi I ˇ/ ;(3.43)
iD1
.Xi I ˛/
O 1 .Xi I ˛/
O
b
which is doubly robust, i.e., ACEAIPW is a consistent and asymptotically normal
estimator of ACE if either of the working models is correctly specified.
5.2.1 Discussion
Kang and Schafer [14] state that there are various ways to construct an estimator
which is doubly robust. In our view, they are essentially the same, i.e., it must
be in the same (or similar) form of AIPW estimator which is constructed by
combining RRM and PM. Other constructions proposed in [14] are just variations
of AIPW estimator. For example, in (3.38), P instead of using N as denominator
for each unit, they use normalised weights NiD1 ii . Such normalised weights are
especially useful for precision improvement in the case that subjects with very small
probabilities of being sampled are actually drawn from the population. Because if
N is used as the weight, these subjects will influence the estimated average response
enormously, and consequently, result in poor precision.
Kang and Schafer [14] have also investigated the precision performance of an
doubly robust estimator when both .X/ and m.X/ are moderately misspecified.
They state that ‘in at least some settings, two wrong models are not better than one’.
This seems obvious because the performance of this estimator will depend on the
degree of misspecification of both models. This can be easily analysed in theory but
far more complicated in practice, as one cannot have a good control of specifying
models .X/ and m.X/ based on limited observed data and previous experience (if
any). Therefore, it would be difficult to measure to what extent the specified models
are different from the true ones.
b
We already see that ACEAIPW is an unbiased and doubly robust estimator of ACE.
Then how can we choose an arbitrary function m.Xi / to minimise the variance of
78 H. Guo et al.
b
ACEAIPW given correct PM? Suppose that in an experiment, we know .Xi / D
P.Ti D 1 j Xi /. Then in terms of the variance, we have that
b
Var.ACEAIPW /
( " n #)
X Ti 1 Ti
1
D Var n .Yi m.Xi //
iD1
.Xi / 1 .Xi /
( " n #
X Ti 1 Ti
2
Dn Var Yi
iD1
.Xi / 1 .Xi /
" n #
X Ti 1 Ti
CVar m.Xi /
iD1
.Xi / 1 .Xi /
" n X n #)
X Ti 1 Ti Ti 1 Ti
2Cov Yi ; m.Xi /
iD1
.Xi / 1 .Xi / iD1
.Xi / 1 .Xi /
( " n #
Dn 2 b
Var.ACEHT / C E
X m2 .Xi /
.Xi /.1 .Xi //
iD1
" n #)
X m.Xi /1i m.Xi /.1i i /
2 E
iD1
.Xi /.1 .Xi // .1 .Xi //2
( " n
Dn 2 b
Var.ACEHT / C E
X m2 .Xi /
.Xi /.1 .Xi //
iD1
X
n
1i 1i i
2 m.Xi / ;
iD1
.Xi /.1 .Xi // .1 .Xi //2
b
b
which minimises the variance of ACEAIPW among all functions of Xi . In fact, if
either .Xi / D p; .Ti D 1 j Xi / or (3.44) holds, ACEAIPW is unbiased, and thus is
doubly robust.
Let m1 .Xi / and m0 .Xi / denote the regressions of Y on Xi for the two treatment
groups in the observational regime. It is unnecessary to require that m1 .Xi / D
E; .Yi j Xi ; Ti D 1/ and that m0 .Xi / D E; .Yi j Xi ; Ti D 0/. As long as m.Xi / is
specified as the sum of the weighted expectations as in the form of (3.44), m.Xi /
minimises the variance of the estimated ACE.
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 79
1 .Xi /
m.Xi / D E; .Yi j Xi ; Ti D 1/P.T D 1 j X/
.Xi /
.Xi /
C E; .Yi j Xi ; Ti D 0/P.T D 0 j X/
1 .Xi /
1 .Xi / .Xi /
D E; .Ti Yi j Xi / C E; Œ.1 Ti /Yi j Xi
.Xi / 1 .Xi /
1 .Xi / .Xi /
D E; Ti C .1 Ti / Yi j Xi
.Xi / 1 .Xi /
1 1
D E; 1 Ti C 1 .1 Ti / Yi j Xi
.Xi / 1 .Xi /
ei j Xi /;
D E; .Y
b
To show the difference of these approaches, we have implemented Monte Carlo
computations for four estimators of ACEAIPW :
1. by (3.44) with E; .Yi j Xi ; Ti D 1/ and E; .Yi j Xi ; Ti D 0/ estimated by regressing
Yi on .Xi ; Ti /.
2. by (3.44) with E; .Yi j Xi ; Ti D 1/ and E; .Yi j Xi ; Ti D 0/ estimated by regressing
Yi on Xi for the treatment group and control group separately.
3. by Horvitz–Thompson approach, i.e. without covariate adjustment.
4. by regression of Yei on Xi .
The results of simulated 100 datasets are shown in Fig. 3.10. The first two
approaches give similar results. That is, we can estimate E; .Yi j Xi ; Ti D 1/ and
E; .Yi j Xi ; Ti D 0/ either simultaneously from the response regression on the
treatment and X, or separately from the response regression only on X for each of
the two groups. As expected, the last approach generates several extreme estimates
80 H. Guo et al.
10
models for E; .Yi j Xi /
separately for both groups;
5
(3) Horvitz–Thompson
estimator; (4) regression of e Yi
estimated ACE
on Xi
0
−5
−10
0 20 40 60 80 100
relative to others, which makes its variance even much larger than that of the HT
estimator.
Suppose that E; .Yi j Xi ; Ti D 1/ and E; .Yi j Xi ; Ti D 0/ are both known but not the
PM. Then the AIPW estimator can be constructed as:
( n )
b
ACEAIPW D n 1
X Ti
1 Ti
g.Xi / 1 g.Xi /
.Yi m.Xi // ;
iD1
where
b
Var.ACE AIPW /
( " n #)
X Ti 1 Ti
D Var n1 .Yi m.Xi //
iD1
g.Xi / 1 g.Xi /
n
( )
X Ti 1 Ti
2
D n Var .Yi Œ.1 g.Xi //1i C g.Xi /0i /
iD1
g.Xi / 1 g.Xi /
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 81
( )
Xn
Ti 1 Ti
2
D n Var .1i 0i / C .Yi 1i / .Yi 0i /
iD1
g.X i / 1 g.Xi /
( n )
X
2
D n Var .1i 0i /
iD1
( " #)
X n
Ti 1 Ti
2
Cn E Var .Yi 1i / .Yi 0i / j Xi
iD1
g.Xi / 1 g.Xi /
( n )
> n2 Var
X
b
.1i 0i / D Var.ACERRM /:
iD1
b
Hence, we conclude that, for each individual, if the conditional expectations of the
response given Xi for both groups are known or correctly specified, then ACEAIPW
will be less precise than the estimated ACE from the response regressions.
5.3.3 Discussion
b
If the PM is known, then the variance of ACEAIPW is minimised when m.Xi / is
specified as in (3.44)—where separate specification of m1 .Xi / and m0 .Xi / is not
necessary. Rubin and van de Laan [26] have introduced a weighted response serving
as an alternative, but we have shown, by simulations, that it could result in large
variance of the estimated ACE and possibly larger than the HT estimator. In the
case that the RRM is correctly specified, i.e., m1 .Xi / D E; .Yi j Xi ; Ti D 1/ and
m0 .Xi / D E; .Yi j Xi ; Ti D 0/, then these two models rather than the AIPW estimator
should be used to estimate ACE for higher precision of the estimator.
6 Summary
################################################################
Figure 5: Linear regression (homoscedasticity)
----------------------------------------------------------------
1. Y on X;
2. Y on population linear discriminant / propensity variable LD;
3. Y on sample linear discriminant / propensity variable LD*;
4. Y on population linear predictor LP.
################################################################
## set parameters
p <- 2
delta <- 0.5
phi <- 1
n <- 20
ps <- function(r) {
set.seed(r)
.Random.seed
t <- rbinom(n, 1, 0.5)
require(MASS)
m <- rep(0, p)
ex <- mvrnorm(n, mu=m, Sigma=sigma)
x <- t%*%alpha + ex
d1 <- data.frame(x, t)
c <- coef(lda(t~.,d1))
ld <- x%*%c
g <- rep(0, 4)
for (r in 31:230) {
g <- rbind(g, ps(r))
}
g <- g[-1,]
d.mean <- 0
d.sd <- 0
mse <- 0
for (i in 1:4) {
d.mean[i] <- round(mean(g[,i]),4)
d.sd[i] <- round(sd(g[,i]),4)
mse[i] <- round((d.sd[i])^2+(d.mean[i]-delta)^2, 4)
}
## generate Figure 5
for (i in 1:4){
hist(g[,i], br=seq(-2.5, 2.5, 0.5), xlim=c(-2.5, 2.5), ylim=c(0,80),
main=main[i], col.lab="blue", xlab="", ylab="",col="magenta")
legend(-2.5,85, c(paste("mean = ",d.mean[i]), paste("sd = ",d.sd[i]),
paste("mse = ",mse[i])), cex=0.85, bty="n")
}
mtext(side=3, cex=1.2, line=-1.1, outer=T, col="blue",
text="Linear regression (homoscedasticity) [200 datasets]")
###########################################################################
Linear regression and subclassification (heteroscedasticity)
---------------------------------------------------------------------------
Figure 6:
1. Regression on population linear predictor LP;
2. Regression on population linear discriminant LD;
3. Regression on population quadratic discriminant / propensity variable QD;
4. Subclassification on QD.
Figure 7:
1. Regression on sample linear predictor LP*;
2. Regression on sample linear discriminant LD*;
3. Regression on sample quadratic discriminant / propensity variable QD*;
4. Subclassification on QD*.
###########################################################################
84 H. Guo et al.
## set parameters
p <- 20
d <- 0
delta <- 0.5
phi <- 1
n <- 500
ps <- function(r) {
set.seed(r)
.Random.seed
pi <- 0.5
t <- rbinom(n, 1, pi)
n0 <- 0
for (i in 1:n) {
if (t[i]==0)
n0 <- n0+1
}
require(MASS)
m <- rep(0, p)
ex0 <- mvrnorm(n0, mu=m, Sigma=sigma0)
ex1 <- mvrnorm((n-n0), mu=m, Sigma=sigma1)
ld <- x%*%solve(pi*sigma1+pi*sigma0)%*%t(alpha)
d1 <- data.frame(x, t)
c <- coef(lda(t~.,d1))
ld.s <- x%*%c
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 85
c1 <- solve(v1)%*%t(m1)-solve(v0)%*%t(m0)
z1.s <- x%*%c1
c2 <- solve(v1)-solve(v0)
z2.s <- 0
for (i in 1:n){
z2.s[i] <- -1/2*matrix(x[i,], nrow=1)%*%c2%*%t(matrix(x[i,], nrow=1))
}
qd.s <- z1.s+z2.s
for (k in 1:2) {
d3 <- d2[, c(k,3,4)]
d3 <- split(d3[order(d3[,1]), ], rep(1:5, each=100))
tm <- vector("list", 5)
for (j in 1:5) {
tm[[j]] <- aggregate(d3[[j]], list(Stratum=d3[[j]]$t), FUN=mean)
tm1[[k]][j] <- tm[[j]][2,3]
tm0[[k]][j] <- tm[[j]][1,3]
}
te.qd[k] <- sum(tm1[[k]] - tm0[[k]])/5
}
g <- rep(0, 8)
for (r in 31:230) {
g <- rbind(g, ps(r))
}
g <- g[-1,]
d.mean <- 0
d.sd <- 0
d.mse <- 0
for (i in 1:8) {
d.mean[i] <- round(mean(g[,i]),4)
d.sd[i] <- round(sd(g[,i]),4)
d.mse[i] <- round((d.sd[i])^2+(d.mean[i]-delta)^2, 4)
}
## generate Figure 6
## generate Figure 7
main=c("Regression on X","Subclassification on QD*",
"Regression on LD*", "Regression on QD*")
for (i in 1:4){
hist(g[,i+4], br=seq(-0.1, 1.1, 0.1), xlim=c(-0.1,1.1), ylim=c(0,80),
main=main[i], col.lab="blue", xlab="", ylab="", col="magenta")
legend(-0.2,85, c(paste("mean = ",d.mean[i+4]), paste("sd = ",d.sd[i+4]),
paste("mse = ",d.mse[i+4])), cex=0.85, bty="n")
}
mtext(side=3, cex=1.2, line=-1.1, outer=T, col="blue",
text="Linear regression and subclassification
(heteroscedasticity, sample) [200 datasets]")
######################################################################
Figure 9 and Table 1: Propensity analysis of custodial sanctions study
----------------------------------------------------------------------
1. Y on all 17 variables X;
2. Y on estimated propensity score EPS.
######################################################################
set.seed(100)
.Random.seed
library(mi)
data.imp <- random.imp(dAll)
glm.ps<-glm(Sentenced_to_prison~
Age_at_1st_yuvenile_incarceration_y +
N_prior_adult_convictions +
Type_of_defense_counsel +
Guilty_plea_with_negotiated_disposition +
N_jail_sentences_gr_90days +
N_juvenile_incarcerations +
Monthly_income_level +
Total_counts_convicted_for_current_sentence +
Conviction_offense_type +
Recent_release_from_incarceration_m +
N_prior_adult_StateFederal_prison_terms +
Offender_race +
Offender_released_during_proceed +
Separated_or_divorced_at_time_of_sentence +
Living_situation_at_time_of_offence +
Status_at_time_of_offense +
Any_victims_female,
data = data.imp, family=binomial)
summary(glm.ps)
eps <- predict(glm.ps, data = data.imp[, -1], type=’response’)
d.eps <- data.frame(data.imp, Est.ps = eps)
library(ggplot2)
glm.y.allx<-glm(Recidivism~
Sentenced_to_prison +
Age_at_1st_yuvenile_incarceration_y +
N_prior_adult_convictions +
Type_of_defense_counsel +
Guilty_plea_with_negotiated_disposition +
N_jail_sentences_gr_90days +
N_juvenile_incarcerations +
Monthly_income_level +
Total_counts_convicted_for_current_sentence +
Conviction_offense_type +
Recent_release_from_incarceration_m +
N_prior_adult_StateFederal_prison_terms +
Offender_race +
Offender_released_during_proceed +
Separated_or_divorced_at_time_of_sentence +
Living_situation_at_time_of_offence +
Status_at_time_of_offense +
Any_victims_female,
data = d.eps, family=binomial)
summary(glm.y.allx)
References
1. Bang, H., Robins, J.M.: Doubly robust estimation in missing data and causal inference models.
Biometrics 61, 962–972 (2005)
2. Berzuini, G.: Causal inference methods for criminal justice data, and an application to the study
of the criminogenic effect of custodial sanctions. MSc Thesis in Applied Statistics, Birkbeck
College, University of London (2013)
3. Carpenter, J.R., Kenward, M.G., Vansteelandt, S.: A comparison of multiple imputation and
doubly robust estimation for analyses with missing data. J. R. Stat. Soc. Ser. A 169, 571–584
(2006)
4. Dawid, A.P.: Conditional independence in statistical theory (with discussion). J. R. Stat. Soc.
Ser. B 41, 1–31 (1979)
5. Dawid, A.P.: Conditional independence for statistical operations. Ann. Stat. 8, 598–617 (1980)
6. Dawid, A.P.: Causal inference without counterfactuals. J. Am. Stat. Assoc. 95, 407–424 (2000)
7. Dawid, A.P.: Influence diagrams for causal modelling and inference. Int. Stat. Rev. 70, 161–189
(2002)
8. Fisher, R.A.: Theory of statistical estimation. Proc. Camb. Philol. Soc. 22, 700–725 (1925)
9. Guo, H., Dawid, A.P.: Sufficient covariates and linear propensity analysis. In: Teh, Y.W.,
Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics (AISTATS), Chia Laguna, Sardinia, Italy, 13–15 May 2010. Journal
of Machine Learning Research Workshop and Conference Proceedings, vol. 9, pp. 281–288
(2010)
3 Sufficient Covariate, Propensity Variable and Doubly Robust Estimation 89
10. Hahn, J.: On the role of the propensity score in efficient semiparametric estimation of average
treatment effects. Econometrica 66, 315–331 (1998)
11. Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treatment effects using
the estimated propensity score. Econometrica 71, 1161–1189 (2003)
12. Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite
universe. J. Am. Stat. Assoc. 47, 663–685 (1952)
13. Imbens, G.W., Lemieux, T.: Regression discontinuity designs: a guide to practice. J. Econ. 142,
615–635 (2007)
14. Kang, J.D.Y., Schafer, J.L.: Demystifying double robustness: a comparison of alternative
strategies for estimating a population mean from incomplete data. Stat. Sci. 22, 523–539 (2007)
15. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic, New York (1979)
16. Pearl, J.: Causal diagrams for empirical research (with discussion). Biometrika 82, 669–710
(1995)
17. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press, Cam-
bridge (2000)
18. Robins, J.M., Mark, S.D., Newey, W.K.: Estimating exposure effects by modelling the
expectation of exposure conditional on confounders. Biometrics 48, 479–495 (1992)
19. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies
for causal effects. Biometrika 70, 44–55 (1983)
20. Rosenbaum, P.R., Rubin, D.B.: Reducing bias in observational studies using subclassification
on the propensity score. J. Am. Stat. Assoc. 79, 516–524 (1984)
21. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies.
J. Educ. Psychol. 66, 688–701 (1974)
22. Rubin, D.B.: Assignment to treatment group on the basis of a covariate. J. Educ. Stat. 2, 1–26
(1977)
23. Rubin, D.B.: Bayesian inference for causal effects: the role of randomization. Ann. Stat. 6,
34–68 (1978)
24. Rubin, D.B.: Matched Sampling for Causal Effects. Cambridge University Press, Cambridge
(2006)
25. Rubin, D.B., Thomas, N.: Characterizing the effect of matching using linear propensity score
methods with normal distributions. Biometrika 79, 797–809 (1992)
26. Rubin, D.B., van de Laan, M.J.: Covariate adjustment for the intention-to-treat parameter with
empirical efficiency maximization. U.C.Berkeley Division of Biostatistics Working Paper 229
(2008)
27. Sekhon, J.: Multivariate and propensity score matching software with automated balance
optimization: the matching package for R. J. Stat. Softw. 42, 1–52 (2011)
28. Senn, S., Graf, E., Caputo, A.: Stratification for the propensity score compared with linear
regression techniques to assess the effect of treatment or exposure. Stat. Med. 26, 5529–5544
(2007)
29. Tang, Z.: Understanding OR, PS, and DR, Comment on “Demystifying double robustness: a
comparison of alternative strategies for estimating a population mean from incomplete data”
by Kang and Schafer. Stat. Sci. 22, 560–568 (2007)
30. Winkelmayer, W.C., Kurth, T.: Propensity scores: help or hype? Nephrol. Dial. Transplant. 19,
1671–1673 (2004)
Chapter 4
A Robustness Index of Propensity Score
Estimation to Uncontrolled Confounders
1 Introduction
W. Pan ()
Duke University School of Nursing, 307 Trent Dr., DUMC Box 3322, Durham, NC 27710, USA
e-mail: wei.pan@duke.edu
H. Bai
Department of Educational & Human Sciences, University of Central Florida, PO Box 161250,
Orlando, FL 32816, USA
e-mail: haiyan.bai@ucf.edu
Over the past three decades, propensity score analysis has become increasingly
popular in social, behavioral, and health research for making causal inferences based
on non-RCTs or observational studies [2, 22].
Propensity score methods start with estimating propensity scores. Denote z as a
treatment condition. Suppose one has N units (e.g., subjects). For each unit i (i D 1,
: : : , N), zi D 1 indicates that the unit i is in the treatment group and zi D 0 indicates
that the unit i is in the comparison group. Suppose each unit i also has a covariate
value vector Xi D (Xi1 , : : : , XiK )0 , where K is the number of covariates. Rosenbaum
and Rubin [30] defined a propensity score for unit i as the probability of the unit
being assigned to theˇ treatment
group, conditional on the covariate vector Xi , that is,
ˇ
e .Xi / D Pr zi D 1ˇXi : They also recommended using the logit of the propensity
e.Xi /
score, ln 1e.X i/
; to achieve normality.
by unobserved covariates that are not included in the propensity score model and
leaves little room for confounding by unobserved covariates that are correlated to
the observed covariates [34, 39].
All the aforementioned approaches to sensitivity analysis of uncontrolled con-
founders, however, fell short of assessing the robustness of propensity score
estimation to the impact of uncontrolled confounders. Robustness is about how
sensitive is too sensitive. In other words, if adding an unobserved covariate in the
propensity score model changes the propensity score estimates significantly, the
propensity score model is too sensitive or not robust to the impact of the additional
unobserved covariate. Following Pan and Frank [24] approach to assess robustness
of confounders in linear models, the present study presents a new technique to
assess not only sensitivity but also robustness of propensity score estimation to
uncontrolled confounders by borrowing information from observed covariates.
.j/ 0
where Xi D Xi1 ; : : : ; Xi.j1/ ; Xi.jC1/ ; : : : ; XiK ; i D 1; : : : ; N, for each covariate j
(j D 1, : : : , K).
94 W. Pan and H. Bai
For each unit i (i D 1, : : : , N), if the robustness index Ri is larger than .95, which
is analogous to a significance level of .05, one could claim that the propensity
score estimation is robust to the impact of uncontrolled confounders for that unit.
Then, for all the N units, if a majority of the Ri ’s (e.g., more than 80 %) are
larger than .95, one could claim that the propensity score estimation is robust to
uncontrolled confounders for the entire sample. If 50–80 % of the Ri ’s are larger
than .95, one could caution that the propensity score estimation may be sensitive
to uncontrolled confounders for the entire sample. If less than 50 % of the Ri ’s are
larger than .95, one could conclude that the propensity score estimation is not robust
to uncontrolled confounders for the entire sample. It is worth noting that the cut-
off percentages (i.e., 50 % and 80 %) are arbitrary, and researchers can adopt more
appreciate cut-off percentages for their own specific research areas.
Now the problem is that how to obtain the robustness index Ri without knowing
the behavior of the sampling distribution or reference distribution of the sensitivity
indices b ij . By following Pan and Frank [24, 25] approach, one could first
approximate the shape of the distribution of the sensitivity indices b ij to one of
b
Pearson distributions [26] based on the first four moments of ij . Once the following
first four moments are calculated:
1 Xb 1 X b m
K K
0i1 D
b ij and b
im D 0i1 ; m D 2; 3; 4;
ij b (4.4)
K jD1 K jD1
Z max b
ij
b
Ri D
1jK
fi .t/dt: (4.5)
min b
ij
1jK
The integration in Eq. 4.5 can be numerically evaluated using a SAS macro compiled
by Pan and Boling [23].
4 A Robustness Index of Propensity Score Estimation to Uncontrolled Confounders 95
4 An Empirical Example
The data for this empirical example were selected from a national database of
10,500 at-risk youth in a national evaluation of the High-Risk Youth Demonstration
Grant Programs sponsored by the Substance Abuse and Mental Health Services
Administration [32] focusing on prevention of substance use. The evaluation
compared participants and non-participants of funded prevention programs over
18 months with respect to socio-demographic risk and protective factors. For
demonstration purposes only, we sampled 547 youth who initiated substance use
prior to entry to the national evaluation. Among the 547 youth in the sample,
nT D 213 were in the prevention (or treatment) group, and nC D 334 were in the
comparison group.
The outcome measure was a composite score of 30-Day substance use, including
tobacco, alcohol, marijuana, and illicit drugs. There were 22 covariates in the
study, including age, gender, race/ethnicity, family composition, family supervision,
school prevention, community protection, neighborhood risk, family bonding,
school bonding, self-efficacy, belief in self, self-control, social confidence, parental
use attitudes, peer use attitudes, and peer use. Due to the focus on the method-
ological nature of this chapter, we would reference the detailed information about
covariate selection to other resources such as SAMHSA [32].
To obtain the sensitivity indices b ij , each one of the 22 covariates was in
turn treated as an unobserved covariate ina propensity
score model (e.g., logistic
.j/
regression model) to calculate e(Xi ) and e Xi , i D 1, 2, : : : , 547; j D 1, 2, : : : ,
22. Figure 4.1 shows the empirical distribution of the sensitivity indices b 311031; j
for Subject 311031, j D 1, : : : , 22. Then, the smallest and largest sensitivity indices
were identified for each unit. For instance, those two specific sensitivity indices for
Subject 311031 were min b 311031j D :58 and max b 311031j D :30.
1j22 1j22
The next step was to calculate the first four moments using Eq. 4.4 and,
for example, the four moments for Subject 311031 are b 0311031;1 D 0:017,
b
311031;2 D 0:046, b 311031;3 D 0:010, and b 311031;4 D 3:003, respectively. Then,
for each subject i (i D 1, : : : , N), using Pan and Boling [23] SAS macro, the
reference distribution of the sensitivity indices was approximated to one of Pearson
distributions using the first four moments. For Subject 311031, the reference
distribution f311031 (t) was approximated as a Type IV Pearson distribution (see
Fig. 4.2). And, the same SAS macro also simultaneously computed the probability
value for the robustness index b Ri (Eq. 4.5). For Subject 311031, the robustness index
Z :30
b
R311031 D f311031 .t/dt D :952, suggesting that the propensity score estimation
:58
is robust to uncontrolled confounders for Subject 311031.
Figure 4.3 displays the empirical distribution of all the 547 robustness indices
b
Ri ’s, i D 1, 2, : : : , 547. Among all the 547 robustness indices, 75 % of them are
96 W. Pan and H. Bai
2.25
2.00
1.75
1.50
1.25
1.00
.75
.50
.25
Pr(-.58 £ t ≥ .30) = .952
t
.00
-1.20 -1.00 -.80 -.60 -.40 -.20 .00 .20 .40 .60 .80 1.00 1.20
Fig. 4.2 The approximated Type IV Pearson distribution of the sensitivity for Subject 311031
4 A Robustness Index of Propensity Score Estimation to Uncontrolled Confounders 97
Fig. 4.3 The empirical distribution of robustness for the entire sample
larger than .95, indicating that the propensity score estimation may be sensitive to
uncontrolled confounders for the entire sample. It suggests that more covariates
should have been observed and controlled in the propensity score model.
5 Conclusion
References
1. Arah, O.A., Chiba, Y., Greenland, S.: Bias formulas for external adjustment and sen-
sitivity analysis of unmeasured confounders. Ann. Epidemiol. 18(8), 637–646 (2008).
doi:10.1016/j.annepidem.2008.04.003
2. Bai, H.: A comparison of propensity score matching methods for reducing selection bias. Int.
J. Res. Method Educ. 34(1), 81–107 (2011). doi:10.1080/1743727X.2011.552338
3. Brumback, B.A., Hernán, M.A., Haneuse, S.J.P.A., Robins, J.M.: Sensitivity analyses for
unmeasured confounding assuming a marginal structural model for repeated measures. Stat.
Med. 23(5), 749–767 (2004). doi:10.1002/sim.1657
4. Cole, S.R., Hernán, M.A., Margolick, J.B., Cohen, M.H., Robins, J.M.: Marginal structural
models for estimating the effect of highly active antiretroviral therapy initiation on CD4 cell
count. Am. J. Epidemiol. 162(5), 471–478 (2005). doi:10.1093/aje/kwi216
5. Cook, T.D., Campbell, D.T.: Quasi-experimentation: Design & Analysis Issues for Field
Settings. Rand McNally, Chicago (1979)
6. Cornfield, J., Haenszel, W., Hammond, E.C., Lilienfeld, A.M., Shimkin, M.B., Wynder, E.L.:
Smoking and lung cancer: recent evidence and a discussion of some questions. J. Natl. Cancer
Inst. 22, 173–203 (1959)
7. Cornfield, J., Haenszel, W., Hammond, E.C., Lilienfeld, A.M., Shimkin, M.B., Wynder,
E.L.: Smoking and lung cancer: recent evidence and a discussion of some questions. Int. J.
Epidemiol. 38(5), 1175–1191 (2009). doi:10.1093/ije/dyp289
8. Greenland, S.: Multiple-bias modelling for analysis of observational data. J. R. Stat. Soc. A.
Stat. Soc. 168(2), 267–306 (2005). doi:10.1111/j.1467-985X.2004.00349.x
9. Groenwold, R.H.H., Hak, E., Hoes, A.W.: Quantitative assessment of unobserved confounding
is mandatory in nonrandomized intervention studies. J. Clin. Epidemiol. 62(1), 22–28 (2009).
doi:10.1016/j.jclinepi.2008.02.011
10. Groenwold, R.H.H., Hoes, A.W., Nichol, K.L., Hak, E.: Quantifying the potential role of
unmeasured confounders: the example of influenza vaccination. Int. J. Epidemiol. 37(6), 1422–
1429 (2008). doi:10.1093/ije/dyn173
11. Groenwold, R.H.H., Nelson, D.B., Nichol, K.L., Hoes, A.W., Hak, E.: Sensitivity analyses to
estimate the potential impact of unmeasured confounding in causal research. Int. J. Epidemiol.
39(1), 107–117 (2010). doi:10.1093/ije/dyp332
12. Hsu, J.Y., Small, D.S.: Calibrating sensitivity analyses to observed covariates in observational
studies. Biometrics 69(4), 803–811 (2013). doi:10.1111/biom.12101
13. Huesch, M.D.: External adjustment sensitivity analysis for unmeasured confounding: an
application to coronary stent outcomes, Pennsylvania 2004–2008. Health Serv. Res. 48(3),
1191–1214 (2013). doi:10.1111/1475-6773.12013
14. Ko, H., Hogan, J.W., Mayer, K.H.: Estimating causal treatment effects from longitudinal HIV
natural history studies using marginal structural models. Biometrics 59(1), 152–162 (2003).
doi:10.1111/1541-0420.00018
15. Kuroki, M., Cai, Z.: Formulating tightest bounds on causal effects in studies with unmeasured
confounders. Stat. Med. 27(30), 6597–6611 (2008). doi:10.1002/sim.3430
16. Li, L., Shen, C., Wu, A.C., Li, X.: Propensity score-based sensitivity analysis
method for uncontrolled confounding. Am. J. Epidemiol. 174(3), 345–353 (2011).
doi:10.1093/aje/kwr096
17. Lin, D.Y., Psaty, B.M., Kronmal, R.A.: Assessing the sensitivity of regression results
to unmeasured confounders in observational studies. Biometrics 54(3), 948–963 (1998).
doi:10.2307/2533848
18. Lunt, M., Glynn, R.J., Rothman, K.J., Avorn, J., Stürmer, T.: Propensity score calibration in the
absence of surrogacy. Am. J. Epidemiol. 175(12), 1294–1302 (2012). doi:10.1093/aje/kwr463
19. MacLehose, R.F., Kaufman, S., Kaufman, J.S., Poole, C.: Bounding causal effects under
uncontrolled confounding using counterfactuals. Epidemiology 16(4), 548–555 (2005).
doi:10.2307/20486093
100 W. Pan and H. Bai
20. McCandless, L.C., Gustafson, P., Levy, A.: Bayesian sensitivity analysis for unmea-
sured confounding in observational studies. Stat. Med. 26(11), 2331–2347 (2007).
doi:10.1002/sim.2711
21. McCandless, L.C., Gustafson, P., Levy, A.: A sensitivity analysis using information about mea-
sured confounders yielded improved uncertainty assessments for unmeasured confounding. J.
Clin. Epidemiol. 61(3), 247–255 (2008). doi:10.1016/j.jclinepi.2007.05.006
22. Pan, W., Bai, H. (eds.): Propensity Score Analysis: Fundamentals and Developments. The
Guilford Press, New York (2015)
23. Pan, W., Boling, J.: Computing and graphing probability Values of Pearson distributions: a
SAS/IML macro. Paper presented at the 2013 Joint Statistical Meetings, Montreal, Canada,
August 2013
24. Pan, W., Frank, K.A.: A probability index of the robustness of a causal inference. J. Educ.
Behav. Stat. 28(4), 315–337 (2003). doi:10.3102/10769986028004315
25. Pan, W., Frank, K.A.: An approximation to the distribution of the product of
two dependent correlation coefficients. J. Stat. Comput. Sim. 74(6), 419–443 (2004).
doi:10.1080/00949650310001596822
26. Pearson, K.: Contributions to the mathematical theory of evolution. II. Skew variation in homo-
geneous material. Philos. Trans. R. Soc. Lond. A 186, 343–414 (1895). doi:10.2307/90649
27. Robins, J.M.: Association, causation, and marginal structural models. Synthese 121(1/2), 151–
179 (1999). doi:10.2307/20118224
28. Robins, J.M., Rotnitzky, A., Scharfstein, D.O.: Sensitivity analysis for selection bias and
unmeasured confounding in missing data and causal inference models. In: Halloran, M.E.,
Berry, D. (eds.) Statistical Models in Epidemiology, the Environment, and Clinical Trials, vol.
116. The IMA Volumes in Mathematics and its Applications, pp. 1–94. Springer, New York
(2000). doi:10.1007/978-1-4612-1284-3_1
29. Rosenbaum, P.R., Rubin, D.B.: Assessing sensitivity to an unobserved binary covariate in an
observational study with binary outcome. J. R. Stat. Soc. Ser. B (Methodol.) 45(2), 212–218
(1983). doi:10.2307/2345524
30. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies
for causal effects. Biometrika 70(1), 41–55 (1983). doi:10.1093/biomet/70.1.41
31. Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli, D., Saisana, M.,
Tarantola, S.: Global Sensitivity Analysis: The Primer. Wiley, West Sussex (2008)
32. SAMHSA: The National Cross-Site Evaluation of High-Risk Youth Programs. Substance
Abuse and Mental Health Services Administration, U.S. Department of Health and Human
Services, Rockville (2002)
33. Schneeweiss, S.: Sensitivity analysis and external adjustment for unmeasured confounders in
epidemiologic database studies of therapeutics. Pharmacoepidemiol. Drug Saf. 15(5), 291–303
(2006). doi:10.1002/pds.1200
34. Schneeweiss, S., Rassen, J.A., Glynn, R.J., Avorn, J., Mogun, H., Brookhart, M.A.: High-
dimensional propensity score adjustment in studies of treatment effects using health care claims
data. Epidemiology 20(4), 512–522 (2009). doi:10.1097/EDE.0b013e3181a663cc
35. Shadish, W.R., Cook, T.D., Campbell, D.T.: Experimental and Quasi-experimental Designs for
Generalized Causal Inference. Houghton Mifflin, Boston (2002)
36. Shen, C., Li, X., Li, L., Were, M.C.: Sensitivity analysis for causal inference using inverse
probability weighting. Biom. J. 53(5), 822–837 (2011). doi:10.1002/bimj.201100042
37. Stürmer, T., Schneeweiss, S., Avorn, J., Glynn, R.J.: Adjusting effect estimates for unmeasured
confounding with validation data using propensity score calibration. Am. J. Epidemiol. 162(3),
279–289 (2005). doi:10.1093/aje/kwi192
38. Stürmer, T., Schneeweiss, S., Rothman, K.J., Avorn, J., Glynn, R.J.: Performance of propen-
sity score calibration—a simulation study. Am. J. Epidemiol. 165(10), 1110–1118 (2007).
doi:10.1093/aje/kwm074
39. Toh, S., García Rodríguez, L.A., Hernán, M.A.: Confounding adjustment via a semi-automated
high-dimensional propensity score algorithm: an application to electronic medical records.
Pharmacoepidemiol. Drug Saf. 20(8), 849–857 (2011). doi:10.1002/pds.2152
Chapter 5
Missing Confounder Data in Propensity Score
Methods for Causal Inference
Bo Fu and Li Su
1 Introduction
In public health research, randomized clinical trials are often infeasible because of
their size, time, budget, and ethical constraints and observational studies play an
important role to evaluate treatment effects on long-term outcomes [9]. Because of
the absence of randomization and the time-varying nature of medication initiation in
such observational cohorts, it is crucial to adequately control potential confounding
from various factors (both time-invariant and time-varying) in order to obtain causal
effects of treatments and interventions. Overall, the ultimate goal in the design and
analysis of observational studies is to mimic those of a randomized controlled trial.
There has been rich literature on how to control potential confounding from baseline
characteristics between treated and untreated patients, for example, using propensity
B. Fu ()
Administrative Data Research Centre for England & Institute of Child Health,
University College London, London, UK
e-mail: b.fu@ucl.ac.uk
L. Su
MRC Biostatistics Unit, Cambridge Institute of Public Health, Cambridge, UK
e-mail: li.su@mrc-bsu.cam.ac.uk
area and the challenges in analyzing real observational data, and to provide useful
information and references for medical researchers and suggest important topics of
future methodological research. We mainly focus on propensity score approaches as
they are the commonly used causal inference method in the medical literature and
are experiencing a tremendous increase of interest in medical research and many
scientific areas (e.g., [1, 30, 34]).
2 Examples
In addition, about 1/3 patients starting a biologics drug will stop it at a later
observed time because of mild side effects, comorbidities, or inefficiency [13].
After changing from ‘on exposure’ to ‘off exposure’, the disease activity is less
controlled and quite a number of patients will re-start the biologics treatment again
in a future observed time or switch to a second-line biologics drug to control
the disease progression. It is not clear how to handle this complex time-varying
treatment process to adequately control for time-varying confounders (e.g., disease
activity affected by prior exposure) in order to obtain causal effects of biologics on
observational outcomes.
Propensity score methods have become the standard techniques for the estimation
of causal treatment effects from observational data. The propensity score is defined
as the probability of receiving treatment conditional on measured confounders.
Conditional on propensity score, treated and untreated patients have a similar
distribution of measured confounders. Thus within similar levels of propensity
score, a “virtual randomization” can be achieved to compare patients between
treatment groups. Different methods of using estimated propensity score have been
described in the literature, including stratification [26], matching [26], covariate
adjustment [26], and weighting [25], and their performance has been compared by
simulation studies in estimating odds ratio [7], risk difference [3], and hazard ratio
for time-to-event outcomes [4], and by an empirical study in balancing confounders
by checking residual confounding [19]. Marginal structural models have also been
developed as an extension of the propensity score weighting method to tackle the
time-varying confounding problem [23].
References
1. Ali, M., Groenwold, R., Klungel, O.: Covariate selection and assessment of balance in
propensity score analysis in the medical literature: a systematic review. J. Clin. Epidemiol.
68(2), 112–121 (2015)
2. Angrist, J. D., Imbens, G.W., Rubin, D. B.: Identification of causal effects using instrumental
variables (with discussion). J. Am. Stat. Assoc. 91, 444–472 (1996)
3. Austin, P.C.: The performance of different propensity score methods for estimating difference
in proportions (risk differences or absolute risk reductions) in observational studies. Stat. Med.
29, 2137–2148 (2010)
4. Austin, P.C.: The performance of different propensity score methods for estimating marginal
hazard ratios. Stat. in Med. 32(16), 2837–2849 (2013)
5. Austin, P.C.: The relative ability of different propensity score methods to balance measured
covariates between treated and untreated subjects in observational studies. Med. Decis. Mak.
29, 661–677 (2009)
6. Austin, P.C.: Balance diagnostics for comparing the distribution of baseline covariates between
treatment groups in propensity-score matched samples. Stat. Med. 28, 3083–3107 (2009)
7. Austin, P.C., Grootendorst, P., Anderson, G.M.: A comparison of the ability of different
propensity score models to balance measured variables between treated and untreated subjects:
a Monte Carlo study. Stat. Med. 26(4), 734–753 (2007)
8. Belitser, S.V., Martens, E.P., Pestman, W.R., Groenwold, R.H.H., Boer, A., Klungel, O.H.:
Measuring balance and model selection in propensity score methods. Pharmacoepidemiol.
Drug Saf. 20, 1115–1129 (2011)
5 Missing Confounder Data in Propensity Score Methods for Causal Inference 109
9. Concato, J., et al.: Randomized, controlled trials, observational studies, and the hierarchy of
research designs. N. Engl. J. Med. 342(25), 1887–1892 (2000)
10. D’Agostino, R., et al.: Examining the impact of missing data on propensity score estimation
in determining the effectiveness of SMBG. Health Serv. Outcome Res. Methodol. 2, 291–315
(2011)
11. D’Agostino, R.B., Rubin, D.B.: Estimating and using propensity scores with partially missing
data. J. Am. Stat. Assoc. 95(451), 749–59 (2000)
12. Dixon, W., Watson, K.D., Lunt, M., Hyrich, K.L., British Society for Rheumatology Biologics
Register Control Centre Consortium, Silman, A.J., Symmons, D.P., on behalf of the British
Society for Rheumatology Biologics Register: Serious infection following anti-tumor necrosis
factor alpha therapy in patients with rheumatoid arthritis: lessons from interpreting data from
observational studies. Arthritis Rheum. 56, 2896–2904 (2007)
13. Fu, B., Lunt, M., et al.: A threshold hazard model for estimating serious infection risk following
anti-tumor necrosis factor therapy in rheumatoid arthritis patients. J. Biopharm. Stat. 23(2),
461–476 (2013)
14. Gran, J.M., Roysland, K., Wolbers, M., Didelez, V., Sterne, J., Ledergerber, B., Furrer, H., von
Wyl, V., Aalen, O.: A sequential Cox approach for estimating the causal effect of treatment in
the presence of time-dependent confounding applied to data from the Swiss HIV cohort study.
Stat. Med. 29, 2757–68 (2010)
15. Groenwold, R.H., White, I.R., Donders, A.R.T., Carpenter, J.R., Altman, D.G., Moons, K.G.:
Missing covariate data in clinical research: when and when not to use the missing-indicator
method for analysis. Can. Med. Assoc. J. 184(11), 1265–1269 (2012)
16. Gu, X.S., Rosenbaum, P.R.: Comparison of multivariate matching methods: structures, dis-
tances, and algorithms. J. Comput. Graph. Stat. 2, 405–420 (1993)
17. Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treatment effects using
the estimated propensity score. Econometrica. 71, 1161–1189 (2003)
18. Iacus, S.M., King, G., Porro, G.: Multivariate matching methods that are monotonic imbalance
bounding. J. Am. Stat. Assoc. 106, 345–361 (2011)
19. Lunt, M., et al.: Different methods of balancing covariates leading to different effect estimates
in the presence of effect modification. Am. J. Epidemiol. 169(7), 909–917 (2009)
20. Mitra, R., Reiter, J.P.: A comparison of two methods of estimating propensity scores after
multiple imputation. Stat. Methods Med. Res. 25(1), 188–204 (2016)
21. Moodie, E., Delaney, J., Lefebvre, G., Platt, R.: Missing confounding data in marginal structure
models: a comparison of inverse probability weighting and multiple imputation. Int. J. Biostat.
4, 1557–4679 (2008)
22. Qu, Y., Lipkovich, I.: Propensity score estimation with missing values using a multiple
imputation missingness pattern (MIMP) approach. Stat. Med. 28, 1402–414 (2009)
23. Robins, J.M., Hernán, M.A., Brumback, B.: Marginal structural models and causal inference
in epidemiology. Epidemiology. 11, 550–60 (2000)
24. Rosenbaum, P.R.: Observational Studies. Springer, New York (2002)
25. Rosenbaum, P.R.: Model-based direct adjustment. J. Am. Stat. Assoc. 82, 387–94 (1987)
26. Rosenbaum, P.R., Rubin, D.B.: Assessing sensitivity to an unobserved binary covariate in an
observational study with binary outcome. J. R. Stat. Soc. Ser. B 45, 212–218 (1983)
27. Rosenbaum, P., Rubin, D.: The central role of the propensity score in observational studies for
causal effect. Biometrika 70, 41–55 (1983)
28. Rosenbaum, P.R., Rubin, D.B.: Reducing bias in observational studies using subclassification
on the propensity score. J. Am. Stat. Assoc. 79, 516–524 (1984)
29. Stuart, E.A.: Matching methods for causal inference. Stat. Sci. 25(1), 1–21 (2010)
30. Stürmer, T., Joshi, M., Glynn, R.J., Avorn, J., Rothman, K.J., Schneeweiss, S.: A review of the
application of propensity score methods yielded increasing use, advantages in specific settings,
but not substantially different estimates compared with conventional multivariable methods.
J. Clin. Epidemiol. 59, 431–437 (2006)
110 B. Fu and L. Su
31. Stürmer, T., Schneeweiss, S., Avorn, J., et al.: Adjusting effect estimates for unmeasured
confounding with validation data using propensity score calibration. Am. J. Epidemiol. 162(3),
279–289 (2005)
32. VanderWeele, T.J., Arah, O.A.: Bias formulas for sensitivity analysis of unmeasured confound-
ing for general outcomes, treatments, and confounders. Epidemiology. 22(1), 42–52 (2011)
33. VanderWeele, T.J.: Unmeasured confounding and hazard scales: sensitivity analysis for total,
direct, and indirect effects. Eur. J. Epidemiol. 28(2), 113–117 (2013)
34. Weitzen, S., et al.: Principles for modelling propensity scores in medical research: a systematic
literature review. Pharmacoepidemiol. Drug Saf. 13(12), 841–853 (2004)
35. Williamson, E., Morley, R., Lucas, A., Carpenter, J.: Propensity scores: from naive enthusiasm
to intuitive understanding. Stat. Methods Med. Res. 21(3), 273–93 (2012)
36. Williamson, E.J., Forbes, A., Wolfe, R.: Doubly robust estimators of causal exposure effects
with missing data in the outcome, exposure or a confounder. Stat. Med. 31(30), 4382–400
(2012)
Chapter 6
Propensity Score Modeling and Evaluation
Abstract In causal inference for binary treatments, the propensity score is defined
as the probability of receiving the treatment given covariates. Under the ignorability
assumption, causal treatment effects can be estimated by conditioning on/adjusting
for the propensity scores. However, in observational studies, propensity scores are
unknown and need to be estimated from the observed data. Estimation of propensity
scores is essential in making reliable causal inference. In this chapter, we first
briefly discuss the modeling of propensity scores for a binary treatment; then we
will focus on the estimation of the generalized propensity scores for categorical
treatment variables with more than two levels and continuous treatment variables.
We will review both parametric and nonparametric approaches for estimating
the generalized propensity scores. In the end, we discuss how to evaluate the
performance of different propensity score models and how to choose an optimal
one among several candidate models.
The potential outcomes framework [23] has been a popular framework for estimat-
ing causal treatment effects. An important quantity to facilitate causal inference has
been the propensity score [22], defined as the probability of receiving the treatment
given a set of measured covariates. In observational studies, propensity scores are
unknown and need to be estimated from the observed data. Consistent estimation of
propensity scores is essential in making reliable causal inference. In this section, we
briefly review the modeling of propensity scores for a binary treatment variable.
We first define some notations. Let Y denote the response of interest, T be the
treatment variable, and X be a p-dimensional vector of baseline covariates. The data
can be represented as .Yi ; Ti ; Xi /, i D 1; : : : ; n, a random sample from .Y; T; X/. In
addition to the observed quantities, we further define Yi .t/ as the potential outcome
In the causal inference literature, propensity score for a binary treatment variable
is usually estimated by logistic regression. Using logistic regression to estimate
propensity scores can be easily implemented in R. However, logistic regression is
not without drawbacks. First of all, a parametric form of r.X/ needs to be specified.
Consistent estimation of ATE and ATT relies on the correct logistic regression
model. In most cases, only including main effects into the model is not adequate, but
it is also hard to determine which interaction terms should be included, especially
when the vector of covariates is high-dimensional. In addition, logistic regression is
not resistant to outliers [11, 18]. In particular, Kang and Schafer [11] show when the
logistic regression model is mildly misspecified, propensity score-based approaches
lead to large bias and variance of the estimated treatment effects.
Other parametric approaches for estimating propensity scores include Probit
regression modeling and linear discriminant analysis, both of which assume normal-
ity. However, through a simulation study, Zhu et al. [31] found that these parametric
models give very similar treatment effect estimates.
include support vector machines (SVM) and K-nearest neighbors (KNN), etc. R
packages are readily available, such as rpart for CART; randomForest for RF,
twang or gbm package for boosting models, and e1071 for SVM. A detailed review
of each approach for estimating propensity scores can be found in [31]. In a
simulation study, Zhu et al. found there is a trade-off between bias and variance
among parametric and nonparametric approaches. More specifically, parametric
methods tend to yield lower bias but higher variance than nonparametric methods
for estimating ATE and ATT.
The advantage of this approach is that, by achieving better balance in the covariates,
it is less susceptible to model misspecification of the propensity scores, compared
to logistic regression.
A related issue is whether we should achieve balance in all the measured
covariates in a study or a subset of the available covariates. This is a variable
selection issue. Zhu et al. [32] have shown through a simulation study that one
should aim to achieve balance in the real confounders, i.e. covariates related to both
the treatment variable and the outcome variable, as well as the covariates related
114 Y. Zhu and L. (Laura) Lin
1
r.tjX/MLR D PM 0
for t D 1
1C sD2 eˇs X
and
0
eˇt x
r.tjX/MLR D PM 0
for t D 2; : : : ; M
1C sD2 eˇs X
2. We maximize the multinomial likelihood function with respect to all the ˇ’s:
Y
n Y
M
L.ˇ/ D ri .tjX/Ai .t/
iD1 tD1
X
n X
M
l.ˇ/ D Ai .t/ log.ri .tjX//:
iD1 tD1
3. The solution ˇOs for s D 2; : : : ; M is substituted into the model to obtain the
estimates for the generalized propensity score.
While MLR is a seemingly simple way to estimate the generalized propensity
score, there is the question of variable selection and which interactions to be
included. In addition, Tchernis et al. [28] pointed out that MLR does not take into
account the correlation among treatments in the sense that for two treatment levels
t ¤ s, we have
r.tjX/MLR 0
D e.ˇt ˇs / X ;
r.sjX/MLR
which does not depend on the information of other treatment levels. This assumption
could be violated in real applications, which makes an MLR model not suitable for
estimating the generalized propensity scores.
In R, to fit an MLR model, we can use the package nnet [29].
116 Y. Zhu and L. (Laura) Lin
In this section, we are going to introduce two machine learning approaches for the
modeling of generalized propensity scores: generalized boosted model (GBM) and
random forests (RF).
GBM uses an iterative procedure that adds together many simple regression trees
to approximate the propensity score function. A regression tree algorithm divides
the dataset into two non-overlapping regions based on one of the covariates. Then,
it recursively divides each of those regions into two smaller regions, where each split
is based on one of the covariates [2]. Note that the splits may occur on a different
covariate or the same covariate each time. The splits are chosen so that the prediction
error is minimized. After the allowed number of splits have occurred, for each region
of the dataset, the estimated response value equals the average response values of
the data points within the region.
Now we describe the GBM method for binary treatments, then we extend the
procedure to multi-level treatments. McCaffrey et al. [16] provides a detailed algo-
rithm for estimating propensity scores using GBM. In the binary case, let g.X/ D
logŒr.X/=.1 r.X// and the maximum likelihood function can be rewritten as
X
n
l.g/ D Ti g.Xi / logf1 C expŒg.Xi /g: (6.4)
iD1
To maximize l.g/ in (6.4), g.X/ is updated at each iteration with g.X/ C h.X/
where h.X/ is the fitted value from a regression tree which models i D
Ti 1=f1 C expŒg.Xi /g, the largest increase in (6.4). To avoid overfitting, a
shrinkage parameter ˛ is introduced so the update is g.X/ C ˛h.X/, where ˛ is
usually a small value, such as 0.0001. This iterative estimation procedure can
be tuned to yield propensity scores that achieve optimal balance in covariate
distribution between the treatment and control groups. The key is to stop the
algorithm at the optimal number of trees when a certain balance statistic (e.g.,
average standardized absolute mean difference in the covariates) is minimized.
Interactions are automatically included when multi-level splits are allowed in
regression trees and since splits are automatically determined by the algorithm
based on a criterion, variable selection is automatically done [16].
McCaffrey et al. [17] extended this algorithm to the multi-level treatment case.
We first note that while estimating the generalized propensity score for a particular
treatment level t, we are interested in the probability that each subject is assigned
to a particular treatment t as opposed to any other treatment. So essentially we
have two groups: those assigned to treatment t (equivalent to the treatment group
in the binary case), and those that were not assigned to treatment t (equivalent to the
control group in the binary case). Then we can fit a GBM that balances the covariates
between the treatment t group and the entire sample [17]. We do this for each of the
M treatments to obtain the generalized propensity scores rO .tjX/. The estimation of
the generalized propensity scores for multi-level treatment can be realized in the R
package twang [19].
6 Propensity Score Modeling and Evaluation 117
The downside to this method is that by fitting separate GBMs for all M treatment
groups, it is not guaranteed that the generalized propensity scores for each treatment
group will add up to 1. McCaffrey et al. [17] justified that estimating the ATE only
requires the propensity scores for the particular treatment groups involved, so as
long as the estimated generalized propensity scores are not biased, they do not need
to add up to 1.
Next, we are going to introduce RF model for estimating the generalized propen-
sity scores. An RF model [1] is built on a collection of classification trees, fitted
on bootstrap samples of the original dataset. Classification trees are different from
regression trees in that classification trees predict the class label for each input vector
of covariates and use nonparametric information criteria, such as Entropy, misclas-
sification rate, or Gini Index, for splitting at each node. The random forest classifica-
tion tree finds the best split from only a random subsample of the covariates at each
node. Then the estimated generalized propensity score for treatment t is the fraction
of votes for t from the collection of the random forest classification trees. The
specific random forest algorithm for estimating the generalized propensity score is
1. Draw a random sample with replacement of size n (size of dataset), called a
bootstrap sample, from the dataset.
2. Fit a random forest classification tree to the bootstrap sample.
3. Repeat steps 1 and 2 a large number, B, times and obtain a collection of B
classification trees (usually, B D 500).
4. For a given vector of covariates X, predict the class label from each fitted tree.
The estimated generalized propensity score is then
where
As explained by Zhu et al. [31], the intuition of this approach comes from the
fact there is a trade-off in bias and variance between parametric and nonparametric
approaches. By combining, both bias and variance of the estimated causal effects
will be reduced. The choice of in (6.6) gives more weight to the estimate that is
closer to the observed value of A.t/, so it trims extreme weights to more reasonable
values without ad hoc adjustment. In addition, it would not attain 0 or 1 as a possible
value due to the MLR component.
Finally, we are going to focus on the case when the treatment variable is continuous.
In this case, we are interested in estimating the so-called dose–response function:
.t/ D EŒYi .t/. We assume Yi .t/ is well defined for t 2 , where D Œt0 ; t1 .
To draw causal inference, we assume the ignorability assumption:
where f .tj/ refers to the conditional density. In other words, we assume the vector
of covariates X include all the real confounders that may jointly affect the treatment
and the potential outcomes.
In the continuous treatment case, the generalized propensity score is defined
as r.tjX/ ftjX .tjX/, which is the conditional density of the treatment level t
conditioning on the covariates [10]. The ignorability assumption also implies
r.Ti /
wi D for i D 1; : : : ; n: (6.7)
r.Ti jXi /
Robins et al. [21] proposed a two-step approach to estimate r.Ti jXi /. The treatment
variable T is assumed to follow a parametric model:
0
T D X ˇ C ; N.0; 2 /: (6.8)
X
M X
Km
m.X/ D cmj IfX 2 Rmj g; (6.11)
mD1 jD1
where M is the total number of trees, Km is the number of terminal nodes for the mth
tree, Rmj is the indicator of rectangular region in the feature space spanned by X, and
cmj is the predicted constant in region Rmj . Km and Rmj are determined by optimizing
some nonparametric information criterion, such as Entropy, misclassification rate,
or Gini Index. cmj is simply the average value of Ti in the training data that falls in
the region Rmj . Details about how to construct a classification/regression tree can be
found in [2].
120 Y. Zhu and L. (Laura) Lin
rO .Ti /
wi D for i D 1; : : : ; n:
rO .Ti jXi /
scores using simulations. However, Hirano et al. [7] and Lunceford and Davidian
[15] showed that conditioning on the estimated propensity score rather than the true
propensity score can yield smaller variance of the estimated causal effects. That is,
even when the propensity score is estimated more accurately, it does not necessarily
yield better causal inference estimates.
One commonly accepted practice is to check balance after the propensity scores
are estimated. The underlying idea is that if the propensity score is correctly
estimated, the covariates should be distributed almost the same among different
treatment groups. There are many ways to evaluate balance in the covariates and it
also depends on the particular approach employed to estimate the causal treatment
effect. For example, in inverse probability weighting, we may look at the absolute
standardized mean difference (ASMD) in the covariates. For a single covariate X,
the standardized mean difference is defined as
XN treated
w
XN control
w
dD q ; (6.12)
.s2treated C s2control /=2
where streated is the standard deviation of X in the treatment group and scontrol is
the standard deviation of X in the control (untreated) group; XN treated
w
is the weighted
average of X in the treatment group and XN control is the weighted average of X in the
w
and
Pn
iD1 Xi .1 Ti /O
ri =.1 rOi /
XN control
w
D P n :
iD1 .1 Ti /O
ri =.1 rOi /
122 Y. Zhu and L. (Laura) Lin
1X O
V
Cv .k/ D . k .Xv0 / O 0 .Xv1 //2 :
V vD1
The optimal model for estimating propensity scores is then chosen to be the one
which leads to the smallest Cv among the K models. Brookhart and van der Laan [3]
6 Propensity Score Modeling and Evaluation 123
proved that the optimal model selected by the Monte Carlo cross-validation criteria
leads to the smallest mean square error of the parameter of interest. This approach
has been adopted to compare different propensity score models in [33], in which an
over-fitted logistic regression model using all the available covariates is treated as
the reference propensity score model to obtain O 0 .X/.
References
20. Robins, J.M.: Association, causation, and marginal structural models. Synthese 121(1), 151–
179 (1999)
21. Robins, J.M., Hernán, M.Á., Brumback, B.: Marginal structural models and causal inference
in epidemiology. Epidemiology. 11(5), 550–560 (2000)
22. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies
for causal effects. Biometrika 70(1), 41–55 (1983)
23. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized
studies. J. Educ. Psychol. 66(5), 688–701 (1974)
24. Setoguchi, S., Schneeweiss, S., Brookhart, M.A., Glynn, R.J., Cook, E.F.: Evaluating uses of
data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol.
Drug Saf. 17(6), 546–555 (2008)
25. Stuart, E.A., Lee, B.K., Leacy, F.P.: Prognostic score–based balance measures can be a
useful diagnostic for propensity score methods in comparative effectiveness research. J. Clin.
Epidemiol. 66(8), S84–S90 (2013)
26. Székely, G.J., Rizzo, M.L.: Brownian distance covariance. Ann. Appl. Stat. 32(8), 1236–1265
(2009)
27. Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation
of distances. Ann. Stat. 35(6), 2769–2794 (2007)
28. Tchernis, R., Horvitz-Lennon, M., Normand, S.L.T.: On the use of discrete choice models for
causal inference. Stat. Med. 24(14), 2197–2212 (2005)
29. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York
(2002). ISBN 0-387-95457-0
30. Zhu, Y., Coffman, D.L., Ghosh, D.: A boosting algorithm for estimating generalized propensity
scores with continuous treatments. J. Causal Inference 3(1), 25–40 (2015)
31. Zhu, Y., Ghosh, D., Mitra, N., Mukherjee, B.: A data-adaptive strategy for inverse weighted
estimation of causal effect. Health Serv. Outcome Res. Methodol. 14(3), 69–91 (2014)
32. Zhu, Y., Schonbach, M., Coffman, D.L., Williams, J.S.: Variable selection for propensity score
estimation via balancing covariates. Epidemiology 26(2), e14–e15 (2015)
33. Zhu, Y., Ghosh, D., Coffman, D.L., Savage, J.S.: Estimating controlled direct effects of
restrictive feeding practices in the ‘early dieting in girls’ study. J. R. Stat. Soc.: Ser. C: Appl.
Stat. 65(1), 115–130 (2016)
Chapter 7
Overcoming the Computing Barriers
in Statistical Causal Inference
Abstract The massive development in statistical causal inference to the era of big
data commonly seen in public health applications can be always hindered due to the
computational barriers. In this chapter we discuss a practical concern on computing
barriers in statistical causal inference with example in optimal pair matching and
consequently offer a novel solution by constructing a stratification tree based on
exact matching and propensity scores. We demonstrate the implementation of this
novel method with a large observational study from Philadelphia obstetric unit
closure from 1995 to 2003 with 59 observed covariates in each of the 132,786 birth
deliveries and 5,998,111 potential controls. Algorithms and R program code are also
provided for interested readers.
K. Zhang ()
Department of Statistics and Operations Research, University of North Carolina,
Chapel Hill, NC, USA
e-mail: zhangk@email.unc.edu
D.-G. Chen
School of Social Work & Department of Biostatistics, Gilling School of Global Public Health,
University of North Carolina, Chapel Hill, NC, USA
e-mail: dinchen@email.unc.edu
One way to overcome the computing barrier for optimal matching problems is to
take advantage of the structure of the data. In particular, there are two general
observations:
1. Individuals with similar propensity scores are more likely to have close covari-
ates, and in general the individuals in the treatment group have higher propensity
scores than the ones in the control group.
2. Some covariates are of more importance than others for the field of research.
Guided by these considerations, one can stratify the data into small subclasses
with a tree structure and match within each subclass.
In what follows, we describe the construction of the stratification tree. In general,
decision on whether or not the stratification is needed at each node is based on
several scientific, statistical, and computational criteria, while the stratification
process can be done by estimated propensity scores and by exact matching of
important variables.
At the root of the stratification tree, the entire data is regarded as a stratum. The
algorithm then runs through the following steps.
1. Checking statistical criteria. We first check if the stratum makes statistical
sense. For example, we ask if there is any treated observation in the stratum.
If yes, we shall proceed with the stratification and matching steps. Otherwise, we
shall ignore the stratum.
2. Checking matching feasibility. In this step we check whether the stratum is
feasible for the OPM algorithm to get matched pairs. For example, we check
whether the size of the distance matrix is below a preset tolerance, for example
128 K. Zhang and D.-G. Chen
9 106 . If yes, then we can proceed matching within the stratum. Otherwise, we
shall further split the stratum. There are two methods of stratification: propensity
score stratification (PSS) and important variables stratification (IVS).
3. Propensity scores stratification (PSS). In this stratification process, we first
fit a logistic regression to get propensity scores for each individual. We then
rank the propensity scores from high to low and start stratification from the
top. A subclass keeps recruiting people until both (1) there are more con-
trol units than the treated ones and (2) the size of distance matrix reaches
the preset tolerance in Step 2. The key idea behind PSS is based on [4]
that stratification on estimated propensity scores can effectively reduce bias
and unbalance. If the logistic regression encounters difficulty for some rea-
son, such as when the stratum is too large for the logistic regression, the
stratum will be split by the important variable stratification (IVS) described
below.
4. Checking the number of strata after propensity. In [4], the authors recom-
mend five subclasses from PSS. Indeed, since individuals with similar propensity
scores may have very different covariates, and since a large number of strata
may lead to difficult interpretations, a discretion on the number of strata from
PSS is needed. In the algorithm, we check if the number of strata from PSS is
below a preset tolerance bound, which should be a compromise between exact
matching and propensity score matching and should be advised by field experts.
If the number of strata is small, we proceed with matching within each subclass.
Otherwise, we disregard the PSS and consider the following important variable
stratification (IVS).
5. Important Variables Stratification (IVS). To stratify the data by important
variables, we first set a list of priority and a set of ranges for interval splitting.
These order of importance and intervals should be advised by field experts before
the study. For example, in [5], the variable “Mom’s age” has three stratification
intervals .0; 18; .18; 34, and .34; 1/ based on medical considerations. Thus,
a stratum reaching the IVS step based on “Mom’s age” will be split into the
three intervals above so that treated and control units are exactly matched in
each of the three strata from IVS. The algorithm then repeats from Step 1 for
each of the three strata. The key idea behind this process is the exact matching
idea described in Chap. 9 in [3]. If there is no more variables for IVS but
matching is not feasible in the current data, the algorithm stops and reports an
error.
When all strata are of a size that is feasible for matching, the stratification
process is complete, and matched pairs are formed within each subclass. The
aggregated pairs from all subclasses then form the matched pairs of treated and
control individuals for the entire study. The flowchart in Fig. 7.1 describes the
complete algorithm.
7 Overcoming the Computing Barriers in Statistical Causal Inference 129
Start
Input Stratum
Yes
Yes
No
Success
Too Many
Subclasses? Yes
No
We take the 1995 data in [5] for example in illustrating the algorithm. There are
14; 768 treated units and 681; 743 control ones in this dataset, and the data size is
beyond the 2.10.0. version of software R in 2009 for logistic regression and OPM.
To apply the algorithm described in Sect. 2, we set the following argument as input:
The tolerance size of the distance matrix within each subclass is set to be 9 106 .
The tolerance number of the subclasses is set to be 5. The stratification variables
and stratification intervals suggested by the doctors, listed by the priority from high
to low, are
130 K. Zhang and D.-G. Chen
1. “Gestation Age” with intervals .0; 33, .33; 36, .36; 38, .38; 40, and .40; 1/.
2. “Mom’s Age” with intervals .0; 18, .18; 34, and .34; 1/.
3. “Mom’s Education” with categories “Less than High School”, “High School
Degree”, “College Degree”, and “More than College”.
We shall only explain Step 3 of PSS here since other steps are straightforward. For
this step, we use the delivery records with gestational ages more than 40 weeks as
an example. There are 1791 treated units and 137; 463 control ones. To subclassify
this stratum, we fit the logistic regression with all covariates but “Gestation Age”
to get estimated propensity scores for each individual. We then sort the estimated
propensity scores from high to low and present a few of them in Table 7.1. We
form the first subclass by searching in the treated group for the lowest estimated
propensity score p1 , above which (1) the number of treated propensity scores is less
than the number of control propensity scores—so there are enough control units
to pair with treated ones, and (2) the product of the number of treated and control
units—which is going to be the size of the distance matrix in OPM—is less than a
threshold required by certain software. The observations with a estimated propensity
score higher than p1 are collected to be the first subclass. The detailed R function of
this PSS step is provided in Appendix.
In this stratum, for the first 312 treated units there are not enough control units
above their propensity scores so pair matching cannot be done. From the 313-th
treated unit on the pool of control units is large enough to match for each treated
unit. However, from the 1045-th treated units on, the product of the sizes of treated
and control pools exceeds the 9 106 tolerance bound. Therefore, p1 D 0:0388, and
the first subclass consists of 1044 treated units and 8614 control units.
In summary, by going through the process described in Sect. 2, the resulting tree
of strata for the 1995 data is shown in Fig. 7.2. At the beginning step, the stratum is
so large that even logistic regression cannot be fit. After the split based on “Gestation
Age”, four of the five strata can be divided into a few subclasses for which matching
is feasible. However, the .38; 40 stratum is still too large. Its PSS will result in
12 subclasses, which is above the limit of 5. Thus, this stratum is further split by
“Mom’s Age”. We further stratify the “Mom’s age” group of .18; 34 by “Mom’s
Education” since it is too large too. The resulting tree has 10 ending nodes of strata
with the numbers of subclasses 2; 2; 5; 2; 2; 5; 4; 1; 2; 4:
After the stratification process is complete, OPM is performed within each of
the 29 subclasses. Table 7.2 from [5] shows that the covariate balance in terms
of standardized differences before and after matching. It can be seen that before
matching, the distributions of covariates are quite different between the Philadelphia
group and the control group. Many covariates have a standardized difference
greater than 0:2. For example, on average Philadelphia mothers were younger and
their prenatal care started later, Philadelphia babies were lighter in weight, and
Table 7.1 Illustration of PSS in [5]
Rank of treated Estimated treated Number of control Enough Size of Exceeding Matching
propensity score propensity score units above control? dist. matrix OPM limit? feasible?
1 0.973 0 No 0 No No
:: :: :: :: :: :: ::
: : : : : : :
312 0.314 308 No 96,096 No No
313 0.312 315 Yes 98,595 No Yes
:: :: :: :: :: :: ::
: : : : : : :
1043 0.0389 8594 Yes 8,963,542 No Yes
1044 0.0388 8614 Yes 8,993,016 No Yes
1045 0.0387 8629 Yes 9,017,305 Yes No
:: :: :: :: :: :: ::
: : : : : : :
7 Overcoming the Computing Barriers in Statistical Causal Inference
The first subclass is formed by the top 1044 treated units and 8614 control units
131
132 K. Zhang and D.-G. Chen
1995
T = 14768
C = 681743
Year
Unable to
fit logistic
regression
Fig. 7.2 The stratification tree from the algorithm. Each node represents a resulting stratum from
IVS. Within each node, we list the number of treated units, the number of control units, and the
number of subclasses from PSS if feasible
Philadelphia families had less income. Such discrepancy made causal conclusions
difficult from direct comparisons. However, after matching, the covariates were well
balanced: The standardized differences were all below 0:2, and the average of each
covariate was close.
134 K. Zhang and D.-G. Chen
4 Summary
As shown in this chapter, the optimal pair matching method can achieve good
balance among covariates for causal conclusions in large observational studies.
However, it has to be used with care to avoid the high computation cost it can incur.
In the era of Big Data, such large studies can be more and more often. Thus, it
is important to develop and consider efficient ways in OPM to facilitate statistical
analysis in observational studies.
In the Philadelphia obstetric unit closure study [5], a stratification tree method
was applied to split the data into small subclasses where matching is computa-
tionally feasible. The construction process of the tree structure was based on an
integration of propensity score stratification and important variable stratification.
The underlying ideas are that propensity scores [4] and exact matching [3] are
important ways to balance covariates. In the Philadelphia obstetric unit closure
study, the difference between covariates is substantially reduced after the matching
based on the stratification tree, which in turn facilitates the establishment of causal
statements.
In general, stratification provides efficient ways to perform OPM for large
datasets. Since the overall goal is to achieve the balance in covariates between
matched pairs for causal conclusion, the stratification process, especially important
quantities such as the points of split, should be carefully designed with the advice
from field experts. The resulting stratification strategy should be a good compromise
between covariate balance and computation costs.
The R code for the tree stratification algorithm described in this chapter is
available upon request for interested readers. An R package of this algorithm is
also under development.
1. indicator: This argument takes a binary vector which takes value 1 for treated
units and 0 for control ones.
2. pscore: This argument takes a vector of propensity scores of each unit.
3. sizemax: This argument takes a preset tolerance level on the size of the distance
matrix. The default value is 9,000,000.
The function opt_pstrat returns with the following values:
1. flag: This output returns 1 for successful stratification and 2 otherwise.
2. cutoffs: This output returns the cutoff points where the subclasses are split.
3. t.pstrata: This output returns a vector listing the number of treated units in
each subclass formed.
4. c.pstrata: This output returns a vector listing the number of control units in
each subclass formed.
5. prodsize.pstrata: This output returns a vector listing the size of distance
matrix in each subclass formed.
o p t _ p s t r a t < f u n c t i o n ( i n d i c a t o r , p s c o r e , s i z e m a x = 9 0 0 0 0 0 0 ) {
c u t o f f s < max ( p s c o r e )
t . s t r a t a < NULL
c . s t r a t a < NULL
s i z e . s t r a t a < NULL
i n d i c a t o r _ i t e r < i n d i c a t o r
p s c o r e _ i t e r < p s c o r e
n u m _ s t r a t a _ f o r m e d < 0
n < l e n g t h ( i n d i c a t o r _ i t e r )
n _ t r e a t e d < sum ( i n d i c a t o r _ i t e r )
t r e a t e d _ p s c o r e _ i t e r < p s c o r e _ i t e r [ t _ i n d ]
t _ g e q _ t < n _ t r e a t e d +1 r a n k ( t r e a t e d _ p s c o r e _ i t e r ,
t i e s . method =" min " )
m a t c h a b l e < c _ g e q _ t >= t _ g e q _ t
i f ( sum ( m a t c h a b l e ) > 0 ) {
136 K. Zhang and D.-G. Chen
m a t c h a b l e _ s e t < t _ i n d [ c _ g e q _ t >= t _ g e q _ t ]
} else {
) ; stop }
s i z e . d i s t < t _ g e q _ t * c _ g e q _ t
i f ( min ( s i z e . d i s t [ m a t c h a b l e ] ) > s i z e m a x ) {
c u t o f f < p s c o r e _ i t e r [ t _ i n d [ c u t o f f . s i z e . i n d ] ]
c u t o f f s < c ( c u t o f f s , c u t o f f )
n u m _ s t r a t a _ f o r m e d < n u m _ s t r a t a _ f o r m e d +1
p r i n t ( num_strata_formed )
i n d i c a t o r _ i t e r < i n d i c a t o r _ i t e r [ p s c o r e _ i t e r < c u t o f f ]
p s c o r e _ i t e r < p s c o r e _ i t e r [ p s c o r e _ i t e r < c u t o f f ]
}
i f ( sum ( i n d i c a t o r _ i t e r ) = = 0 ) {
r e t u r n ( l i s t ( f l a g =1 , n u m _ p s t r a t a = n u m _ s t r a t a _ f o r m e d ,
c u t o f f s =rev ( c u t o f f s ) , t . p s t r a t a =rev ( t . s t r a t a ) , c . p s t r a t a =
rev ( c . s t r a t a ) , prodsize . p s t r a t a =rev ( size . s t r a t a ) ) )
}
i f ( sum ( ! i n d i c a t o r _ i t e r ) = = 0 ) {
p r i n t ( " S t r a t i f i c a t i o n F i n i s h e d : C o n t r o l U n i t s Used Up .
7 Overcoming the Computing Barriers in Statistical Causal Inference 137
As described in the main text, the function opt_pstrat is applied when each
stratum goes through Step 3. The outputs of this function provide useful information
on whether to further split or match within the subclasses.
References
1. Hansen, B.B., Klopfer, S.O.: Optimal full matching and related designs via network flows.
J. Comput. Graph. Stat. 15, 609–627 (2006)
2. Rosenbaum, P.R.: Observational Studies. Springer Series in Statistics. Springer, New York
(2002)
3. Rosenbaum, P.R.: Design of Observational Studies. Springer, New York (2010)
4. Rosenbaum, P.R., Rubin, D.B.: Reducing bias in observational studies using subclassification
on the propensity score. J. Am. Stat. Assoc. 79, 516–524 (1984)
5. Zhang, K., Small, D.S., Lorch, S., Srinivas, S., Rosenbaum, P.R.: Using split samples and
evidence factors in an observational study of neonatal outcomes. J. Am. Stat. Assoc. 106,
511–524 (2011)
Part III
Causal Inference in Randomized
Clinical Studies
Chapter 8
Semiparametric Theory and Empirical
Processes in Causal Inference
Edward H. Kennedy
1 Introduction
Causality and counterfactual questions lie at the heart of many if not most scientific
endeavors. Counterfactual questions are about what would have happened in
some system had it undergone a particular change. For example: How would the
distribution of patient outcomes differ had everyone versus no one received some
medical treatment? Which rule for treatment assignment would maximize outcomes
if it were implemented in the population?
In fact many scientific questions are causal even if they are not framed using
explicitly causal language and notation. For example, standard regression analyses
are often explained in implicitly causal terms, e.g., when regression coefficients
are portrayed as representing the expected difference in outcome if all covariates
were held constant, except for one covariate whose value was increased by one.
In contrast, without causal assumptions, these coefficients can only represent the
expected difference in outcome for two units who happen to have the same covariate
values, except for one covariate whose values happen to differ by one; manipulation
of the covariate cannot be allowed without invoking causal assumptions.
In this chapter we give a review of semiparametric theory and empirical
processes as they arise in causal inference problems. These include very powerful
methodological tools that can be especially useful in causal settings.
In Sect. 2 we give an introduction to causal inference, following Robins [37, 41,
59], van der Laan [57, 59, 60], and others. In order to answer causal questions with
observed data, we need causal assumptions. Sometimes these causal assumptions
can hold by virtue of the study design (e.g., in randomized trials), while at other
times the assumptions we need are untestable and need to be justified based on
subject matter expertise (e.g., in standard observational studies). In either case, as we
discuss in detail in Sect. 2.1, it is important to have a clearly defined study question
(with a corresponding causal parameter of interest). It is similarly important to be
precise about the assumptions that are required to estimate the causal parameter of
interest with observed data. This is the enterprise of identification, which we discuss
briefly in Sect. 2.2.
After a causal parameter of interest has been precisely defined and identified
(i.e., expressed in terms of observed data), then estimation and inference for that
parameter is essentially a purely statistical problem. Classical maximum likelihood
approaches can in theory be used to estimate such identified causal parameters, but
typically require unrealistic parametric assumptions about the entire data-generating
process. In contrast, semiparametric methods allow parts of the data-generating
process to be completely unrestricted, e.g., if they are unknown or involve nui-
sance functions that are not of particular interest to the study question. Thus, if
investigators have a good understanding of the treatment assignment process, for
example, this information can be incorporated into a semiparametric analysis, and
no assumptions might be needed about the outcome process. This is particularly
useful in causal inference settings since the outcome process is often complex
and difficult to model, while investigators may have some information about the
treatment mechanism (e.g., by surveying doctors about how they prescribe some
treatment).
Alternatively, in many cases investigators may not have much information
available about any part of the data-generating process. Then it will often be most
reasonable to use a nonparametric model, which does not make any parametric
assumptions at all about the data-generating process. A nonparametric model can be
viewed as a special case of a semiparametric model, so the theory reviewed in this
chapter covers these settings as well as those where treatment is assigned according
to some known process.
8 Semiparametric Theory and Empirical Processes in Causal Inference 143
2 Setup
In this section we briefly introduce the basic setup of a typical causal inference
problem. We focus on two essential components of causal inference: first, for-
mulating a clearly defined parameter of interest, and second, exploring how and
whether this target parameter is identified with observed data. These issues are very
important and provide a crucial foundation for semiparametric causal inference;
however, we give only a brief treatment since the main goal of this chapter is to
discuss semiparametric theory and empirical processes. Much of the discussion here
is inspired by pioneering work by Robins [37, 41, 59], van der Laan [57, 59, 60],
and colleagues.
An important first step in any scientific pursuit is to have a clearly defined goal. In
a statistical analysis, this includes giving a precise expression for a parameter of
interest, which we will refer to as the target parameter.
144 E.H. Kennedy
The target parameter is the main feature of interest in the analysis, and ideally is
decided upon based on collaborative discussion between scientific investigators and
the statistician or analyst. In practice, however, the target parameter is sometimes
defined only in vague terms, or is chosen based on convenience rather than scientific
interest. In causal inference problems, the target parameter is typically formulated
in terms of hypothetical interventions and corresponding counterfactual data, which
represent the data that would have been observed under some intervention. In this
chapter we mostly rely on the potential outcome framework, due to Neyman [28]
and Rubin [46, 47], but note that alternative frameworks based on structural equation
models and graphs [30, 31], or decision theory [10] can also be useful.
For example, in some population of units (e.g., patients), let Y 2 R denote a
random variable representing an outcome of interest (e.g., blood pressure, or an
indicator for whether a heart attack occurred), and let A 2 f0; 1g denote a binary
treatment (e.g., receipt of a statin), whose effect is in question. Then it may be of
interest to estimate the average causal effect, i.e., how the expected outcome would
have differed had everyone in the population taken treatment versus if no one in
the population had taken treatment. This quantity can be represented notationally as
follows. Let Y a denote the potential outcome that would have been observed (for a
particular unit in the population) had that unit taken treatment level A D a. For a
binary treatment, for example, this notation gives rise to two potential outcomes, Y 1
and Y 0 , which are the outcomes that would have been observed for a particular unit
under treatment (A D 1) and control (A D 0), respectively. Then the average causal
effect in the population can be defined as
D E.Y 1 Y 0 /: (8.1)
2.2 Identification
A D a H) Y D Y a : (C1)
Condition (C1) is called “consistency” [68] and holds if potential outcomes are
defined uniquely by a unit’s own treatment and not others’ (i.e., no interference),
and also not by the way treatment is administered (i.e., no different versions of
treatment). Also suppose that there exists some set of observed covariates L that
render treatment independent of potential outcomes when conditioned upon, i.e.,
A?
? Y a j L; (C2)
3 Semiparametric Theory
with respect to some dominating measure. In general we write p.X D t/ for the
density of X at t, but when there is no ambiguity we let p.x/ D p.X D x/.
A statistical model P is a set of possible probability distributions, which is
assumed to contain the observed data distribution P0 . In a parametric model, P
is assumed to be indexed by a finite-dimensional real-valued parameter 2 Rq ,
8 Semiparametric Theory and Empirical Processes in Causal Inference 147
assume .v/ D .vI / for 2 Rp . Such assumptions are not always easily
encoded directly in the distribution p.z/, but can still be employed in conjunction
with parametric assumptions about the treatment mechanism, for example, or in
otherwise nonparametric models. An alternative approach is to use nonparametric
working models [25], where instead of assuming .v/ D .vI / we define our
target parameter as a projection of .v/ onto the model .vI / (using, for example,
a weighted least squares projection).
where ' haspmean zero and finite variance (i.e., Ef'.Z/g D 0 and Ef'.Z/˝2 g < 1).
Here op .1= n/ employs the usual stochastic order notation so that Xn D op .1=rn /
p p
means rn Xn ! 0 where ! denotes convergence in probability.
Importantly, by the classical central limit theorem, an estimator O with influence
function ' is asymptotically normal with
p
n. O 0/ N 0; Ef'.Z/˝2 g ; (8.7)
(Note that E. O ipw / D 0 by iterated expectation.) The influence function for the
estimator O ipw is clearly given by
AY .1 A/Y
'ipw .Z/ D 0 (8.9)
.L/ 1 .L/
p
since O ipw 0 D Pn f'ipw .Z/g exactly, without any op .1= n/ approximation error.
Now suppose we are in an observational study setting where the propensity
score .l/ needs to be estimated, and suppose we do so with a correctly specified
parametric model .lI ˛/, with ˛ 2 Rq , so that the estimator ˛O solves some esti-
mating equation Pn fS.ZI ˛/g O D 0. Then the inverse-probability-weighted estimator
O ipw
is given by (8.8) above, except with the estimated propensity score .LI ˛/ O
replacing the true propensity score .L/. We can find the corresponding influence
function by standard estimating equation techniques [49]. Specifically, we have that
O D . O ; ˛O T /T solves Pn fm.ZI O /g D 0 where m.zI / D f'ipw .ZI ; ˛/; S.ZI ˛/T gT
ipw
are the stacked estimating equations for and ˛, with the influence function for
known propensity score given by 'ipw .ZI ; ˛/ D AY=.LI ˛/ .1 A/Y=f1
.LI ˛/g . Then under standard regularity conditions [27, 53, 64] we have
" 1 #
@m.ZI 0/ p
O 0 D Pn E m.ZI 0/ C op .1= n/; (8.10)
@
which after evaluating and rearranging implies that the influence function for O ipw
Surprisingly, even if the propensity score is known, it can be shown [53] that the
inverse-probability-weighted estimator O ipw
based on an estimated propensity score
is at least as efficient as the inverse-probability-weighted estimator O ipw that uses
the known propensity score. In other words, the variance of the influence function
'ipw .Z/ is less than or equal to the variance of the influence function 'ipw .Z/ for
known propensity score. Thus the propensity score should be estimated from the
data (according to a correct model, of course) even when it is known; discarding
information can actually yield better efficiency.
So far we have seen that, given an estimator O , we can learn about its asymptotic
behavior by considering its influence function '.Z/. But we can also use influence
150 E.H. Kennedy
where S .ZI 0 / D @ log p.zI /=@ j D 0 is the score function for the target
parameter, and similarly S .ZI 0 / D @ log p.zI /=@ j D 0 is the score for the
nuisance parameter (A˚B denotes the direct sum A˚B D faCb W a 2 A; b 2 Bg). In
the above formulation, the space T is called the nuisance tangent space. Influence
functions for reside in the orthogonal complement of the nuisance tangent space,
denoted by T ? D fg 2 L2 .P/ W P. gh/ D 0 forany h 2 T g. In such parametric
settings, this orthogonal space T ? can be written as
where Ef g.Z/g D 0 and we have supz jg.z/j < M and jj < 1=M so that p .z/ 0.
We will often index the parametric submodel by the function g, and so let P D P;g .
Note again that parametric submodels like the one above are a technical device
152 E.H. Kennedy
for constructing tangent spaces and analyzing semiparametric models, rather than a
usual model whose parameters we want to estimate from data (since P depends on
the true distribution P0 , it cannot be used as a model in the usual sense) [53].
One intuition behind parametric submodels can be expressed in terms of
efficiency bounds as follows [64]. First note that it is an easier problem to estimate
under the parametric submodel P 2 P than it is to estimate under the
entire (larger) semiparametric model P. Therefore the efficiency bound under the
larger model P must be larger than the efficiency bound under any parametric
submodel. In fact we can define the efficiency bound for semiparametric models
as the supremum of all such parametric submodel efficiency bounds.
Now that we have defined parametric submodels, how can they be used to
construct tangent spaces? Just as the tangent space is defined as the linear span
of the score vector in parametric models, in semiparametric models the tangent
space T is defined as the (closure of the) linear span of scores of the parametric
submodels. In other words, we first define scores on the parametric submodels P
with S .z/ D @ log p .z/=@jD0 , and then construct parametric submodel tangent
spaces as described earlier for standard parametric models, i.e., T D fbT S .Z/ W
b 2 Rg. Note that for parametric submodels like the one defined in (8.14) we have
so that the functions g indexing the parametric submodels are set up to equal the
parametric submodel scores. The closure T of the parametric submodel tangent
spaces T is the minimal closed set that contains them; roughly speaking, T is
the union of all the spaces T along with their limit points. Similarly, the nuisance
tangent space T for a semiparametric model is the set of scores in T that do not
vary the target parameter , i.e.,
Importantly, in nonparametric models the tangent space is the whole Hilbert space of
mean zero functions. For more restrictive semiparametric models the tangent space
will be a proper subspace.
Now that we are equipped with definitions of tangent spaces and nuisance tangent
spaces in semiparametric models, we can define influence functions, efficient
influence functions, and efficient scores in much the same way we did before with
parametric models.
Specifically, the subspace of influence functions is the set of elements ' 2 T ?
that satisfy P.'S / D 1. The efficient influence function is the influence function
2
with the smallest covariance P.'eff / P.' 2 / for all '; it is given by 'eff D
2 1
P.Seff / Seff , where Seff is the efficient score defined as the projection of the score
onto the tangent space, i.e., Seff D ˘.S j T ? / D S ˘.S j T / as before. The
efficient influence function can also be defined as the projection of any influence
function ' onto the tangent space, 'eff D ˘.' j T / for any influence function
', which is also a pathwise derivative of the target parameter in the sense that
P.'S / D @ .P /=@jD0 .
8 Semiparametric Theory and Empirical Processes in Causal Inference 153
0
R RFirst consider the term @ .P /=@jD0 D .0/. By definition we have D
fy dP. y j l; a D 1/ y dP. y j l; a D 0/g dP.l/, so that
Z Z
0
./ D fy`0 . y j l; a D 1I / dP. y j l; a D 1I /
h
D E EfY`0 .Y j L; A D 1I 0/ j L; A D 1g EfY`0 .Y j L; A D 0I 0/ j L; A D 0g
i
C f.L; 1/ .L; 0/g`0 .LI 0/
Z Z
D fy`0 . y j l; a D 1I 0/ dP. y j l; a D 1/
The first equality follows from iterated expectation and the fact that, by usual
properties of score functions, Ef`0 .V j WI 0/ j Wg D 0. The second equality
follows from iterated expectation, and the third follows by definition.
Since the last expression for the covariance P.'S / in Eq. (8.20) equals the
expression for 0 ./ from Eq. (8.19) when evaluated at D 0, we have shown
that ' is in fact the efficient influence function.
So far we have introduced the notion of a tangent space and discussed how influence
functions ' for regular asymptotically linear estimators can be viewed as elements
of a subspace of the Hilbert space L2 .P/, namely the orthogonal complement of
the nuisance tangent space, i.e., ' 2 T ? . We also illustrated how to check that a
proposed influence function is the efficient influence function. But how does one
find the space T ? in a given problem? In many cases this is a bit of an art: one
conjectures the form of T ? and then checks that the conjectured space satisfies
the required properties. For nonparametric models, one can sometimes deduce the
form of the efficient influence function from the nonparametric maximum likelihood
estimator, assuming discrete data [60]. However, in some settings it can be useful to
characterize influence functions with hypothetical “full data” (i.e., had we observed
all counterfactuals), and then map these to observed data influence functions [59].
To characterize full-data influence functions in causal inference problems we
need to start by presenting causal inference as a missing data problem [53, 59]. Thus
far we have supposed that we observe an independent and identically distributed
sample of observations Z P. In general missing data problems, we conceive
of hypothetical full data Z,Q of which the observed data Z is a coarsened version.
The problem is that we want to learn about the distribution PQ of the full data Z, Q
Q
but we only get to observe the coarsened version Z of the full data Z. In general
coarsened data problems, Z D ˆ.Z; Q C/ is a known many-to-one function ˆ./ of
both Z and a coarsening variable C that indicates what portion of ZQ is observed.
Q
In causal inference settings, the coarsening variable generally equals the treatment
process so that C D A, and
8 Semiparametric Theory and Empirical Processes in Causal Inference 155
ZQ D fZ a W a 2 A g: (8.21)
Thus the full data ZQ are the potential outcomes under different levels a 2 A of a
general treatment process A (here A could be multivariate, e.g., a treatment sequence
over multiple timepoints). For a given unit we only get to observe Z D ˆ.Z; Q A/ D
Z A , i.e., the potential outcome under the observed treatment process. For instance, in
our running example where Z D .L; A; Y/ with binary treatment so that A D f0; 1g,
the full data for a given unit could be represented as
Note that the last equality follows since La D L if we make the usual assumption
that events in the past cannot be affected by the future. In some cases we might
also want to include the observed treatment process in the full data, so that in the
above example we would have ZQ D .L; A; Y 0 ; Y 1 /. In a longitudinal setting where
covariates and a binary treatment are updated at timepoints t D 1; : : : ; K and an
outcome is measured at the end of follow-up, we could have
where at D .a1 ; : : : ; at / denotes the past history of a variable through time t. The
observed data in this case would be Z D .L1 ; A1 ; : : : ; Lt ; At ; : : : ; LK ; AK ; Y/ for a
given unit. Not every causal inference problem fits in the above framework, but
when the framework applies it can often be very useful.
Now that we have defined the full data ZQ and given some examples, we
can also define corresponding tangent spaces, influence functions, and parametric
submodels, using semiparametric models P f for the full data just as we did for the
observed data previously. The advantage is that it is often more straightforward to
derive tangent spaces and influence functions for full data problems (or else results
may already be known for common models), and then translate them to observed
data, rather than working with observed data directly and using the results from
previous subsections. Of course, in order to translate full data influence functions to
observed data influence functions, we need identifying assumptions.
Under a coarsening at random assumption [14], results for mapping full data to
observed data tangent spaces are given, for example, in [59] and [53]. In general,
coarsening at random means P.Z D z j ZQ D zQ1 / D P.Z D z j ZQ D zQ2 / whenever
z D ˆ.Qz1 ; a/ D ˆ.Qz2 ; a/ for some a 2 A . In many problems [40], this can be
equivalently expressed by saying that P.A D a j ZQ D zQ1 / D P.A D a j ZQ D zQ2 /
only depends on z whenever z D ˆ.Qz1 ; a/ D ˆ.Qz2 ; a/. Under some conditions,
coarsening at random also reduces to a randomization assumption, which says
treatment is independent of potential outcomes given the observed past, e.g., A ?
? Y a j L in our running example, or At ? ? Y aK j Lt ; At1 in the above longitudinal
example. More details on these issues are given in [40, 59]. Again we point out
that this framework does not always apply: sometimes coarsening at random is not
equivalent to treatment randomization, or is not the identifying assumption we wish
to utilize.
156 E.H. Kennedy
Here we will be content giving a simple example of how to map a full data
influence function to the observed data, rather than discussing details in full
generality; see [59] and [53] for more general results. Assume coarsening at random
holds, and that the treatment assignment process is known. Further suppose the
observed data is Z D .L; A; Y/ with A 2 f0; 1g and our goal is to estimate
E.Y 1 j V/ D .VI /, where V L is a subset of the covariates. The full data
orthogonal complement of the nuisance tangent space includes functions of the form
for arbitrary functions h (the simplest estimator would use the above as an estimating
function with h D 0). Note that functions of the above form only depend on
observed data since Y 1 D Y when A D 1. This represents an inverse-probability-
weighting approach for mapping full data spaces to observed data spaces.
4 Empirical Processes
Pn f'.ZI ; O /g D 0 (8.26)
Vaart [64, 65, 67], and Wellner [48, 67], among many others [21, 60]. The field
of empirical process theory is vast; we limit our discussion to tools for handling
nuisance estimation.
To motivate our study of empirical processes, consider our running example where
the goal is to estimate the average treatment effect D E.Y 1 Y 0 /. Specifically
consider the doubly robust estimator for that solves an estimated version of the
efficient influence function presented in Sect. 3.4, i.e., the estimator given by O D
Pn fm1 .ZI O / m0 .ZI O /g where
Note that in this case the nuisance function is given by D .; /. In observational
studies the covariates L are often high-dimensional, and little might be known about
the propensity score and outcome regression functions and , in which case
it makes sense to use flexible, nonparametric, data-adaptive methods to estimate
them. Of course then the asymptotic analysis presented in Sect. 3.2 does not apply,
since the estimators used to construct O D .; O /
O will not be described by a single
finite-dimensional parameter. Nonetheless under some conditions we can still learn
about the asymptotics of O and obtain valid confidence intervals, using tools from
empirical process theory.
Before going further, we need R to introduce some notation. Throughout this
section we will use Pf f .Z/g D f .z/ dP to denote expectations of f .Z/ for a new
observation Z (treating the function f as fixed); thus, PffO .Z/g is random when fO is
random (e.g., estimated from the sample). Contrast this with the fixed non-random
quantity EffO .Z/g, which averages over randomness in both Z and fO and thus will not
equal PffO .Z/g except when fO D f is fixed and non-random.
Suppose for simplicity that O D Pn fm.ZI O /g for some m, as in the above
example. If we only have Pn f'.ZI O ; O /g D 0, then we can proceed similarly,
with an extra step requiring differentiability of Pf'.ZI ; /g in , at 0 in a
neighborhood of 0 [64]. Also suppose that Pfm.ZI 0 /g D 0 (alternatively we
can define 0 so that this holds by definition). For instance, it is straightforward
to check for the doubly robust estimator described above that Pfm.ZI 0 ; /g D
Pfm.ZI ; 0 /g D 0 where m D m1 m0 . Then consider the decomposition
O 0 D Pn fm.ZI O /g Pfm.ZI 0 /g
where the first line is true by definition, and the second follows by simply adding
and subtracting Pfm.ZI O /g.
158 E.H. Kennedy
We will show that the first term .Pn P/fm.ZI O /g above can be handled under
general conditions with empirical process theory. Specifically, we will discuss
conditions under which
p
.Pn P/fm.ZI O /g D .Pn P/fm.ZI 0 /g C op .1= n/; (8.29)
and thus O is regular and asymptotically linear with influence function .m C /.
From an empirical process perspective, a primary way to control how close the term
.Pn P/fm.ZI O /g is to its limiting version .Pn P/fm.ZI 0 /g (in large samples)
is to restrict the complexity of the nuisance function 0 and its estimator O . If these
p
functions are not too complex, then the terms will not differ by more than op .1= n/.
In this subsection we will discuss characterizing complexity with Donsker classes.
We will start by giving the main result in the context of our example, and will
then describe the conditions in detail. Suppose our nuisance estimator O converges
to some limit 0 in the sense that
Z
2 2
jjm.I O / m.I 0 /jj D fm.zI O / m.zI 0 /g dP.z/ D op .1/; (8.31)
Thus, asymptotically, nuisance estimation only affects the second term in (8.28).
In order to define a Donsker class, we
p need to introduce a few concepts first.
Throughout this section we use Gn D n.Pn P/ for ease of notation. Let F
denote a class of functions f W Z ! R, and consider the empirical process
fGn f W f 2 F g: (8.33)
8 Semiparametric Theory and Empirical Processes in Causal Inference 159
Gn fO D Gn f0 C op .1/: (8.35)
classes F and Fj are Donsker; then, as discussed in Sect. 2.10 of [67], as in [1, 64],
the following transformations of F and Fj are also Donsker:
1. Subsets: G F
2. Unions: G D F1 [ F2
3. Closures: G D fg W fm ! g pointwisePand in L2 , for fm 2PF g
4. Convex combinations: G D fg W g D i wi fi for fi 2 F ; i jwi j 1g
5. Lipschitz transformations: G D fg W g DP. f1 ; : : : ; fk / for fj 2 Fj g if satisfies
j. f1 ; : : : ; fk /.x/ . f10 ; : : : ; fk0 /.x/j2 j . fj fj0 /.x/2 for all fj , fj0 , and x, and if
R
supf 2Fj jPf j < 1 and . f1 ; : : : ; fk /.x/2 dx < 1.
The convex combination result suggests using ensemble methods that use
weighted combinations of estimators, e.g., Super Learner [58, 60, 62]. The Lipschitz
transformation result given above is particularly useful. It means, for example, that
the following function classes are Donsker [1, 64, 67]:
1. Minimums: G D fg W g D min. f1 ; f2 / for fj 2 Fj g
2. Maximums: G D fg W g D max. f1 ; f2 / for fj 2 Fj g
3. Sums: G D fg W g D f1 C f2 for fj 2 Fj g
4. Products: G D fg W g D f1 f2 for fj 2 Fj g if Fj are uniformly bounded
5. Ratios: G D fg W g D 1=f for f 2 F g if f ı > 0 for all f 2 F
Repeated use of stability results like those above often allows one to conclude
Donsker properties for the class M D fm.I / W 2 Hg based on Donsker
assumptions about the class H.
For example, consider the doubly robust estimator O D Pn fm1 .ZI O / m0 .ZI O /g
given in (8.27). If O and O take values in Donsker classes F and F , respectively,
then ma .ZI O / does as well (provided that is bounded away from zero and one
for all 2 F ). This follows from Lipschitz results 3 and 5 for sums and ratios
above.
To this point we have seen that, if we assume the estimated nuisance functions O are
contained in Donsker function classes, we can use a standard central limit theorem
to analyze .Pn P/m.ZI
p O / since it is asymptotically equivalent to .Pn P/m.ZI 0 /
up to order op .1= n/. We have defined Donsker classes and shown how they can be
combined and modified to produce new Donsker classes, but we have yet to give any
specific examples of such classes. For the prior results to be useful over and above
more standard parametric techniques, we need Donsker classes to be able to capture
sufficiently flexible functions. Luckily, this is in fact the case, as we will discuss in
this subsection using specific examples.
First we will simply provide a short list of function classes that are Donsker,
and then we will briefly discuss how one typically shows that a particular class
8 Semiparametric Theory and Empirical Processes in Causal Inference 161
is Donsker (using bracketing and covering numbers). Results showing that certain
classes are Donsker are somewhat scattered across the literature, but examples and
nice overviews are given by [64, 67], for example. Among many other kinds of
classes, the following simple classes of functions are Donsker classes [13, 64, 67]:
1. Indicator functions: F D ff W f .x/ D I.x < t/; t 2 Rg
2. Vapnik–Cervonenkis (VC) classes
3. Bounded monotone functions
4. Lipschitz parametric functions: F D ff W f .x/ D f .xI R /; r 2 ‚ Rq g with
jf .xI 1 / f .xI 2 /j b.x/jj 1 2 jj for some b with jb.x/j dP.x/ < 1
@˛ f .x ;:::;x /
5. Smooth functions: F D ff W supx j @˛1 x1 1:::@˛q qxq j < B < 1; with ˛ > q=2g
R
6. Sobolev classes: ff W supx jf .x/j 1; f .k1/ absolutelycts:; jf .k/ .x/j2 dx 1g
7. Uniform sectional variation: ff W supx1 jjf .x1 ; /jjtv B1 ; supx2 jjf .; x2 /jjtv B2 g
where B1 ; B2 < 1 and jj jjtv denotes the total variation norm.
Thus we see that Donsker classes include usual parametric classes, but many
other classes as well, including infinite-dimensional classes that only require certain
smoothness or boundedness. Many other function classes can also be shown to be
Donsker. For example, any appropriate combination or transformation of the above
classes as discussed in the previous subsection will also be Donsker.
Showing that a function class is Donsker is often accomplished using bracketing
or covering numbers [64, 67], which are measures of the size of a class F .
These measures also provide simple sufficient conditions for a function class being
Donsker. An -bracket (in L2 .P/)R is defined as all functions f bracketed by functions
Œl; u (i.e., l f u) satisfying fu.z/ l.z/g2 dP.z/ < 2 . The bracketing number
of a class F is the smallest number of -brackets needed to cover F , and is denoted
by NB .; F /. Similarly, the covering number of a class F (with envelope F, i.e.,
supF jf j F) is the smallest number of L2 .Q/ balls of radius needed to cover F ,
and is denoted by NC .; F /. Then the class F is Donsker if either
Z 1 p Z 1 r p
log NB .; F / d < 1; or log sup NC . QF 2 ; F / d < 1: (8.36)
0 0 Q
Now we return to analyze the asymptotic behavior of the doubly robust estimator
of the average treatment effect D E.Y 1 Y 0 / from Sect. 3.4, which is given by
O D Pn fm.ZI O /g D Pn fm1 .ZI O / m0 .ZI O /g with
As discussed in Sect. 4.2, if the estimators O and O take values in Donsker classes,
then ma .ZI O / does as well (as long as functions in the class containing O are
uniformly bounded away from zero and one). Therefore the result in (8.29) applies,
and we have
p
O 0 D .Pn P/m.ZI / C Pfm.ZI O / m.ZI /g C op .1= n/: (8.39)
X
0 .L/ .L/
O
P f0 .L; a/ .L;
O a/g : (8.40)
a.L/
O C .1 a/f1 .L/g
O
a2f0;1g
Therefore, by the fact that O is bounded away from zero and one, along with the
Cauchy–Schwarz inequality (P. fg/ jjf jj jjgjj), we have that (up to a multiplicative
constant) jPfm.ZI O / m.ZI /gj is bounded above by
X
jj0 .L/ .L/jj
O jj0 .L; a/ .L;
O a/jj: (8.41)
a2f0;1g
conditions ensuring given convergence rates for kernel estimators are described,
for example, in [27]. Thus some modeling is in general required to attain n1=4
rates, but luckily numerous semiparametric models yield estimators that can satisfy
this condition. In particular, faster than n1=4 rates are possible with single index
models, generalized additive models, and partially linear models (see, for example,
[17] for a review of such models, which typically yield estimators with n2=5 rates),
as well as regularized estimators such as the Lasso [5, 6]. Cross-validation-based
weighted combinations of such estimators (e.g., Super Learner) can also satisfy this
rate condition if one of the candidate estimators does [58].
Inference after nonparametric estimation of in truly doubly robust settings
where one arbitrary nuisance estimator can be misspecified is more complicated.
If one of the estimators O or O is misspecified so that either jjO 0 jj D Op .1/ or
jjO 0 jj D Op .1/, then obtaining root-n rate inference for standard estimators will
typically require knowledge of which estimator is correctly specified, as well as that
the correctly specified estimator is based on a parametric model. More sophisticated
estimators that weaken this requirement are discussed in the next section (e.g., [56]).
in settings where root-n rates of convergence are not possible. Further, Donsker-type
regularity conditions (though not rate conditions) can be weakened via cross-
validation approaches, proposed, for example, by Zheng and van der Laan [72].
We also supposed in this review that our target parameter was a low-dimensional
Euclidean parameter 2 Rp that admitted regular asymptotically linear estimators.
However, in some settings these conditions fail to hold. As mentioned above,
Robins et al. [42, 44, 66] considered semiparametric minimax estimation in settings
where the parameter of interest is Euclidean, but root-n rates of convergence cannot
be attained due to high-dimensional covariates. Estimation of functional effect
parameters was considered by Diaz and van der Laan [11], Kennedy et al. [20]
in the context of continuous treatment effects; in such settings the target parameter
is a non-pathwise differentiable curve, and root-n rates of convergence are again
not possible. Inference for a non-regular parameter in an optimal treatment regime
setting was considered by Luedtke and van der Laan [22]; in this case, non-regularity
does not preclude the existence of root-n rate inference.
Numerous other authors have also made important contributions extending
semiparametric causal inference to novel settings; unfortunately, we cannot list all
of them here. In addition, much important work is left to be done, both in the areas
mentioned above and in many other interesting settings.
References
12. Diaz, I., Carone, M., van der Laan, M.J.: Second order inference for the mean of a variable
missing at random. U.C. Berkeley Division of Biostatistics Working Paper Series, vol. 337,
pp. 1–22 (2015)
13. Gill, R.D., van der Laan, M.J., Wellner, J.A.: Inefficient estimators of the bivariate survival
function for three models. Ann. Inst. Henri Poincare. 31, 545–597 (1995)
14. Gill, R.D., van der Laan, M.J., Robins, J.M.: Coarsening at random: characterizations,
conjectures, counter-examples. In: Proceedings of the First Seattle Symposium in Biostatistics,
pp. 255–294. Springer, New York (1997)
15. Hahn, J.: On the role of the propensity score in efficient semiparametric estimation of average
treatment effects. Econometrica 66, 315–333 (1998)
16. Hernan, M.A., Robins, J.M.: Instruments for causal inference: an epidemiologist’s dream?
Epidemiology 17, 360–372 (2006)
17. Horowitz, J.L.: Semiparametric and Nonparametric Methods in Econometrics. Springer,
New York (2009)
18. Hudgens, M.G., Halloran, M.E.: Toward causal inference with interference. J. Am. Stat. Assoc.
103, 832–842 (2012)
19. Kennedy, E.H., Sjolander, A., Small, D.S.: Semiparametric causal inference in matched cohort
studies. Biometrika 102, 739–746 (2015)
20. Kennedy, E.H., Ma, Z., McHugh, M.D., Small, D.S.: Nonparametric methods for doubly robust
estimation of continuous treatment effects. arXiv preprint, arXiv:1507.00747 (2015)
21. Kosorok, M.R.: Introduction to Empirical Processes and Semiparametric Inference. Springer,
New York (2007)
22. Luedtke, A.R., van der Laan, M.J.: Statistical inference for the mean outcome under a possibly
non-unique optimal treatment strategy. U.C. Berkeley Division of Biostatistics Working Paper
Series, vol. 332, pp. 1–37 (2014)
23. Manski, C.F.: Partial Identification of Probability Distributions. Springer, New York (2003)
24. Murphy, S.A.: Optimal dynamic treatment regimes. J. R. Stat. Soc. B 65, 331–355 (2003)
25. Neugebauer, R., van der Laan, M.J.: Nonparametric causal effects based on marginal structural
models. J. Stat. Plan. Infer. 137, 419–434 (2007)
26. Newey, W.K.: The asymptotic variance of semiparametric estimators. Econometrica 62,
1349–1382 (1994)
27. Newey, W.K., McFadden, D.: Large sample estimation and hypothesis testing. Handb. Econ.
4, 2111–2245 (1994)
28. Neyman, J.: On the application of probability theory to agricultural experiments: essay on
principles. Excerpts reprinted (1990) in English (D. Dabrowska and T. Speed, trans.) Stat. Sci.
5, 463–472 (1923)
29. Ogburn, E.L., VanderWeele, T.J.: Causal diagrams for interference. Stat. Sci. 29, 559–578
(2014)
30. Pearl, J.: Causal diagrams for empirical research. Biometrika 82, 669–688 (1995)
31. Pearl, J.: Causality. Cambridge University Press, Cambridge (2009)
32. Petersen, M.L., Porter, K.E., Gruber, S., Wang, Y., van der Laan, M.J.: Diagnosing and
responding to violations in the positivity assumption. Stat. Methods Med. Res. 21, 31–54
(2010)
33. Pfanzagl, J.: Contributions to a General Asymptotic Statistical Theory. Springer, New York
(1982)
34. Pfanzagl, J.: Estimation in Semiparametric Models. Springer, New York (1990)
35. Pollard, D.: Convergence of stochastic processes. Springer, New York (1984)
36. Pollard, D.: Empirical processes: theory and applications. In: NSF-CBMS Regional Confer-
ence Series in Probability and Statistics. Institute of Mathematical Statistics and the American
Statistical Association (1990)
37. Robins, J.M.: A new approach to causal inference in mortality studies with a sustained exposure
period - application to control of the healthy worker survivor effect. Math. Mod. 7, 1393–1512
(1986)
166 E.H. Kennedy
38. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some
regressors are not always observed. J. Am. Stat. Assoc. 89, 846–866 (1994)
39. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Analysis of semiparametric regression models for
repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 90, 106–121 (1995)
40. Robins, J.M., Rotnitzky, A., Scharfstein, D.O.: Sensitivity analysis for selection bias and
unmeasured confounding in missing data and causal inference models. In: Statistical Models
in Epidemiology, the Environment, and Clinical Trials, pp. 1–94. Springer, New York (1999)
41. Robins, J.M., Hernan, M.A., Brumback, B.: Marginal structural models and causal inference
in epidemiology. Epidemiology 11, 550–560 (2000)
42. Robins, J.M., Li, L., Tchetgen, E., van der Vaart, A.W.: Higher order influence functions and
minimax estimation of nonlinear functionals. In: Probability and Statistics: Essays in Honor of
David A. Freedman, pp. 335–421. Beachwood, Ohio, USA, Institute of Mathematical Statistics
(2008)
43. Robins, J.M., Hernan, M.A.: Estimation of the causal effects of time-varying exposures.
In: Fitzmaurice, G., Davidian, M., Verbeke, G., Molenberghs, G. (eds.) Longitudinal Data
Analysis, pp. 553–600. Chapman & Hall, London (2009)
44. Robins, J.M., Li, L., Tchetgen, E., van der Vaart, A.W.: Quadratic semiparametric von mises
calculus. Metrika 69, 227–247 (2009)
45. Rose, S., van der Laan, M.J.: A double robust approach to causal effects in case-control studies.
Am. J. Epid. 179, 662–669 (2014)
46. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies.
J. Educ. Psychol. 66, 688–701 (1974)
47. Rubin, D.B.: Bayesian inference for causal effects: the role of randomization. Ann. Stat. 6,
34–58 (1978)
48. Shorack, G.R., Wellner, J.A.: Empirical Processes with Applications to Statistics. Wiley,
New York (1986)
49. Stefanski, L.A., Boos, D.D.: The calculus of M-estimation. Am. Stat. 56, 29–38 (2002)
50. Tchetgen, E., Rotnitzky, A.: Double-robust estimation of an exposure-outcome odds ratio
adjusting for confounding in cohort and case-control studies. Stat. Med. 30, 335–347 (2011)
51. Tchetgen, E., Shpitser, I.: Semiparametric theory for causal mediation analysis: efficiency
bounds, multiple robustness and sensitivity analysis. Ann. Stat. 40, 1816–1845 (2012)
52. Tchetgen, E., VanderWeele, T.J.: On causal inference in the presence of interference. Stat.
Methods Med. Res. 21, 55–75 (2012)
53. Tsiatis, A.A.: Semiparametric Theory and Missing Data. Springer, New York (2006)
54. van der Laan, M.J.: Estimation based on case-control designs with known prevalence probabil-
ity. Int. J. Biostat. 4 (2008). Article 17
55. van der Laan, M.J.: Causal inference for a population of causally connected units. J. Causal
Inf. 2, 13–74 (2014)
56. van der Laan, M.J.: Targeted estimation of nuisance parameters to obtain valid statistical
inference. Int. J. Biostat. 10, 29–57 (2014)
57. van der Laan, M.J.: Targeted learning: From MLE to TMLE. In: Lin, X., Genest, C., Banks,
D.L., et al. (eds.) Past, Present, and Future of Statistical Science, pp. 465–480. Chapman &
Hall, London (2014)
58. van der Laan, M.J., Dudoit, S.: Unified cross-validation methodology for selection among
estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle
inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper Series.
vol. 130, pp. 1–103 (2003)
59. van der Laan, M.J., Robins, J.M.: Unified Methods for Censored Longitudinal Data and
Causality. Springer, New York (2003)
60. van der Laan, M.J., Rose, S.: Targeted Learning: Causal Inference for Observational and
Experimental Data. Springer, New York (2011)
61. van der Laan, M.J., Rubin, D.: Targeted maximum likelihood learning. Int. J. Biostat. 2, 1–38
(2006)
8 Semiparametric Theory and Empirical Processes in Causal Inference 167
62. van der Laan, M.J., Polley, E.C., Hubbard, A.E.: Super learner. Stat. Appl. Genet. Mol. 6, 1–21
(2007)
63. van der Laan, M.J., Petersen, M., Zheng, W.: Estimating the effect of a community-based
intervention with two communities. J. Causal Inf. 1, 83–106 (2013)
64. van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (2000)
65. van der Vaart, A.W.: Part III: Semiparametric Statistics. In: Bernard, P. (ed.) Lectures on
Probability Theory and Statistics, pp. 331–457. Springer, New York (2002)
66. van der Vaart, A.W.: Higher order tangent spaces and influence functions. Stat. Sci. 29,
679–686 (2014)
67. van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer,
New York (1996)
68. VanderWeele, T.J.: Concerning the consistency assumption in causal inference. Epidemiology
20, 880–883 (2009)
69. VanderWeele, T.J.: Explanation in Causal Inference: Methods for Mediation and Interaction.
Oxford University Press, Oxford (2015)
70. VanderWeele, T.J., Vansteelandt, S.: A weighting approach to causal effects and additive
interaction in case-control studies: marginal structural linear odds models. Am. J. Epidemiol.
174, 1197–1203 (2011)
71. VanderWeele, T.J., Vansteelandt, S.: Invited commentary: some advantages of the relative
excess risk due to interaction (RERI) - towards better estimators of additive interaction. Am. J.
Epidemiol. 179, 670–671 (2014)
72. Zheng, W., van der Laan, M.J.: Asymptotic theory for cross-validated targeted maximum
likelihood estimation. U.C. Berkeley Division of Biostatistics Working Paper Series, vol. 273,
pp. 1–58 (2010)
Chapter 9
Structural Nested Models
for Cluster-Randomized Trials
1 Introduction
2 Motivating Example
3 Estimands
In this section, we will introduce the concept of potential outcomes and provide
notations we are going to use in this chapter. We choose the causal relative risk as
the estimand of interest in the study. Some sophisticated assumptions are required
under different approaches. The causal relative risks are defined for ordinary and
weighted structural nested models, respectively.
3.2 Estimands
For the ordinary SNM approach, the estimand of interest is the causal relative risk,
as a factor of a
EŒYij .a/jAi D a
RR.a/ D : (9.1)
EŒYij .0/jAi D a
where Yij .a/ as the potential outcome for individual j in cluster i who had adherence
Ai D a and W1 is the weight adjustment. Define Pp .V/ as the probability that V
equals its observed value based on the distribution of the population data. Let Wij1 D
Pp .Zi /=Pp .Zi jXij /. We define PW1 .Yij .0/; Zi ; Xij / Pp .Yij .0/; Zi ; Xij /Wij1 .
3.3 Assumptions
For the ordinary SNM approach, we first posit a model representing the effect of
individual-level covariates Xij on Yij .0/ as
By Assumption 3, PW1 .Yij .0/; Zi ; Xij / D Pp .Yij .0/; Xij /Pp .Zi /. This weighted
distribution reflects the distribution of the population data we would have observed
if we could have randomized schools so that the distribution of Xij were the
same at each level of Zi (e.g., by paired matching or frequency matching [25] of
schools); note that for this distribution, Yij .0/ ? Zi . Thus, Assumption 3 implies
that EW1 .Yij .0/jZi / D EW1 .Yij .0// D Ep .Yij .0//, where EW1 .VjC/ is the conditional
expectation of V given C with respect to the weighted distribution PW1 .VjC/, and
Ep .V/ is the expectation of V with respect to the population distribution Pp .V/. We
further assume that
Assumption 4 hfEW1 .Yij .a/jAi D a; Zi /g D hfEW1 .Yij .0/jAi D a; Zi /g C av ,
where h./ is a canonical link corresponding to a generalized linear model,
such as h.p/ D p, h.p/ D log.p/, or h.p/ D log.p=.1 p//.
4 Estimation
In this section, the estimation methodology based on ordinary structural nested mod-
els with different link functions (linear SNM, loglinear SNM, and logistic SNM)
is provided, and the weighted structural nested models developed by Brumback
et al. [4] are reviewed. Different approaches to constructing confidence intervals
are compared and discussed. Computing and programming schemes are provided
for both ordinary and weighted structural nested models.
For the ordinary SNM approach, suppose that for each level of Xij we could have
randomized all clusters in the population and observed both cluster-level adherence
and individual-level outcomes, so that Xij , Zi , Ai , and the potential outcomes Yij .a/
for all a in all levels of x are defined for each individuals in the population. Based
on Assumption 1, we have that conditional on Xij , Yij .0/ does not depend on Zi . We
further let f .Xij I / D Xvij 1 C 0 , where Xvij is defined as a vector-valued function
of Xij (perhaps denoting dummy variables, e.g. when Xij is a multinomial random
variable).
Let Wij2 be the inverse probability that individual j from cluster i was selected
into the study. Let .Ai ; Zi ; Xij I / be a parametric model for E.Yij jAi ; Zi ; Xij / with
parameter . When Ai and Xij are multinomial random variables, one could use the
model .Ai ; Zi ; Xij I / D g.Avi 1 C Zvi 2 C Ai Zi 3 C Xvij 4 /, where Avi and Zvi
are defined as vector functions of Ai and Zi (perhaps denoting dummy variables,
e.g. when Ai and Zi are multinomial random variables), and Ai Zi represents a
multidimensional interaction. Under Assumption 2, letting f .av ; xI / D av x and
assuming that .Ai ; Zi ; Xij I / is correctly specified, we can consistently estimate
174 S. Helian et al.
(9.4)
If we use a generalized linear model for .Ai ; Zi ; Xij I / with a canonical link
function g1 ./, the first estimating equation at (9.4) can be solved by using weighted
GLM software (e.g., PROC GLM in SAS). If we furthermore let h./ D g1 ./
and Di .Avi ; Zvi ; Ai Zi ; Xvij /, then substituting O for into the second estimating
equation at (9.4), it reduces to
XX
Wij2 .Zvi ; Xvij /T Œg.Di O Avi Xij / Xvij 0 D 0; (9.6)
i j
f 0 ./ D Avi Xij g0 .Di O Avi Xij / D Avi Xij g.Di O Avi Xij /: (9.8)
f ./ f . t / . t /f 0 . t / D . t /f . t /: (9.9)
Let Avi D Avi g.Di O Avi Xij /; then f ./ f . t / D Avi Xij . t /, where
f . t / D g.Di O Avi Xij t /. Also, letting Yij D f . t / C Avi Xij t , the second estimating
equation at (9.4) simplifies to
XX
Wij2 .Zvi ; Xvij /T .Yij Avi Xij Xvij 0/ D 0; (9.10)
i j
which can be solved iteratively using weighted instrumental variables software (e.g.,
PROC SYSLIN in SAS) with Yij as the outcome, Avi Xij and Xvij as the endogenous
regressor, and Zvi and Xvij as the instrument variables.
The linearization can also be applied to logit link function. We have
exp.x/
g.x/ D
1 C exp.x/
exp.x/
g0 .x/ D D g.x/Œ1 g.x/: (9.11)
.1 C exp.x//2
Let Avi D Avi g.Di O Avi Xij /Œ1 g.Di O Avi Xij / and Yij D f . t / C Avi Xij t .
Again, the second estimating equation at (9.4) simplifies to
XX
Wij2 .Zvi ; Xvij /T .Yij Avi Xij Xvij 0/ D 0; (9.13)
i j
which can be solved iteratively using weighted instrumental variables software (e.g.,
PROC SYSLIN in SAS) with Yij as the outcome, Avi Xij and Xvij as the endogenous
regressor, and Zvi and Xvij as the instruments.
If we let h.p/ D p in Assumption 2, then the estimating equations at (9.4) can
be solved by using weighted linear regression, and the second estimating equation
at (9.4) becomes
XX
Wij2 .Zvi ; Xvij /T .Di O Avi Xij Xvij 0/ D 0; (9.14)
i j
Assumption 2 states that the distribution of potential outcomes Yij .a/ in the
population satisfies an ordinary generalized structural nested model. Then we can
estimate EŒYij .0/jAi D a and EŒYij .a/jAi D a via
XX
O ij .0/jAi D a D
EŒY O
Wij2 Œg.Di acI O ; /I.Ai D a/
i j
XX
O ij .a/jAi D a D
EŒY Wij2 Œg.Di I O /I.Ai D a/; (9.15)
i j
where O is the estimator solved by the estimating equations at (9.4), and I.Ai D a/
is an indicator function taking the value 1 when Ai D a, and otherwise equaling 0.
Then the Causal relative risk is estimated as
O ij .a/jAi D a
EŒY
O
RR.a/ D : (9.16)
O
EŒYij .0/jAi D a
for .; /. The first estimating equation at (9.17) is unbiased conditional on Ai and
Zi provided .Ai ; Zi I / is correctly specified; if one uses a saturated model, that is
automatic. The second estimating equation at (9.17) is unbiased because
which can be solved iteratively using Newton’s method by linearizing g.Di O Avi /
about a current estimate of , then solve the second estimating equation at (9.17) by
using weighted instrumental software (e.g., PROC SYSLIN in SAS).
Similarly as the linearization procedure in the ordinary SNM approach, the
second estimating equation at (9.17) can be simplified as
XX
Wij Zvi T .Yij Avi ˛/ D 0; (9.20)
i j
where Yij is the outcome, Avi is the endogenous regressor, and Zvi is the instrument
variable.
Assumption 2 states that the distribution of potential outcomes Yij .a/ in the
population satisfies a weighted generalized structural nested mean model. Then we
can estimate
XX
EO W1 ŒYij .0/jAi D a D O O /I.Ai D a/
Wij g.Di aI
i j
XX
EO W1 ŒYij .a/jAi D a D Wij g.Di I O /I.Ai D a/; (9.21)
i j
O W1
O W1 .a/ D E ŒYij .a/jAi D a :
RR (9.22)
EO W1 ŒYij .0/jAi D a
X
H X
Ch
V. O / D fCh =.Ch 1/g fUhc . O / Uh . O /gfUhc . O / Uh . O /gT ; (9.24)
hD1 cD1
P h
where Uh . O / D .1=Ch / CcD1 Uhc . O /. By the law of large numbers and central limit
theorem, O approximately follows multivariate normal distribution with mean and
ar. O /.
variance vc
However, the sandwich estimator of variance is difficult to program. An easier
way to estimate vc ar. O / is to use the bootstrap or jackknife for complex survey data.
Let O be an estimate of based on the data from the bth bootstrap sample, then the
b
estimator of variance is
" ( ) #2
X
B X
B
arB . O / D f1=.B 1/g
vc O b .1=B/ Ob ; (9.25)
bD1 bD1
X
H X
Ch
arJ . O / D
vc f.Ch 1/=Ch g . O hc O /2 : (9.26)
hD1 cD1
For estimating confidence intervals for functions '. / of , such as relative risks,
we use the normal approximation to the log of '. /.
5 Simulation Study
For the weighted SNM approach, Brumback et al. [4] provide a simulation study.
For the ordinary SNM approach, we conducted three additional sets of simulations,
the first based on a loglinear SNM, with h.p/ D log.p/ in Assumption 2; the
9 Structural Nested Models for Cluster-Randomized Trials 179
where from Eq. (9.3) with f .Xij I / D Xvij 1 C 0 and Assumption 1, we have
(9.28)
The distribution of Ai given Zi and Xij is shown in Table 9.4, where P1 .Ai D
ajXij ; Zi / and P1 .Yij .0/ D 1jZi ; Xij / correspond to the loglinear SNM, P2 .Ai D
ajXij ; Zi / and P2 .Yij .0/ D 1jZi ; Xij / correspond to the logistic SNM, and P3 .Ai D
ajXij ; Zi / and P3 .Yij .0/ D 1jZi ; Xij / correspond to the linear SNM.
P2Based on the above distributions, we calculate P.Ai D ajZP i D z/ D
1
cD1 P.A i D ajZ i D z; Xij D c/P.X ij D cjZi D z/ and P.A i D a/ D zD0 P.Ai D
ajZi D z/P.Zi D z/. The distribution of Ai given Zi is listed in Table 9.5, where
P1 .Ai D ajZi / is for the loglinear SNM, P2 .Ai D ajZi / is for the logistic SNM, and
P3 .Ai D ajZi / is for the linear SNM. P
In the loglinear simulated model, we would have P.Ai D 0/ D 1zD0 P.Ai D
P
0jZi D z/P.Zi D z/ D 0:6273 and P.Ai D 1/ D 1zD0 P.Ai D 1jZi D z/P.Zi D
z/ D 0:3727. Similarly, in the logistic simulated model, we would have P.Ai D
0/ D 0:6365 and P.Ai D 1/ D 0:3635; and in the linear simulated model, we would
have P.Ai D 0/ D 0:4363 and P.Ai D 1/ D 0:5637. Then we can calculate the joint
distribution of Xij and Zi given Ai as listed in Table 9.6, where P1 .Xij D c; Zi jAi D a/
is for the loglinear SNM, P2 .Xij D c; Zi jAi D a/ is for the logistic SNM, and
P3 .Xij D c; Zi jAi D a/ is for the linear SNM.
9 Structural Nested Models for Cluster-Randomized Trials 181
(9.29)
To check our results, we simulated a data set with 10;000 observations for 100,
500, and 1000 repetitions. Assume all parameter estimates (e.g., and ) and
O
log.RR.1// approximately follow normal distribution. We can use one sample t-test
to check the bias. The simulation results of all three SNMs are listed in Tables 9.7,
9.8, and 9.9. From the results we can see that generally, our estimating procedure
performs well.
To study the performance of the jackknife method, we simulated 500 data
sets with 500 observations for each SNM approach and computed the confidence
intervals with jackknife variance estimators. The coverage of 95 % confidence
intervals for 0 , 1 , , and causal relative risk RR.1/ for logistic SNM, loglinear
SNM, and linear SNM are listed in Table 9.10. From the results, we can conclude
that the jackknife performs well.
9 Structural Nested Models for Cluster-Randomized Trials 183
for A D 1 was 0.51, whereas that for A D 2 was 0.27. Therefore more reduction in
risk of absenteeism was possible in the schools with Ai D 1. However, when using
Newton’s method to solve the estimating equations at (9.4) with the loglinear or
logistic ordinary generalized SNM approaches, our initial values for and have to
be close enough to the true ones. Otherwise, the iterative algorithm we use to solve
the estimating equations at (9.4) may fail to converge for some jackknife samples.
7 Discussion
In this book chapter, we presented two methods based on structural nested models
for the analysis of multi-armed cluster-randomized trials with unequal probabilities
of sampling individuals. With the weighted structural nested model, we used
individual-level covariates Xij to construct weights in order to adjust for individual-
level confounding. With the ordinary structural nested model, we included the
individual covariates into our structural nested model. Software and programming
schemes were provided for both weighted and ordinary structural nested models
assuming different link functions (linear SNM, loglinear SNM, and logistic SNM).
We also applied our methods to analyze the effect of adherence in the school-
based WASH study. The computation is straightforward with the application of
instrumental variables software (e.g., SAS PROC SYSLIN). With nonlinear link
functions (e.g., loglinear SNM and logistic SNM), we can solve the estimating
equations at (9.4) and (9.17) iteratively using Newton’s method to linearize the
canonical link functions. However, with the ordinary structural nested models, when
we used Newton’s method to solve the estimating equations, the starting values
have to be close to the true ones, otherwise, we may have convergence issues.
To construct confidence intervals, we discussed three different methodologies,
sandwich estimator, bootstrap, and jackknife. However, the sandwich estimator of
variance is difficult to program, and the bootstrap may fail to generate solutions of
the estimating equations. Thus, we turned to the jackknife variance estimator.
To verify our methodologies, we conducted a simulation study for the ordinary
structural nested model. A simulation study for the weighted structural nested model
9 Structural Nested Models for Cluster-Randomized Trials 185
can be found in Brumback et al. [4]. Generally, the simulation results supported
our methods, and jackknife variance estimations performed well. However, when
using the ordinary SNM with nonlinear link functions (e.g., loglinear and logistic)
to analyze data sets, the initial values have to be close enough to the real ones in
order to avoid convergence issues.
References
1. Albert, J.M.: Estimating efficacy in clinical trials with clustered binary responses. Stat. Med.
21, 649–661 (2002)
2. Angrist, J.D., Imbens, G.W., Rubin, D.B.: Identification of causal effects using instrumental
variables. J. Am. Stat. Assoc. 91, 444–455 (1996)
3. Bhattacharya, J., Goldman, D., McCaffrey, D.: Estimating probit models with self-selected
treatments. Stat. Med. 25, 389–413 (2006)
4. Brumback, B.A., He, Z.L., Prasad, M., Freeman, M.C., Rheingans, R.: Using structural-nested
models to estimate the effect of cluster-level adherence on individual-level out-comes with a
three-armed cluster-randomized trial. Stat. Med. 33, 1490–1502 (2014)
5. Burgess, S., Collaboration, C.C.G.: Identifying the odds ratio estimated by a two-stage
instrumental variable analysis with a logistic regression model. Stat. Med. 32, 4726–4747
(2013)
6. Burgess, S., Thompson, S.G.: Improving bias and coverage in instrumental variable analysis
with weak instruments for continuous and binary outcomes. Stat. Med. 31, 1582–1600 (2012)
7. Cai, B., Small, D.S., Ten Have, T.R.: Two-stage instrumental variable methods for estimating
the causal odds ratio: analysis of bias. Stat. Med. 30, 1809–1824 (2011)
8. Cheng, J., Small, D.S.: Bounds on causal effects in three-arm trials with non-compliance. J. R.
Stat. Soc. Ser. B Stat Methodol. 68, 815–836 (2006)
9. Frangakis, C.E., Rubin, D.B.: Principal stratification in causal inference. Biometrics 58, 21–29
(2002)
10. Freeman, M.C., Greene, L.E., Dreibelbis, R., Saboori, S., Muga, R., Brumback, B., Rheingans,
R.: Assessing the impact of a school-based water treatment, hygiene and sanitation program
on pupil absence in Nyanza province, Kenya: a cluster-randomized trial. Tropical Med. Int.
Health 17, 380–391 (2012)
11. Greenland, S.: An introduction to instrumental variables for epidemiologists. Int. J. Epidemiol.
29, 1102–1102 (2000)
12. Hernan, M.A., Robins, J.M.: Instruments for causal inference - an epidemiologist’s dream?
Epidemiology 17, 360–372 (2006)
13. Jo, B., Asparouhov, T., Muthen, B.O.: Intention-to-treat analysis in cluster randomized trials
with noncompliance. Stat. Med. 27, 5565–5577 (2008)
14. Jo, B., Asparouhov, T., Muthen, B.O., Ialongo, N.S., Brown, C.H.: Cluster randomized trials
with treatment noncompliance. Psychol. Methods 13, 1–18 (2008)
15. Jo, B., Stuart, E.A.: On the use of propensity scores in principal causal effect estimation. Stat.
Med. 28, 2857–2875 (2009)
16. Joffe, M.M., Brensinger, C.: Weighting in instrumental variables and G-estimation. Stat. Med.
22, 1285–1303 (2003)
17. Johnston, K.M., Gustafson, P., Levy, A.R., Grootendorst, P.: Use of instrumental variables in
the analysis of generalized linear models in the presence of unmeasured confounding with
applications to epidemiological research. Stat. Med. 27, 1539–1556 (2008)
18. Korhonen, P.A., Laird, N.M., Palmgren, J.: Correcting for non-compliance in randomized
trials: an application to the ATBC study. Stat. Med. 18, 2879–2897 (1999)
186 S. Helian et al.
19. Long, Q., Little, R.J.A., Lin, X.H.: Estimating causal effects in trials involving multitreatment
arms subject to non-compliance: a Bayesian framework. J. R. Stat. Soc.: Ser. C: Appl. Stat. 59,
513–531 (2010)
20. Ma, Y., Roy, J., Marcus, B.: Causal models for randomized trials with two active treatments
and continuous compliance. Stat. Med. 30, 2349–2362 (2011)
21. Nagelkerke, N., Fidler, V., Bernsen, R., Borgdorff, M.: Estimating treatment effects in
randomized clinical trials in the presence of non-compliance. Stat. Med. 19, 1849–1864 (2000)
22. Rassen, J.A., Schneeweiss, S., Glynn, R.J., Mittleman, M.A., Brookhart, M.A.: Instrumental
variable analysis for estimation of treatment effects with dichotomous outcomes. Am. J.
Epidemiol. 169, 273–284 (2009)
23. Robins, J.M.: Correcting for noncompliance in randomized trials using structural nested mean
models. Commun. Stat. Theory Methods 23, 2379–2412 (1994)
24. Robins, J.M.: Correction for non-compliance in equivalence trials. Stat. Med. 17, 269–302
(1998)
25. Rothman, K.J., Greenland, S., Lash, T.L.: Modern Epidemiology, 3rd edn. Wolters Kluwer
Health/Lippincott Williams and Wilkins, Philadelphia (2008)
26. Small, D.S., Ten Have, T.R., Joffe, M.M., Cheng, J.: Random effects logistic models for
analyzing efficacy of a longitudinal randomized treatment with non-adherence. Stat. Med. 25,
1981–2007 (2006)
27. Ten Have, T.R., Joffe, M., Cary, M.: Causal logistic models for non-compliance under
randomized treatment with univariate binary response. Stat. Med. 22, 1255–1283 (2003)
28. Vansteelandt, S., Goetghebeur, E.: Causal inference with generalized structural mean models.
J. R. Stat. Soc. Ser. B Stat. Methodol. 65, 817–835 (2003)
29. Vansteelandt, S., Goetghebeur, E.: Sense and sensitivity when correcting for observed expo-
sures in randomized clinical trials. Stat. Med. 24, 191–210 (2005)
30. Vansteelandt, S., Bowden, J., Babanezhad, M., Goetghebeur, E.: On instrumental variables
estimation of causal odds ratios. Stat. Med. 26, 403–422 (2011)
31. Wooldridge, J.M.: Econometric Analysis of Cross Section and Panel Data. MIT Press,
Cambridge, MA (2002)
Chapter 10
Causal Models for Randomized Trials
with Continuous Compliance
We consider the situation where subjects were randomized to one of two active
treatments, and compliance with each treatment was measured on a continuous
scale. Examples of continuous measures of compliance include the duration of
compliance and the proportion of assigned treatment actually received.
The motivating example for this research was a smoking cessation clinical trial.
The Commit to Quit (CTQ) trials [1–3] comprise two longitudinal follow-up studies
of supervised exercise to promote smoking cessation. One arm included cognitive-
behavioral smoking cessation therapy (CBT) augmented by an individualized,
supervised exercise program. In the control arm, CBT was augmented by a wellness
education program that included lectures, films, handouts, and discussions covering
issues such as healthy eating and prevention of cardiovascular disease. Interest is in
Y. Ma ()
Department of Epidemiology and Biostatistics, The George Washington University,
Washington, DC, USA
e-mail: yanma@gwu.edu
J. Roy
Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania,
Philadelphia, PA, USA
For the CTQ trial, we measure compliance by the proportion of assigned classes
that were actually attended. We then consider the effect of treatment assignment
among interesting subpopulations. Examples could include: the subpopulation that
would be perfectly compliant with either treatment; the subpopulation that would
be highly compliant with either treatment; the subpopulation that would be highly
compliant with wellness but not exercise.
The remainder of the chapter is organized as follows. In Sect. 2 we introduce the
notation, assumptions, and models. In Sect. 3 we describe our estimation procedure.
The example is presented in Sect. 4. Finally, we conclude with a discussion
in Sect. 5.
We consider experimental trials with two active treatments. Let R 2 f0; 1g denote a
randomization indicator, where R D 1 indicates randomization to the new treatment
(e.g., supervised exercise plus CBT), and R D 0 indicates randomization to standard
therapy (e.g., wellness sessions plus CBT). Let Ar denote compliance with assigned
treatment under assignment r. We assume Ar is continuous and possibly bounded
(e.g., the proportion of assigned treatment actually taken). Similarly, define Yr to be
the outcome under assignment r. Each person has two potential compliance levels,
A0 and A1 , that characterize compliance under either treatment assignment; however
only A D RA1 C .1 R/A0 is observed. Similarly, each subject has two potential
outcomes, Y0 and Y1 , with Y D RY1 C .1 R/Y0 observed.
We make three standard assumptions for the development of analytic methods:
(1) the stable unit treatment value assumption (SUTVA), which is the assumption
that the value of the potential outcomes and potential compliance variables for
subject i only depend on the treatment assigned to subject i, not on the treatment
assigned to other subjects; (2) randomization (R ?? fY0 ; Y1 ; A0 ; A1 g); and (3)
the exclusion restriction. SUTVA essentially states that there is no interference
between subjects. Randomization requires that treatment assignment was unrelated
to potential outcomes. The exclusion restriction is the assumption that treatment
assignment affects the outcome entirely through its affect on treatment received;
knowledge of treatment assignment alone will not affect the outcome. These
assumptions have been described in detail elsewhere [15]. We make one additional
assumption that subjects in group R D r do not have access to the treatment assigned
in arm R D 1 r, for r D 0; 1. This assumption has been referred to as the
“treatment access restriction” [16] or “strong treatment access monotonicity” [9].
Without the treatment access restriction assumption, we would have four potential
compliance levels rather than two: compliance to treatment j if assigned to treatment
k, for j; k 2 f0; 1g2 . The assumption holds for the CTQ study, as subjects were not
allowed to attend sessions to which they were not assigned.
190 Y. Ma and J. Roy
Our interest is in the intention to treat (ITT) effects in subpopulations that have
similar compliance behavior. Let .r; a0 ; a1 / D E.Yr jA0 D a0 ; A1 D a1 / D
E.YjR D r; A0 D a0 ; A1 D a1 /. We are interested in comparing quantities
.1; a0 ; a1 / and .0; a0 ; a1 /. Such comparisons have been called principal effects
[7]. For example, if compliance was a proportion of assigned dose actually received,
investigators might be interested in the comparison .1; 1; 1/ and .0; 1; 1/, which
is the causal effect of assignment to new treatment compared to standard treatment,
among perfect compliers.
Even if both A0 and A1 were observed for all subjects, we would not be able
to identify .r; a0 ; a1 / non-parametrically. Notice that this differs from the binary
(yes/no) compliance case, where the principal effects could be identified if the
potential compliance variables were known. With continuous measures of compli-
ance, additional structure is needed to identify .r; a0 ; a1 /. One possibility would be
just estimate the causal effects within regions of the compliance space. Alternatively,
fully parametric (e.g., linear model) or semi-parametric (e.g., smoothing spline)
could be specified.
In order to identify the causal parameters, we propose the use of a structural
model for the principal effects. We assume the data Yi given Ai0 and Ai1 are from an
exponential family with distribution
where g./ is a link function and is a linear predictor. For example, if Y is binary,
one might specify a logistic model .r; a0 ; a1 I ˇ/=logit1 .ˇ0 C ˇ1 a0 C ˇ2 a1 C
ˇ3 ra0 C ˇ4 ra1 /. By the exclusion restriction, the model should be specified so that
10 Causal Models for Randomized Trials 191
FA0 ;A1 .a0 ; a1 / D ˚2 ˚11 fFA0 .a0 /g; ˚11 fFA1 .a1 /g
where ˚1 is the univariate standard normal CDF and ˚2 is the bivariate normal CDF
with mean .0; 0/T , variance .1; 1/T and correlation . Essentially, this implies that
the joint distribution f .z0 ; z1 / is bivariate normal with correlation , where z0 D
˚11 fFA0 .a0 /g and z1 D ˚11 fFA1 .a1 /g. Therefore, the joint distribution f .a0 ; a1 / is
3.1 Likelihood
Without loss of generality, suppose the first n0 subjects are in group R D 0 and the
next n1 subjects are in group R D 1 (n D n0 C n1 /. The likelihood function involves
integrating out missing data from the complete data likelihood. The likelihood
contribution for a subject in group R D 0 can be written
Z
Li0 .ˇ; 0; 1 ; 0 ; 1 I / D f .yi0 jai0 ; ai1 I ˇ/f .ai0 ; ai1 I 0 ; 1 ; 0 ; 1 ; /dai1
(10.2)
where the distributions f .yi0 jai0 ; ai1 I ˇ/ and f .ai0 ; ai1 I 0 ; 1 ; 0 ; 1 ; / were
defined previously. Define Li1 .ˇ; 0 ; 1 ; 0 ; 1 I / similarly for subjects in arm
R D 1, except there Ai0 is integrated out of Pthe likelihood. The loglikelihood
n0
is
Pn therefore log L.ˇ; 0 ; 1 ; 0 ; 1 I / D iD1 log Li0 .ˇ; 0 ; 1 ; 0 ; 1 I / C
iDn0 C1 log Li1 .ˇ; 0 ; 1 ; 0 ; 1 I /.
that Œ˚11 fFAi0 .ai0 /g; ˚11 fFAi1 .ai1 /g follows a bivariate normal distribution. The
right-hand side of (10.2) can be written as
Z ˇ
1 ˇ
f yi0 ˇˇai0 ; ai1 D FA1
i1
f˚1 .zi1 /gI ˇ f .ai0 /f .zi1 jzi0 I /dzi1 (10.3)
1
where zi0 D ˚11 fFAi0 .ai0 /g. The distribution of Œzi0 is standard normal and
Œzi1 jzi0 I N.zi0 ; 12 /. We can then approximate (10.3) using a Gauss–Hermite
quadrature as follows:
X ˇ
J
ˇ
Li0 .ˇ; O0 ; O1 ; O 0 ; O 1 I / D
e f yi0 ˇˇai0 ; ai1 D FA1
j
i1
f˚1 .zi1 /gI ˇ f .ai0 /wj
jD1
j
p
where .zi1 zi0 /= 1 2 is the jth of J nodes from a standard normal distribution
and wj is the corresponding weight. Typically, J D 10 points provides sufficient
accuracy. We define eLi1 .ˇ; O0 ; O1 ; O 0 ; O 1 I / analogously. Our approximation of the
loglikelihood is
X
n0 X
n
Ql.ˇ; O0 ; O1 ; O 0 ; O 1 I / D log e
Li0 .ˇ; O0 ; O1 ; O 0 ; O 1 I /C log e
Li1 .ˇ; O0 ; O1 ; O 0 ; O 1 I /:
iD1 iDn0 C1
The ITT effect for a particular combination of the compliance variables might
be of limited interest, primarily because very few subjects would have poten-
tial compliance equal to those two values. Therefore, researchers might also be
interested in causal effects within certain regions defined by a range of values of
the two compliance variables. Consider the situation where A0 and A1 represent
the proportion of assigned treatment actually taken (so that 0 represents non-
compliance and 1 represents perfect compliance). Suppose we would like to
estimate E .Y1 Y0 jA0 ; A1 2 /, where is some region of Œ0; 12 . Consider the
following examples. The region D fA0 ; A1 2 Œ0:7; 1; jA1 A0 j < 0:2g includes
194 Y. Ma and J. Roy
subjects who would be at least 70 % compliant with either treatment, and whose
compliance level would not differ by more than 20 % between the two arms. This
region would be of interest if investigators wanted to know the ITT effect among
highly compliant subjects. The region D fA0 > 0:7; A1 < 0:3g would include
subjects who would be highly compliant with treatment Z D 0 but poorly compliant
with treatment Z D 1. In the CTQ example, this would include subjects that seemed
to prefer wellness to exercise. Finally, one could consider D fA1 > 0:8g. In
our example, this would include subjects who would be at least 80 % compliant
with exercise, regardless of how compliant they would be with wellness. One could
imagine many regions that might be of interest. Once the model parameters are
estimated, causal effects within a region can then be estimated in a separate step.
We next provide the computational details for a specific example.
In general, the causal effect is
1X
E .Y1 Y0 ja0 ; a1 2 / D lim E .Yi1 Yi0 jai0 ; ai1 2 /
n!1 n
i
1X
E .Yi1 Yi0 jai0 ; ai1 2 / ;
n i
and
˚
u0 .zi1 / D min ˚11 fFAi0 .1/g ; ˚11 FAi0 FA1
i1
.˚1 .zi1 // C 0:2 :
In the above expressions, parameters are replaced by their MLEs, e.g., FAi0 .0:7/ D
FAi0 .0:7I O0 ; O 0 /.
The joint distribution f .zi0 ; zi1 / is bivariate normal with mean 0, variance 1
and correlation . This joint distribution can also be written as f .zi0 jzi1 /f .zi1 /,
where Œzi0 jzi1 N.zi1 ; 1 2 / and Œzi1 N.0; 1/. Thus, we can apply a
Gaussian quadrature to approximate the integral in (10.4). In particular, (10.4) can
be approximated by
X
K X
J h n o n o i h i
FA1 ˚1 .zi0 / ; FA1 ˚1 .zki1 / I ˇO I zi0 2 fl0 .zki1 /; u0 .zki1 /g; zki1 2 .l1 ; u1 / w0 wk1 ;
j j j
i0 i1
kD1 jD1
j
p
where I./ is the indicator function, zki1 and .zi0 zki1 /= 1 2 are nodes from a
j
standard normal distribution, and wk1 and w0 are weights. The totals J and K are
selected to ensure an adequate (e.g., 10) number of valid nodes.
4.1 Data
The CTQ study [1] was a randomized controlled trial designed to assess the efficacy
of supervised vigorous exercise as an adjuvant to cognitive behavioral therapy
(CBT) for promotion of smoking cessation among women. The study enrolled and
assigned 134 women to receive CBT plus vigorous exercise (the new treatment)
and 147 to receive CBT plus a wellness education program (the control treatment).
CBT represents the standard of care for smoking cessation; the wellness education
was added to the control arm to equalize staff contact time between the two arms.
The CBT program was administered to all women in group format weekly over the
course of 12 weeks. The exercise program was supervised, and individually tailored
to each woman based on achieving a target heart rate. Women in the control arm
participated in a program of supervised lectures, films, and discussions. Both the
wellness and exercise interventions were held three times per week. None of the
women in the control arm had access to the supervised exercise program, and none
in the exercise group had access to wellness classes.
196 Y. Ma and J. Roy
We next describe the specific models that were fitted to the CTQ data.
Compliance Model Because Ar was a proportion (approximately continuous,
bounded between 0 and 1), we specified beta distributions for the compliance
variables Ar . To assess the fit of the models, we plotted the empirical probability
density function (PDF) and the model-based PDF. Specifically, the empirical PDF
was obtained as a histogram of A0 and A1 , using ten bins of size 0.1 each. The model-
based PDF was based on the assumption that Ar follows a beta distribution with
parameters r and ˛r estimated by maximizing the likelihood. This model-based
estimated PDF was smoothed over the distribution of ar . The empirical and model-
based PDFs are plotted in Fig. 10.1 for D 0:1 (the figure looks the same for other
values of , as affects the joint distribution, but not the marginal). The marginal
PDFs that were estimated from the model appear to capture the key features of the
data. For example, in the wellness arm, the function appears to decrease initially,
then increase as A0 approaches 1. The distribution of A1 appears to be an increasing
function.
Principal Effects To model the causal effects .r; a0 ; a1 / in Sect. 2.2, we assumed
the distribution f .Yr jA0 ; A1 I ˇ/ was Bernoulli with
for r D 0; 1. Recall that Yr is the indicator that a subject assigned treatment r would
abstain from smoking during the final 8 weeks of the trial. The exclusion restriction
implies that randomization should have no impact among non-compliers (those with
A0 D A1 D 0), and thus ˇ00 D ˇ01 . We believe this assumption is plausible for the
CTQ study. However, as part of a sensitivity analysis we allow the two intercepts to
vary. We define ı as the ratio of risks among subjects who would be non-compliant
with either intervention, i.e.,
10 Causal Models for Randomized Trials 197
4
4
3
3
Density
Density
2
2
1
1
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
a0 a1
Fig. 10.1 Empirical PDF (histogram) and model-based estimated PDF (dashed line) of compli-
ance from the Commit to Quit trial
P.Y1 D 1jA0 D 0; A1 D 0/
ıD :
P.Y0 D 1jA0 D 0; A1 D 0/
We can write ˇ01 as a function of ˇ00 and ı: ˇ01 D log ı=.1 C eˇ00 ı/ . We
propose to estimate ˇ00 and fix the value of ı, which will determine the value of
ˇ01 . The exclusion restriction implies ı D 1. We also consider ı equal to 1.2 and
1=1:2. For example, ı D 1:2 implies that, among subjects that would not comply
with either intervention, the risk of the outcome is 20 % greater if randomized to
Z D 1 (exercise).
The model also relies on the assumption that the effects of A0 and A1 are
linear and additive on the logit scale. Non-linear terms or interactions could also
be specified, but we found that inference was relatively unaffected by these added
complexities.
We fitted the models at values of equal to 0.1, 0.5, and 0.9. These values
represent three values within the range of plausible values of . Recall that is the
correlation between transformed values of A0 and A1 . Independence between A0 and
A1 would occur if D 0. We believe it is unlikely that A0 and A1 are independent, as,
for example, a subject might miss a visit for personal reasons that are unrelated to the
treatment itself. We also believe that negative correlation is unlikely. We therefore
focus on positive values of , while acknowledging that we cannot rule out 0 or
negative values.
198 Y. Ma and J. Roy
4.3 Results
a b
1.0
1.0
0.8
0.8
0.6
0.6
a1
a1
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
a0 a0
Fig. 10.2 Plots of 500 simulated values of a0 and a1 from a copula model with Beta marginal
distributions estimated from the Commit to Quit trial and correlation. (a) D 0:1 and (b) D 0:9
10 Causal Models for Randomized Trials 199
are regions in the graph with little to no data. For example, inference about the
population of people who would be highly compliant with exercise and poorly
compliant if assigned to wellness would be based on extrapolation if is 0.9.
Causal Effects We next consider parameters from the causal model. First, we focus
on principal effects in subpopulations that would have the same compliance in either
treatment, i.e., 1 D .1; a; a/ .0; a; a/ where a0 D a1 D a. We estimated the
effects by plugging the MLEs of ˇ into the following formula
logit1 fˇ01 C .ˇ11 C ˇ12 /ag logit1 fˇ00 C .ˇ01 C ˇ02 /ag
In Table 10.2 we present the estimated principal effects b1 and their standard
errors at compliance levels 0:7 when D 0:1, 0.5, and 0.9 and ı D 0:83, 1,
and 1.2. For all values of , the causal effects increased as the compliance level
increased. There were no prominent causal effects (effects are less than 0.01) for
the compliance levels below 0.6 (results not displayed). Estimated causal effects
were greater than 0.1 for compliance levels 0.9 or above. While the point estimates
suggest a benefit from exercise when compliance is high, the evidence was only
strong (estimate about twice as large as SE) when D 0:9. Causal effect estimation
at the compliance levels that we focused on was insensitive to variations in ı within
the range of values that we considered (0.83–1.2).
Causal Effects in Compliance Regions Finally, we consider estimation of 2 D
E.Y1 Y0 ja0 ; a1 2 / where D fa0 ; a1 2 Œ0:7; 1; ja1 a0 j < 0:2g, as discussed
in Sect. 3.3. This is the estimated effect of being randomized to exercise compared
to wellness, among the group of people who would attend a similar (not differ by
more than 20 %) number of classes in either arm, and would attend most (at least
70 %) of their assigned classes. The estimates and SEs are given in Table 10.2. For
this group of subjects, those in the exercise arm appeared more likely to quit than
those in the wellness arm. For 0:9, the estimated difference in quit rates between
treatment arms was about 0.20, with p-values of about 0.05. Thus, for the group
that would be highly compliant with either arm, there was moderate evidence of a
benefit of exercise.
5 Discussion
We have developed methods that are designed to estimate the principal effects
in clinical trials, such as smoking cessation trials, in which subjects have access
to only one of two active treatments and the compliance variable is continuous
(and possibly bounded). The joint distribution of the observed and counterfactual
compliance is specified by linking the two marginal compliance distributions
utilizing a Gaussian copula with a sensitivity correlation parameter. At each of
the value of the correlation parameter, we obtain the MLEs of all parameters and
estimate the causal effects at a particular combination of the compliance variables
or within certain compliance regions in subpopulations that have similar compliance
behavior. In the smoking cessation analysis, the exercise arm appeared to have lower
quit rates among subjects that would be highly compliant with either intervention.
The two-stage ML approach is relatively easy to implement. However, we found
that for small sample sizes the optimization algorithm can be unstable. A fully
Bayesian approach is a viable alternative that could potentially resolve some of the
convergence problems by helping to identify parameters using informative priors.
Our approach relies heavily on the structural form of outcome model. In
particular, we specify a parametric model for .r; a0 ; a1 /—the mean of the potential
outcome given the potential compliance variables. In principle, it would not be
difficult to extend our approach to the semiparametric setting, where, for example,
.r; a0 ; a1 / could be modeled using bivariate smoothing via penalized splines.
However, because one of the compliance variables is always missing, too much
flexibility in the model can lead to identifiability problems.
10 Causal Models for Randomized Trials 201
Another related issue is that when is close to 1, a0 and a1 contain about the same
information. In that case, it might be sufficient to just include one of the compliance
variables in the model, which should improve computational stability. This suggests
that the form of .r; a0 ; a1 / could depend on . These issues, among others, are in
need of additional research.
References
1. Marcus, B.H., Albrecht, A.E., King, T.K., Parisi, A.F., Pinto, B.M., Roberts, M., Niaura, R.,
Abrams, D.B.: The efficacy of exercise as an aid for smoking cessation in women. Arch. Intern.
Med. 159, 1229–1234 (1999)
2. Marcus, B.H., Lewis, B.A., King, T.K., Albrecht, A.E., Hogan, J., Bock, B., Parisi, A.F.,
Abrams, D.B.: Rationale, design and baseline data for commit to quit II: an evaluation of the
efficacy of moderate-intensity physical activity as an aid to smoking cessation in women. Prev.
Med. 36, 479–492 (2003)
3. Marcus, B.H., Lewis, B.A., Hogan, J., King, T.K., Albrecht, A.E., Bock, B., Parisi, A.F.,
Niaura, R., Abrams, D.B.: The efficacy of moderate-intensity exercise as an aid for smoking
cessation in women: a randomized controlled trial. Nicotine Tob. Res. 7, 871–80 (2005)
4. Robins, J.M.: Correcting for non-compliance in randomized trials using structural nested mean
models. Commun. Stat. Theory Methods 23, 2379—2412 (1994)
5. Robins, J.M., Rotnitzky, A.: Estimation of treatment effects in randomised trials with non-
compliance and a dichotomous outcome using structural mean models. Biometrika 91,
763—783 (2005)
6. Vansteelandt, S., Goetghebeur, E.: Causal inference with generalized structural mean models.
J. R. Stat. Soc. Ser. B. 65, 817–835 (2003)
7. Frangakis, C., Rubin, D.: Principal stratification in causal inference. Biometrics 58, 21–29
(2002)
8. Efron, B., Feldman, D.: Compliance as an explanatory variable in clinical trials. J. Am. Stat.
Assoc. 86, 9–17 (1991)
9. Jin, H., Rubin, D.B.: Principal stratification for causal inference with extended partial
compliance. J. Am. Stat. Assoc. 103, 101–111 (2008)
10. Frees, E.W., Valdez, E.A.: Understanding relationships using copulas. N. Am. Actuarial J. 2,
1–25 (1998)
11. Bartolucci, F., Grilli, L.: Modeling partial compliance through copulas in a principal stratifica-
tion framework. J. Am. Stat. Assoc. 106, 469–479 (2011)
12. Plackett, R.L.: A class of bivariate distributions. J. Am. Stat. Assoc. 60, 516–522 (1965)
13. Ma, Y., Roy, J., Marcus, B.: Causal models for randomized trials with two active treatments
and continuous compliance. Stat. Med. 30, 2349–2362 (2011)
14. Dominici, F., Zeger, S.L., Parmigiani, G., Katz, J., Christian, P.: Estimating percentile-specific
effects in counterfactual models: a case study of micronutrient supplementation, birth weight,
and infant mortality. J. R. Stat. Soc. C 55, 261–280 (2006)
15. Angrist, J., Imbens, G., Rubin, D.: Identification of causal effects using instrumental variables.
J. Am. Stat. Assoc. 91, 444–455 (1996)
16. Roy, J., Hogan, J.W., Marcus, B.W.: Principal stratification with predictors of compliance for
randomized trials with 2 active treatments. Biostatistics 9, 277–289 (2008)
17. Cheng, J., Small, D.S.: Bounds on causal effects in three-arm trials with non-compliance. J. R.
Stat. Soc. B 68, 815–836 (2006)
18. Ferrari, S.L.P., Cribari-Neto, F.: Beta regression for modeling rates and proportions. J. Appl.
Stat. 31, 799–815 (2004)
Chapter 11
Causal Ensembles for Evaluating the Effect
of Delayed Switch to Second-Line
Antiretroviral Regimens
1 Introduction
In current clinical practice, HIV-1 infected patients are treated through a sequence
of combined antiretroviral therapies (cART). Although HIV-1 is a viral agent and
causes acquired immunodeficiency syndrome (AIDS), modern treatment successes
and the lack of a cure suggest similarities in treatment of HIV-1 infection and a
chronic disease. The primary goal of cART is to reduce viremia below a limit of
L. Li
Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN 46285, USA
e-mail: Li_Li_X1@Lilly.com
B.A. Johnson ()
Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood
Avenue, Box 630, Rochester, NY 14642, USA
e-mail: brent_johnson@urmc.rochester.edu
detection, but providers also make treatment decisions to help patients manage
adverse side effects and opportunistic infections. For a variety of reasons, including
co-morbidities, genetic mutations, and poor adherence, patients eventually fail their
current cART and move to the next-in-line cART. Then, similar to individuals that
live with chronic diseases, HIV-1 infected individuals transition from treatment
regimen to regimen, as necessary, until all treatment options have been exhausted
or death.
Despite all that the scientific community has learned about treating HIV-1
infection and AIDS over the last four decades, there is still much that is unknown.
In particular, there are many open questions about the timing of treatment decisions
that leads to better patient outcomes given a patient’s medical and treatment
history. Collecting scientific evidence to identify better treatment decisions is
difficult because of patient heterogeneity and because designing and enrolling
patients in randomized controlled trials to investigate these scientific questions
is challenging [25]. In the absence of controlled clinical trials, investigators may
conduct secondary analyses of existing databases in an attempt to address the same
questions. However, in secondary analyses of observational data, there is often
confounding between treatment and outcome and this issue must be addressed
statistically. For HIV-1 infected patients who move failing cARTs to new cARTs,
the reasons for switching cARTs can be extremely important and are expected to be
related to clinical outcome. In addition to controlling for confounding, there may
be other features of the database that are tangential to the scientific question of
interest but must nevertheless be addressed by the data analyst. Therefore, a fair and
objective evaluation of treatment decisions in a sequence of cARTs is challenging
for several reasons but important for public health and the infected individual’s
quality of life.
Our methods are motivated by data from the AIDS Clinical Trials Group
(ACTG) Study A5095, a controlled clinical trial designed to compare two efavirenz-
containing regimens and a triple nucleoside regimen [15]. After patients failed
their initial cARTs, patients were allowed to switch to second-line cARTs and then
followed per study protocol. We are interested in assessing whether it is clinically
beneficial to delay switching to second-line cART post-virologic failure or whether
patients ought to switch as soon as possible. Because switching from a failing initial
cART may depend on the initial cART, Li et al. [20] argued that it may be prudent to
analyze the ACTG A5095 data as a two-stage design problem [21, 31], where “two-
stage” refers to treatment assignment/decision at two points in time. Here, in our
early versus late two-stage framework, patients are randomly assigned to one of two
treatment arms at the first stage. Then, if a patient fails the initial cART, the patient
decides to switch cART immediately or delay switch at the second stage. The two
levels for each of initial cART and delay versus immediate switch comprise the four
treatment combinations. A key feature of the statistical framework for two-stage
methods is the introduction of a Bernoulli indicator for eligibility to second-stage
randomization [21] and makes evident the connection to intent-to-treat inference in
a randomized controlled trial.
11 Causal Ensembles and Applications to Therapeutic AIDS Studies 205
2 Methods
Because of the close connection between missing data problems and causal
inference [28], it will be instructive to develop methods in both contexts. Methods
for estimating E.Y/ when some outcomes Y are missing are described in Sect. 2.1
while methods for two-stage designs are described in Sect. 2.2.
Without loss of generality, assume that the full data are f.X1 ; Y1 /; : : : ; .Xn ; Yn /g, i.i.d.
pairs from the distribution of .X1 ; Y1 /, and the scientific interest is to estimate D
E.Y1 /. However, some outcomes Y are missing and the missingness mechanism
depends on X but not on Y. The observed data are
1 X ıi Yi 1 XO
n n
O IPW D ; O OR D f .Xi /;
n iD1 .X
O i/ n iD1
O and fO ./ are consistent estimators for ./ and f ./. In many applications,
where ./
.X/ and f .X/ are assumed to be simple parametric functions of the covariates X;
for example, .X/ and f .X/ are fitted quantities from a logistic regression model for
206 L. Li and B.A. Johnson
.X/ and linear model for f .X/. Then, one can show that O IPW and O OR are unbiased
for if .X/ and f .X/ are correctly specified. Using semiparametric theory, Robins
et al. [26] proposed a doubly-robust estimator which would be consistent for
if either .X/ or f .X/ was correctly specified. Also, the doubly-robust estimator
will be semi-parametric efficient if both .X/ and f .X/ are correctly specified.
However, assuming that .X/ is correctly specified, authors subsequently showed
that a consistent, doubly robust estimators can be rather imprecise when f .X/ is
misspecified even if it is semi-parametric efficient when f .X/ is correctly modeled.
To overcome this shortcoming, Robins et al. [27] proposed an estimator that aims
to minimize the variance of the doubly robust estimating function when f .X/ is
misspecified; however, the resulting estimator is no longer doubly robust. Tan [32]
proposed a constrained maximum likelihood estimator that is doubly robust, semi-
parametric efficient when .X/ and f .X/ are correctly modeled, and minimum
variance when .X/ is correctly modeled but f .X/ may be misspecified. All three
of these estimators may be expressed in the form of
" #
1X
n
i .X
O i/ O
O D O IPW k f .Xi / : (11.1)
n iD1 .X
O i/
1 Xn O o 1 Xn o
n n
O OR D O
i f .Xi / C .1 i /f .Xi / D
O
i Yi C .1 i /f .Xi / : (11.2)
n iD1 n iD1
So, a completely non-parametric estimator for can be defined through (11.2) with
a non-parametric regression estimator fO ./ for f ./. When there is only one covariate
X, one can simply use the Nadaraya-Watson [23] estimator [30]; for example, see
[6]. However, as the dimension of X increases, the curse of dimensionality precludes
any simple extension of kernel regression and one must typically impose more struc-
ture on the data to propose practical, interpretable solutions. Some nonparametric
regression methods include local polynomial regression [8], generalized additive
models [1, 17], and smoothing splines [14]. See [30] for a review of these methods.
Alternatively, one can use statistical methods that have little or no interpretability
and this is the approach we take here.
In this note, we use blackbox statistical learners to construct a prediction model
fO ./ and subsequently define our estimator O OR in (11.2). Blackbox method is
a generic umbrella term used to describe classification algorithms in artificial
intelligence, computer science, engineering, machine learning, mathematics, and
11 Causal Ensembles and Applications to Therapeutic AIDS Studies 207
statistics that aim to separate a vector of class labels (n 1) using a data input matrix
X (n p). Authors noted that, in most if not all cases, the same algorithms could
be used to construct predictions for a continuous response Y by simply modifying
the loss function. There are many overlapping names and methods associated with
these algorithms including ensemble methods, aggregation, bagging, and boosting.
There exists a massive literature on these topics and we refer the interested reader
elsewhere for a review (e.g., [18]).
Our estimator of is based on boosted regression trees (e.g., [4]), one of
many blackbox statistical learners that scales up easily to handle large dimensional
covariates X under mild restrictions on the data [9, 10] and is tersely outlined in
the Appendix. There is some empirical evidence to suggest that boosting offers
improvements in the misclassification rates compared to bagging [2] but such a
comparison is beyond the scope of this note. In short, our estimator is constructed
in the following three steps. First, the ensemble prediction fO ./ is built with the
complete data f.Xi ; i Yi ; i / W i D 1g. Second, predictions are computed for the
observations with missing outcomes where i D 0, finally, the estimator is defined
1 Xn o
n
O New D O
i Yi C .1 i /f .Xi / :
n iD1
We use the blackboost function from the mboost package in R, fivefold cross-
validation to determine the stopping iteration, and all other default settings. We
adopt the nonparametric bootstrap to estimate var.O New / as outlined in Sect. 2.3.
Following the framework in [21], let Yab be the potential outcome if a patient
followed a treatment policy .A D a; B D b/, where A and B are the first- and
second-stage treatment random variables, respectively. In our problem, A is a binary
random variable and represents the initial cART in ACTG A5095: A D 0 denotes
the triple nucleoside regimen and A D 1 represents the combined efavirenz-
containing regimens. The second-stage treatment B is also binary and denotes
switching early or late to second-line regimen after confirmed virologic failure on
the initial regimen. Based on discussions with our collaborators, we define an early
switch to second-line regimen as less than 8 weeks after confirmed virologic failure.
Hence, the four potential outcomes for this simple design are .Yab ; a; b D 0; 1/ and
the goal is to estimate E.Yab /. If we can derive a consistent estimator for E.Yab /,
then we can extend those statistics to draw inference on the expected difference
in potential outcomes E.Ya1 Ya0 / to assess the value in switching to second-line
regimen within 8 weeks of virologic failure on the initial regimen. This is a summary
of the introductory arguments given in [20].
If the observed data consisted of i.i.d. copies of .Y; A; B; X/, where treatment
assignment to .a; b/ depended on X only, then, under suitable regularity conditions,
208 L. Li and B.A. Johnson
unbiased estimators for E.Yab / could be derived using usual arguments from causal
inference. However, the observed data are more complex. A challenge in the analysis
of data from two-stage designs is the possibility that not all patients fail their first-
line regimen and, hence, do not switch to second-line regimen. Lunceford et al. [21]
refer to this random variable as an indicator of eligibility to second-stage treatment
assignment and include it as part of the definition of treatment policy; see also,
[20, p.543]. Define as the binary random variable indicating whether patients
fail the initial regimen and are therefore eligible to switch to second-line regimen.
Scientifically, we have no priori interest in . In an intent-to-treat analysis of data
from a randomized clinical trial where we knew .A; B/ at baseline, the indicator
would be ignored. In observational data, we do not know .A; B/ for a randomly
selected patient from the population. Instead, we observe .A; B/ when D 1 and
only observe A when D 0. Nevertheless, the intent-to-treat estimand is still
the parameter of interest. This makes the analysis of data from two-stage designs
different and interesting.
In our framework, the observed data are
is an unbiased estimator for E.Yab /. Along the lines described in Sect. 2.1, Li
et al. [20] showed that a general family of augmented inverse probability weighted
estimators for E.Ya1 / in a two-stage design is
" #
1X
n
Bi a .Xi /
O a1
D O a1 k i fa .Xi / ; (11.3)
IPW
n iD1 a .Xi /
Sect. 2.1 and consider the role of the eligibility random variable . Using standard
arguments, one can show that the outcome regression estimator for E.Ya1 / is
1X
n h n oi
New D
O a1 I.Ai D a/ .1 i /Yi C i Bi Yi C .1 Bi /fOa1 .Xi / : (11.4)
n iD1
The corresponding estimator for E.Ya0 / is O a0
New given by the expression in (11.4)
except that Bi and .1 Bi / in curly brackets are reversed and fa0 .Xi / D E.YjAi D
a; i D 1; Bi D 0; Xi / replaces fa1 .Xi /. As in Sect. 2.1, the estimator O a1 New is
computed in three steps: build the blackbox fOa1 ./ using the complete data, calculate
the predictions for subjects where Bi D 0, and compute the estimate. The stopping
iteration in the blackbox algorithm is chosen via fivefold cross-validation and
minimizing a L2 loss function.
To estimate var.O New / for missing data or var.O ab New / in the two-stage design,
we used a nonparametric bootstrap procedure. More details were in [7]. Because
missing data are imputed in our estimators, we follow the recommendation of Shao
and Sitter [29] who argue that “. . . the bootstrap data set should also be imputed
in the same way as the original data set was imputed.” This is in contrast to
some authors who have suggested that the imputed data should be regarded as
truth, which clearly underestimates the sampling variance. Therefore, in the case
of var.O New /, we (1) draw a simple random sample of size n with replacement from
the original data set, then (2) compute the bootstrap estimate O New;b using the bth
resampled dataset. We repeat the process B times and then take the sample variance
of fO New;1 ; : : : ; O New;B g. We estimate var.O ab
New / in the two-stage design using an
identical resampling plan.
The goal of our analysis is to compare clinical endpoints for patients who switch
early versus late from a failing efavirenz-containing combined antiretroviral therapy
(cART). The question of when to switch from a failing cART has been discussed and
debated in the HIV & AIDS literature for more than 15 years although some recent
research has suggested a preference for switching “early” from a failing regimen
(e.g., [19, 20, 24]). The data analysis below expands on an earlier analysis performed
by our research team [20]. Here, we present new comparisons of the mean clinical
endpoints using non-parametric ensemble methods discussed in Sect. 2.2 that make
weaker modeling assumptions than the semi-parametric methods in [20].
The data are taken from the AIDS Clinical Trials Group (ACTG) Study A5095.
Briefly, ACTG A5095 was a randomized, multi-center clinical trial designed to
210 L. Li and B.A. Johnson
Table 11.1 Estimates of mean potential outcomes E.Yab / for n D 744 patients switching less than
(early) or greater than (late) 8-weeks after confirmed virologic failure on an efavirenz-containing
regimen
HIV-1 RNAa Detection limitb CD4c
Method Switch Est. (SE) T Est. (SE) T Est. (SE) T
Naive Early 2.600 (0.181) 0.513 0.592 (0.054) 0.800 2.436 (0.055) 0.458
Late 2.685 (0.068) 0.546 (0.023) 2.466 (0.026)
IPW Early 1.835 (0.041) 4.970 0.837 (0.030) 2.720 2.21 (0.093) 0.369
Late 1.914 (0.032) 0.787 (0.011) 2.564 (0.015)
AIPW Early 1.848 (0.048) 2.325 0.829 (0.033) 1.614 2.593 (0.035) 0.764
Late 1.915 (0.033) 0.787 (0.011) 2.563 (0.015)
RRZ Early 1.833 (0.043) 4.218 0.828 (0.01) 19.860 2.600 (0.015) 9.800
Late 1.914 (0.033) 0.787 (0.011) 2.561 (0.015)
Tan Early 1.835 (0.040) 4.948 0.830 (0.011) 21.235 2.599 (0.014) 18.326
Late 1.914 (0.033) 0.788 (0.011) 2.563 (0.015)
New Early 1.849 (0.048) 1.192 0.808 (0.012) 7.087 2.593 (0.017) 5.364
Late 1.899 (0.030) 0.788 (0.010) 2.567 (0.015)
Standard errors are reported in parentheses and we report the Wald test statistic (T) for a nominal
test of the null hypothesis that the average causal effect is zero
a
Length-adjusted AUC of HIV-1 RNA level, logarithm scale
b
Proportion of time spent with HIV-1 RNA below limit of detection
c
Length-adjusted AUC of CD4 T-cell counts, logarithm scale
11 Causal Ensembles and Applications to Therapeutic AIDS Studies 211
4 Simulation Studies
Our simulation results are summarized in Table 11.3. Table entries are the Monte
Carlo bias and standard deviation from 200 Monte Carlo datasets. For the first
scenario where both PS and OR models are correct specified, the non-parametric
estimator had similar bias to the semi-parametric estimators but was less precise
than AIPW estimator and Tan estmator. For the second and third scenario where
the OR models were incorrectly specified, IPW, AIPW, and Tan had small Monte
Carlo bias while the non-parametric estimator had bias similar to the RRZ estimator.
The variance of the non-parametric estimator was smaller than any semi-parametric
estimator in scenarios 2–3. In the last scenario where the PS and OR models depend
on 50 covariates, the semi-parametric estimators failed while the non-parametric
causal ensemble had small bias and variance.
5 Discussion
We have proposed a non-parametric estimator for missing data in Sect. 2.1 and
for a two-stage causal estimand in Sect. 2.2. The latter two-stage estimator is
a non-parametric alternative to the semi-parametric estimators proposed in our
earlier work [20] for comparing early versus late switch from a failing combined
antiretroviral therapy. In general, doubly robust semi-parametric estimators of the
causal estimand in [20] require that either one or both of the propensity score or
outcome regression models are correctly modeled. The non-parametric estimator
proposed here is based on blackbox boosted ensemble methods, does not model the
propensity score, and places minimal assumptions on the outcome regression model.
Our estimator uses readily available software via the mboost package in R.
Our simulation studies suggest the non-parametric estimator performs similar to
the semi-parametric methods under linear regression with normal error models, but
better than semi-parametric methods when the number of potential confounders is
large. When the errors are heterogeneous or are otherwise non-normal, we found
that doubly robust semi-parametric methods have the potential to offer smaller bias
and variance when at least one of the treatment or outcome models is correctly
specified. In the data analysis, non-parametric estimates of the mean potential
outcome for the switch-early group were better than those for the delayed-switch
group. This conclusion is similar to what we found using semi-parametric methods
in Table 11.1 and reported in our earlier report [20].
11 Causal Ensembles and Applications to Therapeutic AIDS Studies 213
Appendix
EŒfY; f .X/g;
1 X
n
fO 0 ./ D argmin .Yi ; c/I
n iD1
base procedure m
.Xi ; Ui /niD1 ! gO ./:
References
1. Binder, H., Tutz, G.: A comparison of methods for the fitting of generalized additive models.
Stat. Comput. 18, 87–99 (2008)
2. Borra, S., Ciaccio, A.: Improving nonparametric regression methods by bagging and boosting.
Comput. Stat. Data Anal. 38, 407–420 (2002). doi:10.1016/S0167-9473(01)00068-8
3. Breiman, L.: Prediction Games and Arcing Algorithms. Technical Report 504. Statis-
tics Department, University of California, Berkeley (1997/1998), revised. http://stat-www.
berkeley.edu/tech-reports/index.html
4. Bühlmann, P., Hothorn, T.: Boosting algorithms: regularization, prediction and model fitting.
Stat. Sci. 22, 477–505 (2007). doi:10.1214/07-STS242
5. Cao, W., Tsiatis, A.A., Davidian, M.: Improving efficiency and robustness of the doubly robust.
Biometrika 96, 723–734 (2009)
6. Cheng, P.E.: Nonparametric estimation of mean functionals with data missing at random. J.
Am. Stat. Assoc. 89, 81–87 (1994)
7. Efron, B., Tibshirani, R.: Bootstrap methods for standard errors, confidence intervals, and other
measures of statistical accuracy. Stat. Sci. 1, 54–75 (1986). doi:10.1214/ss/1177013815
8. Fan, J., Gijbels, I.: Local polynomial fitting. In: Smoothing and Regression. Approaches,
Computation and Application (M.G. Schimek), pp. 228–275. Wiley, New York (2000)
9. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an
application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)
10. Freund, Y., Schapire, R.E.: A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 14, 771–
780 (1999)
11. Friedman, J.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29,
1189–1232 (2001)
12. Friedman, J., Hastie, T., Tibshirani, T.: Additive logistic regression: a statistical view of
boosting. Ann. Stat. 28, 337–374 (2000)
13. Friedman, J., Hastie, T., Tibshirani, T.: Rejoiner for additive logistic regression: a statistical
view of boosting. Ann. Stat. 28, 400–407 (2000)
14. Gu, C.: Smoothing Spline ANOVA Models. Springer, New York (2002)
15. Gulick, R.M., Ribaudo, H.J., Lustgarten, S., Squires, K.E., Meyer, W.A., Acosta, E.P.,
Schackman, B.R., Pilcher, C.D., Murphy, R.L., Maher, W.L., Witt, M.D., Reichman, R.C.,
Snyder, S., Klingman, K.L., Kuritzkes, D.R.: Triple-nucleoside regimens versus efavirenz-
containing regimens for the initial treatment of HIV-1 infection. N. Engl. J. Med. 350,
1850–1861 (2004)
11 Causal Ensembles and Applications to Therapeutic AIDS Studies 215
16. Gulick, R.M., Ribaudo, H.J., Shikuma, C.M., Lalama, C., Schackman, B.R., Meyer, W.A.
3rd., Acosta, E.P., Schouten, J., Squires, K.E., Pilcher, C.D., Murphy, R.L., Koletar, S.L.,
Carlson, M., Reichman, R.C., Bastow, B., Klingman, K.L., Kuritzkes, D.R., AIDS Clinical
Trials Group (ACTG) A5095 Study Team: Three- vs four-drug antiretroviral regimens for the
initial treatment of HIV-1 infection: a randomized controlled trial. J. Am. Med. Assoc. 296(7),
768–781 (2006)
17. Hastie, T., Tibshirani, R.: Generalized Additive Models, 1st edn. Monographs on Statistics and
Applied Probability. Chapman and Hall/CRC, Boca Raton (1990)
18. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd edn. Springer, New York (2001)
19. Johnson, B.A., Ribaudo, H., Gulick, R.M., Eron, J.J.: Modeling clinical endpoints as a function
of time of switch to second-line ART with incomplete data on switching times. Biometrics 69,
732–740 (2013)
20. Li, L., Eron, J., Ribaudo, H., Gulick, R.M., Johnson, B.A.: Evaluating the effect of early versus
late ARV regimen change after failure on the initial regimen: results from the AIDS clinical
trials group study A5095. J. Am. Stat. Assoc. 107, 542–554 (2012)
21. Lunceford, J., Davidian, M., Tsitatis, A.: Estimation of survival distributions of treatment
policies in two-stage randomization designs in clinical trials. Biometrics 58, 48–57 (2002)
22. McCullagh, P., Nelder, J.A. :Generalized Linear Models, 1st edn. Chapman and Hall, London
(1983)
23. Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 9(1), 141–142 (1964).
doi:10.1137/1109020
24. Petersen, M.L., van der Laan, M.J., Napravnik, S., Eron, J., Moore, R., Deeks, S.: Long term
consequences of the delay between virologic failure of highly active antiretroviral therapy and
regimen modification: a prospective cohort study. AIDS 22, 2097–106 (2008)
25. Riddler, S., Jiang, H., Tenorio, A., Huang, H., Kuritzkes, D., Acosta, E., Landay, A., Bastow,
B., Haas, D., Tashima, K., Jain, M., Deeks, S., Bartlett, J.: A randomized study of antiviral
medication switch at lower- versus higher-switch thresholds: AIDS clinical trials group study
A5115. Antivir. Ther. 12, 531–541 (2007)
26. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some
regressors are not always observed. J. Am. Stat. Assoc. 89, 846–866 (1994)
27. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Analysis of semiparametric regression models for
repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 90, 106–121 (1995)
28. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies
for causal effects. Biometrika 70, 41–55 (1983)
29. Shao, J., Sitter, R.R.: Bootstrap for imputed survey data. J. Am. Stat. Assoc. 91, 1278–1288
(1996)
30. Simonoff, J.: Smoothing Methods in Statistics. Springer Science and Business Media, New
York (1996)
31. Stone, R.M., Berg, D.T., George, S.L., Dodge, R.K., Paciucci, P.A., Schulman, P., Lee, E.J.,
Moore, J.O., Powell, B.L., Schiker, C.A.: Granulocyte- macrophage colony-stimulating factor
after initial chemotherapy for elderly patients with primary acute myelogenous leukemia. N.
Engl. J. Med. 322, 1671–1677 (1995)
32. Tan, Z.: A distributional approach for causal inference using propensity scores. J. Am. Stat.
Assoc. 101, 1619–1637 (2006)
33. Tan, Z.: Understanding OR, PS and DR. Stat. Sci. 22, 560–568 (2007)
34. Watson, G.S.: Smooth regression analysis. SankhyNa Indian J. Stat. Ser. A 26(4), 359–372
(1964) [JSTOR 25049340]
Chapter 12
Structural Functional Response Models
for Complex Intervention Trials
P. Wu ()
Value Institute, Christiana Care Health System, 4755 Ogletown-Stanton Road, Newark,
DE 19718, USA
e-mail: PWu@Christianacare.org
X.M. Tu
Department of Biostatistics and Computational Biology, University of Rochester,
265 Crittenden Boulevard, Rochester, NY 14642, USA
e-mail: Xin_Tu@urmc.rochester.edu
1 Introduction
The randomized controlled trials (RCTs) has been treated as the gold standard in
causal inference since the effect of randomization ensures that no pre-treatment
variables could potentially confound both treatment assignment and outcomes of
interest. This effort is rewarded by a simple design with robust results that is
easily understood and implementable in the general public. The RCTs, however,
may not always guarantee the causality of treatment on the outcome of interest
when the after-randomization treatment suffers imperfect or non-compliance issue
in practices, such as the inconsistent exposure of intervention for each individual
subject in active treatment arms or less control on other variables (mediators) related
to both treatment and outcomes. The traditional intention to treat (ITT) approach
is recommended to use in RCTs for its simplicity in study design, control, and
implementation, but isn’t capable of addressing the post-treatment confounding and
may lead to biased inference and make analytic results without causal interpretation.
In the past decades, the problem of estimating the causal effect of compliance
with active treatment in randomized trials has received much attention in statistical
literature. Efron and Feldman [3] introduced a one-one monotone mapping between
compliance and treatment and implemented with a full parametric model. Angrist
et al. [1] used the Instrumental Variables approach to calculate the complier average
treatment effect for the placebo-controlled trials and generated this further to
binary compliance on binary outcome. Frangakis and Rubin [7] developed the
Principal Stratification (PS) method to adjust post-treatment compliance within
the stratified covariate groups and estimate causal effect for each strata using
Bayesian approach. Robins [23] proposed the structural nested mean models (SMM)
to find the causal parameters in a quite robust semiparametric framework with
(repeated) continuous outcomes. Goetghebeur and Lapp [8] and Vansteelandt and
Goetghebeur [29] applied this theory to address confounding issue of treatment
compliance in placebo-controlled trials and then extended to generalized SMM in
accommodating binary and count data. SMM give a precise but subtle meaning in
efficient estimation of causal treatment effect. This model can be seen as a robust
regression with unpaired data.
Although RCTs remain as a benchmark for clinical research and practice,
observational studies with self-selected treatment and semi-RCTs (trials that initiate
treatment dynamically when needed) have become more popular, especially in
studies in the behavioral and social sciences, epidemiological studies, and healthcare
research, because of the large amount of data generated by new web technologies
and social media. Even within the standard RCTs, we have found that single inter-
vention design is becoming less attractive in empirical research due to its simplicity
of treatment structural. More and more studies would prefer using complex design,
such as multi-level, multi-layered, or multi-modal, dynamic interventions to take
advantage of both static (e.g., genetic traits) and dynamic (e.g., treatment response)
information during the treatment.
12 Structural Functional Response Models for Complex Intervention Trials 219
the true causal effects of parent involvement in the program, especially if the effect
of treatment on child outcomes is achieved in part through parental participation.
As we introduced, a number of approaches for addressing treatment noncompli-
ance in RCTs have been developed based on the counterfactual outcome framework.
Unfortunately, none of the available methods is able to address treatment noncom-
pliance in multi-layered intervention studies. The new approach we have developed
is to extend the principles in these approaches to this new setting with treatment
noncompliance from multiple layers of the intervention. In Sect. 2, we briefly
review the counterfactual outcome based causal framework and introduce a class of
SFRM to address both pre- and post-treatment confounding. In Sect. 3, the SFRM is
extended to address treatment noncompliance in multi-layered interventions within
a longitudinal study setting. Simulation studies are presented in Sect. 4 to evaluate
the performance of the proposed SFRM. In Sect. 5, we apply the approach to address
the variability in parent participation in the two-layered CRP study. We conclude
with a discussion in Sect. 6.
1 X 1 X
n1 n0
b D y1 y0 ;
y1 D yi 1 ; y0 D yi 0 ; (12.1)
n1 iD1 1 n0 iD1 0
12 Structural Functional Response Models for Complex Intervention Trials 221
where nk denotes the number of subjects assigned to the kth treatment group such
that n D n1 C n0 and ik denotes the ith subject within the kth treatment group.
Note that yik k refers to the observed outcome for the ik th subject in the assigned
kth treatment, while yik denotes the potential outcome corresponding to the kth
treatment.
The above shows that standard statistical models such as linear regression and
mixed-effects models can be applied to RCTs to infer causal treatment effects.
Randomization is key to the transition from the unobserved individual level differ-
ence, yi1 yi0 , to the estimable average treatment effect by the computable sample
means in (12.1). For non-randomized trials such as most epidemiological studies,
exposure to treatment or agent is non-random, in which case (12.1) generally does
not estimate the average causal effect D E . yi1 yi0 /. Thus, associations found
in observational studies generally do not imply causation.
Since only one of the potential outcomes yik is observable, we cannot model the yik ’s
directly using conventional regression models. One way around this is to model
the observed outcomes such as yik k as in the preceding section. Alternatively, we
can circumvent this difficulty by constructing an observable response based on the
unobserved yik and relate the response created to the mean of yik as follows:
!
zki .1 zi /1k yik
E D k ; E .zi / D ; zi D 0; 1; 1 i n; k D 0; 1;
k .1 /1k
(12.2)
where k D E . yik / is the mean of potential outcome yik , since it is readily checked
that
!
zki .1 zi /1k yik 1
1k
E D E z k
.1 z i / yik D k :
k .1 /1k k .1 /1k
i
Although yik are not both observed, the functional response, f . yi0 ; yi1 ; zi / D
zki .1zi /1k yik
in (12.2) is still well defined. If is known as in most RCTs, it is
,
k .1/1k
unnecessary to model zi and (12.2) reduces to the first equation.
The model in (12.2) is not a conventional regression model such as the
generalized linear or non-linear models, since f . yi0 ; yi1 ; zi / is not a single linear
response such as yik or zi . Rather, this model is a member of the following class of
functional response models (FRM):
E f yi1 ; : : : ; yiq ; j xi1 ; : : : ; xiq D h xi1 ; : : : ; xiq I ; i1 ; : : : ; iq 2 Cqn ;
(12.3)
222 P. Wu and X.M. Tu
where f ./ is some function, h ./ is some smooth function (e.g., continuous second-
order derivatives), yi and xi denote some response and explanatory variables, Cqn
denotes the set of nq combinations of q distinct elements i1 ; : : : ; iq from the
integer set f1; : : : ; ng, and a vector of parameters. The response f yi1 ; : : : ; yiq ;
in (12.3) for the general FRM can be quite a complex function of multiple outcomes
[e.g., yik ; zi in (12.2)] from different subjects as well as unknown parameters [e.g.,
in (12.2)]. By generalizing the response variable in this fashion, (12.3) provides
a general framework for modeling a broad set of problems involving higher-order
moments and between-subject attributes. The FRM has been applied to a range of
methodological issues involving multi-subject responses such as extensions of the
Mann-Whitney-Wilcoxon rank sum test to longitudinal and causal inference settings
[2, 31], social network analysis [4, 14, 32], gene expression analysis [11], reliability
coefficients [10, 12, 15–18, 27], and complex response functions such as models for
population mixtures [33] and structural equation models [9].
Because of its relationship to (12.3), the model in (12.2) will be referred to as the
structural FRM (SFRM):
zi yi1 .1 zi / yi0
E . fik . yi0 ; yi1 ; zi // D hik ./ ; fi1 D ; fi2 D ; fi3 D zi ; (12.4)
1
hi1 ./ D 1 ; hi2 ./ D 0 ; hi3 ./ D ;
where D .1 ; 0 ; /> denotes the collection of the parameters for this SFRM.
Before adding more complexity to this SFRM to address treatment noncompliance
within our context, let us first extend it to address selection bias in observational
studies.
If subjects are not randomized with respect to the treatment condition (or exposure)
as in observational studies (e.g., survey, epidemiologic studies), yik ? zi is generally
not true. In the presence of such selection bias, if wi is a vector of covariates
containing all sources of confounding such that the ignorability condition [26],
yik ? zi j wi , holds, then we have
! " !#
zki .1 zi /1k yik zki .1 zi /1k yik
E DE E j wi D k :
.wi /k .1 .wi //1k .wi /k .1 .wi //1k
(12.5)
where .wi / D E .zi j wi /. We may model zi using a generalized linear model such
as logistic regression:
By combining (12.5) and (12.6), we have the following SFRM to provide valid
>
inference about D 1 ; 0 ; > under selection bias:
zi yi1 .1 zi / yi0
fi1 D ; fi2 D ; fi3 D zi ; 1in (12.7)
.wi I / 1 .wi I /
hi1 ./ D 1 ; hi2 ./ D 0 ; hi3 .wi I / D .wi I / ;
.wi I / D logit1 .> wi /; ±1 D ±2 D f0g ; ±3 D fwi g ;
In many RCTs, even well-planned and executed ones, treatment effect may be
significantly modified by levels of exposure of intervention (e.g., compliance or
dosage) due to treatment noncompliance. One popular approach for addressing this
primary post-treatment confounder is the structural mean model (SMM) [8, 23, 29].
Other competing approaches also address treatment noncompliance such as the
instrumental variable [1] and principal stratification methods [7]. However, only
SMM models treatment compliance on a continuous scale, which is more appro-
priate for session attendance within our context. We first frame this model within
the FRM framework and then discuss its extensions to accommodate complex
intervention design study, such as multi-layered treatments and missing data in
Sect. 3.
Consider a randomized medication vs. placebo study and let di1 denote a
continuous potential outcome of medication use, if the ith subject is assigned to
the medication condition. The SMM models the dose effect on treatment difference
as follows:
where g ./ is known up to a set of parameters (i.e., only the functional form of
g .di1 ; xi / is specified) and xi is the baseline covariates. However, the above model
cannot be fit directly using conventional statistical methods, since only one of the
potential outcomes . yi1 ; yi0 / is observed. For RCTs, we have yi1 ; yi0 ? zi and the
above Eq. (12.8) follows that
zi yi1 .1 zi / yi0
fi1 D ; fi2 D ; fi3 D zi ; 1 i n; (12.12)
1
>
hi1 .x; ˇ/ D h .xi ; ˇ/ ; hi2 .xi ; di1 ; / D g .di1 ; xi ; / C h .xi ; ˇ/ ; D ˇ> ; > ;
zi yi1 .1 zi / yi0
fi1 D ; fi2 D ; fi3 D zi ; 1 i n; (12.13)
.wi I / 1 .wi I /
hi1 D hi1 .xi ; ˇ/ ; hi2 .xi ; di1 ; ˇ; / D g .di1 ; xi ; / C hi1 .xi ; ˇ/ ;
>
hi3 D .wi I / ; logit . .wi I // D > wi ; D ˇ > ; > ; > :
We can model h .xi ; ˇ/ and g .di1 ; / in various ways. For example, we may
simply model both as a linear function: h1 .xi ; ˇ/ D x> i ˇ and g .di1 ; xi ; / D di1 .
By specifying an appropriate form for g .di1 ; xi ; /, we may also extend (12.12)
to non-continuous dose variables such as categorical variables. Further, by appro-
priately specifying h1 .xi ; ˇ/ and h2 .xi ; di1 ; ˇ/, we can also generalize (12.12) to
non-continuous responses. For example, for a binary yi , we may specify h1 .xi ; ˇ/
and h2 .xi ; di1 ; ˇ/ as follows:
h1 .xi ; ˇ/ D logit1 x>
i ˇ ; h2 .xi ; di1 ; ˇ/ D logit1 .g .di1 ; xi ; / C h1 .xi ; ˇ// :
f .yi I zi / D . fi1 ; fi2 ; fi3 /> ; hi ./ D .hi1 ; hi2 ; hi3 /> ; 1 i n;
226 P. Wu and X.M. Tu
where fik and hik are defined in (12.13). Then, consistent estimates of are readily
obtained by using the generalized estimating equations (GEE) for FRM [9, 12, 33]:
X
n
@
U ./ D Di Vi1 Si D 0; Si D fi hi ; Di D hi ; (12.14)
iD1
@
1 1
Vi D Ai2 R .˛/ Ai2 ; Ai D diagt .Var . fit j ±it // ;
We first extend the SFRM in Sect. 2 to longitudinal data and then to multi-layered
intervention studies.
Let yit D . yit1 ; yit0 /> (xit ) denote the potential outcomes of yit (a vector of
explanatory variables) of interest with i .t/ indexing the subject (assessment time)
for 1 i n and 1 t T. By applying (12.13) to each time point, we obtain a
longitudinal version of the SFRM:
> >
fi D f > >
i1 ; : : : ; fiT ; zi ; hi D h> >
i1 ; : : : ; hiT ; i ; E .fi j xi / D h .xi ; / ; (12.15)
zi 1 zi
fit D . fit1 ; fit2 /> ; fit1 D yit1 ; fit2 D yit0 ; 1 i n;
i 1 i
hit D .hit1 ; hit2 /> hit1 D h1 .xit ; ˇ/ ; hit2 D gt .di1 ; / C hit1 ;
>
i D logit1 > wi ; D ˇ > ; > ; > :
12 Structural Functional Response Models for Complex Intervention Trials 227
Inference for the FRM above is based on the following GEE for FRM [9, 12, 33]:
X
n
@
U ./ D Di Vi1 Si D 0; Si D fi hi ; Di D hi ; (12.16)
iD1
@
1 1
Vi D Ai2 R .˛/ Ai2 ; Ai D diagt .Var .fit j xit // ;
where Di and Vi are readily computed given (12.15) and R .˛/ denotes a choice of
working correlation matrix.
Missing data is a common issue in longitudinal studies. The GEE in (12.16)
generally yields biased estimates under the missing at random (MAR) mecha-
nism [13, 24, 30]. The weighted generalized estimating equations (WGEE), a
common approach for addressing this issue, has been extended to the FRM [9, 33].
We adapt this approach to the current context, with an alternative implementation to
simplify the inference procedure. As in the literature, we assume monotone missing
data patterns (MMDP) to facilitate inference [9, 13, 24, 30, 33].
Let yit denote the observed potential outcome, i.e., yit D yitk if the subject is
assigned the kth treatment. Let
> >
yit D yi1 ; : : : ; yi.t1/ ; xit D x> >
i1 ; : : : ; xi.t1/ ; 1 t m;
denoting the all individual responses .yit / and explanatory variables .xit / prior to
time t. Let
1 if t D 1 ;
pit D
E rit D 1 j ri.t1/ D 1; xit ; yit if t > 1
pit D logit1 .0t C > >
xt xit C yt yit /;
!1
Yt
it D pit rit I2 ; i ./ D diagt .it / ;
sD1
> >
t D 0t ; > >
xt ; yt ; D > >
2 ; : : : ; T :
X
n
U .; / D Di Vi1 i Si D 0: (12.18)
iD1
228 P. Wu and X.M. Tu
where fit , hit , and i are defined in (12.15), and rit and pit are defined in (12.17).
Consider the WGEE in (12.18), but with Di and i redefined as follows to provide
estimates for both and :
0 1
Vi11 0 0
@ B C 1 1
Di D hi ; Vi D @ 0 Vi22 0 A ; Vi11 D Ai2 R .˛/ Ai2 ; Vi22 D i .1 i / ;
@
0 0 Vi33
0 1 0 1
pi2 .1 pi2 / 0 i11 0 0
B :: :: C B C
Vi33 DB
@ : :
C;
A i D @ 0 i22 0 A ; (12.20)
piT .1 piT / 0 0 i33
0 1
!1 ri1 0
Y
t
B :: C
i11 D diag .it / ; it D rit pis I2 ; i22 D 1; i33 B
D@ :: C;
: : A
sD1
ri.T1/
where Ai is defined in (12.17). Unlike (12.18), the WGEE in (12.19) makes joint
inference about and . Thus, no adjustment is necessary for the asymptotic
variance of the WGEE estimate of to account for the sampling variability of b
as in the standard approach above.
Consider a two-layered intervention study and let ui1 denote some (continuous)
treatment compliance measure for the second layer. By taking into account both
compliance measures di1 and ui1 , we obtain from (12.11) the following dose–
response relationship :
where g.di1 ; ui1 / does not depend on xi for model simplicity. We assume that
the covariates xi sufficiently explain treatment compliance patterns for both the
primary and secondary layers of the multi-layered intervention, i.e., di1 ; yi0 ? xi
and ui1 ; yi0 ? xi . In some studies, treatment noncompliance may be limited to some
intervention layers, in which case xi is only required to explain the affected layers.
For example, in the CRP, noncompliance is a major issue only for the second parent
support layer and the ignorability condition only needs to be assumed for parent
participation.
By formulating (12.21) as an FRM as in the case of single-layered intervention
study, we obtain the following SFRM for modeling the effect of treatment noncom-
pliance on the outcome in a two-layered intervention study:
zi yi1 .1 zi / yi0
fi1 D ; fi2 D ; fi3 D zi ; 1 i n; (12.22)
.wi I / 1 .wi I /
hi1 D hi .xi ; ˇ/ ; hi2 .xi ; di1 ; ui1 / D g .di1 ; ui1 ; / C h1 .xi ; ˇ/ ;
>
i D .wi I / ; E .zi j xi ; di1 ; ui1 ; / D i ; D ˇ > ; > ; > :
where 1 i n. The above has essentially the same form as the single-layered
SFRM, except that the treatment effect g .di1 ; ui1 ; / is a function of compliance
from both the primary and secondary intervention layers. Note that (12.22) applies
to observational studies well, in which case wi is assumed to account for all sources
of selection bias.
We can model treatment effect g .di1 ; ui1 ; / to reflect treatment compliance
in both layers. For example, we may specify an additive effect function,
g .di1 ; ui1 ; / D 1 di1 C 2 ui1 or we may also include a between-layer treatment
compliance interaction di1 ui1 . If the treatment effect is moderated by some covariate
xi , we may also include treatment moderating effect by setting g .di1 ; ui1 ; xi ; / D
xi . 1 di1 C 2 ui1 /. If the moderating effect only occurs to one of the intervention
layers, we may model g .di1 ; ui1 ; xi ; / as 1 xi di1 C 2 ui1 or 1 di1 C 2 xi ui1 , depending
on whether the moderating effect operates at the primary or secondary layer of the
intervention.
As in the case of single-layered intervention study, the cross-sectional SFRM
in (12.22) is readily extended to longitudinal studies. For example, by replacing
the treatment effect function gt .di1 ; / in (12.15) by gt .di1 ; ui1 ; / in (12.22), the
SFRM in (12.15) can be applied to model the effect of treatment compliance for two-
layered observational studies. As well, by modeling the missing data under MAR
230 P. Wu and X.M. Tu
using (12.17), we can make joint inference about in (12.22) and for the missing
data model using a WGEE akin to (12.18), but with Di , Vi , i , and Si in (12.20)
redefined based on (12.22).
In the above, we have assumed that both di1 and ui1 are continuous. The models
are easily extended to non-continuous compliance variables, if either di1 or ui1 or
both are non-continuous.
4 Simulation Studies
Table 12.2 Parameter estimates and standard errors for Model I with a cross-sectional continuous response
Parameter estimates and standard errors for Model I with treatment effect functions g1 =g2
n D 50 n D 200
Parameter Est. Mod. S.E. Emp. S.E. Est. Mod. S.E. Emp. S.E.
0 D 0:5 0.475/0.514 0.629/0.566 0.744/0.758 0.496/0.495 0.318/0.274 0.326/0.289
1 D 0:5 0.459/0.535 0.655/0.553 0.808/0.760 0.505/0.515 0.315/0.271 0.323/0.300
2 D 0:4 0.428/0.377 0.369/0.313 0.453/0.438 0.402/0.395 0.179/0.163 0.196/0.175
ˇ0 D 5 5.107/4.981 0.289/0.304 0.338/0.394 4.998/5.012 0.158/0.157 0.175/0.172
ˇ1 D 2 1.995/2.013 0.341/0.348 0.402/0.509 2.002/1.976 0.189/0.189 0.201/0.195
0 D 0 0.033/-0.003 0.325/0.326 0.329/0.343 0.001/-0.003 0.150/0.157 0.158/0.150
1 D 1 1:086= 1:107 0.394/0.395 0.422/0.461 1:017= 1:016 0.189/0.189 0.188/0.199
P. Wu and X.M. Tu
12 Structural Functional Response Models for Complex Intervention Trials 233
Table 12.3 Parameter estimates and standard errors for Model II with a longitudinal binary
response
Parameter estimates and standard errors for Model II
n D 100 n D 400
Parameter Est. Mod. S.E. Emp. S.E. Est. Mod. S.E. Emp. S.E.
0 D 1 0:873 0:497 0:508 1:047 0:290 0:321
1 D 1 0:964 0:512 0:586 1:070 0:284 0:317
ˇ0 D 1 1:067 0:468 0:491 1:022 0:216 0:213
ˇ1 D 1 1:128 0:461 0:525 1:025 0:205 0:217
ˇ2 D 1 1:089 0:589 0:605 1:016 0:264 0:279
ˇ3 D 1 1:182 0:583 0:690 1:044 0:261 0:275
0 D0 0:021 0:209 0:224 0:001 0:108 0:109
1 D 1 1:058 0:291 0:303 1:008 0:144 0:148
0 D 1 1:022 0:273 0:292 1:010 0:131 0:135
1 D 1 1:088 0:689 0:702 1:033 0:325 0:346
234 P. Wu and X.M. Tu
D .0 ; 1 /| for the missing data model. As in the case of cross-sectional data, both
the parameter estimates and model-based standard errors were quite good when
compared to their true values or empirical counterparts.
Baseline denotes the baseline value of the subscale of the Dominic Interactive self-
report, assessing symptoms of three externalizing (oppositional defiant, conduct
problems, and ADHD) problems [28]. The results from the regression show that
session participation was significantly different across the different schools and
children with different PNC and DomEX baseline values. In addition, parent age
also significantly predicted the session attendance.
For our illustrations of the model, we focused on two primary behavior outcomes
of the study, the Teacher ratings of aggressive behavior (AthAcc) and Parent
rating of internalizing behavior problem (PIntD). For both outcomes, higher values
indicate fewer problems. For each of these behavior outcomes yit , let yit1 and yit0
denote the potential outcomes of yit at baseline .t D 1/ and each of the three follow-
ups .2 t 4/. We modeled the causal treatment effect as a function of treatment
compliance from the parent layer using an SFRM as follows:
z
1 zi i
E yit0 j ui D it ; E yit1 j ui D git C it ; E .zi / D ; (12.25)
1
it D ˇ0 C ˇ1 t C xi1 ˇ2 C ˇ3 xi1 t C ˇ4 xi2 C ˇ5 xi3 C ˇ6 xi4 C
Cˇ7 xi5 C ˇ8 xi6 C ˇ9 xi7 C ˇ10 xi8 ;
git D ui t; 1 t 4;
improved the child’s behaviors and reduced the risk for future mental disorder and
substance abuse. With the SFRM in (12.25), causal treatment effect is given by ui .
For example, if the parent of the child attended all the planned 15 sessions, then
ui D 15 and the causal effect is ˇ4 ui D 0:25 per month time in the scale of the
AthAcc outcome. Thus, in 18 months post-baseline, for instance, the intervention
will on average increase the child AthAcc outcome by 4.32 points.
For comparison purposes, we also performed the intent-to-treat (ITT) analysis
for the two behavior outcomes by setting ui D 1 in git of the SFRM in (12.25). The
estimated , standard errors (S.E.), and p-values (p-value) are shown in Table 12.5
under the column “ITT Effect.” As seen, was not significant for either outcome.
Thus, parent support played a significant role in improving the two child behavior
outcomes in this two-layered intervention study.
6 Discussion
multiple intervention layers as well as missing data under MAR. Our simulation
studies show that the proposed approach perform quite well even for a sample size as
small as 50 (for combined intervention and control groups). As well, applications of
the proposed model to the Rochester Resilience Project demonstrate the importance
to consider treatment noncompliance from the supportive parent layer in this two-
layered intervention study.
References
1. Angrist, J., Imbens, G.W., Rubin, D.B.: Identification of causal effects using instrumental
variables (with discussion). J. Am. Stat. Assoc. 91, 444–472 (1996)
2. Chen, R., Chen, T., Lu, N., Zhang, H., Wu, P., Feng, C., Tu, X.M.: Extending the Mann-
Whitney-Wilcoxon rank sum test to longitudinal data analysis with covariates. J. Appl. Stat.
41(12), 2659–2675 (2014)
3. Efron, B., Feldman, D.: Compliance as an explanatory variable in clinical trials. J. Am. Stat.
Assoc. 91, 444–472 (1991)
4. El-Sayed, A.M., Scarborough, P., Seemann, L., Galea, S.: Social network analysis and agent
based modeling in social epidemiology. Epidemiol. Perspect. Innov. 9, 1–9 (2012)
5. Fischer, K., Goetghebeur, E.: Structural mean effects of noncompliance. J. Am. Stat. Assoc.
99(468), 918–928 (2004)
6. Fitzmaurice, G.M.: A caveat concerning independence estimating equations with multiple
multivariate binary data. Biometrics 51, 309–317 (1995)
7. Frangakis, C.E., Rubin, D.B.: Principal stratification in causal inference. Biometrics 58, 21–29
(2002)
8. Goetghebeur, E., Lapp, K.: The effect of treatment compliance in a placebo-controlled trials:
regression with unpaired data. J. R. Stat. Soc. Ser. C Appl. Stat. 46, 351–364 (1997)
9. Gunzler, D., Tang, W., Lu, N., Wu, P., Tu, X.M.: A class of distribution-free models for
longitudinal mediation analysis. Psychometrika 79(4), 543–568 (2013)
10. King, T.S., Chinchilli, V.M.: A generalized concordance correlation coefficient for continuous
and categorical data. Stat. Med. 20, 2131–47 (2001)
11. Kowalski, J., Powell, J.: Nonparametric inference for stochastic linear hypotheses: application
to high-dimensional data. Biometrika 91(2), 393–408 (2004)
12. Kowalski, J., Tu, X.M.:Modern Applied U Statistics. Wiley, New York (2007)
13. Lu, N., Tang, W., He, H., Yu, Q., Crits-Christoph, P., Zhang, H., Tu, X.M.: On the impact
of parametric assumptions and robust alternatives for longitudinal data analysis. Biom. J. 51,
627–643 (2009)
14. Lu, N., White, A.M., Wu, P., He, H., Hu, J., Feng, C., Tu, X.M.: Social network endogeneity
and its implications for statistical and causal inferences. In: Lu, N., White, A.M., Tu, X.M.
(eds.) Social Networking: Recent Trends, Emerging Issues and Future Outlook. Nova Science,
New York (2013)
15. Lu, N., Chen, T., Wu, P., Gunzler, D., Zhang, H., He, H., Tu, X.M.: Functional response models
for intraclass correlation coefficients. J. Appl. Stat. 41(11), 2539–2556 (2014)
16. Ma, Y., Tang, W., Feng, C., Tu, X.M.: Inference for Kappas for longitudinal study data:
applications to sexual health research. Biometrics 64, 781–789 (2008)
17. Ma, Y., Tang, W., Yu, Q., Tu, X.M.: Modeling concordance correlation coefficient for
longitudinal study data. Psychometrika 75, 99–119 (2010)
238 P. Wu and X.M. Tu
18. Ma, Y., Alejandro, G.D., Hui, Z., Tu, X.M.: A U-statistics based approach for modeling
Cronbach Coefficient Alpha within a longitudinal data setting. Stat. Med. 29(6), 659–670
(2011)
19. Meadows, G., Burgess, P., Fossey, E., Harvey, C.: Perceived need for mental health care,
findings from the Australian National Survey of Mental Health and Wellbeing. Psychol. Med.
30, 645–656 (2000)
20. Nelsen, R.B.: An Introduction to Copulas. Springer, New York (2006)
21. Pepe, M.S., Anderson, G.L.: A cautionary note on inference for marginal regression models
with longitudinal data and general correlated response data. Commun. Stat. Simul. 23, 939–
951 (1994)
22. R Development Core Team: R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna (2010). ISBN 3-900051-07-0. http://www.R-
project.org
23. Robins, J.M.: Correcting for noncompliance in randomized trials using structural nested mean
models. Commun. Stat. 23, 2379–2412 (1994)
24. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Analysis of semiparametric regression models for
repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 90, 106–121 (1995)
25. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies.
J. Educ. Psychol. 66, 688–701 (1974)
26. Rubin, D.B.: Bayesian inference for causal effects: the role of randomization. Ann. Stat. 6,
34–58 (1978)
27. Tu, X.M., Feng, C., Kowalski, J., Tang, W., Wang, H., Wan, C., Ma, Y.: Correlation analysis
for longitudinal data: applications to HIV and psychosocial research. Stat. Med. 26, 4116–4138
(2007)
28. Valla, J.P., Bergeron, L., Smolla, N.: The Dominic-R: a pictorial interview for 6- to11-year old
children. J. Am. Acad. Child Adolesc. Psychiatry 39, 85–93 (2000)
29. Vansteelandt, S., Goetghebeur, E.: Causal inference with generalized structural mean models.
J. R. Stat. Soc. Ser. B 65, 817–835 (2003)
30. Wu, P., Tu, X.M., Kowalski, J.: On assessing model fit for distribution-free longitudinal models
under missing data. Stat. Med. 33(1), 143–157 (2014)
31. Wu, P., Han, Y., Chen, T., Tu, X.M.: Causal inference for Mann-Whitney-Wilcoxon rank sum
and other nonparametric statistics. Stat. Med. 33(8), 1261–1271 (2014)
32. Yu, Q., Tang, W., Kowalski, J., Tu, X.M.: Multivariate U-Statistics: a tutorial with applications.
Wiley Interdiscip. Rev. Comput. Stat. 3, 457–471 (2011)
33. Yu, Q., Chen, R., Tang, W., He, H., Gallop, R., Crits-Christoph, P., Hu, J., Tu, X.M.:
Distribution-free models for longitudinal count responses with over-dispersion and structural
zeros. Stat. Med. 32, 2390–2405 (2013)
34. Zhang, H., Lu, N., Feng, C., Thurston, S.W., Xia, Y., Tu, X.M.: On fitting generalized linear
mixed-effects models for binary responses using different statistical packages. Stat. Med. 30,
2562–2572 (2011)
Part IV
Structural Equation Models
for Mediation Analysis
Chapter 13
Identification of Causal Mediation Models
with an Unobserved Pre-treatment Confounder
Ping He, Zhenguo Wu, Xiaohua Douglas Zhang, and Zhi Geng
1 Introduction
P. He • Z. Wu • Z. Geng
School of Mathematical Sciences, Peking University, Beijing 100871, China
e-mail: sunhp@pku.edu.cn; wuzhenguo@gmail.com; zhigeng@pku.edu.cn
X.D. Zhang ()
Faculty of Health Sciences, University of Macau, Macau, China
e-mail: douglaszhang@umac.mo
presented an approach for a binary mediator. All of these approaches use three
linear models which include three variables: a treatment, a mediator, and an
outcome variable, but do not include an unobserved confounder which affects
both the mediator and the outcome variables (i.e., a strong form of sequential
ignorability). Jo [6] compared two different mediation analysis approaches: the
structural equation modeling approach and the principal stratification model. The
former assumes that there is no confounder which affects both the mediator and
the dependent variable, that is, the ignorability assumption of the mediator status.
The latter assumes that the effect of treatment on the outcome is completely
mediated through the mediator, that is, no direct effect of treatment on the outcome,
also called the exclusion restriction assumption. VanderWeele [15] discussed the
estimation of direct and indirect effects under the assumptions of no unobserved
variable which confounds the treatment–outcome relationship or the mediator–
outcome relationship. Imai et al. [5] discussed the identification of causal mediation
effects under the sequential ignorability assumption, which is different from the
no observed confounder assumption. Sobel [13] discussed the identification and
estimation of causal effects using an instrumental variable (IV) which satisfies
the exclusion restriction assumption. However the exclusion restriction assumption
means no direct treatment effect on the outcome variable which requires that all
treatment effects on the outcome variable are blocked by the mediator, and it
may be too strong in many real applications. For these approaches, the required
assumptions are untestable from observed data and may be very restrictive or
impractical in observational studies and even in experimental studies where only
the treatment assignments can be manipulated. Herting [4] and Kaufman et al. [7]
pointed out, respectively, that the models with and without direct effect of treatment
on the outcome are statistically indistinguishable and that the parameters are not
identifiable when there exists an unobserved confounder between the mediator
and the outcome. For models with unobserved confounders, Ten Have et al. [14]
presented an approach for estimating direct and indirect effects via G-estimation
equations which requires an additional covariate satisfying some conditions.
In this paper, we describe models of the outcome and the mediator which include
an unobserved pre-treatment confounder (i.e., a common cause of the mediator and
the outcome). For an experimental study of randomized treatment assignment or an
observational study where the assignment of treatment is ignorable conditionally
on observed covariates, we propose an approach for identifying parameters in the
models. Without requiring the sequential ignorability assumption or the exclusion
restriction assumption, we require that the degree of equation nonlinearity for the
treatment on the mediator is higher than that for the treatment on the outcome.
For example, the mediator model is nonlinear with respect to treatment, and the
outcome model is linear with respect to treatment. Especially when the mediator
is a binary variable and it has a logistic regression model, then the nonlinearity
condition may be generally satisfied. As an example, let an binary variable indicate
whether an irregular heartbeat is corrected as the intermediator between a treatment
variable and the outcome of survival time. The nonlinear requirement can be
considered as a parametric and functional assumption on the model of treatment
13 Identification of Causal Mediation Models with an Unobserved: : : 243
Y D b0 C b1 M C b2 X C "Y ;
MD .X; "M /;
where ./ is an arbitrary function (usually a linear function), and "Y and "M are
two mutually independent random errors with means 0 and variances Y2 and M2 ,
respectively [8, 9]. In the linear structural model, the model for M is
M D a0 C a1 X C "M :
where ./ and ./ are arbitrary functions, and "Y and "M with means 0 are
independent of .X; M; U/ and .X; U/, respectively.
With the definitions and notation of Pearl [11], the average total effect .x; x0 / of
treatment X on outcome Y, the average controlled direct effect C .x; x0 I m/ of X on
Y when controlling M, and the average natural direct effect N .x; x0 / of X on Y are
defined for treatment levels x versus x0 , respectively, as follows:
where Mx0 in (13.3) denotes the value of M if X were set to x0 . The average
controlled direct effect C .x; x0 I m/ means the average effect of x versus x0 on Y
when the mediator M is fixed at a value m; The average natural direct effect N .x; x0 /
means the average effect of x versus x0 on Y when the mediator M is fixed at
the value of Mx0 which would have been set naturally under X D x0 [11, 16].
For model (13.1), we have that the controlled direct effect C does not depend
on m and that the controlled direct effect equals the natural direct effect, that is,
0 0 0
C .x; x I m/ D N .x; x /, hereafter denoted as .x; x /.
Pearl [11] defines the average natural indirect effect as
It represents the average difference between the potential outcome Yx D YxMx that
would result under treatment status x, and the potential outcome YxMx0 that would
occur if the treatment status is the same and yet the mediator takes a value Mx0 that
would result under the other treatment status x0 , called the average causal mediation
effect [5].
Since model (13.1) is a linear model, the average natural indirect effect .t; t0 / of
T on Y is equal to the difference of the average total effect and the average direct
effect
Parameters b1 and b2 are called mediation effects in Jo [6]. For (13.7), the OLS
estimates of parameters b0 , b1 , and b2 are inconsistent because U is correlated to M.
It will be shown in Sect. 3 that these parameters in (13.7) are not identifiable if the
function ./ in (13.8) is linear with respect to X, as assumed in the traditional IV
method.
To avoid the mathematical complexity, we first consider model (13.7) and then
extend the result to more general models, such as moderated-mediation and
nonlinear direct effect models in Sect. 6, without any essential difficulty. Suppose
that treatment X is a continuous variable or an ordinal discrete variable. Without loss
of generality, assume that EŒ.U; "Y / D 0. By randomization of X, we have that
X is independent of .U; "Y / and then EŒ.U; "Y /jX D x D 0. Hereafter let E.jx/
denote E.jX D x/ for simplicity. Thus from model (13.7), we have the following
equation
Let A denote the .K 1/ 2 matrix on the left-hand side. According to this equation,
parameters b1 and b2 are identifiable if there exist K.K 3/ different levels of
treatment X such that the matrix A has full column rank.
If M has a linear model M D a0 C a1 X C a2 U C "M as usually assumed in
simultaneous equation models or X is a binary treatment, then the matrix A is not
full rank, and thus parameters b1 and b2 are not identifiable, although a0 and a1 can
be identified via an ordinary method by treating a2 U C "M as an error term since X
and .U; "M / are independent. Alternatively applying the IV method to model (13.7),
we obtain
Parameters b1 and b2 are also not identifiable since there are two parameters but
only one equation.
Below we discuss the necessary and sufficient condition for identifiability of
parameters in the model (13.7), and we show that parameters are identifiable if and
only if the conditional expectation E.MjX/ of M given X is not linear with respect
to X. Trivially, E.MjX/ is linear with respect to X when X is binary.
Theorem 1. Assume that treatment X is randomly assigned. Parameters b1 and b2
in model (13.7) are identifiable if and only if the conditional expectation of M given
X is not linear with respect to X, that is, j.E.MjX/; X/j < 1, where .E.MjX/; X/
is the correlation coefficient of E.MjX/ and X.
The proof of Theorem 1 is given in Appendix 1. From Theorem 1, we
immediately have the following corollary for a continuous or discrete X.
Corollary 1. Assume that treatment X is randomly assigned. Parameters b1 and b2
in model (13.7) are identifiable if
1. for a continuous X, @2 E.Mjx/=@x2 jxDx0 ¤ 0 for some x0 in the support of the
distribution of X, or
2. for a discrete X,
In the following two subsections, we discuss the identifiability via models of the
mediator M for the cases of a discrete or continuous M separately.
P.M D 1jx; u/
log D ˛0 C ˛1 x C ˛2 u:
1 P.M D 1jx; u/
Then parameters a1 and a2 can be estimated without bias via an ordinary method
by treating .U; "M / as an error term since X is independent of .U; "M /. Since
13 Identification of Causal Mediation Models with an Unobserved: : : 249
First we use a simple example to show how covariate Z can be used to identify
the parameters in (13.12). Suppose that the model of M has an interaction of X
and Z:
and then the nonlinearity condition required for identifiability in Theorem 1 does
not hold. Thus the parameters in (13.12) cannot be identified. Below we show how
to use the covariate Z for identifying the parameters. If treatment X is randomly
assigned conditionally on Z or not conditionally on Z, then we have X .U; "Y /jZ,
and we obtain from (13.12)
2 3
E.Mjx1 ; z1 / E.Mjx10 ; z1 / x1 x10
6 E.Mjx2 ; z2 / E.Mjx0 ; z2 / x2 x0 7
6 2 2 7 b1
6 :: :: 7
4 : : 5 b2
0 0
E.MjxK ; zK / E.SjxK ; zK / xK xK
2 3
E.Yjx1 ; z1 / E.Yjx10 ; z1 /
6 E.Yjx2 ; z2 / E.Yjx0 ; z2 / 7
6 2 7
D 6 :: 7; (13.14)
4 : 5
0
E.YjxK ; zK / E.YjxK ; zK /
250 P. He et al.
4 Estimation of Parameters
The identifiability discussed in the previous section requires that the distribution
of observed variables has sufficient information on parameters. After confirming
the identifiability, we can use various estimation approaches to estimate these
parameters, such as the moment estimation, and the maximum likelihood estimation
if we can assume the parametric models of .U; "Y / and .X; U; "M / and the
distributions of random errors "Y and "M . In this section, we try to find an efficient
estimation of parameters in the semi-parametric model (13.7).
In our estimation approach, the pivotal condition is independency between
randomized treatment X and .U; "M ; "Y /, which implies the following equation:
where f./ D .f1 ./; ; fK .//0 is an arbitrary vector function and 0 is a K 1 zero
vector.
In the following two subsections, we first present a simple but efficient estimator
for the case of a three-value treatment X, and then we describe a GMM estimator
with the efficient instrument (a function of X) for the case of a general treatment X
proposed by Newey and McFadden [10], which has the minimum variance among
all estimators satisfying the Eq. (13.15).
Define ˇ D .b0 ; b1 ; b2 /0 ,
2 3
EŒı.X D 1/ EŒMı.X D 1/ EŒXı.X D 1/
G D 4 EŒı.X D 2/ EŒMı.X D 2/ EŒXı.X D 2/ 5 ;
EŒı.X D 3/ EŒMı.X D 3/ EŒXı.X D 3/
2 3
EŒYı.X D 1/
H D 4 EŒYı.X D 2/ 5 :
EŒYı.X D 3/
c1 H
ˇO D G c ;
where the elements of Gc and Hc are sample means of the corresponding elements
O
of G and H . Thus ˇ is a valid estimator only when G has full rank. Now we
show that the nonlinearity of E.MjX/ with respect to X can ensure G has full rank.
The determinant of the matrix G is
as that obtained by f .X/. Thus for a three-value treatment, our estimator is efficient
and it is not necessary to choose a complicated f./ to improve the efficiency.
Different from the case of a three-value treatment, for a general treatment X with
more values, different f./’s for X in (13.15) lead to different estimators. In this
section, we derive a GMM estimator with the efficient instrument proposed in [10].
Equation (13.15) can be rewritten as
2 3 2 3
EŒf1 .X/ EŒMf1 .X/ EŒXf1 .X/ EŒYf1 .X/
6 :: :: :: 7 6 :: 7
4 : : : 5ˇ D 4 : 5: (13.17)
EŒfK .X/ EŒMfK .X/ EŒXfK .X/ EŒYfK .X/
Let G denote the K 3 matrix on the left-hand side of Eq. (13.17) and H denote the
vector on the right-hand side. The Eq. (13.17) can be denoted as Gˇ D H.
Define m.ˇ/ D EŒ.Y b0 b1 M b2 X/f.X/. Note that ˇ can be identified
only when r.G/ D 3, where r./ denotes the rank of a matrix. Then for any f./ that
makes r.G/ D 3, a GMM estimate of ˇ is
ˇO D arg minfb bm
m.ˇ/0 W b .ˇ/g D .b bb
G0 W G/1 b b H;
G0 W b (13.18)
ˇ
ˇO eff D Œ.b bb
Geff /0 W Geff 1 .b bH
Geff /0 W beff ;
ˇO eff D .b b 1 Œ.b
Geff /1 W Geff /0 1 .b bH
Geff /0 W beff D .b beff ;
Geff /1 H (13.19)
V.ˇO eff / D .Geff /1 EŒfeff .X/feff .X/0 .Geff /1 Y2 D .Geff /1 res
2
:
The estimate of V.ˇO eff / can be obtained by b 2res which is the sample variance
Geff and b
of residuals of linear model (13.7).
For the case of a three-value treatment, as shown in the previous subsection, we
have ˇO D ˇO eff . From property 2 of the GMM estimator, the variance of ˇO can be
estimated by
b c1b
V.ˇO / D G c0 /1b
EŒf .X/f .X/0 .G 2res ;
where
2 3
b
P.X D 1/ 0 0
b
EŒf .X/f .X/0 D 4 0 b
P.X D 2/ 0 5
0 0 b
P.X D 3/
and b
P.X D i/ is estimated by the observed frequency of X D i.
254 P. He et al.
5 Simulation Study
In this section, we compare our estimates with the OLS estimates via simulations.
In our simulations, data are generated from the causal diagram depicted in Fig. 13.1,
and the underlying model is
Y D b1 M C b2 X C dU C "Y ;
Logit P.M D 1jX; U/ D 1 C 3X C cU;
Y D M C 0:6X C dU C "Y :
Table 13.1 Means of estimates for 1000 simulations (The true values b1 D 1:0 and b2 D 0:6)
c D 0 or d D 0 means that U is not a confounder
N D 300 N D 600
OLS estimates Our estimates OLS estimates Our estimates
bQ 1 bQ 2 bO 1 bO 2 bQ 1 bQ 2 bO 1 bO 2
cD0 d D 2 0:999 0:601 0:994 0:602 0:995 0:602 0:996 0:602
d D 1 0:999 0:601 0:996 0:602 0:996 0:601 0:996 0:602
d D0 1:000 0:600 0:998 0:601 0:997 0:601 0:994 0:602
d D1 1:001 0:600 1:001 0:600 0:998 0:600 0:993 0:602
d D2 1:001 0:599 1:003 0:600 0:998 0:599 0:992 0:601
cD2 d D 2 0:700 0:693 1:008 0:598 0:698 0:694 1:005 0:599
d D 1 0:849 0:647 1:003 0:600 0:848 0:647 1:000 0:600
d D0 0:999 0:600 0:997 0:601 0:998 0:600 0:994 0:602
d D1 1:148 0:554 0:991 0:602 1:148 0:554 0:989 0:603
d D2 1:297 0:508 0:985 0:604 1:299 0:507 0:983 0:604
cD4 d D 2 0:443 0:767 1:072 0:582 0:443 0:768 1:017 0:596
d D 1 0:721 0:684 1:035 0:591 0:720 0:684 1:006 0:599
d D0 0:998 0:601 0:997 0:601 0:997 0:601 0:993 0:602
d D1 1:276 0:517 0:959 0:610 1:274 0:517 0:979 0:605
d D2 1:553 0:434 0:921 0:620 1:551 0:433 0:967 0:608
Comparing the results for two different sample sizes, we can see for the larger size
N D 600 that our estimates are closer to the true values and have smaller standard
errors but that the OLS estimates become even worse, have lower coverage rates,
and do not reduce the biases. We also did simulations for other sample sizes and got
the similar results.
6 Extension
In the previous sections, we discussed the model (13.7) of Y which is linear with
respect to M and X. These results can be extended to more general models, such
as the presence of an interaction term or a nonlinear direct effect of treatment on
outcome in (13.1). First we consider the model of moderated-mediation analysis
which has an interaction of X and M on Y as an example to illustrate the extension.
Consider the following moderated-mediation model:
Table 13.2 Coverage rates of 95 % confidence intervals and estimated standard deviations in
brackets for 1000 simulations
N D 300 N D 600
OLS estimates Our estimates OLS estimates Our estimates
bQ 1 bQ 2 bO 1 bO 2 bQ 1 bQ 2 bO 1 bO 2
cD0 d D 2 0.955 0.943 0.985 0.966 0.957 0.950 0.966 0.967
(0.135) (0.067) (0.378) (0.128) (0.095) (0.047) (0.255) (0.088)
d D 1 0.945 0.938 0.982 0.969 0.956 0.959 0.968 0.959
(0.085) (0.042) (0.240) (0.081) (0.060) (0.030) (0.161) (0.055)
dD0 0.957 0.953 0.984 0.974 0.957 0.950 0.967 0.956
(0.060) (0.030) (0.170) (0.058) (0.043) (0.021) (0.114) (0.039)
dD1 0.952 0.943 0.979 0.963 0.953 0.952 0.976 0.962
(0.085) (0.042) (0.240) (0.082) (0.060) (0.030) (0.161) (0.056)
dD2 0.950 0.950 0.979 0.971 0.957 0.954 0.976 0.962
(0.134) (0.067) (0.379) (0.129) (0.095) (0.047) (0.255) (0.088)
cD2 d D 2 0.368 0.701 0.984 0.970 0.088 0.459 0.970 0.970
(0.131) (0.065) (0.426) (0.141) (0.092) (0.046) (0.283) (0.095)
d D 1 0.548 0.776 0.983 0.974 0.265 0.638 0.972 0.961
(0.083) (0.042) (0.270) (0.089) (0.059) (0.029) (0.179) (0.060)
dD0 0.952 0.945 0.988 0.979 0.949 0.950 0.971 0.962
(0.059) (0.030) (0.192) (0.063) (0.042) (0.021) (0.126) (0.042)
dD1 0.565 0.802 0.982 0.974 0.269 0.626 0.977 0.967
(0.083) (0.042) (0.271) (0.089) (0.059) (0.029) (0.179) (0.060)
dD2 0.388 0.708 0.978 0.972 0.092 0.485 0.982 0.967
(0.131) (0.065) (0.428) (0.141) (0.092) (0.046) (0.284) (0.095)
cD4 d D 2 0.005 0.227 0.981 0.968 0 0.031 0.971 0.973
(0.121) (0.062) (0.758) (0.224) (0.086) (0.044) (0.373) (0.118)
d D 1 0.054 0.437 0.987 0.973 0.002 0.149 0.975 0.970
(0.078) (0.040) (0.469) (0.139) (0.055) (0.028) (0.235) (0.074)
dD0 0.952 0.941 0.995 0.988 0.941 0.947 0.982 0.970
(0.056) (0.029) (0.305) (0.092) (0.040) (0.020) (0.165) (0.052)
dD1 0.060 0.451 0.989 0.982 0 0.169 0.985 0.970
(0.078) (0.040) (0.434) (0.132) (0.055) (0.028) (0.236) (0.074)
dD2 0.003 0.239 0.982 0.980 0 0.033 0.984 0.973
(0.121) (0.062) (0.724) (0.217) (0.086) (0.044) (0.374) (0.118)
2 32b 3 2 3
1 E.Mjx1 / x1 x1 E.Mjx1 / 0 E.Yjx1 /
6 :: :: :: :: 76 b1 7
7 6 :: 7
4: : : : 56
4 b2 5 D 4 : 5:
1 E.MjxK / xK xK E.MjxK / b E.YjxK /
3
Both E.Mjxi / and E.Yjxi / can be estimated from data by a parametric or nonpara-
metric approach. Parameters bi ’s are identifiable if K 4 and the K 4 matrix
on the left-hand side has full column rank. It can be shown that the matrix has full
column rank if and only if E.Mjx/ is not a linear function of x, which is the same
as the condition of Theorem 1. Notice that this identifiability condition can also be
checked by observed data. It is obvious that under the commonly used assumption
of a linear regression of M on X in the simultaneous equation model, we cannot
identify these parameters in mediation models.
Next we consider the case of a nonlinear direct effect of treatment on outcome.
For example, consider a quadratic equation of X
Y D b0 C b1 M C b2 X C b3 X 2 C .U; "Y /:
M D a0 C a1 X C a2 X 2 C a3 X 3 C .X; U; "M /;
and we need to manipulate the treatment X for K. 4/ levels. The higher degree
nonlinearity of treatment effect on mediator M can be used to distinguish the indirect
effect of treatment on outcome from the lower-degree direct effect of treatment on
outcome. This essentially means that the change rates of outcome through the direct
arrow X ! Y and through the path X ! M ! Y are different, and thus we can
separate the direct effect from the indirect effect.
Similarly for a more general model of Y, to identify the parameters in the model,
we require that the treatment X has the number K of levels larger than or equal to
the number of the parameters in the model of Y and that the nonlinearity degree
of treatment effect on mediator is higher than the nonlinearity degree of the direct
effect of treatment on outcome such that the equations of the expectations of Y
conditional on these levels have a unique solution for the parameters.
7 Discussions
require the exclusion restriction assumption, and the experimental approach requires
that the mediator is manipulatable. When a mediation model has a single mediator, it
is difficult to satisfy the exclusion restriction assumption. The sequential ignorability
assumption is hardly satisfied even if the mediator could be manipulated, and the
manipulation experiment of the mediator may not be practical in many applications.
Removing these untestable assumptions, the approach proposed in this paper
requires that the regression equation of the mediator on the treatment variable
is nonlinear, otherwise a covariate is necessary for the identifiability. For the
case of a binary mediator, a logistic regression equation is commonly used and
the nonlinearity may be generally satisfied. The important difference between
our nonlinear requirement and the assumptions of other approaches is that our
nonlinearity requirement of the mediator M with respect to the treatment X is
testable by the observed data of M and X, while the assumptions required by other
approaches are untestable by observed data. This testability is an advantage of our
approach.
When the nonlinearity required in Theorem 1 is not satisfied, we may try to
find a covariate Z such that the slope of the regression of M on X depends on Z,
see Theorem 2. This covariate is essentially used to try a possible nonlinearity
between the treatment and the mediator such that the effects of treatment on the
mediator are different conditionally on different levels of the covariate. In a sense,
the covariate Z requires a model assumption like an instrumental variable so that
we can remove the confounding bias generated by an unobserved confounder. Our
approach may be more realistic for observational studies and experimental studies in
which we cannot manipulate the mediator. For a pure observational study in which
the treatment X cannot be manipulated, we need the commonly used assumptions for
causal inference, such as the ignorability assumption of the treatment assignment.
We separately show the necessity and sufficiency for the identifiability of parameters
in model (13.7). For necessity, suppose that the non-linearity condition does not
hold, that is, j.E.MjX/; X/j D 1. This implies that there exist some a0 and a1
satisfying E.MjX/ D a1 X C a0 almost everywhere. Then from the model (13.7) we
have
E.YjX/ D b0 C b1 E.MjX/ C b2 X
D .b0 C a0 b1 / C .a1 b1 C b2 /X:
13 Identification of Causal Mediation Models with an Unobserved: : : 259
The above equation implies that Y is marginally linear with respect to X. For this
linear model, only the intercept .b0 Ca0 b1 / and the slope .a1 b1 Cb2 / are identifiable
as a whole, while parameters b0 , b1 , and b2 cannot be distinguished each other.
For sufficiency, if M is not marginally linearly related with respect to X, then we
can find 3 levels: x1 , x2 , and x3 , which satisfy ŒE.Mjx1 / E.Mjx2 /=.x1 x2 / ¤
ŒE.Mjx2 / E.Mjx3 /=.x2 x3 /. Hence the matrix in (13.10) has full rank. Thus
parameters b1 and b2 can be identified, and then parameter b0 can be identified from
b0 D E.Yjxi / b1 E.Mjxi / b2 xi .
For sufficiency, when E.MjX; Z/ ¤ cX C .Z/, there are two situations: (i)
E.MjX; Z/ D ‰.X/ C .Z/, where ‰./ is a nonlinear function of X; (ii) E.MjX; Z/
is not additive with respect to X and Z.
For situation (i), since ‰./ is not a linear function, we can choose three levels
of X (say x1 ; x2 ; x3 ) and some z satisfying ŒE.Mjx1 ; z/ E.Mjx2 ; z/=.x1 x2 / ¤
ŒE.Mjx2 ; z/E.Mjx3 ; z/=.x2 x3 /. Then the following equation from model (13.12)
has a unique solution because the coefficient matrix has full rank:
E.Mjx1 ; z/ E.Mjx2 ; z/ x1 x2 b1 E.Yjx1 ; z/ E.Yjx2 ; z/
D :
E.Mjx2 ; z/ E.Mjx3 ; z/ x2 x3 b2 E.Yjx2 ; z/ E.Yjx3 ; z/
where ˆ.Z/ D b1 .Z/ C EŒ.U; Z; "Y /jZ. We can easily see that only c, b1 c C b2
and ˆ.Z/ can be identified given observed data of .Z; X; M; Y/. b1 and b2 cannot be
identified because (1) E.MjX; Z/ is linear with respect to X, and (2) EŒ.U; Z; "Y /jZ
cannot be identified since U and "Y are never observed. Thus the parameters in
model (13.12) are identifiable only if E.MjX; Z/ ¤ cX C .Z/.
260 P. He et al.
We want to show that an arbitrary vector function f./ that identifies ˇ via Eq. (13.15)
leads to the same estimator as that based on the function f ./. For an arbitrary vector
function f./ D .f1 ./; f2 ./; ; fK .//0 .K > 2/, we can denote it as
2 3
f1 .1/ f1 .2/ f1 .3/ 2 ı.X D 1/ 3
6 :: 7 4
f.X/ D 4 ::: ::
: : 5 ı.X D 2/ :
5 (13.21)
fK .1/ fK .2/ fK .3/ ı.X D 3/
Let Q denote the K 3 matrix on the right-hand side. Equation (13.15) can be
rewritten as Gˇ D H, where G D EŒf.X/; Mf.X/; Xf.X/ and H D EŒYf.X/. Then
GˇO D H.
the estimation equation for ˇ is b b From (13.21), we have
b
GDb
EŒf.X/; Mf.X/; Xf.X/
Db
EŒQf .X/; MQf .X/; XQf .X/
D Qb
EŒf .X/; Mf .X/; Xf .X/
c ;
D QG
where b
E./ denotes the sample mean of the corresponding variable. Similarly, we
have
c :
b D QH
H
GˇO H
b c ˇO H
b D Q.G c / D 0:
(ii) We prove that Geff has full rank when non-linearity condition in Theorem 1
holds. To prove that Geff has full rank, we only need show that det.Geff / ¤ 0
when j.X; E.MjX//j < 1. We have
ˇ ˇ
ˇ 1 E.M/ E.X/ ˇˇ
ˇ
det.Geff / D ˇˇ E.M/ EŒE.MjX/2 E.XM/ ˇˇ
ˇ E.X/ E.XM/ E.X 2 / ˇ
ˇ ˇ
ˇ1 E.M/ E.X/ ˇ
ˇ ˇ
ˇ
D ˇ 0 EŒE.MjX/ ŒE.M/ E.XM/ E.X/E.M/ ˇˇ :
2 2
ˇ 0 E.XM/ E.X/E.M/ E.X 2 / ŒE.X/2 ˇ
Since
and
we have
ˇ ˇ
ˇ1 E.M/ E.X/ ˇ
ˇ ˇ
det.G / D ˇ 0 varŒE.MjX/ cov.X; E.MjX// ˇˇ
eff ˇ
ˇ 0 cov.X; E.MjX// var.X/ ˇ
References
1. Baron, R.M., Kenny, D.A.: The moderator-mediator variable distinction in social psychological
research: conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol. 51,
1173–1182 (1986)
2. Frangakis, C.E., Rubin, D.B.: Principle stratification in causal inference. Biometrics 58, 21–29
(2002)
3. Hansen, L.S.: Large sample properties of generalized method of moments estimators. Econo-
metrica 50, 1029–1054 (1982)
4. Herting, J.R.: Evaluating and rejecting true mediation models: a cautionary note. Prev. Sci. 3,
285–289 (2002)
5. Imai, K., Keele, L., Yamamoto, T.: Identification, inference and sensitivity analysis for causal
mediation effects. Stat. Sci. 25(1), 51C71 (2010)
6. Jo, B.: Causal inference in randomized experiments with mediational processes. Psychol.
Methods 13(4), 314–336 (2008)
7. Kaufman, S., Kaufman, J.S., MacLehose, R., Greenland, S., Poole, C.: Improved estimation of
controlled direct effects in the presence of unmeasured confounding of intermediate variables.
Stat. Med. 24, 1683–1702 (2005)
8. Li, Y., Schneider, J.A., Bennet, D.A.: Estimation of the mediation effect with a binary mediator.
Stat. Med. 26, 3398–3414 (2007)
9. MacKinnon, D.P., Fairchild, A.J., Fritz, M.S.: Mediation analysis. Annu. Rev. Psychol. 58,
593–614 (2007)
10. Newey, W.K., McFadden, D.: Large sample estimation and hypothesis testing. In: R. F. Engle,
R.F., McFadden, D. (eds.) Handbook of Econometrics, vol. IV, pp. 2111–2245. Elsevier,
Amsterdam (1994)
11. Pearl, J.: Direct and indirect effects. In: Proc. 17th Conf. Uncertainty in Artificial Intelligence,
pp. 411–420 (2000)
12. Rubin, D.B.: Direct and indirect causal effects via potential outcomes. Scand. J. Stat. 31,
161–170 (2004)
13. Sobel, M.E.: Identification of causal parameters in randomized studies with mediating
variables. J. Educ. Behav. Stat. 33, 230–251 (2008)
14. Ten Have, T.R., Joffe, M.M., Lynch, K.G., Brown, G.K., Maisto, S.A., Beck, A.T.: Causal
mediation analyses with rank preserving models. Biometrics 63, 926–934(2007)
15. VanderWeele, T.J.: Marginal structural models for the estimation of direct and indirect effects.
Epidemiology 20(1), 18–26 (2009)
16. VanderWeele, T.J.: Controlled direct and mediated effects: definition, identification and
bounds. Scand. J. Stat. 38, 551–563 (2011)
Chapter 14
A Comparison of Potential Outcome Approaches
for Assessing Causal Mediation
Electronic supplementary material The online version of this chapter (doi: 10.1007/
978-3-319-41259-7_14) contains supplementary material, which is available to authorized users.
Authors’ note: Preparation of this article was supported by NIDA Center Grant P50 DA100075-
15 and NIDA R01 DA09757. The content is solely the responsibility of the authors and does not
necessarily represent the official views of the National Institute on Drug Abuse (NIDA) or the
National Institutes of Health (NIH).
D.L. Coffman ()
The Methodology Center, Pennsylvania State University, 404 Health and Human Development
Building, University Park, PA 16802, USA
e-mail: dlc30@psu.edu
D.P. MacKinnon
Department of Psychology, Arizona State University, Tempe, AZ 85281, USA
Y. Zhu
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada
N2L 3G1
D. Ghosh
Department of Biostatistics and Informatics, University of Colorado, Aurora, CO 80045, USA
In the potential outcomes framework (see [8–10]), each individual has a potential
outcome for each possible treatment condition, namely the value of the outcome
that would have occurred had the individual received the given treatment condition.
For simplicity, consider a binary treatment indicator, Ti , where Ti D 1 denotes the
intervention condition and Ti D 0 denotes the control condition for participant i,
i D 1, : : : ,n. The potential outcome if the individual receives the intervention is
denoted Yi (1), and the potential outcome if the individual is in the control condition
is denoted Yi (0). The individual causal effect is the difference between these two
potential outcomes. Because each participant is observed in only one condition, only
one of these potential outcomes is observed; the other is missing and, therefore,
the individual causal effect cannot be computed. However, strategies have been
implemented to estimate the causal effect averaged over participants in the study.
This average causal effect (ACE) is defined as E[Yi (1) Yi (0)]; that is, the expected
(or average) difference between the two potential outcomes. Information on the
potential outcomes framework outside of the context of mediation is provided by
Little and Rubin [11], Schafer and Kang [12], and Winship and Morgan [13].
Extending the potential outcomes framework to mediation is more complicated
because a mediator is an outcome of the intervention and, therefore, there are also
potential values for the mediator under each treatment condition for each individual.
The potential mediator under the intervention condition is denoted Mi (1), and the
potential mediator under the control condition is denoted Mi (0). The notation for
the potential outcomes is then expanded to include the potential mediators; this
notation is referred to as nested potential outcomes. Thus, Yi (1,Mi (1)) is the potential
outcome if individual i receives the intervention and the potential mediator takes
on the value that would have been obtained had they received the intervention;
and Yi (0,Mi (0)) is the potential outcome if individual i is in the control condition
and the potential mediator takes on the value that would have been obtained had
they been in the control condition. There are two other potential outcomes that
can never be realized in practice and illustrate the challenge of identifying causal
mediation effects. These two potential outcomes are needed to define the natural
effects and correspond to Yi (1,Mi (0)), the potential outcome if individual i receives
the intervention and has the potential value of the mediator that would have been
obtained had they been in the control condition, and Yi (0,Mi (1)), the potential
outcome if individual i is in the control condition and has the potential value of
the mediator that would have been obtained had they received the intervention. The
impossibility of ever observing these two potential outcomes is one of the reasons
that causal mediation analysis is controversial.
Throughout the article, we use Yi to denote the observed value of the outcome, Mi
to denote the observed value for the mediator, and Yi (t,Mi (t)) to denote the potential
outcomes where t is one of the levels of treatment. We use X0 to denote mea-
sured baseline (i.e., pre-treatment) confounders. We assume throughout that if an
individual receives the intervention, then Yi D Yi (1) D Yi (1,Mi (1)) and Mi D Mi (1).
266 D.L. Coffman et al.
There are several different definitions of mediation within the potential outcomes
framework: natural effects, controlled effects, and principal strata effects. Before
defining the effects using the potential outcomes framework, we define the effects
as they have been traditionally defined in the social science literature. Briefly, in
the social science literature, mediation has traditionally been assessed by fitting two
linear regression models: one for the mediator,
h ˇ i
ˇ
E M ˇT D t D ˇ0M C ˇ1 t (14.1)
The direct effect is defined as ˇ 2 , and the indirect effect is defined as the product of
ˇ 1 and ˇ 3 . Note that these definitions do not involve counterfactuals, as the models
presented above are models for the observed mediator and outcome. These effects
may be interpreted as causal effects only under certain assumptions to be discussed
in the Identification section below.
Principal Strata Effects Principal stratification [16–18] was initially developed to
handle non-compliance in intervention studies; recognizing that actual receipt of
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 267
Note that the principal strata effects do not rely on nested potential outcomes
of the form, Yi (t,Mi (t)). Principal strata effects rely only on the potential outcomes,
Yi (0), Yi (1), Mi (1), and Mi (0). Thus, principal strata effects do not rely on Yi (1,Mi (0))
or Yi (0,Mi (1)), which cannot be realized for any individual. This focus on only
possible potential outcomes is both a strength and a limitation of this approach;
we return to this point later.
Natural Effects Natural direct effects (NDEs) are defined by setting the mediator
to one of its potential values and changing the intervention status. One NDE of
interest, E[Yi (1,Mi (0)) Yi (0,Mi (0))], often called the pure NDE (e.g., [19]), defines
a causal effect of the intervention on the outcome when the mediator is held to the
value that would have been obtained had the individual not received the intervention
(i.e., the effect of the intervention on the outcome if the intervention did not cause
a change in the mediator or if the effect of the intervention on the mediator was
in some way blocked). Additionally E[Yi (1,Mi (1)) Yi (0,Mi (1))], sometimes called
the total NDE, defines a causal effect of the intervention on the outcome when
the mediator is held to the value that would have been obtained had the individual
received the intervention (i.e., the effect of the intervention on the outcome if
absence of the intervention did not prevent a change in the mediator). Note that since
each individual’s set of potential mediators may be unique, setting the mediator to
one of the potential mediators (i.e., Mi (0) or Mi (1)) is not equivalent to setting the
mediator to a given value of the mediator m. In other words, the value at which the
mediator is set can be different for every individual. We will denote pure NDE and
the total NDE as NDEM(0) and NDEM(1) , respectively, where the subscript indicates
the potential value the mediator is set to.
Natural indirect effects (NIEs) are defined by setting the intervention condition
and changing the values of the potential mediator, E[Yi (1,Mi (1)) Yi (1,Mi (0))] or
E[Yi (0,Mi (1)) Yi (0,Mi (0))]. The former, sometimes referred to as the total NIE,
defines the causal effect of receiving the intervention and having the value on the
mediator that would be obtained under the intervention versus having the value on
the mediator that would be obtained under the control condition; in other words, the
effect of the intervention due to intervention-induced changes in the mediator. The
latter, sometimes referred to as the pure NIE, defines the causal effect of receiving
the control condition and having the value on the mediator that would be obtained
under the intervention condition versus having the value on the mediator that
would be obtained under the control condition. Note that again, these effects are
defined with respect to potential mediators rather than a specific observed value
of the mediator. Therefore, the value of the potential mediators may differ across
individuals. We will denote the two NIEs as NIE1 and NIE0 , where the subscript
denotes the value to which the intervention status is set.
Note that for NDEs and NIEs, there is an effect for each level of the inter-
vention. For example, in the case of a binary treatment, there are two NDEs and
two NIEs. The TE, defined as E[Y(1) Y(0)] D E[Yi (1,Mi (1)) Yi (0,Mi (0))], can
be decomposed into E[Yi (1,Mi (1)) Yi (1,Mi (0))] C E[Yi (1,Mi (0)) Yi (0,Mi (0))] or
E[Yi (0,Mi (1)) Yi (0,Mi (0))] C E[Yi (1,Mi (1)) Yi (0,Mi (1))]. That is, the TE is the
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 269
sum of the total NIE, NIE1 , and the pure NDE, NDEM(0) ; or of the pure NIE, NIE0 ,
and the total NDE, NDEM(1) . The terms pure and total refer to whether interaction
effects are included with the direct or indirect effect. Specifically, pure means that
the interaction effects are not included and total means that they are. Therefore, the
TE must include a total and a pure effect.
Controlled Effects The controlled direct effect (CDE; [20]) is the causal effect of
the intervention on the outcome when setting the mediator to a specific value, m, for
the entire population. That is, E[Yi (1,m) Yi (0,m)] where Yi (t,m) is the potential
outcome when T D t and M D m. We will denote the controlled direct effect as
CDEm , where the subscript m denotes the particular value to which m is held or
set. Note the difference between the CDE and the NDE. For the CDE, the value at
which the mediator is set (i.e., held constant) is the same for every individual. Also,
for a binary treatment, there are two NDEs, but there are as many CDEs as there are
possible values of the mediator. We have continued to use the i subscript through
this section to emphasize that the CDE sets the value of the mediator to be the same
for all individuals, whereas the NDE allows the value at which the mediator is set to
vary across individuals.
There is not a controlled indirect effect that is comparable to the NIE without fur-
ther assumptions, which will be discussed below. To illustrate, consider defining the
effect E[Yi (1,m) Yi (1,m0 )] for two different values, for example m D 0 and m0 D 1.
We will denote this effect as MjtD1 and the corresponding E[Yi (0,m) Yi (0,m0 )] as
MjtD0 . The former is the effect of, for example, a one-unit change in the mediator
on the outcome when Ti D 1. This effect does not tell us how the one-unit difference
between m and m0 has come about: it could have happened through the treatment
intervention or through some other mechanism. On the other hand, consider the
NIE, E[Yi (1,Mi (1)) Yi (1,Mi (0))], the effect of the intervention due to intervention-
induced changes in the mediator. This effect, unlike E[Yi (1,m) Yi (1,m0 )], does
indicate that the intervention caused the difference in Mi (1) and Mi (0) because these
are potential outcomes under two different levels of the intervention. This distinction
may seem subtle but it is extremely important. The NIE is what behavioral scientists
typically think of as the mediation effect, commonly denoted ab in the behavioral
science literature, whereas E[Yi (1,m) Yi (1,m0 )] is the causal effect of the mediator
on the outcome, holding constant the intervention status, and is commonly denoted
as b in the behavioral science literature. The effects MjtD1 and MjtD0 also imply
that it is possible to set the mediator to the same value for all individuals as
mentioned above. For elaboration of these conceptual issues, see VanderWeele and
Vansteelandt [21].
It has been shown that under certain assumptions, the various definitions given
above for the direct and indirect effects are equivalent (e.g., [5, 22, 23]). We will
return to this point after discussing identification assumptions. These assumptions
are summarized in Table 14.1.
270 D.L. Coffman et al.
3 Identification
The causal effects defined above are written in terms of potential outcomes, not all
of which can be observed. If all the potential outcomes were observed, then all of the
above effects could be easily estimated. In order to estimate causal effects based on
the observed data, assumptions must be made in order to identify the causal effects.
Principal Strata Effects Generally, principal strata effects are identified by assum-
ing that there is no one for whom the intervention has an iatrogenic (i.e., undesirable)
effect (e.g., P[M(0) D 1, M(1) D 0] D 0), which is typically referred to as the
monotonicity assumption. Note that the CACE is the causal effect of interest under
the hypothesis that the intervention will increase the value of the mediator (i.e.,
increasing values of the mediator are desirable). If the hypothesis happens to be
that the intervention decreases the value of the mediator (i.e., decreasing values
of the mediator are desirable), the monotonicity assumption is that P[M(0) D 0,
M(1) D 1] D 0, and thus, scientific interest lies in DACE. That is, the DACE would
be the causal effect of interest.
Additionally, it is assumed that the only way in which the intervention can affect
the outcome is through the mediator. This is known as the exclusion restriction
and implies that E[Yi (1) Yi (0)jMi (1) D Mi (0)] D 0. That is, among those for whom
there is no causal effect of the intervention on the mediator, there is no causal effect
of the intervention on the outcome. However, the exclusion restriction also means
that there is no direct effect of the intervention on the outcome, among those for
whom there is a causal effect of the intervention on the mediator (i.e., those in either
stratum 3 or 4; the compliers or defiers). In fact, the only way that the principal
strata effects for stratum 3 or 4 can be interpreted as an indirect effect is if the
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 271
exclusion restriction holds. Otherwise, the causal effect estimated is the total effect
of the intervention on the outcome among those for whom the intervention had a
causal effect on the mediator. The exclusion restriction is particularly difficult to
rationalize given that most interventions are designed to affect multiple mediators
that are hypothesized to affect the outcome. In addition, an interaction between T
and M is a violation of the exclusion restriction [5, 24].
Finally, it is assumed that there are no unmeasured confounders of T and Y (e.g.,
there is random assignment to T), which can be stated formally as T ? Y(0),Y(1)jX0 .
This assumption allows T to be used as an instrumental variable (IV) in the two-
stage least-squares (TSLS) estimation to be described below. Note that unlike other
causal mediation methods, the principal strata approach does not require a no-
unmeasured-confounding assumption for M and Y (given the other assumptions
stated above).
Note that the assumptions stated above are not the only set that could be used
for identification. Gallop et al. [25] proposed alternative identification assumptions.
They do not require the exclusion restriction or monotonicity assumption. Instead,
baseline covariates, which predict the principal strata, are used to identify the
stratum-specific ACE. In addition, they assume that there are no interactions
between these baseline covariates and T within each principal stratum and that there
are no unmeasured confounders of T and Y.
Natural Effects To identify the natural effects, it is usually assumed (e.g., [22, 26])
that (a) there are no unmeasured confounders of the intervention and the mediator,
T ? M(0),M(1)jX0 ; (b) there are no unmeasured confounders of the intervention
and the outcome; (c) there are no unmeasured confounders of the mediator and
the outcome; and that (d) there are no measured or unmeasured confounders of
the mediator and outcome that have themselves been influenced by the intervention
(i.e., no post-treatment confounders, denoted X1 ). Note that the set of variables in
X0 do not need to be the same for (a) and (b) and that if X1 is not affected by
the intervention, then it does not violate (d) [27]. If individuals are randomized to
the intervention, then (a) and (b) will typically hold as long as the randomization
does not fail (e.g., individuals comply with the assigned intervention and there is
no selective attrition). However, unless individuals are also randomized to levels of
the mediator, which is typically impossible in practice, (c) is not guaranteed to hold.
These are obviously very strong assumptions that cannot be tested in any empirical
application. Nevertheless, if the researcher has given careful thought to all potential
confounders, measured them, and properly adjusted for them, assumptions (a)–(c)
are plausible. Furthermore, sensitivity analyses have been developed and conducted
to assess the impact of violations of these assumptions (e.g., [4, 28, 29]).
Assumption (d) of no post-treatment confounders of the mediator and outcome
is more difficult to rationalize. Note that confounders of the mediator and outcome
that have been influenced by the intervention are essentially mediators themselves,
although they may not be of scientific interest (i.e., the investigator is not inter-
ested in their effects and simply wishes to control for them). Assumption (d) is
problematic given that most interventions target multiple mediators and because the
272 D.L. Coffman et al.
4 Estimation
For each of the definitions, different estimators have been proposed using different
sets of identifying assumptions described above. We will consider only a few
estimators for each definition. For principal strata effects, we will consider a TSLS
IV estimator [36] and a Bayesian estimator [25]. For natural effects, we will consider
the estimator proposed by Imai et al. [4]. For controlled effects, we will consider the
G-estimator proposed by Ten Have et al. [34] and an inverse propensity weighted
(IPW) estimator [3, 26].
Principal Strata Effects Given the monotonicity and exclusion restriction identi-
fying assumptions, the TSLS IV estimator [36], in which intervention assignment is
the instrument, is typically used to estimate the principal strata effects. In order for
the intervention assignment to be considered an instrumental variable, individuals
should be randomly assigned to intervention conditions such that assumptions
(a) and (b) hold. Further, for all practical purposes, the principal stratification
framework requires a binary mediator.1 Even for a mediator that takes on, say, 5
values, the number of latent principal strata grows tremendously. Specifically, for a
mediator that takes on 5 possible values, there would be 25 latent strata or subgroups
of individuals and thus it would be difficult to identify and estimate principal
strata effects. Given a binary mediator, monotonicity, the exclusion restriction, and
random assignment to the intervention (i.e., no unmeasured confounders of T and
M or T and Y), the latent subgroups of individuals are no longer latent because all
but one stratum is eliminated.
In the recent statistical literature, there have been attempts to use different
identifying assumptions and Bayesian estimation procedures (e.g., [25, 37]) in order
to relax the exclusion restriction. The Elliott et al. estimator is limited to both binary
mediators and outcomes. We use the Bayesian estimator proposed by Gallop et al.
1
Gallop [59] proposed Bayesian estimation of direct effects when the mediator is continuous.
274 D.L. Coffman et al.
to estimate the principal strata effect. This approach was developed to estimate
the direct effect, although it estimates all four principal strata effects. Because the
authors were not interested in an unbiased causal estimate of the indirect effect, they
did not need an assumption of no unmeasured confounders of M and Y. However,
if interest lies in a causal estimate of the indirect effect, then this assumption is
required. In addition, both the TSLS IV and Bayesian estimators require assumption
(d). Although not explicitly stated in the previous literature, a post-T confounder
violates the exclusion restriction because there is pathway from T to Y that does not
go through M.
Natural Effects Several estimators have now been proposed for estimating natural
effects (e.g., [38, 39]) but we will focus on the estimator proposed by Imai and
colleagues [4, 22] and implemented in the R package mediation [40], which uses
identifying assumptions (a)–(d). This estimator involves generating bootstrapped
samples and fitting models, which may be parametric or non-parametric, for the
observed outcome and observed mediator. From these models, potential values of
the mediator are simulated and then potential values of the outcome are simulated
given the simulated values of the mediator. Once all of the potential values for the
mediator and outcome have been simulated, the natural effects can be computed as
defined previously.
Controlled Effects VanderWeele [26] proposed using a marginal structural model
(MSM; [41]) with an IPW estimator for defining and estimating the controlled direct
effect in the mediation context. MSMs are models for the potential outcomes and are
used to define causal effects. For example, for a continuous outcome, the MSMs may
be given as E ŒM.t/ D ˇ0M C ˇ1 t and E ŒY .t; m/ D ˇ0Y C ˇ2 t C ˇ3 m, where ˇ2 D
E ŒY .1; m/ Y .0; m/ D .ˇ0Y C ˇ2 C ˇ3 m/ .ˇ0Y C ˇ3 m/ is the CDE defined
above, ˇ1 D E ŒM.1/ M.0/ D .ˇ0M C ˇ1 / ˇ0M is the effect of the intervention
on the mediator, and ˇ3 D E ŒY .t; m/ Y .t; m0 / is the effect of the mediator on
the outcome for T D t. A T M interaction term can also be included in the MSM.
MSMs are fit by choosing an appropriate model for the observed outcome (e.g.,
linear regression, logistic regression, survival model), but using the IPW estimator
instead of the usual ordinary least squares or maximum likelihood estimator. As
long as assumption (e) holds, an estimate of the indirect effect may be obtained by
subtracting the CDE from the TE.
For controlled effects, we will also examine the modified G-estimator for the
rank preserving model (RPM) described in Ten Have et al. [34]. This estimator does
not require assumption (c); however, it does require that individuals are randomized
to the intervention (i.e., assumptions (a) and (b)). It also assumes that there are no
interaction effects between baseline covariates, X0 , and the mediator and between
baseline covariates, X0 , and intervention assignment on the potential outcomes.
However, there should be strong interaction effects between the baseline covariates,
X0 , and intervention assignment on the mediator. Essentially, this estimator is using
the interactions between baseline covariates, X0 , and intervention assignment as
instrumental variables. The G-estimator also requires assumption (e). Thus, in
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 275
The simulation study crosses four assumption violation conditions with four
confounding conditions. The first confounding scenario (A) does not involve
any confounders. The second confounding scenario (B) involves a pre-treatment
confounder, X0 , of M and Y that has not been influenced by T. The third confounding
scenario (C) involves a post-treatment but pre-mediator confounder, X1 , of M and Y
that has been influenced by T. The fourth confounding scenario (D) involves a pre-
treatment confounder of T, M, and Y, such that there is not random assignment to T.
These confounding conditions are crossed with two sample size conditions, N D 100
and N D 500, and three other conditions that systematically violate the assumptions
of the different approaches; specifically, monotonicity, the exclusion restriction, and
the no-interaction between T and M assumption. A fourth condition in which none
of these assumptions are violated is also included. To summarize, for each sample
size, there are 16 simulation conditions as follows: no confounders/no violations, no
confounders/exclusion restriction violated, no confounders/monotonicity violated,
no confounders/no-interaction violated, unmeasured pre-T confounder of M and
Y/no violations, unmeasured pre-T confounder of M and Y/exclusion restriction
violated, unmeasured pre-T confounder of M and Y/monotonicity violated, unmea-
sured pre-T confounder of M and Y/no-interaction violated, post-T confounder
of M and Y/no violations, post-T confounder of M and Y/exclusion restriction
violated, post-T confounder of M and Y/monotonicity violated, post-T confounder
of M and Y/no-interaction violated, unmeasured pre-T confounder of T, M, and
Y/no violations, unmeasured pre-T confounder of T, M, and Y/exclusion restriction
violated, unmeasured pre-T confounder of T, M, and Y/monotonicity violated,
unmeasured pre-T confounder of T, M, and Y/no-interaction violated.
In each of the simulation conditions, we generated 1000 data sets and estimated
the following causal effects: principal strata effects with TSLS IV estimator,
principal strata effects with Bayesian estimator, controlled effects using the IPW
estimator, controlled effects using the RPM G-estimator, and natural effects using
the Imai et al. [4] estimator.
276 D.L. Coffman et al.
The goal is for the data generation to be general enough that it does not favor one
approach over another. However, we also need to know the population values for
each of the effects. Therefore, we generated all of the potential outcomes for each
individual, including the ones that would never be observed for any individual—Y(1,
M(0)) and Y(0, M(1))—so that the causal effects defined previously may be directly
computed for each individual. By generating data for all potential outcomes, the true
values in all conditions are known.
Each of the simulation study conditions described above dictates the specific
values of population parameters (given in Table 14.2), but here we describe the
data generation generally. M is binary so that the comparison between principal
stratification and the other approaches is more straightforward. However, note that a
binary M is not necessary for estimating the controlled or natural effects. T is binary
and is generated from a binomial distribution with probability of 0.5 in confounding
scenarios A, B, and C. In confounding scenario D, T was generated from a binomial
distribution with a probability dependent on X0 . In other words, T is randomized
in confounding scenarios A, B, and C but not in D. Y is a continuous, normally
distributed variable.
The potential outcomes for M were generated according to a multinomial
distribution,
8 9
ˆ
ˆ 1; 1 >
> p11
< =
0; 0 p00
ŒM.0/; M.1/ D ;
ˆ 1; 0 >> p10
:̂ ;
0; 1 p01
iD0 jD0
Thus, p00 D P[M(0) D 0, M(1) D 0], p11 D P[M(0) D 1,M(1) D 1], p10 D P[M(0) D 1,
M(1) D 0], and p01 D P[M(0) D 0,M(1) D 1]. For confounding scenario A, the
multinomial probabilities were set to particular values depending on whether or not
the monotonicity assumption was violated.
The potential outcomes for Y were generated according to a multivariate normal
distribution with mean,
The population values for each condition of the simulation study are given in
Table 14.2. For confounding scenario A in the conditions in which the monotonicity
assumption holds, p10 D 0 and p00 D p11 D p01 D 1/3. The proportions for confound-
ing scenario A when monotonicity was violated were set to p00 D 0.2, p11 D 0.2,
p10 D 0.1, and p01 D 0.5. For confounding scenario A only, we also studied a
condition in which the monotonicity assumption was violated and all proportions
were set to 0.25. The purpose of this condition was to examine what happens as the
proportion of defiers increases. In addition, because the proportions are equal, the
indirect effect is zero because for 25 % of the sample the indirect effect is positive
and for another 25 % of the sample, the indirect effect is equally negative. Thus, the
effects cancel out. Although it is unlikely that the stratum proportions would ever
be exactly equal or that the proportion of defiers would ever be as large as 0.25, this
condition provides some idea of how extreme the bias may become. In the mediation
context, the proportion of defiers represents the proportion of individuals for whom
the intervention has an iatrogenic effect on the mediator.
For confounding scenario D, the parameter settings were the same as confound-
ing scenario B. However, in confounding scenario D, T was generated from a
Bernoulli distribution with p D 1/(1 C exp(0.2 * X0 )) so that the pre-T confounder
had an effect on intervention assignment. For only the N D 500 sample size
condition, we examined a large effect size condition in which we replaced 0.39
with 0.59 for ˇ 1 , ˇ 2 , and ˇ 3 in Table 14.2.
The true values for each of the effects were computed according to the definitions
presented previously using the potential outcomes. For estimation of the effects, we
used only the data that would be available to an investigator (e.g., Mi (1), Yi (1,Mi (1))
if Ti D 1). We computed the Monte Carlo (MC) mean and standard deviation (SD)
across the 1000 replications. We computed the bias as the difference between the
MC mean and the true value, the mean squared error (MSE) as the squared bias plus
the squared MC SD, and the 95 % coverage as the number of times the confidence
interval (CI) included the true value divided by 1000 and multiplied by 100. The
results of the simulations, along with the true values, are given in Tables 14.3,
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 279
Table 14.3 Confounding scenario A (no unmeasured confounders) results (N D 500) for medium
effect size
NIE NDE TE IPWCDE IPW M RPM M RPMCDE TSLSIV Bayesian
No violations
TRUE 0.13 0 0.13 0 0.39 0.39 0 0.39 0.39
MEAN 0.131 0.002 0.129 0.002 0.391 0.391 0.002 0.383 0.390
BIAS 0.001 0.002 0.001 0.002 0.001 0.001 0.002 0.007 0.000
SD 0.037 0.094 0.090 0.095 0.095 0.095 0.095 0.273 0.146
MSE 0.001 0.009 0.008 0.009 0.009 0.009 0.009 0.072 0.021
Coverage 94.4 % 94.9 % 94.8 % 94.0 % 93.9 % 93.9 % 94.0 % 95.9 % 99.8 %
Exclusion restriction violated
TRUE 0.13 0.39 0.52 0.39 0.39 0.39 0.39 0.78 0.78
MEAN 0.130 0.391 0.521 0.391 0.392 0.392 0.391 1.591 0.783
BIAS 0.000 0.001 0.001 0.001 0.002 0.002 0.001 0.811 0.003
SD 0.034 0.096 0.091 0.095 0.095 0.095 0.095 0.318 0.151
MSE 0.001 0.009 0.0083 0.009 0.008 0.008 0.009 0.771 0.023
Coverage 96.2 % 95.3 % 95.2 % 95.6 % 95.7 % 95.7 % 95.6 % 23.2 % 99.9 %
Monotonicity violated (all proportions equal)
TRUE 0 0 0 0 0.39 0.39 0 0.39 0.39
MEAN 0.000 0.003 0.003 0.003 0.388 0.388 0.003 1.444 0.365
BIAS 0.000 0.003 0.003 0.003 0.002 0.002 0.003 1.054 0.025
SD 0.018 0.094 0.097 0.090 0.090 0.090 0.090 43448.8 0.285
MSE 0.000 0.009 0.009 0.009 0.008 0.008 0.009 2236.32 0.082
Coverage 95.1 % 92.9 % 92.7 % 93.4 % 95.7 % 95.7 % 93.4 % 99.7 % 99.3 %
Monotonicity violated (p10 D 0.1, p01 D 0.5, p00 D p11 D 0.2)
TRUE 0.156 0 0.156 0 0.39 0.39 0 0.39 0.39
MEAN 0.154 0.001 0.154 0.001 0.387 0.387 0.001 0.387 0.390
BIAS 0.002 0.001 0.002 0.001 0.003 0.003 0.001 0.003 0.001
SD 0.042 0.100 0.093 0.098 0.098 0.098 0.098 0.227 0.159
MSE 0.002 0.010 0.009 0.010 0.010 0.010 0.010 0.054 0.025
Coverage 96.3 % 94.7 % 94.7 % 94.8 % 95.3 % 95.3 % 94.8 % 94.8 % 97.8 %
14.4, 14.5, 14.6, and 14.7 for the N D 500 sample size condition. The results for
N D 100 were similar; therefore, they are not presented here but are available
as supplementary online materials. Likewise, the results for the large effect size
condition were similar and are not presented here but are available as supplementary
online materials.
6.1 No Confounders
TRUE 0.13 0.26 0.52 0.65 0.78 0.39 0.78 0.39 0.78 1.17 1.17
MEAN 0.129 0.260 0.513 0.645 0.773 0.380 0.778 0.386 0.785 2.356 1.166
BIAS 0.001 0.000 0.007 0.005 0.007 0.010 0.002 0.004 0.005 1.186 0.004
SD 0.049 0.056 0.101 0.098 0.091 0.134 0.136 0.135 0.135 0.363 0.147
MSE 0.002 0.003 0.010 0.010 0.008 0.019 0.018 0.018 0.019 1.543 0.021
Coverage 94.7 % 96.1 % 94.6 % 95.8 % 95.9 % 95.4 % 95.4 % 95.0 % 95.3 % 2.2 % 99.8 %
Scenario B (unmeasured pre-T confounder of M and Y)
TRUE 0.132 0.263 0.515 0.647 0.779 0.39 0.78 0.39 0.78 1.17 1.17
MEAN 0.135 0.295 0.480 0.640 0.775 0.327 0.804 0.405 0.882 2.339 1.245
BIAS 0.004 0.031 0.035 0.007 0.004 0.063 0.024 0.015 0.102 1.169 0.075
SD 0.051 0.059 0.103 0.106 0.097 0.135 0.140 0.138 0.137 0.357 0.149
MSE 0.003 0.004 0.012 0.011 0.009 0.022 0.021 0.021 0.029 1.494 0.028
Coverage 94.8 % 92.6 % 94.1 % 93.9 % 94.4 % 92.5 % 93.2 % 94.6 % 89.5 % 3.2 % 99.7 %
Scenario C (post-T confounder of M and Y)
TRUE 0.154 0.307 0.711 0.864 1.018 0.59 0.98 0.39 0.78 1.37 1.37
MEAN 0.176 0.345 0.509 0.677 0.854 0.475 1.048 0.316 0.889 2.102 1.165
BIAS 0.023 0.038 0.202 0.187 0.164 0.115 0.068 0.074 0.109 0.732 0.205
SD 0.061 0.076 0.119 0.120 0.105 0.159 0.144 0.146 0.157 0.318 0.234
MSE 0.004 0.007 0.055 0.049 0.038 0.039 0.025 0.025 0.037 0.652 0.097
Coverage 95.3 % 93.6 % 58.7 % 63.8 % 65.2 % 88.0 % 92.1 % 93.4 % 89.2 % 33.5 % 98.1 %
Scenario D (unmeasured pre-T confounder of T, M, and Y)
TRUE 0.132 0.264 0.515 0.647 0.779 0.39 0.78 0.39 0.78 1.17 1.17
MEAN 0.140 0.306 0.522 0.688 0.828 0.367 0.846 0.402 0.881 2.403 1.24
BIAS 0.008 0.042 0.007 0.041 0.049 0.023 0.066 0.012 0.101 1.233 0.07
SD 0.052 0.057 0.107 0.108 0.099 0.137 0.139 0.138 0.138 0.347 0.15
MSE 0.003 0.005 0.012 0.013 0.012 0.020 0.025 0.020 0.028 1.644 0.027
Coverage 93.5 % 90.5 % 94.2 % 92.3 % 90.6 % 95.0 % 91.5 % 94.0 % 89.6 % 1.4 % 99.8 %
D.L. Coffman et al.
Note: Boldface type indicates more severe bias and very poor coverage rates. Italics indicates moderate bias and poor coverage rates.
Underlining indicates slight bias
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 281
Table 14.6 Confounding scenario C (post-T confounder of M and Y) results (N D 500) for
medium effect size
NIE NDE TE IPWCDE IPW M RPM M RPMCDE TSLSIV Bayesian
No violations
TRUE 0.153 0.2 0.353 0.2 0.39 0.39 0.2 0.59 0.59
MEAN 0.173 0.003 0.176 0.200 0.392 0.401 0.000 0.398 0.389
BIAS 0.020 0.197 0.177 0.000 0.002 0.011 0.200 0.192 0.201
SD 0.047 0.109 0.103 0.107 0.107 0.966 0.397 0.260 0.235
MSE 0.003 0.051 0.042 0.012 0.010 0.932 0.197 0.107 0.096
Coverage 94.4 % 55.4 % 59.6 % 94.9 % 96.3 % 96.5 % 94.9 % 88.5 % 97.8 %
Exclusion restriction violated
TRUE 0.153 0.590 0.743 0.59 0.39 0.39 0.59 0.98 0.98
MEAN 0.171 0.388 0.559 0.583 0.388 0.387 0.389 1.390 0.772
BIAS 0.018 0.203 0.185 0.007 0.002 0.003 0.201 0.410 0.208
SD 0.050 0.111 0.104 0.107 0.107 0.925 0.379 0.287 0.228
MSE 0.003 0.053 0.045 0.012 0.011 0.854 0.184 0.258 0.095
Coverage 94.4 % 52.2 % 55.8 % 93.2 % 95.8 % 95.9 % 93.0 % 74.1 % 99.6 %
Monotonicity violated
TRUE 0.040 0.2 0.240 0.2 0.39 0.39 0.2 0.59 0.59
MEAN 0.055 0.004 0.058 0.199 0.389 0.412 0.001 1.278 0.362
BIAS 0.014 0.196 0.182 0.001 0.001 0.022 0.199 1.868 0.228
SD 0.023 0.100 0.101 0.095 0.096 0.438 0.110 3539.23 0.344
MSE 0.001 0.048 0.043 0.009 0.009 0.193 0.359 5859.25 0.170
Coverage 96.0 % 51.9 % 58.0 % 95.4 % 94.9 % 94.2 % 56.0 % 99.3 % 95.6 %
extreme case, as it is unlikely that the proportions in each stratum would be equal or
that the intervention would have an iatrogenic effect on this many individuals. Also
note that in this case, because the proportions for all strata were equal, the NIE true
value is zero because there is an equal proportion with a positive indirect effect and
a negative indirect effect and they cancel out. Natural and controlled effect estimates
are all unbiased regardless of the proportion of defiers.
For the condition in which the no-interaction between T and M assumption is
violated, TSLS IV estimates are biased with 2 % coverage. As mentioned previously
when discussing principal strata effects, this condition is also a violation of the
exclusion restriction. All other effect estimates were unbiased (see Table 14.4)
including the Bayesian principal strata estimate.
The models fitted to the simulated data in this confounding scenario did not adjust
for the pre-T confounder. Thus, this set of conditions represents a violation of
the no unmeasured confounding assumption. Results are reported in Table 14.5
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 283
In this confounding scenario, the exclusion restriction is violated in all the con-
ditions due to the post-T confounder. For this confounding scenario, the models
fit to the simulated data included both the pre- and post-T confounders. Thus, the
no-unmeasured confounding assumptions are not violated. Results are reported in
Table 14.6 for the post-T confounder of M and T/no violations, post-T confounder
of M and T/exclusion restriction violated, and post-T confounder of M and
T/monotonicity violated conditions. Results for the post-T confounder of M and
T/no-interaction violated condition are reported in the third panel of Table 14.4.
For the post-T confounder of M and T/no violations condition (however, the
exclusion restriction is violated due to the post-T confounder although there is
not otherwise a direct effect of T on Y), the TSLS IV and Bayesian principal
strata estimates, the NDE, and the CDE estimated via the RPM are biased to
approximately the same degree. The CDE estimated via IPW, the NIE, and M
estimated via either IPW or the RPM are unbiased although the MSE of M for
the RPM is much larger than the MSE for the IPW estimates. The 95 % coverage
for the NDE and TSLS IV estimates is unacceptable.
For the condition in which the exclusion restriction is violated (i.e., there is
a direct effect of T on Y in addition to the effect through the post-intervention
confounder), the TSLS IV and Bayesian principal strata estimates, the NDE, and
the CDE estimated via the RPM are biased. The CDE estimated via IPW, the NIE,
and M estimated via either IPW or the RPM are unbiased although the MSE of M
for the RPM is much larger than the MSE for the IPW estimates. The 95 % coverage
for the NDE and TSLS IV estimates is unacceptable.
For the condition in which monotonicity is violated, the TSLS IV and Bayesian
principal strata estimates, the NDE, and the CDE estimated via the RPM are biased.
The CDE estimated via IPW, the NIE, and M estimated via either IPW or the RPM
are unbiased although the MSE of M for the RPM is much larger than the MSE
for the IPW estimate. The 95 % coverage for the NDE and TSLS IV estimates is
unacceptable.
For the condition in which the no-interaction between T and M assumption is
violated, all effects are biased to some degree. The TSLS IV estimate is the most
severely biased with unacceptable 95 % coverage (33.5 %, see Table 14.4). The
NDEM(0) , NDEM(1) , CDE0 , the IPW MjtD1 , and the Bayesian principal strata effect
estimates were moderately biased. Coverage for these effects was also unacceptable.
The NIE1 , NIE0 , CDE1 , and IPW MjtD0 estimates were slightly biased.
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 285
The models fitted to the simulated data in this confounding scenario did not adjust
for the pre-T confounder. Thus, these conditions represent a violation of the no-
unmeasured-confounders of T and Y, T and M, and M and Y assumptions. Results
are reported in Table 14.7 for the unmeasured pre-T confounder of T, M, and
Y/no violations, unmeasured pre-T confounder of T, M, and Y/exclusion restriction
violated, and unmeasured pre-T confounder of T, M, and Y/monotonicity violated
conditions. Results for the unmeasured pre-T confounder of T, M, and Y/no-
interaction violated condition are reported in the fourth panel of Table 14.4. There
is no post-T confounder in any of the conditions for this confounding scenario
(Table 14.8).
For the unmeasured pre-T confounder of T, M, and Y/no violations condition, the
TSLS IV estimate is biased. In this confounding scenario, the use of T as an IV is not
justified for the TSLS IV estimator. All other estimates are slightly biased. The bias
is most notable when compared to the corresponding bias in Tables 14.3 and 14.5.
For example, in Table 14.3, the no-unmeasured-confounding assumption holds and
there is no bias. In Table 14.5, the no-unmeasured-confounding assumption holds
with regard to T but not M. Bias for the NIE and M estimates using either IPW
or the RPM are essentially the same between Tables 14.5 and 14.7. However, the
bias for the NDE and CDE estimated using either IPW or the RPM are larger in
Table 14.7 than in Table 14.5 because the no-unmeasured-confounding assumption
for T is also violated in Table 14.7. Finally, the bias for the Bayesian principal strata
estimates increased in Table 14.7 compared to Table 14.5.
For the condition in which the exclusion restriction is violated, the results follow
the exact same pattern except that now the TSLS IV estimate is more severely
biased due to violation of the exclusion restriction. In addition, the 95 % coverage
for the TSLS IV estimate is unacceptably low (13.9 %). For the condition in which
monotonicity is violated, the results again follow the same pattern except that, in
addition, the MC SD for the TSLS IV estimate, and therefore the MSE, is extremely
large (Table 14.9).
For the condition in which the no-interaction between T and M assumption
is violated, the NIE0 , NDEM(0) , and MjtD0 estimates are unbiased. The TSLS
IV estimates were again severely biased with unacceptable coverage (1.4 %, see
Table 14.9 Results of empirical data analysis for controlled effects using inverse
propensity weighted estimator
Without interaction With interaction
Estimate SE 95 % CI Estimate SE 95 % CI
CDE0 2.879 1.063 0.796 4.963 3.824 1.076 1.714 5.933
CDE1 2.879 1.063 0.796 4.963 0.972 2.228 3.394 5.338
MjtD0 2.194 1.183 0.125 4.513 3.847 2.073 0.216 7.911
MjtD1 2.194 1.183 0.125 4.513 0.996 1.350 1.650 3.641
Table 14.4). The MjtD1 estimate was moderately biased. The NIE1 , NDEM(1) ,
CDE0 , CDE1 , and Bayesian principal strata estimates were all slightly biased
(Table 14.10).
The MC SD for the Bayesian estimates was generally larger than the MC SD for the
other methods. The MC SD for the TSLS IV estimates were much larger than that
for the other methods when there was an interaction between T and M. Coverage
for the Bayesian principal strata estimates was over 99 % in almost all simulation
conditions. The results were similar for the N D 100 sample size condition, which
are included in supplementary online materials. We also examined a large effect
size, 0.59 (see [42]), and obtained similar results. That is, all 0.39 values in
Table 14.2 were replaced with 0.59. These results are included in supplementary
online materials.
The IPW CDE and M estimates were unbiased when the no-interaction between
T and M assumption is violated (see top panel of Table 14.4). These estimates were
also unbiased when there was a post-T confounder of M and Y (see Table 14.6).
However, when both of these assumptions were violated, these estimates were
biased (see third panel of Table 14.4). We examined this situation further by
generating 1000 replications for a sample size N D 10,000 and estimating the IPW
CDE and M effects. Although the MC SD decreased as would be expected due to
the increased sample size, the bias did not decrease. In fact, it remained consistent
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 287
with the bias reported in the third panel of Table 14.4. Thus, IPW CDE and
M estimates are not robust for the post-T confounder of M and Y/no-interaction
violation condition.
7 Discussion
The simulation study results illustrate that if the identifying assumptions used by
an estimator hold, then the estimator performs well in terms of bias, and if they
do not hold, then the estimator does not perform well in terms of bias. In addition,
some estimators seem to be more robust than others when assumptions are violated.
Specifically, the simulation study illustrates that the TSLS IV estimator of the
principal strata effects and the RPM G-estimator, which relies on interaction terms
that act as instrumental variables, require that the instrumental variable assumptions
hold and if they do not, these methods are just as biased as those that rely on
sequential ignorability. This problem has been known for quite some time when
attempting to estimate the causal effect of an endogenous variable on an outcome
[43] and it carries over to mediation analysis as well. Unfortunately, many of the
assumptions cannot be verified in empirical data, leaving the researcher to attempt
to justify the assumptions based on rational argument. However, we suggest that
researchers who use instrumental variable methods, such as the RPM, report the
strength of the interaction term on the mediator, as well as the strength of the
interaction term on the outcome. Note that the lack of an effect of the interaction
term on the outcome does not guarantee that the exclusion restriction holds and that
violation of the exclusion restriction cannot be verified or refuted from the observed
data [44]. Furthermore, weak instruments may actually amplify bias in comparison
with an unadjusted estimate (see e.g., [43–45]). In other words, using no instrument
can be better than using a weak instrument. We propose that researchers take the
following steps: define the causal estimand, justify the identification assumptions,
and try several estimators.
interested in the effect of the mediator on the outcome, then the controlled effects are
of interest, because the natural effects do not define this effect separately from the
indirect effect. If the researcher is interested in the causal effect of the intervention
on the outcome that is due to the mediator, then the NIEs are of interest.
For different empirical data sets, certain assumptions are more likely to hold than
others. For example, in some studies the exclusion restriction may be plausible, and
in other studies no post-intervention confounders may be more plausible. Thus, one
consideration in choosing an approach is the plausibility of the various assumptions
for a particular data set. For an extensively studied research area, scientists may have
knowledge about the validity of model assumptions but this knowledge is unlikely
in relative new research areas.
The assumption of no post-treatment confounders (assumption (d)), in which
there might be multiple mediators or confounders of the mediator and outcome that
have been influenced by the intervention, is likely to be violated in many studies.
Suppose a researcher is interested in the NIE, but assumption (d) is not plausible.
If instead assumption (e), no T M interaction, is plausible, then an estimate of
the CDE can be obtained and subtracted from the TE to obtain an estimate of the
indirect effect. Another alternative is to include measures of the additional mediators
of the intervention in the statistical analysis, known as a multiple mediator model.
Accurate estimation of causal effects in this model is an active research area in the
field of causal inference (e.g., [46, 47]).
If a researcher is not able to justify any of the identifying assumptions, or
is particularly interested in a specific estimand and cannot justify the identifying
assumptions for that estimand, then it is important to find ways to assess the sensi-
tivity of the estimates to violations of the assumptions. In some cases, sensitivity
analysis has been developed. For instance, Imai et al. [4] proposed sensitivity
analysis to the no-unmeasured-confounding assumptions used in identifying natural
effects and implemented it in the R mediation package. VanderWeele [28] has
proposed a sensitivity analysis for the no-unmeasured-confounding assumptions
used in identifying controlled effects. Sensitivity analysis for the presence of a post-
treatment confounder for natural effect estimates has recently been developed [46].
However, one type of sensitivity analysis that researchers could try is using several
different estimators that rely on different identifying assumptions for the particular
definition of interest. If the results generally agree, it seems safe to conclude that
either the assumptions are not violated or that the estimates are not sensitive to
violations of them. Of course, if the results do not agree, the researcher does
not know which are correct. In any case, identifying causal effects will require
assumptions; thus, it seems development of sensitivity analysis is an important
direction for future research. Another alternative is to design future research studies
in order to reduce or eliminate the violation of assumptions.
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 289
In this study, we only considered estimation—we did not consider hypothesis testing
and power. This and sensitivity analysis are directions for future work. We also did
not vary the strength of the confounding because the size of the simulation study
was already large. We would expect that as the effect of the unmeasured pre-T
confounder of M and Y (confounding scenario B), or of T, M, and Y (confounding
scenario D) increases, the bias resulting from not accounting for the confounder
would also increase. We also did not vary the strength of post-T confounder of M
and Y or of the interaction between T and M; rather we examined only the presence
or absence of violations of these assumptions.
There are other estimators for each approach that we did not consider here.
For the principal stratification approach, we did not implement the Jo et al. [49]
estimator, which uses reference stratification and propensity scores. Elliott et al. [37]
proposed a Bayesian estimator for principal strata effects, although this estimator
is only applicable when there is a binary mediator and a binary outcome. For
estimating the natural effects, Hogan [39] proposed an imputation-based estimator,
Daniels et al. [38] proposed a Bayesian estimator, Vansteelandt et al. [50] proposed
an imputation-based estimator, and VanderWeele and Vansteelandt [51] and Valeri
and VanderWeele [52] proposed an estimator for dichotomous outcomes based on
the mediation formula [6, 53]. Several other estimators for the CDEs have been
proposed, including a sequential G-estimation approach proposed by Vansteelandt
[54] and an estimator proposed by Emsley et al. [55] that is very similar to the RPM
G-estimator. Albert [56] proposed a TSLS estimator that is similar to those proposed
by Dunn and Bentall [57] and Joffe and Greene [58].
290 D.L. Coffman et al.
8 Conclusions
References
1. MacKinnon, D.P.: Introduction to Statistical Mediation Analysis. LEA, New York (2008)
2. Coffman, D.L.: Estimating causal effects in mediation analysis using propensity scores. Struct.
Equ. Model. 18, 357–369 (2011)
3. Coffman, D.L., Zhong, W.: Assessing mediation using marginal structural models in the
presence of confounding and moderation. Psychol. Methods (2012). doi:10.1037/a0029311
4. Imai, K., Keele, L., Tingley, D.: A general approach to causal mediation analysis. Psychol.
Methods 15, 309–334 (2010)
5. Jo, B.: Causal inference in randomized experiments with mediational processes. Psychol.
Methods 13, 314–336 (2008)
6. Pearl, J.: The causal mediation formula – a guide to the assessment of pathways and
mechanisms. Prev. Sci. 13, 426–436 (2012)
7. Holland, P.W.: Causal inference, path analysis, and recursive structural equations models.
Sociol. Methodol. 18, 449–484 (1988)
8. Holland, P.W.: Statistics and causal inference. J. Am. Stat. Assoc. 81, 945–970 (1986)
9. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies.
J. Educ. Psychol. 66, 688–701 (1974)
10. Rubin, D.B.: Causal inference using potential outcomes: design, modeling, decisions. J. Am.
Stat. Assoc. 100, 322–331 (2005)
11. Little, R.J.A., Rubin, D.B.: Causal effects in clinical and epidemiological studies via potential
outcomes: concepts and analytical approaches. Annu. Rev. Public Health 21, 121–145 (2000)
12. Schafer, J.L., Kang, J.D.Y.: Average causal effects from non-randomized studies: a practical
guide and simulated example. Psychol. Methods 13, 279–313 (2008)
13. Winship, C., Morgan, S.L.: The estimation of causal effects from observational data. Annu.
Rev. Sociol. 25, 659–706 (1999)
14. VanderWeele, T.J.: Concerning the consistency assumption in causal inference. Epidemiology
20(6), 880–883 (2009)
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 291
15. Westreich, D., Cole, S.R.: Invited commentary: positivity in practice. Am. J. Epidemiol. 171,
674–677 (2010)
16. Frangakis, C.E.: Principal stratification. In: Gelman, A., Meng, X.L. (eds.) Applied Bayesian
Modeling and Causal Inference from Incomplete Data Perspectives, pp. 97–108. Wiley, New
York (2004)
17. Frangakis, C.E., Rubin, D.B.: Principal stratification in causal inference. Biometrics 58, 21–29
(2002)
18. Rubin, D.B.: Direct and indirect causal effects via potential outcomes. Scand. J. Stat. 31, 161–
170 (2004)
19. Pearl, J.: Direct and indirect effects. In: Besnard, P., Hanks, S. (eds.) Proceedings of the
Seventeenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufman, San
Francisco (2001)
20. Robins, J.M., Greenland, S.: Identifiability and exchangeability for direct and indirect effects.
Epidemiology 3, 143–155 (1992)
21. VanderWeele, T.J., Vansteelandt, S.: Conceptual issues concerning mediation, interventions
and composition. Stat. Interface 2, 457–468 (2009)
22. Imai, K., Keele, L., Yamamoto, T.: Identification, inference, and sensitivity analysis for causal
mediation effects. Stat. Med. 25, 51–71 (2010)
23. VanderWeele, T.J.: Simple relations between principal stratification and direct and indirect
effects. Stat. Probab. Lett. 78, 2957–2962 (2008)
24. Sobel, M.E.: Identification of causal parameters in randomized studies with mediating
variables. J. Educ. Behav. Stat. 33, 230–251 (2008)
25. Gallop, R., Small, D.S., Lin, J.Y., Elliott, M.R., Joffe, M.M., Ten Have, T.R.: Mediation
analysis with principal stratification. Stat. Med. 28, 1108–1130 (2009)
26. VanderWeele, T.J.: Marginal structural models for the estimation of direct and indirect effects.
Epidemiology 20, 18–26 (2009)
27. Pearl, J.: Interpretation and identification of Causal Mediation. Psychol. Meth. 19(4), 459–481
(2014)
28. VanderWeele, T.J.: Bias formulas for sensitivity analysis for direct and indirect effects.
Epidemiology 21, 1–12 (2010)
29. Baron, R.M., Kenny, D.A.: The moderator–mediator variable distinction in social psycholog-
ical research: conceptual, strategic and statistical considerations. J. Person. Soc. Psychol. 51,
1173–1182 (1986)
30. Avin, C., Shipster, I., Pearl, J.: Identifiability of path-specific effects. In: Proceedings of
the International Joint Conferences on Artificial Intelligence, pp. 357–363. Department of
Statistics, UCLA, Los Angeles (2005)
31. Hafeman, D.M., VanderWeele, T.J.: Alternative assumptions for identification of direct and
indirect effects. Epidemiology 22, 753–764 (2011). doi:10.1097/EDE.0b013e3181c311b2
32. Vansteelandt, S., VanderWeele, T.J.: Natural direct and indirect effects on the exposed: effect
decomposition under weaker assumptions. Biometrics 68(4), 1019–1027 (2012)
33. Ten Have, T.R., Joffe, M.M.: A review of causal estimation of effects in mediation analysis.
Stat. Meth. Med. Res. 21, 77–107 (2012)
34. Ten Have, T.R., Joffe, M.M., Lynch, K.G., Brown, G.K., Maisto, S.A., Beck, A.T.: Causal
mediation analyses with rank preserving models. Biometrics 36, 926–934 (2007)
35. Lynch, K.G., Kerry, M., Gallop, R., Ten Have, T.R.: Causal mediation analyses for randomized
trials. Health Serv. Outcome Res. Methodol. 8, 57–76 (2008)
36. Angrist, J.D., Imbens, G.W., Rubin, D.B.: Identification of causal effects using instrumental
variables. J. Am. Stat. Assoc. 91, 444–472 (1996)
37. Elliott, M.R., Raghunathan, T.E., Li, Y.: Bayesian inference for causal mediation effects using
principal stratification with dichotomous mediators and outcomes. Biostatistics 11, 353–372
(2010)
38. Daniels, M.J., Roy, J., Kim, C., Hogan, J.W., Perri, M.: Bayesian inference for the causal effect
of mediation. Biometrics 68(4), 1028–1036 (2012)
292 D.L. Coffman et al.
39. Hogan, J.W.: Imputation-based inference for natural direct and indirect effects. Presented at
the Workshop on Causal Inference in Health Research, Montreal, Canada, May 2011
40. Keele, L., Tingley, D., Yamamoto, T., Imai, K.: Mediation: R package for
causal mediation analysis [Computer software manual] (2009). Available from
http://CRAN.R-project.org/package=mediation (R package version 2.1)
41. Robins, J.M., Hernan, M.A., Brumback, B.A.: Marginal structural models and causal inference
in epidemiology. Epidemiology 11, 550–560 (2000)
42. MacKinnon, D.P., Lockwood, C.M., Hoffman, J.M., West, S.G., Sheets, V.: A comparison of
methods to test mediation and other intervening variable effects. Psychol. Methods 7, 83–104
(2002)
43. Bound, J., Jaeger, D.A., Baker, R.M.: Problems with instrumental variables estimation when
the correlation between the instruments and the endogenous explanatory variable is weak.
J. Am. Stat. Assoc. 90, 443–450 (1995)
44. Hernan, M.A., Robins, J.M.: Instruments for causal inference: an epidemiologist’s dream?
Epidemiology 17(4), 360–371 (2006)
45. Pearl, J.: On a class of bias-amplifying covariates that endanger effect estimates. UCLA
Cognitive Systems Laboratory, Technical Report (R-356). In: Grunwald, P., Spirtes, P.
(eds.) Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence,
pp. 417–424. Corvallis, OR (2010)
46. Imai, K., Yamamoto, T.: Identification and sensitivity analysis for multiple causal mech-
anisms: revisiting evidence from framing experiments. Polit. Anal. 1, 1–31 (2013).
doi:10.1093/pan/mps040
47. Wang, W., Nelson, S., Albert, J.M.: Estimation of causal mediation effects for a dichotomous
outcome in multiple-mediator models using the mediation formula. Stat. Med. 32(24), 4211–
4228 (2013)
48. Lange, T., Vansteelandt, S., Bekaert, M.: A simple unified approach for estimating natural
direct and indirect effects. Am. J. Epidemiol. 176, 190–195 (2012)
49. Jo, B., Stuart, E.A., MacKinnon, D.P., Vinokur, A.D.: The use of propensity scores in mediation
analysis. Multivar. Behav. Res. 46, 1–28 (2011). doi:10.1080/00273171.2011.576624
50. Vansteelandt, S., Bekaert, M., Lange, T.: Imputation strategies for the estimation of natural
direct and indirect effects. Epidemiol. Methods 1, 131–158 (2012)
51. VanderWeele, T.J., Vansteelandt, S.: Odds ratios for mediation analysis for a dichotomous
outcome. Am. J. Epidemiol. 172, 1339–1348 (2010)
52. Valeri, L., VanderWeele, T.J.: Mediation analysis allowing for exposure-mediator interactions
and causal interpretation: theoretical assumptions and implementation with SAS and SPSS
macros. Psychol. Methods (2013)
53. Pearl, J.: Interpretable conditions for identifying direct and indirect effects. UCLA Cognitive
Systems Laboratory Technical Report (R-389) (2012)
54. Vansteelandt, S.: Estimating direct effects in cohort and case-control studies. Epidemiology
20(6), 851–860 (2009)
55. Emsley, R., Dunn, G., White, I.R.: Mediation and moderation of treatment effects in ran-
domised controlled trials of complex treatments. Stat. Methods Med. Res. 19(3), 237–270
(2010)
56. Albert, J.M.: Mediation analysis via potential outcomes models. Stat. Med. 27, 1282–1304
(2008)
57. Dunn, G., Bentall, R.: Modelling treatment-effect heterogeneity in randomized controlled trials
of complex interventions (psychological treatments). Stat. Med. 26, 4719–4745 (2007)
58. Joffe, M.M., Greene, T.: Related causal frameworks for surrogate outcomes. Biometrics 65,
530–538 (2009)
59. Gallop, R.: Principal stratification for assessing mediation with a continuous mediator. Paper
presented at the Eastern North American Region of the International Biometric Society,
Washington, April 2012
60. Cole, S.R., Frangakis, C.: The consistency statement in causal inference: a definition or an
assumption. Epidemiology 20(1), 3–5 (2009)
14 A Comparison of Potential Outcome Approaches for Assessing Causal Mediation 293
61. MacCallum, R.C., Zhang, S., Preacher, K.J., Rucker, D.D.: On the practice of
dichotomization of quantitative variables. Psychol. Methods 7(1), 19–40 (2002).
doi:10.1037/1082-989X.7.1.19
62. Rosenbaum, P.R.: The consequences of adjustment for a concomitant variable that has been
affected by the treatment. J. R. Stat. Soc. Ser. A (General) 147, 656–666 (1984)
63. West, S.G., Biesanz, J.C., Pitts, S.C.: Causal inference and generalization in field settings:
experimental and quasi-experimental designs. In: Reis, H.T.J., Judd, C. (eds.) Handbook of
Research Methods in Social and Personality Psychology, pp. 40–84. Cambridge University
Press, New York (2000)
64. Cox, M.G., Kisbu-Sakarya, Y., Miočević, M., MacKinnon, D.P.: Sensitivity plots for con-
founder bias in the single mediator model. Eval. Rev. 37(5), 405–431 (2014)
Chapter 15
Causal Mediation Analysis Using Structure
Equation Models
D. Gunzler ()
Center for Health Care Research & Policy, MetroHealth Medical Center, Case Western
Reserve University, 2500 MetroHealth Drive, Cleveland, OH 44109, USA
e-mail: dgunzler@metrohealth.org
N. Morris
Department of Epidemiology and Biostatistics, Case Western Reserve University,
10900 Euclid Ave, Cleveland, OH 44106, USA
X.M. Tu
Department of Biostatistics and Computational Biology, University of Rochester,
Rochester, NY 14642, USA
current context, the FRM-based SEM provides valid inference for longitudinal
mediation analysis under the two most popular missing data mechanisms; missing
completely at random (MCAR) and missing at random (MAR). We illustrate the
SEM approaches discussed in this chapter with real data.
There are many advantages for using the structural equation modeling (SEM)
framework in the context of mediation analysis [1–3].
• SEM allows for the inclusion of latent variables such as happiness and quality of
life.
• SEM allows for the joint estimation of all parameters of a mediation model in a
single analysis.
• SEM allows for the extension of the mediation process to include multiple
independent variables, mediators, or outcomes in a single model.
• Many techniques are available (i.e., full information maximum likelihood) for
handling missing data under various assumptions for a structural equation model
in a single analysis.
• SEM approach provides model fit information about consistency of the hypothe-
sized mediational model to the data.
• SEM implies a functional relationship among variables via a conceptual model,
path diagram, and mathematical equations thus giving a rich, natural language
for expressing causal relationships.
Causal inference methods can be directly applied in the SEM framework for
causal mediation [4–6]. Thus, these approaches address the issues of potential
confounders of the mediator–outcome relationship, potential interaction between
the mediator and treatment, as well as provide definitions for deriving effects for
analyses involving mediators and outcomes that are not on an interval scale (i.e.,
count data, categorical data) all within the SEM framework. These approaches can
be readily implemented in MPlus (Múthen, 2011) [5]. MPlus is more generally a
program for latent variable modeling of which classical SEM is a special case [7].
SEM allows for ease of extension to longitudinal data within a single framework,
corresponding with a study’s conceptual framework for clear hypothesis articulation
[8]. Latent growth modeling (LGM) is an SEM extension for longitudinal data, and
shows great flexibility in evaluating mediating relationships between multiple time-
varying measures [8]. For example, the parallel process LGM framework can be
used to evaluate how growth in the mediator influences growth in the outcome
[9]. This LGM framework assumes no strong temporal relationship between the
mediator and outcome, only in the growth of the mediator and growth of the
outcome. Autoregressive and latent difference scores have also been used for
longitudinal mediation analyses with SEM given a temporal relationship between
15 Causal Mediation Analysis Using Structure Equation Models 297
the mediator and outcome. For more information on the topic of SEM extensions
for longitudinal data in the context of mediation, see MacKinnon [10].
models in this chapter to showcase the advantages of SEM in this context, we refer
an interested reader to other resources that discuss model modification in SEM such
as Kline [1] or Gunzler and Morris [11].
3 Path Diagrams
A path diagram for a mediation model will consist of nodes representing the vari-
ables, and arrows showing relations among them. In a path diagram, latent variables
(e.g., depression or stress) are distinguished from their observed counterparts in
convention by using a circle or ellipse rather than the rectangular or square box
used for the observed variables. Error terms are generally denoted by a letter or
symbol (i.e., e or ") not enclosed in a shape. Arrows are generally used to represent
relationships among the variables. A single straight arrow indicates a causal relation
from the base of the arrow to the head of the arrow. Two straight single-headed
arrows in opposing directions connecting two variables may be used to indicate a
feedback loop. A curved two-headed arrow indicates there may be some association
between the two variables.
Path diagrams can be understood as implying certain conditional independence
relations among variables. Such conditional independence relations can be extracted
from the path diagram using the “d-separation” rule. D-separation is a criterion for
determining, from a given diagram, whether a set X of variables is independent of
another set Y, given a third set Z [4]. If the particular variables x and y are not d-
separated by z (i.e., z does not block the causal path between them), then they are
said to be d-connected by z. Note that d-connected is another way of describing
mediation [13]. See Bollen [2] and Pearl [4] for a more complete explanation of
these rules and for details about modeling complex relationships involving latent
constructs using path diagrams and SEM.
As an example of a path diagram for a hypothesized mediation, Fig. 15.1
represents the path diagram for the causal path from time since symptom onset
β xz Cognitive Decline γ zy ε yi
(zi)
ε zi
Fig. 15.1 Path diagram for the hypothesized mediation model for the causal path from time since
symptom onset to depression
15 Causal Mediation Analysis Using Structure Equation Models 299
y D y C ƒy ˜ C ©
(15.1)
x D x C ƒx Ÿ C •
˜ D C B˜ C Ÿ C — (15.2)
y D y C By C x C — (15.3)
˜ D C B˜ C Ÿ C —
.I B/ ˜ D C Ÿ C —
(15.4)
˜ D .I B/1 C .I B/1 Ÿ C .I B/1 —
˜ D .I B/1 C Ÿ C —
The SEM for the typical mediation process with a single independent variable,
mediator, and outcome as depicted in Fig. 15.1 can be expressed by the following
structural equations:
yi D 0 C zy zi C xy xi C "yi ;
(15.6)
zi D ˇ0 C ˇxz xi C "zi
Note that the two structural equations are linked together and inference about them
is simultaneous, based on a joint distribution, unlike two, independent standard
regression equations.
Here, we might assume, given that multivariate normality is an appropriate
assumption
15 Causal Mediation Analysis Using Structure Equation Models 301
!
"yi y2 0
N.0; ‰/; ‰D : (15.7)
"zi 0 z2
Note that we are assuming that Cov "yi ; "zi D 0 in assuming no mediator–outcome
confounding.
This mediation model is not linear (i.e., it is curvilinear) in terms of the
parameters [15]. To see this, we can express these equations in the form of (15.3):
yi 0 0 zy yi xy "
D C C xi C yi (15.8)
zi ˇ0 0 0 zi ˇxz "zi
The above SEM is clearly not linear in the parameters because of the terms zy ˇ 0
and zy ˇ xz in the first row of the matrix in (15.9).
The direct effect is the pathway from the exogenous variable to the outcome while
controlling for the mediator. Therefore, in our path diagram in Fig. 15.1 xy is the
direct effect. The indirect effect describes the pathway from the exogenous variable
to the outcome through the mediator. This path is represented through the product
of ˇ xz and zy . Finally, the total effect is the sum of the direct and indirect effects of
the exogenous variable on the outcome, xy C ˇxz zy .
The primary hypothesis of interest in a mediation analysis is to see whether the
effect of the independent variable or intervention on the outcome can be mediated
by a change in the mediating variable. In a full mediation process, the effect
is 100 % mediated by the mediator that is, in the presence of the mediator, the
pathway connecting the intervention to the outcome is completely broken so that
the intervention has no direct effect on the outcome. In most applications, however,
partial mediation is more common, in which case the mediator only mediates part
of the effect of the intervention on the outcome, that is, the intervention has some
residual direct effect even after the mediator is introduced into the model.
Inference (standard errors and p-values) for testing mediation effects in the SEM
framework is easily performed using the Delta method (e.g., Sobel [16]; Clogg
et al. [17]). Currently a popular approach to assessing mediation is to bootstrap
confidence intervals (percentile, bias-corrected, and bias-corrected, and accelerated)
for total and specific indirect effects [18].
The sample size or power for mediation analysis might be derived using
simulation techniques under full mediation, where the direct effect is equal to zero,
vs. a suitable alternative effect size to be considered for the direct effect.
302 D. Gunzler et al.
Significant advances have been made over the past few decades in the theory and
applications as well as software development for fitting SEM models that can be
used in the context of mediation analysis. For example, in addition to specialized
packages such as LISREL [14], MPlus [19], EQS [20], and Amos [21], procedures
for fitting SEM are also available from general-purposes statistical packages such
as R, SAS, STATA, and Statistica. These packages provide inference based on
maximum likelihood, generalized least squares, and weighted least squares.
Typically, robust maximum likelihood approaches are used for SEM analysis. For
example, in MPlus, the MLR approach uses ML to estimate the parameters, but uses
a robust sandwich type estimator (Huber–White sandwich estimator) to calculate
standard errors that are robust to model assumptions such as multivariate normality
[22]. Bootstrapping is a similar but more computationally intensive approach to
creating robust standard errors [23].
Both ML and MLR provide a method for dealing with missing data under the
missing at random (MAR) assumption. For example, a slight modification of ML,
full information ML (FIML) is one such approach to handle missing data under
MAR assumption as implemented in MPLUS [19]. In this approach, all parameters
and standard errors are derived from the joint distribution of the endogenous
and exogenous variables, given assumptions such as multivariate normality and
conditional independence. Under these assumptions, the marginal likelihood after
integrating out the missing values can be maximized. Individual level data is needed
for FIML.
6 Model Fit
Model fit indices are measures of the discrepancy between the model and data. In
SEM analyses we evaluate a collective group of model fit measures which each
represents different aspects of model fit. We provide a brief introduction here of
some of the statistics and indices that will be useful for mediation analyses.
For starters, an asymptotically chi-squared distributed test statistic (or robust
corrected statistic) provides a basis for assessing model fit, and in itself tests overall
model fit. The null hypothesis is that there is no difference between the proposed
model and the data structure, while the alternative hypothesis is that there is a
difference between the proposed model and the data structure. Thus, a large chi-
squared test with a corresponding small p-value indicates that the model does not fit
the data. However, commonly, studies will reject the null as the chi-squared statistic
is affected by nonnormality, correlation size, low power, and sample size (both too
small or too large).
A commonly used index, Root Mean Square Error of Approximation (RMSEA)
[24], is a point estimate that builds on this chi-squared statistic but is parsimony
and sample size corrected. Confidence intervals can be constructed around the point
estimate. A close fit hypothesis can be tested for the model using RMSEA. There
are several limitations to the fit index, namely, RMSEA may not exactly follow an
assumed non-central chi-square distribution, may be sensitive to nonnormality, and
may favor larger models.
15 Causal Mediation Analysis Using Structure Equation Models 303
Another commonly reported fit index, the Comparative Fit Index (CFI) [25], is an
incremental fit measure comparing the fit of the model to a baseline model (typically
the model for the data of interest with no covariance) on a zero to one continuous
scale. The closer the CFI index is to one, the better the model fit.
The Tucker–Lewis Index (TLI) [26] is another commonly reported incremental
fit measure with a higher penalty for adding parameters than CFI, and without
the zero to one range restriction. A commonly used absolute fit index, based
on standardized difference between the observed correlation and the predicted
correlation, is the Standardized Root Mean Square Residual (SRMR) [27].
Some general rule of thumb guidelines in SEM literature are that RMSEA 0.05
indicates an excellent fit while <0.08 is acceptable; CFI and TLI < 0.90 are
acceptable and <0.95 are excellent fit. In addition, all three indices should reach
acceptable (preferably excellent) levels before designating a model as good fitting.
SRMR value 0.08 represent a good fit with the model.
PHQ-91
Cognitive
Decline Depression PHQ-92
Time Since
Symptom
Hand PHQ-93
Onset
Function
Fatigue
Mobility
…
Fig. 15.2 Path diagram for the hypothesized multiple mediator multiple outcome model for the
causal path from time since symptom onset to depression and fatigue. For visual ease we leave
out of this diagram all error terms for the endogenous variables and correlations (all are significant
p < 0.001 and of a positive magnitude) among the three mediators (cognitive decline, hand function,
and mobility) and among the two outcomes (depression and fatigue)
Given all the additional causal paths, the model did not show a good model fit
according to multiple SEM fit statistics and indices in comparison with our rule of
thumb guidelines: 2 (92) D 3722.673, p 0.001; RMSEA (90 % Confidence Inter-
val) D 0.106 (0.103, 0.109); CFI D 0.845; TLI D 0.781; SRMR D 0.055. Therefore,
potentially a researcher may want to modify the model (based on both clinical theory
and empirical criteria) before reporting these findings. For more information on how
to perform model modification using modification indices, see Kline [1] or Gunzler
and Morris [11].
In Table 15.1 we show all the model derived specific and total indirect effects.
We can assess these specific and total indirect effects while accounting for other
model relationships. For example, given this more complex model, controlling
for other relationships, and a latent construct for depression, the mediated effect
is still significant, but of a lower magnitude, from symptom onset ! cognitive
decline ! depression (see Table 15.3) compared to the simpler model correspond-
ing to Fig. 15.1. The total indirect effect from symptom onset to depression, while
adjusting for all other model relationships, is the sum of the three individual indirect
effects from symptom onset to depression (0.025 C 0.015 C 0.035 D 0.075).
error terms are t-distributed, the joint normal distribution assumption is not met in
the presence of missing data following MAR. For more technical details about the
simulated model, see Gunzler et al. [38].
As shown in Fig. 15.3 ML-based methods will show bias in estimating the
primary parameters of interest of the mediation model at all sample sizes (small
to large) while the robust Functional Response Modeling (FRM)-based approach
[15] will exhibit little bias that decreases as the sample size increases [38].
We now provide details about the FRM-based approach. Consider the mediation
model in Eq. (15.6). We can replace our outcome, mediator, and independent
variable (yi , zi , xi ) with appropriate time-varying versions (yi3 , zi2 , xi1 ) given data
collected at three repeated measures at t D 1; 2; 3 and temporality among the
measures for assessing longitudinal mediation .xi1 ! zi2 ! yi3 /.
1.0
1.0 0.6
estimate
estimate
0.5 0.4
0.5
0.2
0.0 0.0
0.0
-0.5
-0.5
γ0 γzy γxy β0 βxz γ0 γzy γxy β0 βxz γ0 γzy γxy β0 βxz
parameter parameter parameter
method method method
FRM FRM FRM
ML ML ML
Fig. 15.3 Simulation results: mean estimates population estimates (˙ standard errors) show
the bias in ML while FRM performs well with missing data. Adapted from Gunzler D Lu N
Tang W Wu P Tu XM A Class of Distribution-free Models for Longitudinal Mediation Analysis.
Psychometrika 2014, 17(4), 543–568
where
308 D. Gunzler et al.
E z2i2 j xi D "z2 C .ˇ0 C ˇxz xi1 /2 ;
E .yi3 zi2 j xi / D zy .ˇ0 C ˇxz xi1 / .ˇ0 C ˇxz xi1 / C zy "z2
C .ˇ0 C ˇxz xi1 / 0 C zy ˇ0 C xy C zy ˇxz xi1 ; (15.14)
2 2
E yi3 j xi D zy2 "z2 C "y2
C 0 C zy ˇ0 C xy C zy ˇxz xi1 :
@
Si D fi hi ./ ; Di D hi ./ (15.16)
@
The following estimating equations are well defined and readily evaluated in closed
form:
1X 1X
n n
wn ./ D wni D Di Vi1 Si D 0 (15.17)
n iD1 n iD1
p b d
n ! N .0; † / ;
(15.18)
† D B 1 E Di Vi1 Si SiT Vi1 DTi BT ;
B D E DTi Vi1 Di
Both Wald and Score Tests have been developed to test the true value of parameters
of a mediation model based on the sample estimates using the FRM-based approach
[38].
While these estimating equations provide valid inference under complete data
and the MCAR assumption, weighted estimating equations are necessary for valid
15 Causal Mediation Analysis Using Structure Equation Models 309
inference when the missing data follows the MAR assumption. Using Inverse
Probability Weighting (IPW) we can develop a set of weighted estimating equations
for inference about . We provide a sketch here. Assume no missing data at baseline
(t D 1) and monotone missing data for t D 2 and 3. Then, let
Now let
0 ri3 1
0 0 0 0
B i3 r C
B i2 C
B 0 0 0 0 C
B i2 C
B ri3 C
i D B
B 0 0 i3 0 0
C
C (15.21)
B C
B 0 0 0 ri3 0 C
B i3 C
@ ri2 A
0 0 0 0
i2
1X 1X
n n
wn ./ D wni D Di Vi1 i Si D 0 (15.22)
n iD1 n iD1
For details about solving these weighted estimating equations and the asymptotic
properties, see Gunzler et al. [38].
The distribution-free FRM-based approach is straightforward to extend to non-
continuous mediators and outcomes (i.e., count, categorical). For example, if yit is a
binary outcome, the revised model
ˇ
ˇ
zi2 D ˇ0 C ˇxz xi1 C "zi ; yi3 ˇxi1 ; zi2 Binomial .i ; 1/ ;
ˇ
ˇ
i D E yi3 ˇxi1 ; zi2 ; logit .i / D 0 C xy xi1 C zy zi2 ;
(15.23)
"zi N 0; z2 ; xi1 ? "zi
To illustrate the approach to real study data, we applied the FRM to a longitudinal
study known as the Child Resilience Project [39]. Data was collected for this study
from 2006 to 2011. This analysis included 401 students from first up to third grade
in five Rochester City School District elementary schools. The study examines how
children with a higher risk of developing behavioral problems with a mentor socially
improve compared to the control and lower risk children over periods of 6 and 18
months.
We examined what role a potential mediator, self-reported verbal, declarative
knowledge of the skills the child is learning in the Resilience Project at 6 months,
plays in a cause and effect relationship between the treatment at baseline and the
child’s self-initiated demonstration of skills at 18 months (Fig. 15.4). Thus we have
longitudinal data with three assessment times, baseline, 6 months, and 18 months
and temporally the mediator is hypothesized to occur before the outcome.
The treatment is a binary indicator as children either had a mentor or no mentor.
In the hypothesis of interest, the treatment would be expected to predict a higher
demonstration of skills, which would indicate that the children receiving a mentor
improved their social skills over time. The distributions of both the mediator and
outcome were skewed as shown in Fig. 15.5.
We had full information on whether each child received the treatment at baseline.
However, there were a high percentage of missing observations for both the mediator
(37 %) and outcome (59 %). We modeled this missing data using logistic regression:
This is a simplified special case of a missing data model for applying IPW
in which we are building our missing data models with only observed data at
the previous time point (without using any other information). We estimated the
parameters in R program using the glm function. Since we modeled our missing data
at t D 2 based on the treatment information at baseline, we used all 401 observations.
βxz Knowledge at 6 γ zy
months
Fig. 15.4 Path diagram for the mediation model for the Child Resilience Study with MAR Data
15 Causal Mediation Analysis Using Structure Equation Models 311
0.14 0.20
0.12
0.10 0.15
0.08
Density
0.10
0.06
0.04
0.05
0.02
0.00 0.00
0 5 10 15 20 0 5 10 15
Knowledge Demonstration
Fig. 15.5 Histograms of verbal, declarative knowledge of skills and demonstration of skills for
the Child Resilience Study
Table 15.2 Parameter Estimates, standard errors, and p-value Child Resilience
estimates, standard errors, Example under missing data
and p-values for the missing Estimate Standard error asymptotic p-value
data model for the Child
Resilience Study Sample size D 401
02 0.546 0.147 <0.001
x1 0.019 0.207 0.926
03 0.250 0.201 0.214
z2 0.067 0.029 0.022
Shown in Table 15.2 are the estimates for the missing data model in (15.24).
The p-value for z2 was significant, indicating a MAR mechanism for the missing
data at time 3. Since the p-value for x1 was not significant, missing data at time
2 was MCAR and we would expect no bias for the estimates of time 2 parameters
“ D .ˇ0 ; ˇxz /T in ML. However, we expect to see a bias for the estimates of the time
T
3 parameters D 0 ; xy ; zy in ML. In the hypothesis of interest, the treatment
would be expected to predict a higher demonstration of skills at 18 months, which
would indicate that the children receiving a mentor improved their social skills over
time.
Shown in Table 15.3 are the estimates of the main parameters of ™ D
0 ; zy ; xy ; ˇ0 ; ˇxz and associated standard errors and type I errors for this
mediation model obtained from the alternative FRM and ML. From the table, we see
that the estimates for FRM and ML were practically the same for the “ parameters,
but different for ” parameters. In the “ parameter estimates, FRM had a smaller
standard error than ML. We saw from the simulation for longitudinal missing data
in Fig. 15.3 that ML would produce a value of xy biased less in magnitude than
312 D. Gunzler et al.
Table 15.3 Parameter estimates, standard errors, and type I error rates
for the mediation model for the Child Resilience Study with missing
data
Estimates, standard errors, and type I errors Child Resilience Study
example under missing data (37 %/59 %)
Estimate method Standard error method
FRM ML FRM ML
Sample size D 401
0 1.812 1.810 0.278 0.352
zy 0.042 0.039 0.053 0.050
xy 2.330 2.283 0.503 0.480
ˇ0 3.429 3.429 0.370 0.374
ˇ xz 4.390 4.390 0.528 0.529
Type I ˛ for H0 : xy D 0 Wald < 0.001 <0.001
Score < 0.001
the true estimate. This appeared true again as the FRM estimate was higher in
magnitude, confirming that the treatment predicted a higher demonstration of skills
at 18 months. The parameter zy was not significant for either FRM or ML in this
model (p > 0.421 for Wald Test in both FRM and ML), implying a non-significant
indirect effect in this mediation analysis.
10 Chapter Conclusion
Structural equation modeling provides a very general, powerful framework for per-
forming causal mediation analysis. By taking advantage of the functional response
models (FRM), we have developed a robust approach to systematically address the
limitations of SEM as it applies to mediation analysis. This class of FRM-based
SEM requires no parametric models for the data distribution and provides valid
inference for longitudinal mediation hypotheses under the two most popular missing
data mechanisms, missing completely at random (MCAR) and missing at random
(MAR). The approach can be extended for noncontinuous mediators and outcomes.
References
1. Kline, R.B.: Principles and Practice of Structural Equation Modeling. Guilford Press (2011)
2. Bollen, K.: Structural Equations with Latent Variables. Wiley, New York (1989)
3. Gunzler, D., et al.: Introduction to mediation analysis with structural equation modeling.
Shanghai Arch. Psychiatry 25(6), 390–394 (2013)
4. Pearl, J.: Causality: Models, Reasoning and Inference, 2nd edn. Cambridge Univ Press (2009)
15 Causal Mediation Analysis Using Structure Equation Models 313
5. Muthén, B.: Applications of causally defined direct and indirect effects in mediation anal-
ysis using SEM in Mplus. Download at www.statmodel.com/download/causalmediation.pdf
(2011)
6. Imai, K., Keele, L., Tingley, D.: A general approach to causal mediation analysis. Psychol.
Methods 15(4), 309 (2010)
7. Muthén, B.O.: Beyond SEM: general latent variable modeling. Behaviormetrika 29(1; ISSU
51), 81–118 (2002)
8. Preacher, K.J.: Latent Growth Curve Modeling. Sage (2008)
9. Cheong, J., MacKinnon, D.P., Khoo, S.T.: Investigation of mediational processes using parallel
process latent growth curve modeling. Struct. Equ. Model. 10(2), 238–262 (2003)
10. MacKinnon, D.P.: Introduction to Statistical Mediation Analysis. Routledge (2008)
11. Gunzler, D.D., Morris, N.: A tutorial on structural equation modeling for analysis of overlap-
ping symptoms in co-occurring conditions using MPlus. Stat. Med. 34(24), 3246–3280 (2015)
12. Kenny, D.A.: Terminology and Basics of SEM. (2011). Available from: http://davidakenny.net/
cm/basics.htm
13. Baron, R.M., Kenny, D.A.: The moderator–mediator variable distinction in social psychologi-
cal research: conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol. 51(6),
1173 (1986)
14. Joreskog, K., Sorbom, D.: LISREL 8 User’s Reference Guide. Scientific Software Chicago
(1996)
15. Kowalski, J., Tu, X.M.: Modern Applied U-Statistics, vol. 714. Wiley (2008)
16. Sobel, M.E.: Asymptotic confidence intervals for indirect effects in structural equation models.
Sociol. Methodol. 13(1982), 290–312 (1982)
17. Clogg, C.C., Petkova, E., Shihadeh, E.S.: Statistical methods for analyzing collapsibility in
regression models. J. Educ. Behav. Stat. 17(1), 51–74 (1992)
18. Preacher, K.J., Hayes, A.F.: Asymptotic and resampling strategies for assessing and comparing
indirect effects in multiple mediator models. Behav. Res. Methods 40(3), 879–891 (2008)
19. Muthén, L.K., Muthén, B.O.: Mplus. The Comprehensive Modelling Program for Applied
Researchers: User’s Guide, vol. 5 (2012)
20. Bentler, P.M.: EQS Structural Equations Program Manual, p. 254. BMDP Statistical Software
(1989)
21. Arbuckle, J.: Amos 6.0 User’s Guide. Marketing Department, SPSS Incorporated (2005)
22. Huber, P.J.: The behavior of maximum likelihood estimates under nonstandard conditions.
In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability
(1967)
23. Nevitt, J., Hancock, G.R.: Performance of bootstrapping approaches to model test statistics and
parameter standard error estimation in structural equation modeling. Struct. Equ. Model. 8(3),
353–377 (2001)
24. Browne, M.W., et al.: Alternative ways of assessing model fit. Sage Focus Editions 154, 136
(1993)
25. Bentler, P.M.: Comparative fit indexes in structural models. Psychol. Bull. 107(2), 238 (1990)
26. Tucker, L.R., Lewis, C.: A reliability coefficient for maximum likelihood factor analysis.
Psychometrika 38(1), 1–10 (1973)
27. Hu, L.-t., Bentler, P.M.: Fit indices in covariance structure modeling: sensitivity to underpa-
rameterized model misspecification. Psychol. Methods 3(4), 424 (1998)
28. Katzan, I., et al.: The Knowledge Program: an innovative, comprehensive electronic data
capture system and warehouse. In: AMIA Annual Symposium Proceedings, pp. 683–692
(2011)
29. Mellen Center for Multiple Sclerosis Treatment and Research, C.C., Neurological Institute
(2013). Available from: Retrieved from http://my.clevelandclinic.org/neurological_institute/
mellen-center-multiple-sclerosis/default.aspx
30. Blacker, D.: Psychiatric rating scales. In: Sadock, B.J., Sadock, V.A. (eds.) Kaplan and
Sadock’s Comprehensive Textbook of Psychiatry, 8th edn, pp. 929–955. Lippincott Williams
& Wilkins, Philadelphia (2005)
314 D. Gunzler et al.
31. Schwartz, C.E., Vollmer, T., Lee, H.: Reliability and validity of two self-report measures of
impairment and disability for MS. Neurology 52(1), 63–70 (1999)
32. Chamot, E., Kister, I., Cutter, G.R.: Item response theory-based measure of global disability in
multiple sclerosis derived from the Performance Scales and related items. BMC Neurol. 14(1),
192 (2014)
33. Marrie, R.A., Goldman, M.: Validity of performance scales for disability assessment in
multiple sclerosis. Mult. Scler. 13(9), 1176–1182 (2007)
34. Gunzler, D., et al.: Disentangling multiple sclerosis & depression: an adjusted depression
screening score for patient-centered care. J. Behav. Med. 38(2), 237–250 (2015)
35. Beal, C.C., Stuifbergen, A.K., Brown, A.: Depression in multiple sclerosis: a longitudinal
analysis. Arch. Psychiatr. Nurs. 21(4), 181–191 (2007)
36. Brown, R., et al.: Longitudinal assessment of anxiety, depression, and fatigue in people with
multiple sclerosis. Psychol. Psychother. Theory Res. Pract. 82(1), 41–56 (2009)
37. Krupp, L.B.: Fatigue in Multiple Sclerosis: A Guide to Diagnosis and Management. Demos
Medical Publishing (2004)
38. Gunzler, D., et al.: A class of distribution-free models for longitudinal mediation analysis.
Psychometrika 79(4), 543–568 (2014)
39. Wyman, P.A., et al.: Intervention to strengthen emotional self-regulation in children with
emerging mental health problems: proximal impact on school behavior. J. Abnorm. Child
Psychol. 38(5), 707–720 (2010)
Index
A B
Absolute standardized mean difference Bayesian approach, 218
(ASMD), 121–122 BlackBoost, 214
ACEAIPW precision Boosting model, 119–120
known propensity score model
arbitrary function, 77
disadvantage, 79 C
Monte Carlo computations, 79 cART. See Combined antiretroviral therapies
quadratic function minimization, 78 (cART)
simulated 100 datasets, 79, 80 Causal inference
variance, 78 counterfactual outcome
weighted mean squared error, 79 mediation, treatment effect, 8, 9
known response regression model, 80–81 post-treatment confounders, RCT, 7–8
Acquired immunodeficiency syndrome potential outcome, 5–6
(AIDS), 203 randomization, 5
AdaBoost, 213 selection bias, observational studies, 7
AIDS Clinical Trials Group (ACTG) Study epidemiology and clinical trials, 4
A5095, 204 ITT, 4
ASMD. See Absolute standardized mean statistical models
difference (ASMD) case-control designs, 10
ATE. See Average treatment effect (ATE) causal mediation, 19–20
ATT. See Average treatment effect among the MAR mechanism, 9
treated (ATT) matching and propensity score
Augmented inverse probability weighted matching, 10–11
(AIPW) estimator missing data, 9
construction, 75 MSMs, 12–13
correct PM, 74 post-treatment confounders, RCT,
correct RRM, 74 13–19
double robustness property, 76 sequential ignorability (SI) and model
HT estimator, 74, 75 identification, 20, 21
Average causal mediation effect (ACME), Causal mediation models
20 ACME, 20, 23
Average treatment effect (ATE), 112, 121 causal diagram, 244, 254
Average treatment effect among the treated CDF, random variable, 21
(ATT), 112, 121 direct effect/natural direct effect, 19
Causal mediation models (cont.) CDE. See Controlled direct effect (CDE)
equivalence of different choices, 260 CFI. See Comparative fit index (CFI)
estimation of parameters Child Resilience Project (CRP), 219, 234–236
general treatment X; 252–253 Chronic thromboembolic pulmonary
maximum likelihood estimation, 250 hypertension (CTEPH), 104
moment estimation, 250 Cognitive behavioral smoking cessation
three-value treatment, 251–252 therapy (CBT), 187
GMM, 243 Combined antiretroviral therapies (cART)
identifiability of parameters ACTG A5095, 204, 209–211
continuous mediator M; 248–249 AdaBoost, 213
discrete variable M; 248 BlackBoost, 214
general conditions, 246–248 HIV-1 infected patients, 203
linear model of M; 249–250 methods
indirect and direct effects, 241 missing data, 205–207
logistic regression equation, 258 two-stage designs, 207–209
LSEM, 21, 22 variance estimate, 209
matrix Geff , 261 non-parametric estimator, 212
means of estimates, 255 simulation studies, 211–212
mediator–outcome relationship, 242 The Commit to Quit (CTQ) study
moderated-mediation model, 255–257 compliance model, 196–197
necessity and sufficiency theorem, 258–259 compliance regions, 200
notation and definitions, 243–246 data, 195–196
OLS estimates, 254 estimated causal effects, 199
OLS regression, 243 maximum likelihood estimates, 198
pure indirect effect, 20 principal effects, 196–197
three linear models, 242 two-stage ML approach, 200
total effect of treatment, 20 The Commit to Quit (CTQ) trials, 187
treatment–outcome relationship, 242 Comparative fit index (CFI), 303
unobserved pre-treatment confounder, 242 Compliance behavior, 13
Causal models Complier average causal effect (CACE), 14
CBT, 187 Controlled direct effect (CDE), 19, 269
continuous measures, 187 Cox regression, 104
CTQ study CRP. See Child Resilience Project (CRP)
compliance model, 196–197 CTQ trials. See The Commit to Quit (CTQ)
compliance regions, 200 trials
data, 195–196 Cumulative distribution function (CDF), 21
estimated causal effects, 199
maximum likelihood estimates, 198
principal effects, 196–197 D
two-stage ML approach, 200 Data-adaptive matching score, 117–118
likelihood and inference methods DomEXT Baseline, 234–235
compliance regions, 193–195 Donsker classes, 158–161
contribution, 192
two-stage approach, 192–193
placebo-controlled trials, 188 E
principal stratification approach, 188 Empirical processes
structural principal effects model average treatment effect, 161–163
compliance distributions, 191–192 Donsker classes, 158–161
ITT effects, 190–191 estimating equation, 157
notation and assumptions, 189–190 motivation and setup, 157–158
Causal relative risk, 176, 177 Estimated propensity variable (EPV), 62,
CBT. See Cognitive behavioral smoking 63
cessation therapy (CBT) Estimating equation (EE), 37
CD4 cell, 210–211 Exposure to agents, 7
Index 317
F J
Face-value average causal effect (FACE), 52 Jackknife method, 182–183
Fisher’s linear discriminant (LD), 61
Functional response models (FRM), 221
L
Latent growth modeling (LGM), 296
G Likelihood and inference methods
Generalized boosted model (GBM), 116–117 compliance regions, 193–195
Generalized Linear Structural Equation Models contribution, 192
(GLSEM), 21 two-stage approach, 192–193
Generalized method of moments (GMM), 243 Linear predictor (LP), 60
Genetic Epidemiology Network of Salt Linear SEM (LSEM), 21, 22
Sensitivity (GenSalt) Study LISREL formulation, 299–300
covariate adjustment, 43–44 Logistic regression
covariates, 40 model construction, 70–71
outcomes, 39 propensity analysis, custodial sanctions
parameter estimations, 40, 41 study, 71–73
pre vs. post score matching, 41–43
propensity score weighting approach, 43
treatment conditions, 39–40 M
GMM. See Generalized method of moments Mahalanobis distance, 33
(GMM) Mahalanobis metric matching, 33
Greedy algorithm, 33 Mann-Whitney-Wilcoxon rank sum test, 222
MAR. See Missing at random (MAR)
H Marginal structural models (MSMs), 12–13,
Heteroscedasticity 274
balancing property of PS/PV, 67 mboost package, 207, 212
covariance matrices, 66 MCAR. See Missing Complete at Random
linear discriminant, 67 (MCAR)
QD, 67 Mean squared error (MSE), 278
simulations, 68–70 Missing at random (MAR), 9, 31, 302
High-Risk Youth Demonstration Grant Missing Complete at Random (MCAR), 235,
Programs, 95 296, 308
Homoscedasticity MMDP. See Monotone missing data patterns
asymptotic variance analysis (MMDP)
EPV, 62, 63 Moderated-mediation model, 255–257
propensity variable, 62, 63 Monotone missing data patterns (MMDP), 227
sample size, 63 Monte Carlo (MC) cross-validation criteria,
variance multiplier of coefficient, 63–64 122–123
model construction, 60–61 Monte Carlo (MC) mean, 278
precision, propensity analysis, 62 Monte Carlo (MC) replications, 212
simulations, 65, 66 MSE. See Mean squared error (MSE)
Horvitz-Thompson (HT) estimator, 74, 75 MSMs. See Marginal structural models
(MSMs)
Multinomial logistic regression (MLR), 115
I
Important variables stratification (IVS), 128
Intention to treat (ITT) approach, 4, 218, 219 N
Inverse probability of treatment weights Natural direct effects (NDEs), 268–269
(IPTW), 102 Natural indirect effects (NIEs), 268–269
Inverse probability weighiting (IPW), 36, 106, Newton’s method, 170
273, 309 Nonparametric black-box algorithms, 112
ITT approach. See Intention to treat (ITT) Nonparametric curve regression methods, 37
approach Nonparametric density estimation, 119
318 Index