(Chapman & Hall - CRC Texts in Statistical Science) Paul Roback and Julie Legler - Beyond Multiple Linear Regression-Applied Generalized Linear Models and Multilevel Models in R-CRC Press (2020)
(Chapman & Hall - CRC Texts in Statistical Science) Paul Roback and Julie Legler - Beyond Multiple Linear Regression-Applied Generalized Linear Models and Multilevel Models in R-CRC Press (2020)
(Chapman & Hall - CRC Texts in Statistical Science) Paul Roback and Julie Legler - Beyond Multiple Linear Regression-Applied Generalized Linear Models and Multilevel Models in R-CRC Press (2020)
Linear Regression
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Time Series
A Data Analysis Approach Using R
Robert H. Shumway and David S. Stoffer
Surrogates
Gaussian Process Modeling, Design, and Optimization for the Applied Sciences
Robert B. Gramacy
Statistical Rethinking
A Bayesian Course with Examples in R and STAN, Second Edition
Richard McElreath
Paul Roback
Julie Legler
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
sions@ tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
Preface xv
vii
viii Contents
3 Distribution Theory 71
3.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . 71
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . 72
3.3.1 Binary Random Variable . . . . . . . . . . . . . . . 72
3.3.2 Binomial Random Variable . . . . . . . . . . . . . . 73
3.3.3 Geometric Random Variable . . . . . . . . . . . . . 74
3.3.4 Negative Binomial Random Variable . . . . . . . . 75
3.3.5 Hypergeometric Random Variable . . . . . . . . . . . 77
3.3.6 Poisson Random Variable . . . . . . . . . . . . . . . 79
3.4 Continuous Random Variables . . . . . . . . . . . . . . . . 80
3.4.1 Exponential Random Variable . . . . . . . . . . . . 80
3.4.2 Gamma Random Variable . . . . . . . . . . . . . . . 81
3.4.3 Normal (Gaussian) Random Variable . . . . . . . . 83
3.4.4 Beta Random Variable . . . . . . . . . . . . . . . . 84
Contents ix
4 Poisson Regression 93
4.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Introduction to Poisson Regression . . . . . . . . . . . . . . 94
4.2.1 Poisson Regression Assumptions . . . . . . . . . . . 95
4.2.2 A Graphical Look at Poisson Regression . . . . . . 95
4.3 Case Studies Overview . . . . . . . . . . . . . . . . . . . . . 96
4.4 Case Study: Household Size in the Philippines . . . . . . . . . 97
4.4.1 Data Organization . . . . . . . . . . . . . . . . . . . 98
4.4.2 Exploratory Data Analyses . . . . . . . . . . . . . . 98
4.4.3 Estimation and Inference . . . . . . . . . . . . . . . 102
4.4.4 Using Deviances to Compare Models . . . . . . . . 104
4.4.5 Using Likelihoods to Fit Models (optional) . . . . . 106
4.4.6 Second Order Model . . . . . . . . . . . . . . . . . . . 107
4.4.7 Adding a Covariate . . . . . . . . . . . . . . . . . . 109
4.4.8 Residuals for Poisson Models (optional) . . . . . . . 110
4.4.9 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . 112
4.5 Linear Least Squares vs. Poisson Regression . . . . . . . . 113
4.6 Case Study: Campus Crime . . . . . . . . . . . . . . . . . . 114
4.6.1 Data Organization . . . . . . . . . . . . . . . . . . . 114
4.6.2 Exploratory Data Analysis . . . . . . . . . . . . . . 115
4.6.3 Accounting for Enrollment . . . . . . . . . . . . . . . 117
4.7 Modeling Assumptions . . . . . . . . . . . . . . . . . . . . . 118
4.8 Initial Models . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.8.1 Tukey’s Honestly Significant Differences . . . . . . . 119
4.9 Overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9.1 Dispersion Parameter Adjustment . . . . . . . . . . . 121
4.9.2 No Dispersion vs. Overdispersion . . . . . . . . . . 123
4.9.3 Negative Binomial Modeling . . . . . . . . . . . . . 123
4.10 Case Study: Weekend Drinking . . . . . . . . . . . . . . . . 125
4.10.1 Research Question . . . . . . . . . . . . . . . . . . . 125
4.10.2 Data Organization . . . . . . . . . . . . . . . . . . . 126
4.10.3 Exploratory Data Analysis . . . . . . . . . . . . . . 126
4.10.4 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.10.5 Fitting a ZIP Model . . . . . . . . . . . . . . . . . . 129
4.10.6 The Vuong Test (optional) . . . . . . . . . . . . . . . 131
x Contents
Bibliography 409
Index 417
Preface
xv
xvi Preface
Chapters 8 and 9 are extended to a three-level case study. New ideas include
boundary constraints and exploding numbers of variance components and
fixed effects.
• Chapter 11: Multilevel Generalized Linear Models. This chapter brings
everything together, combining multilevel data with non-normal responses.
Crossed random effects and random effects estimates are both introduced
here.
Three types of exercises are available for each chapter. Conceptual Exercises
ask about key ideas in the contexts of case studies from the chapter and
additional research articles where those ideas appear. Guided Exercises
provide real data sets with background descriptions and lead students step-by-
step through a set of questions to explore the data, build and interpret models,
and address key research questions. Finally, Open-Ended Exercises provide
real data sets with contextual descriptions and ask students to explore key
questions without prescribing specific steps. A solutions manual with solutions
to all exercises will be available to qualified instructors at our book’s website2 .
This work is licensed under a Creative Commons Attribution-NonCommercial-
ShareAlike 4.0 International License.
Acknowledgments. We would like to thank students of Stat 316 at St. Olaf
College since 2010 for their patience as this book has taken shape with their
feedback. We would especially like to thank these St. Olaf students for their
summer research efforts which significantly improved aspects of this book:
Cecilia Noecker, Anna Johanson, Nicole Bettes, Kiegan Rice, Anna Wall, Jack
Wolf, Josh Pelayo, Spencer Eanes, and Emily Patterson. Early editions of this
book also benefitted greatly from feedback from instructors who used these
materials in their classes, including Matt Beckman, Laura Boehm Vock, Beth
Chance, Laura Chihara, Mine Dogucu, and Katie Ziegler-Graham. Finally,
we have appreciated the support of two NSF grants (#DMS-1045015 and
#DMS-0354308) and of our colleagues in the Department of Mathematics,
Statistics, and Computer Science at St. Olaf. We are also thankful to Samantha
Roback for developing the cover image.
2 www.routledge.com
1
Review of Multiple Linear Regression
1
2 1 Review of Multiple Linear Regression
each level of the predictors, and the standard deviation of the responses at
each level of the predictors should be approximately equal. After examining
circumstances where inference with LLSR is appropriate, we will look for
violations of these assumptions in other sets of circumstances. These are
settings where we may be able to use the methods of this text. We’ve kept
the examples in the exposition simple to fix ideas. There are exercises which
describe more realistic and complex studies.
Recall that making inferences or predictions with models fit using linear least
squares regression requires that the following assumptions be tenable. The
acronym LINE can be used to recall the assumptions required for making
inferences and predictions with models based on LLSR. If we consider a simple
linear regression with just a single predictor X, then:
• L: There is a linear relationship between the mean response (Y) and the
explanatory variable (X),
• I: The errors are independent—there’s no connection between how far any
two points lie from the regression line,
4 1 Review of Multiple Linear Regression
• L: The mean value for Y at each level of X falls on the regression line.
• I: We’ll need to check the design of the study to determine if the errors
(vertical distances from the line) are independent of one another.
• N: At each level of X, the values for Y are normally distributed.
• E: The spread in the Y’s for each level of X is the same.
It can be argued that the following studies do not violate assumptions for
inference in linear least squares regression. We begin by identifying the re-
sponse and the explanatory variables followed by describing each of the LINE
assumptions in the context of the study, commenting on possible problems
with the assumptions.
There are potential problems with the linearity and equal standard
deviation assumptions. For example, if there is a threshold for the
volume of music where the effect on reaction times remains the
same, mean reaction times would not be a linear function of music.
1.3 Assumptions for Linear Least Squares Regression 5
Before diving into generalized linear models and multilevel modeling, we review
key ideas from multiple linear regression using an example from horse racing.
The Kentucky Derby is a 1.25-mile horse race held annually at the Churchill
Downs race track in Louisville, Kentucky. Our data set derbyplus.csv con-
tains the year of the race, the winning horse (winner), the condition of the
track, the average speed (in feet per second) of the winner, and the number of
starters (field size, or horses who raced) for the years 1896-2017 [Wikipedia
contributors, 2018]. The track condition has been grouped into three cate-
gories: fast, good (which includes the official designations “good” and “dusty”),
and slow (which includes the designations “slow”, “heavy”, “muddy”, and
“sloppy”). We would like to use least squares linear regression techniques to
model the speed of the winning horse as a function of track condition, field
size, and trends over time.
The first five and last five rows from our data set are illustrated in Table 1.1.
Note that, in certain cases, we created new variables from existing ones:
• fast is an indicator variable, taking the value 1 for races run on fast
tracks, and 0 for races run under other conditions,
• good is another indicator variable, taking the value 1 for races run under
good conditions, and 0 for races run under other conditions,
• yearnew is a centered variable, where we measure the number of years
since 1896, and
• fastfactor replaces fast = 0 with the description “not fast”, and fast =
1 with the description “fast”. Changing a numeric categorical variable to
descriptive phrases can make plot legends more meaningful.
1.5 Initial Exploratory Analyses 9
TABLE 1.1: The first five and the last five observations from the Kentucky
Derby case study.
With any statistical analysis, our first task is to explore the data, examining
distributions of individual responses and predictors using graphical and nu-
merical summaries, and beginning to discover relationships between variables.
This should always be done before any model fitting! We must understand our
data thoroughly before doing anything else.
First, we will examine the response variable and each potential covariate
individually. Continuous variables can be summarized using histograms and
statistics indicating center and spread; categorical variables can be summarized
with tables and possibly bar charts.
In Figure 1.2(a), we see that the primary response, winning speed, follows a
distribution with a slight left skew, with a large number of horses winning
with speeds between 53-55 feet per second. Plot (b) shows that the number of
starters is mainly distributed between 5 and 20, with the largest number of
races having between 15 and 20 starters.
The primary categorical explanatory variable is track condition, where 88
(72%) of the 122 races were run under fast conditions, 10 (8%) under good
conditions, and 24 (20%) under slow conditions.
(a) (b)
30
20
20
Frequency
Frequency
10
10
0 0
50 52 54 56 5 10 15 20 25
Winning speed (ft/s) Number of starters
FIGURE 1.2: Histograms of key continuous variables. Plot (a) shows winning
speeds, while plot (b) shows the number of starters.
Here, we see that higher winning speeds are associated with more recent years,
while the relationship between winning speed and number of starters is less
clear cut. We also see a somewhat strong correlation between year and number
of starters—we should be aware of highly correlated explanatory variables
whose contributions might overlap too much.
Relationships between categorical variables like track condition and continuous
variables can be illustrated with side-by-side boxplots as in the top row, or
with stacked histograms as in the first column. As expected, we see evidence
of higher speeds on fast tracks and also a tendency for recent years to have
more fast conditions. These observed trends can be supported with summary
statistics generated by subgroup. For instance, the mean speed under fast
conditions is 53.6 feet per second, compared to 52.7 ft/s under good conditions
and 51.7 ft/s under slow conditions. Variability in winning speeds, however, is
greatest under slow conditions (SD = 1.36 ft/s) and least under fast conditions
(0.94 ft/s).
Finally, notice that the diagonal illustrates the distribution of individual
variables, using density curves for continuous variables and a bar chart for
categorical variables. Trends observed in the last two diagonal entries match
trends observed in Figure 1.2.
By using shape or color or other attributes, we can incorporate the effect of a
third or even fourth variable into the scatterplots of Figure 1.3. For example,
1.6 Initial Exploratory Analyses 11
75
condition
50
25
0
2000
Corr: Corr:
year
1960
1920
0.650*** 0.717***
25
20
Corr:
starters
15
10 0.423***
5
56
54
speed
52
50
48
0102030 0102030 0102030 19001925195019752000 5 10 15 20 50 52 54
in the coded scatterplot of Figure 1.4 we see that speeds are generally faster
under fast conditions, but the rate of increasing speed over time is greater
under good or slow conditions.
Of course, any graphical analysis is exploratory, and any notable trends at this
stage should be checked through formal modeling. At this point, a statistician
begins to ask familiar questions such as:
As you might expect, answers to these questions will arise from proper consid-
eration of variability and properly identified statistical models.
12 1 Review of Multiple Linear Regression
54
fastfactor
speed
52 fast
not fast
50
FIGURE 1.4: Linear trends in winning speeds over time, presented separately
for fast conditions vs. good or slow conditions.
We will begin by modeling the winning speed as a function of time; for example,
have winning speeds increased at a constant rate since 1896? For this initial
model, let Yi be the speed of the winning horse in year i. Then, we might
consider Model 1:
ordinary least squares methods; we will use hats to denote estimates of popu-
lation parameters based on empirical data. Values for β̂0 and β̂1 are selected
to minimize the sum of squared residuals, where a residual is simply the
observed prediction error—the actual winning speed for a given year minus
the winning speed predicted by the model. In the notation of this section,
• Predicted speed: Ŷi = β̂0 + β̂1 Yeari
• Residual (estimated error): ˆi = Yi − Ŷi
• Estimated variance of points around the line: σ̂ 2 = ˆ2i /(n − 2)
P
Using Kentucky Derby data, we estimate β̂0 = 2.05, β̂1 = 0.026, and σ̂ = 0.90.
Thus, according to our simple linear regression model, winning horses of the
Kentucky Derby have an estimated winning speed of 2.05 ft/s in Year 0 (more
than 2000 years ago!), and the winning speed improves by an estimated 0.026
ft/s every year. With an R2 of 0.513, the regression model explains a moderate
amount (51.3%) of the year-to-year variability in winning speeds, and the
trend toward a linear rate of improvement each year is statistically significant
at the 0.05 level (t(120) = 11.251, p < .001).
Note that the only thing that changes from Model 1 to Model 2 is the estimated
intercept; β̂1 , R2 , and σ̂ all remain exactly the same. Now β̂0 tells us that the
estimated winning speed in 1896 is 51.59 ft/s, but estimates of the linear rate
of improvement or the variability explained by the model remain the same. As
Figure 1.5 shows, centering year has the effect of shifting the y-axis from year
0 to year 1896, but nothing else changes.
14 1 Review of Multiple Linear Regression
## R squared = 0.5134
## Residual standard error = 0.9032
40
speed
20
We should also attempt to verify that our LINE linear regression model as-
sumptions fit for Model 2 if we want to make inferential statements (hypothesis
tests or confidence intervals) about parameters or predictions. Most of these
assumptions can be checked graphically using a set of residual plots as in
Figure 1.6:
• The upper left plot, Residuals vs. Fitted, can be used to check the Linearity
assumption. Residuals should be patternless around Y = 0; if not, there is a
pattern in the data that is currently unaccounted for.
• The upper right plot, Normal Q-Q, can be used to check the Normality
assumption. Deviations from a straight line indicate that the distribution of
residuals does not conform to a theoretical normal curve.
1.6 Multiple Linear Regression Modeling 15
• The lower left plot, Scale-Location, can be used to check the Equal Variance
assumption. Positive or negative trends across the fitted values indicate
variability that is not constant.
• The lower right plot, Residuals vs. Leverage, can be used to check for
influential points. Points with high leverage (having unusual values of the
predictors) and/or high absolute residuals can have an undue influence on
estimates of model parameters.
Standardized residuals
Residuals vs Fitted Normal Q-Q
2
Residuals
0
-3 -1
12 34 1234
-3
13 13
Standardized residuals
2
Cook's distance 12
0.0
13
-4
In this case, the Residuals vs. Fitted plot indicates that a quadratic fit might
be better than the linear fit of Model 2; other assumptions look reasonable.
Influential points would be denoted by high values of Cook’s Distance; they
would fall outside cutoff lines in the northeast or southeast section of the
Residuals vs. Leverage plot. Since no cutoff lines are even noticeable, there are
no potential influential points of concern.
We recommend relying on graphical evidence for identifying regression model
assumption violations, looking for highly obvious violations of assumptions
before trying corrective actions. While some numerical tests have been devised
for issues such as normality and influence, most of these tests are not very
reliable, highly influenced by sample size and other factors. There is typically
no residual plot, however, to evaluate the Independence assumption; evidence
for lack of independence comes from knowing about the study design and
methods of data collection. In this case, with a new field of horses each year,
the assumption of independence is pretty reasonable.
16 1 Review of Multiple Linear Regression
This model could suggest, for example, that the rate of increase in winning
speeds is slowing down over time. In fact, there is evidence that the quadratic
model improves upon the linear model (see Figure 1.7). R2 , the proportion of
year-to-year variability in winning speeds explained by the model, has increased
from 51.3% to 64.1%, and the pattern in the Residuals vs. Fitted plot of Figure
1.6 has disappeared in Figure 1.8, although normality is a little sketchier in
the left tail, and the larger mass of points with fitted values near 54 appears
to have slightly lower variability. The significantly negative coefficient for β2
suggests that the rate of increase is indeed slowing in more recent years.
54
speed
52
50
Standardized residuals
Residuals vs Fitted Normal Q-Q
2
Residuals
0
-1
33
34 3433
-3
13 13
-3
Standardized residuals
Scale-Location Residuals vs Leverage
13
34
33 5
2
1
1.0
-1
Cook's distance
0.0
13
-4
50.5 51.5 52.5 53.5 0.00 0.02 0.04 0.06
Here, it’s easy to see the meaning of our slope and intercept by writing out
separate equations for the two conditions:
• Good or slow conditions (fast = 0)
Yi = β0 + i
• Fast conditions (fast = 1)
Yi = (β0 + β1 ) + i
18 1 Review of Multiple Linear Regression
β0 is the expected winning speed under good or slow conditions, while β1 is the
difference between expected winning speeds under fast conditions vs. non-fast
conditions. According to our fitted Model 3, the estimated winning speed
under non-fast conditions is 52.0 ft/s, while mean winning speeds under fast
conditions are estimated to be 1.6 ft/s higher.
## R squared = 0.3236
## Residual standard error = 1.065
The beauty of the linear regression framework is that we can add explanatory
variables in order to explain more variability in our response, obtain better and
more precise predictions, and control for certain covariates while evaluating
the effect of others. For example, we could consider adding yearnew to Model
3, which has the indicator variable fast as its only predictor. In this way, we
would estimate the difference between winning speeds under fast and non-fast
conditions after accounting for the effect of time. As we observed in Figure 1.3,
recent years have tended to have more races under fast conditions, so Model 3
might overstate the effect of fast conditions because winning speeds have also
increased over time. A model with terms for both year and track condition
will estimate the difference between winning speeds under fast and non-fast
conditions for a fixed year; for example, if it had rained in 2016 and turned
the track muddy, how much would we have expected the winning speed to
decrease?
1.6 Multiple Linear Regression Modeling 19
and linear least squares regression (LLSR) provides the following parameter
estimates:
So far we have been using linear regression for descriptive purposes, which is
an important task. We are often interested in issues of statistical inference as
well—determining if effects are statistically significant, quantifying uncertainty
in effect size estimates with confidence intervals, and quantifying uncertainty
in model predictions with prediction intervals. Under LINE assumptions, all
of these inferential tasks can be completed with the help of the t-distribution
and estimated standard errors.
Here are examples of inferential statements based on Model 4:
• We can be 95% confident that average winning speeds under fast conditions
are between 0.93 and 1.53 ft/s higher than under non-fast conditions, after
accounting for the effect of year.
• Fast conditions lead to significantly faster winning speeds than non-fast
conditions (t = 8.14 on 119 df, p < .001), holding year constant.
20 1 Review of Multiple Linear Regression
• Based on our model, we can be 95% confident that the winning speed in 2017
under fast conditions will be between 53.4 and 56.3 ft/s. Note that Always
Dreaming’s actual winning speed barely fit within this interval—the 2017
winning speed was a borderline outlier on the slow side.
confint(model4)
2.5 % 97.5 %
(Intercept) 50.61169 51.22395
yearnew 0.01878 0.02638
fast 0.92840 1.52529
Remember that you must check LINE assumptions using the same residual plots
as in Figure 1.6 to ensure that the inferential statements in the previous section
are valid. In cases when model assumptions are shaky, one alternative approach
to statistical inference is bootstrapping; in fact, bootstrapping is a robust
approach to statistical inference that we will use frequently throughout this
book because of its power and flexibility. In bootstrapping, we use only the data
we’ve collected and computing power to estimate the uncertainty surrounding
our parameter estimates. Our primary assumption is that our original sample
represents the larger population, and then we can learn about uncertainty in
our parameter estimates through repeated samples (with replacement) from
our original sample.
• fit Model 4 to the bootstrap sample, saving β̂0 , β̂1 , and β̂2 .
• repeat the two steps above a large number of times (say 1000).
• the 1000 bootstrap estimates for each parameter can be plotted to show the
bootstrap distribution (see Figure 1.9).
• a 95% confidence interval for each parameter can be found by taking the
middle 95% of each bootstrap distribution—i.e., by picking off the 2.5 and
97.5 percentiles. This is called the percentile method.
# A tibble: 3 x 3
term low high
<chr> <dbl> <dbl>
1 (Intercept) 50.6 51.3
2 fast 0.909 1.57
3 yearnew 0.0182 0.0265
In this case, we see that 95% bootstrap confidence intervals for β0 , β1 , and
β2 are very similar to the normal-theory confidence intervals we found earlier.
For example, the normal-theory confidence interval for the effect of fast tracks
is 0.93 to 1.53 ft/s, while the analogous bootstrap confidence interval is 0.91
to 1.57 ft/s.
There are many variations on this bootstrap procedure. For example, you could
sample residuals rather than cases, or you could conduct a parametric bootstrap
in which error terms are randomly chosen from a normal distribution. In
addition, researchers have devised other ways of calculating confidence intervals
besides the percentile method, including normality, studentized, and bias-
corrected and accelerated methods (Hesterberg [2015]; Efron and Tibshirani
[1993]; Davison and Hinkley [1997]). We will focus on case resampling and
percentile confidence intervals for now for their understandability and wide
applicability.
22 1 Review of Multiple Linear Regression
200
200
200
count
100
100
100
0 0 0
50.2 50.6 51.0 51.4 0.5 1.0 1.5 0.015 0.020 0.025 0.030
estimate
Yi = β0 + β1 Yearnewi + β2 Fasti
+ β3 Yearnewi × Fasti + i where i ∼ N(0, σ 2 )
Interpretations of model coefficients are most easily seen by writing out separate
equations for fast and non-fast track conditions:
Fast = 0 :
Ŷi = 50.53 + 0.031Yearnewi
Fast = 1 :
Ŷi = (50.53 + 1.83) + (0.031 − 0.011)Yearnewi
We now begin iterating toward a “final model” for these data, on which we
will base conclusions. Typical features of a “final multiple linear regression
model” include:
Although the process of reporting and writing up research results often de-
mands the selection of a sensible final model, it’s important to realize that (a)
statisticians typically will examine and consider an entire taxonomy of models
when formulating conclusions, and (b) different statisticians sometimes select
different models as their “final model” for the same set of data. Choice of a
“final model” depends on many factors, such as primary research questions,
purpose of modeling, tradeoff between parsimony and quality of fitted model,
underlying assumptions, etc. Modeling decisions should never be automated
or made completely on the basis of statistical tests; subject area knowledge
should always play a role in the modeling process. You should be able to
defend any final model you select, but you should not feel pressured to find
the one and only “correct model”, although most good models will lead to
similar conclusions.
Several tests and measures of model performance can be used when comparing
different models for model building:
• extra sum of squares F test. This is a generalization of the t-test for individual
model coefficients which can be used to perform significance tests on nested
models, where one model is a reduced version of the other. For example, we
could test whether our final model (below) really needs to adjust for track
condition, which is comprised of indicators for both fast condition and good
condition (leaving slow condition as the reference level). Our null hypothesis
is then β3 = β4 = 0. We have statistically significant evidence (F = 57.2 on 2
and 116 df, p < .001) that track condition is associated with winning speeds,
after accounting for quadratic time trends and number of starters.
One potential final model for predicting winning speeds of Kentucky Derby
races is:
field is associated with slower winning times (unlike the positive relationship
we saw between speed and number of starters in our exploratory analyses).
The model explains 82.7% of the year-to-year variability in winning speeds,
and residual plots show no serious issues with LINE assumptions. We tested
interaction terms for different effects of time or number of starters based on
track condition, but we found no significant evidence of interactions.
1.7.1 Soccer
Roskes et al. [2011] The right side? Under time pressure, approach motivation
leads to right-oriented bias. Psychological Science [Online] 22(11):1403-7. DOI:
10.1177/0956797611418677, October 2011.
The response for this analysis is the direction of the goalkeeper dive, a binary
variable. For example, you could let Y=1 if the dive is to the right and Y=0
if the dive is to the left. This response is clearly not normally distributed.
One approach to the analysis is logistic regression as described in Chapter
6. A binomial random variable could also be created for this application by
summing the binary variables for each game so that Y= the number of dives
right out of the number of dives the goalkeeper makes during a game. [Thought
question: Do you believe the last line of the abstract?]
Poole [1989] Mate guarding, reproductive success and female choice in African
elephants. Animal Behavior 37:842-49.
Poole and her colleagues recorded, for each male elephant, his age (in years)
and the number of matings for a given year. The researchers were interested
in how age affects the males’ mating patterns. Specifically, questions concern
whether there is a steady increase in mating success as an elephant ages or if
there is an optimal age after which the number of matings decline. Because
the responses of interest are counts (number of matings for each elephant for a
given year), we will consider a Poisson regression (see Chapter 4). The general
form for Poisson responses is the number of events for a specified time, volume,
or space.
28 1 Review of Multiple Linear Regression
The response for this study is a gang activity measure which ranges from 1
to 100. While it may be reasonable to assume this measure is approximately
normal, the structure of this data implies that it is not a simple regression
problem. Individual students have measurements made at 8 different points in
time. We cannot assume that we have 2400 independent observations because
the same measurements on one individual are more likely to be similar than a
measurement of another student. Multilevel modeling as discussed in Chapter
9 can often be used in these situations.
1.7.4 Crime
1.8 Exercises
she was a waitress [Dahlquist and Dong, 2011]. The student was
interested in learning under what conditions a waitress can expect
the largest tips—for example: At dinner time or late at night? From
younger or older patrons? From patrons receiving free meals? From
patrons drinking alcohol? From patrons tipping with cash or credit?
And should tip amount be measured as total dollar amount or as
a percentage? Data can be found in TipData.csv. Here is a quick
description of the variables collected:
•Day = day of the week
•Meal = time of day (Lunch, Dinner, Late Night)
•Payment = how bill was paid (Credit, Cash, Credit with Cash
tip)
•Party = number of people in the party
•Age = age category of person paying the bill (Yadult, Middle,
SenCit)
•GiftCard = was gift card used?
•Comps = was part of the meal complimentary?
•Alcohol = was alcohol purchased?
•Bday = was a free birthday meal or treat given?
•Bill = total size of the bill
•W.tip = total amount paid (bill plus tip)
•Tip = amount of the tip
•Tip.Percentage = proportion of the bill represented by the tip
2
Beyond Least Squares: Using Likelihoods
In this instance we’ll use logistic regression instead of linear least squares
regression. Fitting a logistic regression requires the use of likelihood meth-
ods. Another setting where likelihood methods come into play is when data
39
40 2 Beyond Least Squares: Using Likelihoods
Doesn’t it seem that some families tend to have lots of boys, while others have
more than their fair share of girls? Is it really the case that each child human
couples produce is equally likely to be a male or female? Or does sex run in
families? It can be argued that these kinds of questions have implications for
population demographics and sibling harmony. For example, a 2009 study at
2.2 Case Study: Does Sex Run in Families? 41
the University of Ulster in Northern Ireland found that growing up with sisters,
as compared to brothers, can enhance the quality of life of an adult [BBC
News, 2009].
Sibling harmony aside, why do people care about gender imbalance? Compar-
isons of sex ratios between countries illustrate some compelling reasons. Some
think that genetic or biological influences within families, such as “sex running
in families,” can affect sex ratios. Mating behavior such as waiting until the
family includes a boy or both sexes affects sex ratios. Some believe that sex
ratios point to the practice of sex selection in a country accomplished through
abortion or infanticide. Furthermore, there is speculation that an excess of
men could lead to unrest among young males unable to find marriage partners
or start families.
In 1930, statistician R.A. Fisher posited a 50:50 equilibrium theory regarding
sex ratios in terms of parental expenditure. Most often, in practice, sex ratios
differ from what Fisher predicted. From 1970 to 2002, the sex ratio at birth in
the US among white non-Hispanics was 105 boys to 100 girls, but only 103 boys
to 100 girls among African Americans and Native Americans [Mathews and
Hamilton, 2005]. A 1997 study in Nature reports evidence which suggests that
the human sex ratio may be currently shifting in the United States toward more
female babies, closer to Fisher’s prediction! [Komdeur et al., 1997] Sex ratio
comparisons between countries are also intriguing. For example, Switzerland
has a sex ratio of 106 boys to 100 girls, whereas there are 112 boys to every 100
girls in China according to The World Factbook [Central Intelligence Agency,
2013]. In the next section, we bring the notion of gender imbalance closer to
home by focusing on families instead of countries or sub-populations.
To investigate this question and others, we look at the gender composition of
5,626 families collected by the National Longitudinal Survey of Youth [Bureau
of Labor Statistics, 1997]. We fit models to explore whether there is evidence
sex runs in families, a model we refer to as a Sex Conditional Model. We also
consider a separate but related question about whether couples are “waiting
for a boy.” [Rodgers and Doughty, 2001].
We specify several models related to gender balance in families. Our models liken
having babies to flipping a coin (heads=boy, tails=girl), of course, recognizing
that in truth there is a little more to having babies. The baseline model (Model
0) assumes that the probability of a boy is the same as the probability of a
girl. The first model (Model 1) considers the situation that the coin is loaded
and the probability of heads (a boy) is different than the probability of tails
(a girl). Next, we consider a model (Model 2) that conditions on the previous
number of boys or girls in a family to get at the question of whether sex runs
42 2 Beyond Least Squares: Using Likelihoods
in families. This data is also used for a different set of models that relate to
couples’ behavior. Specifically, we look to see if there is evidence that couples
are waiting for a boy. Searching for evidence of waiting for a girl, or waiting
for both a boy and a girl, are left as exercises.
Models 0 and 1 assume that having children is like flipping a coin. The gender
of each child is independent of the gender of other children and the probability
of a boy is the same for each new child. Let pB be the probability a child is a
boy.
For the Sex Unconditional models, having children is modeled using coin flips.
With a coin flip model, the result of each flip is independent of results of other
flips. With this version of the Sex Unconditional Model, the chance that a
baby is a boy is specified to be pB = 0.5. It makes no difference if the first and
2.4 Model 1: Sex Unconditional, Unequal Probabilities 43
third children are boys, the probability that the second child is a boy is 0.5;
that is, the results for each child are independent of the others. Under this
model you expect to see equal numbers of boys and girls.
You may want your model to allow for the probability of a boy, pB , to be
something different than 0.5. With this version of the Sex Unconditional model,
pB > 0.5 or pB < 0.5 or pB = 0.5, in which case you expect to see more
boys than girls or fewer boys than girls or equal numbers of boys and girls,
respectively. We would retain the assumption of independence; that is, the
probability of a boy, pB , is the same for each child. Seeing a boy for the first
child will not lead you to change the probability that the second child is a boy;
this would not imply that “sex runs in families.”
As is often the case in statistics, our objective is to find an estimate for a model
parameter using our data; here, the parameter to estimate is the probability of
a boy, pB , and the data is the gender composition for each family. One way in
which to interpret probability is to imagine repeatedly producing children. The
probability of a boy will be the overall proportion of boys as the number of
children increases. With likelihood methods, conceptually we consider different
possible values for our parameter(s), pB , and determine how likely we would be
to see our observed data in each case, Lik(pB ). We’ll select as our estimate the
44 2 Beyond Least Squares: Using Likelihoods
value of pB for which our data is most likely. A likelihood is a function that
tells us how likely we are to observe our data for a given parameter value, pB .
For a single family which has a girl followed by two boys, GBB, the likelihood
function looks like:
0.15
.144
0.10
Likelihood
.063
0.05
0.00
0.30 0.60
0.00 0.25 0.50 0.75 1.00
possible values of pb
From the likelihood in Figure 2.2, when pB = 0.3 we see a family of a girl
followed by two boys 6.3% (0.7 · 0.32 ) of the time. However, it indicates that
we are much more likely to see our data if pB = 0.6 where the likelihood of
GBB is 0.4 · 0.62 or 14.4%.
If the choice was between 0.3 and 0.6 for an estimate of pB , we’d choose 0.6.
The “best” estimate of pB would be the value where we are most likely to
see our data from all possible values between 0 and 1, which we refer to as
the maximum likelihood estimate or MLE. We can approximate an MLE
using graphical or numerical approaches. Graphically, here it looks like the
MLE is just above 0.6. In many, but not all, circumstances, we can obtain an
MLE exactly using calculus. In this simple example, the MLE is 2/3. This is
consistent with our intuition since 2 out of the 3 children are boys.
Suppose another family consisting of three girls is added to our data set.
We’ve already seen that the Sex Unconditional Model multiplies probabilities
to construct a likelihood because children are independent of one another.
Extending this idea, families can be assumed to be independent of one another
2.4 Model 1: Sex Unconditional, Unequal Probabilities 45
so that the likelihood for both families can be obtained by multiplication. With
two families (GBB and GGG) our likelihood is now:
A plot of this likelihood appears in Figure 2.3. It is right skewed with an MLE
at approximately 0.3. Using calculus, we can show that the MLE is precisely
1/3 which is consistent with intuition given the 2 boys and 4 girls in our data.
0.020
0.015
Likelihood
0.010
0.005
0.000
FIGURE 2.3: Likelihood function for the data of 2 families (GBB and GGG).
The solid line is at the MLE, pB = 1/3.
Turning now to our hypothetical data with 30 families who have a total of
50 children, we can create the likelihood contribution for each of the family
compositions.
The likelihood function for the hypothetical data set can be found by taking
the product of the entries in the last column of Table 2.2 and simplifying.
It should be obvious that the likelihood for this Sex Unconditional Model (the
coin flipping model) has the simple form:
46 2 Beyond Least Squares: Using Likelihoods
TABLE 2.2: The likelihood factors for the hypothetical data set of n=50
children.
n
Lik(pB ) = pBBoys (1 − pB )nGirls
Figure 2.4(a) is the likelihood for the data set of 50 children. The height of
each point is the likelihood and the possible values for pB appear across the
horizontal axis. It appears that our data is most likely when pB = 0.6 as we
would expect. Note that the log of the likelihood function in Figure 2.4(b) is
maximized at the same spot: pB = 0.6; we will see advantages of using log
likelihoods a bit later. Figures 2.4(c) and (d) are also maximized at pB = 0.6,
but they illustrate less variability and a sharper peak since there is more data
(although the same proportions of boys and girls).
Here a grid search is used with the software package R to find maximum
likelihood estimates, something that can be done with most software. A grid
search specifies a set of finite possible values for pB and then the likelihood,
Lik(pB ), is computed for each of the possible values. First, we define a relatively
coarse grid by specifying 50 values for pB and then computing how likely we
would see our data for each of these possible values. The second example uses
a finer grid, 1,000 values for pB , which allows us to determine a better (more
precise) approximation of the MLE. In addition, most packages, like R, have
2.4 Model 1: Sex Unconditional, Unequal Probabilities 47
(a) (c)
30 boys, 20 girls Likelihood 600 boys, 400 girls Likelihood
Likelihood
Likelihood
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
p p
(b) (d)
30 boys, 20 girls log(Likelihood) 600 boys, 400 girls log(Likelihood)
logLikelihood
logLikelihood
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
p p
## [1] 0.5918
# more precise MLE for p_B based on finer grid (more points)
Lik.f(nBoys = 30, nGirls = 20, nGrid = 1000)
## [1] 0.5996
48 2 Beyond Least Squares: Using Likelihoods
## $maximum
## [1] 0.6
##
## $objective
## [1] -33.65
Calculus may provide another way to determine an MLE. Here, we can ascertain
the value of pB where the likelihood is a maximum by using the first derivative
of the likelihood with respect to pB . We obtain the first derivative using the
Product Rule, set it to 0, solve for pB , and verify that a maximum occurs
there.
d 30
p (1 − pB )20 = 30p29
B (1 − pB )
20
− p30
B 20(1 − pB )
19
=0
dpB B
Lik(pB ) = p30
B (1 − pB )
20
2.4.3 Summary
numbers of boys and girls (sex-neutral) in the existing family. Let pB|B Bias
represent the probability the next child is a boy if the family is boy-biased;
i.e., there are more boys than girls prior to this child. Similarly, let pB|G Bias
represent the probability the next child is a boy if the family is girl-biased; i.e.,
there are more girls than boys prior to this child.
Before we are mired in notation and calculus, let’s think about how these
conditional probabilities can be used to describe sex running in families. While
we only had one parameter, pB , to estimate in the Sex Unconditional Model,
here we have three parameters: pB|N , pB|B Bias , and pB|G Bias . Clearly if all
three of these probabilities are equal, the probability a child is a boy does
not depend upon the existing gender composition of the family and there
is no evidence of sex running in families. A conditional probability pB|B Bias
that is larger than pB|N suggests families with more boys are more likely
to produce additional boys in contrast to families with equal boys and girls.
This finding would support the theory of “boys run in families.” An analogous
argument holds for girls. In addition, comparisons of pB|B Bias and pB|G Bias
to the parameter estimate pB from the Sex Unconditional Model may be
interesting and can be performed using likelihoods.
While it may seem that including families with a single child (singletons) would
not be helpful for assessing whether there is a preponderance of one sex or
another in families, in fact singleton families would be helpful in estimating
pB|N because singletons join “neutral families.”
Using the family composition data for 50 children in the 30 families that
appears in Table 2.3, we construct a likelihood. The six singleton families
with only one boy contribute p6B|N to the likelihood and the seven families
with only one girl contribute p7G|N or (1 − pB|N )7 . [Why do we use 1 − pB|N
52 2 Beyond Least Squares: Using Likelihoods
instead of pG|N ?] There are five families with two boys each with probability
(pB|N )(pB|B Bias ) contributing:
We construct the likelihood using data from all 30 families assuming families
are independent to get:
Lik(pB|N , pB|B Bias , pB|G Bias ) = (pB|N )17 (1 − pB|N )15 (pB|B Bias )5
(2.2)
A couple of points are worth noting. First, there are 50 factors in the likelihood
corresponding to the 50 children in these 30 families. Second, in the Sex
Unconditional example we only had one parameter, pB ; here we have three
parameters. This likelihood does not simplify like the Sex Unconditional Model
to one that is a product of only two powers: one of pB and the other of 1 − pB .
Yet, the basic idea we discussed regarding using a likelihood to find parameter
estimates is the same. To obtain the MLEs, we need to find the combination
of values for our three parameters where the data is most likely to be observed.
Conceptually, we are trying different combinations of possible values for these
three parameters, one after another, until we find the combination where the
likelihood is a maximum. It will not be as easy to graph this likelihood and we
will need multivariable calculus to locate the optimal combination of parameter
values where the likelihood is a maximum. In this text, we do not assume
you know multivariable calculus, but we do want you to retain the concepts
associated with maximum likelihood estimates. In practice, we use software to
obtain MLEs.
With calculus, we can take partial derivatives of the likelihood with respect to
each parameter assuming the other parameters are fixed. As we saw in Section
2.4.2.3, differentiating the log of the likelihood often makes things easier. This
same approach is recommended here. Set each partial derivative to 0 and solve
for all parameters simultaneously.
Knowing that it is easier to work with log-likelihoods, let’s take the log of the
likelihood we constructed in Equation (2.2).
17 15
− =0
pB|N 1 − pB|N
17
p̂B|N =
32
= 0.53
This estimate follows naturally. First consider all of the children who enter
into a family with an equal number of boys and girls. From Table 2.3, we
can see there are 32 such children (30 are first kids and 2 are third kids in
families with 1 boy and 1 girl). Of those children, 17 are boys. So, given that
a child joins a sex-neutral family, the chance they are a boy is 17/32. Similar
calculations for pB|B Bias and pB|G Bias yield:
If we anticipate any “sex running in families” effect, we would expect pB|B Bias
to be larger than the probability of a boy in the neutral setting, pB|N . In our
small hypothetical example, p̂B|B Bias is slightly greater than 0.53, providing
light support for the “sex runs in families” theory when it comes to boys. What
about girls? Do families with more girls than boys tend to have a greater
probability of having a girl? We found that the MLE for the probability of
a girl in a girl-biased setting is 1-0.89=0.11. 1 This data does not provide
evidence that girls run in families since p̂G|Gbias = 0.11 < p̂G|N = 0.47; there
is a markedly lower probability of a girl if the family is already girl biased.
This data is, however, hypothetical. Let’s take a look at some real data and
see what we find.
You should now have a feel for using the Likelihood Principle to obtain
estimates of parameters using family gender composition data. Next, these
1 Note: A nice property of MLEs is demonstrated here. We have the MLE for p
B|G Bias ,
and we want the MLE of pG|G Bias = 1 − pB|G Bias . We can get it by replacing pB|G Bias
with its MLE; i.e., p̂G|G Bias = 1 − p̂B|G Bias . In mathematical terms, you can get the MLE
of a function by applying the function to the original MLE.
54 2 Beyond Least Squares: Using Likelihoods
TABLE 2.4: Number of families and children in families with given com-
position in NLSY data. Sex ratio and proportion males are given by family
size.
Table 2.4 displays family composition data for the 5,626 families with one, two,
or three children in the NLSY data set. This data set includes 10,672 children.
Because our interest centers on the proportion of males, let’s calculate sex
ratios and proportions of males for each family size. For one-child families the
male to female ratio is less than one (97 males:100 females), whereas both
two- and three-child families have ratios of 104 boys to 100 girls, what we may
expect in a population which favors males. While our research questions do
not specifically call for these measures stratified by family size, it still provides
us with an idea of gender imbalance in the data.
Table 2.5 provides insight into whether sex runs in families if the probability of
a boy is 0.5. Simple probability suggests that the percentage of 2-child families
2.6 Case Study: Analysis of the NLSY Data 55
TABLE 2.5: Proportion of families in NLSY data with all the same sex by
number of children in the family. Note that 1-child families are all homogeneous
with respect to sex, so we look at 2- and 3-child families.
Number of children Number of families Number with all same sex Percent with same sex
Two Children 2444 1112 45%
Three Children 1301 345 27%
TABLE 2.6: Proportion of families in NLSY data with only one boy who is
born last.
Number of children Number of families Number with one boy last Percent with boy last
One Child 1881 930 49.4%
Two Children 2444 666 27.2%
Three Children 1301 125 8.6%
with all the same sex would be 50% (BB or GG vs. BG or GB) but in our data
we see only 45%. For 3-child families, we have 8 possible orderings of boys and
girls and so we would expect 2 out of the 8 orderings (25%) to be of the same
sex (BBB or GGG), but in fact 27% have the same sex among the 3-children
families. These results do not provide overwhelming evidence of sex running in
families. There are some potentially complicating factors: the probability of a
boy may not be 0.5 or couples may be waiting for a boy or a girl or both.
Table 2.6 contains the number of families by size and the percentage of those
which are families with one boy who is last. Some of these families may have
“waited” for a boy and then quit childbearing after a boy was born. We see
the proportion of one-child families with a boy is slightly less than the 50%
expected. We’d expect one out of four, or 25%, of 2-child family configurations
to have one boy last and there is 27% in our dataset. Only 8.6% of 3-child
families have one boy last, but in theory we would expect one out of eight or
12.5% of 3-child families to have one boy last. So if, in fact, the probability of
a boy is 50%, there does not appear to be evidence supporting the notion that
families wait for a boy.
There are many other ways to formulate and explore the idea that sex runs in
families or that couples wait for a boy (or a girl). See Rodgers and Doughty
[2001] for other examples.
We construct a likelihood for the Sex Unconditional Model for the one-, two-
and three-child families from the NLSY. See Table 2.4 for the frequencies of
each gender composition.
56 2 Beyond Least Squares: Using Likelihoods
TABLE 2.7: Contributions to the likelihood function for the Sex Uncondi-
tional Model for a sample of family compositions from the NLSY data.
Now we create the entire likelihood for our data under the Sex Unconditional
Model.
This very simple likelihood implies that each child contributes a factor of
the form pB or 1 − pB . Given that there are 10,672 children, what would be
your best guess of the estimated probability of a boy for this model? We can
determine the MLE for pB using our previous work.
nBoys
pˆB =
nBoys + nGirls
5416
=
5416 + 5256
= 0.507
The contribution to a Sex Conditional Model likelihood for the same family
compositions we considered in the previous section appear in Table 2.8.
The products of the last three columns of Table 2.9 provide the likelihood
contributions for the Sex Conditional Model for all of the one-, two- and
three-child NLSY families. We write the likelihood as a function of the three
parameters pB|N , pB|B Bias , and pB|G Bias .
2.6 Case Study: Analysis of the NLSY Data 57
TABLE 2.8: Contributions to the likelihood function for the Sex Conditional
Model for a sample of family compositions from the NLSY data.
(a family with equal boys and girls), pB|N , we begin with the logarithm of the
likelihood in equation (2.3). Differentiating the log-likelihood with respect to
pB|N holding all other parameters constant yields an intuitive estimate.
3161
p̂B|N =
3161 + 3119
= 0.5033
There are 6,280 times when a child is joining a neutral family and, of those
times, 3,161 are boys. Thus the MLE of the probability of a boy joining a
family where the numbers of boys and girls are equal (including when there
are no children) is 0.5033.
Similarly, MLEs for pB|B Bias and pB|G Bias can be obtained:
1131
p̂B|B Bias =
1131 + 1164
= 0.4928
1124
p̂B|G Bias =
1124 + 973
= 0.5360
Are these results consistent with the notion that boys or girls run in families?
We consider the Sex Conditional Model because we hypothesized there would
be a higher probability of boys among children born into families with a
boy bias. However, we found that, if there is a boy bias, the probability of a
subsequent boy was estimated to be actually less (0.493) than the probability
of a subsequent girl. Similarly, girls join families with more girls than boys
approximately 46.4% of the time so that there is little support for the idea
that either “girls or boys run in families.”
Even though initial estimates don’t support the idea, let’s formally take a look
as to whether prior gender composition affects the probability of a boy. To do
so, we’ll see if the Sex Conditional Model is statistically significantly better
than the Sex Unconditional Model.
Likelihoods are not only useful for fitting models, but they are also useful
when comparing models. If the parameters for a reduced model are a subset of
parameters for a larger model, we say the models are nested and the difference
2.6 Case Study: Analysis of the NLSY Data 59
If the parameters are not nested, comparing models with the likelihood can still
be useful but will take a different form. We’ll see that the Akaike Information
Criterion (AIC) and Bayesian Information Criterion (BIC) are functions of the
log-likelihood that can be used to compare models even when the models are
not nested. Either way we see that this notion of likelihood is pretty useful.
Hypotheses
HA : At least one parameter from pB|N , pB|B Bias , pB|G Bias differs from the
others. (Sex Conditional Model) The probability of a boy does depend on the
prior family composition.
We start with the idea of comparing the likelihoods or, equivalently, the log-
likelihoods of each model at their maxima. To do so, we use the log-likelihoods
to determine the MLEs, and then replace the parameters in the log-likelihood
with their MLEs, thereby finding the maximum value for the log-likelihood of
each model. Here we will refer to the first model, the Sex Unconditional Model,
as the reduced model, noting that it has only a single parameter, pB . The
more complex model, the Sex Conditional Model, has three parameters and
is referred to here as the larger (full) model. We’ll use the MLEs derived
earlier in Section 2.6.4.
The maximum of the log-likelihood for the reduced model can be found by
replacing pB in the log-likelihood with the MLE of pB , 0.5075.
The maximum of the log-likelihood for the larger model can be found by
replacing pB|N , pB|B Bias , pB|G Bias in the log-likelihood with 0.5033, 0.4928,
and 0.5360, respectively.
60 2 Beyond Least Squares: Using Likelihoods
Intuitively, when the likelihood for the larger model is much greater than it
is for the reduced model, we have evidence that the larger model is more
closely aligned with the observed data. This isn’t really a fair comparison on
the face of it. We need to account for the fact that more parameters were
estimated and used for the larger model. That is accomplished by taking into
account the degrees of freedom for the χ2 distribution. The expected value of
the χ2 distribution is its degrees of freedom. Thus when the difference in the
number of parameters is large, the test statistic will need to be much larger to
convince us that it is not simply chance variation with two identical models.
Here, under the reduced model we’d expect our test statistic to be 2, when in
fact it is over 9. The evidence favors our larger model. More precisely, the test
statistic is 2(−7391.448 + 7396.073) = 9.238 (p = .0099), where the p-value is
the probability of obtaining a value above 9.238 from a χ2 distribution with 2
degrees of freedom.
We have convincing evidence that the Sex Conditional Model provides a
significant improvement over the Sex Unconditional Model. However, keep in
mind that our point estimates for a probability of a boy were not what we had
expected for “sex runs in families.” It may be that this discrepancy stems from
2.7 Model 3: Stopping Rule Model (waiting for a boy) 61
Rodgers and Doughty [2001] offer one reason to explain the contradictory
results: waiting for a male child. It has been noted by demographers that some
parents are only interested in producing a male heir so that the appearance
of a boy leads more often to the family ending childbearing. Stopping models
investigate questions like: Are couples more likely to stop childbearing once
they have a boy? Or are some parents waiting for a girl? Others might wish
to have at least one boy and girl. The exploratory data analysis results in
Table 2.6 provide some insight but cannot definitively settle the question about
couples’ stopping once they have a boy.
For stopping models, two probabilities are recorded for each child: the proba-
bility of the sex and the conditional probability of stopping after that child. As
we have done in previous models, let pB = probability the child is a boy. When
conditioning, every possible condition must have a probability associated with
it. Here the stopping conditions for Model 3 are: stop on first boy (S|B1) or
stopping on a child who is not the first boy (S|N ).
Additional parameters for the First Boy Stopping Model
• pS|B1 = probability of stopping after the first boy
• 1 − pS|B1 = probability of not stopping after the first boy
• pS|N = probability of stopping after a child who is not the first boy
• 1 − pS|N = probability of not stopping after a child who is not the first boy
With these additional parameters, likelihood contributions of the NLSY families
are listed in Table 2.10. Our interest centers on whether the probability of
62 2 Beyond Least Squares: Using Likelihoods
stopping after the first boy, pS|B1 is greater than stopping when it is not a
first boy, pS|N .
Using calculus, the MLEs are derived to be p̂B = 0.507, p̂S|B1 = 0.432, and
p̂S|N = 0.584. These are consistent with intuition. The estimated proportion
of boys for this model is the same as the estimate for the Sex Unconditional
Model (Model 1). The estimates of the stopping parameters are consistent
Child is... total children prop of all children n.stops (n.families) prop stopped after
these children
a boy who is the 3,986 37.4% 1,721 43.2%
only boy in the
family up to that
point
not an only boy in 6,686 62.2% 3,905 58.4%
the family up to
that point
a girl who is the 3,928 36.8% 1,794 45.7%
only girl in the
family up to that
point
not an only girl in 3,832 63.2% 3,832 56.8%
the family up to
that point
10,672 5,626
2.7 Model 3: Stopping Rule Model (waiting for a boy) 63
with the fact that of the 3,986 first boys, parents stop 43.2% of the time and
of the 6,686 children who are not first boys, childbearing stopped 58.4% of the
time. See Table 2.11.
These results do, in fact, suggest that the probability a couple stops childbearing
on the first boy is different than the probability of stopping at a child who is
not the first boy, but the direction of the difference does not imply that couples
“wait for a boy;” rather it appears that they are less likely to stop childbearing
after the first boy in comparison to children who are not the first-born male.
Similarly, for girls, the MLEs are p̂S|G1 = 0.457 and p̂S|N = 0.568. Once again,
the estimates do not provide evidence of waiting for a girl.
How does the waiting for a boy model compare to the waiting for a girl model?
Thus far we’ve seen how nested models can be compared. But these two models
are not nested since one is not simply a reduced version of the other. Two
measures referred to as information criteria, AIC and BIC, are useful when
comparing non-nested models. Each measure can be calculated for a model
using a function of the model’s maximum log-likelihood. You can find the
log-likelihood in the output from most modeling software packages.
• AIC = −2(maximum log-likelihood ) + 2p, where p represents the number of
parameters in the fitted model. AIC stands for Akaike Information Criterion.
Because smaller AICs imply better models, we can think of the second term
as a penalty for model complexity—the more variables we use, the larger the
AIC.
• BIC = −2(maximum log-likelihood ) + p log(n), where p is the number of
parameters and n is the number of observations. BIC stands for Bayesian
Information Criterion, also known as Schwarz’s Bayesian criterion (SBC).
Here we see that the penalty for the BIC differs from the AIC, where the
log of the number of observations places a greater penalty on each extra
predictor, especially for large data sets.
So which explanation of the data seems more plausible—waiting for a boy or
waiting for a girl? These models are not nested (i.e., one is not a simplified
version of the other), so it is not correct to perform a Likelihood Ratio Test,
but we can legitimately compare these models using information criteria (Table
2.12).
Smaller AIC and BIC are preferred, so here the Waiting for a Boy Model
is judged superior to the Waiting for a Girl Model, suggesting that couples
waiting for a boy is a better explanation of the data than waiting for a girl.
However, for either boys and girls, couples do not stop more frequently after
the first occurrence.
64 2 Beyond Least Squares: Using Likelihoods
TABLE 2.12: Measures of model performance with NLSY data: Waiting for
a Boy vs. Waiting for a Girl Model.
Other stopping rule models are possible. Another model could be that couples
wait to stop until they have both a boy and a girl. We leave the consideration
of this balance-preference model as an exercise.
Using a Likelihood Ratio Test, we found statistical evidence that the Sex
Conditional Model (Sex Bias) is preferred to the Sex Unconditional Models.
However, the parameter estimates were not what we expected if we believe
that sex runs in families. Quite to the contrary, the results suggested that if
there were more of one sex in a family, the next child is likely to be of the
other sex. The results may support the idea that gender composition tends to
“even out” over time.
Using AICs and BICs to compare the non-nested models of waiting for a boy
or waiting for a girl, we found that the model specifying stopping for a first
boy was superior to the model for stopping for the first girl. Again, neither
model suggested that couples were more likely to stop after the first male or
female, rather it appeared just the opposite—couples were less likely to be
stopping after the first boy or first girl.
These results may need to be considered conditional on the size of a family. In
which case, a look at the exploratory data analysis results may be informative.
The reported percentages in Table 2.5 could be compared to the percentages
expected if the sex of the baby occurs randomly, P(all one sex|2-child family)
= 1/2, and we observed 45%. For three-child families, P(all one sex|3-child
family) = 1/4, and we observed 27%. There is very slight evidence for sex
running in families for three-child families and none for two-child families.
Under a random model that assumes the probability of a boy is 50%, the
percentage of one-, two- and three-child families with the first boy showing
up last in the family is 50%, 25%, and 12.5%, respectively. Comparing these
2.10 Likelihood-Based Methods 65
probabilities to what was observed in the data in Table 2.6, we find little
support for the idea that couples are waiting for a boy.
Models that in the past you would fit using ordinary least squares can also be
fit using the principle of maximum likelihood. It is pleasing to discover that
under the right assumptions the maximum likelihood estimates (MLEs) for the
intercept and slope in a linear regression are identical to ordinary least squares
estimators (OLS) despite the fact that they are obtained in quite different
ways.
Beyond the intuitively appealing aspects of MLEs, they also have some very
desirable statistical properties. You learn more about these features in a
statistical theory course. Here we briefly summarize the highlights in non-
technical terms. MLEs are consistent; i.e., MLEs converge in probability
to the true value of the parameter as the sample size increases. MLEs are
asymptotically normal; as the sample size increases, the distribution of MLEs is
closer to normal. MLEs are efficient because no consistent estimator has a lower
mean squared error. Of all the estimators that produce unbiased estimates
of the true parameter value, no estimator will have a smaller mean square
error than the MLE. While likelihoods are powerful and flexible, there are
times when likelihood-based methods fail: either MLEs do not exist, likelihoods
cannot be written down, or MLEs cannot be written explicitly. It is also worth
noting that other approaches to the likelihood, such as bootstrapping, can be
employed.
66 2 Beyond Least Squares: Using Likelihoods
Many factors have been identified that can potentially affect the
human sex ratio at birth. A 1972 paper by Michael Teitelbaum
accounted for around 30 such influences, including drinking
water, coital rates, parental age, parental socioeconomic status,
birth order, and even some societal-level influences like wars and
environmental pathogens.
This chapter on likelihood ignored these complicating factors and was intention-
ally kept simple to impress you with the fact that likelihoods are conceptually
straightforward. Likelihoods answer the sensible question of how likely you are
to see your data in different settings. When the likelihood is simple as in this
chapter, you can roughly determine an MLE by looking at a graph or you can
be a little more precise by using calculus or, most conveniently, software. As
we progress throughout the course, the likelihoods will become more complex
and numerical methods may be required to obtain MLEs, yet the concept of an
MLE will remain the same. Likelihoods will show up in parameter estimation,
model performance assessment, and model comparisons.
One of the reasons many of the likelihoods will become complex is because of
covariates. Here we estimated probabilities of having a boy in different settings,
but we did not use any specific information about families other than sex
composition. The problems in the remainder of the book will typically employ
covariates. For example, suppose we had information on paternal age for each
family. Consider the Sex Unconditional Model, and let
data. For example, models with conditional probabilities do not conform to the
independence assumption. The Sex Conditional Model is an example of such a
model. We’ll see that likelihoods can be useful when the data has structure
such as multilevel that induces a correlation. A good portion of the book
addresses this.
When the responses are not normal such as in generalized linear models, where
we see binary responses and responses which are counts, we’ll find it difficult
to use the linear least squares regression models of the past and we’ll find the
flexibility of likelihood methods to be extremely useful. Likelihood methods
will enable us to move beyond multiple linear regression!
2.11 Exercises
1. Write out the likelihood for a model which assumes the probability of
a girl equals the probability of a boy. Carry out a LRT to determine
whether there is evidence that the two probabilities are not equal.
Comment on the practical significance of this finding (there is not
necessarily one correct answer).
2. Case 3 In Case 1 we used hypothetical data with 30 boys and 20
68 2 Beyond Least Squares: Using Likelihoods
girls. Case 2 was a much larger study with 600 boys and 400 girls.
Consider Case 3, a hypothetical data set with 6000 boys and 4000
girls.
•Use the methods for Case 1 and Case 2 and determine the MLE
for pB for the independence model. Compare your result to the
MLEs for Cases 1 and 2.
•Describe how the graph of the log-likelihood for Case 3 would
compare to the log-likelihood graphs for Cases 1 and 2.
•Compute the log-likelihood for Case 3. Why is it incorrect to
perform an LRT comparing Cases 1, 2, and 3?
3. Write out an expression for the likelihood of seeing our NLSY data
(5,416 boys and 5,256 girls) if the true probability of a boy is:
(a) pB = 0.5
(b) pB = 0.45
(c) pB = 0.55
(d) pB = 0.5075
3.2 Introduction
71
72 3 Distribution Theory
Consider the event of flipping a (possibly unfair) coin. If the coin lands heads,
let’s consider this a success and record Y = 1. A series of these events is a
Bernoulli process, independent trials that take on one of two values (e.g.,
0 or 1). These values are often referred to as a failure and a success, and the
probability of success is identical for each trial. Suppose we only flip the coin
once, so we only have one parameter, the probability of flipping heads, p. If
we know this value, we can express P (Y = 1) = p and P (Y = 0) = 1 − p. In
general, if we have a Bernoulli process with only one trial, we have a binary
distribution (also called a Bernoulli distribution) where
Assuming all songs have equal odds of playing, we can calculate p = 200−5
200 =
0.975, so there is a 97.5% chance of a song you tolerate playing, since P (Y =
1) = .9751 ∗ (1 − .975)0 .
n y
P (Y = y) = p (1 − p)n−y for y = 0, 1, . . . , n. (3.2)
y
p
If Y ∼ Binomial(n, p), then E(Y ) = np and SD(Y ) = np(1 − p). Typical
shapes of a binomial distribution are found in Figure 3.1. On the left side n
remains constant. We see that as p increases, the center of the distribution
(E(Y ) = np) shifts right. On the right, p is held constant. As n increases, the
distribution becomes less skewed.
n = 10 p = 0.25 n = 20 p = 0.2
0.25 0.20
0.20
probability
probability
0.15
0.15
0.10
0.10
0.05
0.05
0.00 0.00
0.0 2.5 5.0 7.5 10.0 0 5 10 15
number of successes number of successes
n = 10 p = 0.5 n = 50 p = 0.2
0.25
0.20
0.10
probability
probability
0.15
0.10
0.05
0.05
0.00 0.00
0.0 2.5 5.0 7.5 10.0 0 5 10 15 20 25
number of successes number of successes
Note that if n = 1,
1 y
P (Y = y) = p (1 − p)1−y
y
= py (1 − p)1−y for y = 0, 1,
## [1] 0.2816
Therefore, there is a 28% chance of exactly 2 correct answers out of 10.
We can think about this function as modeling the probability of y failures, then
1 success. In this case, Y follows a geometric distribution with E(Y ) = 1−p p
q
and SD(Y ) = 1−p p2 .
Typical shapes of geometric distributions are shown in Figure 3.2. Notice that
as p increases, the range of plausible values decreases and means shift towards
0.
Once again, we can use R to aid our calculations. The function dgeom(y,
p) will output the probability of y failures before the first success where
Y ∼ Geometric(p).
3.3 Discrete Random Variables 75
Geometric p = 0.3
0.3
probability
0.2
0.1
0.0
0 5 10 15
number of failures
Geometric p = 0.5
0.5
probability
0.4
0.3
0.2
0.1
0.0
0 5 10 15
number of failures
Geometric p = 0.7
probability
0.6
0.4
0.2
0.0
0 5 10 15
number of failures
Example 3: Consider rolling a fair, six-sided die until a five appears. What is
the probability of rolling the first five on the third roll?
First note that p = 1/6. We are then interested in P (Y = 2), as we would want
2 failures before our success. We know that P (Y = 2) = (5/6)2 (1/6) = 0.116.
Verifying through R:
## [1] 0.1157
Thus, there is a 12% chance of rolling the first five on the third roll.
probability
0.100
0.075
0.050
0.025
0.000
0 10 20 30
number of failures
p = 0.35 , r = 5
probability
0.075
0.050
0.025
0.000
0 10 20 30
number of failures
p = 0.7 , r = 5
0.25
probability
0.20
0.15
0.10
0.05
0.00
0 10 20 30
number of failures
One important property of the gamma function is that for any integer n,
Γ(n) = (n − 1)!. Applying this, we can generalize the pmf of a negative
binomial variable such that
y+r−1
P (Y = y) = (1 − p)y (p)r
r−1
(y + r − 1)!
= (1 − p)y (p)r
(r − 1)!y!
Γ(y + r)
= (1 − p)y (p)r for y = 0, 1, . . . , ∞.
Γ(r)y!
3.3 Discrete Random Variables 77
P (Y < 3) = P (Y = 0) + P (Y = 1) + P (Y = 2)
9 0 10 10
= (1 − 0.9) (0.9) + (1 − 0.9)1 (0.9)10
9 9
11
+ (1 − 0.9)2 (0.9)10
9
= 0.89
Using R:
## [1] 0.8891
Thus, there is a 89% chance that she gets 10 correct responses before missing
3.
m N −m
y n−y
P (Y = y) = N
for y = 0, 1, . . . , min(m, n). (3.6)
n
If Y follows a hypergeometric
q distribution and we define p = m/N , then E(Y ) =
−n
np and SD(Y ) = np(1 − p) N N −1 . Figure 3.4 displays several hypergeometric
distributions. On the left, N and n are held constant. As m → N/2, the
distribution becomes more and more symmetric. On the right, m and N are
held constant. Both distributions are displayed on the same scale. We can see
that as n → N (or n → 0), the distribution becomes less variable.
0.20
probability
probability
0.04
0.15
0.10
0.02
0.05
0.00 0.00
0 5 10 15 20 25 80 90 100 110 120
number of successes number of successes
0.10
probability
probability
0.10
0.05 0.05
0.00 0.00
0 5 10 15 20 25 160 170 180 190 200
number of successes number of successes
4
variable where n = 10, m = 4, and N = 52. Then, P (Y = 4) =
random
48
4 6
52
= 0.0008. We can avoid this calculation through R, of course:
10
## [1] 0.0007757
So, there is a 0.08% chance of all 4 queens being within the first 10 cards of a
randomly shuffled deck of cards.
3.3 Discrete Random Variables 79
e−λ λy
P (Y = y) = for y = 0, 1, . . . , ∞, (3.7)
y!
where λ is the mean or expected count in the unit of time or
√ space of interest.
This probability mass function has E(Y ) = λ and SD(Y ) = λ. Three Poisson
distributions are displayed in Figure 3.5. Notice how distributions become
more symmetric as λ increases.
0.4
0.2
0.0
0 5 10
number of events
Poisson lambda = 1
probability
0.3
0.2
0.1
0.0
0 5 10
number of events
Poisson lambda = 5
probability
0.1
0.0
0 5 10
number of events
P (Y ≤ 3) = P (Y = 0) + P (Y = 1) + P (Y = 2) + P (Y = 3)
e−5 50 e−5 51 e−5 52 e−5 53
= + + +
0! 1! 2! 3!
= 0.27.
80 3 Distribution Theory
## [1] 0.265
Therefore, there is a 27% chance of 3 or fewer tickets being issued within one
month.
Suppose we have a Poisson process with rate λ, and we wish to model the
wait time Y until the first event. We could model Y using an exponential
distribution, where
2
Lambda
density
0.5
1
5
1
0 1 2 3 4
values
## [1] 0.8111
Hence, there is a 81% chance of waiting fewer than 10 days between tickets.
Gamma Distributions
1.0
r = 2, lambda = 1
density
r = 1, lambda = 1
r = 5, lambda = 5
0.5 r = 5, lambda = 7
0.0
0 2 4 6
values
λ 1−1 −λy
f (y) = y e
Γ(1)
= λe−λy for y > 0,
Using R:
3.4 Continuous Random Variables 83
## [1] 0.7149
There is a 71.5% chance of catching 5 fish within the first 3 hours.
You have already at least informally seen normal random variables when
evaluating LLSR assumptions. To recall, we required responses to be normally
distributed at each level of X. Like any continuous random variable, normal
(also called Gaussian) random variables have their own pdf, dependent on µ,
the population mean of the variable of interest, and σ, the population standard
deviation. We find that
2 2
e−(y−µ) /(2σ )
f (y) = √ for − ∞ < y < ∞. (3.10)
2πσ 2
Normal Distributions
0.4
0.3
Distribution
N(10, 5)
density
0.2 N(0, 3)
N(0, 1)
N(-5, 2)
0.1
0.0
-10 0 10 20
values
deviation of 0.5 ounces. What is the probability that the weight of a randomly
selected box is more than 15.5 ounces?
Using a normal distribution,
∞ 2 2
e−(y−15) /(2·0.5 )
Z
P (Y > 15.5) = √ dy = 0.159
15.5 2π · 0.52
## [1] 0.1587
There is a 16% chance of a randomly selected box weighing more than 15.5
ounces.
So far, all of our continuous variables have had no upper bound. If we want to
limit our possible values to a smaller interval, we may turn to a beta random
variable. In fact, we often use beta random variables to model distributions of
probabilities—bounded below by 0 and above by 1. The pdf is parameterized
by two values, α and β (α, β > 0). We can describe a beta random variable by
the following pdf:
Γ(α + β) α−1
f (y) = y (1 − y)β−1 for 0 < y < 1. (3.11)
Γ(α)Γ(β)
s Y
If ∼ Beta(α, β), then E(Y ) = α/(α + β) and SD(Y ) =
αβ
2
. Figure 3.9 displays several beta distributions. Note
(α + β) (α + β + 1)
that when α = β, distributions are symmetric. The distribution is left-skewed
when α > β and right-skewed when β > α.
If α = β = 1, then
Γ(1)
f (y) = y 0 (1 − y)0
Γ(1)Γ(1)
= 1 for 0 < y < 1.
Distribution
Beta(0.5, 0.5)
density
2 Beta(4, 1)
Beta(2, 2)
Beta(2, 5)
Alternatively, in R:
## [1] 0.0593
Hence, there is a 6% chance that a randomly selected student has a probability
of accepting an admission decision above 80%.
3.5.1 χ2 Distribution
You have probably already encountered χ2 tests before. For example, χ2 tests
are used with two-way contingency tables to investigate the association between
row and column variables. χ2 tests are also used in goodness-of-fit testing
such as comparing counts expected according to Mendelian ratios to observed
data. In those situations, χ2 tests compare observed counts to what would be
expected under the null hypotheses and reject the null when these observed
discrepancies are too large.
Chi-squared Distributions
0.5
0.4
7
3
0.2 1
0.1
0.0
0 5 10 15
values
t-Distributions
0.4
0.3
Degrees of Freedom
1
density
0.2 2
10
Inf
0.1
0.0
3.5.3 F -Distribution
F -distributions are also used when performing statistical tests. Like the χ2
distribution, the values from an F -distribution are non-negative and the
distribution is right skewed; in fact, an F -distribution can be derived as the
ratio of two χ2 random variables. R.A. Fisher (for whom the test is named)
devised this test statistic to compare two different estimates of the same
variance parameter, and it has a prominent role in Analysis of Variance
(ANOVA). Model comparisons are often based on the comparison of variance
estimates, e.g., the extra sums-of-squares F test. F -distributions are indexed
by two degrees-of-freedom values, one for the numerator (k1 ) and one for the
denominator (k2 ). The expected value for an F -distribution with k1 , k2 degrees
of freedom under the null hypothesis is k2k−2
2
, which approaches 1 as k2 → ∞.
88 3 Distribution Theory
F Distributions
1.00
0.75
Distribution
density
F(1, 1)
0.50
F(4, 2)
F(5, 10)
0.25
0.00
0 1 2 3
values
Table 3.1 briefly details most of the random variables discussed in this chapter.
3.7 Exercises
0.04
0.03
Frequency
0.02
0.01
0.00
40 50 60 70 80 90 100
Wait time in minutes
• Describe why simple linear regression is not ideal for Poisson data.
• Write out a Poisson regression model and identify the assumptions for
inference.
• Write out the likelihood for a Poisson regression and describe how it could
be used to estimate coefficients for a model.
• Interpret estimated coefficients from a Poisson regression and construct
confidence intervals for them.
• Use deviances for Poisson regression models to compare and assess models.
• Use an offset to account for varying effort in data collection.
• Fit and use a zero-inflated Poisson (ZIP) model.
93
94 4 Poisson Regression
log(λi ) = β0 + β1 xi
where the observed values Yi ∼ Poisson with λ = λi for a given xi . For example,
each state i can potentially have a different λ depending on its value of xi ,
where xi could represent presence or absence of a particular helmet law. Note
that the Poisson regression model contains no separate error term like the we
see in linear regression, because λ determines both the mean and the variance
of a Poisson random variable.
Much like linear least squares regression (LLSR), using Poisson regression to
make inferences requires model assumptions.
15
600
10
300
y
5
0
-300 0
0 10 20 30 40 50 0 5 10 15 20
x x
FIGURE 4.1: Regression models: Linear regression (left) and Poisson regres-
sion (right).
Figure 4.1 illustrates a comparison of the LLSR model for inference to Poisson
regression using a log function of λ.
96 4 Poisson Regression
1. The graphic displaying the LLSR inferential model appears in the left
panel of Figure 4.1. It shows that, for each level of X, the responses
are approximately normal. The panel on the right side of Figure
4.1 depicts what a Poisson regression model looks like. For each
level of X, the responses follow a Poisson distribution (Assumption
1). For Poisson regression, small values of λ are associated with
a distribution that is noticeably skewed with lots of small values
and only a few larger ones. As λ increases the distribution of the
responses begins to look more and more like a normal distribution.
2. In the LLSR model, the variation in Y at each level of X, σ 2 , is
the same. For Poisson regression the responses at each level of X
become more variable with increasing means, where variance=mean
(Assumption 3).
3. In the case of LLSR, the mean responses for each level of X, µY |X ,
fall on a line. In the case of the Poisson model, the mean values of
Y at each level of X, λY |X , fall on a curve, not a line, although the
logs of the means should follow a line (Assumption 4).
At what age are heads of households in the Philippines most likely to find
the largest number of people in their household? Is this association similar for
poorer households (measured by the presence of a roof made from predomi-
98 4 Poisson Regression
TABLE 4.1: The first five observations from the Philippines Household case
study.
The first five rows from our data set fHH1.csv are illustrated in Table 4.1.
Each line of the data file refers to a household at the time of the survey:
• location = where the house is located (Central Luzon, Davao Region, Ilocos
Region, Metro Manila, or Visayas)
• age = the age of the head of household
• total = the number of people in the household other than the head
• numLT5 = the number in the household under 5 years of age
• roof = the type of roof in the household (either Predominantly
Light/Salvaged Material, or Predominantly Strong Material, where stronger
material can sometimes be used as a proxy for greater wealth)
For the rest of this case study, we will refer to the number of people in a
household as the total number of people in that specific household besides
4.4 Case Study: Household Size in the Philippines 99
200
Count of households
100
0 5 10 15
Number in the house excluding head of household
Figure 4.3 reveals a fair amount of variability in the number in each house;
responses range from 0 to 16 with many of the respondents reporting between 1
and 5 people in the house. Like many Poisson distributions, this graph is right
skewed. It clearly does not suggest that the number of people in a household
is a normally distributed response.
30
20
10
0
(55,60] (60,65] (65,70] NA
50
40
30
20
10
0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Household size
TABLE 4.2: Compare mean and variance of household size within each age
group.
Figure 4.4 further shows that responses can be reasonably modeled with a
Poisson distribution when grouped by a key explanatory variable: age of the
household head. These last two plots together suggest that Assumption 1
(Poisson Response) is satisfactory in this case study.
For Poisson random variables, the variance of Y (i.e., the square of the standard
deviation of Y ), is equal to its mean, where Y represents the size of an individual
household. As the mean increases, the variance increases. So, if the response is
a count and the mean and variance are approximately equal for each group of
X, a Poisson regression model may be a good choice. In Table 4.2 we display
age groups by 5-year increments, to check to see if the empirical means and
variances of the number in the house are approximately equal for each age
group. This provides us one way in which to check the Poisson Assumption 3
(mean = variance).
If there is a problem with this assumption, most often we see variances much
larger than means. Here, as expected, we see more variability as age increases.
However, it appears that the variance is smaller than the mean for lower ages,
while the variance is greater than the mean for higher ages. Thus, there is
some evidence of a violation of the mean=variance assumption (Assumption
3), although any violations are modest.
The Poisson regression model also implies that log(λi ), not the mean household
size λi , is a linear function of age; i.e., log(λi ) = β0 + β1 agei . Therefore, to
check the linearity assumption (Assumption 4) for Poisson regression, we would
like to plot log(λi ) by age. Unfortunately, λi is unknown. Our best guess of
4.4 Case Study: Household Size in the Philippines 101
λi is the observed mean number in the household for each age (level of X).
Because these means are computed for observed data, they are referred to as
empirical means. Taking the logs of the empirical means and plotting by age
provides a way to assess the linearity assumption. The smoothed curve added
to Figure 4.5 suggests that there is a curvilinear relationship between age and
the log of the mean household size, implying that adding a quadratic term
should be considered. This finding is consistent with the researchers’ hypothesis
that there is an age at which a maximum household size occurs. It is worth
noting that we are not modeling the log of the empirical means, rather it is
the log of the true rate that is modeled. Looking at empirical means, however,
does provide an idea of the form of the relationship between log(λ) and xi .
FIGURE 4.5: The log of the mean household sizes, besides the head of
household, by age of the head of household, with loess smoother.
We can extend Figure 4.5 by fitting separate curves for each region (see Figure
4.6). This allows us to see if the relationship between mean household size and
age is consistent across region. In this case, the relationships are pretty similar;
if they weren’t, we could consider adding an age-by-region interaction to our
eventual Poisson regression model.
25 50 75 100
Age of head of the household
FIGURE 4.6: Empirical log of the mean household sizes vs. age of the head
of household, with loess smoother by region.
We first consider a model for which log(λ) is linear in age. We then will
determine whether a model with a quadratic term in age provides a significant
improvement based on trends we observed in the exploratory data analysis.
R reports an estimated regression equation for the linear Poisson model as:
log(λX ) = β0 + β1 X
log(λX+1 ) = β0 + β1 (X + 1)
log(λX+1 ) − log(λX ) = β1
λX+1
(4.1)
log = β1
λX
λX+1
= eβ1
λX
## 2.5 % 97.5 %
## (Intercept) 1.451170 1.648249
## age -0.006543 -0.002873
exp(confint(modela))
## 2.5 % 97.5 %
## (Intercept) 4.2681 5.1979
## age 0.9935 0.9971
Another way to test the significance of the age term is to calculate a Wald-
type statistic. A Wald-type test statistic is the estimated coefficient divided
by its standard error. When the true coefficient is 0, this test statistic fol-
lows a standard normal distribution for sufficiently large n. The estimated
coefficient associated with the linear term in age is β̂1 = −0.0047 with stan-
dard error SE(β̂1 ) = 0.00094. The value for the Wald test statistic is then
Z = β̂1 /SE(β̂1 ) = −5.026, where Z follows a standard normal distribution
if β1 = 0. In this case, the two-sided p-value based on the standard normal
distribution for testing H0 : β1 = 0 is almost 0 (p = 0.000000501). Therefore,
we have statistically significant evidence (Z = -5.026, p < .001) that average
household size decreases as age of the head of household increases.
There is another way in which to assess how useful age is in our model. A
deviance is a way in which to measure how the observed data deviates from
the model predictions; it will be defined more precisely in Section 4.4.8, but it
is similar to sum of squared errors (unexplained variability in the response) in
4.4 Case Study: Household Size in the Philippines 105
In order to use the drop-in-deviance test, the models being compared must
be nested; e.g., all the terms in the smaller model must appear in the larger
model. Here the smaller model is the null model with the single term β0 and
the larger model has β0 and β1 , so the two models are indeed nested. For
nested models, we can compare the models’ residual deviances to determine
whether the larger model provides a significant improvement.
• When the reduced model is true, the drop-in-deviance ∼ χ2d where d= the
difference in the degrees of freedom associated with the two models (that is,
the difference in the number of terms/coefficients).
• A large drop-in-deviance favors the larger model.
Wald test for a single coefficient
• Wald-type statistic = estimated coefficient / standard error
• When the true coefficient is 0, for sufficiently large n, the test statistic ∼
N(0,1).
• If the magnitude of the test statistic is large, there is evidence that the true
coefficient is not 0.
The drop-in-deviance and the Wald-type tests usually provide consistent results;
however, if there is a discrepancy, the drop-in-deviance is preferred. Not only
does the drop-in-deviance test perform better in more cases, but it’s also
more flexible. If two models differ by one term, then the drop-in-deviance test
essentially tests if a single coefficient is 0 like the Wald test does, while if two
models differ by more than one term, the Wald test is no longer appropriate.
Before continuing with model building, we take a short detour to see how
coefficient estimates are determined in a Poisson regression model. The least
squares approach requires a linear relationship between the parameter, λi (the
expected or mean response for observation i), and xi (the age for observation
i). However, it is log(λi ), not λi , that is linearly related to X with the Poisson
model. The assumptions of equal variance and normality also do not hold
for Poisson regression. Thus, the method of least squares will not be helpful
for inference in Poisson Regression. Instead of least squares, we employ the
likelihood principle to find estimates of our model coefficients. We look for
those coefficient estimates for which the likelihood of our data is maximized;
these are the maximum likelihood estimates.
The likelihood for n independent observations is the product of the probabilities.
For example, if we observe five households with household sizes of 4, 2, 8, 6,
and 1 person beyond the head, the likelihood is:
e−λ λy
P (Y = y) =
y!
for y = 0, 1, 2, ... So, the likelihood can be written as
4.4 Case Study: Household Size in the Philippines 107
e−λ1 λ41 e−λ2 λ22 e−λ3 λ83 e−λ4 λ64 e−λ5 λ15
Likelihood = ∗ ∗ ∗ ∗
4! 2! 8! 6! 1!
where each λi can differ for each household depending on a particular xi . As
in Chapter 2, it will be easier to find a maximum if we take the log of the
likelihood and ignore the constant term resulting from the sum of the factorials:
Now if we had the age of the head of the household for each house (xi ), we
consider the Poisson regression model:
log(λi ) = β0 + β1 xi
This implies that λ differs for each age and can be determined using
λi = eβ0 +β1 xi .
If the ages are X = c(32, 21, 55, 44, 28) years, our loglikelihood can be written:
To see this, match the terms in Equation (4.2) with those in Equation (4.3),
noting that λi has been replaced with eβ0 +β1 xi . It is Equation (4.3) that will
be used to estimate the coefficients β0 and β1 . Although this looks a little more
complicated than the loglikelihoods we saw in Chapter 2, the fundamental
ideas are the same. In theory, we try out different possible values of β0 and
β1 until we find the two for which the loglikelihood is largest. Most statistical
software packages have automated search algorithms to find those values for
β0 and β1 that maximize the loglikelihood.
In Section 4.4.4, the Wald-type test and drop-in-deviance test both suggest
that a linear term in age is useful. But our exploratory data analysis in Section
4.4.2 suggests that a quadratic model might be more appropriate. A quadratic
model would allow us to see if there exists an age where the number in the
108 4 Poisson Regression
house is, on average, a maximum. The output for a quadratic model appears
below.
We can assess the importance of the quadratic term in two ways. First, the
p-value for the Wald-type statistic for age2 is statistically significant (Z =
-11.058, p < 0.001). Another approach is to perform a drop-in-deviance test.
The first order model has a residual deviance of 2337.1 with 1498 df and the
second order model, the quadratic model, has a residual deviance of 2200.9
with 1497 df. The drop-in-deviance by adding the quadratic term to the linear
model is 2337.1 - 2200.9 = 136.2 which can be compared to a χ2 distribution
with one degree of freedom. The p-value is essentially 0, so the observed drop
of 136.2 again provides significant support for including the quadratic term.
We now have an equation in age which yields the estimated log(mean number
in the house).
As shown in the following, with calculus we can determine that the maximum
estimated additional number in the house is e1.441 = 4.225 when the head of
the household is 50.04 years old.
## locationMetroManila 2.484e-01
## locationVisayas 7.247e-03
Notice that because there are 5 different locations, we must represent the effects
of different locations through 4 indicator variables. For example, β̂6 = −0.0194
indicates that, after controlling for the age of the head of household, the log
mean household size is 0.0194 lower for households in the Davao Region than
for households in the reference location of Central Luzon. In more interpretable
terms, mean household size is e−0.0194 = 0.98 times “higher” (i.e., 2% lower)
in the Davao Region than in Central Luzon, when holding age constant.
Residual plots may provide some insight into Poisson regression models, es-
pecially linearity and outliers, although the plots are not quite as useful here
4.4 Case Study: Household Size in the Philippines 111
as they are for linear least squares regression. There are a few options for
computing residuals and predicted values. Residuals may have the form of
residuals for LLSR models or the form of deviance residuals which, when
squared, sum to the total deviance for the model. Predicted values can be
estimates of the counts, eβ0 +β1 X , or log counts, β0 + β1 X. We will typically
use the deviance residuals and predicted counts.
The residuals for linear least squares regression have the form:
Residual sum of squares (RSS) are formed by squaring and adding these
residuals, and we generally seek to minimize RSS in model building. We have
several options for creating residuals for Poisson regression models. One is to
create residuals in much the same way as we do in LLSR. For Poisson residuals,
the predicted values are denoted by λ̂i (in place of µ̂i in Equation
p (4.4)); they
are then standardized by dividing by the standard error, λ̂i . These kinds of
residuals are referred to as Pearson residuals.
Yi − λ̂i
Pearson residuali = p
λ̂i
Pearson residuals have the advantage that you are probably familiar with
their meaning and the kinds of values you would expect. For example, after
standardizing we expect most Pearson residuals to fall between -2 and 2.
However, deviance residuals have some useful properties that make them a
better choice for Poisson regression.
First, we define a deviance residual for an observation from a Poisson
regression:
s
Yi
deviance residuali = sign(Yi − λ̂i ) 2 Yi log − (Yi − λ̂i )
λ̂i
where sign(x) is defined such that:
1
if x > 0
sign(x) = −1 if x < 0
0 if x = 0
As its name implies, a deviance residual describes how the observed data
deviates from the fitted model. Squaring and summingP the deviances for
all observations produces the residual deviance = (deviance residual)2i .
112 4 Poisson Regression
Relatively speaking, observations for good fitting models will have small
deviances; that is, the predicted values will deviate little from the observed.
However, you can see that the deviance for an observation does not easily
translate to a difference in observed and predicted responses as is the case
with LLSR models.
A careful inspection of the deviance formula reveals several places where the
deviance compares Y to λ̂: the sign of the deviance is based on the difference
between Y and λ̂, and under the radical sign we see the ratio Y /λ̂ and the
difference Y − λ̂. When Y = λ̂, that is, when the model fits perfectly, the
difference will be 0 and the ratio will be 1 (so that its log will be 0). So like
the residuals in LLSR, an observation that fits perfectly will not contribute to
the sum of the squared deviances. This definition of a deviance depends on
the likelihood for Poisson models. Other models will have different forms for
the deviance depending on their likelihood.
5.0
2.5
Deviance Residuals
0.0
-2.5
FIGURE 4.7: Residual plot for the Poisson model of household size by age
of the household head.
A plot (Figure 4.7) of the deviance residuals versus predicted responses for the
first order model exhibits curvature, supporting the idea that the model may
improved by adding a quadratic term. Other details related to residual plots
can be found in a variety of sources including McCullagh and Nelder [1989].
4.4.9 Goodness-of-Fit
The model residual deviance can be used to assess the degree to which the
predicted values differ from the observed. When a model is true, we can expect
the residual deviance to be distributed as a χ2 random variable with degrees
of freedom equal to the model’s residual degrees of freedom. Our model thus
far, the quadratic terms for age plus the indicators for location, has a residual
4.5 Linear Least Squares vs. Poisson Regression 113
deviance of 2187.8 with 1493 df. The probability of observing a deviance this
large if the model fits is esentially 0, saying that there is significant evidence
of lack-of-fit.
[1] 0
There are several reasons why lack-of-fit may be observed. (1) We may be
missing important covariates or interactions; a more comprehensive data set
may be needed. (2) There may be extreme observations that may cause the
deviance to be larger than expected; however, our residual plots did not reveal
any unusual points. (3) Lastly, there may be a problem with the Poisson
model. In particular, the Poisson model has only a single parameter, λ, for
each combination of the levels of the predictors which must describe both the
mean and the variance. This limitation can become manifest when the variance
appears to be larger than the corresponding means. In that case, the response
is more variable than the Poisson model would imply, and the response is
considered to be overdispersed.
Response
LLSR : Normal
PoissonRegression : Counts
Variance
LLSR : Equal for each level of X
PoissonRegression : Equal to the mean for each level of X
Model Fitting
LLSR : µ = β0 + β1 x using Least Squares
PoissonRegression : log(λ) = β0 + β1 x using Maximum Likelihood
114 4 Poisson Regression
EDA
LLSR : Plot X vs. Y; add line
PoissonRegression : Find log(ȳ) for several subgroups; plot vs. X
Comparing Models
LLSR : Extra sum of squares F-tests; AIC/BIC
PoissonRegression : Drop in Deviance tests; AIC/BIC
Interpreting Coefficients
LLSR : β1 = change in µy for unit change in X
PoissonRegression : eβ1 = percent change in λ for unit change in X
# A tibble: 10 x 6
Enrollment type nv nvrate enroll1000 region
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 5590 U 30 5.37 5.59 SE
2 540 C 0 0 0.54 SE
3 35747 U 23 0.643 35.7 W
4 28176 C 1 0.0355 28.2 W
5 10568 U 1 0.0946 10.6 SW
6 3127 U 0 0 3.13 SW
7 20675 U 7 0.339 20.7 W
8 12548 C 0 0 12.5 W
9 30063 U 19 0.632 30.1 C
10 4429 C 4 0.903 4.43 C
30
20
count
10
0 10 20 30
Number of violent crimes
A graph of the number of violent crimes, Figure 4.8, reveals the pattern often
found with distributions of counts of rare events. Many schools reported no
violent crimes or very few crimes. A few schools have a large number of crimes
making for a distribution that appears to be far from normal. Therefore,
Poisson regression should be used to model our data; Poisson random variables
are often used to represent counts (e.g., number of violent crimes) per unit of
time or space (e.g., one year).
Let’s take a look at two covariates of interest for these schools: type of institution
and region. In our data, the majority of institutions are universities (65% of
the 81 schools) and only 35% are colleges. Interest centers on whether the
different regions tend to have different crime rates. Table 4.3 contains the
name of each region and each column represents the percentage of schools in
116 4 Poisson Regression
C MW NE SE SW W
C 0.294 0.3 0.381 0.4 0.2 0.5
U 0.706 0.7 0.619 0.6 0.8 0.5
TABLE 4.4: The mean and variance of the violent crime rate by region and
type of institution.
that region which are colleges or universities. The proportion of colleges varies
from a low of 20% in the Southwest (SW) to a high of 50% in the West (W).
While a Poisson regression model is a good first choice because the responses
are counts per year, it is important to note that the counts are not directly
comparable because they come from different size schools. This issue sometimes
is referred to as the need to account for sampling effort; in other words, we
expect schools with more students to have more reports of violent crime since
there are more students who could be affected. We cannot directly compare
the 30 violent crimes from the first school in the data set to no violent crimes
for the second school when their enrollments are vastly different: 5,590 for
school 1 versus 540 for school 2. We can take the differences in enrollments
into account by including an offset in our model, which we will discuss in the
next section. For the remainder of the EDA, we examine the violent crime
counts in terms of the rate per 1,000 enrolled ( number of violent crimes
number enrolled · 1000).
Note that there is a noticeable outlier for a Southeastern school (5.4 violent
crimes per 1000 students), and there is an observed rate of 0 for the South-
western colleges which can lead to some computational issues. We therefore
4.6 Case Study: Campus Crime 117
C MW NE S W
region
FIGURE 4.9: Boxplot of violent crime rate by region and type of institution
(colleges (C) on the left, and universities (U) on the right).
combined the SW and SE to form a single category of the South, and we also
removed the extreme observation from the data set.
Table 4.4 and Figure 4.9 display mean violent crime rates that are generally
lower at the colleges within a region (with the exception of the Northeast). In
addition, the regional pattern of rates at universities appears to differ from
that of the colleges.
Although working with the observed rates (per 1000 students) is useful during
the exploratory data analysis, we do not use these rates explicitly in the model.
The counts (per year) are the Poisson responses when modeling, so we must
take into account the enrollment in a different way. Our approach is to include
a term on the right side of the model called an offset, which is the log of the
enrollment, in thousands. There is an intuitive heuristic for the form of the
offset. If we think of λ as the mean number of violent crimes per year, then
λ/enroll1000 represents the number per 1000 students, so that the yearly count
is adjusted to be comparable across schools of different sizes. Adjusting the
yearly count by enrollment is equivalent to adding log(enroll1000) to the right-
hand side of the Poisson regression equation—essentially adding a predictor
with a fixed coefficient of 1:
λ
log( ) = β0 + β1 (type)
enroll1000
log(λ) − log(enroll1000) = β0 + β1 (type)
log(λ) = β0 + β1 (type) + log(enroll1000)
118 4 Poisson Regression
λ
While this heuristic is helpful, it is important to note that it is not enroll1000
that we are modeling. We are still modeling log(λ), but we’re adding an offset
to adjust for differing enrollments, where the offset has the unusual feature
that the coefficient is fixed at 1.0. As a result, no estimated coefficient for
enroll1000 or log(enroll1000) will appear in the output. As this heuristic
illustrates, modeling log(λ) and adding an offset is equivalent to modeling
rates, and coefficients can be interpreted that way.
## # A tibble: 10 x 5
## comparison estimate SE z_value p_value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MW - C 0.0991 0.178 0.558 0.980
## 2 NE - C 0.778 0.153 5.08 0.00000349
## 3 S - C 0.582 0.149 3.91 0.000828
## 4 W - C 0.263 0.188 1.40 0.621
## 5 NE - MW 0.679 0.155 4.37 0.000109
## 6 S - MW 0.483 0.151 3.19 0.0121
## 7 W - MW 0.164 0.189 0.864 0.908
## 8 S - NE -0.196 0.122 -1.61 0.486
## 9 W - NE -0.515 0.166 -3.11 0.0157
## 10 W - S -0.320 0.163 -1.96 0.280
In our case, Tukey’s Honestly Significant Differences simultaneously evaluates
all 10 mean differences between pairs of regions. We find that the Northeast
has significantly higher rates of violent crimes than the Central, Midwest, and
120 4 Poisson Regression
Western regions, while the South has significantly higher rates of violent crimes
than the Central and the Midwest, controlling for the type of institution. In the
primary model, the University indicator is significant and, after exponentiating
the coefficient, can be interpreted as an approximately (e0.280 ) 32% increase
in violent crime rate over colleges after controlling for region.
4.9 Overdispersion
Poisson quasi-Poisson
Estimate β̂ β̂ q
Std error SE(β̂) SEQ (β̂) = φ̂SE(β̂)
Wald-type test stat Z = β̂/SE(β̂) t = β̂/SEQ (β̂)
0 0
Confidence interval β̂ ± z SE(β̂) β̂ ± t SEQ (β̂)
Drop in deviance test χ = resid dev(reduced) - resid dev(full) F = (χ2 /difference in df)/φ̂
2
model sufficient). The output below tests for an interaction between region
and type of institution after adjusting for overdispersion (extra variance):
Table 4.5 summarizes the comparison between Poisson inference (tests and
confidence intervals assuming no overdispersion) and quasi-Poisson inference
(tests and confidence intervals after accounting for overdispersion).
universities, and the pattern of those differences depends upon the region.
However, this model exhibited significant lack-of-fit which remained after the
removal of an extreme observation. In the absence of additional covariates,
we accounted for the lack-of-fit by using a quasilikelihood approach and a
negative binomial regression, which provided slightly different conclusions. We
may want to look for additional covariates and/or more data.
head(zip.data[2:5])
As always we take stock of the amount of data; here there are 77 observations.
Large sample sizes are preferred for the type of model we will consider, and
n=77 is on the small side. We proceed with that in mind.
A premise of this analysis is that we believe that those responding zero drinks
are coming from a mixture of non-drinkers and drinkers who abstained the
weekend of the survey.
• Non-drinkers: respondents who never drink and would always reply with
zero.
• Drinkers: obviously this includes those responding with one or more drinks,
but it also includes people who are drinkers but did not happen to imbibe the
past weekend. These people reply zero but are not considered non-drinkers.
Beginning the EDA with the response, number of drinks, we find that over
46% of the students reported no drinks during the past weekend. Figure 4.10a
portrays the observed number of drinks reported by the students. The mean
number of drinks reported the past weekend is 2.013. Our sample consists of
74% females and 26% males, only 9% of whom live off campus.
0.4
Proportion
0.3
0.2
0.1
0.0
0 5 10 15 20
Number of drinks
b) Poisson Model
0.5
0.4
Probability
0.3
0.2
0.1
0.0
0 5 10 15 20
Number of drinks
for its mean and variance. Here we will include an additional parameter, α.
We define α to be the true proportion of non-drinkers in the population.
The next step in the EDA is especially helpful if you suspect your data contains
excess zeros. Figure 4.10b is what we might expect to see under a Poisson
model. Bars represent the probabilities for a Poisson distribution (using the
Poisson probability formula) with λ equal to the mean observed number of
drinks, 2.013 drinks per weekend. Comparing this Poisson distribution to what
we observed (Figure 4.10a), it is clear that many more zeros have been reported
by the students than you would expect to see if the survey observations were
coming from a Poisson distribution. This doesn’t surprise us because we had
expected a subset of the survey respondents to be non-drinkers; i.e., they would
not be included in this Poisson process. This circumstance actually arises in
many Poisson regression settings. We will define λ to be the mean number of
drinks among those who drink, and α to be the proportion of non-drinkers
(“true zeros”). Then, we will attempt to model λ and α (or functions of λ and
α) simultaneously using covariates like sex, first-year status, and off-campus
residence. This type of model is referred to as a zero-inflated Poisson model
or ZIP model.
4.10.4 Modeling
We first fit a simple Poisson model with the covariates off.campus and sex.
# Exponentiated coefficients
exp(coef(pois.m1))
# Goodness-of-fit test
gof.pvalue = 1 - pchisq(pois.m1$deviance, pois.m1$df.residual)
gof.pvalue
## [1] 0
• One part models the association, among drinkers, between number of drinks
and the predictors of sex and off-campus residence.
• The other part uses a predictor for first-year status to obtain an estimate of
the proportion of non-drinkers based on the reported zeros.
The form for each part of the model follows. The first part looks like an
ordinary Poisson regression model:
where λ is the mean number of drinks in a weekend among those who drink.
The second part has the form
4.10 Case Study: Weekend Drinking 129
logit(α) = β0 + β1 firstYear
where α is the probability of being in the non-drinkers group and logit(α) =
log(α/(1 − α)). We’ll provide more detail on the logit in Chapter 6. There are
many ways in which to structure this model; here we use different predictors
in the two pieces, athough it would have been perfectly fine to use the same
predictors for both pieces, or even no predictors for one of the pieces.
## $count
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7543 0.1440 5.238 1.624e-07
130 4 Poisson Regression
## count_(Intercept) count_off.campus
## 2.1261 1.5158
## count_sexm zero_(Intercept)
## 2.7757 0.5468
## zero_firstYearTRUE
## 3.1155
We’ll first consider the “Count model coefficients,” which provide information
on how the sex and off-campus status of a student who is a drinker are related
to the number of drinks reported by that student over a weekend. As we have
done with previous Poisson regression models, we exponentiate each coefficient
for ease of interpretation. Thus, for those who drink, the average number of
drinks for males is e1.0209 or 2.76 times the number for females (Z = 5.827, p
< 0.001) given that you are comparing people who live in comparable settings,
i.e., either both on or both off campus. Among drinkers, the mean number of
drinks for students living off campus is e0.4159 = 1.52 times that of students
living on campus for those of the same sex (Z = 2.021, p = 0.0433).
We have
log(α/(1 − α)) = −0.6036 + 1.1364firstYear
4.10 Case Study: Weekend Drinking 131
e−0.6036+1.1364(firstYear)
α̂ = .
1 + e−0.6036+1.1364(firstYear)
e0.533
= 0.630
1 + e0.533
or 63.0%, while for non-first-year students, the estimated probability of being
a non-drinker is 0.354. If you have seen logistic regression, you’ll recognize that
this transformation is what is used to estimate a probability. More on this in
Chapter 6.
vuong(pois.m1, zip.m2)
p-value
Raw 0.0036
AIC-corrected 0.0056
BIC-corrected 0.0093
Here, we have structured the Vuong Test to compare Model 1: Ordinary
Poisson Model to Model 2: Zero-inflation Model. If the two models do not
differ, the test statistic for Vuong would be asymptotically standard Normal
and the p-value would be relatively large. Here the first line of the output table
indicates that the zero-inflation model is better (Z = −2.69, p = .0036). Note
that the test depends upon sufficiently large n for the Normal approximation,
so since our sample size (n=77) is somewhat small, we need to interpret this
result with caution. More research is underway to address statistical issues
related to these comparisons.
Fitted values (ŷ) and residuals (y− ŷ) can be computed for zero-inflation models
and plotted. Figure 4.11 reveals that one observation appears to be extreme
(Y=22 drinks during the past weekend). Is this a legitimate observation or
was there a transcribing error? Without the original respondents, we cannot
settle this question. It might be worthwhile to get a sense of how influential
this extreme observation is by removing Y=22 and refitting the model.
Y=22
4
Residuals from ZIP model
1 2 3 4 5 6
Fitted values from ZIP model
4.10.8 Limitations
Given that you have progressed this far in your statistical education, the
weekend drinking survey question should raise some red flags. What time
4.11 Exercises 133
4.11 Exercises
16. Poisson approximation: rare events. For rare diseases, the prob-
ability of a case occurring, p, in a very large population, n, is small.
With a small p and large n, the random variable Y = the number of
cases out of n people can be approximated using a Poisson random
variable with λ = np. If the count of those with the disease is ob-
served in several different populations independently of one another,
the Yi represents the number of cases in the ith population and can
136 4 Poisson Regression
TABLE 4.7: Data from Scotto et al. (1974) on the number of cases of non-
melanoma skin cancer for women by age group in two metropolitan areas
(Minneapolis-St. Paul and Dallas-Ft. Worth); the year is unknown.
145
146 5 Generalized Linear Models: A Unifying Theory
e−λ λy
P (Y = y) = where y = 0, 1, 2 . . . ∞
y!
and consider the following useful identities for establishing exponential form:
a = elog(a)
a = ex log(a)
x
log(ab)
a = log(a) + log(b)
log = log(a) − log(b)
b
Determining whether the Poisson model is a member of the one-parameter
exponential family is a matter of writing the Poisson pmf in the form of
Equation (5.1) and checking that the support does not depend upon λ. First,
consider the condition concerning the support of the distribution. The set of
possible values for any Poisson random variable is y = 0, 1, 2 . . . ∞ which does
not depend on λ. The support condition is met. Now we see if we can rewrite
the probability mass function in one-parameter exponential family form.
The first term in the exponent for Equation (5.1) must be the product of two
factors, one solely a function of y, a(y), and another, b(λ), a function of λ only.
The middle term in the exponent must be a function of λ only; no y 0 s should
appear. The last term has only y 0 s and no λ. Since this appears to be the case
here, we can identify the different functions in this form:
a(y) = y
b(λ) = log(λ)
c(λ) = −λ
d(y) = − log(y!)
−1 1/λ2
E(Y ) = − =λ and Var(Y ) = =λ
1/λ (1/λ3 )
We’ll find that other distributions are members of the one-parameter exponen-
tial family by writing their pdf or pmf in this manner and verifying the support
condition. For example, we’ll see that the binomial distribution meets these
conditions, so it is also a member of the one-parameter exponential family. The
normal distribution is a special case where we have two parameters, a mean µ
and standard deviation σ. If we assume, however, that one of the parameters
is known, then we can show that a normal random variable is also from a
one-parameter exponential family.
1 2
/(2σ 2 )
f (y) = √ e−(y−µ)
2πσ 2
√
Even writing 1/ 2πσ 2 as e− log σ−log(2π)/2 we still do not have the pdf written
in one-parameter exponential family form. We will first need to expand the
exponent so that we have
2
−2yµ+µ2 )/(2σ 2 )]
f (y) = e[− log σ−log(2π)/2] e[−(y
Without loss of generality, we can assume σ = 1, so that
148 5 Generalized Linear Models: A Unifying Theory
2
1
− 12 y 2
f (y) ∝ eyµ− 2 µ
and a(y) = y, b(µ) = µ, c(µ) = − 21 µ2 , and d(y) = − 12 y 2 .
From this result, we can see that the canonical link for a normal response is µ
which is consistent with what we’ve been doing with LLSR, since the simple
linear regression model has the form:
µY |X = β0 + β1 X.
5.4 Exercises
P (Y = y; p) = py (1 − p)(1−y)
n
P (Y = y; p) = py (1 − p)(n−y)
y
e−λ λy
P (Y = y; λ) =
y!
150 5 Generalized Linear Models: A Unifying Theory
1 2 2
f (y; µ) = √ e−(y−µ) /(2σ )
2πσ 2
1 2
/(2σ 2 )
f (y; σ) = √ e−(y−µ)
2πσ 2
f (y; λ) = λe−λy
g) Gamma (for fixed r): Y = time spent waiting for the rth event in a
Poisson process with an average rate of λ events per unit of time
λr r−1 −λy
f (y; λ) = y e
Γ(r)
P (Y = y; p) = (1 − p)y p
θk θ
f (y; θ) = for y ≥ k; θ ≥ 1
y (θ+1)
• Identify a binomial random variable and assess the validity of the binomial
assumptions.
• Write a generalized linear model for binomial responses in two forms, one as
a function of the logit and one as a function of p.
• Explain how fitting a logistic regression differs from fitting a linear least
squares regression (LLSR) model.
• Interpret estimated coefficients in logistic regression.
• Differentiate between logistic regression models with binary and binomial
responses.
• Use the residual deviance to compare models, to test for lack-of-fit when
appropriate, and to check for unusual observations or needed transformations.
151
152 6 Logistic Regression
Binary Responses: Recall from Section 3.3.1 that binary responses take on
only two values: success (Y=1) or failure (Y=0), Yes (Y=1) or No (Y=0),
etc. Binary responses are ubiquitous; they are one of the most common types
of data that statisticians encounter. We are often interested in modeling the
probability of success p based on a set of covariates, although sometimes we
wish to use those covariates to classify a future observation as a success or a
failure.
Examples (a) and (b) above would be considered to have binary responses
(Does a student binge drink? Was a patient diagnosed with cancer?), assuming
that we have a unique set of covariates for each individual student or patient.
Binomial Responses: Also recall from Section 3.3.2 that binomial responses
are the number of successes in n identical, independent trials with constant
probability p of success. A sequence of independent trials like this with the
same probability of success is called a Bernoulli process. As with binary
responses, our objective in modeling binomial responses is to quantify how the
probability of success, p, is associated with relevant covariates.
Example (c) above would be considered to have a binomial response, assuming
we have vote totals at the congressional district level rather than information
on individual voters.
Much like ordinary least squares (OLS), using logistic regression to make
inferences requires model assumptions.
1.2
0.8
Regression model
linear
y
logistic
0.4
0.0
-5 0 5 10
x
FIGURE 6.1: Linear vs. logistic regression models for binary response data.
Figure 6.1 illustrates a data set with a binary (0 or 1) response (Y) and a
single continuous predictor (X). The solid line is a linear regression fit with
least squares to model the probability of a success (Y=1) for a given value of
X. With a binary response, the line doesn’t fit the data well, and it produces
predicted probabilities below 0 and above 1. On the other hand, the logistic
regression fit (dashed curve) with its typical “S” shape follows the data closely
and always produces predicted probabilities between 0 and 1. For these and
several other reasons detailed in this chapter, we will focus on the following
model for logistic regression with binary or binomial responses:
pi
log( ) = β0 + β1 xi
1 − pi
where the observed values Yi ∼ binomial with p = pi for a given xi and n = 1
for binary responses.
TABLE 6.1: Soccer goalkeepers’ penalty kick saves when their team is and
is not behind.
The soccer goalkeeper data can be written in the form of a 2 × 2 table. This
example is used to describe some of the underlying theory for logistic regression.
We demonstrate how binomial probability mass functions (pmfs) can be written
in one-parameter exponential family form, from which we can identify the
canonical link as in Chapter 5. Using the canonical link, we write a Generalized
Linear Model for binomial counts and determine corresponding MLEs for model
coefficients. Interpretation of the estimated parameters involves a fundamental
concept, the odds ratio.
The last case study addresses why teens try to lose weight. Here the response is
a binary variable which allows us to analyze individual level data. The analysis
builds on concepts from the previous sections in the context of a random
sample from CDC’s Youth Risk Behavior Survey (YRBS).
Does the probability of a save in a soccer match depend upon whether the
goalkeeper’s team is behind or not? Roskes et al. [2011] looked at penalty kicks
in the men’s World Cup soccer championships from 1982 to 2010, and they
assembled data on 204 penalty kicks during shootouts. The data for this study
is summarized in Table 6.1.
6.4 Case Study: Soccer Goalkeepers 155
Odds are one way to quantify a goalkeeper’s performance. Here the odds that
a goalkeeper makes a save when his team is behind is 2 to 22 or 0.09 to 1. Or
equivalently, the odds that a goal is scored on a penalty kick is 22 to 2 or 11 to
1. An odds of 11 to 1 tells you that a shooter whose team is ahead will score 11
times for every 1 shot that the goalkeeper saves. When the goalkeeper’s team
is not behind the odds a goal is scored is 141 to 39 or 3.61 to 1. We see that
the odds of a goal scored on a penalty kick are better when the goalkeeper’s
team is behind than when it is not behind (i.e., better odds of scoring for the
shooter when the shooter’s team is ahead). We can compare these odds by
calculating the odds ratio (OR), 11/3.61 or 3.05, which tells us that the odds
of a successful penalty kick are 3.05 times higher when the shooter’s team is
leading.
In our example, it is also possible to estimate the probability of a goal, p, for
either circumstance. When the goalkeeper’s team is behind, the probability
of a successful penalty kick is p = 22/24 or 0.833. We can see that the
ratio of the probability of a goal scored divided by the probability of no
goal is (22/24)/(2/24) = 22/2 or 11, the odds we had calculated above. The
same calculation can be made when the goalkeeper’s team is not behind. In
general, we now have several ways of finding the odds of success under certain
circumstances:
#successes #successes/n p
Odds = = = .
#failures #failures/n 1−p
We would like to model the odds of success; however, odds are strictly positive.
Therefore, similar to modeling log(λ) in Poisson regression, which allowed the
response to take on values from −∞ to ∞, we will model the log(odds), the
logit, in logistic regression. Logits will be suitable for modeling with a linear
function of the predictors:
p
log = β0 + β1 X
1−p
Models of this form are referred to as binomial regression models, or more
generally as logistic regression models. Here we provide intuition for using
and interpreting logistic regression models, and then in the short optional
section that follows, we present rationale for these models using GLM theory.
In our example we could define X = 0 for not behind and X = 1 for behind
and fit the model:
156 6 Logistic Regression
pX
log = β0 + β1 X (6.1)
1 − pX
where pX is the probability of a successful penalty kick given X.
So, based on this model, the log odds of a successful penalty kick when the
goalkeeper’s team is not behind is:
p0
log = β0 ,
1 − p0
and the log odds when the team is behind is:
p1
log = β0 + β1 .
1 − p1
We can see that β1 is the difference between the log odds of a successful penalty
kick between games when the goalkeeper’s team is behind and games when
the team is not behind. Using rules of logs:
p1 p0 p1 /(1 − p1 )
β1 = (β0 + β1 ) − β0 = log − log = log .
1 − p1 1 − p0 p0 /(1 − p0 )
Thus eβ1 is the ratio of the odds of scoring when the goalkeeper’s team is not
behind compared to scoring when the team is behind. In general, exponentiated
coefficients in logistic regression are odds ratios (OR). A general interpretation
of an OR is the odds of success for group A compared to the odds of success
for group B—how many times greater the odds of success are in group A
compared to group B.
The logit model (Equation (6.1)) can also be re-written in a probability
form:
eβ0 +β1 X
pX =
1 + eβ0 +β1 X
which can be re-written for games when the goalkeeper’s team is behind as:
eβ0 +β1
p1 = (6.2)
1 + eβ0 +β1
and for games when the goalkeeper’s team is not behind as:
eβ0
p0 = (6.3)
1 + eβ0
Lik(β0 , β1 ) ∝
β0 +β1
22 β0 +β1
2 β0 141 39
eβ0
e e e
1− 1−
1 + eβ0 +β1 1 + eβ0 +β1 1 + eβ0 1 + eβ0
Now what? Fitting the model means finding estimates of β0 and β1 , but
familiar methods from calculus for maximizing the likelihood don’t work here.
Instead, we consider all possible combinations of β0 and β1 . That is, we will
pick that pair of values for β0 and β1 that yield the largest likelihood for our
data. Trial and error to find the best pair is tedious at best, but more efficient
numerical methods are available. The MLEs for the coefficients in the soccer
goalkeeper study are βˆ0 = 1.2852 and βˆ1 = 1.1127.
Exponentiating βˆ1 provides an estimate of the odds ratio (the odds of scoring
when the goalkeeper’s team is behind, compared to the odds of scoring when
the team is not behind) of 3.04, which is consistent with our calculations using
the 2 × 2 table. We estimate that the odds of scoring when the goalkeeper’s
team is behind is over 3 times that of when the team is not behind or, in other
words, the odds a shooter is successful in a penalty kick shootout are 3.04
times higher when his team is leading.
Recall from Chapter 5 that generalized linear models (GLMs) are a way in
which to model a variety of different types of responses. In this chapter, we
apply the general results of GLMs to the specific application of binomial
responses. Let Y = the number scored out of n penalty kicks. The parameter,
p, is the probability of a score on a single penalty kick. Recall that the theory
of GLMs is based on the unifying notion of the one-parameter exponential
family form:
f (y; θ) = e[a(y)b(θ)+c(θ)+d(y)]
To see that we can apply the general approach of GLMs to binomial responses,
we first write an expression for the probability of a binomial response and
then use a little algebra to rewrite it until we can demonstrate that it, too,
can be written in one-parameter exponential family form with θ = p. This
will provide a way in which to specify the canonical link and the form for the
model. Additional theory allows us to deduce the mean, standard deviation,
and more from this form.
If Y follows a binomial distribution with n trials and probability of success p,
we can write:
n y
P (Y = y) = p (1 − p)(n−y)
y
n
= ey log(p)+(n−y) log(1−p)+log (y )
a model using the logit, the log odds of a score, as a linear function of covariates
is a reasonable approach.
The unit of observation for this data is a community in Hale County. We will
focus on the following variables from RR_Data_Hale.csv collected for each
community (see Table 6.2):
• distance = the distance, in miles, the proposed railroad is from the com-
munity
• YesVotes = the number of “Yes” votes in favor of the proposed railroad line
(our primary response variable)
TABLE 6.2: Sample of the data for the Hale County, Alabama, railroad
subsidy vote.
We first look at a coded scatterplot to see our data. Figure 6.2 portrays the
relationship between distance and pctBlack coded by the InFavor status
(whether a community supported the referendum with over 50% Yes votes).
From this scatterplot, we can see that all of the communities in favor of the
railroad referendum are over 55% black, and all of those opposed are 7 miles
or farther from the proposed line. The overall percentage of voters in Hale
County in favor of the railroad is 87.9%.
80
Percent black in the community
InFavor
60
FALSE
TRUE
40
20
0 5 10 15
Distance to the proposed railroad
FIGURE 6.2: Scatterplot of distance from a proposed rail line and percent
black in the community coded by whether the community was in favor of the
referendum or not.
means “based on sample data.” Empirical logits are computed for each com-
munity by taking log number of successes
number of failures . In Figure 6.3, we see that the plot of
empirical logits versus distance produces a plot that looks linear, as needed for
the logistic regression assumption. In contrast, the empirical logits by percent
black reveal that Greensboro deviates quite a bit from the otherwise linear
pattern; this suggests that Greensboro is an outlier and possibly an influential
point. Greensboro has 99.2% voting yes, with only 59.4% black.
-2
0 5 10 15
distance
2.5
0.0
-2.5
25 50 75 100
percent black
FIGURE 6.3: Empirical logit plots for the Railroad Referendum data.
Since the Wald statistic follows a normal distribution with n large, we could
generate a Wald-type (normal-based) confidence interval for β2 using:
exp(confint(model.HaleBD))
2.5 % 97.5 %
(Intercept) 38.2285 122.6116
distance 0.7276 0.7660
pctBlack 0.9794 0.9945
In the model with distance and pctBlack, the profile likelihood 95% con-
fidence interval for eβ2 is (.979, .994), which is approximately equal to the
Wald-based interval despite the small sample size. We can also confirm the
statistically significant association between percent black and odds of voting
Yes (after controlling for distance), because 1 is not a plausible value of eβ2
(where an odds ratio of 1 would imply that the odds of voting Yes do not
change with percent black).
[1] 0
The model with pctBlack and distance has statistically significant evidence
of lack-of-fit (p < .001).
6.5 Case Study: Reconstructing Alabama 165
Similar to the Poisson regression models, this lack-of-fit could result from (a)
missing covariates, (b) outliers, or (c) overdispersion. We will first attempt
to address (a) by fitting a model with an interaction between distance and
percent black, to determine whether the effect of racial composition differs
based on how far a community is from the proposed railroad.
With LLSR, residuals were used to assess model assumptions and identify
outliers. For binomial regression, two different types of residuals are typically
used. One residual, the Pearson residual, has a form similar to that used
with LLSR. Specifically, the Pearson residual is calculated using:
where mi is the number of trials for the ith observation and p̂i is the estimated
probability of success for that same observation.
A deviance residual is an alternative residual for binomial regression based
on the discrepancy between the observed values and those estimated using the
likelihood. A deviance residual can be calculated for each observation using:
s
Yi mi − Yi
di = sign(Yi − mi pˆi ) 2[Yi log + (mi − Yi ) log ]
mi pˆi mi − mi pˆi
When the number of trials is large for all of the observations and the models are
appropriate, both sets of residuals should follow a standard normal distribution.
The sum of the individual deviance residuals is referred to as the deviance
or residual deviance. The residual deviance is used to assess the model. As
the name suggests, a model with a small deviance is preferred. In the case of
binomial regression, when the denominators, mi , are large and a model fits,
the residual deviance follows a χ2 distribution with n − p degrees of freedom
(the residual degrees of freedom). Thus for a good fitting model the residual
deviance should be approximately equal to its corresponding degrees of freedom.
When binomial data meets these conditions, the deviance can be used for a
6.5 Case Study: Reconstructing Alabama 167
goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from
a χ2n−p distribution that are greater than the observed residual deviance.
We begin a residual analysis of our interaction model by plotting the residuals
against the fitted values in Figure 6.4. This kind of plot for binomial regression
would produce two linear trends with similar negative slopes if there were
equal sample sizes mi for each observation.
Deviance residuals from interaction model
5 Greensboro
-5
FIGURE 6.4: Fitted values by residuals for the interaction model for the
Railroad Referendum data.
From this residual plot, Greensboro does not stand out as an outlier. If it did,
we could remove Greensboro and refit our interaction model, checking to see if
model coefficients changed in a noticeable way. Instead, we will continue to
include Greensboro in our modeling efforts. Because the large residual deviance
cannot be explained by outliers, and given we have included all of the covariates
at hand as well as an interaction term, the observed binomial counts are likely
overdispersed. This means that they exhibit more variation than the model
would suggest, and we must consider ways to handle this overdispersion.
6.5.8 Overdispersion
When overdispersion is adjusted for in this way, we can no longer use maximum
likelihood to fit our regression model; instead we use a quasilikelihood approach.
Quasilikelihood is similar to likelihood-based inference, but because the model
uses the dispersion parameter, it is not a binomial model with a true likelihood
(we call it quasibinomial). R offers quasilikelihood as an option when model
fitting. The quasilikelihood approach will yield the same coefficient point
estimates as maximum likelihood; however, the variances will be larger in the
presence of overdispersion (assuming φ > 1). We will see other ways in which
to deal with overdispersion and clusters in the remaining chapters in the book,
but the following describes how overdispersion is accounted for using φ̂:
## (Intercept) 0.1436
## distance 0.1799
## pctBlack 0.3586
## distance:pctBlack 0.4331
We therefore remove the interaction term and refit the model, adjusting for
the extra-binomial variation that still exists.
exp(confint(model.HaleBDq))
2.5 % 97.5 %
(Intercept) 1.3609 5006.722
distance 0.6091 0.871
pctBlack 0.9366 1.044
While we previously found a 95% confidence interval for the odds ratio associ-
ated with distance of (.728, .766), our confidence interval is now much wider:
(.609, .871). Appropriately accounting for overdispersion has changed both the
significance of certain terms and the precision of our coefficient estimates.
170 6 Logistic Regression
6.5.9 Summary
Response
LLSR : normal
Binomial Regression : number of successes in n trials
Variance
LLSR : equal for each level of X
Binomial Regression : np(1 − p) for each level of X
Model Fitting
LLSR : µ = β0 + β1 x using Least Squares
p
Binomial Regression : log = β0 + β1 x using Maximum Likelihood
1−p
EDA
LLSR : plot X vs. Y ; add line
Binomial Regression : find log(odds) for several subgroups; plot vs. X
6.7 Case Study: Trying to Lose Weight 171
Comparing Models
LLSR : extra sum of squares F-tests; AIC/BIC
Binomial Regression : drop-in-deviance tests; AIC/BIC
Interpreting Coefficients
LLSR : β1 = change in mean response for unit change in X
Binomial Regression : eβ1 = percent change in odds for unit change in X
The final case study uses individual-specific information so that our response,
rather than the number of successes out of some number of trials, is simply a
binary variable taking on values of 0 or 1 (for failure/success, no/yes, etc.).
This type of problem—binary logistic regression—is exceedingly common
in practice. Here we examine characteristics of young people who are trying
to lose weight. The prevalence of obesity among U.S. youth suggests that
wanting to lose weight is sensible and desirable for some young people such as
those with a high body mass index (BMI). On the flip side, there are young
people who do not need to lose weight but make ill-advised attempts to do so
nonetheless. A multitude of studies on weight loss focus specifically on youth
and propose a variety of motivations for the young wanting to lose weight;
athletics and the media are two commonly cited sources of motivation for
losing weight for young people.
Sports have been implicated as a reason for young people wanting to shed
pounds, but not all studies are consistent with this idea. For example, a study
by Martinsen et al. [2009] reported that, despite preconceptions to the contrary,
there was a higher rate of self-reported eating disorders among controls (non-
elite athletes) as opposed to elite athletes. Interestingly, the kind of sport
was not found to be a factor, as participants in leanness sports (for example,
distance running, swimming, gymnastics, dance, and diving) did not differ in
the proportion with eating disorders when compared to those in non-leanness
sports. So, in our analysis, we will not make a distinction between different
sports.
Other studies suggest that mass media is the culprit. They argue that students’
exposure to unrealistically thin celebrities may provide unhealthy motivation
for some, particularly young women, to try to slim down. An examination and
analysis of a large number of related studies (referred to as a meta-analysis)
172 6 Logistic Regression
We are interested in the following questions: Are the odds that young females
report trying to lose weight greater than the odds that males do? Is increasing
BMI associated with an interest in losing weight, regardless of sex? Does sports
participation increase the desire to lose weight? Is media exposure associated
with more interest in losing weight?
We have a sample of 500 teens from data collected in 2009 through the U.S.
Youth Risk Behavior Surveillance System (YRBSS) [Centers for Disease Control
and Prevention, 2009]. The YRBSS is an annual national school-based survey
conducted by the Centers for Disease Control and Prevention (CDC) and state,
territorial, and local education and health agencies and tribal governments.
More information on this survey can be found here1 .
Here are the three questions from the YRBSS we use for our investigation:
Q66. Which of the following are you trying to do about your weight?
• A. Lose weight
• B. Gain weight
• C. Stay the same weight
• D. I am not trying to do anything about my weight
Q81. On an average school day, how many hours do you watch TV?
Q84. During the past 12 months, on how many sports teams did you play?
(Include any teams run by your school or community groups.)
• A. 0 teams
1 http://www.cdc.gov/HealthyYouth/yrbs/index.htm
6.7 Case Study: Trying to Lose Weight 173
TABLE 6.3: Mean BMI percentile by sex and desire to lose weight.
• B. 1 team
• C. 2 teams
• D. 3 or more teams
Answers to Q66 are used to define our response variable: Y = 1 corresponds
to “(A) trying to lose weight”, while Y = 0 corresponds to the other non-
missing values. Q84 provides information on students’ sports participation
and is treated as numerical, 0 through 3, with 3 representing 3 or more. As a
proxy for media exposure, we use answers to Q81 as numerical values 0, 0.5,
1, 2, 3, 4, and 5, with 5 representing 5 or more. Media exposure and sports
participation are also considered as categorical factors, that is, as variables
with distinct levels which can be denoted by indicator variables as opposed to
their numerical values.
BMI is included in this study as the percentile for a given BMI for members
of the same sex. This facilitates comparisons when modeling with males and
females. We will use the terms BMI and BMI percentile interchangeably with
the understanding that we are always referring to the percentile.
With our sample, we use only the cases that include all of the data for these
four questions. This is referred to as a complete case analysis. That brings
our sample of 500 to 445. There are limitations of complete case analyses that
we address in the Discussion.
Nearly half (44.7%) of our sample of 445 youths report that they are trying to
lose weight, 48.1% of the sample are females, and 59.3% play on one or more
sports teams. Also, 8.8% report that they do not watch any TV on school days,
whereas another 13.0% watched 5 or more hours each day. Interestingly, the
median BMI percentile for our 445 youths is 68. The most dramatic difference
in the proportions of those who are trying to lose weight is by sex; 58% of the
females want to lose weight in contrast to only 32% of the males (see Figure
6.5). This provides strong support for the inclusion of a sex term in every
model considered.
174 6 Logistic Regression
1.00
0.75
lose.wt
Proportion
0.50 Lose weight
No weight loss
0.25
0.00
Female Male
sex
Table 6.3 displays the mean BMI of those wanting and not wanting to lose
weight for males and females. The mean BMI is greater for those trying to lose
weight compared to those not trying to lose weight, regardless of sex. The size
of the difference is remarkably similar for the two sexes.
If we consider including a BMI term in our model(s), the logit should be
linearly related to BMI. We can investigate this assumption by constructing an
empirical logit plot. In order to calculate empirical logits, we first divide our
data by sex. Within each sex, we generate 10 groups of equal sizes, the first
holding the bottom 10% in BMI percentile for that sex, the second holding the
next lowest 10%, etc. Within each group, we calculate the proportion, p̂ that
p̂
reported wanting to lose weight, and then the empirical log odds, log( 1− p̂ ),
that a young person in that group wants to lose weight.
FIGURE 6.6: Empirical logits of trying to lose weight by BMI and sex.
Figure 6.6 presents the empirical logits for the BMI intervals by sex. Both males
6.7 Case Study: Trying to Lose Weight 175
and females exhibit an increasing linear trend on the logit scale indicating
that increasing BMI is associated with a greater desire to lose weight and that
modeling log odds as a linear function of BMI is reasonable. The slope for
the females appears to be similar to the slope for males, so we do not need to
consider an interaction term between BMI and sex in the model.
Female Male
1.00
0.75
lose.wt
Proportion
0.25
0.00
FIGURE 6.7: Weight loss plans vs. sex and sports participation.
Out of those who play sports, 44% want to lose weight, whereas 46% want
to lose weight among those who do not play sports. Figure 6.7 compares the
proportion of respondents who want to lose weight by their sex and sport
participation. The data suggest that sports participation is associated with
the same or even a slightly lower desire to lose weight, contrary to what had
originally been hypothesized. While the overall levels of those wanting to lose
weight differ considerably between the sexes, the differences between those
in and out of sports within sex appear to be very small. A term for sports
participation or number of teams will be considered, but there is not compelling
evidence that an interaction term will be needed.
1.00
0.75
lose.wt
Proportion
0.50 Lose weight
No weight loss
0.25
0.00
0 0.5 1 2 3 4 5 0 0.5 1 2 3 4 5
as.factor(media)
FIGURE 6.8: Weight loss plans vs. daily hours of TV and sex.
FIGURE 6.9: Empirical logits for the odds of trying to lose weight by TV
watching and sex.
Our strategy for modeling is to use our questions of interest and what we
have learned in the exploratory data analysis. For each model we interpret the
coefficient of interest, look at the corresponding Wald test and, as a final step,
compare the deviances for the different models we considered.
We first use a model where sex is our only predictor.
We next add sport to our model. Sports participation was considered for
inclusion in the model in three ways: an indicator of sports participation (0 =
no teams, 1 = one or more teams), treating the number of teams (0, 1, 2, or 3)
as numeric, and treating the number of teams as a factor. The models below
treat sports participation using an indicator variable, but all three models
produced similar results.
Sports teams were not significant in any of these models, nor were interaction
terms (sex by sports and bmipct by sports). As a result, sports participation
was no longer considered for inclusion in the model.
df AIC
model1 2 584.3
model2 3 469.0
model3 4 470.6
model4 4 469.1
180 6 Logistic Regression
We found that the odds of wanting to lose weight are considerably greater
for females compared to males. In addition, respondents with greater BMI
percentiles express a greater desire to lose weight for members of the same sex.
Regardless of sex or BMI percentile, sports participation and TV watching are
not associated with different odds for wanting to lose weight.
A limitation of this analysis is that we used complete cases in place of a method
of imputing responses or modeling missingness. This reduced our sample from
500 to 445, and it may have introduced bias. For example, if respondents who
watch a lot of TV were unwilling to reveal as much, and if they differed with
respect to their desire to lose weight from those respondents who reported
watching little TV, our inferences regarding the relationship between lots of
TV and desire to lose weight may be biased.
Other limitations may result from definitions. Trying to lose weight is self-
reported and may not correlate with any action undertaken to do so. The
number of sports teams may not accurately reflect sports-related pressures to
lose weight. For example, elite athletes may focus on a single sport and be
subject to greater pressures, whereas athletes who casually participate in three
sports may not feel any pressure to lose weight. Hours spent watching TV are
not likely to encompass the totality of media exposure, particularly because
exposure to celebrities occurs often online. Furthermore, this analysis does not
explore in any detail maladaptions—inappropriate motivations for wanting to
lose weight. For example, we did not focus our study on subsets of respondents
with low BMI who are attempting to lose weight.
6.8 Exercises 181
6.8 Exercises
Turbine group 1 2 3 4
Humidity Low Low High High
n = number of turbine wheels 3 3 3 3
y = number of fissures 1 2 1 0
mtemp1 =
rep(moth$dark[1],moth$REMOVED[1])
dtemp1 =
rep(moth$DISTANCE[1],moth$REMOVED[1])
rtemp1 =
rep(1,moth$REMOVED[1])
mtemp1 =
c(mtemp1,rep(moth$dark[1],
moth$PLACED[1]-moth$REMOVED[1]))
dtemp1 = c(dtemp1,rep(moth$DISTANCE[1],
moth$PLACED[1]-moth$REMOVED[1]))
rtemp1 = c(rtemp1,rep(0,moth$PLACED[1]-moth$REMOVED[1]))
for(i in 2:14) {
mtemp1 = c(mtemp1,rep(moth$dark[i],moth$REMOVED[i]))
dtemp1 = c(dtemp1,rep(moth$DISTANCE[i],moth$REMOVED[i]))
rtemp1 = c(rtemp1,rep(1,moth$REMOVED[i]))
mtemp1 = c(mtemp1,rep(moth$dark[i],
moth$PLACED[i]-moth$REMOVED[i]))
dtemp1 = c(dtemp1,rep(moth$DISTANCE[i],
moth$PLACED[i]-moth$REMOVED[i]))
rtemp1 = c(rtemp1,rep(0,moth$PLACED[i]-moth$REMOVED[i])) }
newdata = data.frame(removed=rtemp1,dark=mtemp1,dist=dtemp1)
newdata[1:25,]
cdplot(as.factor(rtemp1)~dtemp1)
are listed below. In addition, R code at the end of the problem can
be used to input the data and create additional useful variables.
•female = sex (1 = Female, 0 = Male)
•age = age, in years
•highstatus = socioeconomic status (1 = High, 0 = Low),
determined by the occupation of the household’s primary wage
earner
•yrsmoke = years of smoking prior to diagnosis or examination
•cigsday = average rate of smoking, in cigarettes per day
•bird = indicator of birdkeeping (1 = Yes, 0 = No), determined
by whether or not there were caged birds in the home for more
than 6 consecutive months from 5 to 14 years before diagnosis
(cases) or examination (controls)
•cancer = indicator of lung cancer diagnosis (1 = Cancer, 0 =
No Cancer)
a. Perform an exploratory data analysis to see how each explana-
tory variable is related to the response (cancer). Summarize
each relationship in one sentence.
•For quantitative explanatory variables (age, yrsmoke, cigsday),
produce a cdplot, a boxplot, and summary statistics by cancer
diagnosis.
•For categorical explanatory variables (female or sex,
highstatus or socioecon_status, bird or keep_bird),
produce a segmented bar chart and an appropriate table of
proportions showing the relationship with cancer diagnosis.
b. In (a), you should have found no relationship between whether
or not a patient develops lung cancer and either their age or
sex. Why might this be? What implications will this have on
your modeling?
c. Based on a two-way table with keeping birds and developing
lung cancer from (a), find an unadjusted odds ratio comparing
birdkeepers to non-birdkeepers and interpret this odds ratio
in context. (Note: an unadjusted odds ratio is found by not
controlling for any other variables.) Also, find an analogous
relative risk and interpret it in context as well.
d. Are the elogits reasonably linear relating number of years smoked
to the estimated log odds of developing lung cancer? Demon-
strate with an appropriate plot.
e. Does there appear to be an interaction between number of years
smoked and whether the subject keeps a bird? Demonstrate
with an interaction plot and a coded scatterplot with empirical
logits on the y-axis.
6.8 Exercises 187
you have other concerns with this study design or the analysis
you carried out?
n. Read the article that appeared in the British Medical Journal.
What similarities and differences do you see between their anal-
yses and yours? What are a couple of things you learned from
the article that weren’t apparent in the short summary at the
beginning of the assignment.
7.2 Introduction
Introductory statistics courses typically require responses which are approxi-
mately normal and independent of one another. We saw from the first chapters
in this book that there are models for non-normal responses, so we have already
193
194 7 Correlated Data
TABLE 7.1: Summary of simulations for Dams and Pups case study.
Scenario Model Model Name β0 SE β0 t p value φ Est prob CI prob Mean SD count GOF p
count value
1a Binomial fit_1a_binom
Quasibinomial fit_1a_quasi X X X
1b Binomial fit_1b_binom
Quasibinomial fit_1b_quasi X X X
Scenario Model Model Name β1 SE β1 t p value φ Est odds CI odds Mean SD count GOF p
ratio ratio count Dose=1 value
Dose=1
2a Binomial fit_2a_binom
Quasibinomial fit_2a_quasi X X X
2b Binomial fit_2b_binom
Quasibinomial fit_2b_quasi X X X
the same household are usually more similar than opinions of members from
other randomly selected households. The structure of these data sets suggest
inherent patterns of similarities or correlation among outcomes. This kind of
correlation specifically concerns correlation of observations within the same
teacher or patient or household and is referred to as intraclass correlation.
Correlated data often takes on a multilevel structure. That is, population
elements are grouped into aggregates, and we often have information on both
the individual elements and the aggregated groups. For instance, students are
grouped by teacher, weekly depression measures are grouped by patient, and
survey respondents are grouped by household. In these cases, we refer to levels
of measurement and observational units at each level. For example, students
might represent level-one observational units, while teachers represent
level-two observational units, where level one is the most basic level
of observation, and level-one observations are aggregated to form level-two
observations. Then, if we are modeling a response such as test score, we may
want to examine the effects of student characteristics such as sex and ethnicity,
and teacher characteristics such as years of experience. Student characteristics
would be considered level-one covariates, while teacher characteristics would
be level-two covariates.
pups might differ from dam to dam, and it is helpful to explicitly identify these
reasons in order to determine how the dose levels affect the pups while also
accommodating correlation.
Dose effect The dams and pups experiment is being carried out to determine
whether different dose levels affect the development of defects differently. Of
particular interest is determining whether a dose-response effect is present. A
dose-response effect is evident when dams receiving higher dose levels produce
higher proportions of pups with defects. Knowing defect rates at specific dose
levels is typically of interest within this experiment and beyond. Publishing
the defect rates for each dose level in a journal paper, for example, would be
of interest to other teratologists. For that reason, we refer to dose level effects
as fixed effects.
Dams (litter) effect In many settings like this, there is a litter effect as well.
For example, some dams may exhibit a propensity to produce pups with defects,
while others rarely produce litters with defective pups. That is, observations
on pups within the same litter are likely to be similar or correlated. Unlike
the dose effect, teratologists reading experiment results are not interested in
the estimated probability of defect for each dam in the study, and we would
not report these estimated probabilities in a paper. However, there may be
interest in the variability in litter-specific defect probabilities; accounting for
dam-to-dam variability reduces the amount of unexplained variability and
leads to more precise estimates of fixed effects like dose. Often this kind of
effect is modeled using the idea that randomly selected dams produce random
effects. This provides one way in which to model correlated data, in this case
the correlation between pups from the same dam. We elaborate on this idea
throughout the remainder of the text.
Pup-to-pup variability The within litter pup-to-pup differences reflect ran-
dom, unexplained variation in the model.
In Scenario 1, we will ignore dose and assume that dose has no effect on the
probability of a deformed pup. You can follow along with the simulation (and
modify it as desired) in the Rmd file for this chapter (note that some lines of
code are run but not printed in the chapter text), and fill out Table 7.1.
First, we will consider Scenario 1a where each dam’s probability of producing
a deformed pup is p = 0.5; thus, each dam’s log odds are 0. We then will
compare this to Scenario 1b, where dams’ probabilities of producing a deformed
7.6 Scenario 1: No Covariates 197
pup follow a beta distribution where α = β = 0.5 (which has expected value
0.5). So, in Scenario 1b, each dam has a different probability of producing a
deformed pup, but the probabilities average out to 0.5, whereas in Scenario
1a each dam has an identical probability of producing a deformed pup (0.5).
Figure 7.1 illustrates these two scenarios; every dam in Scenario 1a has a
probability of 0.5 (the thick vertical line), while probabilities in Scenario 1b
are selected from the black dashed distribution, so that probabilities near 0
and 1 are more likely than probabilities near 0.5. The histogram shows 24
randomly generated probabilities under one run of Scenario 1b.
The R code below produces a simulated number of deformed pups for each
of the 24 dams under Scenario 1a and Scenario 1b, and Figure 7.2 displays
distributions of counts of deformed pups per dam for the two simulations.
In Scenario 1a, the mean number of deformed pups is 5.17 with standard
deviation 1.49. In Scenario 1b, the mean number of deformed pups is 5.67 with
standard deviation 4.10.
2.0
1.5
density
1.0
0.5
0.0
Thought Questions
count
4
0
0.0 2.5 5.0 7.5 10.0
Count of deformed pups per dam
7.5
count
5.0
2.5
0.0
0.0 2.5 5.0 7.5 10.0
Count of deformed pups per dam
FIGURE 7.2: Counts of deformed pups per dam under Scenarios 1a and 1b.
If we were to model the number of deformed pups per dam in Scenario 1a, we
could ignore the potential of a dam effect (since all dams behave the same)
and proceed with regular binomial regression as in Chapter 6. Since we have
no predictors, we would start with the model:
p̂
log = β̂0 , where β̂0 = 0.067
1 − p̂
which produces an estimated odds of deformity pb/(1 − pb) = e0.067 = 1.069 and
estimated probability pb = 0.517. Creating 95% confidence intervals using a
profile likelihood approach, we get:
Turning to Scenario 1b, where each dam has a unique probability of producing
a pup with a deformity based on a beta distribution, we can fit binomial and
quasibinomial models as well.
A binomial model gives regression equation
p̂
log = 0.268,
1 − p̂
with associated profile likelihood 95% confidence intervals:
Thought Questions
That is, we assume that the log odds of a deformity is linearly related to dose
through the equation above, and the odds of a deformity are 3.79 times greater
(e1.33 ) for each 1-mg increase in dose.
In Scenario 2b, each dam who received a dose of x has probability of deformity
randomly chosen from a beta distribution where α = 2p/(1 − p) and β = 2.
These beta distribution parameters ensure that, on average, dams with a
dose x in Scenario 2b have the same probability of a deformed pup as dams
with dose x in Scenario 2a. For example, dams receiving the 1-mg dosage
under Scenario 2b would have probabilities following a beta distribution with
α
α = 2(0.34)/(1 − 0.34) = 1.03 and β = 2, which has mean α+β = 0.34. The
big difference is that all dams receiving the 1-mg dosage in Scenario 2a have
probability 0.34 of a deformed pup, whereas dams receiving the 1-mg dosage
in Scenario 2b each have a unique probability of a deformed pup, but those
probabilities average out to 0.34.
Figure 7.3 displays histograms for each dosage group of each dam’s proba-
bility of producing deformed pups under Scenario 2b as well as theoretical
distributions of probabilities. A vertical line is displayed at each hypothetical
distribution’s mean; the vertical line represents the fixed probability of a
deformed pup for all dams under Scenario 2a.
set.seed(1)
b <- 2
a <- b*pi_2a / (1-pi_2a)
pi_2b <- rbeta(24, a, b)
count_2b <- rbinom(24, 10, pi_2b)
7.7 Scenario 2: Dose Effect 201
Dosage = 0 mg
5
density
3
0
0.00 0.25 0.50 0.75 1.00
Probability of Deformity
Dosage = 1 mg
2.0
1.5
density
1.0
0.5
0.0
0.00 0.25 0.50 0.75 1.00
Probability of Deformity
Dosage = 2 mg
2.0
1.5
density
1.0
0.5
0.0
0.00 0.25 0.50 0.75 1.00
Probability of Deformity
Dosage = 3 mg
4
density
0
0.00 0.25 0.50 0.75 1.00
Probability of Deformity
Scenario 2a Scenario 2b
Dosage Mean p SD p Mean SD Mean p SD p Mean SD
Count Count Count Count
0 0.119 0 1.333 1.366 0.061 0.069 0.500 0.837
1 0.339 0 3.167 1.835 0.239 0.208 3.500 2.881
2 0.661 0 5.833 1.472 0.615 0.195 5.833 1.941
3 0.881 0 8.833 1.169 0.872 0.079 8.833 1.169
202 7 Correlated Data
Thought Questions
Thought Questions
the R markdown file for this chapter to completely fill out the table.
Do confidence intervals contain the true model parameters?
10. Why are differences between quasibinomial and binomial models of
Scenario 2a less noticeable than the differences in Scenario 2b?
11. Why does Scenario 2b contain correlated data that we must account
for, while Scenario 2a does not?
Tubes were placed on trees in some locations or transects but not in others.
One research question is whether tree growth in the first year is affected by
the presence of tubes. This analysis has a structure similar to the dams and
pups; the two study designs are depicted in Figure 7.4.
Some transects were assigned to have tubes on all of their trees, and other
transects had no tubes on all of their trees, just as every dam assigned to a
certain group received the same dose. Within a transect, each tree’s first year
of growth was measured, much like the presence or absence of a defect was
noted for every pup within a dam. Although the response in the tree tube
study is continuous (and somewhat normally distributed) rather than binary
as in the dams and pups study, we can use methods to account for correlation
of trees within a transect, just as we accounted for correlation of pups within
dams.
204 7 Correlated Data
FIGURE 7.4: Data structures in the Dams and Pups (left) and Tree Growth
(right) case studies.
We will consider a subset of the full data set in treetube.csv for illustration
purposes here: the 382 trees with heights recorded in both 1990 and 1991.
Thus, we will consider the following variables:
• id = a unique identifier for each tree
• transect = a unique identifier for each transect containing several trees
• species = tree species
• tubes = an indicator variable for the presence or absence of tubes for a
given transect
• height91 = first year height for each tree in meters
• height90 = baseline height for each tree in meters
• growth_yr1 = height91 - height90, in meters
A sample of 10 observations are displayed in Table 7.3.
This portion of the data indicates that the four trees in transect 18 have
tubes, while the other 6 trees listed do not. The concern with this kind of
data structure is that trees from the same transect may be more similar or
correlated with one another, in contrast to trees from other transects. This
could be true for a variety of reasons: some transects may receive less sun than
others, or irrigation of the soil may differ from transect to transect. These
unmeasured but possibly influential factors may imply a correlation among
trees within transects. In that case, we would not have independent pieces of
information, so that the number of trees within a transect would overstate
the amount of independent information. To prepare for an analysis of this
potentially correlated data, we examine the sources of variability in first-year
tree growth.
7.8 Case Study: Tree Growth 205
TABLE 7.3: A sample of 10 trees and their growth from 1990 to 1991.
Transect effects For some of the factors previously mentioned such as sun
exposure or water availability, first-year growth may vary by transect. Knowing
which specific transects produce greater growth is not of interest and would not
appear in a publication of this study. These random effects are analogous
to dam effects which were not of inherent interest, but which we nevertheless
wished to account for.
Data sets with this kind of structure are often referred to as multilevel
data, and the remaining chapters delve into models for multilevel data in
gory detail. With a continuous response variable, we will actually add random
effects for transects to a more traditional linear least squares regression model
rather than estimate an overdispersion parameter as with a binary response.
Either way, if observations are really correlated, proper accounting will lead to
206 7 Correlated Data
larger standard errors for model coefficients and larger (but more appropriate)
p-values for testing the significance of those coefficients.
Attempting to model the effects of tubes on tree growth, we could use LLSR
which yields the model:
ˆ
Growth = 0.106 − 0.040 Tube
7.9 Summary
The most important idea from this chapter is that structures of data sets
may imply that outcomes are correlated. Correlated outcomes provide less
information than independent outcomes, resulting in effective sample sizes
that are less than the total number of observations. Neglecting to take into
account correlation may lead to underestimating standard errors of coefficients,
overstating significance and precision. Correlation is likely and should be
accounted for if basic observational units (e.g., pups, trees) are aggregated in
ways that would lead us to expect units within groups to be similar.
We have mentioned two ways to account for correlation: incorporate a dispersion
parameter or include random effects. In the following chapters, we will primarily
focus on models with random effects. In fact, there are even more ways to
account for correlation, including inflating the variance using Huber-White
estimators (aka Sandwich estimators), and producing corrected variances using
bootstrapping. These are beyond the scope of this text.
7.10 Exercises
2. More dams and pups Describe how to generalize the pup and
dam example by allowing for different size litters.
2. Dams and pups (continued). Modify the dams and pups simula-
tion in the following ways. In each case, produce plots and describe
the results of your modified simulation.
a. Pick a different beta distribution for Scenario 1b.
b. Center the beta distributions in Scenarios 1a and 1b somewhere
other than 0.5.
c. Repeat Scenario 2a with 3 doses and an underlying logistic
model of your own choosing. Then create beta distributions as
in Scenario 2b to match your 3 doses.
The correlated binomial counts simulated in the Dams and Pups Case Study
are in fact beta-binomial random variables like those simulated in the Guided
Exercises from Chapter 3. In fact, we could use the form of a beta-binomial
pdf to model overdispersed binomial variables. Unlike the more generic form of
accounting for correlation using dispersion parameter estimates, beta-binomial
models are more specific and highly parameterized. This approach involves more
assumptions but may also yield more information than the quasi-likelihood
approach. If the beta-binomial model is incorrect, however, our results may
be misleading. That said, the beta-binomial structure is quite flexible and
conforms to many situations.
8
Introduction to Multilevel Models
211
212 8 Introduction to Multilevel Models
TABLE 8.1: A snapshot of selected variables from the first three and the
last three observations in the Music Performance Anxiety case study.
Our examination of the data from Sadler and Miller [2010] in musicdata.csv
will focus on the following key variables:
• id = unique musician identification number
• diary = cumulative total of diaries filled out by musician
• perf_type = type of performance (Solo, Large Ensemble, or Small Ensemble)
• audience = who attended (Instructor, Public, Students, or Juried)
• memory = performed from Memory, using Score, or Unspecified
• na = negative affect score from PANAS
• gender = musician gender
• instrument = Voice, Orchestral, or Piano
• mpqab = absorption subscale from MPQ
• mpqpem = positive emotionality (PEM) composite scale from MPQ
• mpqnem = negative emotionality (NEM) composite scale from MPQ
Sample rows containing selected variables from our data set are illustrated in
Table 8.1; note that each subject (id) has one row for each unique diary entry.
As with any statistical analysis, our first task is to explore the data, examining
distributions of individual responses and predictors using graphical and nu-
merical summaries, and beginning to discover relationships between variables.
With multilevel models, exploratory analyses must eventually account for
the level at which each variable is measured. In a two-level study such as
this one, Level One will refer to variables measured at the most frequently
occurring observational unit, while Level Two will refer to variables measured
on larger observational units. For example, in our study on music performance
anxiety, many variables are measured at every performance. These “Level One”
variables include:
214 8 Introduction to Multilevel Models
(a)
100
Frequency
75
50
25
0
10 15 20 25 30 35
Negative Affect
(b)
10.0
Frequency
7.5
5.0
2.5
0.0
10 15 20 25 30 35
Mean Negative Affect
We can also summarize categorical Level One covariates across all (possibly
correlated) observations to get a rough relative comparison of trends. A total
of 56.1% of the 497 performances in our data set were solos, while 27.3%
were large ensembles and 16.5% were small ensembles. The most common
audience type was a public performance (41.0%), followed by instructors
(30.0%), students (20.1%), and finally juried recitals (8.9%). In 30.0% of
performances, the musician played by memory, while 55.1% used the score and
14.9% of performances were unspecified.
To generate an initial examination of Level Two covariates, we consider a data
set with just one observation per subject, since Level Two variables are constant
over all performances from the same subject. Then, we can proceed as we did
with Level One covariates—using histograms to illustrate the distributions
of continuous covariates (see Figure 8.2) and tables to summarize categorical
covariates. For example, we learn that the majority of subjects have positive
emotionality scores between 50 and 60, but that several subjects fall into a
216 8 Introduction to Multilevel Models
lengthy lower tail with scores between 20 and 50. A summary of categorical
Level Two covariates reveals that among the 37 subjects (26 female and 11
male), 17 play an orchestral instrument, 15 are vocal performers, and 5 play a
keyboard instrument.
7.5 9
6
Frequency
5.0 6
4
2.5 3
2
0.0 0 0
10 20 30 40 50 20 30 40 50 60 70 10 20 30
NEM PEM Absorption
To avoid the issue of dependent observations in our three plots from Figure
8.3, we could generate separate plots for each subject and examine trends
within and across subjects. These “lattice plots” are illustrated in Figures
8.4, 8.5, and 8.6; we discuss such plots more thoroughly in Chapter 9. While
general trends are difficult to discern from these lattice plots, we can see the
variety in subjects in sample size distributions and overall level of performance
anxiety. In particular, in Figure 8.6, we notice that linear fits for many subjects
illustrate the same slight downward trend displayed in the overall scatterplot
in Figure 8.3, although some subjects experience increasing anxiety and others
exhibit non-linear trends. Having an idea of the range of individual trends will
be important when we begin to draw overall conclusions from this study.
(a) (b)
Student(s)
Solo
Public Performance
Small Ensemble
Juried Recital
Large Ensemble
Instructor
10 15 20 25 30 35 10 15 20 25 30 35
Negative affect Negative affect
(c)
35
Negative affect
30
25
20
15
10
0 5 10
Previous Performances
In Figure 8.7, we use boxplots to examine the relationship between our primary
categorical Level Two covariate (instrument) and our continuous model re-
sponse. Plot (a) uses all 497 performances, while plot (b) uses one observation
per subject (the mean performance anxiety across all performances) regardless
of how many performances that subject had. Naturally, plot (b) has a more
condensed range of values, but both plots seem to support the notion that
performance anxiety is slightly lower for vocalists and maybe a bit higher for
keyboardists.
In Figure 8.8, we use scatterplots to examine the relationships between con-
tinuous Level Two covariates and our model response. Performance anxiety
appears to vary little with a subject’s positive emotionality, but there is some
evidence to suggest that performance anxiety increases with increasing nega-
tive emotionality and absorption level. Plots based on mean negative affect,
with one observation per subject, support conclusions based on plots with all
218 8 Introduction to Multilevel Models
Solo
Small Ensemble
Large Ensemble
Solo
Small Ensemble
Large Ensemble
Solo
Small Ensemble
Performance Type
Large Ensemble
Solo
Small Ensemble
Large Ensemble
Solo
Small Ensemble
Large Ensemble
Solo
Small Ensemble
Large Ensemble
Solo
Small Ensemble
Large Ensemble
Solo 10 20 30 10 20 30 10 20 30
Small Ensemble
Large Ensemble
10 20 30 10 20 30
Negative Affect
FIGURE 8.4: Lattice plot of performance type vs. negative affect, with
separate dotplots by subject.
Student(s)
Public Performance
Juried Recital
Instructor
Student(s)
Public Performance
Juried Recital
Instructor
Student(s)
Public Performance
Juried Recital
Instructor
Student(s)
Audience
Public Performance
Juried Recital
Instructor
Student(s)
Public Performance
Juried Recital
Instructor
Student(s)
Public Performance
Juried Recital
Instructor
Student(s)
Public Performance
Juried Recital
Instructor
Student(s) 10 20 30 10 20 30 10 20 30
Public Performance
Juried Recital
Instructor
10 20 30 10 20 30
Negative Affect
FIGURE 8.5: Lattice plot of audience type vs. negative affect, with separate
dotplots by subject.
35
30
25
20
15
10
35
30
25
20
15
10
35
30
25
20
15
Negative Affect
10
35
30
25
20
15
10
35
30
25
20
15
10
35
30
25
20
15
10
35
30
25
20
15
10
35 0 5 10 0 5 10 0 5 10
30
25
20
15
10
0 5 10 0 5 10
Previous Performances
FIGURE 8.6: Lattice plot of previous performances vs. negative affect, with
separate scatterplots with fitted lines by subject.
8.3 Initial Exploratory Analyses 219
(a)
voice
orchestral instrument
10 15 20 25 30 35
Negative Affect
(b)
voice
orchestral instrument
10 15 20 25 30 35
Mean Negative Affect
observations from all subjects; indeed the overall relationships are in the same
direction and of the same magnitude.
Of course, any graphical analysis is exploratory, and any notable trends at this
stage should be checked through formal modeling. At this point, a statistician
begins to ask familiar questions such as:
• which characteristics of individual performances are most associated with
performance anxiety?
220 8 Introduction to Multilevel Models
As you might expect, answers to these questions will arise from proper consid-
eration of variability and properly identified statistical models.
## R squared = 0.02782
## Residual standard error = 5.179
Other than somewhat skewed residuals, residual plots (not shown) do not
indicate any major problems with the LLSR model. However, another key
assumption in these models is the independence of all observations. While we
might reasonably conclude that responses from different study participants are
8.4 Two-Level Modeling: Preliminary Considerations 221
independent (although possibly not if they are members of the same ensemble
group), it is not likely that the 15 or so observations taken over multiple
performances from a single subject are similarly independent. If a subject
begins with a relatively high level of anxiety (compared to other subjects)
before their first performance, chances are good that they will have relatively
high anxiety levels before subsequent performances. Thus, multiple linear least
squares regression using all 497 observations is not advisable for this study (or
multilevel data sets in general).
Y22j = a22 + b22 LargeEns22j + 22j where 22j ∼ N (0, σ 2 ) and (8.1)
222 8 Introduction to Multilevel Models
(
1 if perf-type = Large Ensemble
LargeEnsj =
0 if perf-type = Solo or Small Ensemble
The parameters in this model (a22 , b22 , and σ 2 ) can be estimated through
least squares methods. a22 represents the true intercept for Musician #22—the
expected anxiety score for Musician #22 when performance type is a Solo or
Small Ensemble (LargeEns = 0), or the true average anxiety for Musician #22
over all Solo or Small Ensemble performances he may conceivably give. b22
represents the true slope for Musician #22—the expected increase in perfor-
mance anxiety for Musician #22 when performing as part of a Large Ensemble
rather than in a Small Ensemble or as a Solo, or the true average difference
in anxiety scores for Musician #22 between Large Ensemble performances
and other types. Finally, the 22j terms represent the deviations of Musician
#22’s actual performance anxiety scores from the expected scores under this
model—the part of Musician #22’s anxiety before performance j that is not
explained by performance type. The variability in these deviations from the
regression model is denoted by σ 2 .
For Subject 22, we estimate â22 = 24.5, b̂22 = −7.8, and σ̂ = 4.8. Thus,
according to our simple linear regression model, Subject 22 had an estimated
anxiety score of 24.5 before Solo and Small Ensemble performances, and 16.7
(7.8 points lower) before Large Ensemble performances. With an R2 of 0.425,
the regression model explains a moderate amount (42.5%) of the performance-
to-performance variability in anxiety scores for Subject 22, and the trend
toward lower scores for large ensemble performances is statistically significant
at the 0.05 level (t(13)=-3.10, p=.009).
8.4 Two-Level Modeling: Preliminary Considerations 223
ai = α0 + α1 Orchi + ui (8.2)
bi = β0 + β1 Orchi + vi (8.3)
7.5
Frequency
5.0
2.5
0.0
15 20 25
Intercepts from 37 subjects
(b)
9
Frequency
0
-5 0 5
Slopes from 37 subjects
FIGURE 8.9: Histograms of intercepts and slopes from fitting simple regres-
sion models by subject, where each model contained a single binary predictor
indicating if a performance was part of a large ensemble.
but rather the fitted regression coefficients from the Level One models fit to
each subject. (Well, in our theoretical model, the responses are actually the
true intercepts and slopes from Level One models for each subject, but in
reality, we have to use our estimated slopes and intercepts.)
Exploratory data analysis (see boxplots by instrument in Figure 8.10) suggests
that subjects playing orchestral instruments have higher intercepts than vocal-
ists or keyboardists, and that orchestral instruments are associated with slightly
lower (more negative) slopes, although with less variability that the slopes of
vocalists and keyboardists. These trends are borne out in regression modeling.
If we fit Equations (8.2) and (8.3) using fitted intercepts and slopes as our
response variables, we obtain the following estimated parameters: α̂0 = 16.3,
α̂1 = 1.4, β̂0 = −0.8, and β̂1 = −1.4. Thus, the intercept (ai ) and slope (bi )
for Subject i can be modeled as:
where ai is the true mean negative affect when Subject i is playing solos
or small ensembles, and bi is the true mean difference in negative affect for
Subject i between large ensembles and other performance types. Based on
these models, average performance anxiety before solos and small ensembles is
16.3 for vocalists and keyboardists, but 17.7 (1.4 points higher) for orchestral
instrumentalists. Before playing in large ensembles, vocalists and instrumen-
talists have performance anxiety (15.5) which is 0.8 points lower, on average,
than before solos and small ensembles, while subjects playing orchestral in-
struments experience an average difference of 2.2 points, producing an average
8.5 Two-Level Modeling: A Unified Approach 225
performance anxiety of 15.5 before playing in large ensembles just like sub-
jects playing other instruments. However, the difference between orchestral
instruments and others does not appear to be statistically significant for either
intercepts (t=1.424, p=0.163) or slopes (t=-1.168, p=0.253).
(a)
Orchestral 1
15 20 25
Fitted Intercepts
(b)
Orchestral
-5 0 5
Fitted Slopes
FIGURE 8.10: Boxplots of fitted intercepts, plot (a), and slopes, plot (b),
by orchestral instrument (1) vs. keyboard or vocalist (0).
This two-stage modeling process does have some drawbacks. Among other
things, (1) it weights every subject the same regardless of the number of diary
entries we have, (2) it responds to missing individual slopes (from 7 subjects
who never performed in a large ensemble) by simply dropping those subjects,
and (3) it does not share strength effectively across individuals. These issues
can be better handled through a unified multilevel modeling framework which
we will develop in the next section.
For the unified approach, we will still envision two levels of models as in Section
8.4.2, but we will use likelihood-based methods for parameter estimation rather
than ordinary least squares to address the drawbacks associated with the two-
stage approach. To illustrate the unified approach, we will first generalize the
models presented in Section 8.4.2. Let Yij be the performance anxiety score of
the ith subject before performance j. If we are initially interested in examining
the effects of playing in a large ensemble and playing an orchestral instrument,
then we can model the performance anxiety for Subject i in performance j
with the following system of equations:
226 8 Introduction to Multilevel Models
• Level One:
Yij = ai + bi LargeEnsij + ij
• Level Two:
ai = α0 + α1 Orchi + ui
bi = β0 + β1 Orchi + vi ,
In this system, there are 4 key fixed effects to estimate: α0 , α1 , β0 and β1 .
Fixed effects are the fixed but unknown population effects associated with
certain covariates. The intercepts and slopes for each subject from Level One,
ai and bi , don’t need to be formally estimated as we did in Section 8.4.2;
they serve to conceptually connect Level One with Level Two. In fact, by
substituting the two Level Two equations into the Level One equation, we can
view this two-level system of models as a single Composite Model without
ai and bi :
From this point forward, when building multilevel models, we will use Greek
letters (such as α0 ) to denote final fixed effects model parameters to be
estimated empirically, and Roman letters (such as a0 ) to denote preliminary
fixed effects parameters at lower levels. Variance components that will be
estimated empirically will be denoted with σ or ρ, while terms such as and
ui represent error terms. In our framework, we can estimate final parameters
directly without first estimating preliminary parameters, which can be seen
with the Composite Model formulation (although we can obtain estimates of
preliminary parameters in those occasional cases when they are of interest to
us). Note that when we model a slope term like bi from Level One using Level
Two covariates like Orchi , the resulting Composite Model contains a cross-
level interaction term, denoting that the effect of LargeEnsij depends on
the instrument played.
Furthermore, with a binary predictor at Level Two such as instrument, we can
write out what our Level Two model looks like for those who play keyboard or
are vocalists (Orchi = 0) and those who play orchestral instruments (Orchi =
1):
• Keyboardists and Vocalists (Orchi = 0)
ai = α0 + ui
bi = β0 + vi
• Orchestral Instrumentalists (Orchi = 1)
ai = (α0 + α1 ) + ui
bi = (β0 + β1 ) + vi
8.5 Two-Level Modeling: A Unified Approach 227
Writing the Level Two model in this manner helps us interpret the model
parameters from our two-level model. In this case, even the Level One covariate
is binary, so that we can write out expressions for mean performance anxiety
based on our model for four different combinations of instrument played and
performance type:
• Keyboardists or vocalists playing solos or small ensembles: α0
• Keyboardists or vocalists playing large ensembles: α0 + β0
• Orchestral instrumentalists playing solos or small ensembles: α0 + α1
• Orchestral instrumentalists playing large ensembles: α0 + α1 + β0 + β1
effects follow a normal distribution with mean 0 and a variance parameter which
must be estimated from the data. For example, at Level One, we will assume
that the errors associated with each performance of a particular musician
can be described as: ij ∼ N (0, σ 2 ). At Level Two, we have one error term
(ui ) associated with subject-to-subject differences in intercepts, and one error
term (vi ) associated with subject-to-subject differences in slopes. That is,
ui represents the deviation of Subject i from the mean performance anxiety
before solos and small ensembles after accounting for their instrument, and vi
represents the deviation of Subject i from the mean difference in performance
anxiety between large ensembles and other performance types after accounting
for their instrument.
In modeling the random behavior of ui and vi , we must also account for the
possibility that random effects at the same level might be correlated. Subjects
with higher baseline performance anxiety have a greater capacity for showing
decreased anxiety in large ensembles as compared to solos and small ensembles,
so we might expect that subjects with larger intercepts (performance anxiety
before solos and small ensembles) would have smaller slopes (indicating greater
decreases in anxiety before large ensembles). In fact, our fitted Level One
intercepts and slopes in this example actually show evidence of a fairly strong
negative correlation (r = −0.525, see Figure 8.11).
FIGURE 8.11: Scatterplot with fitted regression line for estimated intercepts
and slopes (one point per subject).
To allow for this correlation, the error terms at Level Two can be assumed to
follow a multivariate normal distribution in our unified multilevel model.
Mathematically, we can express this as:
σu2
ui 0 ρuv σu σv
∼N ,
vi 0 ρuv σu σv σv2
where σu2 is the variance of the ui terms, σv2 is the variance of the vi terms,
8.5 Two-Level Modeling: A Unified Approach 229
and σuv = ρuv σu σv is the covariance between the ui and the vi terms (i.e.,
how those two terms vary together).
Note that the correlation ρuv between the error terms is simply the covariance
σuv = ρuv σu σv converted to a [−1, 1] scale through the relationship:
σuv
ρuv =
σu σv
With this expression, we are allowing each error term to have its own variance
(around a mean of 0) and each pair of error terms to have its own covariance
(or correlation). Thus, if there are n equations at Level Two, we can have n
variance terms and n(n − 1)/2 covariance terms for a total of n + n(n − 1)/2
variance components. These variance components are organized in matrix
form, with variance terms along the diagonal and covariance terms in the
off-diagonal. In our small example, we have n = 2 equations at Level Two, so
we have 3 variance components to estimate—2 variance terms (σu2 and σv2 ) and
1 correlation (ρuv ).
The multivariate normal distribution with n = 2 is illustrated in Figure 8.12
for two cases: (a) the error terms are uncorrelated (σuv = ρuv = 0), and (b)
the error terms are positively correlated (σuv > 0 and ρuv > 0). In general, if
the errors in intercepts (ui ) are placed on the x-axis and the errors in slopes
(vi ) are placed on the y-axis, then σu2 measures spread in the x-direction and
σv2 measures spread in the y-direction, while σuv measures tilt. Positive tilt
(σuv > 0) indicates a tendency for errors from the same subject to both be
positive or both be negative, while negative tilt (σuv < 0) indicates a tendency
for one error from a subject to be positive and the other to be negative. In
Figure 8.12, σu2 = 4 and σv2 = 1, so both contour plots show a greater range of
errors in the x-direction than the y-direction. Ellipses near the center of the
contour plot indicate pairs of ui and vi that are more likely. In Figure 8.12
(a) σuv = ρuv = 0, so the axes of the contour plot correspond to the x- and
y-axes, but in Figure 8.12 (b) σuv = 1.5, so the contour plot tilts up, reflecting
a tendency for high values of ui to be associated with high values of vi .
Now, our relatively simple two-level model has 8 parameters that need to be
estimated: 4 fixed effects (α0 , α1 , β0 , and β1 ), and 4 variance components
(σ 2 , σu2 , σv2 , and σuv ). Note that we use the term variance components
to signify model parameters that describe the behavior of random effects.
We can use statistical software, such as the lmer() function from the lme4
package in R, to obtain parameter estimates using our 497 observations. The
most common methods for estimating model parameters—both fixed effects
and variance components—are maximum likelihood (ML) and restricted
maximum likelihood (REML). The method of ML was introduced in
230 8 Introduction to Multilevel Models
(a) (b)
5.0 5.0
2.5 2.5
0.0 0.0
v
-2.5 -2.5
-5.0 -5.0
Chapter 2, where parameter estimates are chosen to maximize the value of the
likelihood function based on observed data. REML is conditional on the fixed
effects, so that the part of the data used for estimating variance components
is separated from that used for estimating fixed effects. Thus REML, by
accounting for the loss in degrees of freedom from estimating the fixed effects,
provides an unbiased estimate of variance components, while ML estimators
for variance components are biased under assumptions of normality, since they
use estimated fixed effects rather than the true values. REML is preferable
when the number of parameters is large or the primary interest is obtaining
estimates of model parameters, either fixed effects or variance components
associated with random effects. ML should be used if nested fixed effects
models are being compared using a likelihood ratio test, although REML is
fine for nested models of random effects (with the same fixed effects model).
In this text, we will typically report REML estimates unless we are specifically
comparing nested models with the same random effects. In most case studies
and most models we consider, there is very little difference between ML and
REML parameter estimates. Additional details are beyond the scope of this
book [Singer and Willett, 2003].
Note that the multilevel output shown beginning in the next section contains
no p-values for performing hypothesis tests. This is primarily because the exact
distribution of the test statistics under the null hypothesis (no fixed effect) is
unknown, primarily because the exact degrees of freedom is not known [Bates
et al., 2015]. Finding good approximate distributions for test statistics (and thus
good approximate p-values) in multilevel models is an area of active research. In
most cases, we can simply conclude that t-values (ratios of parameter estimates
to estimated standard errors) with absolute value above 2 indicate significant
8.5 Two-Level Modeling: A Unified Approach 231
Random effects:
Groups Name Variance Std.Dev. Corr
C) id (Intercept) 5.655 2.378
D) large 0.452 0.672 -0.63
E) Residual 21.807 4.670
F) Number of obs: 497, groups: id, 37
Fixed effects:
Estimate Std. Error t value
G) (Intercept) 15.930 0.641 24.83
H) orch 1.693 0.945 1.79
I) large -0.911 0.845 -1.08
J) orch:large -1.424 1.099 -1.30
This output (except for the capital letters along the left column) was specifically
generated by the lmer() function in R; multilevel modeling results from other
packages will contain similar elements. Because we will use lmer() output to
summarize analyses of case studies in this and following sections, we will spend
a little time now orienting ourselves to the most important features in this
output.
232 8 Introduction to Multilevel Models
• Fixed effects:
– α̂0 = 15.9. The estimated mean performance anxiety for solos and small
ensembles (Large=0) for keyboard players and vocalists (Orch=0) is
15.9.
– α̂1 = 1.7. Orchestral instrumentalists have an estimated mean perfor-
mance anxiety for solos and small ensembles which is 1.7 points higher
than keyboard players and vocalists.
– β̂0 = −0.9. Keyboard players and vocalists have an estimated mean
decrease in performance anxiety of 0.9 points when playing in large
ensembles instead of solos or small ensembles.
– β̂1 = −1.4. Orchestral instrumentalists have an estimated mean decrease
in performance anxiety of 2.3 points when playing in large ensembles
8.6 Two-Level Modeling: A Unified Approach 233
instead of solos or small ensembles, 1.4 points greater than the mean
decrease among keyboard players and vocalists.
• Variance components
– σ̂u = 2.4. The estimated standard deviation of performance anxiety
levels for solos and small ensembles is 2.4 points, after controlling for
instrument played.
– σ̂v = 0.7. The estimated standard deviation of differences in performance
anxiety levels between large ensembles and other performance types is
0.7 points, after controlling for instrument played.
– ρ̂uv = −0.64. The estimated correlation between performance anxiety
scores for solos and small ensembles and increases in performance anxiety
for large ensembles is -0.64, after controlling for instrument played. Those
subjects with higher performance anxiety scores for solos and small
ensembles tend to have greater decreases in performance anxiety for
large ensemble performances.
– σ̂ = 4.7. The estimated standard deviation in residuals for the individual
regression models is 4.7 points.
Two-level modeling as done with the music performance anxiety data usually
involves fitting a number of models. Subsequent sections will describe a process
of starting with the simplest two-level models and building toward a final
model which addresses the research questions of interest.
234 8 Introduction to Multilevel Models
The first model fit in almost any multilevel context should be the uncon-
ditional means model, also called a random intercepts model. In this
model, there are no predictors at either level; rather, the purpose of the un-
conditional means model is to assess the amount of variation at each level—to
compare variability within subject to variability between subjects. Expanded
models will then attempt to explain sources of between and within subject
variability.
The unconditional means (random intercepts) model, which we will denote as
Model A, can be specified either using formulations at both levels:
8.6 Building a Multilevel Model 235
• Level One:
Yij = ai + ij where ij ∼ N (0, σ 2 )
• Level Two:
ai = α0 + ui where ui ∼ N (0, σu2 )
or as a composite model:
Yij = α0 + ui + ij
In this model, the performance anxiety scores of subject i are not a function of
performance type or any other Level One covariate, so that ai is the true mean
response of all observations for subject i. On the other hand, α0 is the grand
mean – the true mean of all observations across the entire population. Our
primary interest in the unconditional means model is the variance components
– σ 2 is the within-person variability, while σu2 is the between-person variability.
The name random intercepts model then arises from the Level Two equation
for ai : each subject’s intercept is assumed to be a random value from a normal
distribution centered at α0 with variance σu2 .
Using the composite model specification, the unconditional means model can
be fit to the music performance anxiety data using statistical software:
• α̂0 = 16.2 = the estimated mean performance anxiety score across all
performances and all subjects.
• σ̂ 2 = 22.5 = the estimated variance in within-person deviations.
• σ̂u2 = 5.0 = the estimated variance in between-person deviations.
The next step in model fitting is to build a good model for predicting per-
formance anxiety scores at Level One (within subject). We will add poten-
tially meaningful Level One covariates—those that vary from performance-
to-performance for each individual. In this case, mirroring our model from
Section 8.4 we will include a binary covariate for performance type:
(
1 if perf-type = Large Ensemble
LargeEnsij =
0 if perf-type = Solo or Small Ensemble
and no other Level One covariates (for now). (Note that we may later also want
to include an indicator variable for “Small Ensemble” to separate the effects of
Solo performances and Small Ensemble performances.) The resulting model,
which we will denote as Model B, can be specified either using formulations at
both levels:
• Level One:
Yij = ai + bi LargeEnsij + ij
• Level Two:
ai = α0 + ui
bi = β0 + vi
or as a composite model:
8.7 Binary Covariates at Level One and Level Two 237
σu2
ui 0
∼N , .
vi 0 ρσu σv σv2
In this model, performance anxiety scores for subject i are assumed to differ
(on average) for Large Ensemble performances as compared with Solos and
Small Ensemble performances; the ij terms capture the deviation between the
true performance anxiety levels for subjects (based on performance type) and
their observed anxiety levels. α0 is then the true mean performance anxiety
level for Solos and Small Ensembles, and β0 is the true mean difference in
performance anxiety for Large Ensembles compared to other performance types.
As before, σ 2 quantifies the within-person variability (the scatter of points
around individuals’ means by performance type), while now the between-person
variability is partitioned into variability in Solo and Small Ensemble scores
(σu2 ) and variability in differences with Large Ensembles (σv2 ).
Using the composite model specification, Model B can be fit to the music
performance anxiety data, producing the following output:
From this output, we obtain estimates of our six model parameters (2 fixed
effects and 4 variance components):
• α̂0 = 16.7 = the mean performance anxiety level before solos and small
ensemble performances.
238 8 Introduction to Multilevel Models
• β̂0 = −1.7 = the mean decrease in performance anxiety before large ensemble
performances.
• σ̂ 2 = 21.8 = the variance in within-person deviations.
• σ̂u2 = 6.3 = the variance in between-person deviations in performance anxiety
scores before solos and small ensembles.
• σ̂v2 = 0.7 = the variance in between-person deviations in increases (or
decreases) in performance anxiety scores before large ensembles.
• ρ̂uv = −0.76 = the correlation in subjects’ anxiety before solos and small
ensembles and their differences in anxiety between large ensembles and other
performance types.
We see that, on average, subjects had a performance anxiety level of 16.7 before
solos and small ensembles, and their anxiety levels were 1.7 points lower, on
average, before large ensembles, producing an average performance anxiety
level before large ensembles of 15.0. According to the t-value listed in R, the
difference between large ensembles and other performance types is statistically
significant (t=-3.09).
This random slopes and intercepts model is illustrated in Figure 8.13. The
thicker black line shows the overall trends given by our estimated fixed effects:
an intercept of 16.7 and a slope of -1.7. Then, each subject is represented by a
gray line. Not only do the subjects’ intercepts differ (with variance 6.3), but
their slopes differ as well (with variance 0.7). Additionally, subjects’ slopes and
intercepts are negatively associated (with correlation -0.76), so that subjects
with greater intercepts tend to have steeper negative slopes. We can compare
this model with the random intercepts model from Section 8.6.2, pictured in
Figure 8.14. With no effect of large ensembles, each subject is represented by
a gray line with identical slopes (0) but varying intercepts (with variance 5.0).
25
20
Negative Affect
15
10
0 1
Large Ensemble indicator
FIGURE 8.13: The random slopes and intercepts model fitted to the music
performance anxiety data. Each gray line represents one subject, and the
thicker black line represents the trend across all subjects.
8.7 Binary Covariates at Level One and Level Two 239
25
20
Negative Affect 15
10
0 1
Large Ensemble indicator
FIGURE 8.14: The random intercepts model fitted to the music performance
anxiety data. Each gray line represents one subject, and the thicker black line
represents the trend across all subjects.
Figures 8.13 and 8.14 use empirical Bayes estimates for the intercepts (ai )
and slopes (bi ) of individual subjects. Empirical Bayes estimates are sometimes
called “shrinkage estimates” since they combine individual-specific information
with information from all subjects, thus “shrinking” the individual estimates
toward the group averages. Empirical Bayes estimates are often used when a
term such as ai involves both fixed and random components; further detail
can be found in Raudenbush and Bryk [2002] and Singer and Willett [2003].
ensembles; those with higher levels of performance anxiety before solos and
small ensembles have more opportunity for decreases before large ensembles.
Pseudo R-squared values are not universally reliable as measures of model
performance. Because of the complexity of estimating fixed effects and variance
components at various levels of a multilevel model, it is not unusual to encounter
situations in which covariates in a Level Two equation for, say, the intercept
remain constant (while other aspects of the model change), yet the associated
pseudo R-squared values differ or are negative. For this reason, pseudo R-
squared values in multilevel models should be interpreted cautiously.
The initial two-level model described in Section 8.5.5 essentially expands upon
the random slopes and intercepts model by adding a binary covariate for
instrument played at Level Two. We will denote this as Model C:
• Level One:
Yij = ai + bi LargeEnsij + ij
• Level Two:
ai = α0 + α1 Orchi + ui
bi = β0 + β1 Orchi + vi ,
σu2
ui 0
∼N , .
vi 0 ρσu σv σv2
We found that there are no highly significant fixed effects in Model C (other
than the intercept). In particular, we have no significant evidence that musicians
playing orchestral instruments reported different performance anxiety scores,
on average, for solos and small ensembles than keyboardists and vocalists,
no evidence of a difference in performance anxiety by performance type for
keyboard players and vocalists, and no evidence of an instrument effect in
difference between large ensembles and other types.
Since no terms were added at Level One when expanding from the random
slopes and intercepts model (Model B), no discernible changes should occur
in explained within-person variability (although small changes could occur
due to numerical estimation procedures used in likelihood-based parameter
estimates). However, Model C expanded Model B by using the instrument
which a subject plays to model both intercepts and slopes at Level Two. We
can use pseudo R-squared values for both intercepts and slopes to evaluate
the impact on between-person variability of adding instrument to Model B.
8.7 Binary Covariates at Level One and Level Two 241
• Level Two:
ai = α0 + α1 Orchi + ui
bi = β0 + β1 Orchi ,
df AIC
model.c 8 3003
model.c2 6 2999
df BIC
model.c 8 3037
model.c2 6 3025
Note that parameter estimates for the remaining 6 fixed effects and variance
components closely mirror the corresponding parameter estimates from Model
C. In fact, removing the error term on the slope has improved (reduced) both
the AIC and BIC measures of overall model performance. Instead of assuming
that the large ensemble effects, after accounting for instrument played, vary by
individual, we are assuming that large ensemble effect is fixed across subjects.
It is not unusual to run a two-level model like this, with an error term on the
intercept equation to account for subject-to-subject differences, but with no
error terms on other Level Two equations unless there is an a priori reason to
allow effects to vary by subject or if the model performs better after building
in those additional error terms.
Recall that we are particularly interested in this study in Level Two covariates—
those subject-specific variables that provide insight into why individuals react
differently in anxiety-inducing situations. In Section 8.3, we saw evidence that
subjects with higher baseline levels of negative emotionality tend to have higher
performance anxiety levels prior to performances. Thus, in our next step in
model building, we will add negative emotionality as a Level Two predictor to
Model C. With this addition, our new model can be expressed as a system of
Level One and Level Two models:
• Level One:
Yij = ai + bi LargeEnsij + ij
• Level Two:
ai = α0 + α1 Orchi + α2 MPQnemi + ui
bi = β0 + β1 Orchi + β2 MPQnemi + vi ,
or as a composite model:
8.8 Adding Further Covariates 243
• α̂0 = 11.57. The estimated mean performance anxiety for solos and small
ensembles (large=0) is 11.57 for keyboard players and vocalists (orch=0)
with negative emotionality of 0 at baseline (mpqnem=0). Since the minimum
negative emotionality score in this study was 11, this interpretation, while
technically correct, is not practically meaningful.
• α̂1 = 1.00. Orchestral instrument players have an estimated mean anxiety
level before solos and small ensembles which is 1.00 point higher than
keyboardists and vocalists, controlling for the effects of baseline negative
emotionality.
• α̂2 = 0.15. A one point increase in baseline negative emotionality is associated
with an estimated 0.15 mean increase in anxiety levels before solos and small
ensembles, after controlling for instrument.
• β̂0 = −0.28. Keyboard players and vocalists (orch=0) with baseline negative
emotionality levels of 0 (mpqnem=0) have an estimated mean decrease in
anxiety level of 0.28 points before large ensemble performances compared to
other performance types.
• β̂1 = −0.95. After accounting for baseline negative emotionality, orchestral
instrument players have an estimated mean anxiety level before solos and
small ensembles which is 1.00 point higher than keyboardists and vocalists,
while the mean anxiety of orchestral players is only .05 points higher before
large ensembles (a difference of .95 points).
• β̂2 = −0.03. After accounting for instrument, a one-point increase in baseline
negative emotionality is associated with an estimated 0.15 mean increase in
anxiety levels before solos and small ensembles, but only an estimated 0.12
increase before large ensembles (a difference of .03 points).
Some of the detail in these parameter interpretations can be tricky—describing
interaction terms, deciding if a covariate must be fixed at 0 or merely held
constant, etc. Often it helps to write out models for special cases to isolate
the effects of specific fixed effects. We will consider a few parameter estimates
from above and see why the interpretations are written as they are.
• α̂1 . For solos and small ensembles (LargeEns=0), the following equations
describe the fixed effects portion of the composite model for negative affect
score for vocalists and keyboardists (Orch=0) and orchestral instrumentalists
(Orch=1):
Orch = 0 :
Yij = α0 + α2 MPQnemi
Orch = 1 :
Yij = (α0 + α1 ) + α2 MPQnemi
solos and small ensembles. For large ensembles, the difference between those
playing orchestral instruments and others is actually given by α̂1 + β̂1 , holding
MPQnem constant (Show!).
• β̂0 . Because LargeEns interacts with both Orch and MPQnem in Model C, β̂0
only describes the estimated difference between large ensembles and other
performance types when both Orch=0 and MPQnem=0, thus removing the
effects of the interaction terms. If, for instance, Orch=1 and MPQnem=20, then
the difference between large ensembles and other performance types is given
by β̂0 + β̂1 + 20β̂2 .
• β̂1 . As with α̂1 , we consider equations describing the fixed effects portion of
the composite model for negative affect score for vocalists and keyboardists
(Orch=0) and orchestral instrumentalists (Orch=1), except here we leave
LargeEns as an unknown rather than restricting the model to solos and small
ensembles:
Orch = 0 :
Yij = α0 + α2 MPQnemi + β0 LargeEnsij
+ β2 MPQnemi LargeEnsij
Orch = 1 :
Yij = (α0 + α1 ) + α2 MPQnemi + (β0 + β1 )LargeEnsij
+ β2 MPQnemi LargeEnsij
As long as baseline negative emotionality is held constant (at any level, not
just 0), then β̂1 represents the estimated difference in the large ensemble effect
between those playing orchestral instruments and others.
At this point, we might ask: do the two extra fixed effects terms in Model D
provide a significant improvement over Model C? Nested models such as these
can be tested using a likelihood ratio test (drop in deviance test), as we’ve
used in Sections 4.4.4 and 6.5.4 with certain generalized linear models. Since
we are comparing models nested in their fixed effects, we use full maximum
likelihood methods to estimate model parameters, as discussed in Section 8.5.4.
As expected, the likelihood is larger (and the log-likelihood is less negative)
under the larger model (Model D); our test statistic (14.734) is then -2 times
the difference in log-likelihood between Models C and D. Comparing the test
statistic to a chi-square distribution with 2 degrees of freedom (signifying the
number of additional terms in Model D), we obtain a p-value of .0006. Thus,
Model D significantly outperforms Model C.
246 8 Introduction to Multilevel Models
Two models, whether they are nested or not, can be compared using AIC and
BIC measures, which were first seen in Chapter 1 and later used in evaluating
generalized linear models. In this case, the AIC clearly favors Model D (2996.7)
over Model C (3007.3), whereas the BIC favors Model D (3038.8) only slightly
over Model C (3041.0) since the BIC imposes a stiffer penalty on additional
terms and additional model complexity. However, the likelihood ratio test is a
more reliable method for comparing nested models.
is 3.0). Often, when there’s no pre-defined anchor value, the mean is used to
represent a typical case. With this in mind, we can create a new variable
centeredbaselineNEM = cmpqnem
= mpqnem - mean(mpqnem)
= mpqnem − 31.63
and replace baseline NEM in Model D with its centered version to create Model
E:
As you compare Model D to Model E, you should notice that only two
things change – α̂0 and β̂0 . All other parameter estimates—both fixed effects
and variance components—remain identical; the basic model is essentially
unchanged as well as the amount of variability in anxiety levels explained by the
model. α̂0 and β̂0 are the only two parameter estimates whose interpretations
in Model D refer to a specific level of baseline NEM. In fact, the interpretations
that held true where NEM=0 (which isn’t possible) now hold true for cmpqnem=0
or when NEM is at its average value of 31.63, which is possible and quite
meaningful. Now, parameter estimates using centered baseline NEM in Model
E change in value from Model D and produce more useful interpretations:
• α̂0 = 16.26. The estimated mean performance anxiety for solos and small
ensembles (large=0) is 16.26 for keyboard players and vocalists (orch=0)
with an average level of negative emotionality at baseline (mpqnem=31.63).
248 8 Introduction to Multilevel Models
• β̂0 = −1.23. Keyboard players and vocalists (orch=0) with an average level of
baseline negative emotionality levels (mpqnem=31.63) have an estimated mean
decrease in anxiety level of 1.23 points before large ensemble performances
compared to other performance types.
We now begin iterating toward a “final model” for these data, on which we
will base conclusions. Typical features of a “final multilevel model” include:
Although the process of reporting and writing up research results often de-
mands the selection of a sensible final model, it’s important to realize that (a)
statisticians typically will examine and consider an entire taxonomy of models
when formulating conclusions, and (b) different statisticians sometimes select
different models as their “final model” for the same set of data. Choice of a
“final model” depends on many factors, such as primary research questions,
purpose of modeling, tradeoff between parsimony and quality of fitted model,
underlying assumptions, etc. So you should be able to defend any final model
you select, but you should not feel pressured to find the one and only “correct
model”, although most good models will lead to similar conclusions.
As we’ve done in previous sections, we can use (a) t-statistics for individual
fixed effects when considering adding a single term to an existing model, (b)
likelihood ratio tests for comparing nested models which differ by more than
one parameter, and (c) model performance measures such as AIC and BIC to
compare non-nested models. Below we offer one possible final model for this
data—Model F:
• Level One:
• Level Two:
8.10 A Final Model for Music Performance Anxiety 249
where previous is the number of previous diary entries filled out by that
individual (diary-1); students, juried, and public are indicator variables
created from the audience categorical variable (so that “Instructor” is the
reference level in this model); and, solo is 1 if the performance was a solo and
0 if the performance was either a small or large ensemble.
In addition, we assume the following variance-covariance structure at Level
Two:
σu2
ui 0
vi
0
σuv σv2
2
wi
∼ N 0
, σuw σvw σw
.
xi
0
σux σvx σwx σx2
yi 0 σuy σvy σwy σxy σy2
zi 0 σuz σvz σwz σxz σyz σz2
that this association is even more pronounced when musicians are performing
solos rather than as part of an ensemble group.
Here are how a couple of key fixed effects would be interpreted in this final
model:
40
30
Negative Affect
LLSR
MLM
Subject 1
20
Subject 2
Subject 3
Subject 4
10
0 5 10 15 20
Previous Performances
however, produces an overall relationship (the solid black line) that is strongly
positive. In this case, by naively fitting the 40 observations as if they were all
independent and ignoring subject effects, the LLSR analysis has gotten the
estimated slope of the overall relationship backwards, producing a continuous
data version of Simpson’s Paradox.
Our second example is based upon Model C from Section 8.7.3, with single
binary predictors at both Level One and Level Two. Using the estimated fixed
effects coefficients and variance components from random effects produced
in Model C, we generated 1000 sets of simulated data. Each set of simulated
data contained 497 observations from 37 subjects just like the original data,
with relationships between negative affect and large ensembles and orchestral
instruments (along with associated variability) governed by the estimated
parameters from Model C. Each set of simulated data was used to fit both a
multilevel model and a linear least squares regression model, and the estimated
fixed effects (α̂0 , α̂1 , β̂0 , and β̂1 ) and their standard errors were saved. Figure
8.16 shows density plots comparing the 1000 estimated values for each fixed
effect from the two modeling approaches; in general, estimates from multilevel
modeling and LLSR tend to agree pretty well, without noticeable bias. Based
on coefficient estimates alone, there appears to be no reason to favor multilevel
modeling over LLSR in this example, but Figure 8.17 tells a different story.
Figure 8.17 shows density plots comparing the 1000 estimated standard errors
associated with each fixed effect from the two modeling approaches; in general,
standard errors are markedly larger with multilevel modeling than LLSR. This
is not unusual, since LLSR assumes all 497 observations are independent, while
8.11 Modeling Multilevel Structure: Is It Necessary? 253
(a) (b)
0.4
0.6
Density
Density
0.4
LLSR 0.2 LLSR
0.0 0.0
14 15 16 17 18 0 2 4
^
a ^
a
0 1
(c) (d)
0.5
0.4 0.3
model model
Density
Density
0.3 0.2
LLSR LLSR
0.2
MLM 0.1 MLM
0.1
0.0 0.0
-4 -2 0 2 -6 -3 0 3
^
B ^
B
0 1
FIGURE 8.16: Density plots of parameter estimates for the four fixed effects
of Model C under both a multilevel model and linear least squares regression.
1000 sets of simulated data for the 37 subjects in our study were produced using
estimated fixed and random effects from Model C. For each set of simulated
data, estimates of (a) α0 , (b) α1 , (c) β0 , and (d) β1 were obtained using both
a multilevel and an LLSR model. Each plot then shows a density plot for the
1000 estimates of the corresponding fixed effect using multilevel modeling vs. a
similar density plot for the 1000 estimates using LLSR.
(a) (b)
15
20 model model
Density
Density
10
LLSR LLSR
10
MLM 5 MLM
0 0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.6 0.8 1.0 1.2
^ )
SE(a ^ )
SE(a
0 1
(c) (d)
12.5
10.0 7.5
model model
Density
Density
7.5
LLSR 5.0 LLSR
5.0
MLM 2.5 MLM
2.5
0.0 0.0
0.7 0.8 0.9 1.0 1.1 1.0 1.2 1.4 1.6
^ )
SE(B ^ )
SE(B
0 1
Initial examination of the data for Case Study 8.2 shows a couple of features
that must be noted. First, there are 37 unique study participants, but they
are not numbered successively from 1 to 43. The majority of participants filled
out 15 diaries, but several filled out fewer (with a minimum of 2); as with
participant IDs, diary numbers within participant are not always successively
numbered. Finally, missing data is not an issue in this data set, since researchers
had already removed participants with only 1 diary entry and performances
for which the type was not recorded (of which there were 11).
The R code below runs the initial multilevel model in Section 8.5.5. Multilevel
model notation in R is based on the composite model formulation. Here, the
response variable is na, while orch, large, and orch:large represent the
fixed effects α1 , β0 , and β1 , along with the intercept α0 which is included
automatically. Note that a colon is used to denote an interaction between two
variables. Error terms and their associated variance components are specified
in (large|id), which is equivalent to (1+large|id). This specifies two error
terms at Level Two (the id level): one corresponding to the intercept (ui ) and
one corresponding to the large ensemble effect (vi ); the multilevel model will
then automatically include a variance for each error term in addition to the
covariance between the two error terms. A variance associated with a Level
One error term is also automatically included in the multilevel model. Note
that there are ways to override the automatic inclusion of certain variance
components; for example, (0+large|id) would not include an error term for
the intercept (and therefore no covariance term at Level Two either).
8.13 Exercises
263
264 9 Two-Level Longitudinal Data
library(broom)
library(tidyverse)
TABLE 9.1: The first six observations in the wide data set for the Charter
Schools case study.
schoolid schoolName urban charter schPctnonw schPctsped schPctfree MathAvgScore.0 MathAvgScore.1 MathAvgScore.2
Dtype 1 Dnum 1 Snum 2 RIPPLESIDE ELEMENTARY 0 0 0.0000 0.1176 0.3627 652.8 656.6 652.6
Dtype 1 Dnum 100 Snum 1 WRENSHALL ELEMENTARY 0 0 0.0303 0.1515 0.4242 646.9 645.3 651.9
Dtype 1 Dnum 108 Snum 30 CENTRAL MIDDLE 0 0 0.0769 0.1231 0.2615 654.7 658.5 659.7
Dtype 1 Dnum 11 Snum 121 SANDBURG MIDDLE 1 0 0.0977 0.0827 0.2481 656.4 656.8 659.9
Dtype 1 Dnum 11 Snum 193 OAK VIEW MIDDLE 1 0 0.0538 0.0954 0.1418 657.7 658.2 659.8
Dtype 1 Dnum 11 Snum 195 ROOSEVELT MIDDLE 1 0 0.1234 0.0886 0.2405 655.9 659.1 660.3
• MathAvgScore.0 = average MCA-II math score for all sixth grade students
in a school in 2008
• MathAvgScore.1 = average MCA-II math score for all sixth grade students
in a school in 2009
• MathAvgScore.2 = average MCA-II math score for all sixth grade students
in a school in 2010
This data is stored in WIDE format, with one row per school, as illustrated in
Table 9.1.
In this case, before we convert our data to LONG form, we should first
address problems with missing data. Missing data is a common phenomenon
in longitudinal studies. For instance, it could arise if a new school was started
during the observation period, a school was shut down during the observation
period, or no results were reported in a given year. Dealing with missing data
in a statistical analysis is not trivial, but fortunately many multilevel packages
(including the lme4 package in R) are adept at handling missing data.
First, we must understand the extent and nature of missing data in our study.
Table 9.2 is a frequency table of missing data patterns, where 1 indicates
presence of a variable and 0 indicates a missing value for a particular variable;
this table is a helpful starting point. Among our 618 schools, 540 had complete
data (all covariates and math scores for all three years), 25 were missing a
math score for 2008, 35 were missing math scores in both 2008 and 2009, etc.
9.3 Initial Exploratory Analyses 267
The number of schools with a particular missing data pattern are listed in the
left column; the remaining columns of 0’s and 1’s describe the missing data
pattern, with 0 indicating a missing value. Some covariates that are present
for every school are not listed. The bottom row gives the number of schools
with missing values for specific variables; the last entry indicates that 121 total
observations were missing.
Statisticians have devised different strategies for handling missing data; a few
common approaches are described briefly here:
• Include only schools with complete data. This is the cleanest approach
analytically; however, ignoring data from 12.6% of the study’s schools (since
78 of the 618 schools had incomplete data) means that a large amount
of potentially useful data is being thrown away. In addition, this approach
creates potential issues with informative missingness. Informative missingness
occurs when a school’s lack of scores is not a random phenomenon but provides
information about the effectiveness of the school type (e.g., a school closes
because of low test scores).
• Last observation carried forward. Each school’s last math score is analyzed
as a univariate response, whether the last measurement was taken in 2008,
2009, or 2010. With this approach, data from all schools can be used, and
analyses can be conducted with traditional methods assuming independent
responses. This approach is sometimes used in clinical trials because it tends
to be conservative, setting a higher bar for showing that a new therapy is
significantly better than a traditional therapy. Of course, we must assume
that a school’s 2008 score is representative of its 2010 score. In addition,
information about trajectories over time is thrown away.
• Imputation of missing observations. Many methods have been developed for
sensibly “filling in” missing observations, using imputation models which
base imputed data on subjects with similar covariate profiles and on typical
observed time trends. Once an imputed data set is created (or several imputed
268 9 Two-Level Longitudinal Data
TABLE 9.3: The first six observations in the long data set for the Charter
Schools case study; these lines correspond to the first two observations from
the wide data set illustrated in Table 9.1.
data sets), analyses can proceed with complete data methods that are easier
to apply. Risks with the imputation approach include misrepresenting missing
observations and overstating precision in final results.
• Apply multilevel methods, which use available data to estimate patterns over
time by school and then combine those school estimates in a way that recog-
nizes that time trends for schools with complete data are more precise than
time trends for schools with fewer measurements. Laird [1988] demonstrates
that multilevel models are valid under the fairly unrestrictive condition that
the probability of missingness cannot depend on any unobserved predictors
or the response. This is the approach we will follow in the remainder of the
text.
Now, we are ready to create our LONG data set. Fortunately, many packages
(including R) have built-in functions for easing this conversion, and the func-
tions are improving constantly. The resulting LONG data set is shown in Table
9.3, where year08 measures the number of years since 2008.
than time at this point (potential covariates at this level may have included
measures of the number of students tested or funds available per student). We
will, however, consider the Level Two variables of charter or non-charter, urban
or rural, percent free and reduced lunch, percent special education, and percent
non-white. Although covariates such as percent free and reduced lunch may
vary slightly from year to year within a school, the larger and more important
differences tend to occur between schools, so we used percent free and reduced
lunch for a school in 2010 as a Level Two variable.
As in Chapter 8, we can conduct initial investigations of relationships between
Level Two covariates and test scores in two ways. First, we can use all 1733
observations to investigate relationships of Level Two covariates with test
scores. Although these plots will contain dependent points, since each school is
represented by up to three years of test score data, general patterns exhibited
in these plots tend to be real. Second, we can calculate mean scores across all
years for each of the 618 schools. While we lose some information with this
approach, we can more easily consider each plotted point to be independent.
Typically, both types of exploratory plots illustrate similar relationships, and
in this case, both approaches are so similar that we will only show plots using
the second approach, with one observation per school.
Figure 9.1 shows the distribution of MCA math test scores as somewhat
left-skewed. MCA test scores for sixth graders are scaled to fall between 600
and 700, where scores above 650 for individual students indicate “meeting
standards”. Thus, schools with averages below 650 will often have increased
incentive to improve their scores the following year. When we refer to the
“math score” for a particular school in a particular year, we will assume that
score represents the average for all sixth graders at that school. In Figure 9.2,
we see that test scores are generally higher for both schools in rural areas
and for public non-charter schools. Note that in this data set there are 237
schools in rural areas and 381 schools in urban areas, as well as 545 public
non-charter schools and 73 charter schools. In addition, we can see in Figure
9.3 that schools tend to have lower math scores if they have higher percentages
of students with free and reduced lunch, with special education needs, or who
are non-white.
200
Frequency
150
100
50
FIGURE 9.1: Histogram of mean sixth grade MCA math test scores over
the years 2008-2010 for 618 Minnesota schools.
(a)
public non-charter
charter
urban
rural
FIGURE 9.2: Boxplots of categorical Level Two covariates vs. average MCA
math scores. Plot (a) shows charter vs. public non-charter schools, while plot
(b) shows urban vs. rural schools.
(change in test scores over the three-year period), and form of the relationship.
These differences among schools are nicely illustrated in so-called spaghetti
plots such as Figure 9.5, which overlays the individual schools’ time trends
(for the math test scores) from Figure 9.4 on a single set of axes. In order to
illustrate the overall time trend without making global assumptions about the
form of the relationship, we overlaid in bold a non-parametric fitted curve
through a loess smoother. LOESS comes from “locally estimated scatterplot
smoother”, in which a low-degree polynomial is fit to each data point using
weighted regression techniques, where nearby points receive greater weight.
LOESS is a computationally intensive method which performs especially well
with larger sets of data, although ideally there would be a greater diversity of
9.3 Initial Exploratory Analyses 271
(a) (b)
660 660
by School
by School
650 650
640 640
630 630
620 620
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Percent Free/Reduced Lunch Percent Special Ed
(c)
Mean Math Scores 670
by School 660
650
640
630
620
0.00 0.25 0.50 0.75 1.00
Percent Non-white
FIGURE 9.3: Scatterplots of average MCA math scores by (a) percent free
and reduced lunch, (b) percent special education, and (c) percent non-white
in a school.
x-values than the three time points we have. In this case, the loess smoother
follows very closely to a linear trend, indicating that assuming a linear increase
in test scores over the three-year period is probably a reasonable simplifying
assumption. To further examine the hypothesis that linearity would provide a
reasonable approximation to the form of the individual time trends in most
cases, Figure 9.6 shows a lattice plot containing linear fits through ordinary
least squares rather than connected time points as in Figure 9.4.
660
655
650
645
660
655
Math Score
650
645
660
655
650
645
660
655
650
645
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
Years since 2008
FIGURE 9.4: Lattice plot by school of math scores over time for the first 24
schools in the data set.
670
660
Math Score
650
640
630
FIGURE 9.5: Spaghetti plot of math scores over time by school, for all the
charter schools and a random sample of public non-charter schools, with overall
fit using loess (bold).
FIGURE 9.6: Lattice plot by school of math scores over time with linear fit
for the first 24 schools in the data set.
670
660
Math Scores
650
640
630
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Years since 2008
FIGURE 9.7: Spaghetti plots showing time trends for each school by school
type, for a random sample of charter schools (left) and public non-charter
schools (right), with overall fits using loess (bold).
670
660
Math Scores
650
640
630
0 1 2 0 1 2 0 1 2 0 1 2
Years since 2008
FIGURE 9.8: Spaghetti plots showing time trends for each school by quartiles
of percent free and reduced lunch, with loess fits.
Even though we know that every school’s math test scores were not strictly
linearly increasing or decreasing over the observation period, a linear model
for individual time trends is often a simple but reasonable way to model data.
One advantage of using a linear model within school is that each school’s
data points can be summarized with two summary statistics—an intercept
and a slope (obviously, this is an even bigger advantage when there are more
274 9 Two-Level Longitudinal Data
observations over time per school). For instance, we see in Figure 9.6 that sixth
graders from the school depicted in the top right slot slowly increased math
scores over the three-year observation period, while students from the school
depicted in the fourth column of the top row generally experienced decreasing
math scores over the same period. As a whole, the linear model fits individual
trends pretty well, and many schools appear to have slowly increasing math
scores over time, as researchers in this study may have hypothesized.
Another advantage of assuming a linear trend at Level One (within schools)
is that we can examine summary statistics across schools. Both the intercept
and slope are meaningful for each school: the intercept conveys the school’s
math score in 2008, while the slope conveys the school’s average yearly increase
or decrease in math scores over the three-year period. Figure 9.9 shows that
point estimates and uncertainty surrounding individual estimates of intercepts
and slopes vary considerably. In addition, we can generate summary statistics
and histograms for the 618 intercepts and slopes produced by fitting linear
regression models at Level One, in addition to R-squared values which describe
the strength of fit of the linear model for each school (Figure 9.10). For our
618 schools, the mean math score for 2008 was 651.4 (SD=7.28), and the
mean yearly rate of change in math scores over the three-year period was
1.30 (SD=2.51). We can further examine the relationship between schools’
intercepts and slopes. Figure 9.11 shows a general decreasing trend, suggesting
that schools with lower 2008 test scores tend to have greater growth in scores
between 2008 and 2010 (potentially because those schools have more room
for improvement); this trend is supported with a correlation coefficient of
−0.32 between fitted intercepts and slopes. Note that, with only 3 or fewer
observations for each school, extreme or intractable values for the slope and
R-squared are possible. For example, slopes cannot be estimated for those
schools with just a single test score, R-squared values cannot be calculated for
those schools with no variability in test scores between 2008 and 2010, and
R-squared values must be 1 for those schools with only two test scores.
Summarizing trends over time within schools is typically only a start, however.
Most of the primary research questions from this study involve comparisons
among schools, such as: (a) are there significant differences between charter
schools and public non-charter schools, and (b) do any differences between
charter schools and public schools change with percent free and reduced lunch,
percent special education, or location? These are Level Two questions, and we
can begin to explore these questions by graphically examining the effects of
school-level variables on schools’ linear time trends. By school-level variables,
we are referring to those covariates that differ by school but are not dependent
on time. For example, school type (charter or public non-charter), urban or
rural location, percent non-white, percent special education, and percent free
9.4 Preliminary Two-Stage Modeling 275
(a) (b)
20 20
15 15
Schools
Schools
10 10
5 5
0 0
600 650 700 -60 -30 0 30 60
Intercepts Slopes
FIGURE 9.9: Point estimates and 95% confidence intervals for (a) intercepts
and (b) slopes by school, for the first 24 schools in the data set.
(a) (b)
200
150
Frequency
Frequency
150
100
100
50 50
0 0
630 640 650 660 670 -5 0 5
Intercepts Slopes
(c)
150
Frequency
100
50
0
0.0 0.4 0.8
Rsquared values
FIGURE 9.10: Histograms for (a) intercepts, (b) slopes, and (c) R-squared
values from fitted regression lines by school.
5
Fitted Slopes
-5
and reduced lunch are all variables which differ by school but which don’t
change over time, at least as they were assessed in this study. Variables which
would be time-dependent include quantities such as per pupil funding and
reading scores.
Figure 9.12 shows differences in the average time trends by school type, using
estimated intercepts and slopes to support observations from the spaghetti
plots in Figure 9.7. Based on intercepts, charter schools have lower math scores,
on average, in 2008 than public non-charter schools. Based on slopes, however,
charter schools tend to improve their math scores at a slightly faster rate than
public schools, especially at the 75th percentile and above. By the end of the
three-year observation period, we would nevertheless expect charter schools to
have lower average math scores than public schools. For another exploratory
perspective on school type comparisons, we can examine differences between
school types with respect to math scores in 2008 and math scores in 2010. As
expected, boxplots by school type (Figure 9.13) show clearly lower math scores
for charter schools in 2008, but differences are slightly less dramatic in 2010.
(a)
School Type
public non-charter
charter
public non-charter
charter
-5 0 5
Fitted Slopes
FIGURE 9.12: Boxplots of (a) intercepts and (b) slopes by school type
(charter vs. public non-charter).
School Type
1
FIGURE 9.13: Boxplots of (a) 2008 and (b) 2010 math scores by school
type (charter (1) vs. public non-charter (0)).
of free and reduced lunch. We will investigate these trends more thoroughly
with statistical modeling.
(a) (b)
Fitted Intercepts
School Type
660
public non-charter
650
640
charter
630
0 25 50 75 100 0 25 50 75 100
% Free/Reduce Lunch % Free/Reduce Lunch
(c)
Fitted Slopes
-5
0 25 50 75 100
% Free/Reduce Lunch
FIGURE 9.14: (a) Boxplot of percent free and reduced lunch by school type
(charter vs. public non-charter), along with scatterplots of (b) intercepts and
(c) slopes from fitted regression lines by school vs. percent free and reduced
lunch.
(a) (b)
High Pct Free/Reduced Lunch
charter charter
School Type
School Type
Low Pct Free/Reduced Lunch
charter charter
FIGURE 9.15: Boxplots of (a) intercepts and (b) slopes from fitted regression
lines by school vs. school type (charter vs. public non-charter), separated by
high and low levels of percent free and reduced lunch.
where Ŷij is the predicted math score of the ith school at time j, where time j
is the number of years since 2008. In this model, the predicted math score will
be identical for all schools at a given time point j. Residuals Yij − Ŷij are then
calculated for each observation, measuring the difference between actual math
score and the average overall time trend. Figure 9.16 then combines three pieces
of information: the upper right triangle contains correlation coefficients for
residuals between pairs of years, the diagonal contains histograms of residuals
at each time point, and the lower left triangle contains scatterplots of residuals
from two different years. In our case, we see that correlation between residuals
from adjacent years is strongly positive (0.81-0.83) and does not drop off
greatly as the time interval between years increases.
60
Corr: Corr:
lmres.0
40
20
0.806*** 0.773***
0
10
0
Corr:
lmres.1
-10
0.833***
-20
-30
20
lmres.2
-20
FIGURE 9.16: Correlation structure within school. The upper right contains
correlation coefficients between residuals at pairs of time points, the lower left
contains scatterplots of the residuals at time point pairs, and the diagonal
contains histograms of residuals at each of the three time points.
As you might expect, answers to these questions will arise from proper consid-
eration of variability and properly identified statistical models. As in Chapter
8, we will begin model fitting with some simple, preliminary models, in part to
establish a baseline for evaluating larger models. Then, we can build toward a
final model for inference by attempting to add important covariates, centering
certain variables, and checking assumptions.
• α̂0 = 652.7 = the mean math score across all schools and all years
σ̂u2 41.869
ρ̂ = 2 2
= = 0.798
σ̂u + σ̂ 41.869 + 10.571
At the lowest level, we can consider building individual growth models over
time for each of the 618 schools in our study. First, we must decide upon
282 9 Two-Level Longitudinal Data
a form for each of our 618 growth curves. Based on our initial exploratory
analyses, assuming that an individual school’s MCA-II math scores follow a
linear trend seems like a reasonable starting point. Under the assumption of
linearity, we must estimate an intercept and a slope for each school, based
on their 1-3 test scores over a period of three years. Compared to time series
analyses of economic data, most longitudinal data analyses have relatively few
time periods for each subject (or school), and the basic patterns within subject
are often reasonably described by simpler functional forms.
Let Yij be the math score of the ith school in year j. Then we can model the
linear change in math test scores over time for School i according to Model B:
The parameters in this model (ai , bi , and σ 2 ) can be estimated through LLSR
methods. ai represents the true intercept for School i—i.e., the expected test
score level for School i when time is zero (2008)—while bi represents the true
slope for School i—i.e., the expected yearly rate of change in math score for
School i over the three-year observation period. Here we use Roman letters
rather than Greek for model parameters since models by school will eventually
be a conceptual first step in a multilevel model. The ij terms represent the
deviation of School i’s actual test scores from the expected results under linear
growth—the part of school i’s test score at time j that is not explained by
linear changes over time. The variability in these deviations from the linear
model is given by σ 2 . In Figure 9.17, which illustrates a linear growth model
for Norwood Central Middle School, ai is estimated by the y-intercept of the
fitted regression line, bi is estimated by the slope of the fitted regression line,
and σ 2 is estimated by the variability in the vertical distances between each
point (the actual math score in year j) and the line (the predicted math score
in year j).
Norwood Central
660
Math Score
658
656
654
FIGURE 9.17: Linear growth model for Norwood Central Middle School.
9.5 Initial Models 283
In a multilevel model, we let intercepts (ai ) and slopes (bi ) vary by school
and build models for these intercepts and slopes using school-level variables
at Level Two. An unconditional growth model features no predictors at Level
Two and can be specified either using formulations at both levels:
• Level One:
Yij = ai + bi Year08ij + ij
• Level Two:
ai = α0 + ui
bi = β0 + vi
or as a composite model:
While modeling linear trends over time is often a good approximation of reality,
it is by no means the only way to model the effect of time. One alternative is to
model the quadratic effect of time, which implies adding terms for both time
and the square of time. Typically, to reduce the correlation between the linear
and quadratic components of the time effect, the time variable is often centered
first; we have already “centered” on 2008. Modifying Model B to produce an
unconditional quadratic growth model would take the following form:
• Level One:
Yij = ai + bi Year08ij + ci Year082ij + ij
• Level Two:
ai = α0 + ui
bi = β0 + vi
ci = γ0 + wi
With the extra term at Level One for the quadratic effect, we now have
3 equations at Level Two, and 6 variance components at Level Two (3
variance terms and 3 covariance terms). However, with only a maximum of 3
observations per school, we lack the data for fitting 3 equations with error
terms at Level Two. Instead, we could model the quadratic time effect with
fewer variance components—for instance, by only using an error term on the
intercept at Level Two:
ai = α0 + ui
bi = β 0
ci = γ 0
where ui ∼ N (0, σu2 ). Models like this are frequently used in practice—they
allow for a separate overall effect on test scores for each school, while minimizing
parameters that must be estimated. The tradeoff is that this model does not
allow linear and quadratic effects to differ by school, but we have little choice
here without more observations per school. Thus, using the composite model
specification, the unconditional quadratic growth model with random intercept
for each school can be fit to the MCA-II test data:
From this output, we see that the quadratic effect is positive and significant
(t=7.1), in this case indicating that increases in test scores are greater between
2009 and 2010 than between 2008 and 2009. Based on AIC and BIC values,
the quadratic growth model outperforms the linear growth model with random
intercepts only at level Two (AIC: 10308 vs. 10354; BIC: 10335 vs. 10375).
Another frequently used approach to modeling time effects is the piecewise
linear model. In this model, the complete time span of the study is divided
into two or more segments, with a separate slope relating time to the response
in each segment. In our case study there is only one piecewise option—fitting
separate slopes in 2008-09 and 2009-10. With only 3 time points, creating a
piecewise linear model is a bit simplified, but this idea can be generalized to
segments with more than two years each.
The performance of this model is very similar to the quadratic growth model
by AIC and BIC measures, and the story told by fixed effects estimates is also
very similar. While the mean yearly increase in math scores was 0.2 points
between 2008 and 2009, it was 2.3 points between 2009 and 2010.
Despite the good performances of the quadratic growth and piecewise linear
models on our three-year window of data, we will continue to use linear growth
assumptions in the remainder of this chapter. Not only is a linear model easier
to interpret and explain, but it’s probably a more reasonable assumption in
years beyond 2010. Predicting future performance is more risky by assuming a
steep one-year rise or a non-linear rise will continue, rather than by using the
average increase over two years.
ai = α0 + α1 Charteri + ui
bi = β0 + β1 Charteri + vi
With a binary predictor at Level Two such as school type, we can write out
what our Level Two model looks like for public non-charter schools and charter
schools.
• Public schools
ai = α0 + ui
bi = β 0 + v i ,
• Charter schools
ai = (α0 + α1 ) + ui
bi = (β0 + β1 ) + vi
Writing the Level Two model in this manner helps us interpret the model
parameters from our two-level model. We can use statistical software (such
as the lmer() function from the lme4 package in R) to obtain parameter
estimates using our 1733 observations, after first converting our Level One and
Level Two models into a composite model (Model C) with fixed effects and
random effects separated:
• Fixed effects:
– α̂0 = 652.1. The estimated mean test score for 2008 for non-charter
public schools is 652.1.
– α̂1 = −6.02. Charter schools have an estimated test score in 2008 which
is 6.02 points lower than public non-charter schools.
– β̂0 = 1.20. Public non-charter schools have an estimated mean increase
in test scores of 1.20 points per year.
– β̂1 = 0.86. Charter schools have an estimated mean increase in test
scores of 2.06 points per year over the three-year observation period, 0.86
points higher than the mean yearly increase among public non-charter
schools.
• Variance components:
– σ̂u = 5.99. The estimated standard deviation of 2008 test scores is 5.99
points, after controlling for school type.
– σ̂v = 0.36. The estimated standard deviation of yearly changes in test
scores during the three-year observation period is 0.36 points, after
controlling for school type.
– ρ̂uv = 0.88. The estimated correlation between 2008 test scores and
yearly changes in test scores is 0.88, after controlling for school type.
– σ̂ = 2.96. The estimated standard deviation in residuals for the individual
growth curves is 2.96 points.
650 650
645 645
640 640
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Years since 2008 Years since 2008
• Level Two:
ai = α0 + α1 Charteri + α2 schpctfreei + ui
bi = β0 + β1 Charteri + β2 schpctfreei + vi
290 9 Two-Level Longitudinal Data
We now begin iterating toward a “final model” for these data, on which we
will base conclusions. Being cognizant of typical features of a “final model” as
outlined in Chapter 8, we offer one possible final model for this data—Model
F:
292 9 Two-Level Longitudinal Data
• Level One:
where we find the effect of charter schools on 2008 test scores after adjusting
for urban or rural location, percentage of special education students, and
percentage of students that receive free or reduced lunch, and the effect of
charter schools on yearly change between 2008 and 2010 after adjusting for
urban or rural location and percentage of special education students. We can
use AIC and BIC criteria to compare Model F with Model D, since the two
models are not nested. By both criteria, Model F is significantly better than
Model D: AIC of 9885 vs. 9988, and BIC of 9956 vs. 10043. Based on the R
output below, we offer interpretations for estimates of model fixed effects:
• α̂0 = 661.0. The estimated mean math test score for 2008 is 661.0 for public
schools in rural areas with no students qualifying for special education or
free and reduced lunch.
• α̂1 = −3.22. Charter schools have an estimated mean math test score in 2008
which is 3.22 points lower than non-charter public schools, after controlling
for urban or rural location, percent special education, and percent free and
reduced lunch.
• α̂2 = −1.11. Schools in urban areas have an estimated mean math score in
2008 which is 1.11 points lower than schools in rural areas, after controlling
for school type, percent special education, and percent free and reduced
lunch.
• α̂3 = −0.118. A 10% increase in special education students at a school is
associated with a 1.18 point decrease in estimated mean math score for 2008,
after controlling for school type, urban or rural location, and percent free
and reduced lunch.
• α̂4 = −0.153. A 10% increase in free and reduced lunch students at a school
is associated with a 1.53 point decrease in estimated mean math score for
2008, after controlling for school type, urban or rural location, and percent
special education.
• β̂0 = 2.14. Public non-charter schools in rural areas with no students qual-
ifying for special education have an estimated increase in mean math test
score of 2.14 points per year over the three-year observation period, after
controlling for percent of students receiving free and reduced lunch.
• β̂1 = 1.03. Charter schools have an estimated mean annual increase in math
score that is 1.03 points higher than public non-charter schools over the
three-year observation period, after controlling for urban or rural location,
percent special education, and percent free and reduced lunch.
• β̂2 = −0.53. Schools in urban areas have an estimated mean annual increase
in math score that is 0.53 points lower than schools from rural areas over
the three-year observation period, after controlling for school type, percent
special education, and percent free and reduced lunch.
• β̂3 = −0.047. A 10% increase in special education students at a school is
associated with an estimated mean annual increase in math score that is
0.47 points lower over the three-year observation period, after controlling for
school type, urban or rural location, and percent free and reduced lunch.
From this model, we again see that 2008 sixth grade math test scores from
charter schools were significantly lower than similar scores from public non-
charter schools, after controlling for school location and demographics. However,
charter schools showed significantly greater improvement between 2008 and
2010 compared to public non-charter schools, although charter school test
scores were still lower than public school scores in 2010, on average. We also
294 9 Two-Level Longitudinal Data
tested several interactions between Level Two covariates and charter schools
and found none to be significant, indicating that the 2008 gap between charter
schools and public non-charter schools was consistent across demographic
subgroups. The faster improvement between 2008 and 2010 for charter schools
was also consistent across demographic subgroups (found by testing three-
way interactions). Controlling for school location and demographic variables
provided more reliable and nuanced estimates of the effects of charter schools,
while also providing interesting insights. For example, schools in rural areas
not only had higher test scores than schools in urban areas in 2008, but the
gap grew larger over the study period given fixed levels of percent special
education, percent free and reduced lunch, and school type. In addition, schools
with higher levels of poverty lagged behind other schools and showed no signs
of closing the gap, and schools with higher levels of special education students
had both lower test scores in 2008 and slower rates of improvement during the
study period, again given fixed levels of other covariates.
As we demonstrated in this case study, applying multilevel methods to two-level
longitudinal data yields valuable insights about our original research questions
while properly accounting for the structure of the data.
When testing random effects at the boundary (such as σv2 = 0) or those with
restricted ranges (such as ρuv = 0), using a chi-square distribution to conduct a
likelihood ratio test is not appropriate. In fact, this will produce a conservative
test, with p-values that are too large and not rejected enough (Raudenbush
and Bryk [2002], Singer and Willett [2003], Faraway [2005]). For example, we
should suspect that the p-value (.8447) produced by the likelihood ratio test
comparing Models F and F0 is too large, that the real probability of getting
a likelihood ratio test statistic of 0.3376 or greater when Model F0 is true is
smaller than .8447.
• Fit Model F0 (the null model) to obtain estimated fixed effects and variance
components (this is the “parametric” part.)
• Use the estimated fixed effects and variance components from the null model
to regenerate a new set of math test scores with the same sample size
(n = 1733) and associated covariates for each observation as the original data
(this is the “bootstrap” part.)
• Fit both Model F0 (the reduced model) and Model F (the full model) to the
new data
• Compute a likelihood ratio statistic comparing Models F0 and F
296 9 Two-Level Longitudinal Data
Let’s see how new test scores are generated under the parametric bootstrap.
Consider, for instance, i = 1 and j = 1, 2, 3; that is, consider test scores for
9.6 Building to a Final Model 297
School #1 (Rippleside Elementary) across all three years (2008, 2009, and
2010). Table 9.4 shows the original data for Rippleside Elementary.
Level Two
One way to see the data generation process under the null model (Model F0)
is to start with Level Two and work backwards to Level One. Recall that our
Level Two models for ai and bi , the true intercept and slope for school i, in
Model F0 are:
All the α and β terms will be fixed at their estimated values, so the one term
that will change for each bootstrapped data set is ui . As we obtain a numeric
value for ui for each school, we will fix the subscript. For example, if ui is set
to -5.92 for School #1, then we would denote this by u1 = −5.92. Similarly, in
the context of Model F0, a1 represents the 2008 math test score for School
#1, where u1 quantifies how School #1’s 2008 score differs from the average
2008 score across all schools with the same attributes: charter status, urban or
rural location, percent of special education students, and percent of free and
reduced lunch students.
According to Model F0, each ui is sampled from a normal distribution with
mean 0 and standard deviation 4.18. That is, a random component to the
intercept for School #1 (u1 ) would be sampled from a normal distribution
with mean 0 and SD 4.18; say, for instance, u1 = −5.92. We would sample
u2 , ..., u618 in a similar manner for all 618 schools. Then we can produce a
model-based intercept and slope for School #1:
Notice a couple of features of the above derivations. First, all of the coefficients
from the above equations (α0 = 661.03, α1 = −3.22, etc.) come from the
estimated fixed effects from Model F0. Second, “public non-charter” is the
reference level for charter and “rural” is the reference level for urban, so
both of those predictors are 0 for Rippleside Elementary. Third, the mean
298 9 Two-Level Longitudinal Data
intercept (2008 test scores) for schools like Rippleside that are rural and public
non-charter, with 11.8% special education students and 36.3% free and reduced
lunch students, is 661.03 - 0.12(11.8) - 0.15(36.3) = 654.2. The mean yearly
improvement in test scores for rural, public non-charter schools with 11.8%
special education students is then 1.60 points per year (2.14 - .046*11.8). School
#1 (Rippleside) therefore has a 2008 test score that is 5.92 points below the
mean for all similar schools, but every such school is assumed to have the
same improvement rate in test scores of 1.60 points per year because of our
assumption that there is no school-to-school variability in yearly rate of change
(i.e., vi = 0).
Level One
We next proceed to Level One, where the scores from Rippleside are modeled
as a linear function of year (654.2 + 1.60Year08ij ) with a normally distributed
residual 1k at each time point k. Three residuals (one for each year) are sampled
independently from a normal distribution with mean 0 and standard deviation
2.97 – the standard deviation again coming from parameter estimates from
fitting Model F0 to the actual data. Suppose we obtain residuals of 11 = −3.11,
12 = 1.19, and 13 = 2.41. In that case, our parametrically generated data for
Rippleside Elementary (School #1) would look like:
Once an entire set of simulated scores for every school and year have been
generated based on Model F0, two models are fit to this data:
• Model F0 – the correct (null) model that was actually used to generate the
responses
• Model F – the incorrect (full) model that contains two extra variance com-
ponents – σv2 and σuv – that were not actually used when generating the
responses
9.6 Building to a Final Model 299
However, we are really only interested in saving the likelihood ratio test
statistic from this bootstrapped sample (2 ∗ (−4932) − (−4935) = 4.581).
By generating (“bootstrapping”) many sets of responses based on estimated
parameters from Model F0 and calculating many likelihood ratio test statistics,
we can observe how this test statistic behaves under the null hypothesis of
σv2 = σuv = 0, rather than making the (dubious) assumption that its behavior
is described by a chi-square distribution with 2 degrees of freedom. Figure 9.20
illustrates the null distribution of the likelihood ratio test statistic derived by
the parametric bootstrap procedure as compared to a chi-square distribution.
A p-value for comparing our full and reduced models can be approximated
by finding the proportion of likelihood ratio test statistics generated under
the null model which exceed our observed likelihood ratio test (0.3376). The
parametric bootstrap provides a more reliable p-value in this case (.578 from
table below); a chi-square distribution puts too much mass in the tail and not
enough near 0, leading to overestimation of the p-value. Based on this test, we
would still choose our simpler Model F0.
0.8
0.6
Density
0.4
0.2
0.0
Another way of examining whether or not we should stick with the reduced
model or reject it in favor of the larger model is by generating parametric
bootstrap samples, and then using those samples to produce 95% confidence
intervals for both ρuv and σv .
## 2.5 % 97.5 %
## sd_(Intercept)|schoolid 3.801826 4.50866
## cor_year08.(Intercept)|schoolid -1.000000 1.00000
## sd_year08|schoolid 0.009203 0.91060
## sigma 2.779393 3.07776
## (Intercept) 660.071996 662.04728
## charter -4.588611 -2.07372
## urban -2.031600 -0.29152
## SchPctFree -0.169426 -0.13738
## SchPctSped -0.156065 -0.07548
## year08 1.722458 2.56106
## charter:year08 0.449139 1.65928
## urban:year08 -0.905941 -0.17156
## SchPctSped:year08 -0.066985 -0.02617
9.7 Covariance Structure among Observations 301
From the output above, the 95% bootstrapped confidence interval for ρuv
(-1, 1) contains 0, and the interval for σv (0.0092, 0.9106) nearly contains 0,
providing further evidence that the larger model is not needed.
Part of our motivation for framing our model for multilevel data was to account
for the correlation among observations made on the same school (the Level
Two observational unit). Our two-level model, through error terms on both
Level One and Level Two variables, actually implies a specific within-school
covariance structure among observations, yet we have not (until now) focused
on this imposed structure. For example:
• What does our two-level model say about the relative variability of 2008 and
2010 scores from the same school?
• What does it say about the correlation between 2008 and 2009 scores from
the same school?
is satisfactory for the data at hand. Then, in the succeeding optional section,
we provide derivations of the imposed within-school covariance structure for
our standard two-level model using results from probability theory.
For School i, the covariance structure for the three time points has general
form:
V ar(Yi1 ) Cov(Yi1 , Yi2 ) Cov(Yi1 , Yi3 )
Cov(Yi ) = Cov(Yi1 , Yi2 ) V ar(Yi2 ) Cov(Yi2 , Yi3 )
Cov(Yi1 , Yi3 ) Cov(Yi2 , Yi3 ) V ar(Yi3 )
where, for instance, V ar(Yi1 ) is the variability in 2008 test scores (time j = 1),
Cov(Yi1 , Yi2 ) is the covariance between 2008 and 2009 test scores (times j = 1
and j = 2), etc. Since covariance measures the tendency of two variables to
move together, we expect positive values for all three covariance terms in
Cov(Yi ), since schools with relatively high test scores in 2008 are likely to
also have relatively high test scores in 2009 or 2010. The correlation between
two variables then scales covariance terms to values between -1 and 1, so by
the same rationale, we expect correlation coefficients between two years to
be near 1. If observations within school were independent—that is, knowing
a school had relatively high scores in 2008 tells nothing about whether that
school will have relatively high scores in 2009 or 2010—then we would expect
covariance and correlation values near 0.
It is important to notice that the error structure at Level Two is not the same
as the within-school covariance structure among observations. That is, the
relationship between ui and vi from the Level Two equations is not the same
as the relationship between test scores from different years at the same school
(e.g., the relationship between Yi1 and Yi2 ). In other words,
9.7 Covariance Structure among Observations 303
2
ui 0 σu
Cov(Yi ) 6= ∼N , .
vi 0 σuv σv2
Yet, the error structure and the covariance structure are connected to each
other, as we will now explore.
Using results from probability theory (see Section 9.7.5), we can show that:
for all i, where our time variable (year08) has values ti1 = 0, ti2 = 1, and
ti3 = 2 for every School i. Intuitively, these formulas are sensible. For instance,
V ar(Yi1 ), the uncertainty (variability) around a school’s score in 2008, increases
as the uncertainty in intercepts and slopes increases, as the uncertainty around
that school’s linear time trend increases, and as the covariance between intercept
and slope residuals increases (since if one is off, the other one is likely off as
well). Also, Cov(Yi1 , Yi2 ), the covariance between 2008 and 2009 scores, does
not depend on Level One error. Thus, in the 3-by-3 within-school covariance
structure of the charter schools case study, our standard two-level model
determines all 6 covariance matrix elements through the estimation of four
parameters (σu2 , σuv , σv2 , σ 2 ) and the imposition of a specific structure related
to time.
To obtain estimated variances for individual observations and covariances
between two time points from the same school, we can simply plug estimated
variance components from our two-level model along with time points from
our data collection into the equations above. For instance, in Section 9.6.1,
we obtained the following estimates of variance components: σ̂ 2 = 8.784,
σ̂u2 = 35.832, σ̂v2 = 0.131, and σ̂uv = ρ̂σˆu σˆv = 1.907. Therefore, our estimated
within-school variances for the three time points would be:
ˆ
Cov(Yi1 , Yi2 ) = 35.832 + (0)(1)0.131 + (0 + 1)1.907 = 37.74
ˆ
Cov(Yi1 , Yi3 ) = 35.832 + (0)(2)0.131 + (0 + 2)1.907 = 39.65
ˆ
Cov(Yi2 , Yi3 ) = 35.832 + (1)(2)0.131 + (1 + 2)1.907 = 41.81
In fact, these values will be identical for every School i, since scores were
304 9 Two-Level Longitudinal Data
assessed at the same three time points. Thus, we will drop the subscript i
moving forward.
Written in matrix form, our two-level model implicitly imposes this estimated
covariance structure on within-school observations for any specific School i:
44.62
ˆ
Cov(Y) = 37.74 48.56
39.65 41.81 52.77
and this estimated covariance matrix can be converted into an estimated within-
school correlation matrix using the identity Corr(Y1 , Y2 ) = √ Cov(Y1 ,Y2 ) :
V ar(Y1 )V ar(Y2 )
1
ˆ
Corr(Y) = .811 1
.817 .826 1
A couple of features of these two matrices can be highlighted that offer insights
into implications of our standard two-level model on the covariance structure
among observations at Level One from the same school:
• Many longitudinal data sets show higher correlation for observations that are
closer in time. In this case, we see that correlation is very consistent between
all pairs of observations from the same school; the correlation between
test scores separated by two years (.817) is approximately the same as the
correlation between test scores separated by a single year (.811 for 2008 and
2009 scores; .826 for 2009 and 2010 scores).
• Many longitudinal data sets show similar variability at all time points. In this
case, the variability in 2010 (52.77) is about 18% greater than the variability
in 2008 (44.62), while the variability in 2009 is in between (48.56).
• Our two-level model actually imposes a quadratic structure on the relationship
between variance and time; note that the equation for V ar(Yj ) contains both
t2j and tj . The variance is therefore minimized at t = −σ σv2 . With the charter
uv
Careful modeling and estimation of the Level One covariance matrix is especially
important and valuable for longitudinal data (with time at Level One) and as
we’ve seen, our standard two-level model has several nice properties for this
purpose. The standard model is also often appropriate for non-longitudinal
multilevel models as discussed in Chapter 8, although we must remain aware
of the covariance structure implicitly imposed. In other words, the ideas in
this section generalize even if time isn’t a Level One covariate.
As an example, in Case Study 8.2 where Level One observational units are
musical performances rather than time points, the standard model implies
the following covariance structure for Musician i in Model C, which uses an
indicator for large ensembles as a Level One predictor:
and
Note that, in the Music Performance Anxiety case study, each subject will
have a unique Level One variance-covariance structure, since each subject has
a different number of performances and a different mix of large ensemble and
small ensemble or solo performances.
In the charter school example, as is often true in multilevel models, the choice of
covariance matrix does not greatly affect estimates of fixed effects. The choice
of covariance structure could potentially impact the standard errors of fixed
effects, and thus the associated test statistics, but the impact appears minimal
in this particular case study. In fact, the standard model typically works very
well. So is it worth the time and effort to accurately model the covariance
structure? If primary interest is in inference regarding fixed effects, and if
the standard errors for the fixed effects appear robust to choice of covariance
structure, then extensive time spent modeling the covariance structure is not
advised. However, if researchers are interested in predicted random effects
and estimated variance components in addition to estimated fixed effects,
9.7 Covariance Structure among Observations 307
then choice of covariance structure can make a big difference. For instance,
if researchers are interested in drawing conclusions about particular schools
rather than charter schools in general, they may more carefully model the
covariance structure in this study.
Applying these identities to Model C, we first see that we can ignore all fixed
effects, since they do not contribute to the variability. Thus,
where the last line reflects the fact that observations were taken at the same
time points for all schools. We can derive the covariance terms in a similar
fashion:
9.9 Exercises
2. Describe the difference between the wide and long formats for longi-
tudinal data in this study.
3. Describe scenarios or research questions in which a lattice plot would
be more informative than a spaghetti plot, and other scenarios or
research questions in which a spaghetti plot would be preferable to
a lattice plot.
4. Walker-Barnes and Mason summarize their analytic approach in the
following way, where HLM = hierarchical linear models, a synonym
for multilevel models:
The first series [of analyses] tested whether there was overall change
and/or significant individual variability in gang [activity] over time,
regardless of parenting behavior, peer behavior, or ethnic and cultural
heritage. Second, given the well documented relation between peer
and adolescent behavior . . . HLM analyses were conducted examining
the effect of peer gang [activity] on [initial gang activity and] changes
in gang [activity] over time. Finally, four pairs of analyses were
conducted examining the role of each of the four parenting variables
on [initial gang activity and] changes in gang [activity].
The last series of analyses controlled for peer gang activity and
ethnic and cultural heritage, in addition to examining interactions
between parenting and ethnic and cultural heritage.
Although the authors examined four parenting behaviors—
behavioral control, lax control, psychological control, and parental
warmth—they did so one at a time, using four separate multilevel
models. Based on their description, write out a sample model from
each of the three steps in the series. For each model, (a) write
out the two-level model for predicting gang activity, (b) write out
the corresponding composite model, and (c) determine how many
model parameters (fixed effects and variance components) must be
estimated.
5. Table 9.5 shows a portion of Table 2: Results of Hierarchical Linear
Modeling Analyses Modeling Gang Involvement from Walker-Barnes
and Mason [2001]. Provide interpretations of significant coefficients
in context.
Predictor Coefficient SE
Intercept (initial status)
Base (intercept for predicting int term) -.219 .160
Peer behavior .252** .026
Black ethnicity .671* .289
White/Other ethnicity .149 .252
Parenting .076 .050
Black ethnicity X parenting -.161+ .088
White/Other ethnicity X parenting -.026 .082
Slope (change)
Base (intercept for predicting slope term) .028 .030
Peer behavior -.011* .005
Black ethnicity -.132* .054
White/Other ethnicity -.059 .046
Parenting -.015+ .009
Black ethnicity X parenting -.048** .017
White/Other ethnicity X parenting .016 .015
These columns focus on the parenting behavior of psychological control.
Table reports values for coefficients in the final model with all
variables entered. * p<.05; ** p<.01; + p<.10
12. In Section 9.5.2, why don’t we examine the pseudo R-squared value
for Level Two?
312 9 Two-Level Longitudinal Data
13. If we have test score data from 2001-2010, explain how we’d create
new variables to fit a piecewise model.
14. In Section 9.6.2, could we have used percent free and reduced lunch
as a Level One covariate rather than 2010 percent free and reduced
lunch as a Level Two covariate? If so, explain how interpretations
would have changed. What if we had used average percent free and
reduced lunch over all three years or 2008 percent free and reduced
lunch instead of 2010 percent free and reduced lunch. How would
this have changed the interpretation of this term?
15. In Section 9.6.2, why do we look at a 10% increase in the percentage
of students receiving free and reduced lunch when interpreting α̂2 ?
16. In Section 9.6.3, if the gap in 2008 math scores between charter and
non-charter schools differed for schools of different poverty levels (as
measured by percent free and reduced lunch), how would the final
model have differed?
17. Explain in your own words why “the error structure at Level Two
is not the same as the within-school covariance structure among
observations”.
18. Here is the estimated unstructured covariance matrix for Model C:
41.87
Cov(Yi ) = 36.46 48.18
35.20 39.84 45.77
Explain why this matrix cannot represent an estimated covariance
matrix with a compound symmetry, autoregressive, or Toeplitz
structure. Also explain why it cannot represent our standard two-
level model.
Data from the BtheB study can be found in BtheB.csv; it is also part
of the HSAUR package [Everitt and Hothorn, 2006] in R. Examination
of the data reveals the following variables:
• Extend the standard multilevel model to cases with more than two levels.
• Apply exploratory data analysis techniques specific to data from more than
two levels.
• Formulate multilevel models including the variance-covariance structure.
• Build and understand a taxonomy of models for data with more than two
levels.
• Interpret parameters in models with more than two levels.
• Develop strategies for handling an exploding number of parameters in multi-
level models.
• Recognize when a fitted model has encountered boundary constraints and
understand strategies for moving forward.
• Apply a parametric bootstrap test of significance to appropriate situations
with more than two levels.
321
322 10 Multilevel Data With More Than Two Levels
The data we’ll examine was collected through an experiment run using a
3x2x2 factorial design, with 3 levels of soil type (remnant, cultivated, and
restored), 2 levels of sterilization (yes or no), and 2 levels of species (leadplant
and coneflower). Each of the 12 treatments (unique combinations of factor
levels) was replicated in 6 pots, for a total of 72 pots. Six seeds were planted
in each pot (although a few pots had 7 or 8 seeds), and initially student
researchers recorded days to germination (defined as when two leaves are
visible), if germination occurred. In addition, the height of each germinated
plant (in mm) was measured at 13, 18, 23, and 28 days after planting. The
study design is illustrated in Figure 10.1.
Data for Case Study 10.2 in seeds2.csv contains the following variables:
324 10 Multilevel Data With More Than Two Levels
TABLE 10.1: A snapshot of data (Plants 231-246) from the Seed Germination
case study in wide format.
pot plant soil sterile species germin hgt13 hgt18 hgt23 hgt28
135 23 231 CULT N C Y 1.1 1.4 1.6 1.7
136 23 232 CULT N C Y 1.3 2.2 2.5 2.7
137 23 233 CULT N C Y 0.5 1.4 2.0 2.3
138 23 234 CULT N C Y 0.3 0.4 1.2 1.7
139 23 235 CULT N C Y 0.5 0.5 0.8 2.0
140 23 236 CULT N C Y 0.1 NA NA NA
141 24 241 STP Y L Y 1.8 2.6 3.9 4.2
142 24 242 STP Y L Y 1.3 1.7 2.8 3.7
143 24 243 STP Y L Y 1.5 1.6 3.9 3.9
144 24 244 STP Y L Y NA 1.0 2.3 3.8
145 24 245 STP Y L N NA NA NA NA
146 24 246 STP Y L N NA NA NA NA
This data is stored in wide format, with one row per plant (see 12 sample
plants in Table 10.1). As we have done in previous multilevel analyses, we
will convert to long format (one observation per plant-time combination) after
examining the missing data pattern and removing any plants with no growth
data. In this case, we are almost assuredly losing information by removing
plants with no height data at all four time points, since these plants did not
germinate, and there may well be differences between species, soil type, and
sterilization with respect to germination rates. We will handle this possibility
by analyzing germination rates separately (see Chapter 11); the analysis in
this chapter will focus on effects of species, soil type, and sterilization on initial
growth and growth rate among plants that germinate.
Although the experimental design called for 72 ∗ 6 = 432 plants, the wide data
set has 437 plants because a few pots had more than six plants (likely because
two of the microscopically small seeds stuck together when planted). Of those
10.3 Initial Exploratory Analyses 325
TABLE 10.2: A snapshot of data (Plants 236-242) from the Seed Germination
case study in long format.
437 plants, 154 had no height data (did not germinate by the 28th day) and
were removed from analysis (for example, see rows 145-146 in Table 10.1). A
total of 248 plants had complete height data (e.g., rows 135-139 and 141-143),
13 germinated later than the 13th day but had complete heights once they
germinated (e.g., row 144), and 22 germinated and had measurable height on
the 13th day but died before the 28th day (e.g., row 140). Ultimately, the long
data set contains 1132 unique observations where plant heights were recorded;
representation of plants 236-242 in the long data set can be seen in Table 10.2.
Notice the three-level structure of this data. Treatments (levels of the three
experimental factors) were assigned at the pot level, then multiple plants were
grown in each pot, and multiple measurements were taken over time for each
plant. Our multilevel analysis must therefore account for pot-to-pot variability
in height measurements (which could result from factor effects), plant-to-plant
variability in height within a single pot, and variability over time in height for
individual plants. In order to fit such a three-level model, we must extend the
two-level model which we have used thus far.
We start by taking an initial look at the effect of Level Three covariates (factors
applied at the pot level: species, soil type, and sterilization) on plant height,
pooling observations across pot, across plant, and across time of measurement
within plant. First, we observe that the initial balance which existed after
randomization of pot to treatment no longer holds. After removing plants
326 10 Multilevel Data With More Than Two Levels
that did not germinate (and therefore had no height data), more height
measurements exist for coneflowers (n=704, compared to 428 for leadplants),
soil from restored prairies (n=524, compared to 288 for cultivated land and
320 for remnant prairies), and unsterilized soil (n=612, compared to 520 for
sterilized soil). This imbalance indicates possible factor effects on germination
rate; we will take up those hypotheses in Chapter 11. In this chapter, we
will focus on the effects of species, soil type, and sterilization on the growth
patterns of plants that germinate.
Because we suspect that height measurements over time for a single plant
are highly correlated, while height measurements from different plants from
the same pot are less correlated, we calculate mean height per plant (over
all available time points) before generating exploratory plots investigating
Level Three factors. Figure 10.2 then examines the effects of soil type and
sterilization separately by species. Sterilization seems to have a bigger benefit
for coneflowers, while soil from remnant prairies seems to lead to smaller
leadplants and taller coneflowers.
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
CULT REM STP N Y
Soil type Sterilized
Leadplants (a) Leadplants (b)
Plant Height (mm)
5 5
4 4
3 3
2 2
1 1
0 0
CULT REM STP N Y
Soil type Sterilized
FIGURE 10.2: Plant height comparisons of (a) soil type and (b) steriliza-
tion within species. Each plant is represented by the mean height over all
measurements at all time points for that plant.
We also use spaghetti plots to examine time trends within species to see (a)
if it is reasonable to assume linear growth between Day 13 and Day 28 after
planting, and (b) if initial height and rate of growth is similar in the two
species. Figure 10.3 illustrates differences between species. While both species
have similar average heights 13 days after planting, coneflowers appear to
have faster early growth which slows later, while leadplants have a more linear
growth rate which culminates in greater average heights 28 days after planting.
Coneflowers also appear to have greater variability in initial height and growth
rate, although there are more coneflowers with height data.
Exploratory analyses such as these confirm the suspicions of biology researchers
that leadplants and coneflowers should be analyzed separately. Because of
10.3 Initial Exploratory Analyses 327
Coneflowers Leadplants
7.5
2.5
0.0
16 20 24 28 16 20 24 28
Days since seeds planted
FIGURE 10.3: Spaghetti plot by species with loess fit. Each line represents
one plant.
biological differences, it is expected that these two species will show different
growth patterns and respond differently to treatments such as fertilization.
Coneflowers are members of the aster family, growing up to 4 feet tall with their
distinctive gray seed heads and drooping yellow petals. Leadplants, on the other
hand, are members of the bean family, with purple flowers, a height of 1 to 3
feet, and compound grayish green leaves which look to be dusted with white
lead. Leadplants have deep root systems and are symbiotic N-fixers, which
means they might experience stifled growth in sterilized soil compared with
other species. For the remainder of this chapter, we will focus on leadplants
and how their growth patterns are affected by soil type and sterilization. You
will have a chance to analyze coneflower data later in the Exercises section.
Lattice plots, illustrating several observational units simultaneously, each with
fitted lines where appropriate, are also valuable to examine during the ex-
ploratory analysis phase. Figure 10.4 shows height over time for 24 randomly
selected leadplants that germinated in this study, with a fitted linear regres-
sion line. Linearity appears reasonable in most cases, although there is some
variability in the intercepts and a good deal of variability in the slopes of the
fitted lines. These intercepts and slopes by plant, of course, will be potential
parameters in a multilevel model which we will fit to this data. Given the
three-level nature of this data, it is also useful to examine a spaghetti plot by
pot (Figure 10.5). While linearity appears to reasonably model the average
trend over time within pot, we see differences in the plant-to-plant variability
within pot, but some consistency in intercept and slope from pot to pot.
Spaghetti plots can also be an effective tool for examining the potential effects
of soil type and sterilization on growth patterns of leadplants. Figure 10.6
and Figure 10.7 illustrate how the growth patterns of leadplants depend on
soil type and sterilization. In general, we observe slower growth in soil from
remnant prairies and soil that has not been sterilized.
We can further explore the variability in linear growth among plants and
among pots by fitting regression lines and examining the estimated intercepts
328 10 Multilevel Data With More Than Two Levels
6
3
0
-3
6
3
6
3
0
-3
6
3
0
-3
6
3
0
-3
16 20 24 28 16 20 24 28 16 20 24 28 16 20 24 28 16 20 24 28
Time
1 2 3 4 7 10 14
5
4
3
2
1
0
15 17 22 24 27 28 29
5
4
3
2
Plant height (mm)
1
0
32 35 36 37 38 41 47
5
4
3
2
1
0
49 50 51 52 58 61 62
5
4
3
2
1
0
16 20 24 28 16 20 24 28 16 20 24 28
67 68 70 72
5
4
3
2
1
0
16 20 24 28 16 20 24 28 16 20 24 28 16 20 24 28
Days since seeds planted
FIGURE 10.5: Spaghetti plot for leadplants by pot with loess fit.
4
Plant height (mm)
0
16 20 24 28 16 20 24 28 16 20 24 28
Days since seeds planted
FIGURE 10.6: Spaghetti plot for leadplants by soil type with loess fit.
10.3 Initial Exploratory Analyses 329
Not Sterilized Sterilized
0
16 20 24 28 16 20 24 28
Days since seeds planted
FIGURE 10.7: Spaghetti plot for leadplants by sterilization with loess fit.
and slopes, as well as the corresponding R2 values. Figures 10.8 and 10.9
provide just such an analysis, where Figure 10.8 shows results of fitting lines
by plant, and Figure 10.9 shows results of fitting lines by pot. Certain caveats
accompany these summaries. In the case of fitted lines by plant, each plant is
given equal weight regardless of the number of observations (2-4) for a given
plant, and in the case of fitted lines by pot, a line is estimated by simply
pooling all observations from a given pot, ignoring the plant from which the
observations came, and equally weighting pots regardless of how many plants
germinated and survived to Day 28. Nevertheless, the summaries of fitted lines
provide useful information. When fitting regression lines by plant, we see a
mean intercept of 1.52 (SD=0.66), indicating an estimated average height at
13 days of 1.5 mm, and a mean slope of 0.114 mm per day of growth from Days
13 to 28 (SD=0.059). Most R-squared values were strong (e.g., 84% were above
0.8). Summaries of fitted regression lines by pot show similar mean intercepts
(1.50) and slopes (0.107), but somewhat less variability pot-to-pot than we
observed plant-to-plant (SD=0.46 for intercepts and SD=0.050 for slopes).
Another way to examine variability due to plant vs. variability due to pot is
through summary statistics. Plant-to-plant variability can be estimated by
averaging standard deviations from each pot (.489 for intercepts and .039 for
slopes), while pot-to-pot variability can be estimated by finding the standard
deviation of average intercept (.478) or slope (.051) within pot. Based on these
rough measurements, variability due to plants and pots is comparable.
Fitted lines by plant and pot are modeled using a centered time variable
(time13), adjusted so that the first day of height measurements (13 days
after planting) corresponds to time13=0. This centering has two primary
advantages. First, the estimated intercept becomes more interpretable. Rather
than representing height on the day of planting (which should be 0 mm, but
which represents a hefty extrapolation from our observed range of days 13 to
330 10 Multilevel Data With More Than Two Levels
(a) (b)
30 30
Frequency
Frequency
20 20
10 10
0 0
0 1 2 3 0.0 0.1 0.2 0.3
Intercepts Slopes
(c)
40
Frequency
30
20
10
0
0.00 0.25 0.50 0.75 1.00
R-squared values
FIGURE 10.8: Histograms of (a) intercepts, (b) slopes, and (c) R-squared
values for linear fits across all leadplants.
(a) (b)
15
Frequency
Frequency
10
10
5 5
0 0
1 2 0.00 0.05 0.10 0.15 0.20
Intercepts Slopes
(c)
Frequency
10
0
0.0 0.4 0.8
R-squared values
FIGURE 10.9: Histograms of (a) intercepts, (b) slopes, and (c) R-squared
values for linear fits across all pots with leadplants.
28), the intercept now represents height on Day 13. Second, the intercept and
slope are much less correlated (r=-0.16) than when uncentered time is used,
which improves the stability of future models.
Fitted intercepts and slopes by plant can be used for an additional exploratory
examination of factor effects to complement those from the earlier spaghetti
plots. Figure 10.10 complements Figure 10.3, again showing differences be-
tween species—coneflowers tend to start smaller and have slower growth rates,
although they have much more variability in growth patterns than leadplants.
Returning to our focus on leadplants, Figure 10.11 shows that plants grown
in soil from cultivated fields tend to be taller at Day 13, and plants grown
in soil from remnant prairies tend to grow more slowly than plants grown in
other soil types. Figure 10.12 shows the strong tendency for plants grown in
10.4 Initial Exploratory Analyses 331
sterilized soil to grow faster than plants grown in non-sterilized soil. We will
soon see if our fitted multilevel models support these observed trends.
(a) (b)
0.4
3
0.3
2 0.2
Intercepts
Slopes
1 0.1
0.0
0
-0.1
FIGURE 10.10: Boxplots of (a) intercepts and (b) slopes for all plants by
species, based on a linear fit to height data from each plant.
(a) (b)
3
2 0.2
Intercepts
Slopes
1
0.1
0.0
FIGURE 10.11: Boxplots of (a) intercepts and (b) slopes for all leadplants
by soil type, based on a linear fit to height data from each plant.
Since we have time at Level One, any exploratory analysis of Case Study 10.2
should contain an investigation of the variance-covariance structure within
plant. Figure 10.13 shows the potential for an autocorrelation structure in
which the correlation between observations from the same plant diminishes
as the time between measurements increases. Residuals five days apart have
correlations ranging from .77 to .91, while measurements ten days apart have
correlations of .62 and .70, and measurements fifteen days apart have correlation
of .58.
332 10 Multilevel Data With More Than Two Levels
(a) (b)
3
2 0.2
Intercepts
Slopes
1
0.1
0.0
FIGURE 10.12: Boxplots of (a) intercepts and (b) slopes for all leadplants
by sterilization, based on a linear fit to height data from each plant.
time13=0
10
Corr: Corr: Corr:
5 0.835*** 0.621*** 0.575***
0
1
time13=5
0 Corr: Corr:
-1
0.770*** 0.701***
2
1
time13=10
0 Corr:
-1 0.907***
-2
2
1
time13=15
0
-1
-2
-1 0 1 -1 0 1 -2 -1 0 1 2 -2 -1 0 1 2
FIGURE 10.13: Correlation structure within plant. The upper right contains
correlation coefficients between residuals at pairs of time points, the lower left
contains scatterplots of the residuals at time point pairs, and the diagonal
contains histograms of residuals at each of the four time points.
The structure and notation for three level models will closely resemble the
structure and notation for two-level models, just with extra subscripts. Therein
lies some of the power of multilevel models—extensions are relatively easy and
allow you to control for many sources of variability, obtaining more precise
estimates of important parameters. However, the number of variance component
parameters to estimate can quickly mushroom as covariates are added at lower
10.4 Initial Models 333
We once again begin with the unconditional means model, in which there
are no predictors at any level, in order to assess the amount of variation at
each level. Here, Level Three is pot, Level Two is plant within pot, and Level
One is time within plant. Using model formulations at each of the three levels,
the unconditional means three-level model can be expressed as:
where the heights of plants from different pots are considered independent, but
plants from the same pot are correlated as well as measurements at different
times from the same plant.
Keeping track of all the model terms, especially with three subscripts, is not a
trivial task, but it’s worth spending time thinking it through. Here is a quick
guide to the meaning of terms found in our three-level model:
• ijk describes how far an observed height Yijk is from the mean height for
plant j from pot i.
• uij describe how far the mean height of plant j from pot i is from the mean
height of all plants from pot i.
• ũi describes how far the mean height of all observations from pot i is from
the overall mean height across all pots, plants, and time points. None of the
error terms (, u, ũ) are considered model parameters; they simply account
for differences between the observed data and expected values under our
model.
• σ 2 is a variance component (random effects model parameter) that describes
within-plant variability over time.
• σu2 is the variance component describing plant-to-plant variability within
pot.
• σũ2 is the variance component describing pot-to-pot variability.
• α̂0 = 2.39 = the mean height (in mm) across all time points, plants, and
pots.
• σ̂ 2 = 0.728 = the variance over time within plants.
10.4 Initial Models 335
• σ̂u2 = 0.278 = the variance between plants from the same pot.
• σ̂ũ2 = 0.049 = the variance between pots.
aij = ai + uij
bij = bi + vij
ai = α0 + ũi
bi = β0 + ṽi
or as a composite model:
336 10 Multilevel Data With More Than Two Levels
In this model, at Level One the trajectory for plant j from pot i is assumed to
be linear, with intercept aij (height on Day 13) and slope bij (daily growth
rate between Days 13 and 28); the ijk terms capture the deviation between
the true growth trajectory of plant j from pot i and its observed heights. At
Level Two, ai represents the true mean intercept and bi represents the true
mean slope for all plants from pot i, while uij and vij capture the deviation
between plant j’s true growth trajectory and the mean intercept and slope
for pot i. The deviations in intercept and slope at Level Two are allowed to
be correlated through the covariance parameter σuv . Finally, α0 is the true
mean intercept and β0 is the true mean daily growth rate over the entire
population of leadplants, while ũi and ṽi capture the deviation between pot
i’s true overall growth trajectory and the population mean intercept and slope.
Note that between-plant and between-pot variability are both partitioned now
into variability in initial status (σu2 and σũ2 ) and variability in rates of change
(σv2 and σṽ2 ).
Using the composite model specification, the unconditional growth model can
be fit to the seed germination data:
From this output, we obtain estimates of our nine model parameters (two fixed
effects and seven variance components):
Typically, with models consisting of three or more levels, the next step after
adding covariates at Level One (such as time) is considering covariates at
Level Two. In the seed germination experiment, however, there are no Level
Two covariates of interest, and the treatments being studied were applied to
pots (Level Three). We are primarily interested in the effects of soil type and
338 10 Multilevel Data With More Than Two Levels
where the error terms at Level Three follow the same multivariate normal
distribution as in Model B. In our case, the composite model can be written
as:
which, after combining fixed effects and random effects, can be rewritten as:
From the output below, the addition of Level Three covariates in Model C
(cult, rem, strl, and their interactions with time) appears to provide a
significant improvement (likelihood ratio test statistic = 32.2 on 6 df, p < .001)
to the unconditional growth model (Model B).
0 0
sigma2
sigma2
-5 -5
-4 0 4 8 12 0 5 10
b0 b0
Note that the estimated variance components are all very similar to Model C,
and the estimated fixed effects and their associated t-statistics are also very
342 10 Multilevel Data With More Than Two Levels
aij = ai + uij
bij = bi + vij
Note that there is no longer an error term associated with the model for mean
growth rate bi at the pot level. The growth rate for pot i is assumed to be
fixed, after accounting for soil type and sterilization; all pots with the same
soil type and sterilization are assumed to have the same growth rate. As a
result, our error assumption at Level Three is no longer bivariate normal, but
rather univariate normal: ũi ∼ N (0, σũ2 ). By removing one of our two Level
Three error terms (ṽi ), we effectively removed two parameters: the variance
for ṽi and the correlation between ũi and ṽi . Fixed effects remain similar, as
can be seen in the output below:
We now have a more stable model, free of boundary constraints. In fact, we can
attempt to determine whether or not removing the two variance component
parameters for Model C.1 provides a significant reduction in performance.
Based on a likelihood ratio test (see below), we do not have significant evidence
(chi-square test statistic=2.089 on 2 df, p=0.3519) that σṽ2 or σũṽ is non-zero,
so it is advisable to use the simpler Model C.1. However, Section 10.6 describes
why this test may be misleading and prescribes a potentially better approach.
pot plant soil sterile species germin hgt13 hgt18 hgt23 hgt28
1 11 STP Y L Y 2.3 2.9 4.5 5.1
1 12 STP Y L Y 1.9 2.0 2.6 3.5
Under the parametric bootstrap, we must simulate data under the null hypoth-
esis many times. Here are the basic steps for running a parametric bootstrap
procedure to compare Model C.1 with Model C:
• Fit Model C.1 (the null model) to obtain estimated fixed effects and variance
components (this is the “parametric” part).
• Use the estimated fixed effects and variance components from the null model
to regenerate a new set of plant heights with the same sample size (n = 413)
and associated covariates for each observation as the original data (this is
the “bootstrap” part).
• Fit both Model C.1 (the reduced model) and Model C (the full model) to
the new data.
• Compute a likelihood ratio statistic comparing Models C.1 and C.
• Repeat the previous 3 steps many times (e.g., 1000).
• Produce a histogram of likelihood ratio statistics to illustrate its behavior
when the null hypothesis is true.
• Calculate a p-value by finding the proportion of times the bootstrapped test
statistic is greater than our observed test statistic.
Let’s see how new plant heights are generated under the parametric bootstrap.
Consider, for instance, i = 1 and j = 1, 2. That is, consider Plants #11 and
#12 as shown in Table 10.3. These plants are found in Pot #1, which was
randomly assigned to contain sterilized soil from a restored prairie (STP):
Level Three
One way to see the data generation process under the null model (Model C.1)
is to start with Level Three and work backwards to Level One. Recall that
our Level Three models for ai and bi , the true intercept and slope from Pot i,
in Model C.1 are:
All the α and β terms will be fixed at their estimated values, so the one term
that will change for each bootstrapped data set is ũi . As we obtain a numeric
value for ũi for each pot, we will fix the subscript. For example, if ũi is set
to -.192 for Pot #1, then we would denote this by ũ1 = −.192. Similarly, in
the context of Model C.1, a1 represents the mean height at Day 13 across all
10.6 Parametric Bootstrap Testing 345
plants in Pot #1, where ũ1 quantifies how Pot #1’s Day 13 height relates to
other pots with the same sterilization and soil type.
According to Model C.1, each ũi is sampled from a normal distribution with
mean 0 and standard deviation .240 (note that the standard deviation σu is
also fixed at its estimated value from Model C.1, given in Section 10.5). That is,
a random component to the intercept for Pot #1 (ũ1 ) would be sampled from
a normal distribution with mean 0 and SD .240; say, for instance, ũ1 = −.192.
We would sample ũ2 , ..., ũ72 in a similar manner. Then we can produce a
model-based intercept and slope for Pot #1:
Notice a couple of features of the above derivations. First, all of the coefficients
from the above equations (α0 = 1.512, α1 = −.088, etc.) come from the
estimated fixed effects from Model C.1 reported in Section 10.5. Second,
“restored prairie” is the reference level for soil type, so that indicators for
“cultivated land” and “remnant prairie” are both 0. Third, the mean intercept
(Day 13 height) for observations from sterilized restored prairie soil is 1.512 -
0.088 = 1.424 mm across all pots, while the mean daily growth is .160 mm.
Pot #1 therefore has mean Day 13 height that is .192 mm below the mean for
all pots with sterilized restored prairie soil, but every such pot is assumed to
have the same growth rate of .160 mm/day because of our assumption that
there is no pot-to-pot variability in growth rate (i.e., ṽi = 0).
Level Two
We next proceed to Level Two, where our equations for Model C.1 are:
aij = ai + uij
bij = bi + vij
We will initially focus on Plant #11 from Pot #1. Notice that the intercept
(Day 13 height = a11 ) for Plant #11 has two components: the mean Day 13
height for Pot #1 (a1 ) which we specified at Level Three, and an error term
(u11 ) which indicates how the Day 13 height for Plant #11 differs from the
overall average for all plants from Pot #1. The slope (daily growth rate =
b11 ) for Plant #11 similarly has two components. Since both a1 and b1 were
determined at Level Three, at this point we need to find the two error terms
for Plant #11: u11 and v11 . According to our multilevel model, we can sample
u11 and v11 from a bivariate normal distribution with means both equal to 0,
standard deviation for the intercept of .543, standard deviation for the slope
of .036, and correlation between the intercept and slope of .194.
346 10 Multilevel Data With More Than Two Levels
For instance, suppose we sample u11 = .336 and v11 = .029. Then we can
produce a model-based intercept and slope for Plant #11:
Although plants from Pot #1 have a mean Day 13 height of 1.232 mm, Plant
#11’s mean Day 13 height is .336 mm above that. Similarly, although plants
from Pot #1 have a mean growth rate of .160 mm/day (just like every other pot
with sterilized restored prairie soil), Plant #11’s growth rate is .029 mm/day
faster.
Level One
Finally we proceed to Level One, where the height of Plant #11 is modeled
as a linear function of time (1.568 + .189time11k ) with a normally distributed
residual 11k at each time point k. Four residuals (one for each time point) are
sampled independently from a normal distribution with mean 0 and standard
deviation .287 – the standard deviation again coming from parameter estimates
from fitting Model C.1 to the actual data as reported in Section 10.5. Suppose
we obtain residuals of 111 = −.311, 112 = .119, 113 = .241, and 114 = −.066.
In that case, our parametrically generated data for Plant #11 from Pot #1
would look like:
We would next turn to Plant #12 from Pot #1 (i = 1 and j = 2). Fixed
effects would remain the same, as would coefficients for Pot #1, a1 = 1.232
and b1 = .160, at Level Three. We would, however, sample new residuals u12
and v12 at Level Two, producing a different intercept a12 and slope b12 than
those observed for Plant #11. Four new independent residuals 12k would also
be selected at Level One, from the same normal distribution as before with
mean 0 and standard deviation .287.
Once an entire set of simulated heights for every pot, plant, and time point
have been generated based on Model C.1, two models are fit to this data:
• Model C.1 – the correct (null) model that was actually used to generate the
responses
• Model C – the incorrect (full) model that contains two extra variance compo-
nents, σṽ2 and σũṽ , that were not actually used when generating the responses
10.6 Parametric Bootstrap Testing 347
However, we are really only interested in saving the likelihood ratio test
statistic from this bootstrapped sample (2 ∗ (−280.54 − (−281.24) = 1.40).
By generating (“bootstrapping”) many sets of responses based on estimated
parameters from Model C.1 and calculating many likelihood ratio test statistics,
we can observe how this test statistic behaves under the null hypothesis of
σṽ2 = σũṽ = 0, rather than making the (dubious) assumption that its behavior
is described by a chi-square distribution with 2 degrees of freedom. Figure 10.15
illustrates the null distribution of the likelihood ratio test statistic derived by
the parametric bootstrap procedure as compared to a chi-square distribution.
A p-value for comparing our full and reduced models can be approximated
by finding the proportion of likelihood ratio test statistics generated under
the null model which exceed our observed likelihood ratio test (2.089). The
parametric bootstrap provides a more reliable p-value in this case (.088 from
table below); a chi-square distribution puts too much mass in the tail and not
enough near 0, leading to overestimation of the p-value. Based on this test, we
would still choose our simpler Model C.1, but we nearly had enough evidence
to favor the more complex model.
348 10 Multilevel Data With More Than Two Levels
0.9
Density
0.6
0.3
0.0
Another way of testing whether or not we should stick with the reduced model
or reject it in favor of the larger model is by generating parametric bootstrap
samples, and then using those samples to produce 95% confidence intervals for
both ρũṽ and σṽ . From the output below, the 95% bootstrapped confidence
interval for ρũṽ (-1, 1) contains 0, and the interval for σṽ (.00050, .0253) nearly
contains 0, providing further evidence that the larger model is not needed.
## 2.5 % 97.5 %
## sd_(Intercept)|plant 0.4546698 0.643747
## cor_time13.(Intercept)|plant -0.0227948 0.624198
## sd_time13|plant 0.0250352 0.042925
## sd_(Intercept)|pot 0.0000000 0.402128
## cor_time13.(Intercept)|pot -1.0000000 1.000000
## sd_time13|pot 0.0004998 0.025339
## sigma 0.2550641 0.311050
## (Intercept) 1.2413229 1.776788
## time13 0.0827687 0.117109
10.7 Exploding Variance Components 349
Our modeling task in Section 10.5 was simplified by the absence of covariates
at Level Two. As multilevel models grow to include three or more levels, the
addition of just a few covariates at lower levels can lead to a huge increase in
the number of parameters (fixed effects and variance components) that must
be estimated throughout the model. In this section, we will examine when
and why the number of model parameters might explode, and we will consider
strategies for dealing with these potentially complex models.
or as a composite model:
• Level One:
• Level Two:
• Level Three:
352 10 Multilevel Data With More Than Two Levels
where ijk ∼ N (0, σ 2 ), uij ∼ N (0, σu2 ), and ũi ∼ N (0, σũ2 ). Or, in terms of a
composite model:
According to the second option, we have built a random intercepts model with
error terms only at the first (intercept) equation at each level. Not only does
this eliminate variance terms associated with the missing error terms, but it
also eliminates correlation terms between errors (as suggested by Option 1)
since there are no pairs of error terms that can be formed at any level. In
addition, as suggested by Option 3, we have eliminated predictors (and their
fixed effects coefficients) at every equation other than the intercept at each
level.
The simplified 9-parameter model essentially includes a random effect for pot
(σũ2 ) after controlling for sterilization and soil type, a random effect for plant
within pot (σu2 ) after controlling for seed size, and a random effect for error
about the time trend for individual plants (σ 2 ). We must assume that the
effect of time is the same for all plants and all pots, and it does not depend on
seed size, sterilization, or soil type. Similarly, we must assume that the effect
of seed size is the same for each pot and does not depend on sterilization or
soil type. While somewhat proscriptive, a random intercepts model such
as this can be a sensible starting point, since the simple act of accounting for
variability of observational units at Levels Two and Three can produce better
estimates of fixed effects of interest and their standard errors.
aij = ai + uij
bij = bi + vij
• Level Three:
Our final model (Model F), with its constraints on Level Three error terms,
can be expressed level-by-level as:
• Level One:
• Level Two:
aij = ai + uij
bij = bi + vij
• Level Three:
ai = α0 + ũi
bi = β0 + β1 strli + β2 remi + β3 strli remi
0.125
0.100
0.075
Density
0.050
0.025
0.000
0 10 20
Likelihood Ratio Test Statistics from Null Distribution
The effects of remnant prairie soil and the interaction between remnant soil
and sterilization appear to have marginal benefit in Model F, so we remove
those two terms to create Model E. A likelihood ratio test comparing Models
E and F, however, shows that Model F significantly outperforms Model E
(chi-square test statistic = 9.40 on 2 df, p=.0090). Thus, we will use Model F
as our “Final Model” for generating inference.
10.8 Building to a Final Model 357
mm/day). Note that the difference between .056 and .017 is our three-way
interaction coefficient. Through this three-way interaction term, we also
see that leadplants grown in sterilized soil from remnant prairies have an
estimated daily increase in height of 0.095 mm.
Based on t-values produced by Model F, sterilization has the most significant
effect on leadplant growth, while there is some evidence that growth rate is
somewhat slower in remnant prairies, and that the effect of sterilization is
also somewhat muted in remnant prairies. Sterilization leads to an estimated
66% increase in growth rate of leadplants from Days 13 to 28 in soil from
reconstructed prairies and cultivated lands, and an estimated 28% increase in
soil from remnant prairies. In unsterilized soil, plants from remnant prairies
grow an estimated 19% slower than plants from other soil types.
and
10.9 Covariance Structure (optional) 359
2
ũi 0 σũ
∼N , .
ṽi 0 σũṽ σṽ2
In order to assess the implied covariance structure from our standard model,
we must first derive variance and covariance terms for related observations (i.e.,
same timepoint and same plant, different timepoints but same plant, different
plants but same pot). Each derivation will rely on the random effects portion of
the composite model, since there is no variability associated with fixed effects.
For ease of notation, we will let tk = timeijk , since all plants were planned to
be observed on the same 4 days.
The variance for an individual observation can be expressed as:
V ar(Yijk ) = (σ 2 + σu2 + σũ2 ) + 2(σuv + σũṽ )tk + (σv2 + σṽ2 )t2k , (10.1)
Cov(Yijk , Yijk0 ) = (σu2 + σũ2 ) + (σuv + σũṽ )(tk + tk0 ) + (σv2 + σṽ2 )tk tk0 , (10.2)
Based on these variances and covariances, the covariance matrix for observations
over time from the same plant (j) from pot i can be expressed as the following
4x4 matrix:
τ12
τ12 τ22
Cov(Yij ) = ,
τ13 τ23 τ32
τ14 τ24 τ34 τ42
where τk2 = V ar(Yijk ) and τkk0 = Cov(Yijk , Yijk0 ). Note that τk2 and τkk0 are
both independent of i and j so that Cov(Yij ) will be constant for all plants
from all pots. That is, every plant from every pot will have the same set of
variances over the four timepoints and the same correlations between heights at
different timepoints. But, the variances and correlations can change depending
on the timepoint under consideration as suggested by the presence of tk terms
in Equations (10.1) through (10.3).
Similarly, the covariance matrix between observations from plants j and j 0
from pot i can be expressed as this 4x4 matrix:
360 10 Multilevel Data With More Than Two Levels
τ̃11
τ̃12 τ̃22
Cov(Yij , Yij 0 ) =
τ̃13
,
τ̃23 τ̃33
τ̃14 τ̃24 τ̃34 τ̃44
where τ̃kk = Cov(Yijk , Yij 0 k ) = σũ2 + 2σũṽ tk + σṽ2 t2k and τ̃kk0 = Cov(Yijk , Yij 0 k0 )
as derived above. As we saw with Cov(Yij ), τ̃kk and τ̃kk0 are both independent
of i and j so that Cov(Yij , Yij 0 ) will be constant for all pairs of plants from
all pots. That is, any pair of plants from the same pot will have the same
correlations between heights at any two timepoints. As with any covariance
matrix, we can convert Cov(Yij , Yij 0 ) into a correlation matrix if desired.
Now that we have the general covariance structure implied by the standard
multilevel model in place, we can examine the specific structure suggested
by the estimates of variance components in Model B. Restricted maximum
likelihood (REML) in Section 10.4 produced the following estimates p for variance
components: σ̂ 2 = .0822, σ̂u2 = .299, σ̂v2 p = .00119, σ̂uv = ρ̂uv σ̂u2 σ̂v2 = .00528,
σ̂ũ2 = .0442, σ̂ṽ2 = .00126, σ̂ũṽ = ρ̂ũṽ σ̂ũ2 σ̂ṽ2 = −.00455. Based on these
estimates and the derivations above, the within-plant correlation structure
over time is estimated to be:
1
.76 1
Corr(Yij ) = .65 .82 1
.54 .77 .88 1
for all plants j and all pots i, and the correlation structure between different
plants from the same pot is estimated to be:
.104
.047 .061
Corr(Yij , Yij 0 ) =
−.002 .067 .116
.
−.037 .068 .144 .191
for all plants j and all pots i, and the correlation structure between different
plants from the same pot is estimated to be:
.104
.095 .087
Corr(Yij , Yij 0 ) =
.084 .077 .068
.
.073 .067 .059 .052
The covariance between observations taken from different plants from the same
pot is:
Based on these variances and covariances and the expressions for Cov(Yij )
and Cov(Yij , Yij 0 ) in Section 10.9, the complete covariance matrix for obser-
vations from pot i can be expressed as the following 24x24 matrix (assuming 4
observations over time for each of 6 plants):
Cov(Yi1 )
Cov(Yi1 , Yi2 ) Cov(Yi2 )
Cov(Yi1 , Yi3 ) Cov(Yi2 , Yi3 ) Cov(Yi3 )
Cov(Yi ) =
Cov(Yi1 , Yi4 ) Cov(Yi2 , Yi4 ) Cov(Yi3 , Yi4 )
Cov(Yi1 , Yi5 ) Cov(Yi2 , Yi5 ) Cov(Yi3 , Yi5 )
Cov(Yi1 , Yi6 ) Cov(Yi2 , Yi6 ) Cov(Yi3 , Yi6 ) . . .
.
Cov(Yi4 )
Cov(Yi4 , Yi5 ) Cov(Yi5 )
... Cov(Yi4 , Yi6 ) Cov(Yi5 , Yi6 ) Cov(Yi6 )
A covariance matrix for our entire data set, therefore, would be block diagonal,
with Cov(Yi ) matrices along the diagonal reflecting within pot correlation
and 0’s off-diagonal reflecting the assumed independence of observations from
plants from different pots. As with any covariance matrix, we can convert the
Cov(Yij , Yij 0 ) blocks for two different plants from the same pot into correlation
10.10 Notes on Using R (optional) 363
to be 0. Our initial attempt used the anova() function in R, which created two
problems: (a) the anova() function uses full maximum likelihood estimates
rather than REML estimates of model parameters and performance, which
is fine when two models differ in fixed effects but not, as in this case, when
two models differ only in random effects; and, (b) the likelihood ratio test
statistic is often not well approximated by a chi-square distribution. Therefore,
we implemented the parametric bootstrap method to simulate the distribution
of the likelihood ratio test statistic and obtain a more reliable p-value, also
illustrating that the chi-square distribution would produce an artificially large
p-value.
10.11 Exercises
better to find the mean height for each pot and just use those 72
values to examine the effects of experimental factors?
8. Explain why a likelihood ratio test is appropriate for comparing
Models B and C.
9. Should we be concerned that σ̂u2 increased from Model A to B? Why
or why not?
10. Explain the idea of boundary constraints in your own words. Why
can it be a problem in multilevel models?
11. In Model C, we initially addressed boundary constraints by removing
the Level Three correlation between error terms from our multilevel
model. What other model adjustments might we have considered?
12. How does Figure 10.15 show that a likelihood ratio test using a
chi-square distribution would be biased?
13. In Section 10.7, a model with 52 parameters is described: (a) il-
lustrate that the model does indeed contain 52 parameters; (b)
explain how to minimize the total number of parameters using ideas
from Section 10.7; (c) what assumptions have you made in your
simplification in (b)?
14. In Section 10.8, Model F (the null model) is compared to Model
D using a parametric bootstrap test. As in Section 10.6, show in
detail how bootstrapped data would be generated under Model F
for, say, Plant # 1 from Pot # 1. For the random parts, tell what
distribution the random pieces are coming from and then select
a random value from that distribution. Finally, explain how the
parametric bootstrap test would be carried out.
15. Section 10.8 contains an interpretation for the coefficient of a three-
way interaction term, β̂3 . Provide an alternative interpretation for β̂3
by focusing on how the sterilization-by-soil type interaction differs
over time.
Model 1 Model 2
Social composition Social comp and collective efficacy
Variable Coefficient SE t Coefficient SE t
Concentrated disadvantage 0.277 0.021 13.30 0.171 0.024 7.24
Immigrant concentration 0.041 0.017 2.44 0.018 0.016 1.12
Residential stability -0.102 0.015 -6.95 -0.056 0.016 -3.49
Collective efficacy -0.618 0.104 -5.95
373
374 11 Multilevel Generalized Linear Models
Examination of data for Case Study 11.2 reveals the following key variables in
basketball0910.csv:
11.3 Initial Exploratory Analyses 375
TABLE 11.1: Key variables from the first 10 rows of data from the College
Basketball Referees Case Study. Each row represents a different foul called;
we see all 8 first-half fouls from Game 1 followed by the first 2 fouls called in
Game 2.
game visitor hometeam foul.num foul.home foul.diff score.diff lead.home foul.type time
1 IA MN 1 0 0 7 1 Personal 14.167
1 IA MN 2 1 -1 10 1 Personal 11.433
1 IA MN 3 1 0 11 1 Personal 10.233
1 IA MN 4 0 1 11 1 Personal 9.733
1 IA MN 5 0 0 14 1 Shooting 7.767
1 IA MN 6 0 -1 22 1 Shooting 5.567
1 IA MN 7 1 -2 25 1 Shooting 2.433
1 IA MN 8 1 -1 23 1 Offensive 1.000
2 MI MIST 1 0 0 2 1 Shooting 18.983
2 MI MIST 2 1 -1 2 1 Personal 17.200
Data was collected for 4972 fouls over 340 games from the Big Ten, ACC, and
Big East conference seasons during 2009-2010. We focus on fouls called during
the first half to avoid the issue of intentional fouls by the trailing team at the
end of games. Table 11.1 illustrates key variables from the first 10 rows of the
data set.
376 11 Multilevel Generalized Linear Models
For our initial analysis, our primary response variable is foul.home, and our
primary hypothesis concerns evening out foul calls. We hypothesize that the
probability a foul is called on the home team is inversely related to the foul
differential; that is, if more fouls have been called on the home team than the
visiting team, the next foul is less likely to be on the home team.
The structure of this data suggests a couple of familiar attributes combined in
an unfamiliar way. With a binary response variable, a generalized linear model
is typically applied, especially one with a logit link function (indicating logistic
regression). But, with covariates at multiple levels—some at the individual
foul level and others at the game level—a multilevel model would also be
sensible. So what we need is a multilevel model with a non-normal response; in
other words, a multilevel generalized linear model (multilevel GLM).
We will investigate what such a model might look like in the next section, but
we will still begin by exploring the data with initial graphical and numerical
summaries.
As with other multilevel situations, we will begin with broad summaries across
all 4972 foul calls from all 340 games. Most of the variables we have collected
can vary with each foul called; these Level One variables include:
• whether or not the foul was called on the home team (our response variable),
• the game situation at the time the foul was called (the time remaining in
the first half, who is leading and by how many points, the foul differential
between the home and visiting team, and who the previous foul was called
on), and
• the type of foul called (offensive, personal, or shooting).
Level Two variables, those that remain unchanged for a particular game,
then include only the home and visiting teams, although we might consider
attributes such as attendance, team rankings, etc.
In Figure 11.1, we see histograms for the continuous Level One covariates (time
remaining, foul differential, and score differential). These plots treat each foul
within a game as independent even though we expect them to be correlated,
but they provide a sense for the overall patterns. We see that time remaining is
reasonably uniform. Score differential and foul differential are both bell-shaped,
with a mean slightly favoring the home team in both cases – on average, the
home team leads by 2.04 points (SD 7.24) and has 0.36 fewer previous fouls
(SD 2.05) at the time a foul is called.
Summaries of the categorical response (whether the foul was called on the
home team) and categorical Level One covariates (whether the home team has
the lead and what type of foul was called) can be provided through tables of
11.3 Initial Exploratory Analyses 377
(a) (b)
600 1500
Frequency
Frequency
400 1000
200 500
0 0
0 5 10 15 20 -20 0 20 40
Time left in first half Score difference (home-visitor)
(c)
1500
Frequency
1000
500
0
-8 -4 0 4
Foul difference (home-visitor)
proportions. More fouls are called on visiting teams (52.1%) than home teams,
the home team is more likely to hold a lead (57.1%), and personal fouls are
most likely to be called (51.6%), followed by shooting fouls (38.7%) and then
offensive fouls (9.7%).
For an initial examination of Level Two covariates (the home and visiting
teams), we can take the number of times, for instance, Minnesota (MN) appears
in the long data set (with one row per foul called as illustrated in Table 11.1) as
the home team and divide by the number of unique games in which Minnesota
is the home team. This ratio (12.1), found in Table 11.2, shows that Minnesota
is among the bottom three teams in the average total number of fouls in the
first halves of games in which it is the home team. That is, games at Minnesota
have few total fouls relative to games played elsewhere. Accounting for the
effect of home and visiting team will likely be an important part of our model,
since some teams tend to play in games with twice as many fouls called as
others, and other teams see a noticeable disparity in the total number of fouls
depending on if they are home or away.
Next, we inspect numerical and graphical summaries of relationships between
Level One model covariates and our binary model response. As with other
multilevel analyses, we will begin by observing broad trends involving all 4972
fouls called, even though fouls from the same game may be correlated. The
conditional density plots in the first row of Figure 11.2 examine continuous
Level One covariates. Figure 11.2a provides support for our primary hypothesis
about evening out foul calls, indicating a very strong trend for fouls to be
more often called on the home team at points in the game when more fouls
had previously been called on the visiting team. Figures 11.2b and 11.2c
then show that fouls were somewhat more likely to be called on the home
team when the home team’s lead was greater and (very slightly) later in the
378 11 Multilevel Generalized Linear Models
TABLE 11.2: Average total number of fouls in the first half over all games
in which a particular team is home or visitor. The left columns show the top 3
and bottom 3 teams according to total number of fouls (on both teams) in
first halves of games in which they are the home team. The middle columns
correspond to games in which the listed teams are the visitors, and the right
columns show the largest differences (in both directions) between total fouls
in games in which a team is home or visitor.
half. Conclusions from the conditional density plots in Figures 11.2a-c are
supported with associated empirical logit plots in Figures 11.2d-f. If a logistic
link function is appropriate, these plots should be linear, and the stronger the
linear association, the more promising the predictor. We see in Figure 11.2d
further confirmation of our primary hypothesis, with lower log-odds of a foul
called on the home team associated with a greater number of previous fouls
the home team had accumulated compared to the visiting team. Figure 11.2e
shows that game score may play a role in foul trends, as the log-odds of a foul
on the home team grows as the home team accumulates a bigger lead on the
scoreboard, and Figure 11.2f shows a very slight tendency for greater log-odds
of a foul called on the home team as the half proceeds (since points on the
right are closer to the beginning of the game).
0.4
0.0
0.0
0.0
-0.5
-0.1
-0.4
-1.0
-0.2
-7.5 -5.0 -2.5 0.0 2.5 5.0 -10 0 10 5 10 15
Foul difference (H-V) Score difference (H-V) Time left in half
FIGURE 11.2: Conditional density and empirical logit plots of the binary
model response (foul called on home or visitor) vs. the three continuous Level
One covariates (foul differential, score differential, and time remaining). The
dark shading in a conditional density plot shows the proportion of fouls called
on the home team for a fixed value of (a) foul differential, (b) score differential,
and (c) time remaining. In empirical logit plots, estimated log odds of a home
team foul are calculated for each distinct foul (d) and score (e) differential,
except for differentials at the high and low extremes with insufficient data; for
time (f), estimated log odds are calculated for two-minute time intervals and
plotted against the midpoints of those intervals.
The mosaic plots in Figure 11.3 examine categorical Level One covariates,
indicating that fouls were more likely to be called on the home team when the
home team was leading, when the previous foul was on the visiting team, and
when the foul was a personal foul rather than a shooting foul or an offensive
foul. A total of 51.8% of calls go against the home team when it is leading
the game, compared to only 42.9% of calls when it is behind; 51.3% of calls
go against the home team when the previous foul went against the visitors,
compared to only 43.8% of calls when the previous foul went against the home
team; and, 49.2% of personal fouls are called against the home team, compared
to only 46.9% of shooting fouls and 45.7% of offensive fouls. Eventually we
will want to examine the relationship between foul type (personal, shooting,
or offensive) and foul differential, examining our hypothesis that the tendency
to even out calls will be even stronger for calls over which the referees have
greater control (personal fouls and especially offensive fouls).
The exploratory analyses presented above are an essential first step in under-
standing our data, seeing univariate trends, and noting bivariate relationships
between variable pairs. However, our important research questions (a) involve
the effect of foul differential after adjusting for other significant predictors of
which team is called for a foul, (b) account for potential correlation between
foul calls within a game (or within a particular home or visiting team), and (c)
380 11 Multilevel Generalized Linear Models
(a) (b)
Home
Visitor
FIGURE 11.3: Mosaic plots of the binary model response (foul called on
home or visitor) vs. the three categorical Level One covariates (foul type (a),
team in the lead (b), and team called for the previous foul (c)). Each bar shows
the percentage of fouls called on the home team vs. the percentage of fouls
called on the visiting team for a particular category of the covariate. The bar
width shows the proportion of fouls at each of the covariate levels.
One quick and dirty approach to analysis might be to run a multiple logistic
regression model on the entire long data set of 4972 fouls. In fact, Anderson
and Pierce ran such a model in their 2009 paper, using the results of their
multiple logistic regression model to support their primary conclusions, while
justifying their approach by confirming a low level of correlation within games
and the minimal impact on fixed effect estimates that accounting for clustering
would have. Output from one potential multiple logistic regression model is
shown below; this initial modeling attempt shows significant evidence that
referees tend to even out calls (i.e., that the probability of a foul called on
the home team decreases as total home fouls increase compared to total
visiting team fouls—that is, as foul.diff increases) after accounting for score
differential and time remaining (Z=-3.078, p=.002). The extent of the effect of
11.4 Two-Level Modeling with a Generalized Response 381
foul differential also appears to grow (in a negative direction) as the first half
goes on, based on an interaction between time remaining and foul differential
(Z=-2.485, p=.013). We will compare this model with others that formally
account for clustering and correlation patterns in our data.
TABLE 11.3: Key variables from the March 3, 2010, game featuring Virginia
at Boston College (Game 110).
we could model the probability of a foul on the home team in Game 110 with
the model:
p110j
log = a110 + b110 foul.diff 110j (11.1)
1 − p110j
where i is fixed at 110. Note that there is no separate error term or variance
parameter, since the variance is a function of pij with a Bernoulli random
variable.
Maximum likelihood estimators for the parameters in this model (a110 and
b110 ) can be obtained through statistical software. ea110 represents the odds
that a foul is called on the home team when the foul totals in Game 110 are
even, and eb110 represents the multiplicative change in the odds that a foul is
called on the home team for each additional foul for the home team relative to
the visiting team during the first half of Game 110.
For Game 110, we estimate â110 = −5.67 and b̂110 = −2.11 (see output below).
Thus, according to our simple logistic regression model, the odds that a foul is
called on the home team when both teams have an equal number of fouls in
Game 110 is e−5.67 = 0.0035; that is, the probability that a foul is called on
the visiting team (0.9966) is 1/0.0035 = 289 times higher than the probability
a foul is called on the home team (0.0034) in that situation. While these
parameter estimates seem quite extreme, reliable estimates are difficult to
obtain with 14 observations and a binary response variable, especially in a
case like this where the fouls were only even at the start of the game. Also, as
the gap between home and visiting fouls increases by 1, the odds that a foul is
11.4 Two-Level Modeling with a Generalized Response 383
40
20
0
-10 -5 0 5 10
Intercepts
(b)
60
Frequency
40
20
0
-5.0 -2.5 0.0 2.5 5.0
Slopes
FIGURE 11.4: Histograms of (a) intercepts and (b) slopes from fitting simple
logistic regression models by game. Several extreme outliers have been cut off
in these plots for illustration purposes.
At this point, you might imagine expanding model building efforts in a couple
of directions: (a) continue to improve the Level One model in Equation (11.1)
384 11 Multilevel Generalized Linear Models
Then we include no fixed covariates at Level Two, but we include error terms
to allow the intercept and slope from Level One to vary by game, and we allow
these errors to be correlated:
ai = α0 + ui
bi = β0 + vi ,
where the error terms at Level Two can be assumed to follow a multivariate
normal distribution:
2
ai 0 σu
∼N ,
bi 0 σuv σv2
Again, we can use statistical software to obtain parameter estimates for this
unified multilevel model using all 4972 fouls recorded from the 340 games.
For example, the glmer() function from the lme4 package in R extends the
lmer() function to handle generalized responses and to account for the fact
that fouls are not independent within games. Results are given below for the
two-level model with foul differential as the sole covariate and Game as the
Level Two observational unit.
When parameter estimates from the multilevel model above are compared
with those from the naive logistic regression model assuming independence of
all observations (below), there are noticeable differences. For instance, each
additional foul for the visiting team is associated with a 33% increase (1/e−.285 )
in the odds of a foul called on the home team under the multilevel model,
but the single level model estimates the same increase as only 14% (1/e−.130 ).
Also, estimated standard errors for fixed effects are greater under multilevel
generalized linear modeling, which is not unusual after accounting for correlated
observations, which effectively reduces the sample size.
In this case, eah represents the odds that a foul is called on the home team
when total fouls are equal between both teams in a game involving Home
Team h, and ebh represents the multiplicative change in the odds that a foul is
called on the home team for every extra foul on the home team compared to
the visitors in a game involving Home Team h. After fitting logistic regression
models for each of the 39 teams in our data set, we see in Figure 11.5 variability
in fitted intercepts (mean=-0.15, sd=0.33) and slopes (mean=-0.22, sd=0.12)
among the 39 teams, although much less variability than we observed from
game-to-game. Of course, each logistic regression model for a home team
was based on about 10 times more foul calls than each model for a game, so
observing less variability from team-to-team was not unexpected.
(a)
6
Frequency
0
-0.8 -0.4 0.0 0.4
Intercepts
(b)
8
6
Frequency
0
-0.4 -0.2 0.0
Slopes
FIGURE 11.5: Histograms of (a) intercepts and (b) slopes from fitting simple
logistic regression models by home team.
Team g against Home Team h. Square brackets are introduced since g and
h are essentially at the same level as i, whereas we have assumed (without
stating so) throughout this book that subscripting without square brackets
implies a movement to lower levels as the subscripts move left to right (e.g.,
ij indicates i units are at Level Two, while j units are at Level One, nested
inside Level Two units). We can then consider Yi[gh]j to be a Bernoulli random
variable with parameter pi[gh]j , where pi[gh]j is the true probability that the
j th foul from Game i was called on Home Team h rather than Visiting Team
g. We will include the crossed subscripting only where necessary.
Typically, with the addition of crossed effects, the Level One model will remain
familiar and changes will be seen at Level Two, especially in the equation for
the intercept term. In the model formulation below we allow, as before, the
slope and intercept to potentially vary by game:
• Level One:
pi[gh]j
log = ai + bi foul.diff ij (11.2)
1 − pi[gh]j
• Level Two:
ai = α0 + ui + vh + wg
bi = β 0 ,
Therefore, at Level Two, we assume that ai , the log odds of a foul on the home
team when the home and visiting teams in Game i have an equal number of
fouls, depends on four components:
• α0 is the population average log odds across all games and fouls (fixed)
• ui is the effect of Game i (random)
• vh is the effect of Home Team h (random)
• wg is the effect of Visiting Team g (random)
where error terms (random effects) at Level Two can be assumed to follow
independent normal distributions:
ui ∼ N 0, σu2
vh ∼ N 0, σv2
2
wg ∼ N 0, σw .
We could include terms that vary by home or visiting team in other Level
11.5 Crossed Random Effects 389
Two equations, but often adjusting for these random effects on the intercept is
sufficient. The advantages to including additional random effects are three-fold.
First, by accounting for additional sources of variability, we should obtain
more precise estimates of other model parameters, including key fixed effects.
Second, we obtain estimates of variance components, allowing us to compare
the relative sizes of game-to-game and team-to-team variability. Third, as
outlined in Section 11.8, we can obtain estimated random effects which allow
us to compare the effects on the log-odds of a home foul of individual home
and visiting teams.
pi[gh]j
log = [α0 + β0 foul.diff ij ] + [ui + vh + wg ].
1 − pi[gh]j
We will refer to this as Model A3, where we look at the effect of foul differential
on the odds a foul is called on the home team, while accounting for three
crossed random effects at Level Two (game, home team, and visiting team).
Parameter estimates for Model A3 are given below:
# Model A3
model.a3 <- glmer(foul.home ~ foul.diff + (1|game) +
(1|hometeam) + (1|visitor),
family = binomial, data = refdata)
• α̂0 = −0.188 = the mean log odds of a home foul at the point where total
fouls are equal between teams. In other words, when fouls are balanced
between teams, the probability that a foul is called on the visiting team
(.547) is 20.7% (1/e−.188 = 1.207) higher than the probability a foul is called
on the home team (.453).
• β̂0 = −0.264 = the decrease in mean log odds of a home foul for each 1 foul
increase in the foul differential. More specifically, the odds the next foul is
called on the visiting team rather than the home team increases by 30.2%
with each additional foul called on the home team (1/e−.264 = 1.302).
• σ̂u2 = 0.172 = the variance in intercepts from game-to-game.
• σ̂v2 = 0.068 = the variance in intercepts among different home teams.
2
• σ̂w = 0.023 = the variance in intercepts among different visiting teams.
Based on the t-value (-6.80) and p-value (p < .001) associated with foul
differential in this model, we have significant evidence of a negative association
between foul differential and the odds of a home team foul. That is, we have
significant evidence that the odds that a foul is called on the home team
shrinks as the home team has more total fouls compared with the visiting
team. Thus, there seems to be preliminary evidence in the 2009-2010 data that
college basketball referees tend to even out foul calls over the course of the
first half. Of course, we have yet to adjust for other significant covariates.
ai = α0 + ui
bi = β 0 ,
The likelihood ratio test (see below) provides significant evidence (LRT=16.074,
11.6 Parametric Bootstrap for Model Comparisons 391
df=2, p=.0003) that accounting for variability among home teams and among
visiting teams improves our model.
Figure 11.6 illustrates the null distribution of the likelihood ratio test statistic
derived by the parametric bootstrap procedure with 100 samples as compared
to a chi-square distribution. As we observed in Section 9.6.4, the parametric
bootstrap provides a more reliable p-value in this case (p < .001 from output
below) because a chi-square distribution puts too much mass in the tail and not
enough near 0. However, the parametric bootstrap is computationally intensive,
and it can take a long time to run even with moderately complex models. With
this data, we would select our full Model A3 based on a parametric bootstrap
test.
We might also reasonably ask: is it helpful to allow slopes (coefficients for foul
392 11 Multilevel Generalized Linear Models
0.6
0.4
Density
0.2
0.0
0 5 10 15
Likelihood Ratio Test Statistics from Null Distribution
differential) to vary by game, home team, and visiting team as well? Again,
since we are comparing models that differ in random effects, and since the
null hypothesis involves setting random effects at their boundaries, we use the
parametric bootstrap. Formally, we are comparing Model A3 to Model B3,
which has the same Level One equation as Model A3:
pi[gh]j
log = ai + bi foul.diff ij
1 − pi[gh]j
but 6 variance components to estimate at Level Two:
ai = α0 + ui + vh + wg
bi = β0 + zi + rh + sg ,
where error terms (random effects) at Level Two can be assumed to follow
independent normal distributions:
ui ∼ N 0, σu2
zi ∼ N 0, σz2
vh ∼ N 0, σv2
rh ∼ N 0, σr2
2
wg ∼ N 0, σw
sg ∼ N 0, σs2 .
11.6 Parametric Bootstrap for Model Comparisons 393
Thus our null hypothesis for comparing Model A3 vs. Model B3 is H0 : σz2 =
σr2 = σs2 = 0. We do not have significant evidence (LRT=0.349, df=3, p=.46
by parametric bootstrap) of variability among slopes, so we will only include
random effects for game, home team, and visiting team for the intercept going
forward. Figure 11.7 illustrates the null distribution of the likelihood ratio
test statistic derived by the parametric bootstrap procedure as compared to
a chi-square distribution, again showing that the tails are too heavy in the
chi-square distribution.
0.6
0.4
Density
0.2
0.0
0 5 10 15
Likelihood Ratio Test Statistics from Null Distribution
Note that we could have also allowed for a correlation between the error terms
for the intercept and slope by game, home team, or visiting team – i.e., assume,
for example:
2
ui 0 σu
∼N ,
zi 0 σuz σz2
while error terms by game, home team, or visiting team are still independent.
Here, the new model would have 6 additional parameters when compared
394 11 Multilevel Generalized Linear Models
• Level Two:
11.7 A Final Model for Examining Referee Bias 395
ai = α0 + ui + vh + wg
bi = β0
ci = γ0
di = δ0
fi = φ0
ki = κ0
li = λ0
mi = µ0
ni = ν0
oi = ω0
qi = ξ0 ,
where error terms at Level Two can be assumed to follow independent normal
distributions:
ui ∼ N 0, σu2
vh ∼ N 0, σv2
2
wg ∼ N 0, σw .
Using the composite form of this multilevel generalized linear model, the
parameter estimates for our 11 fixed effects and 3 variance components are
given in the output below:
(1|hometeam) + (1|visitor),
family = binomial, data = refdata)
stronger earlier in the half, and when offensive and personal fouls are called
instead of shooting fouls. The effect of foul type supports the hypothesis that
if referees are consciously or subconsciously evening out foul calls, the behavior
will be more noticeable for calls over which they have more control, especially
offensive fouls (which are notorious judgment calls) and then personal fouls
(which don’t affect a player’s shot, and thus a referee can choose to let them go
uncalled). Evidence like this can be considered dose response, since higher
“doses” of referee control are associated with a greater effect of foul differential
on their calls. A dose response effect provides even stronger indication of referee
bias.
Analyses of data from 2004-2005 [Noecker and Roback, 2012] showed that the
tendency to even out foul calls was stronger when one team had a large lead,
but we found no evidence of a foul differential by score differential interaction
in the 2009-2010 data, although home team fouls are more likely when the
home team has a large lead, regardless of the foul differential.
Here are specific interpretations of key model parameters:
• exp(α̂0 ) = exp(−0.247) = 0.781. The odds of a foul on the home team is
0.781 at the end of the first half when the score is tied, the fouls are even,
and the referee has just called a shooting foul. In other words, only 43.9% of
shooting fouls in those situations will be called on the home team.
• exp(β̂0 ) = exp(−0.172) = 0.842. Also, 0.842−1 = 1.188. As the foul differen-
tial decreases by 1 (the visiting team accumulates another foul relative to the
home team), the odds of a home foul increase by 18.8%. This interpretation
applies to shooting fouls at the end of the half, after controlling for the effects
of score differential and whether the home team has the lead.
• exp(γ̂0 ) = exp(0.034) = 1.034. As the score differential increases by 1 (the
home team accumulates another point relative to the visiting team), the odds
of a home foul increase by 3.4%, after controlling for foul differential, type of
foul, whether or not the home team has the lead, and time remaining in the
half. Referees are more likely to call fouls on the home team when the home
team is leading, and vice versa. Note that a change in the score differential
could result in the home team gaining the lead, so that the effect of score
differential experiences a non-linear “bump” at 0, where the size of the bump
depends on the time remaining (this would involve the interpretation for ξˆ0 ).
• exp(µ̂0 ) = exp(−0.103) = 0.902. Also, 0.902−1 = 1.109. The effect of foul
differential increases by 10.9% if a foul is an offensive foul rather than
a shooting foul, after controlling for score differential, whether the home
team has the lead, and time remaining. As hypothesized, the effect of foul
differential is greater for offensive fouls, over which referees have more control
when compared with shooting fouls. For example, midway through the half
(time=10), the odds that a shooting foul is on the home team increase by
29.6% for each extra foul on the visiting team, while the odds that an offensive
foul is on the home team increase by 43.6%.
398 11 Multilevel Generalized Linear Models
there is a great deal of uncertainty surrounding each estimate, we see that, for
instance, DePaul and Seton Hall have higher baseline odds of home fouls than
Purdue or Syracuse. Similar histograms and prediction intervals plots can be
generated for random effects due to visiting teams and specific games.
4
Frequency
DEPAUL
SETON
STJNNY
CIN
VA
NCST
IN
NOVA
IA
SFL
GATECH
MIST
RUT
MARQET
PITT
MIA
Home Teams
LOU
VATECH
IL
PSU
WI
WF
BC
FLST
NC
NW
PROV
MD
OHST
DUKE
MI
GTOWN
CLEM
WV
CT
ND
MN
SYR
PUR
-0.8 -0.4 0.0 0.4 0.8
Estimated Random Effects
want to fit a model with an error term for each equation, but you also want
to assume that the two error terms are independent, the error terms must
be requested separately. For example, (1 | hometeam) allows the Level One
intercept to vary by home team, while (0+foul.dff | hometeam) allows the
Level One effect of foul.diff to vary by home team. Under this formulation,
the correlation between those two error terms is assumed to be 0; a non-zero
correlation could have been specified with (1+foul.diff | hometeam).
The R code below shows how fixef() can be used to extract the estimated
fixed effects from a multilevel model. Even more, it shows how ranef() can
be used to illustrate estimated random effects by Game, Home Team, and
Visiting Team, along with prediction intervals for those random effects. These
estimated random effects are sometimes called Empirical Bayes estimators. In
this case, random effects are placed only on the [["(Intercept)"]] term; the
phrase “Intercept” could be replaced with other Level One covariates whose
values are allowed to vary by game, home team, or visiting team in our model.
11.10 Exercises
TABLE 11.4: Adjusted rate ratios for individual-level variables from the
multilevel Poisson regression model with random intercept for area from Table
2 in Randall et al. (2014).
RR 95% CI p-Value
Aboriginal
No(ref) 1.00 <0.01
Yes 2.10 1.98-2.23
Age Group
25-34 (ref) 1.00 <0.01
35-44 6.01 5.44-6.64
45-54 19.36 17.58-21.31
55-64 40.29 36.67-44.26
65-74 79.92 72.74-87.80
75-84 178.75 162.70-196.39
Sex
Male (ref) 1.00 <0.01
Female 0.45 0.44-0.45
Year
2002 (ref) 1.00 <0.01
2003 1.00 0.98-1.03
2004 0.97 0.95-0.99
2005 0.91 0.89-0.94
2006 0.88 0.86-0.91
2007 0.88 0.86-0.91
20. Table 11.4 shows Table 2 from Randall et al. [2014]. Let Yij be the
number of acute myocardial infarctions in subgroup j from SLA i;
write out the multilevel model that likely produced Table 11.4. How
many fixed effects and variance components must be estimated?
21. Provide interpretations in context for the following rate ratios, con-
fidence intervals, and p-values in Table 11.4: RR of 2.1 and CI of
1.98 - 2.23 for Aboriginal = Yes; p-value of <.01 for Age Group; CI
of 5.44 - 6.64 for Age Group = 35-44; RR of 0.45 for Sex = Female;
CI of 0.86 - 0.91 for Year = 2007.
22. Given the rate ratio and 95% confidence interval reported for Abo-
riginal Australians in Table 11.4, find the estimated model fixed
effect for Aboriginal Australians from the multilevel model along
with its standard error.
23. How might the p-value for Age Group have been produced?
404 11 Multilevel Generalized Linear Models
TABLE 11.5: Adjusted rate ratios for area-level variables from the multilevel
Poisson regression model with random intercept for area from Table 3 in
Randall et al. (2014). Area-level factors added one at a time to the fully
adjusted individual-level model (adjusted for Aboriginal status, age, sex and
year) due to being highly associated.
RR 95% CI p-Value
Remoteness of Residence
Major City 1.00 <0.01
Inner Regional 1.16 1.04-1.28
Outer Regional 1.11 1.01-1.23
Remote/very remote 1.22 1.02-1.45
SES quintile
1 least disadvantaged 1.00 <0.01
2 1.26 1.11-1.43
3 1.40 1.24-1.58
4 1.46 1.30-1.64
5 most disadvantaged 1.70 1.52-1.91
24. Randall et al. [2014] report that, “we identified a significant inter-
action between Aboriginal status and age group (p < 0.01) and
Aboriginal status and sex (p < 0.01), but there was no significant in-
teraction between Aboriginal status and year (p=0.94).” How would
the multilevel model associated with Table 11.4 need to have been
adjusted to allow these interactions to be tested?
25. Table 11.5 shows Table 3 from Randall et al. [2014]. Describe the
changes to the multilevel model for Table 11.4 that likely produced
this new table.
26. Provide interpretations in context for the following rate ratios, con-
fidence intervals, and p-values in Table 11.5: p-value of <.01 for
Remoteness of Residence; RR of 1.22 for Remote/very remote; CI
of 1.52 - 1.91 for SES quintile = 5.
27. Randall et al. [2014] also report the results of a single-level Poisson
regression model: “After adjusting for age, sex and year of event,
the rate of AMI events in Aboriginal people was 2.3 times higher
than in non-Aboriginal people (95% CI: 2.17-2.44).” Compare this
to the results of the multilevel Poisson model; what might explain
any observed differences?
28. Randall et al. [2014] claim that, “our application of multilevel mod-
elling techniques allowed us to account for clustering by area of
11.10 Exercises 405
1 https://www.kaggle.com/yelp-dataset/yelp-dataset
408 11 Multilevel Generalized Linear Models
409
410 Bibliography
Trump’s voters in the 2016 Election. St. Olaf College. Statistics 316 Project,
2018.
H. Jane Brockmann. Satellite male groups in horseshoe crabs, limulus
polyphemus. Ethology, 102(1):1–21, 1996. URL http://dx.doi.org/doi:
10.1111/j.1439-0310.1996.tb01099.x.
Kenneth Brown and Bulent Uyar. A hierarchical linear model approach
for assessing the effects of house and neighborhood characteristics on
housing prices. Journal of Real Estate Practice and Education, 7(1):15–
24, 2004. URL http://aresjournals.org/doi/abs/10.5555/repe.7.1.
f687057161743261.
Richard Buddin and Ron Zimmer. Student achievement in charter schools:
A complex picture. Journal of Policy Analysis and Management, 24(2):
351–371, 2005. URL http://dx.doi.org/10.1002/pam.20093.
Bureau of Labor Statistics. National Longitudinal Surveys, 1997. URL https:
//www.bls.gov/nls/nlsy97.htm.
A.C. Cameron and P.K. Trivedi. Econometric models based on count data:
Comparisons and applications of some estimators and tests. Journal of
Applied Econometrics, 1:29–53, 1986.
Philip Camill, Mark J. McKone, Sean T. Sturges, William J. Severud, Erin
Ellis, Jacob Limmer, Christopher B. Martin, Ryan T. Navratil, Amy J.
Purdie, Brody S. Sandel, Shano Talukder, and Andrew Trout. Community-
and ecosystem-level changes in a specie-rich tallgrass prairie restoration.
Ecological Applications, 14(6):1680–1694, 2004. doi: 10.1890/03-5273.
Ann Cannon, George Cobb, Brad Hartlaub, Julie Legler, Robin Lock, Tom
Moore, Allan Rossman, and Jeff Witmer. Stat2: Modeling with Regression
and ANOVA. Macmillan, 2019.
Centers for Disease Control and Prevention. Youth Risk Behavior Survey data,
2009. URL http://www.cdc.gov/HealthyYouth/yrbs/index.htm.
Central Intelligence Agency. The World Factbook 2013, 2013. URL
https://www.cia.gov/library/publications/download/download-
2013/index.html.
Christopher Chapp, Paul Roback, Kendra Jo Johnson-Tesch, Adrian Rossing,
and Jack Werner. Going vague: Ambiguity and avoidance in online political
messaging. Social Science Computer Review, Aug 2018. URL https:
//doi.org/10.1177/0894439318791168.
Patrick J. Curran, Eric Stice, and Laurie Chassin. The relation between ado-
lescent alcohol use and peer alcohol use: A longitudinal random coefficients
model. Journal of Consulting and Clinical Psychology, 65(1):130–140, 1997.
URL http://dx.doi.org/10.1037/0022-006X.65.1.130.
Bibliography 411
Samantha Dahlquist and Jin Dong. The effects of credit cards on tipping. St.
Olaf College. Statistics 272 Project, 2011.
A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Application.
Cambridge University Press, 1997.
P. J. Diggle, P. Heagarty, K.-Y. Liang, and S. L. Zeger. Analysis of Longitudinal
Data. Oxford University Press, 2002.
Bradley Efron. Bayesian inference and the parametric bootstrap. Ann. Appl.
Stat., 6(4):1971–1997, Dec 2012. URL https://doi.org/10.1214/12-
AOAS571.
Bradley Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman
& Hall/CRC, Boca Raton, FL, 1993.
Robert Eisinger, Amanda Elling, and J.R. Stamp. Tree growth rates and
mortality. In Proceedings of the National Conference on Undergraduate
Research (NCUR), Ithaca College, New York, 2011.
Brian S. Everitt and Torsten Hothorn. A Handbook of Statistical Analyses
using R. Chapman & Hall/ CRC, Boca Raton, FL, 2006.
Julian Faraway. Extending the Linear Model With R: Generalized Linear,
Mixed Effects and Nonparametric Regression Models. Chapman & Hall/
CRC, Boca Raton, FL, 2005.
Anthony Farrar and Thomas H. Bruggink. A new test of the moneyball
hypothesis. The Sport Journal, May 2011. URL http://thesportjournal.
org/article/a-new-test-of-the-moneyball-hypothesis/.
Shannon Fast and Thomas Hegland. Book challenges: A statistical examination.
St. Olaf College. Statistics 316 Project, 2011.
Kara Finnigan, Nancy Adelman, Lee Anderson, Lynyonne Cotton, Mary Beth
Donnelly, and Tiffany Price. Evaluation of Public Charter Schools Program:
Final Evaluation Report. U.S. Department of Education, Washington, D.C.,
2004.
Lisa Fisher, Katie Murney, and Tyler Radtke. Emergency department over-
crowding and factors that contribute to ambulance diversion. St. Olaf
College. Statistics 316 Project, 2019.
Richard A. Friedman. Standing up at your desk could make you smarter. The
New York Times, Apr 20 2018.
Andrew Gelman, Jeffrey Fagan, and Alex Kiss. An analysis of the NYPD’s
stop-and-frisk policy in the context of claims of racial bias. Journal of The
American Statistical Association, 102:813–823, Sept 2007.
Thomas Gilovich, Robert Vallone, and Amos Tversky. The hot hand in basket-
412 Bibliography
Jan Komdeur, Serge Daan, Joost Tinbergen, and Christa Mateman. Extreme
adaptive modification in sex ratio of the Seychelles warbler’s eggs. Nature,
385:522–525, Feb 1997. URL http://dx.doi.org/10.1038/385522a0.
Nan M. Laird. Missing data in longitudinal studies. Statistics in Medicine, 7
(1-2):305–315, 1988. URL http://dx.doi.org/10.1002/sim.4780070131.
Michael M. Lewis. Moneyball: The Art of Winning an Unfair Game. W. W.
Norton & Company, 2003.
M Martinsen, S Bratland-Sanda, A K Eriksson, and J Sundgot-Borgen. Dieting
to win or to be thin? A study of dieting and disordered eating among
adolescent elite athletes and non-athlete controls. British Journal of Sports
Medicine, 44(1):70–76, 2009. URL http://bjsm.bmj.com/content/44/1/
70.
T.J. Mathews and Brady E. Hamilton. Trend analysis of the sex ratio at birth
in the United States. National Vital Statistics Reports, 53(20):1–20, 06 2005.
URL https://www.cdc.gov/nchs/data/nvsr/nvsr53/nvsr53_20.pdf.
Peter McCullagh and John Ashworth Nelder. Generalized Linear Models.
Chapman & Hall/ CRC, Boca Raton, Florida, 2nd edition, 1989.
Minnesota Department of Education. Minnesota Department of Education
data center, 2018. URL https://education.mn.gov/MDE/Data/.
Tobias J. Moskowitz and L. Jon Wertheim. Scorecasting: The Hidden Influences
Behind How Sports Are Played and Games Are Won. Crown Archetype,
New York, 2011.
Per Nafstad, Jorgen A. Hagen, Leif Oie, Per Magnus, and Jouni J. K. Jaakkola.
Day care centers and respiratory health. Pediatrics, 103(4):753–758, 1999.
URL http://pediatrics.aappublications.org/content/103/4/753.
National Center for Education Statistics. The Integrated Postsecondary Edu-
cation Data System, 2018. URL https://nces.ed.gov/ipeds/.
John Ashworth Nelder and Robert William Maclagan Wedderburn. Generalized
linear models. Journal of the Royal Statistical Society. Series A (General),
135(3):370–384, 1972. URL http://www.jstor.org/stable/2344614.
Cecilia A. Noecker and Paul Roback. New insights on the tendency of NCAA
basketball officials to even out foul calls. Journal of Quantitative Analysis
in Sports, 8(3):1–23, Oct 2012.
Philippine Statistics Authority. Family income and expenditure sur-
vey, 2015. URL https://www.kaggle.com/grosvenpaul/family-income-
and-expenditure.
Joyce H. Poole. Mate guarding, reproductive success and female choice in
African elephants. Animal Behaviour, 37:842–849, 1989. URL http://www.
sciencedirect.com/science/article/pii/0003347289900687.
414 Bibliography
Ian Pray. Effects of rainfall and sun exposure on leaf characteristics. St. Olaf
College. Bio in South India Project, 2009.
J Proudfoot, D Goldberg, A Mann, B Everitt, I Marks, and J A Gray. Com-
puterized, interactive, multimedia cognitive-behavioural program for anxiety
and depression in general practice. Psychological Medicine, 33(2):217–27,
Feb 2003. doi: 10.1017/s0033291702007225.
R Core Team. R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, 2020. URL
https://www.R-project.org.
Fred Ramsey and Daniel Schafer. The Statistical Sleuth: A course in methods
of data analysis. Brooks/Cole Cengage, Boston, Massachusetts, 2nd edition,
2002.
D.A. Randall, L.R. Jorm, S. Lujic, S.J. Eades, T.R. Churches, A.J. O’Loughlin,
and A.H. Leyland. Exploring disparities in acute myocardial infarction events
between Aboriginal and non-Aboriginal Australians: Roles of age, gender,
geography and area-level disadvantage. Health & Place, 28:58–66, 2014.
ISSN 1353-8292. doi: https://doi.org/10.1016/j.healthplace.2014.03.009.
Stephen W. Raudenbush and Anthony S. Bryk. Hierarchical Linear Models:
Applications and Data Analysis Methods. SAGE Publications, Inc., Thousand
Oaks, CA, 2nd edition, 2002.
Joseph Lee Rodgers and Debby Doughty. Does having boys or girls run in the
family? CHANCE, 14(4):8–13, 2001. URL http://dx.doi.org/10.1080/
09332480.2001.10542293.
Marieke Roskes, Daniel Sligte, Shaul Shalvi, and Carsten K. W. De Dreu.
The right side? Under time pressure, approach motivation leads to right-
oriented bias. Psychology Science, 22(11):1403–1407, 2011. URL https:
//doi.org/10.1177/0956797611418677.
Michael E. Sadler and Christopher J. Miller. Performance anxiety: A lon-
gitudinal study of the roles of personality and experience in musicians.
Social Psychological and Personality Science, 1(3):280–287, 2010. URL
http://dx.doi.org/10.1177/1948550610370492.
Robert J. Sampson, Stephen W. Raudenbush, and Felton Earls. Neighborhoods
and violent crime: A multilevel study of collective efficacy. Science, 277
(5328):918–924, 1997. ISSN 0036-8075. doi: 10.1126/science.277.5328.918.
Joseph Scotto, Alfred W. Kopf, and Fredrick Urbach. Non-melanoma skin can-
cer among Caucasians in four areas of the United States. Cancer, 34(4):1333–
1338, Oct 1974. URL https://doi.org/10.1002/1097-0142(197410)34:
4<1333::AID-CNCR2820340447>3.0.CO;2-A.
Prabha Siddarth, Alison C. Burggren, Harris A. Eyre, Gary W. Small, and
Bibliography 415
417
418 Index
quasi-Poisson, 121
quasibinomial, 168, 198
quasilikelihood, 121, 168
R-squared, 16, 24
random effects, 196, 227
random intercepts model, 234, 351
random slopes and intercepts model,
236
relative risk (rate ratio), 103
residual, 13
residual deviance, 112, 166
restricted maximum likelihood
(REML), 229
t-distribution, 87