2015 Book RegressionModelingStrategies-1 PDF
2015 Book RegressionModelingStrategies-1 PDF
2015 Book RegressionModelingStrategies-1 PDF
Regression
Modeling
Strategies
With Applications to Linear Models,
Logistic and Ordinal Regression,
and Survival Analysis
Second Edition
Springer Series in Statistics
Advisors:
P. Bickel, P. Diggle, S.E. Feinberg, U. Gather,
I. Olkin, S. Zeger
Regression Modeling
Strategies
With Applications to Linear Models,
Logistic and Ordinal Regression,
and Survival Analysis
Second Edition
123
Frank E. Harrell, Jr.
Department of Biostatistics
School of Medicine
Vanderbilt University
Nashville, TN, USA
There are many books that are excellent sources of knowledge about
individual statistical tools (survival models, general linear models, etc.), but
the art of data analysis is about choosing and using multiple tools. In the
words of Chatfield [100, p. 420] “. . . students typically know the technical de-
tails of regression for example, but not necessarily when and how to apply it.
This argues the need for a better balance in the literature and in statistical
teaching between techniques and problem solving strategies.” Whether ana-
lyzing risk factors, adjusting for biases in observational studies, or developing
predictive models, there are common problems that few regression texts ad-
dress. For example, there are missing data in the majority of datasets one is
likely to encounter (other than those used in textbooks!) but most regression
texts do not include methods for dealing with such data effectively, and most
texts on missing data do not cover regression modeling.
This book links standard regression modeling approaches with
• methods for relaxing linearity assumptions that still allow one to easily
obtain predictions and confidence limits for future observations, and to do
formal hypothesis tests,
• non-additive modeling approaches not requiring the assumption that
interactions are always linear × linear,
• methods for imputing missing data and for penalizing variances for incom-
plete data,
• methods for handling large numbers of predictors without resorting to
problematic stepwise variable selection techniques,
• data reduction methods (unsupervised learning methods, some of which
are based on multivariate psychometric techniques too seldom used in
statistics) that help with the problem of “too many variables to analyze and
not enough observations” as well as making the model more interpretable
when there are predictor variables containing overlapping information,
• methods for quantifying predictive accuracy of a fitted model,
vii
viii Preface
• powerful model validation techniques based on the bootstrap that allow the
analyst to estimate predictive accuracy nearly unbiasedly without holding
back data from the model development process, and
• graphical methods for understanding complex models.
On the last point, this text has special emphasis on what could be called
“presentation graphics for fitted models” to help make regression analyses
more palatable to non-statisticians. For example, nomograms have long been
used to make equations portable, but they are not drawn routinely because
doing so is very labor-intensive. An R function called nomogram in the package
described below draws nomograms from a regression fit, and these diagrams
can be used to communicate modeling results as well as to obtain predicted
values manually even in the presence of complex variable transformations.
Most of the methods in this text apply to all regression models, but special
emphasis is given to some of the most popular ones: multiple regression using
least squares and its generalized least squares extension for serial (repeated
measurement) data, the binary logistic model, models for ordinal responses,
parametric survival regression models, and the Cox semiparametric survival
model. There is also a chapter on nonparametric transform-both-sides regres-
sion. Emphasis is given to detailed case studies for these methods as well as
for data reduction, imputation, model simplification, and other tasks. Ex-
cept for the case study on survival of Titanic passengers, all examples are
from biomedical research. However, the methods presented here have broad
application to other areas including economics, epidemiology, sociology, psy-
chology, engineering, and predicting consumer behavior and other business
outcomes.
This text is intended for Masters or PhD level graduate students who
have had a general introductory probability and statistics course and who
are well versed in ordinary multiple regression and intermediate algebra. The
book is also intended to serve as a reference for data analysts and statistical
methodologists. Readers without a strong background in applied statistics
may wish to first study one of the many introductory applied statistics and
regression texts that are available. The author’s course notes Biostatistics
for Biomedical Research on the text’s web site covers basic regression and
many other topics. The paper by Nick and Hardin [476] also provides a good
introduction to multivariable modeling and interpretation. There are many
excellent intermediate level texts on regression analysis. One of them is by
Fox, which also has a companion software-based text [200, 201]. For readers
interested in medical or epidemiologic research, Steyerberg’s excellent text
Clinical Prediction Models [586] is an ideal companion for Regression Modeling
Strategies. Steyerberg’s book provides further explanations, examples, and
simulations of many of the methods presented here. And no text on regression
modeling should fail to mention the seminal work of John Nelder [450].
The overall philosophy of this book is summarized by the following state-
ments.
Preface ix
concrete. At the very least, the code demonstrates that all of the methods
presented in the text are feasible.
This text does not teach analysts how to use R. For that, the reader may
wish to see reading recommendations on www.r-project.org as well as Venables
and Ripley [635] (which is also an excellent companion to this text) and the
many other excellent texts on R. See the Appendix for more information.
In addition to powerful features that are built into R, this text uses a
package of freely available R functions called rms written by the author. rms
tracks modeling details related to the expanded X or design matrix. It is a
series of over 200 functions for model fitting, testing, estimation, validation,
graphics, prediction, and typesetting by storing enhanced model design at-
tributes in the fit. rms includes functions for least squares and penalized least
squares multiple regression modeling in addition to functions for binary and
ordinal regression, generalized least squares for analyzing serial data, quan-
tile regression, and survival analysis that are emphasized in this text. Other
freely available miscellaneous R functions used in the text are found in the
Hmisc package also written by the author. Functions in Hmisc include facilities
for data reduction, imputation, power and sample size calculation, advanced
table making, recoding variables, importing and inspecting data, and general
graphics. Consult the Appendix for information on obtaining Hmisc and rms.
The author and his colleagues have written SAS macros for fitting re-
stricted cubic splines and for other basic operations. See the Appendix for
more information. It is unfair not to mention some excellent capabilities of
other statistical packages such as Stata (which has also been extended to
provide regression splines and other modeling tools), but the extendability
and graphics of R makes it especially attractive for all aspects of the compre-
hensive modeling strategy presented in this book.
Portions of Chapters 4 and 20 were published as reference [269]. Some of
Chapter 13 was published as reference [272].
The author may be contacted by electronic mail at f.harrell@
vanderbilt.edu and would appreciate being informed of unclear points, er-
rors, and omissions in this book. Suggestions for improvements and for future
topics are also welcome. As described in the Web site, instructors may con-
tact the author to obtain copies of quizzes and extra assignments (both with
answers) related to much of the material in the earlier chapters, and to obtain
full solutions (with graphical output) to the majority of assignments in the
text.
Major changes since the first edition include the following:
1. Creation of a now mature R package, rms, that replaces and greatly ex-
tends the Design library used in the first edition
2. Conversion of all of the book’s code to R
3. Conversion of the book source into knitr [677] reproducible documents
4. All code from the text is executable and is on the web site
5. Use of color graphics and use of the ggplot2 graphics package [667]
6. Scanned images were re-drawn
xii Preface
Acknowledgments
A good deal of the writing of the first edition of this book was done during
my 17 years on the faculty of Duke University. I wish to thank my close col-
league Kerry Lee for providing many valuable ideas, fruitful collaborations,
and well-organized lecture notes from which I have greatly benefited over the
past years. Terry Therneau of Mayo Clinic has given me many of his wonderful
ideas for many years, and has written state-of-the-art R software for survival
analysis that forms the core of survival analysis software in my rms package.
Michael Symons of the Department of Biostatistics of the University of North
Preface xiii
Carolina at Chapel Hill and Timothy Morgan of the Division of Public Health
Sciences at Wake Forest University School of Medicine also provided course
materials, some of which motivated portions of this text. My former clini-
cal colleagues in the Cardiology Division at Duke University, Robert Califf,
Phillip Harris, Mark Hlatky, Dan Mark, David Pryor, and Robert Rosati,
for many years provided valuable motivation, feedback, and ideas through
our interaction on clinical problems. Besides Kerry Lee, statistical colleagues
L. Richard Smith, Lawrence Muhlbaier, and Elizabeth DeLong clarified my
thinking and gave me new ideas on numerous occasions. Charlotte Nelson
and Carlos Alzola frequently helped me debug S routines when they thought
they were just analyzing data.
Former students Bercedis Peterson, James Herndon, Robert McMahon,
and Yuan-Li Shen have provided many insights into logistic and survival mod-
eling. Associations with Doug Wagner and William Knaus of the University
of Virginia, Ken Offord of Mayo Clinic, David Naftel of the University of Al-
abama in Birmingham, Phil Miller of Washington University, and Phil Good-
man of the University of Nevada Reno have provided many valuable ideas and
motivations for this work, as have Michael Schemper of Vienna University,
Janez Stare of Ljubljana University, Slovenia, Ewout Steyerberg of Erasmus
University, Rotterdam, Karel Moons of Utrecht University, and Drew Levy of
Genentech. Richard Goldstein, along with several anonymous reviewers, pro-
vided many helpful criticisms of a previous version of this manuscript that
resulted in significant improvements, and critical reading by Bob Edson (VA
Cooperative Studies Program, Palo Alto) resulted in many error corrections.
Thanks to Brian Ripley of the University of Oxford for providing many help-
ful software tools and statistical insights that greatly aided in the production
of this book, and to Bill Venables of CSIRO Australia for wisdom, both sta-
tistical and otherwise. This work would also not have been possible without
the S environment developed by Rick Becker, John Chambers, Allan Wilks,
and the R language developed by Ross Ihaka and Robert Gentleman.
Work for the second edition was done in the excellent academic environ-
ment of Vanderbilt University, where biostatistical and biomedical colleagues
and graduate students provided new insights and stimulating discussions.
Thanks to Nick Cox, Durham University, UK, who provided from his careful
reading of the first edition a very large number of improvements and correc-
tions that were incorporated into the second. Four anonymous reviewers of
the second edition also made numerous suggestions that improved the text.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Hypothesis Testing, Estimation, and Prediction . . . . . . . . . . . 1
1.2 Examples of Uses of Predictive Multivariable Modeling . . . . . 3
1.3 Prediction vs. Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Planning for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Emphasizing Continuous Variables . . . . . . . . . . . . . . . 8
1.5 Choice of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
xv
xvi Contents
3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Prelude to Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Missing Values for Different Types of Response Variables . . . 47
3.4 Problems with Simple Alternatives to Imputation . . . . . . . . . 47
3.5 Strategies for Developing an Imputation Model . . . . . . . . . . . . 49
3.6 Single Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . . . 52
3.7 Predictive Mean Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.8.1 The aregImpute and Other Chained Equations
Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.10 Summary and Rough Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.1 The R Modeling Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2 User-Contributed Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3 The rms Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4 Other Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Typographical Conventions
R code blocks are set off with a shadowbox, and R output that is not directly
using LATEX appears in a box that is framed on three sides.
xxv
Chapter 1
Introduction
model.664 The analysis of variance (multiple group) test and the Kruskal–
Wallis test can easily be obtained from these two regression models by
using more than one dummy predictor variable.
Even without complexities such as repeated measurements, problems can
arise when many hypotheses are to be tested. Testing too many hypotheses
is related to fitting too many predictors in a regression model. One commonly
hears the statement that “the dataset was too small to allow modeling, so we
just did hypothesis tests.” It is unlikely that the resulting inferences would be
reliable. If the sample size is insufficient for modeling it is often insufficient
for tests or estimation. This is especially true when one desires to publish
an estimate of the effect corresponding to the hypothesis yielding the small-
est P -value. Ordinary point estimates are known to be badly biased when
the quantity to be estimated was determined by “data dredging.” This can
be remedied by the same kind of shrinkage used in multivariable modeling
(Section 9.10).
Statistical estimation is usually model-based. For example, one might use a
survival regression model to estimate the relative effect of increasing choles-
terol from 200 to 250 mg/dl on the hazard of death. Variables other than
cholesterol may also be in the regression model, to allow estimation of the
effect of increasing cholesterol, holding other risk factors constant. But ac-
curate estimation of the cholesterol effect will depend on how cholesterol as
well as each of the adjustment variables is assumed to relate to the hazard
of death. If linear relationships are incorrectly assumed, estimates will be
inaccurate. Accurate estimation also depends on avoiding overfitting the ad-
justment variables. If the dataset contains 200 subjects, 30 of whom died, and
if one adjusted for 15 “confounding” variables, the estimates would be “over-
adjusted” for the effects of the 15 variables, as some of their apparent effects
would actually result from spurious associations with the response variable
(time until death). The overadjustment would reduce the cholesterol effect.
The resulting unreliability of estimates equals the degree to which the overall
model fails to validate on an independent sample.
It is often useful to think of effect estimates as differences between two
predicted values from a model. This way, one can account for nonlinearities
and interactions. For example, if cholesterol is represented nonlinearly in a
logistic regression model, predicted values on the “linear combination of X’s
scale” are predicted log odds of an event. The increase in log odds from raising
cholesterol from 200 to 250 mg/dl is the difference in predicted values, where
cholesterol is set to 250 and then to 200, and all other variables are held
constant. The point estimate of the 250:200 mg/dl odds ratio is the anti-log
of this difference. If cholesterol is represented nonlinearly in the model, it
does not matter how many terms in the model involve cholesterol as long as
the overall predicted values are obtained.
1.2 Examples of Uses of Predictive Multivariable Modeling 3
predict the life expectancy for this patient. By making the same prediction
for a patient on treatment B with the same characteristics, one can estimate
the absolute difference in probabilities or life expectancies. This approach
recognizes that low-risk patients must have less absolute benefit of treatment
(lower change in outcome probability) than high-risk patients,351 a fact that
has been ignored in many clinical trials. Another reason for multivariable
modeling in randomized clinical trials is that when the basic response model
is nonlinear (e.g., logistic, Cox, parametric survival models), the unadjusted
estimate of the treatment effect is not correct if there is moderate heterogene-
ity of subjects, even with perfect balance of baseline characteristics across
the treatment groups.a9, 24, 198, 588 So even when investigators are interested
in simple comparisons of two groups’ responses, multivariable modeling can
be advantageous and sometimes mandatory.
Cost-effectiveness analysis is becoming increasingly used in health care re-
search, and the “effectiveness” (denominator of the cost-effectiveness ratio)
is always a measure of absolute effectiveness. As absolute effectiveness varies
dramatically with the risk profiles of subjects, it must be estimated for indi-
vidual subjects using a multivariable model90, 344 .
a
For example, unadjusted odds ratios from 2 × 2 tables are different from adjusted
odds ratios when there is variation in subjects’ risk factors within each treatment
group, even when the distribution of the risk factors is identical between the two
groups.
1.3 Prediction vs. Classification 5
(the median) is used. Dichtomizing the outcome at a different point may re-
quire a many-fold increase in sample size to make up for the lost informa-
tion187 . In the area of medical diagnosis, it is often the case that the disease
is really on a continuum, and predicting the severity of disease (rather than
just its presence or absence) will greatly increase power and precision, not to
mention making the result less arbitrary.
It is important to note that two-group classification represents an artificial
forced choice. It is not often the case that the user of the classifier needs to
be limited to two possible actions. The best option for many subjects may
be to refuse to make a decision or to obtain more data (e.g., order another
medical diagnostic test). A gray zone can be helpful, and predictions include
gray zones automatically.
Unlike prediction (e.g., of absolute risk), classification implicitly uses util-
ity functions (also called loss or cost functions, e.g., cost of a false positive
classification). Implicit utility functions are highly problematic. First, it is
well known that the utility function depends on variables that are not pre-
dictive of outcome and are not collected (e.g., subjects’ preferences) that
are available only at the decision point. Second, the approach assumes every
subject has the same utility functionb . Third, the analyst presumptuously
assumes that the subject’s utility coincides with his own.
Formal decision analysis uses subject-specific utilities and optimum predic-
tions based on all available data62, 74, 183, 210, 219, 642c . It follows that receiver
b
Simple examples to the contrary are the less weight given to a false negative diagno-
sis of cancer in the elderly and the aversion of some subjects to surgery or chemother-
apy.
c
To make an optimal decision you need to know all relevant data about an individual
(used to estimate the probability of an outcome), and the utility (cost, loss function)
of making each decision. Sensitivity and specificity do not provide this information.
For example, if one estimated that the probability of a disease given age, sex, and
symptoms is 0.1 and the “cost”of a false positive equaled the “cost” of a false negative,
one would act as if the person does not have the disease. Given other utilities, one
would make different decisions. If the utilities are unknown, one gives the best estimate
of the probability of the outcome to the decision maker and let her incorporate her
own unspoken utilities in making an optimum decision for her.
Besides the fact that cutoffs that are not individualized do not apply to individuals,
only to groups, individual decision making does not utilize sensitivity and specificity.
For an individual we can compute Prob(Y = 1|X = x); we don’t care about Prob(Y =
1|X > c), and an individual having X = x would be quite puzzled if she were given
Prob(X > c|future unknown Y) when she already knows X = x so X is no longer a
random variable.
Even when group decision making is needed, sensitivity and specificity can be
bypassed. For mass marketing, for example, one can rank order individuals by the
estimated probability of buying the product, to create a lift curve. This is then used
to target the k most likely buyers where k is chosen to meet total program cost
constraints.
6 1 Introduction
instruments containing the necessary variables, and all terms can be given
standard definitions (for both descriptive and response variables) for use at
all data collection sites. Also, steps can be taken to minimize the amount of
missing data.
In the context of describing and modeling health outcomes, Iezzoni317 has
an excellent discussion of the dimensions of risk that should be captured by
variables included in the model. She lists these general areas that should be
quantified by predictor variables:
1. age,
2. sex,
3. acute clinical stability,
4. principal diagnosis,
5. severity of principal diagnosis,
6. extent and severity of comorbidities,
7. physical functional status,
8. psychological, cognitive, and psychosocial functioning,
9. cultural, ethnic, and socioeconomic attributes and behaviors,
10. health status and quality of life, and
11. patient attitudes and preferences for outcomes.
Given that the choice of the model has been made (e.g., a log-normal
model), penalized maximum likelihood estimation has major advantages in
the battle between making the model fit adequately and avoiding overfitting
(Sections 9.10 and 13.4.7). Penalization lessens the need for model selection.
1 Briggs and Zaretzki74 eloquently state the problem with ROC curves and the
areas under them (AUC):
Statistics such as the AUC are not especially relevant to someone who
must make a decision about a particular xc . . . . ROC curves lack or ob-
scure several quantities that are necessary for evaluating the operational
effectiveness of diagnostic tests. . . . ROC curves were first used to check
how radio receivers (like radar receivers) operated over a range of fre-
quencies. . . . This is not how must ROC curves are used now, particularly
in medicine. The receiver of a diagnostic measurement . . . wants to make
a decision based on some xc , and is not especially interested in how well
he would have done had he used some different cutoff.
AUC, only because it equals the concordance probability in the binary Y case,
is still often useful as a predictive discrimination measure.
2 More severe problems caused by dichotomizing continuous variables are dis-
cussed in [13, 17, 45, 82, 185, 294, 379, 521, 597].
3 See the excellent editorial by Mallows434 for more about model choice. See
Breiman and discussants67 for an interesting debate about the use of data
models vs. algorithms. This material also covers interpretability vs. predictive
accuracy and several other topics.
4 See [15, 80, 100, 163, 186, 415] for information about accounting for model selec-
tion in making final inferences. Faraway186 demonstrated that the bootstrap
has good potential in related although somewhat simpler settings, and Buck-
land et al.80 developed a promising bootstrap weighting method for accounting
for model uncertainty.
5 Tibshirani and Knight611 developed another approach to estimating the gener-
alized degrees of freedom. Luo et al.430 developed a way to add noise of known
variance to the response variable to tune the stopping rule used for variable
selection. Zou et al.689 showed that the lasso, an approach that simultaneously
selects variables and shrinks coefficients, has a nice property. Since it uses pe-
nalization (shrinkage), an unbiased estimate of its effective number of degrees
of freedom is the number of nonzero regression coefficients in the final model.
Chapter 2
General Aspects of Fitting
Regression Models
The ordinary multiple linear regression model is frequently used and has
parameters that are easily interpreted. In this chapter we study a general
class of regression models, those stated in terms of a weighted sum of a set
of independent or predictor variables. It is shown that after linearizing the
model with respect to the predictor variables, the parameters in such re-
gression models are also readily interpreted. Also, all the designs used in
ordinary linear regression can be used in this general setting. These designs
include analysis of variance (ANOVA) setups, interaction effects, and nonlin-
ear effects. Besides describing and interpreting general regression models, this
chapter also describes, in general terms, how the three types of assumptions
of regression models can be examined.
First we introduce notation for regression models. Let Y denote the re-
sponse (dependent) variable, and let X = X1 , X2 , . . . , Xp denote a list or
vector of predictor variables (also called covariables or independent, descrip-
tor, or concomitant variables). These predictor variables are assumed to be
constants for a given individual or subject from the population of interest.
Let β = β0 , β1 , . . . , βp denote the list of regression coefficients (parameters).
β0 is an optional intercept parameter, and β1 , . . . , βp are weights or regression
coefficients corresponding to X1 , . . . , Xp . We use matrix or vector notation
to describe a weighted sum of the Xs:
Xβ = β0 + β1 X1 + . . . + βp Xp , (2.1)
the expected value or average of Y given X, or C(Y |X) could be the proba-
bility that Y = 1 given X (where Y = 0 or 1).
and given X, Y has a normal distribution with mean Xβ and constant vari-
ance σ 2 . The binary logistic regression model129, 647 is
where Y can take on the values 0 and 1. In general the model, when
stated in terms of the property C(Y |X), may not be linear in Xβ; that
is C(Y |X) = g(Xβ), where g(u) is nonlinear in u. For example, a regression
model could be E(Y |X) = (Xβ).5 . The model may be made linear in the
unknown parameters by a transformation in the property C(Y |X):
In the original model, C(Y |X) specifies the way in which X affects a property
of Y . Except in the ordinary linear regression model, it is difficult to interpret
the individual parameters if the model is stated in terms of C(Y |X). In the
model C ′ (Y |X) = Xβ = β0 + β1 X1 + . . . + βp Xp , the regression parameter
βj is interpreted as the change in the property C ′ of Y per unit change in
the descriptor variable Xj , all other descriptors remaining constanta :
βj = C ′ (Y |X1 , X2 , . . . , Xj + 1, . . . , Xp ) − C ′ (Y |X1 , X2 , . . . , Xj , . . . , Xp ).
(2.10)
In the ordinary linear regression model, for example, βj is the change in
expected value of Y per unit change in Xj . In the logistic regression model
βj is the change in log odds that Y = 1 per unit change in Xj . When a
non-interacting Xj is a dichotomous variable or a continuous one that is
linearly related to C ′ , Xj is represented by a single term in the model and
its contribution is described fully by βj .
In all that follows, we drop the ′ from C ′ and assume that C(Y |X) is the
property of Y that is linearly related to the weighted sum of the Xs.
a
Note that it is not necessary to “hold constant” all other variables to be able to
interpret the effect of one predictor. It is sufficient to hold constant the weighted sum
of all the variables other than Xj . And in many cases it is not physically possible to
hold other variables constant while varying one, e.g., when a model contains X and
X 2 (David Hoaglin, personal communication).
16 2 General Aspects of Fitting Regression Models
Suppose that we wish to model the effect of two or more treatments and be
able to test for differences between the treatments in some property of Y .
A nominal or polytomous factor such as treatment group having k levels, in
which there is no definite ordering of categories, is fully described by a series of
k −1 binary indicator variables (sometimes called dummy variables). Suppose
that there are four treatments, J, K, L, and M , and the treatment factor is
denoted by T . The model can be written as
C(Y |T = J) = β0
C(Y |T = K) = β0 + β1 (2.11)
C(Y |T = L) = β0 + β2
C(Y |T = M ) = β0 + β3 .
The four treatments are thus completely specified by three regression param-
eters and one intercept that we are using to denote treatment J, the reference
treatment. This model can be written in the previous notation as
C(Y |T ) = Xβ = β0 + β1 X1 + β2 X2 + β3 X3 , (2.12)
where
X1 = 1 if T = K, 0 otherwise
X2 = 1 if T = L, 0 otherwise (2.13)
X3 = 1 if T = M, 0 otherwise.
2.3.2 Interactions
Suppose that a model has descriptor variables X1 and X2 and that the effect
of the two Xs cannot be separated; that is the effect of X1 on Y depends on
the level of X2 and vice versa. One simple way to describe this interaction is
to add the constructed variable X3 = X1 X2 to the model:
where [sex = f ] is a 0–1 indicator variable for sex = female; the reference cell
is sex = male corresponding to a zero value of the indicator variable. This is
a model that assumes
1. age is linearly related to C(Y ) for males,
2. age is linearly related to C(Y ) for females, and
3. whatever distribution, variance, and independence assumptions are appro-
priate for the model being considered.
18 2 General Aspects of Fitting Regression Models
We are thus assuming that the interaction between age and sex is simple;
that is it only alters the slope of the age effect. The parameters in the model
have interpretations shown in Table 2.1. β3 is the difference in slopes (female
– male).
There are many useful hypotheses that can be tested for this model. First
let’s consider two hypotheses that are seldom appropriate although they are
routinely tested.
1. H0 : β1 = 0: This tests whether age is associated with Y for males.
2. H0 : β2 = 0: This tests whether sex is associated with Y for zero-year olds.
Now consider more useful hypotheses. For each hypothesis we should write
what is being tested, translate this to tests in terms of parameters, write the
alternative hypothesis, and describe what the test has maximum power to
detect. The latter component of a hypothesis test needs to be emphasized, as
almost every statistical test is focused on one specific pattern to detect. For
example, a test of association against an alternative hypothesis that a slope
is nonzero will have maximum power when the true association is linear.
If the true regression model is exponential in X, a linear regression test
will have some power to detect “non-flatness” but it will not be as powerful
as the test from a well-specified exponential regression effect. If the true
effect is U-shaped, a test of association based on a linear model will have
almost no power to detect association. If one tests for association against
a quadratic (parabolic) alternative, the test will have some power to detect
a logarithmic shape but it will have very little power to detect a cyclical
trend having multiple “humps.” In a quadratic regression model, a test of
linearity against a quadratic alternative hypothesis will have reasonable power
to detect a quadratic nonlinear effect but very limited power to detect a
multiphase cyclical trend. Therefore in the tests in Table 2.2 keep in mind
that power is maximal when linearity of the age relationship holds for both
sexes. In fact it may be useful to write alternative hypotheses as, for example,
“Ha : age is associated with C(Y ), powered to detect a linear relationship.”
Note that if there is an interaction effect, we know that there is both an
age and a sex effect. However, there can also be age or sex effects when the
lines are parallel. That’s why the tests of total association have 2 d.f.
Table 2.2 Most Useful Tests for Linear Age × Sex Model
Null or Alternative Hypothesis Mathematical
Statement
Effect of age is independent of sex or H0 : β3 = 0
Effect of sex is independent of age or
Age and sex are additive
Age effects are parallel
Age interacts with sex Ha : β3 = 0
Age modifies effect of sex
Sex modifies effect of age
Sex and age are non-additive (synergistic)
Age is not associated with Y H0 : β 1 = β 3 = 0
Age is associated with Y Ha : β1 = 0 or β3 = 0
Age is associated with Y for either
Females or males
Sex is not associated with Y H0 : β 2 = β 3 = 0
Sex is associated with Y Ha : β2 = 0 or β3 = 0
Sex is associated with Y for some
Value of age
Neither age nor sex is associated with Y H0 : β1 = β2 = β3 = 0
Either age or sex is associated with Y Ha : β1 = 0 or β2 = 0 or β3 = 0
pressure ≤ 160 mmHg. The interpretation of the resulting odds ratio will depend
on the exact distribution of blood pressures in the sample (the proportion of
subjects > 170, > 180, etc.). On the other hand, if blood pressure is modeled as
a continuous variable (e.g., using a regression spline, quadratic, or linear effect)
one can estimate the ratio of odds for exact settings of the predictor, e.g., the
odds ratio for 200 mmHg compared with 120 mmHg.
6. Categorization does not condition on full information. When, for example, the
risk of stroke is being assessed for a new subject with a known blood pressure
(say 162 mmHg), the subject does not report to her physician “my blood pressure
exceeds 160” but rather reports 162 mmHg. The risk for this subject will be much
lower than that of a subject with a blood pressure of 200 mmHg.
7. If cutpoints are determined in a way that is not blinded to the response vari-
able, calculation of P -values and confidence intervals requires special simulation
techniques; ordinary inferential methods are completely invalid. For example, if
cutpoints are chosen by trial and error in a way that utilizes the response, even
informally, ordinary P -values will be too small and confidence intervals will not
have the claimed coverage probabilities. The correct Monte-Carlo simulations
must take into account both multiplicities and uncertainty in the choice of cut-
points. For example, if a cutpoint is chosen that minimizes the P -value and the
resulting P -value is 0.05, the true type I error can easily be above 0.5300 .
8. Likewise, categorization that is not blinded to the response variable results in
biased effect estimates17, 559 .
9. “Optimal” cutpoints do not replicate over studies. Hollander et al.300 state that
“. . . the optimal cutpoint approach has disadvantages. One of these is that in al-
most every study where this method is applied, another cutpoint will emerge.
This makes comparisons across studies extremely difficult or even impossible.
Altman et al. point out this problem for studies of the prognostic relevance of the
S-phase fraction in breast cancer published in the literature. They identified 19
different cutpoints used in the literature; some of them were solely used because
they emerged as the ‘optimal’ cutpoint in a specific data set. In a meta-analysis on
the relationship between cathepsin-D content and disease-free survival in node-
negative breast cancer patients, 12 studies were in included with 12 different
cutpoints . . . Interestingly, neither cathepsin-D nor the S-phase fraction are rec-
ommended to be used as prognostic markers in breast cancer in the recent update
of the American Society of Clinical Oncology.” Giannoni et al.215 demonstrated
that many claimed “optimal cutpoints” are just the observed median values in the
sample, which happens to optimize statistical power for detecting a separation in
outcomes and have nothing to do with true outcome thresholds. Disagreements
in cutpoints (which are bound to happen whenever one searches for things that
do not exist) cause severe interpretation problems. One study may provide an
odds ratio for comparing body mass index (BMI) > 30 with BMI ≤ 30, another
for comparing BMI > 28 with BMI ≤ 28. Neither of these odds ratios has a good
definition and the two estimates are not comparable.
10. Cutpoints are arbitrary and manipulatable; cutpoints can be found that can result
in both positive and negative associations646 .
11. If a confounder is adjusted for by categorization, there will be residual confound-
ing that can be explained away by inclusion of the continuous form of the predictor
in the model in addition to the categories.
When cutpoints are chosen using Y , categorization represents one of those
few times in statistics where both type I and type II errors are elevated.
A scientific quantity is a quantity which can be defined outside of the
specifics of the current experiment. The kind of high:low estimates that re-
sult from categorizing a continuous variable are not scientific quantities; their
interpretation depends on the entire sample distribution of continuous mea-
surements within the chosen intervals.
2.4 Relaxing Linearity Assumption for Continuous Predictors 21
where
(u)+ = u, u > 0,
0, u ≤ 0. (2.18)
The number of knots can vary depending on the amount of available data for
fitting the function. The linear spline function can be rewritten as
f (X) = β0 + β1 X, X≤a
= β0 + β1 X + β2 (X − a) a<X≤b (2.19)
= β0 + β1 X + β2 (X − a) + β3 (X − b) b < X ≤ c
= β0 + β1 X + β2 (X − a)
+β3 (X − b) + β4 (X − c) c < X.
where Xβ = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 , and
X1 = X X2 = (X − a)+
X3 = (X − b)+ X4 = (X − c)+ . (2.21)
f(X)
0 1 2 3 4 5 6
X
Fig. 2.1 A linear spline function with knots at a = 1, b = 3, c = 5.
Although the linear spline is simple and can approximate many common
relationships, it is not smooth and will not fit highly curved functions well.
These problems can be overcome by using piecewise polynomials of order
higher than linear. Cubic polynomials have been found to have nice properties
with good ability to fit sharply curving shapes. Cubic splines can be made to
be smooth at the join points (knots) by forcing the first and second derivatives
of the function to agree at the knots. Such a smooth cubic spline function
with three knots (a, b, c) is given by
f (X) = β0 + β1 X + β2 X 2 + β3 X 3
+ β4 (X − a)3+ + β5 (X − b)3+ + β6 (X − c)3+ (2.22)
= Xβ
X1 = X X2 = X 2
X3 = X 3 X4 = (X − a)3+ (2.23)
X5 = (X − b)3+ X6 = (X − c)3+ .
If the cubic spline function has k knots, the function will require estimat-
ing k + 3 regression coefficients besides the intercept. See Section 2.4.6 for
information on choosing the number and location of knots. 1
There are more numerically stable ways to form a design matrix for cubic
spline functions that are based on B-splines instead of the truncated power
basis152, 575 used here. However, B-splines are more complex and do not allow
for extrapolation beyond the outer knots, and the truncated power basis
seldom presents estimation problems (see Section 4.6) when modern methods
such as the Q–R decomposition are used for matrix inversion. 2
24 2 General Aspects of Fitting Regression Models
Stone and Koo595 have found that cubic spline functions do have a drawback
in that they can be poorly behaved in the tails, that is before the first knot and
after the last knot. They cite advantages of constraining the function to be
linear in the tails. Their restricted cubic spline function (also called natural
3 splines) has the additional advantage that only k − 1 parameters must be
estimated (besides the intercept) as opposed to k + 3 parameters with the
unrestricted cubic spline. The restricted spline function with k knots t1 , . . . , tk
is given by156
τ = (tk − t1 )2 . (2.26)
Figure 2.3 displays some typical shapes of restricted cubic spline functions
with k = 3, 4, 5, and 6. These functions were generated using random β.
2.4 Relaxing Linearity Assumption for Continuous Predictors 25
0.010 1.0
0.008 0.8
0.006 0.6
0.004 0.4
0.002 0.2
0.000 0.0
Fig. 2.2 Restricted cubic spline component variables for k = 5 and knots at X =
.05, .275, .5, .725, and .95. Nonlinear basis functions are scaled by τ . The left panel
is a y–magnification of the right panel. Fitted functions such as those in Figure 2.3
will be linear combinations of these basis functions as long as knots are at the same
locations used here.
Once β0 , . . . , βk−1 are estimated, the restricted cubic spline can be restated
in the form
H0 : β2 = β3 = . . . = βk−1 = 0. (2.29)
4 The truncated power basis for restricted cubic splines does allow for
rational (i.e., linear) extrapolation beyond the outer knots. However, when
the outer knots are in the tails of the data, extrapolation can still be danger-
ous.
When nonlinear terms in Equation 2.25 are normalized, for example, by
dividing them by the square of the difference in the outer knots to make all
terms have units of X, the ordinary truncated power basis has no numerical
difficulties when modern matrix algebra software is used.
We have assumed that the locations of the knots are specified in advance;
that is, the knot locations are not treated as free parameters to be estimated.
If knots were free parameters, the fitted function would have more flexibility
but at the cost of instability of estimates, statistical inference problems, and
inability to use standard regression modeling software for estimating regres-
sion parameters.
How then does the analyst pre-assign knot locations? If the regression
relationship were described by prior experience, pre-specification of knot lo-
cations would be easy. For example, if a function were known to change
curvature at X = a, a knot could be placed at a. However, in most situations
there is no way to pre-specify knots. Fortunately, Stone593 has found that
the location of knots in a restricted cubic spline model is not very crucial in
most situations; the fit depends much more on the choice of k, the number of
5 knots. Placing knots at fixed quantiles (percentiles) of a predictor’s marginal
distribution is a good approach in most datasets. This ensures that enough
points are available in each interval, and also guards against letting outliers
overly influence knot placement. Recommended equally spaced quantiles are
shown in Table 2.3.
2.4 Relaxing Linearity Assumption for Continuous Predictors 27
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3 knots 4 knots
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
5 knots 6 knots
Fig. 2.3 Some typical restricted cubic spline functions for k = 3, 4, 5, 6. The y–axis
is Xβ. Arrows indicate knots. These curves were derived by randomly choosing values
of β subject to standard deviations of fitted functions being normalized.
k Quantiles
3 .10 .5 .90
4 .05 .35 .65 .95
5 .05 .275 .5 .725 .95
6 .05 .23 .41 .59 .77 .95
7 .025 .1833 .3417 .5 .6583 .8167 .975
28 2 General Aspects of Fitting Regression Models
The principal reason for using less extreme default quantiles for k = 3 and
more extreme ones for k = 7 is that one usually uses k = 3 for small sample
sizes and k = 7 for large samples. When the sample size is less than 100, the
outer quantiles should be replaced by the fifth smallest and fifth largest data
points, respectively.595 What about the choice of k? The flexibility of possible
fits must be tempered by the sample size available to estimate the unknown
parameters. Stone593 has found that more than 5 knots are seldom required
in a restricted cubic spline model. The principal decision then is between
k = 3, 4, or 5. For many datasets, k = 4 offers an adequate fit of the model
and is a good compromise between flexibility and loss of precision caused
by overfitting a small sample. When the sample size is large (e.g., n ≥ 100
with a continuous uncensored response variable), k = 5 is a good choice.
Small samples (< 30, say) may require the use of k = 3. Akaike’s information
criterion (AIC, Section 9.8.1) can be used for a data-based choice of k. The
value of k maximizing the model likelihood ratio χ2 − 2k would be the best
“for the money” using AIC.
The analyst may wish to devote more knots to variables that are thought
to be more important, and risk lack of fit for less important variables. In this
way the total number of estimated parameters can be controlled (Section 4.1).
simple moving average has a problem in estimating E(Y ) at the outer values
of X. The estimates are quite sensitive to the choice of the number of points
(or interval width) to use in “binning” the data.
A moving least squares linear regression smoother is far superior to a
moving flat line smoother (moving average). Cleveland’s111 moving linear
regression smoother loess has become the most popular smoother. To obtain
the smoothed value of Y at X = x, we take all the data having X values
within a suitable interval about x. Then a linear regression is fitted to all
of these points, and the predicted value from this regression at X = x is
taken as the estimate of E(Y |X = x). Actually, loess uses weighted least
squares estimates, which is why it is called a locally weighted least squares
method. The weights are chosen so that points near X = x are given the
most weightb in the calculation of the slope and intercept. Surprisingly, a
good default choice for the interval about x is an interval containing 2/3 of
the data points! The weighting function is devised so that points near the
extremes of this interval receive almost no weight in the calculation of the
slope and intercept.
Because loess uses a moving straight line rather than a moving flat one,
it provides much better behavior at the extremes of the Xs. For example,
one can fit a straight line to the first three data points and then obtain the
predicted value at the lowest X, which takes into account that this X is not
the middle of the three Xs.
loess obtains smoothed values for E(Y ) at each observed value of X.
Estimates for other Xs are obtained by linear interpolation.
The loess algorithm has another component. After making an initial es-
timate of the trend line, loess can look for outliers off this trend. It can
then delete or down-weight those apparent outliers to obtain a more robust
trend estimate. Now, different points will appear to be outliers with respect
to this second trend estimate. The new set of outliers is taken into account
and another trend line is derived. By default, the process stops after these
three iterations. loess works exceptionally well for binary Y as long as the
iterations that look for outliers are not done, that is only one iteration is
performed.
For a single X, Friedman’s“super smoother”207 is another efficient and flex-
ible nonparametric trend estimator. For both loess and the super smoother
the amount of smoothing can be controlled by the analyst. Hastie and
Tibshirani275 provided an excellent description of smoothing methods and
developed a generalized additive model for multiple Xs, in which each
continuous predictor is fitted with a nonparametric smoother (see Chap-
ter 16). Interactions are not allowed. Cleveland et al.96 have extended two- 6
dimensional smoothers to multiple dimensions without assuming additivity.
Their local regression model is feasible for up to four or so predictors. Local
regression models are extremely flexible, allowing parts of the model to be
b
This weight is not to be confused with the regression
coefficient; rather the weights
are w1 , w2 , . . . , wn and the fitting criterion is n ˆ 2
i wi (Yi − Yi ) .
30 2 General Aspects of Fitting Regression Models
When a factor is a linear or binary term in the regression model, the test
of association for that factor with the response involves testing only a single
regression parameter. Nominal factors and predictors that are represented as
a quadratic or spline function require multiple regression parameters to be
32 2 General Aspects of Fitting Regression Models
d.f. even when testing at the second stage. This is because in the original
2 d.f. test of association, the 1 d.f. corresponding to the nonlinear effect is
deleted if the nonlinear effect is very small; that is one is retaining the most
significant part of the 2 d.f. F statistic.
If we use a 2 d.f. F critical value to assess the X effect even when X 2 is not
in the model, it is clear that the two-stage approach can only lose power and
hence it has no advantage whatsoever. That is because the sum of squares
due to regression from the quadratic model is greater than the sum of squares
computed from the linear model.
In this section, the regression part of the model is isolated, and methods are
described for validating the regression assumptions or modifying the model
to meet the assumptions. The general linear regression model is
where X1 is binary and X2 is continuous. One needs to verify that the prop-
erty of the response C(Y ) is related to X1 and X2 according to Figure 2.4.
There are several methods for checking the fit of this model. The first
method below is based on critiquing the simple model, and the other methods
directly “estimate” the model.
1. Fit the simple linear additive model and critically examine residual plots
for evidence of systematic patterns. For least squares fits one can compute
estimated residuals e = Y − X β̂ and box plots of e stratified by X1 and
scatterplots of e versus X1 and Ŷ with trend curves. If one is assuming
constant conditional variance of Y , the spread of the residual distribution
against each of the variables can be checked at the same time. If the nor-
mality assumption is needed (i.e., if significance tests or confidence limits
are used), the distribution of e can be compared with a normal distribu-
tion with mean zero. Advantage: Simplicity. Disadvantages: Standard
residuals can only be computed for continuous uncensored response vari-
ables. The judgment of non-randomness is largely subjective, it is difficult
to detect interaction, and if interaction is present it is difficult to check
any of the other assumptions. Unless trend lines are added to plots, pat-
34 2 General Aspects of Fitting Regression Models
X1 = 1
C(Y)
X1 = 0
X2
Fig. 2.4 Regression assumptions for one binary and one continuous predictor
terns may be difficult to discern if the sample size is very large. Detecting
patterns in residuals does not always inform the analyst of what corrective
action to take, although partial residual plots can be used to estimate the
needed transformations if interaction is absent.
2. Make a scatterplot of Y versus X2 using different symbols according to
values of X1 . Advantages: Simplicity, and one can sometimes see all re-
gression patterns including interaction. Disadvantages: Scatterplots can-
not be drawn for binary, categorical, or censored Y . Patterns are difficult
to see if relationships are weak or if the sample size is very large.
3. Stratify the sample by X1 and quantile groups (e.g., deciles) of X2 . Within
each X1 × X2 stratum an estimate of C(Y |X1 , X2 ) is computed. If X1 is
continuous, the same method can be used after grouping X1 into quantile
groups. Advantages: Simplicity, ability to see interaction patterns, can
handle censored Y if care is taken. Disadvantages: Subgrouping requires
relatively large sample sizes and does not use continuous factors effectively
as it does not attempt any interpolation. The ordering of quantile groups is
not utilized by the procedure. Subgroup estimates have low precision (see
p. 488 for an example). Each stratum must contain enough information
to allow trends to be apparent above noise in the data. The method of
grouping chosen (e.g., deciles vs. quintiles vs. rounding) can alter the shape
of the plot.
4. Fit a nonparametric smoother separately for levels of X1 (Section 2.4.7)
relating X2 to Y . Advantages: All regression aspects of the model can
be summarized efficiently with minimal assumptions. Disadvantages:
Does not easily apply to censored Y , and does not easily handle multiple
predictors.
2.7 Assessment of Model Fit 35
5. Fit a flexible parametric model that allows for most of the departures from
the linear additive model that you wish to entertain. Advantages: One
framework is used for examining the model assumptions, fitting the model,
and drawing formal inference. Degrees of freedom are well defined and
all aspects of statistical inference “work as advertised.” Disadvantages:
Complexity, and it is generally difficult to allow for interactions when
assessing patterns of effects.
The first four methods each have the disadvantage that if confidence limits
or formal inferences are desired it is difficult to know how many degrees of
freedom were effectively used so that, for example, confidence limits will have
the stated coverage probability. For method five, the restricted cubic spline
function is an excellent tool for estimating the true relationship between X2
and C(Y ) for continuous variables without assuming linearity. By fitting a
model containing X2 expanded into k − 1 terms, where k is the number of
knots, one can obtain an estimate of the function of X2 that could be used
linearly in the model:
where
fˆ(X2 ) = β̂2 X2 + β̂3 X2′ + β̂4 X2′′ , (2.35)
and X2′ and X2′′ are constructed spline variables (when k = 4) as described
previously. We call fˆ(X2 ) the spline-estimated transformation of X2 . Plotting
the estimated spline function fˆ(X2 ) versus X2 will generally shed light on
how the effect of X2 should be modeled. If the sample is sufficiently large,
the spline function can be fitted separately for X1 = 0 and X1 = 1, allowing
detection of even unusual interaction patterns. A formal test of linearity in
X2 is obtained by testing H0 : β3 = β4 = 0, using a computationally efficient
score test, for example (Section 9.2.3).
If the model is nonlinear in X2 , either a transformation suggested by the
spline function plot (e.g., log(X2 )) or the spline function itself (by placing
X2 , X2′ , and X2′′ simultaneously in any model fitted) may be used to describe
X2 in the model. If a tentative transformation of X2 is specified, say g(X2 ),
the adequacy of this transformation can be tested by expanding g(X2 ) in a
spline function and testing for linearity. If one is concerned only with predic-
tion and not with statistical inference, one can attempt to find a simplifying
transformation for a predictor by plotting g(X2 ) against fˆ(X2 ) (the estimated
spline transformation) for a variety of g, seeking a linearizing transformation
of X2 . When there are nominal or binary predictors in the model in addi-
tion to the continuous predictors, it should be noted that there are no shape
assumptions to verify for the binary/nominal predictors. One need only test
for interactions between these predictors and the others.
36 2 General Aspects of Fitting Regression Models
If the model contains more than one continuous predictor, all may be ex-
panded with spline functions in order to test linearity or to describe nonlinear
relationships. If one did desire to assess simultaneously, for example, the lin-
earity of predictors X2 and X3 in the presence of a linear or binary predictor
X1 , the model could be specified as
where X2′ , X2′′ , X3′ , and X3′′ represent components of four knot restricted cubic
spline functions.
The test of linearity for X2 (with 2 d.f.) is H0 : β3 = β4 = 0. The overall
test of linearity for X2 and X3 is H0 : β3 = β4 = β6 = β7 = 0, with 4 d.f.
But as described further in Section 4.1, even though there are many reasons
for allowing relationships to be nonlinear, there are reasons for not testing
the nonlinear components for significance, as this might tempt the analyst to
simplify the model thus distorting inference.234 Testing for linearity is usually
best done to justify to non-statisticians the need for complexity to explain or
predict outcomes.
on whether interaction terms are adjusted for when estimating “main effects.”
A good compromise method involves fitting interactions of the form X1 f (X2 )
and X2 g(X1 ):
For some problems a reasonable approach is, for each predictor separately,
to test simultaneously the joint importance of all interactions involving that
predictor. For p predictors this results in p tests each with p − 1 degrees
of freedom. The multiple comparison problem would then be reduced from
p(p − 1)/2 tests (if all two-way interactions were tested individually) to p
tests.
In the fields of biostatistics and epidemiology, some types of interactions
that have consistently been found to be important in predicting outcomes
and thus may be pre-specified are the following.
1. Interactions between treatment and the severity of disease being treated.
Patients with little disease can receive little benefit.
2. Interactions involving age and risk factors. Older subjects are generally
less affected by risk factors. They had to have been robust to survive to
their current age with risk factors present.
3. Interactions involving age and type of disease. Some diseases are incurable
and have the same prognosis regardless of age. Others are treatable or
have less effect on younger patients.
4. Interactions between a measurement and the state of a subject during a
measurement. Respiration rate measured during sleep may have greater
predictive value and thus have a steeper slope versus outcome than res-
piration rate measured during activity.
5. Interaction between menopausal status and treatment or risk factors.
6. Interactions between race and disease.
7. Interactions between calendar time and treatment. Some treatments have
learning curves causing secular trends in the associations.
8. Interactions between month of the year and other predictors, due to sea-
sonal effects.
9. Interaction between the quality and quantity of a symptom, for example,
daily frequency of chest pain × severity of a typical pain episode.
12 10. Interactions between study center and treatment.
For the case of an ordinal predictor, spline functions are not useful unless
there are so many categories that in essence the variable is continuous. When
the number of categories k is small (three to five, say), the variable is usu-
ally modeled as a polytomous factor using indicator variables or equivalently
as one linear term and k − 2 indicators. The latter coding facilitates testing
for linearity. For more categories, it may be reasonable to stratify the data
by levels of the variable and to compute summary statistics (e.g., logit pro-
portions for a logistic model) or to examine regression coefficients associated
with indicator variables over categories. Then one can attempt to summarize
the pattern with a linear or some other simple trend. Later hypothesis tests
2.7 Assessment of Model Fit 39
must take into account this data-driven scoring (by using > 1 d.f., for exam-
ple), but the scoring can save degrees of freedom when testing for interaction
with other factors. In one dataset, the number of comorbid diseases was used
to summarize the risk of a set of diseases that was too large to model. By
plotting the logit of the proportion of deaths versus the number of diseases,
it was clear that the square of the number of diseases would properly score
the variables.
Sometimes it is useful to code an ordinal predictor with k − 1 indicator
variables of the form [X ≥ vj ], where j = 2, . . . , k and [h] is 1 if h is true,
0 otherwise.648 Although a test of linearity does not arise immediately from
this coding, the regression coefficients are interpreted as amounts of change
from the previous category. A test of whether the last m categories can be
combined with the category k − m does follow easily from this coding.
The general linear regression model is stated as C(Y |X) = Xβ to highlight its
regression assumptions. For logistic regression models for binary or nominal
responses, there is no distributional assumption if simple random sampling
is used and subjects’ responses are independent. That is, the binary logistic
model and all of its assumptions are contained in the expression logit{Y =
1|X} = Xβ. For ordinary multiple regression with constant variance σ 2 , we
usually assume that Y −Xβ is normally distributed with mean 0 and variance
σ 2 . This assumption can be checked by estimating β with β̂ and plotting the
overall distribution of the residuals Y − X β̂, the residuals against Ŷ , and the
residuals against each X. For the latter two, the residuals should be normally
distributed within each neighborhood of Ŷ or X. A weaker requirement is that
the overall distribution of residuals is normal; this will be satisfied if all of the
stratified residual distributions are normal. Note a hidden assumption in both
models, namely, that there are no omitted predictors. Other models, such as
the Weibull survival model or the Cox132 proportional hazards model, also
have distributional assumptions that are not fully specified by C(Y |X) = Xβ.
However, regression and distributional assumptions of some of these models
are encapsulated by
XΓ = Γ1 X1 + Γ2 X2 + . . . + Γk Xk (2.41)
1 References [152, 575, 578] have more information about cubic splines.
2 See Smith578 for a good overview of spline functions.
3 More material about natural splines may be found in de Boor152 . McNeil et al.451
discuss the overall smoothness of natural splines in terms of the integral of the
square of the second derivative of the regression function, over the range of
the data. Govindarajulu et al.230 compared restricted cubic splines, penalized
splines, and fractional polynomial532 fits and found that the first two methods
agreed with each other more than with estimated fractional polynomials.
4 A tutorial on restricted cubic splines is in [271].
5 Durrleman and Simon168 provide examples in which knots are allowed to be
estimated as free parameters, jointly with the regression coefficients. They found
that even though the “optimal” knots were often far from a priori knot locations,
the model fits were virtually identical.
2.8 Further Reading 41
2.9 Problems
d
Note: To pre-specify knots for restricted cubic spline functions, use something like
rcs(predictor, c(t1,t2,t3,t4)), where the knot locations are t1, t2, t3, t4.
e
Note that anova in rms computes all needed test statistics from a single model fit
object.
44 2 General Aspects of Fitting Regression Models
There are missing data in the majority of datasets one is likely to encounter.
Before discussing some of the problems of analyzing data in which some
variables are missing for some subjects, we define some nomenclature. 1
Data are missing for reasons that are unrelated to any characteristics or re-
sponses for the subject, including the value of the missing value, were it to
be known. Examples include missing laboratory measurements because of a
dropped test tube (if it was not dropped because of knowledge of any mea-
surements), a study that ran out of funds before some subjects could return
for follow-up visits, and a survey in which a subject omitted her response to
a question for reasons unrelated to the response she would have made or to
any other of her characteristics.
Data are not missing at random, but the probability that a value is missing
depends on values of variables that were actually measured. As an example,
consider a survey in which females are less likely to provide their personal
income in general (but the likelihood of responding is independent of her
actual income). If we know the sex of every subject and have income levels
for some of the females, unbiased sex-specific income estimates can be made.
That is because the incomes we do have for some of the females are a random
sample of all females’ incomes. Another way of saying that a variable is MAR
is that given the values of other available variables, subjects having missing
values are only randomly different from other subjects.535 Or to paraphrase
Greenland and Finkle,242 for MAR the missingness of a covariable cannot
depend on unobserved covariable values; for example whether a predictor is
observed cannot depend on another predictor when the latter is missing but
it can depend on the latter when it is observed. MAR and MCAR data are
also called ignorable non-responses.
The tendency for a variable to be missing is a function of data that are not
available, including the case when data tend to be missing if their true values
are systematically higher or lower. An example is when subjects with lower
income levels or very high incomes are less likely to provide their personal in-
come in an interview. IM is also called nonignorable non-response and missing
not at random (MNAR).
IM is the most difficult type of missing data to handle. In many cases, there
is no fix for IM nor is there a way to use the data to test for the existence of
IM. External considerations must dictate the choice of missing data models,
and there are few clues for specifying a model under IM. MCAR is the easiest
case to handle. Our ability to correctly analyze MAR data depends on the
availability of other variables (the sex of the subject in the example above).
Most of the methods available for dealing with missing data assume the data
are MAR. Fortunately, even though the MAR assumption is not testable, it
may hold approximately if enough variables are included in the imputation
models256 .
functions naclus and naplot in the Hmisc package (see p. 142) can help detect
how variables are simultaneously missing. Recursive partitioning (regression
tree) algorithms (see Section 2.5) are invaluable for describing which kinds of
subjects are missing on a variable. Logistic regression is also an excellent tool
for this purpose. A later example (p. 302) demonstrates these procedures.
It can also be helpful to explore the distribution of non-missing Y by the
number of missing variables in X (including zero, i.e., complete cases on X).
When the response variable Y is collected serially but some subjects drop out
of the study before completion, there are many ways of dealing with partial
information42, 412, 480 including multiple imputation in phases,381 or efficiently
analyzing all available serial data using a full likelihood model. When Y is the
time until an event, there are actually no missing values of Y but follow-up
will be curtailed for some subjects. That leaves the case where the response
is completely measured once.
It is common practice to discard subjects having missing Y . Before doing
so, at minimum an analysis should be done to characterize the tendency
for Y to be missing, as just described. For example, logistic regression or
recursive partitioning can be used to predict whether Y is missing and to
test for systematic tendencies as opposed to Y being missing completely at
random. In many models, though, more efficient and less biased estimates of
regression coefficients can be made by also utilizing observations missing on
Y that are non-missing on X. Hence there is a definite place for imputation
of Y . von Hippel645 found advantages of using all variables to impute all
others, and once imputation is finished, discarding those observations having
missing Y . However if missing Y values are MCAR, up-front deletion of cases
having missing Y may sometimes be preferred, as imputation requires correct
specification of the imputation model. 2
the predictors are age, sex, and blood pressure, and that age and sex were
recorded for every subject. Suppose that blood pressure was not measured
for a fraction of 0.10 of the subjects, and the most common reason for not
obtaining a blood pressure was that the subject was about to die. Deletion
of these very sick patients will cause a major bias (downward) in the model’s
3 intercept parameter. In general, casewise deletion will bias the estimate of
the model’s intercept parameter (as well as others) when the probability of
a case being incomplete is related to Y and not just to X [422, Example
3.3]. van der Heijden et al.628 discuss how complete case analysis (casewise
deletion) usually assumes MCAR.
Now consider an example in which casewise deletion of incomplete records
is inefficient. The inefficiency comes from the reduction of sample size, which
causes standard errors to increase,162 confidence intervals to widen, and power
of tests of association and tests of lack of fit to decrease. Suppose that the
response is the presence of coronary artery disease and the predictors are
age, sex, LDL cholesterol, HDL cholesterol, blood pressure, triglyceride, and
smoking status. Suppose that age, sex, and smoking are recorded for all sub-
jects, but that LDL is missing in 0.18 of the subjects, HDL is missing in 0.20,
and triglyceride is missing in 0.21. Assume that all missing data are MCAR
and that all of the subjects missing LDL are also missing HDL and that
overall 0.28 of the subjects have one or more predictors missing and hence
would be excluded from the analysis. If total cholesterol were known on every
subject, even though it does not appear in the model, it (along perhaps with
age and sex) can be used to estimate (impute) LDL and HDL cholesterol and
triglyceride, perhaps using regression equations from other studies. Doing the
analysis on a “filled in” dataset will result in more precise estimates because
the sample size would then include the other 0.28 of the subjects.
In general, observations should only be discarded if the MCAR assump-
tion is justified, there is a rarely missing predictor of overriding importance
that cannot be reliably imputed from other information, or if the fraction of
observations excluded is very small and the original sample size is large. Even
then, there is no advantage of such deletion other than saving analyst time.
If a predictor is MAR but its missingness depends on Y , casewise deletion is
biased.
The first blood pressure example points out why it can be dangerous to
handle missing values by adding a dummy variable to the model. Many ana-
lysts would set missing blood pressures to a constant (it doesn’t matter which
constant) and add a variable to the model such as is.na(blood.pressure) in
R notation. The coefficient for the latter dummy variable will be quite large
in the earlier example, and the model will appear to have great ability to
predict death. This is because some of the left-hand side of the model con-
taminates the right-hand side; that is, is.na(blood.pressure) is correlated
4 with death. For categorical variables, another common practice is to add a
new category to denote missing, adding one more degree of freedom to the
3.5 Strategies for Developing an Imputation Model 49
predictor and changing its meaning.a Jones326 , Allison [12, pp. 9–11], Don-
ders et al.161 , Knol et al.353 and van der Heijden et al.628 describe why both
of these missing-indicator methods are invalid even when MCAR holds. 5
Except in special circumstances that usually involve only very simple models,
the primary alternative to deleting incomplete observations is imputation of
the missing values. Many non-statisticians find the notion of estimating data
distasteful, but the way to think about imputation of missing values is that
“making up” data is better than discarding valuable data. It is especially dis-
tressing to have to delete subjects who are missing on an adjustment variable
when a major variable of interest is not missing. So one goal of imputation
is to use as much information as possible for examining any one predictor’s
adjusted association with Y . The overall goal of imputation is to preserve the
information and meaning of the non-missing data.
At this point the analyst must make some decisions about the information
to use in computing predicted values for missing values.
1. Imputation of missing values for one of the variables can ignore all other
information. Missing values can be filled in by sampling non-missing values
of the variable, or by using a constant such as the median or mean non-
missing value.
2. Imputation algorithms can be based only on external information not oth-
erwise used in the model for Y in addition to variables included in later
modeling. For example, family income can be imputed on the basis of loca-
tion of residence when such information is to remain confidential for other
aspects of the analysis or when such information would require too many
degrees of freedom to be spent in the ultimate response model.
3. Imputations can be derived by only analyzing interrelationships among
the Xs.
4. Imputations can use relationships among the Xs and between X and Y .
5. Imputations can use X, Y , and auxiliary variables not in the model
predicting Y .
6. Imputations can take into account the reason for non-response if known.
The model to estimate the missing values in a sometimes-missing (target)
variable should include all variables that are either
a
This may work if values are “missing” because of “not applicable”, e.g. one has a
measure of marital happiness, dichotomized as high or low, but the sample contains
some unmarried people. One could have a 3-category variable with values high, low,
and unmarried (Paul Allison, IMPUTE e-mail list, 4Jul09).
50 3 Missing Data
In predictive mean matching 422 (PMM), one replaces a missing (NA) value
for the target variable being imputed with the actual value from a donor
observation. Donors are identified by matching in only one dimension, namely
the predicted value (e.g., predicted mean) of the target. Key considerations
are how to
b
Predictors of the target variable include all the other Xs along with auxiliary
variables that are not included in the final outcome model, as long as they precede
the variable being imputed in the causal chain (unlike with multiple imputation).
3.8 Multiple Imputation 53
Imputing missing values and then doing an ordinary analysis as if the imputed
values were real measurements is usually better than excluding subjects with
incomplete data. However, ordinary formulas for standard errors and other
statistics are invalid unless imputation is taken into account.651 Methods for
properly accounting for having incomplete data can be complex. The boot-
strap (described later) is an easy method to implement, but the computations
can be slowe .
c
Thus when modeling binary or categorical targets one can frequently take least
squares shortcuts in place of maximum likelihood for binary, ordinal, or multinomial
logistic models.
d 662
discusses an alternative method based on choosing a donor observation at
random from the q closest matches (q = 3, for example).
e
To use the bootstrap to correctly estimate variances of regression coefficients, one
must repeat the imputation process and the model fitting perhaps 100 times using a
54 3 Missing Data
3.9 Diagnostics
1 These types of missing data are well described in an excellent review article
on missing data by Schafer and Graham542 . A good introductory article on
missing data and imputation is by Donders et al.161 and a good overview of
multiple imputation is by White et al.662 and Harel and Zhou256 . Paul Allison’s
booklet12 and van Buuren’s book85 are also excellent practical treatments.
2 Crawford et al.138 give an example where responses are not MCAR for which
deleting subjects with missing responses resulted in a biased estimate of the
response distribution. They found that multiple imputation of the response re-
sulted in much improved estimates. Wood et al.673 have a good review of how
missing response data are typically handled in randomized trial reports, with
recommendations for improvements. Barnes et al.42 have a good overview of
imputation methods and a comparison of bias and confidence interval cover-
age for the methods when applied to longitudinal data with a small number
of subjects. Twist et al.617 found instability in using multiple imputation of
longitudinal data, and advantages of using instead full likelihood models.
3 See van Buuren et al.626 for an example in which subjects having missing base-
line blood pressure had shorter survival time. Joseph et al.327 provide examples
demonstrating difficulties with casewise deletion and single imputation, and
comment on the robustness of multiple imputation methods to violations of
assumptions.
4 Another problem with the missingness indicator approach arises when more
than one predictor is missing and these predictors are missing on almost the
same subjects. The missingness indicator variables will be collinear; that is
impossible to disentangle.326
5 See [623, pp. 2645–2646] for several problems with the “missing category” ap-
proach. A clear example is in161 where covariates X1 , X2 have true β1 = 1, β2 =
0 and X1 is MCAR. Adding a missingness indicator for X1 as a covariate re-
sulted in βˆ1 = 0.55, βˆ2 = 0.51 because in the missing observations the constant
X1 was uncorrelated with X2 . D’Agostino and Rubin146 developed methods for
propensity score modeling that allow for missing data. They mentioned that ex-
tra categories may be added to allow for missing data in propensity models and
that adding indicator variables describing patterns of missingness will also allow
the analyst to match on missingness patterns when comparing non-randomly
assigned treatments.
6 Harel and Zhou256 and Siddique569 discuss the approximate Bayesian bootstrap
further.
7 Kalton and Kasprzyk332 proposed a hybrid approach to imputation in which
missing values are imputed with the predicted value for the subject plus the
residual from the subject having the closest predicted value to the subject being
imputed.
8 Miller et al.458 studied the effect of ignoring imputation when conditional mean
fill-in methods are used, and showed how to formalize such methods using linear
models.
9 Meng455 argues against always separating imputation from final analysis, and
in favor of sometimes incorporating weights into the process.
10 van Buuren et al.626 presented an excellent case study in multiple imputation
in the context of survival analysis. Barzi and Woodward43 present a nice review
of multiple imputation with detailed comparison of results (point estimates and
confidence limits for the effect of the sometimes-missing predictor) for various
imputation methods. Barnard and Rubin41 derived an estimate of the d.f. asso-
ciated with the imputation-adjusted variance matrix for use in a t-distribution
3.12 Problems 59
3.12 Problems
f
The dataset is on the book’s dataset wiki and may be automatically fetched over
the internet and loaded using the Hmisc package’s command getHdata(support).
60 3 Missing Data
a. The cost estimates are not available on 105 patients. Total hospital
charges (bills) are available on all but 25 patients. Relate these two
variables to each other with an eye toward using charges to predict
totcst when totcst is missing. Make graphs that will tell whether lin-
ear regression or linear regression after taking logs of both variables is
better.
b. Impute missing total hospital costs in SUPPORT based on a regression
model relating charges to costs, when charges are available. You may
want to use a statement like the following in R:
support ← transform ( support ,
totcst = ifelse ( is.na ( totcst ) ,
( e x p r e s s i o n _ i n _ c h a r g e s ) , totcst ))
If in the previous problem you felt that the relationship between costs
and charges should be based on taking logs of both variables, the “ex-
pression in charges” above may look something like exp(intercept +
slope * log(charges)), where constants are inserted for intercept and
slope.
c. Compute the likely error in approximating total cost using charges by
computing the median absolute difference between predicted and ob-
served total costs in the patients having both variables available. If you
used a log transformation, also compute the median absolute percent
error in imputing total costs by anti-logging the absolute difference in
predicted logs.
3. State briefly why single conditional medianh imputation is OK here.
4. Use transcan to develop single imputations for total cost, commenting on
the strength of the model fitted by transcan as well as how strongly each
variable can be predicted from all the others.
5. Use predictive mean matching to multiply impute cost 10 times per missing
observation. Describe graphically the distributions of imputed values and
briefly compare these to distributions of non-imputed values. State in a
g
You can use the R command subset(support, is.na(totcst) | totcst > 0). The
is.na condition tells R that it is permissible to include observations having missing
totcst without setting all columns of such observations to NA.
h
We are anti-logging predicted log costs and we assume log cost has a symmetric
distribution
3.12 Problems 61
simple way what the sample variance of multiple imputations for a single
observation of a continuous predictor is approximating.
6. Using the multiple imputed values, develop an overall least squares model
for total cost (using the log transformation) making optimal use of partial
information, with variances computed so as to take imputation (except for
cost) into account. The model should use the predictors in Problem 1 and
should not assume linearity in any predictor but should assume additivity.
Interpret one of the resulting ratios of imputation-corrected variance to
apparent variance and explain why ratios greater than one do not mean
that imputation is inefficient.
Chapter 4
Multivariable Modeling Strategies
ones that are most different from one another. So if using partial tests or
rank correlation to reduce degrees of freedom can harm the model, one might
argue that it is fair to allow this strategy to also benefit the analysis.
When collinearities or confounding are not problematic, a quicker approach
based on pairwise measures of association can be useful. This approach will
not have numerical problems (e.g., singular covariance matrix). When Y is
binary or continuous (but not censored), a good general-purpose measure of
association that is useful in making decisions about the number of parameters
to devote to a predictor is an extension of Spearman’s ρ rank correlation.
This is the ordinary R2 from predicting the rank of Y based on the rank of
2 X and the square of the rank of X. This ρ2 will detect not only nonlinear
relationships (as will ordinary Spearman ρ) but some non-monotonic ones
as well. It is important that the ordinary Spearman ρ not be computed, as
this would tempt the analyst to simplify the regression function (towards
monotonicity) if the generalized ρ2 does not significantly exceed the square
of the ordinary Spearman ρ. For categorical predictors, ranks are not squared
but instead the predictor is represented by a series of dummy variables. The
resulting ρ2 is related to the Kruskal–Wallis test. See p. 460 for an example.
Note that bivariable correlations can be misleading if marginal relationships
3 vary greatly from ones obtained after adjusting for other predictors.
Once one expands a predictor into linear and nonlinear terms and esti-
mates the coefficients, the best way to understand the relationship between
predictors and response is to graph this estimated relationshipc . If the plot
appears almost linear or the test of nonlinearity is very insignificant there
is a temptation to simplify the model. The Grambsch and O’Brien result
described in Section 2.6 demonstrates why this is a bad idea.
From the above discussion a general principle emerges. Whenever the re-
sponse variable is informally or formally linked, in an unmasked fashion, to
particular parameters that may be deleted from the model, special adjust-
ments must be made in P -values, standard errors, test statistics, and confi-
dence limits, in order for these statistics to have the correct interpretation.
Examples of strategies that are improper without special adjustments (e.g.,
using the bootstrap) include examining a frequency table or scatterplot to
decide that an association is too weak for the predictor to be included in
the model at all or to decide that the relationship appears so linear that all
nonlinear terms should be omitted. It is also valuable to consider the reverse
situation; that is, one posits a simple model and then additional analysis or
outside subject matter information makes the analyst want to generalize the
model. Once the model is generalized (e.g., nonlinear terms are added), the
test of association can be recomputed using multiple d.f. So another general
principle is that when one makes the model more complex, the d.f. prop-
erly increases and the new test statistics for association have the claimed
c
One can also perform a joint test of all parameters associated with nonlinear effects.
This can be useful in demonstrating to the reader that some complexity was actually
needed.
4.3 Variable Selection 67
Before developing a multivariable model one must decide whether the as-
sumptions of each continuous predictor can be verified by ignoring the effects
of all other potential predictors. In some cases, the shape of the relation-
ship between a predictor and the property of response will be different if an
adjustment is made for other correlated factors when deriving regression esti-
mates. Also, failure to adjust for an important factor can frequently alter the
nature of the distribution of Y . Occasionally, however, it is unwieldy to deal
simultaneously with all predictors at each stage in the analysis, and instead
the regression function shapes are assessed separately for each continuous
predictor.
The material covered to this point dealt with a prespecified list of variables
to be included in the regression model. For reasons of developing a concise
model or because of a fear of collinearity or of a false belief that it is not
legitimate to include “insignificant” regression coefficients when presenting
results to the intended audience, stepwise variable selection is very commonly
employed. Variable selection is used when the analyst is faced with a series of
potential predictors but does not have (or use) the necessary subject matter
knowledge to enable her to prespecify the “important” variables to include
in the model. But using Y to compute P -values to decide which variables
to include is similar to using Y to decide how to pool treatments in a five–
treatment randomized trial, and then testing for global treatment differences
using fewer than four degrees of freedom.
Stepwise variable selection has been a very popular technique for many
years, but if this procedure had just been proposed as a statistical method, it
would most likely be rejected because it violates every principle of statistical
estimation and hypothesis testing. Here is a summary of the problems with
this method.
68 4 Multivariable Modeling Strategies
d
Lockhart et al.425 provide an example with n = 100 and 10 orthogonal predictors
where all true βs are zero. The test statistic for the first variable to enter has type I
error of 0.39 when the nominal α is set to 0.05, in line with what one would expect
with multiple testing using 1 − 0.9510 = 0.40.
4.3 Variable Selection 69
They found that variables selected for the final model represented noise 0.20
to 0.74 of the time and that the final model usually contained less than half
of the actual number of authentic predictors. Hence there are many reasons
for using methods such as full-model fits or data reduction, instead of using
any stepwise variable selection algorithm.
If stepwise selection must be used, a global test of no regression should
be made before proceeding, simultaneously testing all candidate predictors
and having degrees of freedom equal to the number of candidate variables
(plus any nonlinear or interaction terms). If this global test is not significant,
selection of individually significant predictors is usually not warranted.
The method generally used for such variable selection is forward selection
of the most significant candidate or backward elimination of the least sig-
nificant predictor in the model. One of the recommended stopping rules is
based on the “residual χ2 ” with degrees of freedom equal to the number of
candidate variables remaining at the current step. The residual χ2 can be
tested for significance (if one is able to forget that because of variable selec-
tion this statistic does not have a χ2 distribution), or the stopping rule can
be based on Akaike’s information criterion (AIC33 ), here residual χ2 − 2×
d.f.257 Of course, use of more insight from knowledge of the subject matter
will generally improve the modeling process substantially. It must be remem-
bered that no currently available stopping rule was developed for data-driven
variable selection. Stopping rules such as AIC or Mallows’ Cp are intended
for comparing a limited number of prespecified models [66, Section 1.3]347e . 4
If the analyst insists on basing the stopping rule on P -values, the optimum
(in terms of predictive accuracy) α to use in deciding which variables to
include in the model is α = 1.0 unless there are a few powerful variables
and several completely irrelevant variables. A reasonable α that does allow
for deletion of some variables is α = 0.5.589 These values are far from the
traditional choices of α = 0.05 or 0.10. 5
e
AIC works successfully when the models being entertained are on a progression
defined by a single parameter, e.g. a common shrinkage coefficient or the single num-
ber of knots to be used by all continuous predictors. AIC can also work when the
model that is best by AIC is much better than the runner-up so that if the process
were bootstrapped the same model would almost always be found. When used for
one variable at a time variable selection. AIC is just a restatement of the P -value,
and as such, doesn’t solve the severe problems with stepwise variable selection other
than forcing us to use slightly more sensible α values. Burnham and Anderson84 rec-
ommend selection based on AIC for a limited number of theoretically well-founded
models. Some statisticians try to deal with multiplicity problems caused by stepwise
variable selection by making α smaller than 0.05. This increases bias by giving vari-
ables whose effects are estimated with error a greater relative chance of being selected.
Variable selection does not compete well with shrinkage methods that simultaneously
model all potential predictors.
70 4 Multivariable Modeling Strategies
4. By using the bootstrap to choose variables, one must use the double boot-
strap to resample the entire modeling process in order to validate the model
and to derive reliable confidence intervals. This may be computationally
prohibitive.
5. The bootstrap did not improve upon traditional backward stepdown vari-
able selection. Both methods fail at identifying the “correct” variables.
For some applications the list of variables selected may be stabilized by
grouping variables according to subject matter considerations or empirical
correlations and testing each related group with a multiple degree of freedom
test. Then the entire group may be kept or deleted and, if desired, groups that
are retained can be summarized into a single variable or the most accurately
measured variable within the group can replace the group. See Section 4.7
for more on this.
Kass and Raftery337 showed that Bayes factors have several advantages in
variable selection, including the selection of less complex models that may
agree better with subject matter knowledge. However, as in the case with
more traditional stopping rules, the final model may still have regression
coefficients that are too large. This problem is solved by Tibshirani’s lasso
method,608, 609 which is a penalized estimation technique in which the esti-
mated regression coefficients are constrained so that the sum of their scaled
absolute values falls below some constant k chosen by cross-validation. This
kind of constraint forces some regression coefficient estimates to be exactly
zero, thus achieving variable selection while shrinking the remaining coef-
ficients toward zero to reflect the overfitting caused by data-based model
selection.
A final problem with variable selection is illustrated by comparing this
approach with the sensible way many economists develop regression mod-
els. Economists frequently use the strategy of deleting only those variables
that are “insignificant” and whose regression coefficients have a nonsensible
direction. Standard variable selection on the other hand yields biologically
implausible findings in many cases by setting certain regression coefficients
exactly to zero. In a study of survival time for patients with heart failure,
for example, it would be implausible that patients having a specific symptom
live exactly as long as those without the symptom just because the symp-
tom’s regression coefficient was “insignificant.” The lasso method shares this
difficulty with ordinary variable selection methods and with any method that
in the Bayesian context places nonzero prior probability on β being exactly
zero. 9
Many papers claim that there were insufficient data to allow for multivari-
able modeling, so they did “univariable screening” wherein only “significant”
variables (i.e., those that are separately significantly associated with Y ) were
entered into the model.f This is just a forward stepwise variable selection in
f
This is akin to doing a t-test to compare the two treatments (out of 10, say) that
are apparently most different from each other.
72 4 Multivariable Modeling Strategies
which insignificant variables from the first step are not reanalyzed in later
steps. Univariable screening is thus even worse than stepwise modeling as
it can miss important variables that are only important after adjusting for
other variables.598 Overall, neither univariable screening nor stepwise vari-
able selection in any way solves the problem of “too many variables, too few
subjects,” and they cause severe biases in the resulting multivariable model
fits while losing valuable predictive information from deleting marginally sig-
10 nificant variables.
The online course notes contain a simple simulation study of stepwise
selection using R.
When a model is fitted that is too complex, that it, has too many free pa-
rameters to estimate for the amount of information in the data, the worth
of the model (e.g., R2 ) will be exaggerated and future observed values will
11 not agree with predicted values. In this situation, overfitting is said to be
present, and some of the findings of the analysis come from fitting noise and
not just signal, or finding spurious associations between X and Y . In this sec-
tion general guidelines for preventing overfitting are given. Here we concern
ourselves with the reliability or calibration of a model, meaning the ability of
the model to predict future observations as well as it appeared to predict the
responses at hand. For now we avoid judging whether the model is adequate
for the task, but restrict our attention to the likelihood that the model has
significantly overfitted the data.
In typical low signal–to–noise ratio situationsg , model validations on in-
dependent datasets have found the minimum training sample size for which
the fitted model has an independently validated predictive discrimination
that equals the apparent discrimination seen with in training sample. Similar
validation experiments have considered the margin of error in estimating an
absolute quantity such as event probability. Studies such as268, 270, 577 have
shown that in many situations a fitted regression model is likely to be reli-
able when the number of predictors (or candidate predictors if using variable
selection) p is less than m/10 or m/20, where m is the “limiting sample size”
m
12 given in Table 4.1. A good average requirement is p < 15 . For example,
577
Smith et al. found in one series of simulations that the expected error in
Cox model predicted five–year survival probabilities was below 0.05 when
p < m/20 for “average” subjects and below 0.10 when p < m/20 for “sick”
g
These are situations where the true R2 is low, unlike tightly controlled experiments
and mechanistic models where signal:noise ratios can be quite high. In those situ-
m
ations, many parameters can be estimated from small samples, and the 15 rule of
thumb can be significantly relaxed.
4.4 Sample Size, Overfitting, and Limits on Number of Predictors 73
subjects, where m is the number of deaths. For “average” subjects, m/10 was
adequate for preventing expected errors > 0.1. Note: The number of non-
intercept parameters in the model (p) is usually greater than the number of
predictors. Narrowly distributed predictor variables (e.g., if all subjects’ ages
are between 30 and 45 or only 5% of subjects are female) will require even
higher sample sizes. Note that the number of candidate variables must include
all variables screened for association with the response, including nonlinear
terms and interactions. Instead of relying on the rules of thumb in the table,
the shrinkage factor estimate presented in the next section can be used to
guide the analyst in determining how many d.f. to model (see p. 87).
Rules of thumb such as the 15:1 rule do not consider that a certain min-
imum sample size is needed just to estimate basic parameters such as an
intercept or residual variance. This is dealt with in upcoming topics about
specific models. For the case of ordinary linear regression, estimation of the
residual variance is central. All standard errors, P -values, confidence inter-
vals, and R2 depend on having a precise estimate of σ 2 . The one-sample
problem of estimating a mean, which is equivalent to a linear model contain-
ing only an intercept, is the easiest case when estimating σ 2 . When a sample
of size n is drawn from a normal distribution, a 1 − α two-sided confidence
interval for the unknown population variance σ 2 is given by
n−1 n−1 2
s2 < σ 2 < 2 s , (4.1)
χ21−α/2,n−1 χα/2,n−1
h
See [487]. If one considers the power of a two-sample binomial test compared
with a Wilcoxon test if the response could be made continuous and the propor-
tional odds assumption holds, the effective sample size for a binary response is
3n1 n2 /n ≈ 3 min(n1 , n2 ) if n1 /n is near 0 or 1 [664, Eq. 10, 15]. Here n1 and n2
are the marginal frequencies of the two response levels.
i
Based on the power of a proportional odds model two-sample test when the marginal
cell sizes for the response are n1 , . . . , nk , compared with all cell sizes equal to unity
(response is continuous) [664, Eq, 3]. If all cell sizes are equal, the relative efficiency
of having k response categories compared with a continuous response is 1 − 1/k2 [664,
Eq. 14]; for example, a five-level response is almost as efficient as a continuous one if
proportional odds holds across category cutoffs.
j
This is approximate, as the effective sample size may sometimes be boosted some-
what by censored observations, especially for non-proportional hazards methods such
as Wilcoxon-type tests.49
74 4 Multivariable Modeling Strategies
where s2 is the sample variance and χ2α,n−1 is the α critical value of the
χ2 distribution with n − 1 degrees of freedom. We take the fold-change or
multiplicative margin of error (MMOE) for estimating σ to be
χ2
max( 1−α/2,n−1 , n − 1 ) (4.2)
n−1 χ2α/2,n−1
0.975 0.01
100 100
80 80
0.95
60 60
Multiple of p
Multiple of p
0.02
0.9
40 40
0.025
0.04
20 20
0.05
0.075
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
2 2
R R
Fig. 4.1 Multiple of p that n must be to achieve a relative drop from R2 to R2adj by
the indicated relative factor (left panel, 3 factors) or absolute difference (right panel,
6 decrements)
4.5 Shrinkage
The term shrinkage is used in regression modeling to denote two ideas. The
first meaning relates to the slope of a calibration plot, which is a plot of
observed responses against predicted responsesk. When a dataset is used to
fit the model parameters as well as to obtain the calibration plot, the usual
estimation process will force the slope of observed versus predicted values to
be one. When, however, parameter estimates are derived from one dataset
and then applied to predict outcomes on an independent dataset, overfitting
will cause the slope of the calibration plot (i.e., the shrinkage factor ) to be less
than one, a result of regression to the mean. Typically, low predictions will be
too low and high predictions too high. Predictions near the mean predicted
value will usually be quite accurate. The second meaning of shrinkage is a
statistical estimation method that preshrinks regression coefficients towards
zero so that the calibration plot for new data will not need shrinkage as its
calibration slope will be one.
We turn first to shrinkage as an adverse result of traditional modeling.
In ordinary linear regression, we know that all of the coefficient estimates
are exactly unbiased estimates of the true effect when the model fits. Isn’t
the existence of shrinkage and overfitting implying that there is some kind
of bias in the parameter estimates? The answer is no because each separate
coefficient has the desired expectation. The problem lies in how we use the
coefficients. We tend not to pick out coefficients at random for interpretation
but we tend to highlight very small and very large coefficients.
A simple example may suffice. Consider a clinical trial with 10 randomly
assigned treatments such that the patient responses for each treatment are
normally distributed. We can do an ANOVA by fitting a multiple regres-
sion model with an intercept and nine dummy variables. The intercept is an
unbiased estimate of the mean response for patients on the first treatment,
and each of the other coefficients is an unbiased estimate of the difference
in mean response between the treatment in question and the first treatment.
βˆ0 + βˆ1 is an unbiased estimate of the mean response for patients on the
second treatment. But if we plotted the predicted mean response for patients
against the observed responses from new data, the slope of this calibration
plot would typically be smaller than one. This is because in making this plot
we are not picking coefficients at random but we are sorting the coefficients
into ascending order. The treatment group having the lowest sample mean
response will usually have a higher mean in the future, and the treatment
group having the highest sample mean response will typically have a lower
mean in the future. The sample mean of the group having the highest sample
mean is not an unbiased estimate of its population mean.
k
An even more stringent assessment is obtained by stratifying calibration curves by
predictor settings.
76 4 Multivariable Modeling Strategies
0.7
0.6
Group Mean
l
l l
l l l l l
0.5 l
l l l l
l l
l l l
l
l
0.4
0.3
16 6 17 2 10 14 20 9 8 7 11 18 5 4 3 1 15 13 19 12
Group
Fig. 4.2 Sorted means from 20 samples of size 50 from a uniform [0, 1] distribution.
The reference line at 0.5 depicts the true population value of all of the means.
model χ2 − p
γ̂ = , (4.3)
model χ2
where p is the total degrees of freedom for the predictors and model χ2 is 13
the likelihood ratio χ2 statistic for testing the joint influence of all predictors
simultaneously (see Section 9.3.1). For ordinary linear models, van Houwelin-
gen and le Cessie proposed a shrinkage factor γ̂ that can be shown to equal
2
n−p−1 Radj
n−1 R2 , where the adjusted R2 is given by 14
2 n−1
Radj = 1 − (1 − R2 ) . (4.4)
n−p−1
For such linear models with an intercept β0 , the shrunken estimate of β is
where Y is the mean of the response vector. Again, when stepwise fitting is
used, the p in these equations is much closer to the number of candidate de-
grees of freedom rather than the number in the “final” model. See Section 5.3 15
for methods of estimating γ using the bootstrap (p. 115) or cross-validation.
Now turn to the second usage of the term shrinkage. Just as clothing is
sometimes preshrunk so that it will not shrink further once it is purchased,
better calibrated predictions result when shrinkage is built into the estima-
tion process in the first place. The object of shrinking regression coefficient
estimates is to obtain a shrinkage coefficient of γ = 1 on new data. Thus by
somewhat discounting β̂ we make the model underfitted on the data at hand
(i.e., apparent γ < 1) so that on new data extremely low or high predictions
are correct.
Ridge regression388, 633 is one technique for placing restrictions on the pa-
rameter estimates that results in shrinkage. A ridge parameter must be chosen
to control the amount of shrinkage. Penalized maximum likelihood estima-
tion,237, 272, 388, 639 a generalization of ridge regression, is a general shrinkage
78 4 Multivariable Modeling Strategies
4.6 Collinearity
When at least one of the predictors can be predicted well from the other
predictors, the standard errors of the regression coefficient estimates can be
inflated and corresponding tests have reduced power.217 In stepwise variable
selection, collinearity can cause predictors to compete and make the selection
of “important” variables arbitrary. Collinearity makes it difficult to estimate
and interpret a particular regression coefficient because the data have little
information about the effect of changing one variable while holding another
(highly correlated) variable constant [101, Chap. 9]. However, collinearity
does not affect the joint influence of highly correlated variables when tested
simultaneously. Therefore, once groups of highly correlated predictors are
identified, the problem can be rectified by testing the contribution of an
entire set with a multiple d.f. test rather than attempting to interpret the
coefficient or one d.f. test for a single predictor.
Collinearity does not affect predictions made on the same dataset used to
estimate the model parameters or on new data that have the same degree
of collinearity as the original data [470, pp. 379–381] as long as extreme
extrapolation is not attempted. Consider as two predictors the total and LDL
cholesterols that are highly correlated. If predictions are made at the same
combinations of total and LDL cholesterol that occurred in the training data,
no problem will arise. However, if one makes a prediction at an inconsistent
combination of these two variables, the predictions may be inaccurate and
have high standard errors.
When the ordinary truncated power basis is used to derive component
variables for fitting linear and cubic splines, as was described earlier, the
component variables can be very collinear. It is very unlikely that this will
result in any problems, however, as the component variables are connected
algebraically. Thus it is not possible for a combination of, for example, x and
max(x − 10, 0) to be inconsistent with each other. Collinearity problems are
then more likely to result from partially redundant subsets of predictors as
in the cholesterol example above.
4.7 Data Reduction 79
The sample size need not be as large as shown in Table 4.1 if the model
is to be validated independently and if you don’t care that the model may
fail to validate. However, it is likely that the model will be overfitted and
will not validate if the sample size does not meet the guidelines. Use of data
reduction methods before model development is strongly recommended if the
conditions in Table 4.1 are not satisfied, and if shrinkage is not incorporated
into parameter estimation. Methods such as shrinkage and data reduction
reduce the effective d.f. of the model, making it more likely for the model
to validate on future data. Data reduction is aimed at reducing the number
of parameters to estimate in the model, without distorting statistical infer-
ence for the parameters. This is accomplished by ignoring Y during data
reduction. Manipulations of X in unsupervised learning may result in a loss
of information for predicting Y , but when the information loss is small, the
gain in power and reduction of overfitting more than offset the loss.
Some available data reduction methods are given below.
1. Use the literature to eliminate unimportant variables.
2. Eliminate variables whose distributions are too narrow.
3. Eliminate candidate predictors that are missing in a large number of sub-
jects, especially if those same predictors are likely to be missing for future
applications of the model.
4. Use a statistical data reduction method such as incomplete principal com-
ponent regression, nonlinear generalizations of principal components such
80 4 Multivariable Modeling Strategies
There are many approaches to data reduction. One rigorous approach involves
removing predictors that are easily predicted from other predictors, using
flexible parametric additive regression models. This approach is unlikely to
result in a major reduction in the number of regression coefficients to estimate
against Y , but will usually provide insights useful for later data reduction
over and above the insights given by methods based on pairwise correlations
instead of multiple R2 .
The Hmisc redun function implements the following redundancy checking
algorithm.
• Expand each continuous predictor into restricted cubic spline basis func-
tions. Expand categorical predictors into dummy variables.
• Use OLS to predict each predictor with all component terms of all remain-
ing predictors (similar to what the Hmisc transcan function does). When the
predictor is expanded into multiple terms, use the first canonical variatem.
• Remove the predictor that can be predicted from the remaining set with
the highest adjusted or regular R2 .
• Predict all remaining predictors from their complement.
• Continue in like fashion until no variable still in the list of predictors can
be predicted with an R2 or adjusted R2 greater than a specified threshold
or until dropping the variable with the highest R2 (adjusted or ordinary)
would cause a variable that was dropped earlier to no longer be predicted
at the threshold from the now smaller list of predictors.
Special consideration must be given to categorical predictors. One way to
consider a categorical variable redundant is if a linear combination of dummy
variables representing it can be predicted from a linear combination of other
variables. For example, if there were 4 cities in the data and each city’s rainfall
was also present as a variable, with virtually the same rainfall reported for
all observations for a city, city would be redundant given rainfall (or vice-
versa). If two cities had the same rainfall, ‘city’ might be declared redundant
even though tied cities might be deemed non-redundant in another setting. A
second, more stringent way to check for redundancy of a categorical predictor
is to ascertain whether all dummy variables created from the predictor are
individually redundant. The redun function implements both approaches.
19 Examples of use of redun are given in two case studies.
m
There is an option to force continuous variables to be linear when they are being
predicted.
4.7 Data Reduction 81
Scaling techniques often allow the analyst to reduce the number of parameters
to fit by estimating transformations for each predictor using only information
about associations with other predictors. It may be advisable to cluster vari-
ables before scaling so that patterns are derived only from variables that are
related. For purely categorical predictors, methods such as correspondence
analysis (see, for example, [108, 139, 239, 391, 456]) can be useful for data reduc-
tion. Often one can use these techniques to scale multiple dummy variables
into a few dimensions. For mixtures of categorical and continuous predictors,
qualitative principal component analysis such as the maximum total variance
(MTV) method of Young et al.456, 680 is useful. For the special case of repre-
senting a series of variables with one PC, the MTV method is quite easy to
implement.
82 4 Multivariable Modeling Strategies
C o n v e r g e n c e c r i t e r i o n :2.901 0.035
0.007
C o n v e r g e n c e in 4 i t e r a t i o n s
R2 achieved in p r e d i c t i n g each variable :
Adjusted R2 :
n
If one were to estimate transformations without removing observations that had
these constants inserted for the current Y -variable, the resulting transformations
would likely have a spike at Y = imputation constant.
o
Study to Understand Prognoses Preferences Outcomes and Risks of Treatments
4.7 Data Reduction 85
w $ imputed $ blood.pressure
400 401
132.4057 109.7741
t ← w $ transformed
spe ← round ( c ( spearman ( heart.rate , blood.pressure ),
spearman ( t [, ' heart.rate ' ],
t [, ' blood.pressur e ' ])) , 2)
8
0
Transformed blood.pressure
6
Transformed heart.rate
−2
4 −4
2 −6
−8
0
Fig. 4.3 Transformations fitted using transcan. Tick marks indicate the two imputed
values for blood pressure.
Spearman’s rank correlation ρ between pairs of heart rate and blood pressure
was -0.02, because these variables each require U -shaped transformations. Us-
ing restricted cubic splines with five knots placed at default quantiles, tran-
scan provided the transformations shown in Figure 4.3. Correlation between
transformed variables is ρ = −0.13. The fitted transformations are similar to
those obtained from relating these two variables to time until death.
150
−2
Transformed bp
blood.pressure
100 −4
−6
50
−8
Fig. 4.4 The lower left plot contains raw data (Spearman ρ = −0.02); the lower right
is a scatterplot of the corresponding transformed values (ρ = −0.13). Data courtesy
of the SUPPORT study352.
new predictors representing whether any of the factors is positive and a count
of the number of positive factors. For the ordinal count of the number of
positive factors, score the summary variable to satisfy linearity assumptions
as discussed previously. For the more powerful predictor of the two summary
measures, test for adequacy of scoring by using all dichotomous variables as
candidate predictors after adjusting for the new summary variable. A residual
χ2 statistic can be used to test whether the summary variable adequately
captures the predictive information of the series of binary predictors.p This
statistic will have degrees of freedom equal to one less than the number of
binary predictors when testing for adequacy of the summary count (and hence
will have low power when there are many predictors). Stratification by the
summary score and examination of responses over cells can be used to suggest
a transformation on the score.
Another approach to scoring a series of related dichotomous predictors is to
have “experts” assign severity points to each condition and then to either sum
these points or use a hierarchical rule that scores according to the condition
with the highest points (see Section 14.3 for an example). The latter has the
advantage of being easy to implement for field use. The adequacy of either
type of scoring can be checked using tests of linearity in a regression modelq .
p
Whether this statistic should be used to change the model is problematic in view
of model uncertainty.
q
The R function score.binary in the Hmisc package (see Section 6.2) assists in
computing a summary variable from the series of binary conditions.
4.7 Data Reduction 87
variables would then be p − q. The shrinkage for the reduced model is then
on average [LR − (p − q) − q]/[LR − (p − q)]. Setting this ratio to be ≥ 0.9
and solving for q gives q ≤ (LR− p)/9. Therefore, reduction of dimensionality
down to q degrees of freedom would be expected to achieve < 10% shrinkage.
With these assumptions, there is no hope that a reduced model would have
acceptable calibration unless LR > p + 9. If the information explained by the
omitted variables is less than one would expect by chance (e.g., their total
χ2 is extremely small), a reduced model could still be beneficial, as long as
the conservative bound (LR − q)/LR ≥ 0.9 or q ≤ LR/10 were achieved. This
conservative bound assumes that no χ2 is lost by the reduction, that is that
the final model χ2 ≈ LR. This is unlikely in practice. Had the p − q omitted
variables had a larger χ2 of 2(p − q) (the break-even point for AIC), q must
be ≤ (LR − 2p)/8.
As an example, suppose that a binary logistic model is being developed
from a sample containing 45 events on 150 subjects. The 10:1 rule suggests
we can analyze 4.5 degrees of freedom. The analyst wishes to analyze age,
sex, and 10 other variables. It is not known whether interaction between age
and sex exists, and whether age is linear. A restricted cubic spline is fitted
with four knots, and a linear interaction is allowed between age and sex.
These two variables then need 3 + 1 + 1 = 5 degrees of freedom. The other
10 variables are assumed to be linear and to not interact with themselves
or age and sex. There is a total of 15 d.f. The full model with 15 d.f. has
LR = 50. Expected shrinkage from this model is (50 − 15)/50 = 0.7. Since
LR > 15 + 9 = 24, some reduction might yield a better validating model.
Reduction to q = (50 − 15)/9 ≈ 4 d.f. would be necessary, assuming the
reduced LR is about 50 − (15 − 4) = 39. In this case the 10:1 rule yields
about the same value for q. The analyst may be forced to assume that age is
linear, modeling 3 d.f. for age and sex. The other 10 variables would have to
be reduced to a single variable using principal components or another scaling
technique. The AIC-based calculation yields a maximum of 2.5 d.f.
If the goal of the analysis is to make a series of hypothesis tests (adjusting
P -values for multiple comparisons) instead of to predict future responses, the
full model would have to be used.
A summary of the various data reduction methods is given in Figure 4.5.
When principal component analysis or related methods are used for data
reduction, the model may be harder to describe since internal coefficients are
“hidden.” R code on p. 141 shows how an ordinary linear model fit can be
used in conjunction with a logistic model fit based on principal components
24 to draw a nomogram with axes for all predictors.
4.8 Other Approaches to Predictive Modeling 89
Variable clustering
• Sub ject matter knowl-
edge
Group predictors so that • Group predictors to
each group represents a • ↓ d.f. arising from mul-
maximize proportion of
single dimension that can tiple predictors
variance explained by
be summarized with a sin- • Make P C1 more reason-
P C1 of each group
gle score able summary
• Hierarchical clustering
using a matrix of simi-
larity measures between
predictors
Principal components
Multiple dimensional ↓ d.f. for all predictors 1, 2, . . . , k, k < p com-
scoring of all predictors combined puted from all trans-
formed predictors
the other (X, Y ) pairs caused a χ2 of 36 for testing nonlinearity of the pre-
dictor. Remember that an imperfectly fitting model is a fact of life, and
discarding the observations can inflate the model’s predictive accuracy. On
rare occasions, such lack of fit may lead the analyst to make changes in
the model’s structure, but ordinarily this is best done from the “ground
up” using formal tests of lack of fit (e.g., a test of linearity or interaction).
Influential observations of the second and third kinds can often be detected
by careful quality control of the data. Statistical measures can also be helpful.
The most common measures that apply to a variety of regression models are
leverage, DFBETAS, DFFIT, and DFFITS.
Leverage measures the capacity of an observation to be influential due
to having extreme predictor values. Such an observation is not necessarily
influential. To compute leverage in ordinary least squares, we define the hat
matrix H given by
H = X(X ′ X)−1 X ′ . (4.6)
H is the matrix that when multiplied by the response vector gives the pre-
dicted values, so it measures how an observation estimates its own predicted
response. The diagonals hii of H are the leverage measures and they are not
influenced by Y . It has been suggested47 that hii > 2(p + 1)/n signal a high
leverage point, where p is the number of columns in the design matrix X
aside from the intercept and n is the number of observations. Some believe
that the distribution of hii should be examined for values that are higher
than typical.
DFBETAS is the change in the vector of regression coefficient estimates
upon deletion of each observation in turn, scaled by their standard errors.47
Since DFBETAS encompasses an effect for each predictor’s coefficient, DF-
BETAS allows the analyst to isolate the problem better than some of the
other measures. DFFIT is the change in the predicted Xβ when the observa-
tion is dropped, and DFFITS is DFFIT standardized by the standard error
of the estimate of Xβ. In both cases, the standard error used for normal-
ization is recomputed each time an observation is omitted.
Some classify an
observation as overly influential when |DFFITS| > 2 (p + 1)/(n − p − 1),
while others prefer to examine the entire distribution of DFFITS to identify
“outliers”.47
Section 10.7 discusses influence measures for the logistic model, which
requires maximum likelihood estimation. These measures require the use of
special residuals and information matrices (in place of X ′ X).
If truly influential observations are identified using these indexes, careful
thought is needed to decide how (or whether) to deal with them. Most im-
portant, there is no substitute for careful examination of the dataset before
doing any analyses.99 Spence and Garrison [581, p. 16] feel that
Although the identification of aberrations receives considerable attention in
most modern statistical courses, the emphasis sometimes seems to be on dis-
posing of embarrassing data by searching for sources of technical error or
92 4 Multivariable Modeling Strategies
Frequently one wants to choose between two competing models on the ba-
sis of a common set of observations. The methods that follow assume that
the performance of the models is evaluated on a sample not used to develop
either one. In this case, predicted values from the model can usually be con-
sidered as a single new variable for comparison with responses in the new
dataset. These methods listed below will also work if the models are com-
pared using the same set of data used to fit each one, as long as both models
have the same effective number of (candidate or actual) parameters. This
requirement prevents us from rewarding a model just because it overfits the
training sample (see Section 9.8.1 for a method comparing two models of dif-
fering complexity). The methods can also be enhanced using bootstrapping
or cross-validation on a single sample to get a fair comparison when the play-
ing field is not level, for example, when one model had more opportunity for
fitting or overfitting the responses.
Some of the criteria for choosing one model over the other are
1. calibration (e.g., one model is well-calibrated and the other is not),
2. discrimination,
3. face validity,
4. measurement errors in required predictors,
5. use of continuous predictors (which are usually better defined than cate-
gorical ones),
6. omission of “insignificant” variables that nonetheless make sense as risk
factors,
7. simplicity (although this is less important with the availability of comput-
ers), and
8. lack of fit for specific types of subjects.
Items 3 through 7 require subjective judgment, so we focus on the other as-
pects. If the purpose of the models is only to rank-order subjects, calibration
is not an issue. Otherwise, a model having poor calibration can be dismissed
outright. Given that the two models have similar calibration, discrimination
should be examined critically. Various statistical indexes can quantify dis-
crimination ability (e.g., R2 , model χ2 , Somers’ Dxy , Spearman’s ρ, area un-
der ROC curve—see Section 10.8). Rank measures (Dxy , ρ, ROC area) only
measure how well predicted values can rank-order responses. For example,
predicted probabilities of 0.01 and 0.99 for a pair of subjects are no better
than probabilities of 0.2 and 0.8 using rank measures, if the first subject had
4.10 Comparing Two Models 93
a lower response value than the second. Therefore, rank measures such as
ROC area (c index), although fine for describing a given model, may not be
very sensitive in choosing between two models118, 488, 493 . This is especially
true when the models are strong, as it is easier to move a rank correlation
from 0.6 to 0.7 than it is to move it from 0.9 to 1.0. Measures such as R2 and
the model χ2 statistic (calculated from the predicted and observed responses)
are more sensitive. Still, one may not know how to interpret the added utility
of a model that boosts the R2 from 0.80 to 0.81.
Again given that both models are equally well calibrated, discrimination
can be studied more simply by examining the distribution of predicted values
Ŷ . Suppose that the predicted value is the probability that a subject dies.
Then high-resolution histograms of the predicted risk distributions for the
two models can be very revealing. If one model assigns 0.02 of the sample to
a risk of dying above 0.9 while the other model assigns 0.08 of the sample to
the high risk group, the second model is more discriminating. The worth of a
model can be judged by how far it goes out on a limb while still maintaining
good calibration. 25
Frequently, one model will have a similar discrimination index to another
model, but the likelihood ratio χ2 statistic is meaningfully greater for one. As-
suming corrections have been made for complexity, the model with the higher
χ2 usually has a better fit for some subjects, although not necessarily for the
average subject. A crude plot of predictions from the first model against
predictions from the second, possibly stratified by Y , can help describe the
differences in the models. More specific analyses will determine the charac-
teristics of subjects where the differences are greatest. Large differences may
be caused by an omitted, underweighted, or improperly transformed predic-
tor, among other reasons. In one example, two models for predicting hospital
mortality in critically ill patients had the same discrimination index (to two
decimal places). For the relatively small subset of patients with extremely low
white blood counts or serum albumin, the model that treated these factors
as continuous variables provided predictions that were very much different
from a model that did not.
When comparing predictions for two models that may not be calibrated
(from overfitting, e.g.), the two sets of predictions may be shrunk so as to
not give credit for overfitting (see Equation 4.3).
Sometimes one wishes to compare two models that used the response vari-
able differently, a much more difficult problem. For example, an investigator
may want to choose between a survival model that used time as a continuous
variable, and a binary logistic model for dead/alive at six months. Here, other
considerations are also important (see Section 17.1). A model that predicts
dead/alive at six months does not use the response variable effectively, and
it provides no information on the chance of dying within three months.
When one or both of the models is fitted using least squares, it is useful
to compare them using an error measure that was not used as the optimiza-
tion criterion, such as mean absolute error or median absolute error. Mean
94 4 Multivariable Modeling Strategies
and median absolute errors are excellent measures for judging the value of a
model developed without transforming the response to a model fitted after
26 transforming Y , then back-transforming to get predictions.
• Develop a black box model that performs well and is difficult to interpret
• Develop interpretable approximations to the black box
• Develop an interpretable model (e.g. give priority to additive effects) that
performs well and is likely to perform equally well on future data from the
same stream.
As stated in the Preface, the strategy emphasized in this text, stemming
from the last philosophy, is to decide how many degrees of freedom can be
“spent,” where they should be spent, and then to spend them. If statistical
tests or confidence limits are required, later reconsideration of how d.f. are
spent is not usually recommended. In what follows some default strategies
are elaborated. These strategies are far from failsafe, but they should allow
the reader to develop a strategy that is tailored to a particular problem. At
the least these default strategies are concrete enough to be criticized so that
statisticians can devise better ones.
4. The analyst need not be very concerned about conserving degrees of free-
dom devoted to the predictor of interest. The complexity allowed for this
variable is usually determined by prior beliefs, with compromises that con-
sider the bias-variance trade-off.
5. If penalized estimation is used, the analyst may wish to not shrink param-
eter estimates for the predictor of interest.
6. Model validation is not necessary unless the analyst wishes to use it to
quantify the degree of overfitting.
1 Some good general references that address modeling strategies are [216,269,476,
590].
2 Even though they used a generalized correlation index for screening variables
and not for transforming them, Hall and Miller249 present a related idea, com-
puting the ordinary R2 against a cubic spline transformation of each potential
predictor.
3 Simulation studies are needed to determine the effects of modifying the model
based on assessments of “predictor promise.” Although it is unlikely that this
strategy will result in regression coefficients that are biased high in absolute
value, it may on some occasions result in somewhat optimistic standard errors
and a slight elevation in type I error probability. Some simulation results may
be found on the Web site. Initial promising findings for least squares models
for two uncorrelated predictors indicate that the procedure is conservative in
its estimation of σ 2 and in preserving type I error.
4 Verweij and van Houwelingen640 and Shao565 describe how cross-validation can
be used in formulating a stopping rule. Luo et al.430 developed an approach to
tuning forward selection by adding noise to Y .
5 Roecker528 compared forward variable selection (FS) and all possible subsets
selection (APS) with full model fits in ordinary least squares. APS had a greater
tendency to select smaller, less accurate models than FS. Neither selection tech-
nique was as accurate as the full model fit unless more than half of the candidate
variables was redundant or unnecessary.
6 Wiegand668 showed that it is not very fruitful to try different stepwise algo-
rithms and then to be comforted by agreements in some of the variables selected.
It is easy for different stepwise methods to agree on the wrong set of variables.
7 Other results on how variable selection affects inference may be found in Hurvich
and Tsai316 and Breiman [66, Section 8.1].
8 Goring et al.227 presented an interesting analysis of the huge bias caused by
conditioning analyses on statistical significance in a high-dimensional genetics
context.
9 Steyerberg et al.589 have comparisons of smoothly penalized estimators with
the lasso and with several stepwise variable selection algorithms.
10 See Weiss,656 Faraway,186 and Chatfield100 for more discussions of the effect of
not prespecifying models, for example, dependence of point estimates of effects
on the variables used for adjustment.
11 Greenland241 provides an example in which overfitting a logistic model resulted
in far too many predictors with P < 0.05.
12 See Peduzzi et al.486, 487 for studies of the relationship between “events per
variable” and types I and II error, accuracy of variance estimates, and accuracy
of normal approximations for regression coefficient estimators. Their findings
are consistent with those given in the text (but644 has a slightly different take).
van der Ploeg et al.629 did extensive simulations to determine the events per
variable ratio needed to avoid a drop-off (in an independent test sample) in more
than 0.01 in the c-index, for a variety of predictive methods. They concluded
that support vector machines, neural networks, and random forests needed far
more events per variable to achieve freedom from overfitting than does logistic
regression, and that recursive partitioning was not competitive. Logistic regres-
sion required between 20 and 50 events per variable to avoid overfitting. Differ-
ent results might have been obtained had the authors used a proper accuracy
score.
13 Copas [122, Eq. 8.5] adds 2 to the numerator of Equation 4.3 (see also [504,631]).
4.13 Further Reading 101
4.14 Problems
b
There are decompositions of the Brier score into discrimination and calibration
components.
106 5 Describing, Resampling, Validating, and Simplifying the Model
When one assumes that a random variable Y has a certain population dis-
tribution, one can use simulation or analytic derivations to study how a sta-
tistical estimator computed from samples from this distribution behaves. For
example, when Y has a log-normal distribution, the variance of the sample
median for a sample of size n from that distribution can be derived analyt-
ically. Alternatively, one can simulate 500 samples of size n from the log-
normal distribution, compute the sample median for each sample, and then
compute the sample variance of the 500 sample medians. Either case requires
knowledge of the population distribution function.
Efron’s bootstrap 150, 177, 178 is a general-purpose technique for obtaining es-
timates of the properties of statistical estimators without making assumptions
about the distribution giving rise to the data. Suppose that a random variable
Y comes from a cumulative distribution function F (y) = Prob{Y ≤ y} and
that we have a sample of size n from this unknown distribution, Y1 , Y2 , . . . , Yn .
The basic idea is to repeatedly simulate a sample of size n from F , computing
the statistic of interest, and assessing how the statistic behaves over B rep-
etitions. Not having F at our disposal, we can estimate F by the empirical
cumulative distribution function
n
1
Fn (y) = [Yi ≤ y]. (5.1)
n i=1
1.0
0.8
Prob[X ≤ x]
0.6
0.4
0.2
0.0
is only used to estimate bias in the sample median). The first 20 bootstrap
samples (after sorting data values) and the corresponding sample medians
are shown in Table 5.1.
For a given number B of bootstrap samples, our estimates are simply
the sample 0.1 and 0.9 quantiles of the sample medians, and the mean of
the sample medians. Not knowing how large B should be, we could let B
range from, say, 50 to 1000, stopping when we are sure the estimates have
converged. In the left plot of Figure 5.2, B varies from 1 to 400 for the mean
(10 to 400 for the quantiles). It can be seen that the bootstrap estimate of the
population mean of the sample median can be estimated satisfactorily when
B > 50. For the lower and upper limits of the 0.8 confidence interval for the
population median Y , B must be at least 200. For more extreme confidence
limits, B must be higher still.
For the final set of 400 sample medians, a histogram (right plot in Fig-
ure 5.2) can be used to assess the form of the sampling distribution of the
sample median. Here, the distribution is almost normal, although there is a
slightly heavy left tail that comes from the data themselves having a heavy left
tail. For large samples, sample medians are normally distributed for a wide
variety of population distributions. Therefore we could use bootstrapping to
estimate the variance of the sample median and then take ±1.28 standard
errors as a 0.80 confidence interval. In other cases (e.g., regression coefficient
estimates for certain models), estimates are asymmetrically distributed, and
the bootstrap quantiles are better estimates than confidence intervals that
are based on a normality assumption. Note that because sample quantiles
are more or less restricted to equal one of the values in the sample, the boot-
108 5 Describing, Resampling, Validating, and Simplifying the Model
8
Mean and 0.1, 0.9 Quantiles
60
7
6 40
Frequency
5
20
4
3
0
0 100 200 300 400
2 4 6 8
Bootstrap Samples Used
5.3.1 Introduction
The surest method to have a model fit the data at hand is to discard much
of the data. A p-variable fit to p + 1 observations will perfectly predict Y as
long as no two observations have the same Y . Such a model will, however,
yield predictions that appear almost random with respect to responses on
a different dataset. Therefore, unbiased estimates of predictive accuracy are
essential.
Model validation is done to ascertain whether predicted values from the
model are likely to accurately predict responses on future subjects or sub-
jects not used to develop our model. Three major causes of failure of the 8
model to validate are overfitting, changes in measurement methods/changes
in definition of categorical variables, and major changes in subject inclusion
criteria.
There are two major modes of model validation, external and internal. The
most stringent external validation involves testing a final model developed in
one country or setting on subjects in another country or setting at another
time. This validation would test whether the data collection instrument was
translated into another language properly, whether cultural differences make
earlier findings nonapplicable, and whether secular trends have changed as-
sociations or base rates. Testing a finished model on new subjects from the 9
same geographic area but from a different institution as subjects used to fit
the model is a less stringent form of external validation. The least stringent
form of external validation involves using the first m of n observations for
model training and using the remaining n − m observations as a test sample.
This is very similar to data-splitting (Section 5.3.3). For details about meth-
ods for external validation see the R val.prob and val.surv functions in the
rms package.
Even though external validation is frequently favored by non-statisticians,
it is often problematic. Holding back data from the model-fitting phase re-
110 5 Describing, Resampling, Validating, and Simplifying the Model
sults in lower precision and power, and one can increase precision and learn
more about geographic or time differences by fitting a unified model to the
entire subject series including, for example, country or calendar time as a
main effect and/or as an interacting effect. Indeed one could use the follow-
ing working definition of external validation: validation of a prediction tool
using data that were not available when the tool needed to be completed. An
alternate definition could be taken as the validation of a prediction tool by
an independent research team.
One suggested hierarchy of the quality of various validation methods is as
follows, ordered from worst to best.
1. Attempting several validations (internal or external) and reporting only
the one that “worked”
2. Reporting apparent performance on the training dataset (no validation)
3. Reporting predictive accuracy on an undersized independent test sample
4. Internal validation using data-splitting where at least one of the training
and test samples is not huge and the investigator is not aware of the
arbitrariness of variable selection done on a single sample
5. Strong internal validation using 100 repeats of 10-fold cross-validation or
several hundred bootstrap resamples, repeating all analysis steps involving
Y afresh at each re-sample and the arbitrariness of selected “important
variables” is reported (if variable selection is used)
6. External validation on a large test sample, done by the original research
team
7. Re-analysis by an independent research team using strong internal valida-
tion of the original dataset
8. External validation using new test data, done by an independent research
team
9. External validation using new test data generated using different instru-
ments/technology, done by an independent research team
Internal validation involves fitting and validating the model by carefully
using one series of subjects. One uses the combined dataset in this way to
estimate the likely performance of the final model on new subjects, which
after all is often of most interest. Most of the remainder of Section 5.3 deals
with internal validation.
subjects will be close to zero. In this case, dramatic overfitting has occurred.
The adjusted R2 (Equation 4.4) solves this problem, at least when the model
has been completely prespecified and no variables or parameters have been
2
“screened” out of the final model fit. That is, Radj is only valid when p in its
formula is honest— when it includes all parameters ever examined (formally
or informally, e.g., using graphs or tables) whether these parameters are in
the final model or not.
Quite often we need to validate indexes other than R2 for which adjust-
ments for p have not been created.c We also need to validate models contain-
ing “phantom degrees of freedom” that were screened out earlier, formally
or informally. For these purposes, we obtain nearly unbiased estimates of R2
or other indexes using data splitting, cross-validation, or the bootstrap. The
bootstrap provides the most precise estimates.
The g–index is another discrimination measure to validate. But g and R2
measures only one aspect of predictive ability. In general, there are two major
aspects of predictive accuracy that need to be assessed. As discussed in Sec-
tion 4.5, calibration or reliability is the ability of the model to make unbiased
estimates of outcome. Discrimination is the model’s ability to separate sub-
jects’ outcomes. Validation of the model is recommended even when a data
reduction technique is used. This is a way to ensure that the model was not
overfitted or is otherwise inaccurate.
5.3.3 Data-Splitting
c
For example, in the binary logistic model, there is a generalization of R2 available,
but no adjusted version. For logistic models we often validate other indexes such
as the ROC area or rank correlation between predicted probabilities and observed
outcomes. We also validate the calibration accuracy of Ŷ in predicting Y .
112 5 Describing, Resampling, Validating, and Simplifying the Model
of squares, and SSE can be greater than SST (when predictions are worse
10 than the constant predictor Y ).
To be able to validate predictions from the model over an entire test sam-
ple (without validating it separately in particular subsets such as in males
and females), the test sample must be large enough to precisely fit a model
containing one predictor. For a study with a continuous uncensored response
variable, the test sample size should ordinarily be ≥ 100 at a bare minimum.
For survival time studies, the test sample should at least be large enough
to contain a minimum of 100 outcome events. For binary outcomes, the test
sample should contain a bare minimum of 100 subjects in the least frequent
outcome category. Once the size of the test sample is determined, the remain-
ing portion of the original sample can be used as a training sample. Even with
these test sample sizes, validation of extreme predictions is difficult.
Data-splitting has the advantage of allowing hypothesis tests to be con-
firmed in the test sample. However, it has the following disadvantages.
1. Data-splitting greatly reduces the sample size for both model development
and model testing. Because of this, Roecker528 found this method “appears
to be a costly approach, both in terms of predictive accuracy of the fitted
model and the precision of our estimate of the accuracy.” Breiman [66,
Section 1.3] found that bootstrap validation on the original sample was as
efficient as having a separate test sample twice as large36 .
2. It requires a larger sample to be held out than cross-validation (see be-
low) to be able to obtain the same precision of the estimate of predictive
accuracy.
3. The split may be fortuitous; if the process were repeated with a different
split, different assessments of predictive accuracy may be obtained.
4. Data-splitting does not validate the final model, but rather a model devel-
oped on only a subset of the data. The training and test sets are recombined
for fitting the final model, which is not validated.
5. Data-splitting requires the split before the first analysis of the data. With
other methods, analyses can proceed in the usual way on the complete
dataset. Then, after a “final” model is specified, the modeling process is
rerun on multiple resamples from the original data to mimic the process
that produced the “final” model.
Efron,172, 173 Efron and Gong,175 Gong,224 Efron and Tibshirani,177, 178 Lin-
net,416 and Breiman66 describe several bootstrapping procedures for obtain-
ing nearly unbiased estimates of future model performance without holding
back data when making the final estimates of model parameters. With the
“simple bootstrap” [178, p. 247], one repeatedly fits the model in a bootstrap
sample and evaluates the performance of the model on the original sample.
The estimate of the likely performance of the final model on future data
is estimated by the average of all of the indexes computed on the original
sample.
Efron showed that an enhanced bootstrap estimates future model per-
formance more accurately than the simple bootstrap. Instead of estimating
an accuracy index directly from averaging indexes computed on the original
sample, the enhanced bootstrap uses a slightly more indirect approach by
estimating the bias due to overfitting or the “optimism” in the final model
fit. After the optimism is estimated, it can be subtracted from the index
of accuracy derived from the original sample to obtain a bias-corrected or
overfitting-corrected estimate of predictive accuracy. The bootstrap method
is as follows. From the original X and Y in the sample of size n, draw a
sample with replacement also of size n. Derive a model in the bootstrap sam-
ple and apply it without change to the original sample. The accuracy index
from the bootstrap sample minus the index computed on the original sample
is an estimate of optimism. This process is repeated for 100 or so bootstrap
replications to obtain an average optimism, which is subtracted from the final
14 model fit’s apparent accuracy to obtain the overfitting-corrected estimate.
Note that bootstrapping validates the process that was used to fit the orig-
inal model (as does cross-validation). It provides an estimate of the expected
value of the optimism, which when subtracted from the original index, pro-
15 vides an estimate of the expected bias-corrected index. If stepwise variable
selection is part of the bootstrap process (as it must be if the final model
is developed that way), and not all resamples (samples with replacement or
training samples in cross-validation) resulted in the same model (which is
almost always the case), this internal validation process actually provides an
unbiased estimate of the future performance of the process used to identify
markers and scoring systems; it does not validate a single final model. But
resampling does tend to provide good estimates of the future performance of
the final model that was selected using the same procedure repeated in the
resamples.
5.3 Model Validation 115
cross-validation is less biased than the bootstrap, Efron172 showed that it has
much higher variance in estimating overfitting-corrected predictive accuracy
than bootstrapping. In other words, cross-validation, like data-splitting, can
yield significantly different estimates when the entire validation process is
repeated.
It is frequently very informative to estimate a measure of predictive accu-
racy forcing all candidate factors into the fit and then to separately estimate
accuracy allowing stepwise variable selection, possibly with different stop-
ping rules. Consistent with Spiegelhalter’s proposal to use all factors and
then to shrink the coefficients to adjust for overfitting,582 the full model fit
will outperform the stepwise model more often than not. Even though step-
wise modeling has slightly less optimism in predictive discrimination, this
improvement is not enough to offset the loss of information from deleting
even marginally important variables. Table 5.2 shows a typical scenario. In
this example, stepwise modeling lost a possible 0.50 − 0.47 = 0.03 predictive
discrimination. The full model fit will especially be an improvement when
1. the stepwise selection deletes several variables that are almost significant;
2. these marginal variables have some real predictive value, even if it’s slight;
and
3. there is no small set of extremely dominant variables that would be easily
found by stepwise selection.
19
Faraway186 has a fascinating study showing how resampling methods can
be used to estimate the distributions of predicted values and of effects of a
predictor, adjusting for an automated multistep modeling process. Bootstrap-
ping can be used, for example, to penalize the variance in predicted values for
choosing a transformation for Y and for outlier and influential observation
deletion, in addition to variable selection. Estimation of the transformation of
Y greatly increased the variance in Faraway’s examples. Brownstone [77, p.
74] states that “In spite of considerable efforts, theoretical statisticians have
been unable to analyze the sampling properties of [usual multistep modeling
strategies] under realistic conditions” and concludes that the modeling strat-
egy must be completely specified and then bootstrapped to get consistent
20 estimates of variances and other sampling properties.
5.4 Bootstrapping Ranks of Predictors 117
When the order of importance of predictors is not pre-specified but the re-
searcher attempts to determine that order by assessing multiple associations
with Y , the process of selecting “winners” and “losers” is unreliable. The
bootstrap can be used to demonstrate the difficulty of this task, by estimat-
ing confidence intervals for the ranks of all the predictors. Even though the
bootstrap intervals are wide, they actually underestimate the true widths250 .
The following exampling uses simulated data with known ranks of impor-
tance of 12 predictors, using an ordinary linear model. The importance metric
is the partial χ2 minus its degrees of freedom, while the true metric is the
partial β, as all covariates have U (0, 1) distributions.
# Use the plot method for anova , with pl=FALSE to suppress
# actual plotting of chi-square - d.f. for each bootstrap
# repetition. Rank the negative of the adjusted chi-squares
# so that a rank of 1 is assigned to the highest. It is
# important to tell plot.anova.rms not to sort the results ,
# or every bootstrap replication would have ranks of 1,2,3,
# ... for the partial test statistics.
require ( rms )
n ← 300
set.seed (1)
d ← data.fram e ( x1 = runif ( n ) , x2 = runif ( n ) , x3 = runif ( n ) ,
x4 = runif ( n ) , x5 = runif ( n ) , x6 = runif ( n ) , x7 = runif ( n ) ,
x8 = runif ( n ) , x9 = runif ( n ) , x10 = runif (n ) , x11 = runif ( n ) ,
x12 = runif ( n ))
d $ y ← with ( d , 1 * x1 + 2 * x2 + 3 * x3 + 4 * x4 + 5 * x5 + 6 * x6 +
7 * x7 + 8 * x8 + 9 * x9 + 10 * x10 + 11 * x11 +
12 * x12 + 9 * rnorm ( n ))
x12
x11
x10
x9
x8
predictor
x7
x6
x5
x4
x3
x2
x1
1 2 3 4 5 6 7 8 9 10 11 12
Rank
Fig. 5.3 Bootstrap percentile 0.95 confidence limits for ranks of predictors in an OLS
model. Ranking is on the basis of partial χ2 minus d.f. Point estimates are original
ranks
the predictors. Confidence intervals are wide; for example the 0.95 confidence
interval for the rank of x7 (which has a true rank of 7) is [1, 8], so we are
only confident that x7 is not one of the 4 most influential predictors. The
confidence intervals do include the true ranks in each case (Figure 5.3).
A model that contains all prespecified terms will usually be the one that pre-
dicts the most accurately on new data. It is also a model for which confidence
limits and statistical tests have the claimed properties. Often, however, this
model will not be very parsimonious. The full model may require more pre-
dictors than the researchers care to collect in future samples. It also requires
predicted values to be conditional on all of the predictors, which can increase
the variance of the predictions.
As an example suppose that least squares has been used to fit a model
containing several variables including race (with four categories). Race may
be an insignificant predictor and may explain a tiny fraction of the observed
variation in Y . Yet when predictions are requested, a value for race must be
inserted. If the subject is of the majority race, and this race has a majority of,
5.5 Simplifying the Final Model by Approximating It 119
say 0.75, the variance of the predicted value will not be significantly greater
than the variance for a predicted value from a model that excluded race
for its list of predictors. If, however, the subject is of a minority race (say
“other” with a prevalence of 0.01), the predicted value will have much higher
variance. One approach to this problem, that does not require development
of a second model, is to ignore the subject’s race and to get a weighted
average prediction. That is, we obtain predictions for each of the four races
and weight these predictions by the relative frequencies of the four races.d
This weighted average estimates the expected value of Y unconditional on
race. It has the advantage of having exactly correct confidence limits when
model assumptions are satisfied, because the correct “error term” is being
used (one that deducts 3 d.f. for having ever estimated the race effect). In
regression models having nonlinear link functions, this process does not yield
such a simple interpretation.
When predictors are collinear, their competition results in larger P -values
when predictors are (often inappropriately) tested individually. Likewise, con-
fidence intervals for individual effects will be wide and uninterpretable (can
other variables really be held constant when one is changed?).
When the full model contains several predictors that do not appreciably af-
fect the predictions, the above process of “unconditioning” is unwieldy. In the
search for a simple solution, the most commonly used procedure for making
the model parsimonious is to remove variables on the basis of P -values, but
this results in a variety of problems as we have seen. Our approach instead
is to consider the full model fit as the “gold standard” model, especially the
model from which formal inferences are made. We then proceed to approxi-
mate this full model to any desired degree of accuracy. For any approximate
model we calculate the accuracy with which it approximates the best model.
One goal this process accomplishes is that it provides different degrees of
parsimony to different audiences, based on their needs. One investigator may
be able to collect only three variables, another one seven. Each investigator
will know how much she is giving up by using a subset of the predictors.
In approximating the gold standard model it is very important to note that
there is nothing gained in removing certain nonlinear terms; gains in parsi-
mony come only from removing entire predictors. Another accomplishment
of model approximation is that when the full model has been fitted using
d
Using the rms package described in Chapter 6, such estimates and their
confidence limits can easily be obtained, using for example contrast(fit,
list(age=50, disease=’hypertension’, race=levels(race)), type=’average’,
weights=table(race)).
120 5 Describing, Resampling, Validating, and Simplifying the Model
W V W ′, (5.2)
where V is the variance matrix for β̂. See Section 19.5 for an example. Ambler
et al.21 studied model simplification using simulation studies based on several
clinical datasets, and compared it with ordinary backward stepdown variable
selection and with shrinkage methods such as the lasso (see Section 4.3). They
found that ordinary backwards variable selection can be competitive when
there is a large fraction of truly irrelevant predictors (something that can be
difficult to know in advance). Paul et al.485 found advantages to modeling
the response with a complex but reliable approach, and then developing a
parsimoneous model using the lasso or stepwise variable selection against Ŷ .
See Section 11.7 for a case study in model approximation.
where Ti is the accuracy estimate derived from fitting a model on the ith boot-
strap sample and evaluating it on the observations omitted from that bootstrap
sample, and wi are weights derived for the B bootstrap samples:
n
1 [bootstrap sample i omits observation j]
wi = . (5.4)
n j=1 #bootstrap samples omitting observation j
5.7 Problem
require ( rms )
set.seed (1) # so can reproduce results
The methods described in this book are useful in any regression model that
involves a linear combination of regression parameters. The software that is
described below is useful in the same situations. Functions in R520 allow inter-
action spline functions as well as a wide variety of predictor parameterizations
for any regression function, and facilitate model validation by resampling. 1
R is the most comprehensive tool for general regression models for the
following reasons.
1. It is very easy to write R functions for new models, so R has implemented
a wide variety of modern regression models.
2. Designs can be generated for any model. There is no need to find out
whether the particular modeling function handles what SAS calls “class”
variables—dummy variables are generated automatically when an R cate-
gory, factor, ordered, or character variable is analyzed.
3. A single R object can contain all information needed to test hypotheses
and to obtain predicted values for new data.
4. R has superior graphics.
5. Classes in R make possible the use of generic function names (e.g., predict,
summary, anova) to examine fits from a large set of specific model–fitting
functions.
R44, 601, 635 is a high-level object-oriented language for statistical anal-
ysis with over six thousand packages and tens of thousands of functions
available. The R system318, 520 is the basis for R software used in this
text, centered around the Regression Modeling Strategies (rms) package261.
See the Appendix and the Web site for more information about software
implementations.
a
lrm and rcs are in the rms package.
6.2 User-Contributed Functions 129
In addition to the many functions that are packaged with R, a wide variety
of user-contributed functions is available on the Internet (see the Appendix
or Web site for addresses). Two packages of functions used extensively in
this text are Hmisc20 and rms written by the author. The Hmisc package con-
tains miscellaneous functions such as varclus, spearman2, transcan, hoeffd,
rcspline.eval, impute, cut2, describe, sas.get, latex, and several power and
sample size calculation functions. The varclus function uses the R hclust hi-
erarchical clustering function to do variable clustering, and the R plclust
function to draw dendrograms depicting the clusters. varclus offers a choice
of three similarity measures (Pearson r2 , Spearman ρ2 , and Hoeffding D)
and uses pairwise deletion of missing values. varclus automatically generates
a series of dummy variables for categorical factors. The Hmisc hoeffd function
computes a matrix of Hoeffding Ds for a series of variables. The spearman2
function will do Wilcoxon, Spearman, and Kruskal–Wallis tests and general-
izes Spearman’s ρ to detect non-monotonic relationships.
Hmisc’s transcan function (see Section 4.7) performs a similar function to
PROC PRINQUAL in SAS—it uses restricted splines, dummy variables, and canon-
ical variates to transform each of a series of variables while imputing missing
values. An option to shrink regression coefficients for the imputation models
avoids overfitting for small samples or a large number of predictors. transcan
can also do multiple imputation and adjust variance–covariance matrices for
imputation. See Chapter 8 for an example of using these functions for data
reduction.
See the Web site for a list of R functions for correspondence analysis,
principal component analysis, and missing data imputation available from
other users. Venables and Ripley [635, Chapter 11] provide a nice description
of the multivariate methods that are available in R, and they provide several
new multivariate analysis functions.
A basic function in Hmisc is the rcspline.eval function, which creates a
design matrix for a restricted (natural) cubic spline using the truncated power
basis. Knot locations are optionally estimated using methods described in
Section 2.4.6, and two types of normalizations to reduce numerical problems
are supported. You can optionally obtain the design matrix for the anti-
derivative of the spline function. The rcspline.restate function computes
the coefficients (after un-normalizing if needed) that translate the restricted
cubic spline function to unrestricted form (Equation 2.27). rcspline.restate
also outputs LATEX and R representations of spline functions in simplified
form.
130 6 R Software
rms has special fitting functions that facilitate many of the procedures de-
scribed in this book, shown in Table 6.1.
Glm is a slight modification of the built-in R glm function so that rms meth-
ods can be run on the resulting fit object. glm fits general linear models under
a wide variety of distributions of Y . Gls is a modification of the gls function
from the nlme package of Pinheiro and Bates509 , for repeated measures (longi-
tudinal) and spatially correlated data. The Rq function is a modification of the
quantreg package’s rq function356, 357 . Functions related to survival analysis
make heavy use of Therneau’s survival package482 .
You may want to specify to the fitting functions an option for how missing
values (NAs) are handled. The method for handling missing data in R is to
specify an na.action function. Some possible na.actions are given in Table 6.2.
The default na.action is na.delete when you use rms’s fitting functions. An
easy way to specify a new default na.action is, for example,
options ( na.action = " na.omit " ) # don ' t report frequency of NAs
before using a fitting function. If you use na.delete you can also use the system
option na.detail.response that makes model fits store information about Y
stratified by whether each X is missing. The default descriptive statistics for
Y are the sample size and mean. For a survival time response object the
sample size and proportion of events are used. Other summary functions can
be specified using the na.fun.response option.
132 6 R Software
These functions can be used with any function of a predictor. For example,
to obtain a four-knot cubic spline expansion of the cube root of x, specify
rcs(x∧ (1/3),4).
When the transformation functions are called, they are usually given one
or two arguments, such as rcs(x,5). The first argument is the predictor vari-
able or some function of it. The second argument is an optional vector of
parameters describing a transformation, for example location or number of
knots. Other arguments may be provided.
The Hmisc package’s cut2 function is sometimes used to create a categorical
variable from a continuous variable x. You can specify the actual interval
endpoints (cuts), the number of observations to have in each interval on
the average (m), or the number of quantile groups (g). Use, for example,
cuts=c(0,1,2) to cut into the intervals [0, 1), [1, 2].
A key concept in fitting models in R is that the fitting function returns an
object that is an R list. This object contains basic information about the fit
(e.g., regression coefficient estimates and covariance matrix, model χ2 ) as well
as information about how each parameter of the model relates to each factor
in the model. Components of the fit object are addressed by, for example,
fit$coef, fit$var, fit$loglik. rms causes the following information to also
be retained in the fit object: the limits for plotting and estimating effects
for each factor (if options(datadist="name") was in effect), the label for each
factor, and a vector of values indicating which parameters associated with a
factor are nonlinear (if any). Thus the “fit object” contains all the information
needed to get predicted values, plots, odds or hazard ratios, and hypothesis
tests, and to do “smart” variable selection that keeps parameters together
when they are all associated with the same predictor.
R uses the notion of the class of an object. The object-oriented class idea
allows one to write a few generic functions that decide which specific func-
tions to call based on the class of the object passed to the generic function.
An example is the function for printing the main results of a logistic model.
134 6 R Software
The lrm function returns a fit object of class "lrm". If you specify the R com-
mand print(fit) (or just fit if using R interactively—this invokes print), the
print function invokes the print.lrm function to do the actual printing specific
to logistic models. To find out which particular methods are implemented for
a given generic function, type methods(generic.name).
Generic functions that are used in this book include those in Table 6.4.
The first argument of the majority of functions is the object returned from
the model fitting function. When used with ols, lrm, orm, psm, cph, Glm, Gls, Rq,
bj, these functions do the following. specs prints the design specifications, for
example, number of parameters for each factor, levels of categorical factors,
knot locations in splines, and so on. vcov returns the variance-covariance
matrix for the model. logLik retrieves the maximized log-likelihood, whereas
AIC computes the Akaike Information Criterion for the model on the minus
twice log-likelihood scale (with an option to compute it on the χ2 scale if you
specify type=’chisq’). lrtest, when given two fit objects from nested models,
computes the likelihood ratio test for the extra variables. univarLR computes
all univariable likelihood ratio χ2 statistics, one predictor at a time.
The robcov function computes the Huber robust covariance matrix esti-
mate. bootcov uses the bootstrap to estimate the covariance matrix of pa-
rameter estimates. Both robcov and bootcov assume that the design matrix
and response variable were stored with the fit. They have options to adjust
for cluster sampling. Both replace the original variance–covariance matrix
with robust estimates and return a new fit object that can be passed to any
of the other functions. In that way, robust Wald tests, variable selection, con-
fidence limits, and many other quantities may be computed automatically.
The functions do save the old covariance estimates in component orig.var
of the new fit object. bootcov also optionally returns the matrix of param-
eter estimates over the bootstrap simulations. These estimates can be used
to derive bootstrap confidence intervals that don’t assume normality or sym-
metry. Associated with bootcov are plotting functions for drawing histogram
136 6 R Software
and smooth density estimates for bootstrap distributions. bootcov also has
a feature for deriving approximate nonparametric simultaneous confidence
sets. For example, the function can get a simultaneous 0.90 confidence region
for the regression effect of age over its entire range.
The pentrace function assists in selection of penalty factors for fitting re-
gression models using penalized maximum likelihood estimation (see Sec-
tion 9.10). Different types of model terms can be penalized by different
amounts. For example, one can penalize interaction terms more than main
effects. The effective.df function prints details about the effective degrees
of freedom devoted to each type of model term in a penalized fit.
summary prints a summary of the effects of each factor. When summary is
used to estimate effects (e.g., odds or hazard ratios) for continuous variables,
it allows the levels of interacting factors to be easily set, as well as allowing
the user to choose the interval for the effect. This method of estimating effects
allows for nonlinearity in the predictor. By default, interquartile range effects
(differences in X β̂, odds ratios, hazards ratios, etc.) are printed for continuous
factors, and all comparisons with the reference level are made for categorical
factors. See the example at the end of the summary documentation for a method
of quickly computing pairwise treatment effects and confidence intervals for
a large series of values of factors that interact with the treatment variable.
Saying plot(summary(fit)) will depict the effects graphically, with bars for a
list of confidence levels.
The anova function automatically tests most meaningful hypotheses in a
design. For example, suppose that age and cholesterol are predictors, and
that a general interaction is modeled using a restricted spline surface. anova
prints Wald statistics for testing linearity of age, linearity of cholesterol, age
effect (age + age × cholesterol interaction), cholesterol effect (cholesterol +
age × cholesterol interaction), linearity of the age × cholesterol interaction
(i.e., adequacy of the simple age × cholesterol 1 d.f. product), linearity of the
interaction in age alone, and linearity of the interaction in cholesterol alone.
Joint tests of all interaction terms in the model and all nonlinear terms in the
model are also performed. The plot.anova function draws a dot chart showing
the relative contribution (χ2 , χ2 minus d.f., AIC, partial R2 , P -value, etc.)
of each factor in the model.
The contrast function is used to obtain general contrasts and correspond-
ing confidence limits and test statistics. This is most useful for testing effects
in the presence of interactions (e.g., type II and type III contrasts). See the
help file for contrast for several examples of how to obtain joint tests of mul-
tiple contrasts (see Section 9.3.2) as well as double differences (interaction
contrasts).
The predict function is used to obtain a variety of values or predicted
values from either the data used to fit the model or a new dataset. The
Predict function is easier to use for most purposes, and has a special plot
method. The gendata function makes it easy to obtain a data frame containing
predictor combinations for obtaining selected predicted values.
6.3 The rms Package 137
fastbw deletes factors, not columns of the design matrix. Factors requiring
multiple d.f. will be retained or dropped as a group. The function prints the
deletion statistics for each variable in turn, and prints approximate parameter
estimates for the model after deleting variables. The approximation is better
when the number of factors deleted is not large. For ols, the approximation
is exact.
The which.influence function creates a list with a component for each
factor in the model. The names of the components are the factor names.
Each component contains the observation identifiers of all observations that
are “overly influential” with respect to that factor, meaning that |dfbetas| > u
for at least one βi associated with that factor, for a given u. The default u
is .2. You must have specified x=TRUE, y=TRUE in the fitting function to use
which.influence. The first argument is the fit object, and the second argument
is the cutoff u.
The following R program will print the set of predictor values that were
very influential for each factor. It assumes that the data frame containing the
data used in the fit is called df.
f ← lrm ( y ∼ x1 + x2 + ... , data = df , x = TRUE , y = TRUE )
w ← which.influen ce (f , .4 )
nam ← names ( w )
for ( i in 1: length ( nam )) {
cat ( " Influential observations for effect of " ,
nam [ i ], " \ n " )
print ( df [ w [[ i ]] ,])
}
the other hand, the print method with optional argument latex=TRUE is used
to output LATEX code representing the model results in tabular form to the
console. This is intended for use with knitr677 or Sweave399 .
The Function function composes an R function that you can use to evaluate
X β̂ analytically from a fitted regression model. The documentation for Func-
tion also shows how to use a subsidiary function sascode that will (almost)
translate such an R function into SAS code for evaluating predicted values in
new subjects. Neither Function nor latex handles third-order interactions.
The nomogram function draws a partial nomogram for obtaining predictions
from the fitted model manually. It constructs different scales when interac-
tions (up to third-order) are present. The constructed nomogram is not com-
plete, in that point scores are obtained for each predictor and the user must
add the point scores manually before reading predicted values on the final
axis of the nomogram. The constructed nomogram is useful for interpreting
the model fit, especially for non-monotonically transformed predictors (their
scales wrap around an axis automatically).
The vif function computes variance inflation factors from the covariance
matrix of a fitted model, using [147, 654].
The impute function is another generic function. It does simple imputation
by default. It can also work with the transcan function to multiply or singly
impute missing values using a flexible additive model.
As an example of using many of the functions, suppose that a categorical
variable treat has values "a", "b", and "c", an ordinal variable num.diseases
has values 0,1,2,3,4, and that there are two continuous variables, age and
cholesterol. age is fitted with a restricted cubic spline, while cholesterol
is transformed using the transformation log(cholesterol+10). Cholesterol is
missing on three subjects, and we impute these using the overall median
cholesterol. We wish to allow for interaction between treat and cholesterol.
The following R program will fit a logistic model, test all effects in the design,
estimate effects, and plot estimated transformations. The fit for num.diseases
really considers the variable to be a five-level categorical variable. The only
difference is that a 3 d.f. test of linearity is done to assess whether the variable
can be remodeled “asis”. Here we also show statements to attach the rms
package and store predictor characteristics from datadist.
require ( rms ) # make new functions available
ddist ← datadist ( cholesterol , treat , num.diseases , age )
# Could have used ddist ← datadist ( data.frame.na me )
options ( datadist = " ddist " ) # defines data dist. to rms
cholesterol ← impute ( cholesterol )
fit ← lrm ( y ∼ treat + scored ( num.diseases ) + rcs ( age ) +
log ( cholesterol +10) +
treat : log ( cholesterol +10))
describe ( y ∼ treat + scored ( num.diseases ) + rcs ( age ))
# or use describe ( formula ( fit )) for all variables used in
# fit. describe function ( in Hmisc ) gets simple statistics
# on variables
# fit ← robcov ( fit )# Would make all statistics that follow
6.3 The rms Package 139
ag ← 10:80
logit ← predict ( fit , expand.grid ( treat = " a " , num.dis =0 ,
age = ag ,
cholesterol = median ( cholesterol )) ,
type = " terms " )[ , " age " ]
# Note : if age interacted with anything , this would be the
# age ` main effect ' ignoring interaction terms
# Could also use logit ← Predict (f , age = ag , ... ) $ yhat ,
# which allows evaluation of the shape for any level of
# interacting factors. When age does not interact with
# anything , the result from predict (f , ... , type =" terms ")
# would equal the result from Predict if all other terms
# were ignored
To examine interactions in a simpler way, you may want to group age into
tertiles:
age.tertile ← cut2 ( age , g =3)
# For auto ranges later , specify age.tertile to datadist
fit ← lrm ( y ∼ age.tertile * rcs ( cholesterol ))
require ( rms )
dd ← datadist ( my.data )
options ( datadist = ' dd ' )
pcfit ← princomp (∼ pain.symptom1 + pain.symptom2 + sign1 +
sign2 + sign3 + smoking )
pc2 ← pcfit $ scores [ ,1:2] # first 2 PCs as matrix
logistic.fit ← lrm ( death ∼ rcs ( age ,4) + pc2 )
predicted.log it ← predict ( logistic.fit )
linear.mod ← ols ( predicted.logi t ∼ rcs ( age ,4) +
pain.symptom1 + pain.symptom2 +
sign1 + sign2 + sign3 + smoking )
# This model will have R-squared =1
nom ← nomogram ( linear.mod , fun = function ( x )1 / (1+ exp ( -x )) ,
funlabel = " Probability of Death " )
# can use fun = plogis
plot ( nom )
# 7 Axes showing effects of all predictors , plus a reading
# axis converting to predicted probability scale
For principal component analysis, R has the princomp and prcomp functions.
Canonical correlations and canonical variates can be easily computed us-
ing the cancor function. There are many other R functions for examining
associations and for fitting models. The supsmu function implements Fried-
man’s “super smoother.”207 The lowess function implements Cleveland’s two-
dimensional smoother.111 The glm function will fit general linear models under
142 6 R Software
a wide variety of distributions of Y . There are functions to fit Hastie and Tib-
shirani’s275 generalized additive model for a variety of distributions. More is
said about parametric and nonparametric additive multiple regression func-
tions in Chapter 16. The loess function fits a multidimensional scatterplot
smoother (the local regression model of Cleveland et al.96 ). loess provides
approximate test statistics for normal or symmetrically distributed Y :
f ← loess ( y ∼ age * pressure )
plot ( f ) # cross-section al plots
ages ← seq (20 ,70 , length =40)
pressures ← seq (80 ,200 , length =40)
pred ← predict (f ,
expand.grid ( age = ages , pressure = pressures ))
persp ( ages , pressures , pred ) # 3 -D plot
The Hmisc rcorr.cens function can compute Somers’ Dxy rank correla-
tion coefficient and its standard error, for binary or continuous (and possibly
right-censored) responses. A simple transformation of Dxy yields the c index
(generalized ROC area). The Hmisc improveProb function is useful for compar-
ing two probability models using the methods of Pencina etal490, 492, 493 in an
external validation setting. See also the rcorrp.cens function in this context.
a
A case study in OLS—Chapter 7 from the first edition—may be found on the text’s
web site.
Longitudinal data can be used to estimate overall means or the mean at the
last scheduled follow-up, making maximum use of incomplete records. But the
real value of longitudinal data comes from modeling the entire time course.
Estimating the time course leads to understanding slopes, shapes, overall
trajectories, and periods of treatment effectiveness. With continuous Y one
typically specifies the time course by a mean time-response profile. Common
representations for such profiles include
• k dummy variables for k + 1 unique times (assumes no functional form for
time but assumes discrete measurement times and may spend many d.f.)
• k = 1 for linear time trend, g1 (t) = t
• k–order polynomial in t
• k + 1–knot restricted cubic spline (one linear term, k − 1 nonlinear terms)
Suppose the time trend is modeled with k parameters so that the time
effect has k d.f. Let the basis functions modeling the time effect be g1 (t),
g2 (t), . . . , gk (t) to allow it to be nonlinear. A model for the time profile with-
out interactions between time and any X is given by
pattern while others assume a restrictive form for the pattern. The following
table is an attempt to briefly survey available longitudinal analysis meth-
ods. LOCF and the summary statistic method are not modeling methods. 2
LOCF is an ad hoc attempt to account for longitudinal dropouts, and sum-
mary statistics can convert multivariate responses to univariate ones with few
assumptions (other than minimal dropouts), with some information loss.
a
Thanks to Charles Berry, Brian Cade, Peter Flom, Bert Gunter, and Leena Choi
for valuable input.
b
GEE: generalized estimating equations; GLS: generalized least squares; LOCF: last
observation carried forward.
c
E.g., compute within-subject slope, mean, or area under the curve over time. As-
sumes that the summary measure is an adequate summary of the time profile and
assesses the relevant treatment effect.
146 7 Modeling Longitudinal Responses using Generalized Least Squares
The most prevalent full modeling approach is mixed effects models in which
baseline predictors are fixed effects, and random effects are used to describe
subject differences and to induce within-subject correlation. Some disadvan-
tages of mixed effects models are
• The induced correlation structure for Y may be unrealistic if care is not
taken in specifying the model.
• Random effects require complex approximations for distributions of test
statistics.
• The most commonly used models assume that random effects follow a
normal distribution. This assumption may not hold.
It could be argued that an extended linear model (with no random effects)
is a logical extension of the univariate OLS model b . This model, called the
generalized least squares or growth curve model221, 509, 510 , was developed long
before mixed effect models became popular.
We will assume that Yit |Xi has a multivariate normal distribution with
mean given above and with variance-covariance matrix Vi , an ni × ni matrix
that is a function of ti1 , . . . , tini . We further assume that the diagonals of Vi
are all equalb . This extended linear model has the following assumptions:
• all the assumptions of OLS at a single time point including correct mod-
eling of predictor effects and univariate normality of responses conditional
on X
d
Unless one uses the Huynh-Feldt or Greenhouse-Geisser correction
e
For full efficiency, if using the working independence model
f
Or requires the user to specify one
g For full efficiency of regression coefficient estimates
h
Unless the last observation is missing
i
The cluster sandwich variance estimator used to estimate SEs in GEE does not
perform well in this situation, and neither does the working independence model
because it does not weight subjects properly.
j
Unless one knows how to properly do a weighted analysis
k
Or uses population averages
l
Unlike GLS, does not use standard maximum likelihood methods yielding simple
likelihood ratio χ2 statistics. Requires high-dimensional integration to marginalize
random effects, using complex approximations, and if using SAS, unintuitive d.f. for
the various tests.
m
Because there is no correct formula for SE of effects; ordinary SEs are not penalized
for imputation and are too small
n
If correction not applied
o
E.g., a model with a predictor that is a lagged value of the response variable
b
E.g., few statisticians use subject random effects for univariate Y . Pinheiro and
Bates [509, Section 5.1.2] state that “in some applications, one may wish to avoid
incorporating random effects in the model to account for dependence among obser-
vations, choosing to use the within-group component Λi to directly model variance-
covariance structure of the response.”
b
This procedure can be generalized to allow for heteroscedasticity over time or with
respect to X, e.g., males may be allowed to have a different variance than females.
7.5 Common Correlation Structures 147
• the distribution of two responses at two different times for the same sub-
ject, conditional on X, is bivariate normal with a specified correlation
coefficient
• the joint distribution of all ni responses for the ith subject is multivariate
normal with the given correlation pattern (which implies the previous two
distributional assumptions)
• responses from two different subjects are uncorrelated.
Generalized least squares is like weighted least squares but uses a covariance
matrix that is not diagonal. Each subject can have her own shape of Vi due
to each subject being measured at a different set of times. This is a maximum
likelihood procedure. Newton-Raphson or other trial-and-error methods are
used for estimating parameters. For a small number of subjects, there are ad-
vantages in using REML (restricted maximum likelihood) instead of ordinary
MLE [159, Section 5.3] [509, Chapter 5]221 (especially to get a more unbiased
estimate of the covariance matrix).
When imbalances of measurement times are not severe, OLS fitted ignoring
subject identifiers may be efficient for estimating β. But OLS standard errors
will be too small as they don’t take intra-cluster correlation into account.
This may be rectified by substituting a covariance matrix estimated using
the Huber-White cluster sandwich estimator or from the cluster bootstrap.
When imbalances are severe and intra-subject correlations are strong, OLS
(or GEE using a working independence model) is not expected to be efficient
because it gives equal weight to each observation; a subject contributing two
distant observations receives 51 the weight of a subject having 10 tightly-
spaced observations.
times are continuous and are not equally spaced [509, Section 5.3.3] are shown
below, along with the correlation function names from the R nlme package.
Compound symmetry: h = ρ if t1 = t2 , 1 if t1 = t2 nlme corCompSymm
(Essentially what two-way ANOVA assumes)
Autoregressive-moving average lag 1: h = ρ|t1 −t2 | = ρs corCAR1
where s = |t1 − t2 |
Exponential: h = exp(−s/ρ) corExp
Gaussian: h = exp[−(s/ρ)2 ] corGaus
Linear: h = (1 − s/ρ)[s < ρ] corLin
Rational quadratic: h = 1 − (s/ρ)2 /[1 + (s/ρ)2 ] corRatio
Spherical: h = [1 − 1.5(s/ρ) + 0.5(s/ρ)3 ][s < ρ] corSpher
s−dmin
dmin +δ
Linear exponent AR(1): h=ρ dmax −dmin
, 1 if t1 = t2 572
The structures 3–7 use ρ as a scaling parameter, not as something re-
stricted to be in [0, 1]
Section 4.4 provided some guidance about sample sizes needed for OLS.
A good way to think about sample size adequacy for generalized least squares
is to determine the effective number of independent observations that a given
configuration of repeated measurements has. For example, if the standard er-
ror of an estimate from three measurements on each of 20 subjects is the same
as the standard error from 27 subjects measured once, we say that the 20×3
study has an effective sample size of 27, and we equate power from the uni-
variate analysis on n subjects measured once to 20n27 subjects measured three
times. Faes et al.181 have a nice approach to effective sample sizes with a
variety of correlation patterns in longitudinal data. For an AR(1) correlation
structure with n equally spaced measurement times on each of N subjects,
d Variograms can be unstable.
7.9 Case Study 149
with the correlation between two consecutive times being ρ, the effective
sample size is n−(n−2)ρ
1+ρ N . Under compound symmetry, the effective size is
nN
1+ρ(n−1) .
7.8 R Software
The nonlinear mixed effects model package nlme of Pinheiro & Bates in
Rprovides many useful functions. For fitting linear models, fitting functions
are lme for mixed effects models and gls for generalized least squares without
random effects. The rms package has a front-end function Gls so that many
features of rms can be used:
anova: all partial Wald tests, test of linearity, pooled tests
summary: effect estimates (differences in Ŷ ) and confidence limits
Predict and plot: partial effect plots
nomogram: nomogram
Function: generate R function code for the fitted model
latex: LATEX representation of the fitted model.
In addition, Gls has a cluster bootstrap option (hence you do not use rms’s
bootcov for Gls fits). When B is provided to Gls( ), bootstrapped regression
coefficients and correlation estimates are saved, the former setting up for
bootstrap percentile confidence limitse The nlme package has many graphics
and fit-checking functions. Several functions will be demonstrated in the case
study.
Consider the dataset in Table 6.9 of Davis [148, pp. 161–163] from a multi-
center, randomized controlled trial of botulinum toxin type B (BotB) in pa-
tients with cervical dystonia from nine U.S. sites. Patients were randomized
to placebo (N = 36), 5000 units of BotB (N = 36), or 10,000 units of BotB
(N = 37). The response variable is the total score on the Toronto Western
Spasmodic Torticollis Rating Scale (TWSTRS), measuring severity, pain, and
disability of cervical dystonia (high scores mean more impairment). TWSTRS
is measured at baseline (week 0) and weeks 2, 4, 8, 12, 16 after treatment
began. The dataset name on the dataset wiki page is cdystonia.
e
To access regular gls functions named anova (for likelihood ratio tests, AIC, etc.)
or summary use anova.gls or summary.gls.
150 7 Modeling Longitudinal Responses using Generalized Least Squares
getHdata ( cdystonia )
attach ( cdystonia )
0 0 2 4 0 2 4 12 16 0 2 4 8 0 2 4 8 12
1 1 3 1 1
0 2 4 8 12 16 0 2 4 8 16 0 2 8 12 16 0 4 8 12 16 0 4 8 16
94 1 2 4 1
# Show quartiles
ggplot ( cdystonia , aes ( x = week , y = twstrs )) + xl + yl +
ylim (0 , 70) + stat_summary ( fun.data = " median_hilow " ,
conf.int =0 .5 , geom = ' smooth ' ) +
facet_wrap (∼ treat , nrow =2) # Fig. 7.2
dd ← datadist ( both )
options ( datadist = ' dd ' )
7.9 Case Study 151
1 2 3 4 5 6 7 8 9
60
10000U
40
20
TWSTRS−total score
60
5000U
40
20
60
Placebo
40
20
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Week
Fig. 7.1 Time profiles for individual subjects, stratified by study site and dose
10000U 5000U
60
40
20
TWSTRS−total score
0
Placebo
60
40
20
0
0 5 10 15
Week
Fig. 7.2 Quartiles of TWSTRS stratified by dose
AIC computed above is set up so that smaller values are best. From this
the continuous-time AR1 and exponential structures are tied for the best.
3 For the remainder of the analysis we use corCAR1, using Gls.
a ← Gls ( twstrs ∼ treat * rcs ( week , 3) + rcs ( twstrs0 , 3) +
rcs ( age , 4) * sex , data = both ,
correlation = corCAR1 ( form =∼week | uid ))
0.6
Semivariogram 0.4
0.2
2 4 6 8 10 12 14
Distance
latex ( summary (a ) , file = ' ' , table.env = FALSE ) # Shows for week 8
Residuals
0
Residuals
−20
−20
−40 −40
20 30 40 50 60 7020 30 40 50 60 7020 30 40 50 60 70 30 40 50 60
fitted twstrs0
20 20
10
0
Residuals
Residuals
0
−20
−10
−40
−20
4 8 12 16 −2 0 2
week theoretical
Fig. 7.4 Three residual plots to check for absence of trends in central tendency
and in variability. Upper right panel shows the baseline score on the x-axis. Bottom
left panel shows the mean ±2×SD. Bottom right panel is the QQ plot for checking
normality of residuals from the GLS fit.
sex
age * sex
age
treat * week
treat
week
twstrs0
70
48 50
Xβ
^
40
44
Xβ
^
30
40
20
36
10
4 8 12 16 30 40 50 60
Week TWSTRS−total score
Sex F M
50
45
40
Xβ
^
35
30
25
40 50 60 70 80
Age,years
Fig. 7.6 Estimated effects of time, baseline TWSTRS, age, and sex
R e d u n d a n t c o n t r a s t s are denoted by *
C o n f i d e n c e i n t e r v a l s are 0.95 i n d i v i d u a l i n t e r v a l s
R e d u n d a n t c o n t r a s t s are denoted by *
C o n f i d e n c e i n t e r v a l s are 0.95 i n d i v i d u a l i n t e r v a l s
7.9 Case Study 157
k1 ← as.data.frame ( k1 [ c ( ' week ' , ' Contrast ' , ' Lower ' ,
' Upper ' )])
p1 ← ggplot ( k1 , aes ( x = week , y = Contrast )) + geom_point () +
geom_line () + ylab ( ' Low Dose - Placebo ' ) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ), width =0)
k2 ← as.data.frame ( k2 [ c ( ' week ' , ' Contrast ' , ' Lower ' ,
' Upper ' )])
p2 ← ggplot ( k2 , aes ( x = week , y = Contrast )) + geom_point () +
geom_line () + ylab ( ' High Dose - Placebo ' ) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ), width =0)
gridExtra :: grid.arrange ( p1 , p2 , ncol =2) # Figure 7.7
4
0
−4
−4
−8 −8
4 8 12 16 4 8 12 16
week week
Fig. 7.7 Contrasts and 0.95 confidence limits from GLS fit
TWSTRS−total score
20 25 30 35 40 45 50 55 60 65 70
50 60
age (sex=F)
85 40 70 20
60 70
age (sex=M)
50 40 85 30 20
5000U
treat (week=2)
10000U Placebo
5000U
treat (week=4)
10000U Placebo
5000U
treat (week=8)
10000U Placebo
10000U
treat (week=12)
5000U
Placebo
treat (week=16)
5000U
Total Points
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Linear Predictor
15 20 25 30 35 40 45 50 55 60 65 70
Fig. 7.8 Nomogram from GLS fit. Second axis is the baseline score.
1 Jim Rochon (Rho, Inc., Chapel Hill NC) has the following comments about
using the baseline measurement of Y as the first longitudinal response.
For RCTs [randomized clinical trials], I draw a sharp line at the point
when the intervention begins. The LHS [left hand side of the model equa-
tion] is reserved for something that is a response to treatment. Anything
before this point can potentially be included as a covariate in the regres-
sion model. This includes the “baseline” value of the outcome variable.
Indeed, the best predictor of the outcome at the end of the study is typ-
ically where the patient began at the beginning. It drinks up a lot of
variability in the outcome; and, the effect of other covariates is typically
mediated through this variable.
I treat anything after the intervention begins as an outcome. In the west-
ern scientific method, an “effect” must follow the “cause” even if by a split
second.
Note that an RCT is different than a cohort study. In a cohort study,
“Time 0” is not terribly meaningful. If we want to model, say, the trend
over time, it would be legitimate, in my view, to include the “baseline”
value on the LHS of that regression model.
7.10 Further Reading 159
. . . one should focus clearly on ‘outcomes’ as being the only values that
can be influenced by treatment and examine critically any schemes that
assume that these are linked in some rigid and deterministic view to
‘baseline’ values. An alternative tradition sees a baseline as being merely
one of a number of measurements capable of improving predictions of
outcomes and models it in this way.
The final reason that baseline cannot be modeled as the response at time zero is
that many studies have inclusion/exclusion criteria that include cutoffs on the
baseline variable yielding a truncated distribution. In general it is not appropri-
ate to model the baseline with the same distributional shape as the follow-up
160 7 Modeling Longitudinal Responses using Generalized Least Squares
f
In addition to this, one of the paper’s conclusions that analysis of covariance is not
appropriate if the population means of the baseline variable are not identical in the
treatment groups is arguable563 . See346 for a discussion of423 .
Chapter 8
Case Study in Data Reduction
Recall that the aim of data reduction is to reduce (without using the outcome)
the number of parameters needed in the outcome model. The following case
study illustrates these techniques:
1. redundancy analysis;
2. variable clustering;
3. data reduction using principal component analysis (PCA), sparse PCA,
and pretransformations;
4. restricted cubic spline fitting using ordinary least squares, in the context
of scaling; and
5. scaling/variable transformations using canonical variates and nonparamet-
ric additive regression.
8.1 Data
Consider the 506-patient prostate cancer dataset from Byar and Green.87 The
data are listed in [28, Table 46] and are available in ASCII form from StatLib
(lib.stat.cmu.edu) in the Datasets area from this book’s Web page. These
data were from a randomized trial comparing four treatments for stage 3
and 4 prostate cancer, with almost equal numbers of patients on placebo and
each of three doses of estrogen. Four patients had missing values on all of the
following variables: wt, pf, hx, sbp, dbp, ekg, hg, bm; two of these patients
were also missing sz. These patients are excluded from consideration. The
ultimate goal of an analysis of the dataset might be to discover patterns in
survival or to do an analysis of covariance to assess the effect of treatment
while adjusting for patient heterogeneity. See Chapter 21 for such analyses.
The data reductions developed here are general and can be used for a variety
of dependent variables.
The variable names, labels, and a summary of the data are printed below.
require ( Hmisc )
prostate[2:17]
16 Variables 502 Observations
stage : Stage
n missing unique Info Mean
502 0 2 0.73 3.424
3 (289, 58%), 4 (213, 42%)
rx
n missing unique
502 0 4
placebo (127, 25%), 0.2 mg estrogen (124, 25%)
1.0 mg estrogen (126, 25%), 5.0 mg estrogen (125, 25%)
status
n missing unique
502 0 10
pf
n missing unique
502 0 4
normal activity (450, 90%), in bed < 50% daytime (37, 7%)
in bed > 50% daytime (13, 3%), confined to bed (2, 0%)
ekg
n missing unique
494 8 7
normal (168, 34%), benign (23, 5%)
rhythmic disturb & electrolyte ch (51, 10%)
heart block or conduction def (26, 5%), heart strain (150, 30%)
old MI (75, 15%), recent MI (1, 0%)
bm : Bone Metastases
n missing unique Info Sum Mean
502 0 2 0.41 82 0.1633
stage is defined by ap as well as X-ray results. Of the patients in stage 3,
0.92 have ap ≤ 0.8. Of those in stage 4, 0.93 have ap > 0.8. Since stage can
be predicted almost certainly from ap, we do not consider stage in some of
the analyses.
There are 354 deaths among the 502 patients. If predicting survival time were
of major interest, we could develop a reliable model if no more than about
354/15 = 24 parameters were examined against Y in unpenalized modeling.
Suppose that a full model with no interactions is fitted and that linearity is
not assumed for any continuous predictors. Assuming age is almost linear,
we could fit a restricted cubic spline function with three knots. For the other
continuous variables, let us use five knots. For categorical predictors, the
maximum number of degrees of freedom needed would be one fewer than
the number of categories. For pf we could lump the last two categories since
the last category has only 2 patients. Likewise, we could combine the last
two levels of ekg. Table 8.1 lists the candidate predictors with the maximum
number of parameters we consider for each.
Redundancy Analysis
n : 483 p : 14 nk : 3
Number of NAs : 19
Frequencies of Missing Values Due to Each Variable
stage I ( rxn ) age wt I ( pfn ) hx sbp
dbp
0 0 1 2 0 0 0
0
ekg . norm hg sz sg ap bm
0 0 5 11 0 0
Rendundant variables :
stage sbp bm sg
From Table 8.1, the total number of parameters is 42, so some data reduction
should be considered. We resist the temptation to take the “easy way out” us-
ing stepwise variable selection so that we can achieve a more stable modeling
1 process and obtain unbiased standard errors. Before using a variable cluster-
ing procedure, note that ap is extremely skewed. To handle skewness, we use
Spearman rank correlations for continuous variables (later we transform each
variable using transcan, which will allow ordinary correlation coefficients to
be used). After classifying ekg as “normal/benign” versus everything else, the
Spearman correlations are plotted below.
x ← with ( prostate ,
cbind ( stage , rx , age , wt , pf , hx , sbp , dbp ,
ekg.norm , hg , sz , sg , ap , bm ))
# If no missing data , could use cor ( apply (x , 2 , rank ))
r ← rcorr (x , type = " spearman " ) $ r # rcorr in Hmisc
maxabsr ← max ( abs ( r[ row (r ) ! = col ( r )]))
p ← nrow ( r )
plot ( c ( -.35 , p + .5 ), c (.5 , p + .25 ), type = ' n ' , axes = FALSE ,
xlab = ' ' , ylab = ' ' ) # Figure 8.1
v ← dimnames ( r )[[1]]
text ( rep ( .5 , p ), 1: p , v , adj =1)
for ( i in 1:( p-1 )) {
for ( j in ( i +1): p ) {
lines ( c (i , i ), c (j , j + r [i , j ] / maxabsr / 2) ,
lwd =3 , lend = ' butt ' )
lines ( c ( i-.2 , i + .2 ), c (j , j ), lwd =1 , col = gray ( .7 ))
}
text (i , i , v [ i ], srt = -45 , adj =0)
}
We combine sbp and dbp, and tentatively combine ap, sg, sz, and bm.
Now we turn to the scoring of the predictors to potentially reduce the number
of regression parameters that are needed later by doing away with the need for
bm
ap
ap
sg
sg
sz
hg sz
hg
ekg.norm
ek
dbp
g.
no
db
sbp
rm
p
sb
p
hx
hx f
pf
p
wt
w
age
t
ag
rx
e
rx
stage
st
ag
e
Fig. 8.1 Matrix of Spearman ρ rank correlation coefficients between predictors. Hor-
izontal gray scale lines correspond to ρ = 0. The tallest bar corresponds to |ρ| = 0.78.
nonlinear terms and multiple dummy variables. The R Hmisc package transcan
function defaults to using a maximum generalized variance method368 that
incorporates canonical variates to optimally transform both sides of a mul-
tiple regression model. Each predictor is treated in turn as a variable being
predicted, and all variables are expanded into restricted cubic splines (for
continuous variables) or dummy variables (for categorical ones).
# Combine 2 levels of ekg ( one had freq. 1)
levels ( prostate $ ekg )[ levels ( prostate $ ekg ) % in %
c ( ' old MI ' , ' recent MI ' )] ← ' MI '
0.00
pfn
0.02
ekg.norm
age
hx
rxn
sz
bm
30 * Hoeffding D
wt
hg
0.04
0.06
0.08
0.10
stage
sbp
dbp
0.12
0.14
sg
ap
Fig. 8.2 Hierarchical clustering using Hoeffding’s D as a similarity measure. Dummy
variables were used for the categorical variable ekg. Some of the dummy variables
cluster together since they are by definition negatively correlated.
ptrans ←
transcan (∼ sz + sg + ap + sbp + dbp +
age + wt + hg + ekg + pf + bm + hx , imputed = TRUE ,
transformed = TRUE , trantab = TRUE , pl = FALSE ,
show.na = TRUE , data = prostate , frac = .1 , pr = FALSE )
summary ( ptrans , digits =4)
Iterations : 8
A d j u s t e d R2 :
hx 0.04 0.05 −0.01 −0.04 0.00 −0.06 0.02 −0.01 −0.09 −0.04 −0.05
hx
sz 0.34
sg 0.14
ap −0.03
sb p −0.14
dbp −0.01
age −0.76
wt 0.27
hg −0.12
ekg −1.23
pf −0.46
bm −0.02
hx
Summary o f i m p u t ed v a l u e s
sz
n missing unique Info Mean
5 0 4 0.95 12.86
6 . 5 1 1 7 . 2 8 9 7 . 3 9 4 8 1 0 . 2 5 1 0 . 2 7 1 0 . 3 2 1 0 . 3 9 1 0 . 7 3 15
Fr equ en cy 1 1 1 1 1 1 1 1 1 2
% 9 9 9 9 9 9 9 9 9 18
age
n missing unique Info Mean
1 0 1 0 71.65
wt
n missing unique Info Mean
2 0 2 1 97.77
The plotted output is shown in Figure 8.3. Note that at face value the trans-
formation of ap was derived in a circular manner, since the combined index
of stage and histologic grade, sg, uses in its stage component a cutoff on ap.
However, if sg is omitted from consideration, the resulting transformation for
ap does not change appreciably. Note that bm and hx are represented as binary
variables, so their coefficients in the table of canonical variable coefficients
are on a different scale. For the variables that were actually transformed, the
coefficients are for standardized transformed variables (mean 0, variance 1).
From examining the R2 s, age, wt, ekg, pf, and hx are not strongly related
to other variables. Imputations for age, wt, ekg are thus relying more on the
median or modal values from the marginal distributions. From the coefficients
of first (standardized) canonical variates, sbp is predicted almost solely from
dbp; bm is predicted mainly from ap, hg, and pf. 2
170 8 Case Study in Data Reduction
sz sg ap sbp
1.00
0.75
0.50
0.25 R2 = 0.5
dbp age wt hg
1.00
Transformed
0.75
0.50
0.25
R2 = 0.12 2 missing R2 = 0.16
0.00 R2 = 0.49 R2 = 0.1 1 missing
ekg pf bm hx
1.00
0.75
0.50 R2 = 0.35 R2 = 0.11
2
0.25 R = 0.11
R2 = 0.09 8 missing
0.00
2 4 6 1.0 1.5 2.0 2.5 3.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Fig. 8.3 Simultaneous transformation and single imputation of all candidate predic-
tors using transcan. Imputed values are shown as red plus signs. Transformed values
are arbitrarily scaled to [0, 1].
The first PC, PC1 , is the linear combination of standardized variables having
maximum variance. PC2 is the linear combination of predictors having the
second largest variance such that PC2 is orthogonal to (uncorrelated with)
PC1 . If there are p raw variables, the first k PCs, where k < p, will explain
only part of the variation in the whole system of p variables unless one or
more of the original variables is exactly a linear combination of the remaining
variables. Note that it is common to scale and center variables to have mean
zero and variance 1 before computing PCs.
The response variable (here, time until death due to any cause) is not
examined during data reduction, so that if PCs are selected by variance ex-
plained in the X-space and not by variation explained in Y , one needn’t
correct for model uncertainty or multiple comparisons.
PCA results in data reduction when the analyst uses only a subset of the
p possible PCs in predicting Y . This is called incomplete principal component
regression. When one sequentially enters PCs into a predictive model in a
strict pre-specified order (i.e., by descending amounts of variance explained
8.6 Data Reduction Using Principal Components 171
sz sg age wt ekg
5 11 1 2 8
plot ( prin.raw , type = ' lines ' , main = ' ' , ylim = c (0 ,3)) # Figure 8.4
# Add cumulative fraction of variance explained
addscree ← function (x , npcs = min (10 , length ( x $ sdev )) ,
plotv = FALSE ,
col =1 , offset = .8 , adj =0 , pr = FALSE ) {
vars ← x $ sdev ∧ 2
cumv ← cumsum ( vars ) / sum ( vars )
if ( pr ) print ( cumv )
text (1: npcs , vars [1: npcs ] + offset * par ( ' cxy ' )[2] ,
as.character ( round ( cumv [1: npcs ], 2)) ,
srt =45 , adj = adj , cex = .65 , xpd = NA , col = col )
if ( plotv ) lines (1: npcs , vars [1: npcs ], type = ' b ' , col = col )
}
addscree ( prin.raw )
prin.trans ← princomp ( ptrans $ transformed , cor = TRUE )
addscree ( prin.trans , npcs =10 , plotv = TRUE , col = ' red ' ,
offset = -.8 , adj =1)
172 8 Case Study in Data Reduction
3.0
15
0.
2.5
23
0.
26
0.
2.0
Variances
35
0.
42
49
56
1.5
63
0.
38
0.
0.
7
0.
75
0.
0.
0.
8
5
0.
1.0
0.
59
0.
66
0.5
73
0.
79
85
91
0.
0.
0.
0.
95
0.0
0.
Comp.1 Comp.3 Comp.5 Comp.7 Comp.9
Fig. 8.4 Variance of the system of raw predictors (black) explained by individual
principal components (lines) along with cumulative proportion of variance explained
(text), and variance explained by components computed on transcan-transformed
variables (red)
The resulting plot shown in Figure 8.4 is called a “scree” plot [325, pp. 96–99,
104, 106]. It shows the variation explained by the first k principal components
as k increases all the way to 16 parameters (no data reduction). It requires
10 of the 16 possible components to explain > 0.8 of the variance, and the
first 5 components explain 0.49 of the variance of the system. Two of the 16
dimensions are almost totally redundant.
After repeating this process when transforming all predictors via transcan,
we have only 12 degrees of freedom for the 12 predictors. The variance ex-
plained is depicted in Figure 8.4 in red. It requires at least 9 of the 12 possible
components to explain ≥ 0.9 of the variance, and the first 5 components ex-
plain 0.66 of the variance as opposed to 0.49 for untransformed variables.
Let us see how the PCs “explain” the times until death using the Cox re-
gression132 function from rms, cph, described in Chapter 20. In what follows
we vary the number of components used in the Cox models from 1 to all 16,
computing the AIC for each model. AIC is related to model log likelihood
penalized for number of parameters estimated, and lower is better. For refer-
ence, the AIC of the model using all of the original predictors, and the AIC
of a full additive spline model are shown as horizontal lines.
require ( rms )
ps ← pcs [ ,1: i ]
aic [ i ] ← AIC ( cph (S ∼ ps ))
} # Figure 8.5
plot (1:16 , aic , xlab = ' Number of Components Used ' ,
ylab = ' AIC ' , type = ' l ' , ylim = c (3950 ,4000))
f ← cph ( S ∼ sz + sg + log ( ap ) + sbp + dbp + age + wt + hg +
ekg + pf + bm + hx , data = imputed )
abline ( h = AIC ( f ), col = ' blue ' )
f ← cph ( S ∼ rcs ( sz ,5) + rcs ( sg ,5) + rcs ( log ( ap ) ,5) +
rcs ( sbp ,5) + rcs ( dbp ,5) + rcs ( age ,3) + rcs ( wt ,5) +
rcs ( hg ,5) + ekg + pf + bm + hx ,
tol =1 e-14 , data = imputed )
For the money, the first 5 components adequately summarizes all variables,
if linearly transformed, and the full linear model is no better than this. The
model allowing all continuous predictors to be nonlinear is not worth its
added degrees of freedom.
Next check the performance of a model derived from cluster scores of
transformed variables.
# Compute PC1 on a subset of t r a n s c a n - t r a n s f or m ed predictors
pco ← function ( v ) {
f ← princomp ( ptrans $ transformed [, v ], cor = TRUE )
vars ← f $ sdev ∧ 2
cat ( ' Fraction of variance explained by PC1 : ' ,
round ( vars [1] / sum ( vars ) ,2) , ' \ n ' )
f $ scores [ ,1]
}
tumor ← pco ( c ( ' sz ' , ' sg ' , ' ap ' , ' bm ' ))
4000
3990
3980
AIC
3970
3960
3950
5 10 15
Number of Components Used
Fig. 8.5 AIC of Cox models fitted with progressively more principal components.
The solid blue line depicts the AIC of the model with all original covariates. The
dotted blue line is positioned at the AIC of the full spline model.
[1] 3954.393
Loadings :
Comp . 1 Comp . 2 Comp . 3 Comp . 4 Comp . 5 Comp . 6 Comp . 7 Comp . 8 Comp . 9 Comp . 1 0
sz 0.248 0.950
sg 0.620 0.522
ap 0.634 −0.305
sb p −0.707
dbp 0.707
age 1.000
wt 1.000
hg 1.000
ekg 1.000
pf 1.000
bm −0.391 0.852
hx 1.000
Only nonzero loadings are shown. The first sparse PC is the tumor cluster
used above, and the second is the blood pressure cluster. Let us see how well
incomplete sparse principal component regression predicts time until death.
a
The spca package is a new sparse PC package that should also be considered.
176 8 Case Study in Data Reduction
3.0
2
0.
2.5
35
2.0
0.
Variances 1.5
44
53
61
79
88
7
0.
0.
0.
0.
0.
0.
95
1.0
0.
1
0.5
0.0
1 2 3 4 5 6 7 8 9 10
Fig. 8.6 Variance explained by individual sparse principal components (lines) along
with cumulative proportion of variance explained (text)
More components are required to optimize AIC than were seen in Figure 8.5,
but a model built from 6–8 sparse PCs performed as well as the other models.
4000
3990
3980
AIC
3970
3960
3950
2 4 6 8 10
Number of Components Used
Fig. 8.7 Performance of sparse principal components in Cox models
x ← with ( imputed ,
cbind ( sz , sg , ap , sbp , dbp , age , wt , hg , ekg , pf ,
bm , hx ))
monotonic ← c ( " sz " ," sg " ," ap " ," sbp " ," dbp " ," age " ," pf " )
transace (x , monotonic , # Figure 8.8
categorical = " ekg " , binary = c ( " bm " ," hx " ))
sz sg ap sb p dbp age wt
0.2265824 0.5762743 0.5717747 0.4823852 0.4580924 0.1514527 0.1732244
hg ekg pf bm hx
0.2001008 0.1110709 0.1778705 NA NA
Except for ekg, age, and for arbitrary sign reversals, the transformations in
Figure 8.8 determined using transace were similar to those in Figure 8.3. The
transcan transformation for ekg makes more sense.
3 1.5 3
1.0 2
2 0.5
1 0.0 1
−0.5
0 −1.0 0
−1 −1.5 −1
0 10 20 30 40 50 60 70 6 8 10 12 14 0 200 400 600 800 1000
sz sg ap
8 8
6 6 1
4 4 0
2 2 −1
0 0 −2
−2 −2 −3
10 15 20 25 30 4 6 8 10 12 14 16 18 50 60 70 80 90
sbp dbp age
2 1 1.5
1 0 1.0
−1 0.5
0 −2
−1 −3 0.0
−2 −4 −0.5
−5 −1.0
−3 −6
80 100 120 140 10 15 20 1 2 3 4 5 6
wt hg ekg
4
3
2
1
0
1.0 1.5 2.0 2.5 3.0
pf
8.9 Problems
c. Considering the PCs are fixed, use the bootstrap to estimate the 0.95
confidence interval of the inter-quartile-range age effect on the original
scale, and the same type of confidence interval for the coefficient of PC1 .
d. Now accounting for uncertainty in the PCs, compute the same two
confidence intervals. Compare and interpret the two sets. Take into
account the fact that PCs are not unique to within a sign change.
R programming hints for this exercise are found on the course web site.
Chapter 9
Overview of Maximum Likelihood
Estimation
Let us now study the situation of independent binary trials in general. Let
the sample size be n and the observed responses be Y1 , Y2 , . . . , Yn . The joint
probability of observing the data is given by
n
L= P Yi (1 − P )1−Yi . (9.1)
i=1
Now let s denote the sum of the Y s or the number of times that the event
occurred (Yi = 1), that is the number of “successes.” The number of non-
occurrences (“failures”) is n − s. The likelihood of the data can be simplified
to
L = P s (1 − P )n−s . (9.2)
It is easier to work with the log likelihood function, which also has desirable
statistical properties. For the one-sample binary response problem, the log
likelihood is
log L = s log(P ) + (n − s) log(1 − P ). (9.3)
The MLE of P is that value of P that maximizes L or log L. Since log L
is a smooth function of P , its maximum value can be found by finding the
point at which log L has a slope of 0. The slope or first derivative of log L,
with respect to P , is
The first derivative of the log likelihood function with respect to the parame-
ter(s), here U (P ), is called the score function. Equating this function to zero
requires that s/P = (n − s)/(1 − P ). Multiplying both sides of the equation
by P (1 − P ) yields s(1 − P ) = (n − s)P or that s = (n − s)P + sP = nP .
Thus the MLE of P is p = s/n.
Another important function is called the Fisher information about the
unknown parameters. The information function is the expected value of the
negative of the curvature in log L, which is the negative of the slope of the
slope as a function of the parameter, or the negative of the second derivative
of log L. Motivation for consideration of the Fisher information is as follows.
If the log likelihood function has a distinct peak, the sample provides infor-
mation that allows one to readily discriminate between a good parameter
estimate (the location of the obvious peak) and a bad one. In such a case the
MLE will have good precision or small variance. If on the other hand the like-
lihood function is relatively flat, almost any estimate will do and the chosen
estimate will have poor precision or large variance. The degree of peakedness
of a function at a given point is the speed with which the slope is changing at
that point, that is, the slope of the slope or second derivative of the function
at that point.
9.1 General Notions—Simple Cases 183
−20
−40
Log Likelihood
−60
−80
−100
P
Fig. 9.1 log likelihood functions for three one-sample binomial problems
In each case it can be seen that the value of P that makes the data most
likely to have occurred (the value that maximizes L or log L) is p given
above. Also, the score function (slope of log L) is zero at P = p. Note that
the information function I(P ) is highest for P approaching 0 or 1 and is
lowest for P near .5, where there is maximum uncertainty about P . Note
also that while log L has the same shape for the s = 60 and s = 12 curves
in Figure 9.1, the range of log L is much greater for the larger sample size.
Figures 9.2 and 9.3 show that the larger sample size produces a sharper
likelihood. In other words, with larger n, one can zero in on the true value of
P with more precision.
184 9 Overview of Maximum Likelihood Estimation
600
400
200
Score
0
−200
s=50 n=100
s=60 n=100
−400
s=12 n=20
1200
1000
n=100
n=20
800
Information
600
400
200
Now let us turn to hypothesis tests about the unknown population parameter
P — H0 : P = P0 . There are three kinds of statistical tests that arise from
likelihood theory.
This test statistic is the ratio of the likelihood at the hypothesized parameter
values to the likelihood of the data at the maximum (i.e., at parameter values
= MLEs). It turns out that −2× the log of this likelihood ratio has desirable
statistical properties. The likelihood ratio test statistic is given by
LR = −2 log(L at H0 /L at MLEs)
= −2(log L at H0 ) − [−2(log L at MLEs)]. (9.7)
Therefore, even though the best estimate of P is the same for these two cases,
the test statistic is more impressive when the sample size is five times larger.
a
In linear regression, a t distribution is used to penalize for the fact that the variance
of Y |X is estimated. In models such as the logistic model, there is no separate vari-
ance parameter to estimate. Gould has done simulations that show that the normal
distribution provides more accurate P -values than the t for binary logistic regression.
9.2 Hypothesis Tests 187
which formally does not involve the MLE, p. The statistic can be simplified
as follows.
Note that the numerator of S involves s − nP0 , the difference between the
observed number of successes and the number of successes expected under H0 .
As with the other two test statistics, S = 0 for the first sample. For the
last two samples S is, respectively, 4 and .8 = 4/5. 1
The likelihood of observing the observed sample values is the joint density
of the Y s. The log likelihood function here is a function of two unknowns, μ
and σ 2 .
n
log L = −.5n log(2πσ 2 ) − .5 (Yi − μ)2 /σ 2 . (9.18)
i=1
188 9 Overview of Maximum Likelihood Estimation
It can be shown that the value of μ that maximizes log L is the value that min-
imizes the sum of squared deviations about μ, which is the sample mean Y .
The MLE of σ 2 is n
s2 = (Yi − Y )2 /n. (9.19)
i=1
The MLE of B is that value of the vector B that maximizes log L(B) as
a function of B. In general, the solution for B requires iterative trial-and-
error methods as outlined later. Denote the MLE of B as b = {b1 , . . . , bp }.
The score vector is the vector of first derivatives of log L(B) with respect to
B1 , . . . , Bp :
The Fisher information matrix is the p × p matrix whose elements are the
negative of the expectation of all second partial derivatives of log L(B):
The observed information matrix I(B) is I ∗ (B) without taking the expecta-
tion. In other words, observed values remain in the second derivatives:
This information matrix is often estimated from the sample using the es-
timated observed information I(b), by inserting b, the MLE of B, into the
formula for I(B).
Under suitable conditions, which are satisfied for most situations likely
to be encountered, the MLE b for large samples is an optimal estimator
(has as great a chance of being close to the true parameter as all other
types of estimators) and has an approximate multivariate normal distribution
with mean vector B and variance–covariance matrix I ∗−1 (B), where C −1
denotes the inverse of the matrix C. (C −1 is the matrix such that C −1 C is
the identity matrix, a matrix with ones on the diagonal and zeros elsewhere.
If C is a 1 × 1 matrix, C −1 = 1/C.) A consistent estimator of the variance–
covariance matrix is given by the matrix V , obtained by inserting b for B in
I(B) : V = I −1 (b) .
LR = −2 log(L at H0 /L at MLEs)
= −2[log L(B 0 ) − log L(b)]. (9.25)
The corresponding Wald test statistic, using the estimated observed informa-
tion matrix, is
W = (b − B 0 )′ I(b)(b − B 0 ) = (b − B 0 )′ V −1 (b − B 0 ). (9.26)
S = U ′ (B 0 )I −1 (B 0 )U (B 0 ). (9.27)
Note that as before, S does not require solving for the MLE. For large samples,
LR, W , and S have a χ2 distribution with p d.f. under suitable conditions.
190 9 Overview of Maximum Likelihood Estimation
Subtracting LR χ2 for the smaller model from that of the larger model yields
V11 V12
V = ′ . (9.33)
V12 V22
where C is a contrast matrix that “picks off” the proper elements of B. The
contrasts can be much more general by allowing elements of C to be other
than zero and one. For the normal linear model, W is converted to an F -
statistic by dividing by the rank r of C (normally the number of rows in
C), yielding a statistic with an F -distribution with r numerator degrees of
freedom.
Many interesting contrasts are tested by forming differences in predicted
values. By forming more contrasts than are really needed, one can develop
a surprisingly flexible approach to hypothesis testing using predicted values.
This has the major advantage of not requiring the analyst to account for how
the predictors are coded. Suppose that one wanted to assess the difference
in two vectors of predicted values, X1 b − X2 b = (X1 − X2 )b = Δb to test
H0 : ΔB = 0, where Δ = X1 − X2 . The covariance matrix for Δb is given by
var(Δb) = ΔV Δ′ . (9.37)
b
For example, in a 3-treatment comparison one could examine contrasts between
treatments A and B, A and C, and B and C by obtaining predicted values for those
treatments, even though only two differences are required.
c
The rms command could be contrast(fit, list(sex=’male’,age=30),
list(sex=’female’,age=40)) where all other predictors are set to medians or
modes.
9.3 General Case 193
Suppose that there are three treatment groups (A, B, C) interacting with a
cubic spline function of X. If one wants to test the multiple degree of freedom
hypothesis that the profile for X is the same for treatment A and B vs. the
alternative hypothesis that there is a difference between A and B for at least
one value of X, one can compare predicted values at treatment A and a vector
of X values against predicted values at treatment B and the same vector of
X values. If the X relationship is linear, any two X values will suffice, and
if X is quadratic, any three points will suffice. It would be difficult to test
complex hypotheses involving only 2 of 3 treatments using other methods.
The contrast function in rms can estimate a wide variety of contrasts and
make joint tests involving them, automatically computing the number of non-
linearly-dependent contrasts as the test’s degrees of freedom. See its help file
for several examples.
At this point, one may ask why three types of test statistics are needed. The
answer lies in the statistical properties of the three tests as well as in com-
putational expense in different situations. From the standpoint of statistical
properties, LR is the best statistic, followed by S and W . The major sta-
tistical problem with W is that it is sensitive to problems in the estimated
variance–covariance matrix in the full model. For some models, most notably
the logistic regression model,278 the variance–covariance estimates can be too
large as the effects in the model become very strong, resulting in values of
W that are too small (or significance levels that are too large). W is also
sensitive to the way the parameter appears in the model. For example, a test
of H0 : log odds ratio = 0 will yield a different value of W than will H0 :
odds ratio = 1.
Relative computational efficiency of the three types of tests is also an issue.
Computation of LR and W requires estimating all p unknown parameters,
and in addition LR requires re-estimating the last q parameters under that
restriction that the first k parameters = B10 . Therefore, when one is contem-
plating whether a set of parameters should be added to a model, the score
test is the easiest test to carry out. For example, if one were interested in
testing all two-way interactions among 4 predictors, the score test statistic
for H0 : “no interactions present” could be computed without estimating the
4 × 3/2 = 6 interaction effects. S would also be appealing for testing linearity
of effects in a model—the nonlinear spline terms could be tested for signifi-
cance after adjusting for the linear effects (with estimation of only the linear
effects). Only parameters for linear effects must be estimated to compute
S, resulting in fewer numerical problems such as lack of convergence of the
Newton–Raphson algorithm.
194 9 Overview of Maximum Likelihood Estimation
The Wald tests are very easy to make after all the parameters in a model
have been estimated. Wald tests are thus appealing in a multiple regression
setup when one wants to test whether a given predictor or set of predic-
tors is “significant.” A score test would require re-estimating the regression
coefficients under the restriction that the parameters of interest equal zero.
Likelihood ratio tests are used often for testing the global hypothesis that
no effects are significant, as the log likelihood evaluated at the MLEs is al-
ready available from fitting the model and the log likelihood evaluated at
a “null model” (e.g., a model containing only an intercept) is often easy to
compute. Likelihood ratio tests should also be used when the validity of a
Wald test is in question as in the example cited above.
Table 9.2 summarizes recommendations for choice of test statistics for
various situations.
the respective observed number of “successes” in the two samples. Let us test
the null hypothesis H0 : P1 = P2 based on the two samples.
The likelihood function is
ni
2
Yij
L= Pi (1 − Pi )1−Yij
i=1 j=1
9.4 Iterative ML Estimation 195
2
= Pisi (1 − Pi )ni −si (9.39)
i=1
2
log L = {si log(Pi ) + (ni − si ) log(1 − Pi )}. (9.40)
i=1
Under H0 , P1 = P2 = P , so
This statistic for large enough n1 and n2 has a χ2 distribution with 1 d.f.
since the null hypothesis involves the estimation of one fewer parameter than
does the unrestricted case. This LR statistic is the likelihood ratio χ2 statistic
for a 2 × 2 contingency table. It can be shown that the corresponding score
statistic is equivalent to the Pearson χ2 statistic. The better LR statistic can
be used routinely over the Pearson χ2 for testing hypotheses in contingency
tables.
In most cases, one cannot explicitly solve for MLEs but must use trial-and-
error numerical methods to solve for parameter values B that maximize
log L(B) or yield a score vector U (B) = 0. One of the fastest and most ap-
plicable methods for maximizing a function is the Newton–Raphson method,
which is based on approximating U (B) by a linear function of B in a small
196 9 Overview of Maximum Likelihood Estimation
The process is continued in like fashion. At the ith step the next estimate is
obtained from the previous estimate using the formula
If the log likelihood actually worsened at bi+1 , “step halving” is used; bi+1
is replaced with (bi + bi+1 )/2. Further step halving is done if the log like-
lihood still is worse than the log likelihood at bi , after which the original
iterative strategy is resumed. The Newton–Raphson iterations continue until
the −2 log likelihood changes by only a small amount over the previous iter-
ation (say .025). The reasoning behind this stopping rule is that estimates of
B that change the −2 log likelihood by less than this amount do not affect
3 statistical inference since −2 log likelihood is on the χ2 scale.
The estimator for the covariance matrix of b found in Section 9.3 assumes that
the model is correctly specified in terms of distribution, regression assump-
tions, and independence assumptions. The model may be incorrect in a va-
riety of ways such as non-independence (e.g., repeated measurements within
subjects), lack of fit (e.g., omitted covariable, incorrect covariable transfor-
mation, omitted interaction), and distributional (e.g., Y has a Γ distribution
instead of a normal distribution). Variances and covariances, and hence con-
fidence intervals and Wald tests, will be incorrect when these assumptions
are violated.
For the case in which the observations are independent and identically
distributed but other assumptions are possibly violated, Huber312 provided
a covariance matrix estimator that is consistent. His “sandwich” estimator is
given by
n
H = I −1 (b)[ Ui Ui′ ]I −1 (b), (9.48)
i=1
where I(b) is the observed information matrix (Equation 9.24) and Ui is the
vector of derivatives, with respect to all parameters, of the log likelihood
component for the ith observation (assuming the log likelihood can be par-
titioned into per-observation contributions). For the normal multiple linear
regression case, H was derived by White:659
9.5 Robust Estimation of the Covariance Matrix 197
n
(X ′ X)−1 [ (Yi − Xi b)2 Xi Xi′ ](X ′ X)−1 , (9.49)
i=1
The rms package’s robcov function computes the Huber robust covariance
matrix estimator, and the bootcov function computes the bootstrap covariance
estimator. Both of these functions allow for clustering.
var(Xb) = XV X ′ . (9.51)
d
This is the basis for confidence limits computed by the R rms package’s Predict,
summary, and contrast functions. When the robcov function has been used to replace
the information-matrix-based covariance matrix with a Huber robust covariance esti-
mate with an optional cluster sampling correction, the functions are using a “robust”
Wald statistic basis. When the bootcov function has been used to replace the model
fit’s covariance matrix with a bootstrap unconditional covariance matrix estimate,
the two functions are computing confidence limits based on a normal distribution but
using more nonparametric covariance estimates.
9.7 Bootstrap Confidence Regions 199
1
Prob(Y = 1|X = x) = . (9.52)
1 + exp[−(1 + x/2)]
Not knowing the true model, a quadratic logistic model is fitted. The R code
needed to generate the data and fit the model is given below.
require ( rms )
n ← 200
set.seed (15)
x1 ← rnorm ( n )
logit ← x1 / 2
y ← ifelse ( runif ( n ) ≤ plogis ( logit ), 1, 0)
dd ← datadist ( x1 ); options ( datadist = ' dd ' )
f ← lrm ( y ∼ pol ( x1 ,2) , x = TRUE , y = TRUE )
print (f , latex = TRUE )
χ2 d.f. P
x1 13.99 2 0.0009
Nonlinear 1.88 1 0.1708
TOTAL 13.99 2 0.0009
x-effect by computing the effect (on the logit scale) of increasing x from 1
to 5. We first compute bootstrap nonparametric percentile confidence inter-
vals the long way. The 1000 bootstrap estimates of the log odds ratio are
computed easily using a single matrix multiplication with the difference in
predictions approach, multiplying the difference in two design matrices, and
we obtain the bootstrap estimate of the standard error of the log odds ratio
by computing the sample standard deviation of the 1000 valuese . Bootstrap
percentile confidence limits are just sample quantiles from the bootstrapped
log odds ratios.
# Get 2 -row design matrix for obtaining predicted values
# for x = 1 and 5
X ← cbind ( Intercept =1 ,
predict (f , data.frame ( x1 = c (1 ,5)) , type = ' x ' ))
Xdif ← X [2 , , drop = FALSE ] - X [1 , , drop = FALSE ]
Xdif
b ← bootcov (f , B =1000)
boot.log.od d s. ra t io ← b $ boot.Coef % * % t ( Xdif )
sd ( boot.log.od d s. ra ti o )
[1] 2.752103
[1] 2.752103
[1] 2.988373
2.5% 97.5%
2.795032 e +00 2.067146 e +05
# Automatic :
summary (b , x1 = c (1 ,5))[ ' Odds Ratio ' ,]
e
As indicated below, this standard deviation can also be obtained by using the
summary function on the object returned by bootcov, as bootcov returns a fit object
like one from lrm except with the bootstrap covariance matrix substituted for the
information-based one.
202 9 Overview of Maximum Likelihood Estimation
C o n f i d e n c e i n t e r v a l s are 0.95 b o o t s t r a p n o n p a r a m e t r i c p e r c e n t i l e i n t e r v a l s
# Figure 9.4
hist ( boot.log.odds.ratio , nclass =100 , xlab = ' log ( OR ) ' ,
main = ' ' )
40
30
Frequency
20
10
0 5 10 15
log(OR)
Fig. 9.4 Distribution of 1000 bootstrap x=1:5 log odds ratios
z $ class ← ifelse ( z $ .set. % in % c ( ' Boot percentile ' , ' Boot bca ' ,
' Boot basic ' ), ' Other ' , ' Wald ' )
ggplot (z , groups = c ( ' .set. ' , ' class ' ),
conf = ' line ' , ylim = c (-1 , 9) , legend.label = FALSE )
Suppose that from a single sample two competing models were developed. Let
the respective −2 log likelihoods for these models be denoted by L1 and L2 ,
and let p1 and p2 denote the number of parameters estimated in each model.
Suppose that L1 < L2 . It may be tempting to rate model one as the “best”
fitting or “best” predicting model. That model may provide a better fit for
the data at hand, but if it required many more parameters to be estimated,
it may not be better “for the money.” If both models were applied to a new
sample, model one’s overfitting of the original dataset may actually result in
a worse fit on the new dataset.
f
Limited simulations using the conditional bootstrap and Firth’s penalized likeli-
hood281 did not show significant improvement in confidence interval coverage.
204 9 Overview of Maximum Likelihood Estimation
.set.
7.5
Boot percentile
Robust sandwich
Boot BCa
5.0 Boot covariance+Wald
log odds
Wald
Boot basic
Simultaneous
2.5
Other
Wald
0.0
0 1 2 3 4 5
x1
Fig. 9.5 Predicted log odds and confidence bands for seven types of confidence in-
tervals. Seven categories are ordered top to bottom corresponding to order of lower
confidence bands at x1=5. Dotted lines are for Wald–type methods that yield sym-
metric confidence intervals and assume normality of point estimators.
Akaike’s information criterion (AIC33, 359, 633 ) provides a method for pe-
nalizing the log likelihood achieved by a given model for its complexity to
obtain a more unbiased assessment of the model’s worth. The penalty is
to subtract the number of parameters estimated from the log likelihood, or
equivalently to add twice the number of parameters to the −2 log likelihood.
The penalized log likelihood is analogous to Mallows’ Cp in ordinary multiple
regression. AIC would choose the model by comparing L1 + 2p1 to L2 + 2p2
10 and picking the model with the lower value. We often use AIC in “adjusted
χ2 ” form:
AIC = LR χ2 − 2p. (9.53)
Breiman [66, Section 1.3] and Chatfield [100, Section 4] discuss the fallacy of
11 AIC and Cp for selecting from a series of non-prespecified models.
One way to test whether one model (A) is better than another (B) is to
embed both models in a more general model (A + B). Then a LR χ2 test
9.8 Further Use of the Log Likelihood 205
The global likelihood ratio test for regression is useful for determining whether
any predictor is associated with the response. If the sample is large enough,
even weak associations can be “statistically significant.” Even though a like-
lihood ratio test does not shed light on a model’s predictive strength, the log
likelihood (L.L.) can still be useful here. Consider the following L.L.s:
Best (lowest) possible −2 L.L.:
L∗ = −2 L.L. for a hypothetical model that perfectly predicts the outcome.
−2 L.L. achieved:
L = −2 L.L. for the fitted model.
Worst −2 L.L.:
L0 = −2 L.L. for a model that has no predictive information.
The last −2 L.L., for a “no information” model, is the −2 L.L. under the null
hypothesis that all regression coefficients except for intercepts are zero. A “no
information” model often contains only an intercept and some distributional
parameters (a variance, for example). 13
The quantity L0 − L is LR, the log likelihood ratio statistic for testing
the global null hypothesis that no predictors are related to the response. It
is also the −2 log likelihood “explained” by the model. The best (lowest) −2
L.L. is L∗ , so the amount of L.L. that is capable of being explained by the
model is L0 − L∗ . The fraction of −2 L.L. explained that was capable of being
explained is
(L0 − L)/(L0 − L∗ ) = LR/(L0 − L∗ ). (9.54)
206 9 Overview of Maximum Likelihood Estimation
The R index can be used to assess how well the model compares with a
“perfect” model, as well as to judge whether a more complex model has pre-
dictive strength that justifies its additional parameters. Had p been used in
Equation 9.55 rather than 2p, R2 is negative if the log likelihood explained
is less than what one would expect by chance. R will be the square root of
1 − 2p/(L0 − L∗ ) if the model perfectly predicts the response. This upper
limit will be near one if the sample size is large.
Partial R indexes can also be defined by substituting the −2 L.L. explained
for a given factor in place of that for the entire model, LR. The “penalty
factor” p becomes one. This index Rpartial is defined by
2
Rpartial = (LRpartial − 2)/(L0 − L∗ ), (9.56)
other analogies to derive other R2 measures that may have better properties.
For a sample of size n and a Wald statistic for testing overall association,
they defined
2 W
RW =
n+W
2
RLR = 1 − exp(−LR/n) (9.57)
= 1 − λ2/n ,
where λ is the null model likelihood divided by the fitted model likelihood. In
the case of ordinary least squares with normality both of the above indexes
are equal to the traditional R2 . RLR
2
is equivalent to Maddala’s index [431,
137
Eq. 2.44]. Cragg and Uhler and Nagelkerke471 suggested dividing RLR 2
by
9.8 Further Use of the Log Likelihood 207
Log likelihoods are also useful for quantifying the predictive information con-
tained in a subset of the predictors compared with the information contained
in the entire set of predictors.264 Let LR again denote the −2 log likelihood
ratio statistic for testing the joint significance of the full set of predictors. Let
LRs denote the −2 log likelihood ratio statistic for testing the importance of
the subset of predictors of interest, excluding the other predictors from the
model. A measure of adequacy of the subset for predicting the response is
given by
A = LRs /LR. (9.59)
A is then the proportion of log likelihood explained by the subset with refer-
ence to the log likelihood explained by the entire set. When A = 1, the subset
contains all the predictive information found in the whole set of predictors;
that is, the subset is adequate by itself and the additional predictors contain
no independent information. When A = 0, the subset contains no predictive
information by itself.
Califf et al.89 used the A index to quantify the adequacy (with respect to
prognosis) of two competing sets of predictors that each describe the extent of
coronary artery disease. The response variable was time until cardiovascular
death and the statistical model used was the Cox132 proportional hazards
model. Some of their results are reproduced in Table 9.3. A chance-corrected 15
adequacy measure could be derived by squaring the ratio of the R-index for
the subset to the R-index for the whole set. A formal test of superiority of
X1 = maximum % stenosis over X2 = jeopardy score can be obtained by
testing whether X1 adds to X2 (LR χ2 = 57.5 − 42.6 = 14.9) and whether
X2 adds to X1 (LR χ2 = 57.5 − 51.8 = 5.7). X1 adds more to X2 (14.9) than
X2 adds to X1 (5.7). The difference 14.9 − 5.7 = 9.2 equals the difference in
single factor χ2 (51.8 − 42.6)665 .
208 9 Overview of Maximum Likelihood Estimation
Maximizing the log likelihood provides the best fit to the dataset at hand,
but this can also result in fitting noise in the data. For example, a categor-
ical predictor with 20 levels can produce extreme estimates for some of the
19 regression parameters, especially for the small cells (see Section 4.5). A
shrinkage approach will often result in regression coefficient estimates that
while biased are lower in mean squared error and hence are more likely to be
close to the true unknown parameter values. Ridge regression is one approach
to shrinkage, but a more general and better developed approach is penalized
maximum likelihood estimation,237, 388, 639, 641 which is really a special case 17
of Bayesian modeling with a Gaussian prior. Letting L denote the usual like-
lihood function and λ be a penalty factor, we maximize the penalized log
likelihood given by
p
1
log L − λ (si βi )2 , (9.61)
2 i=1
where s1 , s2 , . . . , sp are scale factors chosen to make si βi unitless. Most au-
thors standardize the data first and do not have scale factors in the equation,
but Equation 9.61 has the advantage of allowing estimation of β on the orig-
inal scale of the data. The usual methods (e.g., Newton–Raphson) are used
to maximize 9.61.
The choice of the scaling constants has received far too little attention in
the ridge regression and penalized MLE literature. It is common to use the 18
standard deviation of each column of the design matrix to scale the corre-
sponding parameter. For models containing nothing but continuous variables
that enter the regression linearly, this is usually a reasonable approach. For
continuous variables represented with multiple terms (one of which is lin-
ear), it is not always reasonable to scale each nonlinear term with its own
standard
deviation. For dummy variables, scaling using the standard devia-
tion ( d(1 − d), where d is the mean of the dummy variable, i.e., the frac-
tion of observations in that cell) is problematic since this will result in high
prevalance cells getting more shrinkage than low prevalence ones because the
high prevalence cells will dominate the penalty function.
An advantage of the formulation in Equation 9.61 is that one can assign
scale constants of zero for parameters for which no shrinkage is desired.237, 639
For example, one may have prior beliefs that a linear additive model will fit
the data. In that case, nonlinear and non-additive terms may be penalized.
210 9 Overview of Maximum Likelihood Estimation
where LR χ2 is the likelihood ratio χ2 for the penalized model, but ignoring
the penalty function. If a variety of λ are tried and one plots the (λ, AIC)
pairs, the λ that maximizes AIC will often be a good choice, that is, it is
likely to be near the value of λ that maximizes predictive accuracy on a
future datasetg .
Note that if one does penalized maximum likelihood estimation where a set
of variables being penalized has a negative value for the unpenalized χ2 −2 ×
d.f., the value of λ that will optimize the overall model AIC will be ∞.
As an example, consider some simulated data (n = 100) with one predictor
in which the true model is Y = X1 + ǫ, where ǫ has a standard normal
distribution and so does X1 . We use a series of penalties (found by trial and
error) that give rise to sensible effective d.f., and fit penalized restricted cubic
spline functions with five knots. We penalize two ways: all terms in the model
including the coefficient of X1 , which in reality needs no penalty; and only
the nonlinear terms. The following R program, in conjunction with the rms
package, does the job.
g
Several examples from simulated datasets have shown that using BIC to choose a
penalty results in far too much shrinkage.
212 9 Overview of Maximum Likelihood Estimation
set.seed (191)
x1 ← rnorm (100)
y ← x1 + rnorm (100)
pens ← df ← aic ← c (0 , .07 , .5 ,2 ,6 ,15 ,60)
all ← nl ← list ()
all ← do.call ( ' rbind ' , all ); all $ type ← ' Penalize All '
nl ← do.call ( ' rbind ' , nl ) ; nl $ type ← ' Penalize Nonlinear '
both ← a s . d a t a . f r a m e ( r b i n d . d a t a . f r a m e ( all , nl ))
both $ Penalty ← both $ .set.
ggplot ( both , aes ( x = x1 , y = yhat , color = Penalty )) + geom_line () +
geom_ablin e ( col = gray ( .7 )) + facet_gri d (∼ type )
# Figure 9.6
Penalty
0
2 0.07
0.5
yhat
2
0 6
15
60
−2
−2 −1 0 1 2 −2 −1 0 1 2
x1
Fig. 9.6 Penalized least squares estimates for an unnecessary five-knot restricted
cubic spline function. In the left graph all parameters (except the intercept) are
penalized. The effective d.f. are 4, 3.21, 2.71, 2.30, 2.03, 1.82, and 1.51. In the right
graph, only parameters associated with nonlinear functions of X1 are penalized. The
effective d.f. are 4, 3.22, 2.73, 2.34, 2.11, 1.96, and 1.68.
1 Boos60 has some nice generalizations of the score test. Morgan et al.464 show
how score test χ2 statistics may negative unless the expected information matrix
is used.
2 See Marubini and Valsecchi [444, pp. 164–169] for an excellent description of
the relationship between the three types of test statistics.
3 References [115,507] have good descriptions of methods used to maximize log L.
4 As Long and Ervin426 argue, for small sample sizes, the usual Huber–White co-
variance estimator should not be used because there the residuals do not have
constant variance even under homoscedasticity. They showed that a simple cor-
rection due to Efron and others can result in substantially better estimates.
Lin and Wei,410 Binder,55 and Lin407 have applied the Huber estimator to the
Cox132 survival model. Freedman206 questioned the use of sandwich estima-
tors because they are often used to obtain the right variances on the wrong
parameters when the model doesn’t fit. He also has some excellent background
information.
5 Feng et al.188 showed that in the case of cluster correlations arising from re-
peated measurement data with Gaussian errors, the cluster bootstrap performs
excellently even when the number of observations per cluster is large and the
number of subjects is small. Xiao and Abrahamowicz676 compared the cluster
bootstrap with a two-stage cluster bootstrap in the context of the Cox model.
214 9 Overview of Maximum Likelihood Estimation
In [314] they contrast asymptotically efficient model selection with AIC when
the true model has infinitely many parameters with improvements using other
indexes such as AICC when the model is finite.
One difficulty in applying the Schwarz, AICC , and related criteria is that with
censored or binary responses it is not clear that the actual sample size n should
be used in the formula.
12 Goldstein,222 Willan et al.,669 and Royston and Thompson534 have nice dis-
cussions on comparing non-nested regression models. Schemper’s method549 is
useful for testing whether a set of variables provides significantly greater infor-
mation (using an R2 measure) than another set of variables.
13 van Houwelingen and le Cessie [633, Eq. 22] recommended using L/2 (also called
the Kullback–Leibler error rate) as a quality index.
14 Schemper549 provides a bootstrap technique for testing for significant differ-
ences between correlated R2 measures. Mittlböck and Schemper,461 Schemper
and Stare,554 Korn and Simon,365, 366 Menard,454 and Zheng and Agresti684
have excellent discussions about the pros and cons of various indexes of the
predictive value of a model.
15 Al-Radi et al.10 presented another analysis comparing competing predictors
using the adequacy index and a receiver operating characteristic curve area
approach based on a test for whether one predictor has a higher probability of
being “more concordant” than another.
16 [55, 97, 409] provide good variance–covariance estimators from a weighted max-
imum likelihood analysis.
17 Huang and Harrington310 developed penalized partial likelihood estimates for
Cox models and provided useful background information and theoretical results
about improvements in mean squared errors of regression estimates. They used
a bootstrap error estimate for selection of the penalty parameter.
18 Sardy538 proposes that the square roots of the diagonals of the inverse of the
covariance matrix for the predictors be used for scaling rather than the standard
deviations.
19 Park and Hastie483 and articles referenced therein describe how quadratic pe-
nalized logistic regression automatically sets coefficient estimates for empty cells
to zero and forces the sum of k coefficients for a k-level categorical predictor to
equal zero.
20 Greenland241 has a nice discussion of the relationship between penalized max-
imum likelihood estimation and mixed effects models. He cautions against esti-
mating the shrinkage parameter.
21 See310 for a bootstrap approach to selection of λ.
22 Verweij and van Houwelingen [639, Eq. 4] derived another expression for d.f., but
it requires more computation and did not perform any better than Equation 9.63
in choosing λ in several examples tested.
23 See van Houwelingen and Thorogood631 for an approximate empirical Bayes
approach to shrinkage. See Tibshirani608 for the use of a non-smooth penalty
function that results in variable selection as well as shrinkage (see Section 4.3).
Verweij and van Houwelingen640 used a “cross-validated likelihood” based on
leave-out-one estimates to penalize for overfitting. Wang and Taylor652 pre-
sented some methods for carrying out hypothesis tests and computing con-
fidence limits under penalization. Moons et al.462 presented a case study of
penalized estimation and discussed the advantages of penalization.
216 9 Overview of Maximum Likelihood Estimation
9.12 Problems
1. A sample of size 100 from a normal distribution with unknown mean and
standard deviation (μ and σ) yielded the following log likelihood values
when computed at two values of μ.
10.1 Model
1.0
0.8
0.6
P
0.4
0.2
0.0
−4 −2 0 2 4
X
Other methods that have been used to analyze binary response data in-
clude the probit model, which writes P in terms of the cumulative normal
distribution, and discriminant analysis. Probit regression, although assuming
a similar shape to the logistic function for the regression relationship be-
tween Xβ and Prob{Y = 1}, involves more cumbersome calculations, and
there is no natural interpretation of its regression parameters. In the past,
discriminant analysis has been the predominant method since it is the sim-
plest computationally. However, it makes more assumptions than logistic re-
3 gression. The model used in discriminant analysis is stated in terms of the
10.1 Model 221
distribution of X given the outcome group Y , even though one is seldom in-
terested in the distribution of the predictors per se. The discriminant model
has to be inverted using Bayes’ rule to derive the quantity of primary in-
terest, Prob{Y = 1}. By contrast, the logistic model is a direct probability
model since it is stated in terms of Prob{Y = 1|X}. Since the distribution
of a binary random variable Y is completely defined by the true probability
that Y = 1 and since the model makes no assumption about the distribu-
tion of the predictors, the logistic model makes no distributional assumptions
whatsoever.
Since the logistic model is a direct probability model, its only assumptions
relate to the form of the regression equation. Regression assumptions are
verifiable, unlike the assumption of multivariate normality made by discrimi-
nant analysis. The logistic model assumptions are most easily understood by
transforming Prob{Y = 1} to make a model that is linear in Xβ:
logit{Y = 1|X} = β0 + β1 X1 + . . . + βj Xj + . . . + βk Xk
= βj Xj + C, (10.7)
The parameter βj is then the change in the log odds per unit change in
Xj if Xj represents a single factor that is linear and does not interact with
other factors and if all other factors are held constant. Instead of writing this
relationship in terms of log odds, it could just as easily be written in terms
of the odds that Y = 1:
The regression parameters can also be written in terms of odds ratios. The
odds that Y = 1 when Xj is increased by d, divided by the odds at Xj is
odds{Y = 1|X1 , X2 , . . . , Xj + d, . . . , Xk }
odds{Y = 1|X1 , X2 , . . . , Xj , . . . , Xk }
exp[βj (Xj + d)] exp(C)
= (10.11)
[exp(βj Xj ) exp(C)]
= exp[βj Xj + βj d − βj Xj ] = exp(βj d).
Now consider some special cases of the logistic multiple regression model.
If there is only one predictor X and that predictor is binary, the model can
be written
logit{Y = 1|X = 0} = β0
logit{Y = 1|X = 1} = β0 + β1 . (10.13)
Since there are only two quantities to model and two free parameters,
there is no way that this two-sample model can’t fit; the model in this case
is essentially fitting two cell proportions. Similarly, if there are g − 1 dummy
indicator Xs representing g groups, the ANOVA-type logistic model must
always fit.
If there is one continuous predictor X, the model is
and without further modification (e.g., taking log transformation of the pre-
dictor), the model assumes a straight line in the log odds, or that an increase
in X by one unit increases the odds by a factor of exp(β1 ).
Now consider the simplest analysis of covariance model in which there are
two treatments (indicated by X1 = 0 or 1) and one continuous covariable
(X2 ). The simplest logistic model for this setup is
logit{Y = 1|X1 = 0, X2 } = β0 + β2 X2
logit{Y = 1|X1 = 1, X2 } = β0 + β1 + β2 X2 . (10.17)
logit{Y = 1|X1 = 0, X2 } = β0 + β2 X2
logit{Y = 1|X1 = 1, X2 } = β0 + β1 + β2 X2 + β3 X2 (10.19)
= β0′ + β2′ X2 ,
224 10 Binary Logistic Regression
0.6
10
0.5
Increase in Risk
0.4 5
4
0.3 3
0.2 2
1.75
1.5
0.1
1.25
1.1
0.0
Fig. 10.2 Absolute benefit as a function of risk of the event in a control subject and
the relative effect (odds ratio) of the risk factor. The odds ratios are given for each
curve.
1
1 + exp −[βˆ0 + βˆ1 + βˆ2 X2 + . . . + βˆp Xp ]
1
− (10.20)
1 + exp −[βˆ0 + βˆ2 X2 + . . . + βˆp Xp ]
1
= − R̂,
1+ ( 1−
R̂
R̂
) exp(−β̂1 )
where R̂ is the estimate of the baseline risk, Prob{Y = 1|X1 = 0}. The risk
difference estimate can be plotted against R̂ or against levels of variables in A
to display absolute risk increase against overall risk (Figure 10.2) or against
specific subject characteristics. 4
Consider the data in Table 10.2. A graph of the data, along with a fitted
logistic model (described later), appears in Figure 10.3. The graph also dis-
plays proportions of responses obtained by stratifying the data by sex and
226 10 Binary Logistic Regression
age group (< 45, 45 − 54, ≥ 55). The age points on the abscissa for these
groups are the overall mean ages in the three age intervals (40.2, 49.1, and
61.1, respectively).
require ( rms )
getHdata ( s e x . a g e . r e s p o n s e )
d ← sex.age.response
dd ← datadist ( d ); options ( datadist = ' dd ' )
f ← lrm ( response ∼ sex + age , data = d )
fasr ← f # Save for later
w ← function ( ... )
with (d , {
m ← sex == ' male '
f ← sex == ' female '
lpoints ( age [ f ] , response [ f ] , pch =1)
lpoints ( age [ m ] , response [ m ] , pch =2)
af ← cut2 ( age , c (45 ,55) , levels.mea n = TRUE )
prop ← tapply ( response , list ( af , sex ) , mean ,
na.rm = TRUE )
agem ← as.numeric ( row.names ( prop ))
lpoints ( agem , prop [ , ' female ' ] ,
pch =4 , cex =1 .3 , col = ' green ' )
lpoints ( agem , prop [ , ' male ' ] ,
pch =5 , cex =1 .3 , col = ' green ' )
x ← rep (62 , 4); y ← seq ( .25 , .1 , length =4)
lpoints ( x , y , pch = c (1 , 2 , 4 , 5) ,
col = rep ( c ( ' blue ' , ' green ' ) , each =2))
ltext ( x +5 , y ,
c ( ' F Observed ' , ' M Observed ' ,
' F Proportion ' , ' M Proportio n ' ) , cex = .8 )
} ) # Figure 10.3
plot ( Predict (f , age = seq (34 , 70 , length =200) , sex , fun = plogis ) ,
ylab = ' Pr [ response ] ' , ylim = c ( -.02 , 1 .02 ) , addpanel = w )
ltx ← function ( fit ) latex ( fit , inline = TRUE , columns =54 ,
file = ' ' , after = ' $ . ' , digits =3 ,
size = ' Ssize ' , before = ' $ X \\ hat {\\ beta }= ' )
ltx ( f )
male
0.8
0.6
Pr[response]
0.4
F Observed
0.2 M Observed
F Proportion
female
M Proportion
40 50 60 70
age
Fig. 10.3 Data, subgroup proportions, and fitted logistic model, with 0.95 pointwise
confidence bands
sex response
Frequency
Row Pct 0 1 Total Odds/Log
F 14 6 20 6/14=.429
70.00 30.00 -.847
M 6 14 20 14/6=2.33
30.00 70.00 .847
Total 20 20 40
χ2 1 6.400 0.011
Likelihood Ratio χ2 1 6.583 0.010
Parameter Estimate Std Err Wald χ2 P
Note that the estimate of β0 , β̂0 is the log odds for females and that β̂1 is the
log odds (M:F) ratio. β̂0 + β̂1 = .847, the log odds for males. The likelihood
ratio test for H0 : no effect of sex on probability of response is obtained as
follows.
Log likelihood (β1 = 0) : −27.727
Log likelihood (max) : −24.435
LR χ2 (H0 : β1 = 0) : −2(−27.727 − −24.435) = 6.584.
(Note the agreement of the LR χ2 with the contingency table likelihood ratio
χ2 , and compare 6.584 with the Wald statistic 6.03.)
Next, consider the relationship between age and response, ignoring sex.
age response
Frequency
Row Pct 0 1 Total Odds/Log
<45 8 5 13 5/8=.625
61.5 38.4 -.47
45-54 6 6 12 6/6=1
50.0 50.0 0
55+ 6 9 15 9/6=1.5
40.0 60.0 .405
Total 20 20 40
sex=F
age response
Frequency
Row Pct 0 1 Total
<45 4 0 4
100.0 0.0
45-54 4 1 5
80.0 20.0
55+ 6 5 11
54.6 45.4
Total 14 6 20
sex=M
age response
Frequency
Row Pct 0 1 Total
<45 4 5 9
44.4 55.6
45-54 2 5 7
28.6 71.4
55+ 0 4 4
0.0 100.0
Total 6 14 20
where as before sex = 0 for females, 1 for males. For example, for a 40-year-
old female, the predicted logit is −9.84 + .158(40) = −3.52. The predicted
probability of a response is 1/[1 + exp(3.52)] = .029. For a 40-year-old male,
the predicted logit is −9.84 + 3.49 + .158(40) = −.03, with a probability
of .492.
The logistic multiple regression model can incorporate the same designs as
can ordinary linear regression. An analysis of variance (ANOVA) model for
a treatment with k levels can be formulated with k − 1 dummy variables.
This logistic model is equivalent to a 2 × k contingency table. An analysis
of covariance logistic model is simply an ANOVA model augmented with
covariables used for adjustment.
One unique design that is interesting to consider in the context of logistic
models is a simultaneous comparison of multiple factors between two groups.
Suppose, for example, that in a randomized trial with two treatments one
wished to test whether any of 10 baseline characteristics are mal-distributed
between the two groups. If the 10 factors are continuous, one could perform a
two-sample Wilcoxon–Mann–Whitney test or a t-test for each factor (if each
is normally distributed). However, this procedure would result in multiple
comparison problems and would also not be able to detect the combined ef-
fect of small differences across all the factors. A better procedure would be a
multivariate test. The Hotelling T 2 test is designed for just this situation. It
is a k-variable extension of the one-variable unpaired t-test. The T 2 test, like
discriminant analysis, assumes multivariate normality of the k factors. This
assumption is especially tenuous when some of the factors are polytomous. A
better alternative is the global test of no regression from the logistic model.
This test is valid because it can be shown that H0 : mean X is the same for
both groups (= H0 : mean X does not depend on group = H0 : mean X|
group = constant) is true if and only if H0 : Prob{group|X} = constant. Thus
k factors can be tested simultaneously for differences between the two groups
using the binary logistic model, which has far fewer assumptions than does the
Hotelling T 2 test. The logistic global test of no regression (with k d.f.) would
be expected to have greater power if there is non-normality. Since the logistic
model makes no assumption regarding the distribution of the descriptor vari-
ables, it can easily test for simultaneous group differences involving a mixture
of continuous, binary, and nominal variables. In observational studies, such
10.2 Estimation 231
10.2 Estimation
The parameters in the logistic regression model are estimated using the maxi-
mum likelihood (ML) method. The method is based on the same principles as
the one-sample proportion example described in Section 9.1. The difference
is that the general logistic model is not a single sample or a two-sample prob-
lem. The probability of response for the ith subject depends on a particular
set of predictors Xi , and in fact the list of predictors may not be the same
for any two subjects. Denoting the response and probability of response of
the ith subject by Yi and Pi , respectively, the model states that
β̂. The inverse of the estimated observed information matrix is taken as the
estimate of the variance–covariance matrix of β̂.
Under H0 : β1 = β2 = . . . = βk = 0, the intercept parameter β0 can be
estimated explicitly and the log likelihood under this global null hypothesis
can be computed explicitly. Under the global null hypothesis, Pi = P =
[1 + exp(−β0 )]−1 and the MLE of P is P̂ = s/n where s is the number of
responses and n is the sample size. The MLE of β0 is β̂0 = logit(P̂ ). The log
6 likelihood under this null hypothesis is
s log(P̂ ) + (n − s) log(1 − P̂ )
= s log(s/n) + (n − s) log[(n − s)/n] (10.24)
= s log s + (n − s) log(n − s) − n log(n).
Once β is estimated, one can estimate any log odds, odds, or odds ratios.
The MLE of the Xj + 1 : Xj log odds ratio is β̂j , and the estimate of the
Xj + d : Xj log odds ratio is β̂j d, all other predictors remaining constant
(assuming the absence of interactions and nonlinearities involving Xj ). For
large enough samples, the MLEs are normally distributed with variances that
are consistently estimated from the estimated variance–covariance matrix.
Letting z denote the 1−α/2 critical value of the standard normal distribution,
a two-sided 1 − α confidence interval for the log odds ratio for a one-unit
increase in Xj is [β̂j − zs, β̂j + zs], where s is the estimated standard error
of β̂j . (Note that for α = .05, i.e., for a 95% confidence interval, z = 1.96.)
A theorem in statistics states that the MLE of a function of a parameter
is that same function of the MLE of the parameter. Thus the MLE of the
Xj + 1 : Xj odds ratio is exp(β̂j ). Also, if a 1 − α confidence interval of a
parameter β is [c, d] and f (u) is a one-to-one function, a 1 − α confidence
interval of f (β) is [f (c), f (d)]. Thus a 1−α confidence interval for the Xj +1 :
Xj odds ratio is exp[β̂j ± zs]. Note that while the confidence interval for βj is
symmetric about β̂j , the confidence interval for exp(βj ) is not. By the same
theorem just used, the MLE of Pi = Prob{Yi = 1|Xi } is
matrix of β̂ (see Equation 9.51). This variance is the sum of all variances and
covariances of β̂ weighted by squares and products of the predictors. The es-
timated standard error of X β̂, s, is the square root of this variance estimate.
A 1 − α confidence interval for Pi is then 7
Suppose there were no covariates, so that the only parameter in the model is
the intercept. What is the sample size required to allow the estimate of the
intercept to be precise enough so that the predicted probability is within 0.1
of the true probability with 0.95 confidence, when the true intercept is in the
neighborhood of zero? The answer is n=96. What if there were one covariate,
and it was binary with a prevalence of 21 ? One would need 96 subjects with
X = 0 and 96 with X = 1 to have an upper bound on the margin of error
for estimating Prob{Y = 1|X = x} not exceed 0.1 for either value of xa .
Now consider a very simple single continuous predictor case in which X
has a normal distribution with mean zero and standard deviation σ, with the
true Prob{Y = 1|X = x} = [1 + exp(−x)]−1 . The expected number of events
is n2 b . The following simulation answers the question “What should n be so
that the expected maximum absolute error (over x ∈ [−1.5, 1.5]) in P̂ is less
than ǫ?”
sigmas ← c ( .5 , .75 , 1, 1 .25 , 1 .5 , 1 .75 , 2, 2 .5 , 3, 4)
ns ← seq (25 , 300 , by =25)
nsim ← 1000
xs ← seq ( -1.5 , 1 .5 , length =200)
pactual ← plogis ( xs )
i ← 0
for ( s in sigmas ) {
i ← i + 1
j ← 0
for ( n in ns ) {
a
The general formula for the sample size required to achieve a margin of error of δ in
estimating a true probability of θ at the 0.95 confidence level is n = ( 1.96
δ
)2 ×θ(1−θ).
1
Set θ = 2 (intercept=0) for the worst case.
b
The R code can easily be modified for other event frequencies, or the minimum of
the number of events and non-events for a dataset at hand can be compared with n 2
in this simulation. An average maximum absolute error of 0.05 corresponds roughly
to a half-width of the 0.95 confidence interval of 0.1.
234 10 Binary Logistic Regression
j ← j + 1
n1 ← maxe ← 0
for ( k in 1: nsim ) {
x ← rnorm (n , 0, s )
P ← plogis ( x )
y ← ifelse ( runif (n ) ≤ P , 1, 0)
n1 ← n1 + sum (y )
beta ← lrm.fit (x , y ) $ coefficients
phat ← plogis ( beta [1] + beta [2] * xs )
maxe ← maxe + max ( abs ( phat - pactual ))
}
n1 ← n1 / nsim
maxe ← maxe / nsim
maxerr [i , j ] ← maxe
N1 [i , j ] ← n1
}
}
xrange ← range ( xs )
simerr ← llist ( N1 , maxerr , sigmas , ns , nsim , xrange )
The likelihood ratio, score, and Wald statistics discussed earlier can be used
to test any hypothesis in the logistic model. The likelihood ratio test is gen-
erally preferred. When true parameters are near the null values all three
statistics usually agree. The Wald test has a significant drawback when the
true parameter value is very far from the null value. In such case the stan-
dard error estimate becomes too large. As β̂j increases from 0, the Wald test
statistic for H0 : βj = 0 becomes larger, but after a certain point it becomes
smaller. The statistic will eventually drop to zero if β̂j becomes infinite.278
Infinite estimates can occur in the logistic model especially when there is a
binary predictor whose mean is near 0 or 1. Wald statistics are especially
problematic in this case. For example, if 10 out of 20 males had a disease and
5 out of 5 females had the disease, the female : male odds ratio is infinite and
so is the logistic regression coefficient for sex. If such a situation occurs, the
likelihood ratio or score statistic should be used instead of the Wald statistic.
10.4 Residuals 235
σ
0.5
0.75
1
0.20 1.25
Average Maximum P − P
1.5
1.75
^
2
2.5
0.15 3
4
0.10
0.05
n
Fig. 10.4 Simulated expected maximum error in estimating probabilities for x ∈
[−1.5, 1.5] with a single normally distributed X with mean zero
10.4 Residuals
Several types of residuals can be computed for binary logistic model fits. Many
of these residuals are used to examine the influence of individual observations
on the fit. The partial residual can be used for directly assessing how each 8
236 10 Binary Logistic Regression
predictor should be transformed. For the ith observation, the partial residual
for the jth element of X is defined by
Yi − P̂i
rij = β̂j Xij + , (10.27)
P̂i (1 − P̂i )
where Xij is the value of the jth variable in the ith observation, Yi is the
corresponding value of the response, and P̂i is the predicted probability that
Yi = 1. A smooth plot (using, e.g., loess) of Xij against rij will provide an
estimate of how Xj should be transformed, adjusting for the other Xs (using
their current transformations). Typically one tentatively models Xj linearly
and checks the smoothed plot for linearity. A U -shaped relationship in this
plot, for example, indicates that a squared term or spline function needs to
9 be added for Xj . This approach does assume additivity of predictors.
good as any. They used a normal Z-test for the sum of squared errors (n × B,
where B is the Brier index in Equation 10.35). This test takes into account the
fact that one cannot obtain a χ2 distribution for the sum of squares. It also
takes into account the estimation of β. It is not yet clear for which types of
lack of fit this test has reasonable power. Returning to the external validation
case where uncertainty of β does not need to be accounted for, Stallard584 has
further documented the lack of power of the original Hosmer-Lemeshow test
and found more power with a logarithmic scoring rule (deviance test) and a
χ2 test that, unlike the simple unweighted sum of squares test, weights each
squared error by dividing it by P̂i (1 − P̂i ). A scaled χ2 distribution seemed to
provide the best approximation to the null distribution of the test statistics.
More power for detecting lack of fit is expected to be obtained from testing
specific alternatives to the model. In the model
where X1 is binary and X2 is continuous, one needs to verify that the log
odds is related to X1 and X2 according to Figure 10.5.
X1= 1
logit{Y=1}
X1= 0
X2
Fig. 10.5 Logistic regression assumptions for one binary and one continuous predic-
tor
The simplest method for validating that the data are consistent with the
no-interaction linear model involves stratifying the sample by X1 and quan-
tile groups (e.g., deciles) of X2 .265 Within each stratum the proportion of
responses P̂ is computed and the log odds calculated from log[P̂ /(1 − P̂ )].
The number of quantile groups should be such that there are at least 20 (and
perhaps many more) subjects in each X1 ×X2 group. Otherwise, probabilities
cannot be estimated precisely enough to allow trends to be seen above “noise”
in the data. Since at least 3 X2 groups must be formed to allow assessment
of linearity, the total sample size must be at least 2 × 3 × 20 = 120 for this
method to work at all.
238 10 Binary Logistic Regression
Figure 10.6 demonstrates this method for a large sample size of 3504 sub-
jects stratified by sex and deciles of age. Linearity is apparent for males while
there is evidence for slight interaction between age and sex since the age trend
for females appears curved.
getHdata ( acath )
acath $ sex ← factor ( acath $ sex , 0:1 , c ( ' male ' , ' female ' ))
dd ← datadist ( acath ); options ( datadist = ' dd ' )
f ← lrm ( sigdz ∼ rcs ( age , 4) * sex , data = acath )
w ← function ( ... )
with ( acath , {
plsmo ( age , sigdz , group = sex , fun = qlogis , lty = ' dotted ' ,
add = TRUE , grid = TRUE )
af ← cut2 ( age , g =10 , levels.mean = TRUE )
prop ← qlogis ( tapply ( sigdz , list ( af , sex ), mean ,
na.rm = TRUE ))
agem ← as.numeric ( row.names ( prop ))
lpoints ( agem , prop [, ' female ' ], pch =4 , col = ' green ' )
lpoints ( agem , prop [, ' male ' ], pch =2 , col = ' green ' )
} ) # Figure 10.6
plot ( Predict (f , age , sex ), ylim = c ( -2 ,4) , addpanel =w ,
label.curve = list ( offset = unit (0 .5 , ' cm ' )))
The subgrouping method requires relatively large sample sizes and does
not use continuous factors effectively. The ordering of values is not used at all
between intervals, and the estimate of the relationship for a continuous vari-
able has little resolution. Also, the method of grouping chosen (e.g., deciles
vs. quintiles vs. rounding) can alter the shape of the plot.
In this dataset with only two variables, it is efficient to use a nonpara-
metric smoother for age, separately for males and females. Nonparametric
smoothers, such as loess111 used here, work well for binary response vari-
ables (see Section 2.4.7); the logit transformation is made on the smoothed
10 probability estimates. The smoothed estimates are shown in Figure 10.6.
When there are several predictors, the restricted cubic spline function is
better for estimating the true relationship between X2 and logit{Y = 1} for
continuous variables without assuming linearity. By fitting a model containing
X2 expanded into k − 1 terms, where k is the number of knots, one can obtain
an estimate of the transformation of X2 as discussed in Section 2.4:
where X2′ and X2′′ are constructed spline variables (when k = 4). Plotting
the estimated spline function f (X2 ) versus X2 will estimate how the effect of
X2 should be modeled. If the sample is sufficiently large, the spline function
can be fitted separately for X1 = 0 and X1 = 1, allowing detection of even
unusual interaction patterns. A formal test of linearity in X2 is obtained by
testing H0 : β3 = β4 = 0.
10.5 Assessment of Model Fit 239
3
male
2
log odds
1
−1
female
30 40 50 60 70 80
Age,Year
Fig. 10.6 Logit proportions of significant coronary artery disease by sex and deciles
of age for n=3504 patients, with spline fits (smooth curves). Spline fits are for k = 4
knots at age= 36, 48, 56, and 68 years, and interaction between age and sex is allowed.
Shaded bands are pointwise 0.95 confidence limits for predicted log odds. Smooth
nonparametric estimates are shown as dotted curves. Data courtesy of the Duke
Cardiovascular Disease Databank.
k Model χ2 AIC
0 99.23 97.23
3 112.69 108.69
4 121.30 115.30
5 123.51 115.51
6 124.41 114.51
This analysis confirms the first impression from the graph, namely, that
age × sex interaction is present but it is not of the form of a simple product
between age and sex (change in slope). In the context of a linear age effect,
there is no significant product interaction effect (P = .14). Without allowing
for interaction, there is no significant nonlinear effect of age (P = .18). How-
ever, the general test of lack of fit with 5 d.f. indicates a significant departure
from the linear additive model (P = .005).
In Figure 10.7, data from 2332 patients who underwent cardiac catheteri-
zation at Duke University Medical Center and were found to have significant
(≥ 75%) diameter narrowing of at least one major coronary artery were ana-
lyzed (the dataset is available from the Web site). The relationship between
the time from the onset of symptoms of coronary artery disease (e.g., angina,
myocardial infarction) to the probability that the patient has severe (three-
vessel disease or left main disease—tvdlm) coronary disease was of interest.
There were 1129 patients with tvdlm. A logistic model was used with the
duration of symptoms appearing as a restricted cubic spline function with
k = 3, 4, 5, and 6 equally spaced knots in terms of quantiles between .05 and
.95. The best fit for the number of parameters was chosen using Akaike’s
information criterion (AIC), computed in Table 10.4 as the model likelihood
10.5 Assessment of Model Fit 241
ratio χ2 minus twice the number of parameters in the model aside from the
intercept. The linear model is denoted k = 0.
dz ← subset ( acath , sigdz ==1)
dd ← datadist ( dz )
1
log odds
−1
Figure 10.7 displays the spline fit for k = 5. The triangles represent sub-
group estimates obtained by dividing the sample into groups of 150 patients.
For example, the leftmost triangle represents the logit of the proportion
of tvdlm in the 150 patients with the shortest duration of symptoms, ver-
sus the mean duration in that group. A Wald test of linearity, with 3 d.f.,
showed highly significant nonlinearity (χ2 = 23.92 with 3 d.f.). The plot of the
spline transformation suggests a log transformation, and when log (duration
of symptoms in months + 1) was fitted in a logistic model, the log likelihood
of the model (119.33 with 1 d.f.) was virtually as good as the spline model
(123.51 with 4 d.f.); the corresponding Akaike information criteria (on the χ2
scale) are 117.33 and 115.51. To check for adequacy in the log transformation,
242 10 Binary Logistic Regression
a five-knot restricted cubic spline function was fitted to log10 (months + 1),
as displayed in Figure 10.8. There is some evidence for lack of fit on the right,
but the Wald χ2 for testing linearity yields P = .27.
f ← lrm ( tvdlm ∼ log10 ( cad.dur + 1) , data = dz )
w ← function ( ... )
with ( dz , {
x ← cut2 ( cad.dur , m =150 , levels.mean = TRUE )
prop ← tapply ( tvdlm , x , mean , na.rm = TRUE )
xm ← as.numeric ( names ( prop ))
lpoints ( xm , prop , pch =2 , col = ' green ' )
} )
# Figure 10.8
plot ( Predict (f , cad.dur , fun = plogis ), ylab = ' P ' ,
ylim = c ( .2 , .8 ), addpanel = w )
0.7
0.6
0.5
P
0.4
0.3
Fig. 10.8 Fitted linear logistic model in log10 (duration + 1), with subgroup es-
timates using groups of 150 patients. Fitted equation is logit(tvdlm) = −.9809 +
.7122 log10 (months + 1).
If the model contains two continuous predictors, they may both be ex-
panded with spline functions in order to test linearity or to describe nonlinear
relationships. Testing interaction is more difficult here. If X1 is continuous,
one might temporarily group X1 into quantile groups. Consider the subset
of 2258 (1490 with disease) of the 3504 patients used in Figure 10.6 who
have serum cholesterol measured. A logistic model for predicting significant
coronary disease was fitted with age in tertiles (modeled with two dummy
variables), sex, age × sex interaction, four-knot restricted cubic spline in
cholesterol, and age tertile × cholesterol interaction. Except for the sex ad-
justment this model is equivalent to fitting three separate spline functions in
cholesterol, one for each age tertile. The fitted model is shown in Figure 10.9
for cholesterol and age tertile against logit of significant disease. Significant
age × cholesterol interaction is apparent from the figure and is suggested by
10.5 Assessment of Model Fit 243
the Wald χ2 statistic (10.03) that follows. Note that the test for linearity of
the interaction with respect to cholesterol is very insignificant (χ2 = 2.40 on
4 d.f.), but we retain it for now. The fitted function is
acath ← transform ( acath ,
cholesterol = choleste ,
age.tertile = cut2 ( age , g =3) ,
sx = as.integer ( acath $ sex ) - 1)
# sx for loess , need to code as numeric
dd ← datadist ( acath ); options ( datadist = ' dd ' )
ltx ( f )
yl ← c ( -1 ,5)
plot ( Predict (f , cholesterol , age.tertile ),
adj.subtitle = FALSE , ylim = yl ) # Figure 10.9
4
[58,82]
3
log odds
2
[49,58)
1
0 [17,49)
Cholesterol, mg %
Fig. 10.9 Log odds of significant coronary artery disease modeling age with two
dummy variables
Before fitting a parametric model that allows interaction between age and
cholesterol, let us use the local regression model of Cleveland et al.96 dis-
cussed in Section 2.4.7. This nonparametric smoothing method is not meant
to handle binary Y , but it can still provide useful graphical displays in the
binary case. Figure 10.10 depicts the fit from a local regression model predict-
ing Y = 1 = significant coronary artery disease. Predictors are sex (modeled
parametrically with a dummy variable), age, and cholesterol, the last two
fitted nonparametrically. The effect of not explicitly modeling a probability
is seen in the figure, as the predicted probabilities exceeded 1. Because of this
we do not take the logit transformation but leave the predicted values in raw
form. However, the overall shape is in agreement with Figure 10.10.
# Re-do model with continuous age
f ← loess ( sigdz ∼ age * ( sx + cholesterol ), data = acath ,
parametric = " sx " , drop.square = " sx " )
ages ← seq (25 , 75 , length =40)
chols ← seq (100 , 400 , length =40)
g ← expand.grid ( cholesterol = chols , age = ages , sx =0)
# drop sex dimension of grid since held to 1 value
p ← drop ( predict (f , g ))
p [ p < 0 .001 ] ← 0 .001
p [ p > 0 .999 ] ← 0 .999
zl ← c ( -3 , 6) # Figure 10.10
wireframe ( qlogis ( p ) ∼ cholesterol * age ,
xlab = list ( rot =30) , ylab = list ( rot = -40 ),
zlab = list ( label = ' log odds ' , rot =90) , zlim = zl ,
scales = list ( arrows = FALSE ), data = g )
4
log odds
−2
70 400
60 350
50 300
ag 40 250
e 200
30 ol
150
ster
100 ole
ch
Fig. 10.10 Local regression fit for the logit of the probability of significant coronary
disease vs. age and cholesterol for males, based on the loess function.
c
In the wireframe plots that follow, predictions for cholesterol–age combinations for
which fewer than 5 exterior points exist are not shown, so as to not extrapolate to
regions not supported by at least five points beyond the data perimeter.
10.5 Assessment of Model Fit 247
latex ( anova ( f ), caption = ' Linear spline surface ' , file = ' ' ,
size = ' smaller ' , label = ' tab : anova-lsp ' ) # Table 10.6
where X1′ , X1′′ , X2′ , and X2′′ are restricted cubic spline component variables
for X1 and X2 for k = 4. A general test of interaction with 9 d.f. is H0 : β7 =
. . . = β15 = 0. A test of adequacy of a simple product form interaction is
H0 : β8 = . . . = β15 = 0 with 8 d.f. A 13 d.f. test of linearity and additivity
is H0 : β2 = β3 = β5 = β6 = β7 = β8 = β9 = β10 = β11 = β12 = β13 = β14 =
β15 = 0 .
Figure 10.12 depicts the fit of this model. There is excellent agreement with
Figures 10.9 and 10.11, including an increased (but probably insignificant)
risk with low cholesterol for age ≥ 57.
f ← lrm ( sigdz ∼ rcs ( age ,4) *( sex + rcs ( cholesterol ,4)) ,
data = acath , tol =1 e-11 )
ltx ( f )
2
log odds
0
−1
−2
70
400
60 350
Ag 50 300
e, 250 g%
Ye
ar 40 200 r o l, m
30 150 ste
100 h ole
C
Fig. 10.11 Linear spline surface for males, with knots for age at 46, 52, 59 and knots
for cholesterol at 196, 224, and 259 (quartiles).
10.5 Assessment of Model Fit 249
# Figure 10.12:
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
Statistics for testing age × cholesterol components of this fit are above.
None of the nonlinear interaction components is significant, but we again
retain them.
The general interaction model can be restricted to be of the form
by removing the parameters β11 , β12 , β14 , and β15 from the model. The previ-
ous table of Wald statistics included a test of adequacy of this reduced form
(χ2 = 5.41 on 4 d.f., P = .248). The resulting fit is in Figure 10.13.
f ← lrm ( sigdz ∼ sex * rcs ( age ,4) + rcs ( cholesterol ,4) +
rcs ( age ,4) % ia % rcs ( cholesterol ,4) , data = acath )
latex ( anova ( f ), file = ' ' , size = ' smaller ' ,
caption = ' Singly nonlinear cubic spline surface ' ,
label = ' tab : anova-ria ' ) # Table 10.8
250 10 Binary Logistic Regression
log odds
1
0
−1
−2
70 400
60 350
Ag 50 300
e, 250
Ye
ar
40 200 g%
30 150 r o l, m
100 ste
ole
Ch
Fig. 10.12 Restricted cubic spline surface in two variables, each with k = 4 knots
# Figure 10.13:
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
ltx ( f )
10.5 Assessment of Model Fit 251
X β̂ = −7.2+2.96[female]+0.164age+7.23×10−5(age−36)3+ −0.000106(age−
48)3+ − 1.63×10−5(age − 56)3+ + 4.99×10−5(age − 68)3+ + 0.0148cholesterol +
1.21 × 10−6 (cholesterol − 160)3+ − 5.5 × 10−6 (cholesterol − 208)3+ + 5.5 ×
10−6 (cholesterol − 243)3+ − 1.21 × 10−6(cholesterol − 319)3+ + age[−0.00029
cholesterol + 9.28×10−9(cholesterol − 160)3+ + 1.7×10−8(cholesterol − 208)3+ −
4.43×10−8(cholesterol−243)3++1.79×10−8(cholesterol−319)3+]+cholesterol[2.3×
10−7 (age − 36)3+ + 4.21×10−7(age − 48)3+ − 1.31×10−6(age − 56)3+ + 6.64×
10−7 (age−68)3+ ]+[female][−0.111age+8.03×10−5(age−36)3+ +0.000135(age−
48)3+ − 0.00044(age − 56)3+ + 0.000224(age − 68)3+ ].
The fit is similar to the former one except that the climb in risk for low-
cholesterol older subjects is less pronounced. The test for nonlinear interac-
tion is now more concentrated (P = .54 with 4 d.f.). Figure 10.14 accordingly
depicts a fit that allows age and cholesterol to have nonlinear main effects,
but restricts the interaction to be a product between (untransformed) age
and cholesterol. The function agrees substantially with the previous fit.
f ← lrm ( sigdz ∼ rcs ( age ,4) * sex + rcs ( cholesterol ,4) +
age % ia % cholesterol , data = acath )
latex ( anova ( f ) , caption = ' Linear interacti o n surface ' , file = ' ' ,
size = ' smaller ' , label = ' tab : anova-lia ' ) #Table 10.9
# Figure 10.14:
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
f.linia ← f # save linear interaction fit for later
ltx ( f )
252 10 Binary Logistic Regression
log odds
1
0
−1
−2
70
400
60 350
Ag 50 300
e, 250
Ye 40 200 %
ar mg
30 150
e r ol,
100 st
ole
Ch
Fig. 10.13 Restricted cubic spline fit with age × spline(cholesterol) and cholesterol
× spline(age)
log odds
1
0
−1
−2
70
400
60 350
Ag 50 300
e, 250 %
Yea 40 200 mg
r
30 150 e r ol,
st
100 ole
Ch
Fig. 10.14 Spline fit with nonlinear effects of cholesterol and age and a simple
product interaction
3 63.73
log odds
2
53.06
1
0 41.74
Fig. 10.15 Predictions from linear interaction model with mean age in tertiles indi-
cated.
The partial residuals discussed in Section 10.4 can be used to check lo-
gistic model fit (although it may be difficult to deal with interactions). As
an example, reconsider the “duration of symptoms” fit in Figure 10.7. Fig-
ure 10.16 displays “loess smoothed” and raw partial residuals for the original
and log-transformed variable. The latter provides a more linear relationship,
especially where the data are most dense.
254 10 Binary Logistic Regression
3 3
2 2
1 1
0 0
ri
ri
−1 −1
−2 −2
−3 −3
Fig. 10.16 Partial residuals for duration and log10 (duration+1). Data density shown
at top of each plot.
10.6 Collinearity
The variance inflation factors (VIFs) discussed in Section 4.6 can apply to
any regression fit.147, 654 These VIFs allow the analyst to isolate which vari-
able(s) are responsible for highly correlated parameter estimates. Recall that,
in general, collinearity is not a large problem compared with nonlinearity and
overfitting.
population model that is truly linear in age and additive in age and sex, the
apparent influential observations are just random occurrences. It is unwise
to assume that in real data all points will agree with overall trends. Removal
of such points would bias the results, making the model apparently more
11 predictive than it will be prospectively. See Table 10.11.
f ← update ( fasr , x = TRUE , y = TRUE )
which.influen ce (f , .4 ) # Table 10.11
The test statistics discussed above allow one to test whether a factor or set of
factors is related to the response. If the sample is sufficiently large, a factor
that grades risk from .01 to .02 may be a significant risk factor. However, that
factor is not very useful in predicting the response for an individual subject.
There is controversy regarding the appropriateness of R2 from ordinary least
12 squares in this setting.136, 424 The generalized RN
2
index of Nagelkerke471 and
Cragg and Uhler137 , Maddala431 , and Magee432 described in Section 9.8.3
can be useful for quantifying the predictive strength of a model:
10.8 Quantifying Predictive Ability 257
2 1 − exp(−LR/n)
RN = , (10.34)
1 − exp(−L0 /n)
where LR is the global log likelihood ratio statistic for testing the importance
of all p predictors in the model and L0 is the −2 log likelihood for the null
model. 13
Tjur613 coined the term “coefficient of discrimination” D, defined as the
average P̂ when Y = 1 minus the average P̂ when Y = 0, and showed how it
ties in with sum of squares–based R2 measures. D has many advantages as
an index of predictive powerd .
Linnet416 advocates quadratic and logarithmic probability scoring rules
for measuring predictive performance for probability models. Linnet shows
how to bootstrap such measures to get bias-corrected estimates and how to
use bootstrapping to compare two correlated scores. The quadratic scoring
rule is Brier’s score, frequently used in judging meteorologic forecasts30, 73:
n
1
B= (P̂i − Yi )2 , (10.35)
n i=1
where P̂i is the predicted probability and Yi the corresponding observed re-
sponse for the ith observation. 14
A unitless index of the strength of the rank correlation between predicted
probability of response and actual response is a more interpretable measure of
the fitted model’s predictive discrimination. One such index is the probability
of concordance, c, between predicted probability and response. The c index,
which is derived from the Wilcoxon–Mann–Whitney two-sample rank test,
is computed by taking all possible pairs of subjects such that one subject
responded and the other did not. The index is the proportion of such pairs
with the responder having a higher predicted probability of response than
the nonresponder.
Bamber39 and Hanley and McNeil255 have shown that c is identical to a
widely used measure of diagnostic discrimination, the area under a “receiver
operating characteristic”(ROC) curve. A value of c of .5 indicates random pre-
dictions, and a value of 1 indicates perfect prediction (i.e., perfect separation
of responders and nonresponders). A model having c greater than roughly
.8 has some utility in predicting the responses of individual subjects. The
concordance index is also related to another widely used index, Somers’ Dxy
rank correlation579 between predicted probabilities and observed responses,
by the identity
Dxy = 2(c − .5). (10.36)
Dxy is the difference between concordance and discordance probabilities.
When Dxy = 0, the model is making random predictions. When Dxy = 1,
d
Note that D and B (below) and other indexes not related to c (below) do not work
well in case-control studies because of their reliance on absolute probability estimates.
258 10 Binary Logistic Regression
The major cause of unreliable models is overfitting the data. The methods
described in Section 5.3 can be used to assess the accuracy of models fairly.
If a sample has been held out and never used to study associations with the
response, indexes of predictive accuracy can now be estimated using that
sample. More efficient is cross-validation, and bootstrapping is the most ef-
ficient validation procedure. As discussed earlier, bootstrapping does not re-
quire holding out any data, since all aspects of model development (stepwise
variable selection, tests of linearity, estimation of coefficients, etc.) are re-
validated on samples taken with replacement from the whole sample.
Cox130 proposed and Harrell and Lee267 and Miller et al.457 further de-
veloped the idea of fitting a new binary logistic model to a new sample to
estimate the relationship between the predicted probability and the observed
outcome in that sample. This fit provides a simple calibration equation that
can be used to quantify unreliability (lack of calibration) and to calibrate
the predictions for future use. This logistic calibration also leads to indexes
of unreliability (U ), discrimination (D), and overall quality (Q = D − U )
which are derived from likelihood ratio tests267 . Q is a logarithmic scoring
rule, which can be compared with Brier’s index (Equation 10.35). See [633]
for many more ideas.
With bootstrapping we do not have a separate validation sample for as-
sessing calibration, but we can estimate the overoptimism in assuming that
the final model needs no calibration, that is, it has overall intercept=0 and
slope=1. As discussed in Section 5.3, refitting the model
(where Pc denotes the calibrated probability and the original predicted prob-
ability is P̂ = [1 + exp(−X β̂)]−1 ) in the original sample will always result in
γ = (γ0 , γ1 ) = (0, 1), since a logistic model will always “fit” the training sam-
ple when assessed overall. We thus estimate γ by using Efron’s172 method to
estimate the overoptimism in (0, 1) to obtain bias-corrected estimates of the
true calibration. Simulations have shown this method produces an efficient
estimate of γ.259
More stringent calibration checks can be made by running separate calibra-
tions for different covariate levels. Smooth nonparametric curves described in
Section 10.11 are more flexible than the linear-logit calibration method just
described.
A good set of indexes to estimate for summarizing a model validation is the
c or Dxy indexes and measures of calibration. In addition, the overoptimism
in the indexes may be reported to quantify the amount of overfitting present.
The estimate of γ can be used to draw a calibration curve by plotting P̂
on the x-axis and P̂c = [1 + exp −(γ0 + γ1 L)]−1 on the y-axis, where L =
logit(P̂ ).130, 267 An easily interpreted index of unreliability, Emax , follows
immediately from this calibration model:
260 10 Binary Logistic Regression
latex ( v1 ,
caption = ' Bootstrap Validation , 2 Predictors Without
Stepdown ' , digits =2 , size = ' Ssize ' , file = ' ' )
latex ( v2 ,
caption = ' Bootstrap Validation , 2 Predictors with Stepdown ' ,
digits =2 , B =15 , file = ' ' , size = ' Ssize ' )
sex age
• •
• •
• •
• •
• •
• •
• •
• •
• •
• •
• •
• •
• •
• •
•
The apparent Somers’ Dxy is 0.7 for the original stepwise model (which ac-
tually retained both age and sex), and the bias-corrected Dxy is 0.63, slightly
worse than the more correct model which forced in both variables. The cal-
ibration was also slightly worse as reflected in the slope correction factor
estimate of 0.84 versus 0.91.
Next, five additional candidate variables are considered. These variables
are random uniform variables, x1, . . . , x5 on the [0, 1] interval, and have no
association with the response.
set.seed (133)
n ← nrow ( d )
x1 ← runif ( n )
x2 ← runif ( n )
x3 ← runif ( n )
x4 ← runif ( n )
x5 ← runif ( n )
f ← lrm ( response ∼ age + sex + x1 + x2 + x3 + x4 + x5 ,
data =d , x = TRUE , y = TRUE )
v3 ← validate (f , B =150 , bw = TRUE ,
rule = ' p ' , sls = .1 , type = ' individual ' )
latex ( v3 ,
caption = ' Bootstrap Validation with 5 Noise Variables and
Stepdown ' , digits =2 , B =15 , size = ' Ssize ' , file = ' ' )
• •
• • • •
• • •
• •
• •
• • •
• • •
264 10 Binary Logistic Regression
Once the proper variables have been modeled and all model assumptions have
been met, the analyst needs to present and interpret the fitted model. There
are at least three ways to proceed. The coefficients in the model may be
interpreted. For each variable, the change in log odds for a sensible change in
the variable value (e.g., interquartile range) may be computed. Also, the odds
10.10 Describing the Fitted Model 265
1.0
0.6
0.4
age, sex
0.2 age, sex stepdown
age, sex, x1−x5
ideal
0.0
ratio or factor by which the odds increases for a certain change in a predictor,
holding all other predictors constant, may be displayed. Table 10.12 contains
such summary statistics for the linear age × cholesterol interaction surface
fit described in Section 10.5.
s ← summary ( f.linia ) # Table 10.12
latex (s , file = ' ' , size = ' Ssize ' ,
label = ' tab : lrm-cholxag e- c on fb a r ' )
The outer quartiles of age are 46 and 59 years, so the “half-sample” odds
ratio for age is 2.47, with 0.95 confidence interval [1.63, 3.74] when sex is male
and cholesterol is set to its median. The effect of increasing cholesterol from
196 (its lower quartile) to 259 (its upper quartile) is to increase the log odds
by 0.79 or to increase the odds by a factor of 2.21. Since there are interactions
allowed between age and sex and between age and cholesterol, each odds ratio
in the above table depends on the setting of at least one other factor. The
266 10 Binary Logistic Regression
Odds Ratio
0.10 0.75 1.50 2.50 3.50
age − 59:46
cholesterol − 259:196
sex − female:male
results are shown graphically in Figure 10.18. The shaded confidence bars
show various levels of confidence and do not pin the analyst down to, say, the
0.95 level.
For those used to thinking in terms of odds or log odds, the preceding
description may be sufficient. Many prefer instead to interpret the model in
terms of predicted probabilities instead of odds. If the model contains only
a single predictor (even if several spline terms are required to represent that
predictor), one may simply plot the predictor against the predicted response.
Such a plot is shown in Figure 10.19 which depicts the fitted relationship
between age of diagnosis and the probability of acute bacterial meningitis
(ABM) as opposed to acute viral meningitis (AVM), based on an analysis of
422 cases from Duke University Medical Center.580 The data may be found
on the web site. A linear spline function with knots at 1, 2, and 22 years was
used to model this relationship.
When the model contains more than one predictor, one may graph the pre-
dictor against log odds, and barring interactions, the shape of this relationship
will be independent of the level of the other predictors. When displaying the
model on what is usually a more interpretable scale, the probability scale, a
difficulty arises in that unlike log odds the relationship between one predictor
and the probability of response depends on the levels of all other factors. For
example, in the model
1.00
0.50
0.25
0.00
0 20 40 60
Age in Years
Fig. 10.19 Linear spline fit for probability of bacterial versus viral meningitis as a
function of age at onset580 . Points are simple proportions by age quantile groups.
When three factors are present, one could draw a separate graph for each
level of X3 , a separate curve on each graph for each level of X1 , and vary X2
on the x-axis. Instead of this, or if more than three factors are present, a good
way to display the results may be to plot “adjusted probability estimates” as
a function of one predictor, adjusting all other factors to constants such as
the mean. For example, one could display a graph relating serum cholesterol
to probability of myocardial infarction or death, holding age constant at 55,
sex at 1 (male), and systolic blood pressure at 120 mmHg.
The final method for displaying the relationship between several predictors
and probability of response is to construct a nomogram.40, 254 A nomogram
not only sheds light on how the effect of one predictor on the probability of
response depends on the levels of other factors, but it allows one to quickly
estimate the probability of response for individual subjects. The nomogram
in Figure 10.20 allows one to predict the probability of acute bacterial menin-
gitis (given the patient has either viral or bacterial meningitis) using the same
sample as in Figure 10.19. Here there are four continuous predictor values,
none of which are linearly related to log odds of bacterial meningitis: age
at admission (expressed as a linear spline function), month of admission (ex-
pressed as |month−8|), cerebrospinal fluid glucose/blood glucose ratio (linear
effect truncated at .6; that is, the effect is the glucose ratio if it is ≤ .6, and .6
if it exceeded .6), and the cube root of the total number of polymorphonuclear
leukocytes in the cerebrospinal fluid. 17
The model associated with Figure 10.14 is depicted in what could be called
a “precision nomogram” in Figure 10.21. Discrete cholesterol levels were re-
quired because of the interaction between two continuous variables.
268 10 Binary Logistic Regression
75 2500
1 Feb 1 Feb .30 2000
70 6m
1500
18m 1 Mar 1 Jan
.35
65 0.99
1000
1 Apr 1 Dec
60 0.95 .40
0m 0.90
1 May 1 Nov 0.80 500
55 0.70 400
2y 0.60 .45
0.50 300
1 Jun 1 Oct 0.40
50 0.30
5 0.20 200
0.10 .50
1 Jul 1 Sep 0.05
45 100
0.01 50
10 1 Aug 1 Aug .55
40
10
35 ≥.60
15
30 0
A B
25 20
22 22y
Fig. 10.20 Nomogram for estimating probability of bacterial (ABM) versus viral
(AVM) meningitis. Step 1, place ruler on reading lines for patient’s age and month
of presentation and mark intersection with line A; step 2, place ruler on values for
glucose ratio and total polymorphonuclear leukocyte (PMN) count in cerebrospinal
fluid and mark intersection with line B; step 3, use ruler to join marks on lines A and
B, then read off the probability of ABM versus AVM.580
10.11 R Functions
0 10 20 30 40 50 60 70 80 90 100
Points
cholesterol (age=30
sex=male) 150 250 300 350 400
cholesterol (age=40
sex=male) 150 250 300 350 400
cholesterol (age=50 250 300 350 400
sex=male) 200
cholesterol (age=60 250 350
sex=male) 200
cholesterol (age=70 250 400
sex=male) 200
cholesterol (age=30
sex=female) 150 250 300 350 400
cholesterol (age=40
sex=female) 150 250 300 350 400
cholesterol (age=50 250 300 350 400
sex=female) 200
cholesterol (age=60 250 350
sex=female) 200
cholesterol (age=70 250 400
sex=female) 200
Total Points
0 10 20 30 40 50 60 70 80 90 100
Linear Predictor
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5
Probability of CAD
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
Fig. 10.21 Nomogram relating age, sex, and cholesterol to the log odds and to
the probability of significant coronary artery disease. Select one axis corresponding
to sex and to age ∈ {30, 40, 50, 60, 70}. There is linear interaction between age and
sex and between age and cholesterol. 0.70 and 0.90 confidence intervals are shown
(0.90 in gray). Note that for the “Linear Predictor” scale there are various lengths
of confidence intervals near the same value of X β̂, demonstrating that the standard
error of X β̂ depends on the individual X values. Also note that confidence intervals
corresponding to smaller patient groups (e.g., females) are wider.
Actual Probability
Intercept −0.371
Slope 0.544
Emax 0.211
0.6 S:z 2.351
S:p 0.019
0.4
Ideal
Logistic calibration
0.2 Nonparametric
Grouped observations
0.0
one of the predictors is a discrete variable such as sex group. When discrimi-
nant analysis assumptions are violated, logistic regression yields more accurate
estimates.251, 514 Even when discriminant analysis is optimal (i.e., when all
its assumptions are satisfied) logistic regression is virtually as accurate as the
discriminant model.264
4 See [573] for a review of measures of effect for binary outcomes.
5 Cepedaet al.95 found that propensity adjustment is better than covariate ad-
justment with logistic models when the number of events per variable is less
than 8.
6 Pregibon512 developed a modification of the log likelihood function that when
maximized results in a fit that is resistant to overly influential and outlying
observations.
7 See Hosmer and Lemeshow306 for methods of testing for a difference in the
observed event proportion and the predicted event probability (average of pre-
dicted probabilities) for a group of heterogeneous subjects.
8 See Hosmer and Lemeshow,305 Kay and Little,341 and Collett [115, Chap. 5].
Landwehr et al.373 proposed the partial residual (see also Fowlkes199 ).
9 See Berk and Booth51 for other partial-like residuals.
10 See [341] for an example comparing a smoothing method with a parametric
logistic model fit.
11 See Collett [115, Chap. 5] and Pregibon512 for more information about influence
statistics. Pregibon’s resistant estimator of β handles overly influential groups
of observations and allows one to estimate the weight that an observation con-
tributed to the fit after making the fit robust. Observations receiving low weight
are partially ignored but are not deleted.
12 Buyse86 showed that in the case of a single categorical predictor, the ordi-
nary R2 has a ready interpretation in terms of variance explained for binary
responses. Menard454 studied various indexes for binary logistic regression. He
criticized R2N for being too dependent on the proportion of observations with
Y = 1. Hu et al.309 further studied the properties of variance-based R2 mea-
sures for binary responses. Tjur613 has a nice discussion discrimination graphics
10.13 Problems 273
10.13 Problems
d. Plot the estimated logit response as a function of age and sex, with and
without fitting an interaction term.
e. Perform a likelihood ratio test of H0 : the model containing only age
and sex is adequate versus Ha : model is inadequate. Here, “inadequate”
may mean nonlinearity (quadratic) in age or presence of an interaction.
f. Assuming no interaction is present, test H0 : model is linear in age versus
Ha : model is nonlinear in age. Allow “nonlinear” to be more general
than quadratic. (Hint: use a restricted cubic spline function with knots
at age=39, 45, 55, 64 years.)
g. Plot age against the estimated spline transformation of age (the trans-
formation that would make age fit linearly). You can set the sex and
intercept terms to anything you choose. Also plot Prob{response = 1 |
age, sex} from this fitted restricted cubic spline logistic model.
2. Consider a binary logistic regression model using the following predictors:
age (years), sex, race (white, African-American, Hispanic, Oriental, other),
blood pressure (mmHg). The fitted model is given by
logit Prob[Y = 1|X] = X β̂ = −1.36 + .03(race = African-American)
− .04(race = hispanic) + .05(race = oriental) − .06(race = other)
+ .07|blood pressure − 110| + .3(sex = male) − .1age + .002age2 +
(sex = male)[.05age − .003age2 ].
a. Compute the predicted logit (log odds) that Y = 1 for a 50-year-old
female Hispanic with a blood pressure of 90 mmHg. Also compute the
odds that Y = 1 (Prob[Y = 1]/Prob[Y = 0]) and the estimated proba-
bility that Y = 1.
b. Estimate odds ratios for each nonwhite race compared with the ref-
erence group (white), holding all other predictors constant. Why can
you estimate the relative effect of race for all types of subjects without
specifying their characteristics?
c. Compute the odds ratio for a blood pressure of 120 mmHg compared
with a blood pressure of 105, holding age first to 30 years and then to
40 years.
d. Compute the odds ratio for a blood pressure of 120 mmHg compared
with a blood pressure of 105, all other variables held to unspecified
constants. Why is this relative effect meaningful without knowing the
subject’s age, race, or sex?
e. Compute the estimated risk difference in changing blood pressure from
105 mmHg to 120 mmHg, first for age = 30 then for age = 40, for a
white female. Why does the risk difference depend on age?
f. Compute the relative odds for males compared with females, for age = 50
and other variables held constant.
g. Same as the previous question but for females : males instead of males
: females.
h. Compute the odds ratio resulting from increasing age from 50 to 55
for males, and then for females, other variables held constant. What is
wrong with the following question: What is the relative effect of chang-
ing age by one year?
Chapter 11
Case Study in Binary Logistic Regression,
Model Selection and Approximation:
Predicting Cause of Death
11.1 Overview
11.2 Background
Kay340 did a formal analysis of the competing risks for cancer, cardiovascular,
and other deaths. It can also be quite informative to study how treatment
and baseline variables relate to the cause of death for those patients who
died.376 We subset the original dataset of those patients dying from prostate
cancer (n = 130), heart or vascular disease (n = 96), or cerebrovascular
disease (n = 31). Our goal is to predict cardiovascular–cerebrovascular death
(cvd, n = 127) given the patient died from either cvd or prostate cancer. Of
interest is whether the time to death has an effect on the cause of death, and
whether the importance of certain variables depends on the time of death.
In R, first obtain the desired subset of the data and do some preliminary
calculations such as combining an infrequent category with the next category,
and dichotomizing ekg for use in ordinary principal components (PCs).
require ( rms )
getHdata ( prostate )
prostate ←
within ( prostate , {
levels ( ekg )[ levels ( ekg ) % in %
c ( ' old MI ' , ' recent MI ' )] ← ' MI '
ekg.norm ← 1 * ( ekg % in % c ( ' normal ' , ' benign ' ))
levels ( ekg ) ← abbreviate ( levels ( ekg ))
pfn ← as.numeric ( pf )
levels ( pf ) ← levels ( pf )[ c (1 ,2 ,3 ,3)]
cvd ← status % in % c (" dead - heart or vascular " ,
" dead - cerebrovascul ar " )
rxn = as.numeric ( rx ) })
# Use transcan to compute optimal p r e - t r a n s f o r m a ti o ns
ptrans ← # See Figure 8.3
transcan (∼ sz + sg + ap + sbp + dbp +
age + wt + hg + ekg + pf + bm + hx + dtime + rx ,
imputed = TRUE , transformed = TRUE ,
data = prostate , pl = FALSE , pr = FALSE )
# Use transcan single imputations
imp ← impute ( ptrans , data = prostate , list.out = TRUE )
sz sg age wt ekg
5 11 1 2 8
The five approaches to modeling the outcome are compared using AIC (where
smaller is better).
c ( f8 = AIC ( f8 ), f8t = AIC ( f8t ), f = AIC ( f ), g = AIC ( g ), h = AIC ( h ))
f8 f8t f g h
257.6573 254.5172 255.8545 263.8413 254.5317
Based on AIC, the more traditional model fitted to the raw data and as-
suming linearity for all the continuous predictors has only a slight chance
of producing worse cross-validated predictive accuracy than other methods.
278 11 Binary Logistic Regression Case Study 1
The chances are also good that effect estimates from this simple model will
have competitive mean squared errors.
Here we describe the simple all-linear full model. Summary statistics and a
Wald-ANOVA table are below, followed by partial effects plots with pointwise
confidence bands, and odds ratios over default ranges of predictors.
print (f , latex = TRUE )
an ← anova ( f )
latex ( an , file = ' ' , table.env = FALSE )
χ2 d.f. P
sz 14.42 1 0.0001
sg 6.67 1 0.0098
ap 7.74 1 0.0054
sbp 0.41 1 0.5201
dbp 5.78 1 0.0162
age 10.45 1 0.0012
wt 1.59 1 0.2069
hg 0.86 1 0.3524
ekg 6.76 5 0.2391
pf 5.52 2 0.0632
bm 0.08 1 0.7738
hx 8.33 1 0.0039
rx 5.72 3 0.1260
dtime 1.61 1 0.2040
TOTAL 66.87 21 < 0.0001
bm
sbp
hg
wt
dtime
ekg
rx
pf
dbp
sg
ap
hx
age
sz
0 2 4 6 8 10 12
χ2− df
Fig. 11.1 Ranking of apparent importance of predictors of cause of death
Now use fast backward step-down (with total residual AIC as the stopping
rule) to identify the variables that explain the bulk of the cause of death.
Later validation will take this screening of variables into account.The greatly
reduced model results in a simple nomogram.
fastbw ( f )
Coef S . E . Wald Z P
I n t e r c e p t -3.74986 1.82887 -2.050 0 . 0 4 0 3 2 8 6
sz -0.04862 0.01532 -3.174 0 . 0 0 1 5 0 1 3
ap -0.40694 0.11117 -3.660 0 . 0 0 0 2 5 1 8
age 0.06000 0.02562 2.342 0 . 0 1 9 1 7 0 1
hx 0.86969 0.34339 2.533 0 . 0 1 1 3 1 9 8
[1] sz ap age hx
11.6 Backwards Step-Down 281
age ap dbp
4
2
0
−2
−4
−6
55 60 65 70 75 80 0 20 40 5.0 7.5 10.0 12.5
dtime hg sbp
4
2
log odds
0
−2
−4
−6
0 20 40 60 10 12 14 16 12.5 15.0 17.5
sg sz wt
4
2
0
−2
−4
−6
8 10 12 14 0 10 20 30 40 80 90 100 110 120
bm ekg hx
MI
1 hrts 1
hbocd
rd&ec
0 bngn 0
nrml
pf rx
in bed > 50% daytime 5.0 mg estrogen
1.0 mg estrogen
in bed < 50% daytime
0.2 mg estrogen
normal activity placebo
−6 −4 −2 0 2 4 −6 −4 −2 0 2 4
log odds
Fig. 11.2 Partial effects (log odds scale) in full model for cause of death, along with
vertical line segments showing the raw data distribution of predictors
1
Prob{cvd} = , where
1 + exp(−Xβ)
X β̂ =
−5.009276 − 0.05510121 sz − 0.509185 log(ap) + 0.0788052 age + 1.070601 hx
282 11 Binary Logistic Regression Case Study 1
Odds Ratio
0.10 0.50 2.00 8.00
sz − 25:6
sg − 12:9
ap − 7:0.5999756
sbp − 16:13
dbp − 9:7
age − 76:70
wt − 106:89
hg − 14.59961:12
bm − 1:0
hx − 1:0
dtime − 37:11
ekg − nrml:hrts
ekg − bngn:hrts
ekg − rd&ec:hrts
ekg − hbocd:hrts
ekg − MI:hrts
pf − in bed < 50% daytime:normal activity
pf − in bed > 50% daytime:normal activity
rx − 0.2 mg estrogen:placebo
rx − 1.0 mg estrogen:placebo
rx − 5.0 mg estrogen:placebo
Fig. 11.3 Interquartile-range odds ratios for continuous predictors and simple odds
ratios for categorical predictors. Numbers at left are upper quartile : lower quartile or
current group : reference group. The bars represent 0.9, 0.95, 0.99 confidence limits.
The intervals are drawn on the log odds ratio scale and labeled on the odds ratio
scale. Ranges are on the original scale.
It is readily seen from this model that patients with a history of heart
disease, and patients with less extensive prostate cancer are those more likely
to die from cvd rather than from cancer. But beware that it is easy to over-
interpret findings when using unpenalized estimation, and confidence inter-
vals are too narrow. Let us use the bootstrap to study the uncertainty in
the selection of variables and to penalize for this uncertainty when estimat-
ing predictive performance of the model. The variables selected in the first 20
bootstrap resamples are shown, making it obvious that the set of “significant”
variables, i.e., the final model, is somewhat arbitrary.
f ← update (f , x = TRUE , y = TRUE )
v ← validate (f , B =200 , bw = TRUE )
0 10 20 30 40 50 60 70 80 90 100
Points
Fig. 11.4 Nomogram calculating X β̂ and P̂ for cvd as the cause of death, using
the step-down model. For each predictor, read the points assigned on the 0–100 scale
and add these points. Read the result on the Total Points scale and then read the
corresponding predictions below it.
The slope shrinkage (γ̂) is a bit lower than was estimated above. There is
drop-off in all indexes. The estimated likely future predictive discrimination
of the model as measured by Somers’ Dxy fell from 0.682 to 0.611. The
latter estimate is the one that should be claimed when describing model
performance.
A nearly unbiased estimate of future calibration of the stepwise-derived
model is given below.
cal ← calibrate (f , B =200 , bw = TRUE )
plot ( cal ) # Figure 11.5
The amount of overfitting seen in Figure 11.5 is consistent with the indexes
produced by the validate function.
For comparison, consider a bootstrap validation of the full model without
using variable selection.
vfull ← validate (f , B =200)
latex ( vfull , digits =3)
11.6 Backwards Step-Down 285
1.0
0.8
0.4
0.2 Apparent
Bias−corrected
0.0 Ideal
The performance statistics are midway between the full model and the
smaller stepwise model.
11.7 Model Approximation 287
After 6 deletions, slightly more than 0.05 of both the LR χ2 and the approx-
imation R2 are lost (see Figure 11.6). Therefore we take as our approximate
model the one that removed 6 predictors. The equation for this model is
below, and its nomogram is in Figure 11.7.
fapprox ← ols ( lp ∼ sz + sg + log ( ap ) + age + ekg + pf + hx +
rx , data = psub )
fapprox $ stats [ ' R2 ' ] # as a check
R2 0.9453396
latex ( fapprox , file = ' ' )
288 11 Binary Logistic Regression Case Study 1
1.0
0.9
Fraction of χ2 Preserved
0.8
0.7
0.6
0.5
0.4
0.5 0.6 0.7 0.8 0.9 1.0
Approximation R2
Fig. 11.6 Fraction of explainable variation (full model LR χ2 ) in cvd that was
explained by approximate models, along with approximation accuracy (x–axis)
X β̂ =
−2.868303 − 0.06233241 sz − 0.3157901 sg − 0.3834479 log(ap) + 0.09089393 age
+1.396922[bngn] + 0.06275034[rd&ec] − 1.24892[hbocd] + 0.6511938[hrts]
+0.3236771[MI]
+1.116028[in bed < 50% daytime] − 2.436734[in bed > 50% daytime]
+1.05316 hx
−0.3888534[0.2 mg estrogen] + 0.6920495[1.0 mg estrogen]
+0.7834498[5.0 mg estrogen]
0 10 20 30 40 50 60 70 80 90 100
Points
Fig. 11.7 Nomogram for predicting the probability of cvd based on the approximate
model
Chapter 12
Logistic Model Case Study 2: Survival
of Titanic Passengers
t3
6 Variables 1309 Observations
pclass
n missing unique
1309 0 3
1st (323, 25%), 2nd (277, 21%), 3rd (709, 54%)
survived : Survived
n missing unique Info Sum Mean
1309 0 2 0.71 500 0.382
age : Age [years]
n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
1046 263 98 1 29.88 5 14 21 28 39 50 57
lowest : 0.1667 0.3333 0.4167 0.6667 0.7500
highest: 70.5000 71.0000 74.0000 76.0000 80.0000
sex
n missing unique
1309 0 2
female (466, 36%), male (843, 64%)
Next, we obtain access to the needed variables and observations, and save data
distribution characteristics for plotting and for computing predictor effects.
There are not many passengers having more than 3 siblings or spouses or
more than 3 children, so we truncate two variables at 3 for the purpose of
estimating stratified survival probabilities.
dd ← datadist ( t3 )
# describe distributions of variables to rms
options ( datadist = ' dd ' )
s ← summary ( survived ∼ age + sex + pclass +
cut2 ( sibsp ,0:3) + cut2 ( parch ,0:3) , data = t3 )
plot (s , main = ' ' , subtitles = FALSE ) # Figure 12.1
Note the large number of missing ages. Also note the strong effects of sex and
passenger class on the probability of surviving. The age effect does not appear
to be very strong, because as we show later, much of the effect is restricted to
12.1 Descriptive Statistics 293
N
Age [years]
[ 0.167,22.0) 290
[22.000,28.5) 246
[28.500,40.0) 265
[40.000,80.0] 245
Missing 263
sex
female 466
male 843
pclass
1st 323
2nd 277
3rd 709
Overall
1309
Survived
age < 21 years for one of the sexes. The effects of the last two variables are
unclear as the estimated proportions are not monotonic in the values of these
descriptors. Although some of the cell sizes are small, we can show four-way
empirical relationships with the fraction of surviving passengers by creating
four cells for sibsp × parch combinations and by creating two age groups. We
suppress proportions based on fewer than 25 passengers in a cell. Results are
shown in Figure 12.2.
tn ← transform ( t3 ,
agec = ifelse ( age < 21 , ' child ' , ' adult ' ),
sibsp = ifelse ( sibsp == 0, ' no sib / sp ' , ' sib / sp ' ),
parch = ifelse ( parch == 0, ' no par / child ' , ' par / child ' ))
3rd
adult
2nd
Passenger Class
sex
1st
female
3rd male
child
2nd
1st
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Proportion Surviving
Note that none of the effects of sibsp or parch for common passenger groups
appear strong on an absolute risk scale.
1.00 1.00
0.75 0.75
sex
0.25 0.25
0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age
1.00 1.00
pclass
0.75 0.75 1st
pclass 2nd
1st 3rd
0.50 0.50
2nd
3rd sex
0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age
Figure 12.3 shows much of the story of passenger survival patterns. “Women
and children first” seems to be true except for women in third class. It is
interesting that there is no real cutoff for who is considered a child. For men,
the younger the greater chance of surviving. The interpretation of the effects
of the “number of relatives”-type variables will be more difficult, as their
definitions are a function of age. Figure 12.4 shows these relationships.
# Figure 12.4
top ← theme ( legend.positio n = ' top ' )
p1 ← ggplot ( t3 , aes ( x = age , y = survived , color = cut2 ( sibsp ,
0:2))) + stat_plsmo () + b + ylim (0 ,1) + yl + top +
scale_color _ di s cr et e ( name = ' siblings / spouses ' )
296 12 Logistic Model Case Study 2: Survival of Titanic Passengers
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age
Fig. 12.4 Relationship between age and survival stratified by the number of siblings
or spouses on board (left panel) or by the number of parents or children of the
passenger on board (right panel).
is fitted below. The ^2 in the model formula means to expand the terms in
parentheses to include all main effects and second-order interactions.
f ← lrm ( survived ∼ ( sex + pclass + rcs ( age ,5)) ∧ 2 +
rcs ( age ,5) * sibsp , data = t3 )
print (f , latex = TRUE )
latex ( anova ( f ), file = ' ' , label = ' titanic-anova 2 ' , size = ' small ' )
#12.2
This is a very powerful model (ROC area = c = 0.88); the survival patterns
are easy to detect. The Wald ANOVA in Table 12.2 indicates especially strong
sex and pclass effects (χ2 = 199 and 109, respectively). There is a very strong
12.3 Binary Logistic Model With Casewise Deletion of Missing Values 299
Note the agreement between the lower right-hand panel of Figure 12.3 with
Figure 12.5. This results from our use of similar flexibility in the parametric
and nonparametric approaches (and similar effective degrees of freedom). The
estimated effect of sibsp as a function of age is shown in Figure 12.6.
ggplot ( Predict (f , sibsp , age = c (10 ,15 ,20 ,50) , conf.int = FALSE ))
# # Figure 12.6
Note that children having many siblings apparently had lower survival. Mar-
ried adults had slightly higher survival than unmarried ones.
There will never be another Titanic, so we do not need to validate the
model for prospective use. But we use the bootstrap to validate the model
anyway, in an effort to detect whether it is overfitting the data. We do not
penalize the calculations that follow for having examined the effect of parch or
300 12 Logistic Model Case Study 2: Survival of Titanic Passengers
0.75
sex
0.50 female
male
0.25
0.00
0 20 40 60 0 20 40 60 0 20 40 60
Age, years
−1
Age, years
−2
10
log odds
−3 15
20
−4
50
−5
−6
0 2 4 6 8
Number of Siblings/Spouses Aboard
Fig. 12.6 Effect of number of siblings and spouses on the log odds of surviving, for
third class males
for testing three-way interactions, in the belief that these tests would replicate
well.
f ← update (f , x = TRUE , y = TRUE )
# x = TRUE , y = TRUE adds raw data to fit object so can bootstrap
set.seed (131) # so can replicate re-samples
latex ( validate (f , B =200) , digits =2 , size = ' Ssize ' )
12.3 Binary Logistic Model With Casewise Deletion of Missing Values 301
1.0
0.8
Actual Probability
0.6
0.4
0.2 Apparent
Bias−corrected
0.0 Ideal
The first step to dealing with missing data is understanding the patterns
of missing values. To do this we use the Hmisc library’s naclus and naplot
functions, and the recursive partitioning library of Atkinson and Therneau.
Below naclus tells us which variables tend to be missing on the same persons,
and it computes the proportion of missing values for each variable. The rpart
function derives a tree to predict which types of passengers tended to have
age missing.
na.patterns ← naclus ( titanic3 )
require ( rpart ) # Recursive partitioning package
We see in Figure 12.8 that age tends to be missing on the same passengers
as the body bag identifier, and that it is missing in only 0.09 of first or sec-
ond class passengers. The category of passengers having the highest fraction
of missing ages is third class passengers having no parents or children on
board. Below we use Hmisc’s summary.formula function to plot simple descrip-
tive statistics on the fraction of missing ages, stratified by other variables. We
see that without adjusting for other variables, age is slightly more missing on
nonsurviving passengers.
plot ( summary ( is.na ( age ) ∼ sex + pclass + survived +
sibsp + parch , data = t3 )) # Figure 12.9
0.0
fare
boat
embarked
cabin
ticket
parch
sibsp
sex
name
pclass
survived
0.1
Fraction Missing
0.2
age
0.3
0.4
body
home.dest
Fig. 12.8 Patterns of missing data. Upper left panel shows the fraction of observa-
tions missing on each predictor. Lower panel depicts a hierarchical cluster analysis of
missingness combinations. The similarity measure shown on the Y -axis is the frac-
tion of observations for which both variables are missing. Right panel shows the result
of recursive partitioning for predicting is.na(age). The rpart function found only
strong patterns according to passenger class.
mean
N
sex
female 466
male 843
pclass
1st 323
2nd 277
3rd 709
Survived
No 809
Yes 500
Number of Siblings/Spouses Aboard
0 891
1 319
2 42
3 20
4 22
5 6
8 9
Number of Parents/Children Aboard
0 1002
1 170
2 113
3 8
4 6
5 6
6 2
9 2
Overall
1309
is.na(age)
Fortunately, after controlling for other variables, Table 12.3 provides evi-
dence that nonsurviving passengers are no more likely to have age missing.
The only important predictors of missingness are pclass and parch (the more
parents or children the passenger has on board, the less likely age was to be
missing).
χ2 d.f. P
sex (Factor+Higher Order Factors) 5.61 3 0.1324
All Interactions 5.58 2 0.0614
pclass (Factor+Higher Order Factors) 68.43 4 < 0.0001
All Interactions 5.58 2 0.0614
survived 0.98 1 0.3232
sibsp 0.35 1 0.5548
parch 7.92 1 0.0049
sex × pclass (Factor+Higher Order Factors) 5.58 2 0.0614
TOTAL 82.90 8 < 0.0001
mi
n : 1309 p: 6 I m p u t a t i o n s: 20 nk : 4
Number of NAs :
age sex pclass sibsp parch survived
263 0 0 0 0 0
type d . f .
age s 1
sex c 1
pclass c 2
sibsp s 2
parch s 2
survived l 1
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6] [ ,7] [ ,8] [ ,9] [ ,10]
16 40 49 24 29 60.0 58 64 36 50 61
38 33 45 40 49 80.0 2 38 38 36 53
41 29 24 19 31 40.0 60 64 42 30 65
47 40 42 29 48 36.0 46 64 30 38 42
60 52 40 22 31 38.0 22 19 24 40 33
70 16 14 23 23 18.0 24 19 27 59 23
71 30 62 57 30 42.0 31 64 40 40 63
75 43 23 36 61 45.5 58 64 27 24 50
81 44 57 47 31 45.0 30 64 62 39 67
107 52 18 24 62 32.5 38 64 47 19 23
plot ( mi )
Ecdf ( t3 $ age , add = TRUE , col = ' gray ' , lwd =2 ,
subtitles = FALSE ) # Fig. 12.10
1.0
0.8
Proportion <= x
0.6
0.4
0.2
0.0
0 20 40 60 80
Imputed age
Fig. 12.10 Distributions of imputed and actual ages for the Titanic dataset. Imputed
values are in black and actual ages in gray.
We now fit logistic models for five completed datasets. The fit.mult.impute
function fits five models and examines the within– and between–imputation
variances to compute an imputation-corrected variance–covariance matrix
that is stored in the fit object f.mi. fit.mult.impute will also average the five
β̂ vectors, storing the result in f.mi$coefficients. The function also prints
the ratio of imputation-corrected variances to average ordinary variances.
f.mi ← fit.mult.impu te (
survived ∼ ( sex + pclass + rcs ( age ,5)) ∧ 2 +
rcs ( age ,5) * sibsp ,
12.6 Summarizing the Fitted Model 307
The Wald χ2 for age is reduced by accounting for imputation but is in-
creased (by a lesser amount) by using patterns of association with survival
status to impute missing age. The Wald tests are all adjusted for multiple im-
putation. Now examine the fitted age relationship using multiple imputation
vs. casewise deletion.
p1 ← Predict (f , age , pclass , sex , sibsp =0 , fun = plogis )
p2 ← Predict ( f.mi , age , pclass , sex , sibsp =0 , fun = plogis )
p ← rbind ( ' Casewise Deletion ' = p1 , ' Multiple Imputation ' = p2 )
ggplot (p , groups = ' sex ' , ylab = ' Probability of Surviving ' )
# Figure 12.11
In this section we depict the model fitted using multiple imputation, by com-
puting odds ratios and by showing various predicted values. For age, the odds
ratio for an increase from 1 year old to 30 years old is computed, instead of
the default odds ratio based on outer quartiles of age. The estimated odds
308 12 Logistic Model Case Study 2: Survival of Titanic Passengers
0.75
1st
0.50
0.25
0.00
1.00
0.75
Probability of Surviving
sex
2nd
0.50 female
male
0.25
0.00
1.00
0.75
3rd
0.50
0.25
0.00
0 20 40 60 0 20 40 60
Age, years
Fig. 12.11 Predicted probability of survival for males from fit using casewise deletion
again (top) and multiple random draw imputation (bottom). Both sets of predictions
are for sibsp=0.
ratios are very dependent on the levels of interacting factors, so Figure 12.12
depicts only one of many patterns.
# Get predicted values for certain types of passengers
s ← summary ( f.mi , age = c (1 ,30) , sibsp =0:1)
# override default ranges for 3 variables
plot (s , log = TRUE , main = ' ' ) # Figure 12.12
age − 30:1
sibsp − 1:0
sex − female:male
pclass − 1st:3rd
pclass − 2nd:3rd
We can also get predicted values by creating an R function that will evaluate
the model on demand.
310 12 Logistic Model Case Study 2: Survival of Titanic Passengers
p r e d . l o g i t ← f u n c t i o n ( s e x = ”male ” , p c l a s s = ”3 rd ” ,
age = 2 8 , s i b s p = 0 )
{
3 . 2 4 2 7 6 7 1 − 0 . 9 5 4 3 1 8 0 9 ∗ ( s e x == ”male ” ) + 5 . 4 0 8 6 5 0 5 ∗
( p c l a s s == ”2nd ” ) − 1 . 3 3 7 8 6 2 3 ∗ ( p c l a s s ==
”3 rd ” ) + 0 . 0 9 1 1 6 2 6 4 9 ∗ age − 0 . 0 0 0 3 1 2 0 4 3 2 7 ∗
pmax ( age − 6 , 0 ) ∧ 3 + 0 . 0 0 2 1 7 5 0 4 1 3 ∗ pmax ( age −
2 1 , 0 ) ∧ 3 − 0 . 0 0 2 7 6 2 7 0 3 2 ∗ pmax ( age − 2 7 , 0 ) ∧ 3 +
0 . 0 0 0 9 8 0 5 1 3 7 ∗ pmax ( age − 3 6 , 0 ) ∧ 3 − 8 . 0808484e−05 ∗
pmax ( age − 55 . 8 , 0 ) ∧ 3 − 1 . 1 5 6 7 9 7 6 ∗ s i b s p +
( s e x == ”male ” ) ∗ ( −0. 46061284 ∗ ( p c l a s s ==
”2nd ” ) + 2 . 0 4 0 6 5 2 3 ∗ ( p c l a s s == ”3 rd ” ) ) +
( s e x == ”male ” ) ∗ ( −0. 22469066 ∗ age + 0 . 0 0 0 4 3 7 0 8 2 9 6 ∗
pmax ( age − 6 , 0 ) ∧ 3 − 0 . 0 0 2 6 5 0 5 1 3 6 ∗ pmax ( age −
2 1 , 0 ) ∧ 3 + 0 . 0 0 3 1 2 0 1 4 0 4 ∗ pmax ( age − 2 7 ,
0 ) ∧ 3 − 0 . 0 0 0 9 7 9 2 3 7 4 9 ∗ pmax ( age − 3 6 ,
0 ) ∧ 3 + 7 . 2527708e−05 ∗ pmax ( age − 55 . 8 ,
0 ) ∧ 3 ) + ( p c l a s s == ”2nd ” ) ∗ ( −0. 46144083 ∗
age + 0 . 0 0 0 7 0 1 9 4 8 4 9 ∗ pmax ( age − 6 , 0 ) ∧ 3 −
0 . 0 0 3 4 7 2 6 6 6 2 ∗ pmax ( age − 2 1 , 0 ) ∧ 3 + 0 . 0 0 3 5 2 5 5 3 8 7 ∗
pmax ( age − 2 7 , 0 ) ∧ 3 − 0 . 0 0 0 7 9 0 0 8 9 1 ∗ pmax ( age −
3 6 , 0 ) ∧ 3 + 3 . 5268151e−05 ∗ pmax ( age − 55 . 8 ,
0 ) ∧ 3 ) + ( p c l a s s == ”3 rd ”) ∗ ( −0. 17513289 ∗
age + 0 . 0 0 0 3 5 2 8 3 3 5 8 ∗ pmax ( age − 6 , 0 ) ∧ 3 −
0 . 0 0 2 3 0 4 9 3 7 2 ∗ pmax ( age − 2 1 , 0 ) ∧ 3 + 0 . 0 0 2 8 9 7 8 9 6 2 ∗
pmax ( age − 2 7 , 0 ) ∧ 3 − 0 . 0 0 1 0 5 1 4 5 ∗ pmax ( age −
3 6 , 0 ) ∧ 3 + 0 . 0 0 0 1 0 5 6 5 7 3 5 ∗ pmax ( age − 55 . 8 ,
0 ) ∧ 3 ) + s i b s p ∗ ( 0 . 0 4 0 8 3 0 7 7 3 ∗ age − 1 . 5627772e−05 ∗
pmax ( age − 6 , 0 ) ∧ 3 + 0 . 0 0 0 1 2 7 9 0 2 5 6 ∗ pmax ( age −
2 1 , 0 ) ∧ 3 − 0 . 0 0 0 2 5 0 3 9 3 8 5 ∗ pmax ( age − 2 7 ,
0 ) ∧ 3 + 0 . 0 0 0 1 7 8 7 1 7 0 1 ∗ pmax ( age − 3 6 , 0 ) ∧ 3 −
4 . 0597949e−05 ∗ pmax ( age − 55 . 8 , 0 ) ∧ 3 )
}
13.1 Background
A basic assumption of all commonly used ordinal regression models is that the
response variable behaves in an ordinal fashion with respect to each predictor.
Assuming that a predictor X is linearly related to the log odds of some
appropriate event, a simple way to check for ordinality is to plot the mean
of X stratified by levels of Y . These means should be in a consistent order.
If for many of the Xs, two adjacent categories of Y do not distinguish the
means, that is evidence that those levels of Y should be pooled.
One can also estimate the mean or expected value of X|Y = j (E(X|Y =
j)) given that the ordinal model assumptions hold. This is a useful tool for
checking those assumptions, at least in an unadjusted fashion. For simplicity,
assume that X is discrete, and let Pjx = Pr(Y = j|X = x) be the probability
that Y = j given X = x that is dictated from the model being fitted, with
X being the only predictor in the model. Then
where P̂jx denotes the estimate of Pjx from the fitted one-predictor model
(for inner values of Y in the PO models, these probabilities are differences
between terms given by Equation 13.4 below), fx is the frequency of X = x
in the sample of size n, and gj is the frequency of Y = j in the sample. This
estimate can be computed conveniently without grouping the data by X. For
n subjects let the n values of X be x1 , x2 , . . . , xn . Then
n
Ê(X|Y = j) = xi P̂jxi /gj . (13.3)
i=1
13.3.1 Model
The most commonly used ordinal logistic model was described in Walker
and Duncan647 and later called the proportional odds (PO) model by Mc-
Cullagh.449 The PO model is best stated as follows, for a response variable
having levels 0, 1, 2, . . . , k:
1
Pr[Y ≥ j|X] = , (13.4)
1 + exp[−(αj + Xβ)]
13.3.3 Estimation
13.3.4 Residuals
where Xi stands for a vector of predictors for subject i. Then, after forming
an indicator variable for the event currently being predicted ([Yi ≥ j]), one
computes the score (first derivative) components Uim from an ordinary binary
logistic model:
Uim = Xim ([Yi ≥ j] − P̂ij ), (13.6)
for the subject i and predictor m. Then, for each column of U , plot the mean
Ū·m and confidence limits, with Y (i.e., j) on the x-axis. For each predictor
the trend against j should be flat if PO holds. a In binary logistic regression,
partial residuals are very useful as they allow the analyst to fit linear effects
a
If β̂ were derived from separate binary fits, all Ū·m ≡ 0.
13.3 Proportional Odds Model 315
for all the predictors but then to nonparametrically estimate the true trans-
formation that each predictor requires (Section 10.4). The partial residual is
defined as follows, for the ith subject and mth predictor variable.115, 373
Yi − P̂i
rim = β̂m Xim + , (13.7)
P̂i (1 − P̂i )
where
1
P̂i = . (13.8)
1 + exp[−(α + Xi β̂)]
A smoothed plot (e.g., using the moving linear regression algorithm in
loess111 ) of Xim against rim provides a nonparametric estimate of how Xm
relates to the log relative odds that Y = 1|Xm . For ordinal Y , we just need
to compute binary model partial residuals for all cutoffs j:
[Yi ≥ j] − P̂ij
rim = β̂m Xim + , (13.9)
P̂ij (1 − P̂ij )
then to make a plot for each m showing smoothed partial residual curves for
all j, looking for similar shapes and slopes for a given predictor for all j. Each
curve provides an estimate of how Xm relates to the relative log odds that
Y ≥ j. Since partial residuals allow examination of predictor transformations
(linearity) while simultaneously allowing examination of PO (parallelism),
partial residual plots are generally preferred over score residual plots for or-
dinal models.
Li and Shepherd402 have a residual for ordinal models that serves for the
entire range of Y without the need to consider cutoffs. Their residual is use-
ful for checking functional form of predictors but not the proportional odds
assumption.
Peterson and Harrell502 developed score and likelihood ratio tests for testing
the PO assumption. The score test is used in the SAS PROC LOGISTIC,540
but its extreme anti-conservatism in many cases can make it unreliable.502 4
For determining whether the PO assumption is likely to be satisfied for
each predictor separately, there are several graphics that are useful. One is the
graph comparing means of X|Y with and without assuming PO, as described
in Section 13.2 (see Figure 14.2 for an example). Another is the simple method
of stratifying on each predictor and computing the logits of all proportions of
the form Y ≥ j, j = 1, 2, . . . , k. When proportional odds holds, the differences
in logits between different values of j should be the same at all levels of X,
316 13 Ordinal Logistic Regression
getHdata ( support )
sfdm ← as.integer ( support $ sfdm2 ) - 1
sf ← function ( y )
c ( ' Y ≥ 1 ' = qlogis ( mean (y ≥ 1)) , ' Y ≥ 2 ' = qlogis ( mean (y ≥ 2)) ,
' Y ≥ 3 ' = qlogis ( mean (y ≥ 3)))
s ← summary ( sfdm ∼ adlsc + sex + age + meanbp , fun = sf ,
data = support )
plot (s , which =1:3 , pch =1:3 , xlab = ' logit ' , vnames = ' names ' ,
main = ' ' , width.factor =1 .5 ) # Figure 13.1
N
adlsc
0.000 282
[0.495,1.167) 150
[1.167,3.024) 199
[3.024,7.000] 210
sex
female 377
male 464
age
[19.8, 52.4) 211
[52.4, 65.3) 210
[65.3, 74.8) 210
[74.8,100.1] 210
meanbp
[ 0, 64) 211
[ 64, 78) 216
[ 78,108) 204
[108,180] 210
Overall
841
Fig. 13.1 Checking PO assumption separately for a series of predictors. The circle,
triangle, and plus sign correspond to Y ≥ 1, 2, 3, respectively. PO is checked by
examining the vertical constancy of distances between any two of these three symbols.
Response variable is the severe functional disability scale sfdm2 from the 1000-patient
SUPPORT dataset, with the last two categories combined because of low frequency
of coma/intubation.
2
logit(Fn(x)) 5
Φ−1(Fn(x))
0 0
small
−2 −5 medium
large
One could conclude the right panel of Figure 13.2 displays more parallelism
than the left panel displays linearity, so the assumptions of the PO model are
better satisfied than the assumptions of the ordinary linear model.
Chapter 14 has many examples of graphics for assessing fit of PO models.
Regarding assessment of linearity and additivity assumptions, splines, partial
residual plots, and interaction tests are among the best tools. Fagerland and
Hosmer182 have a good review of goodness-of-fit tests for the PO model.
318 13 Ordinal Logistic Regression
2
The RN coefficient is really computed from the model LR χ2 (χ2 added to
a model containing only the k intercept parameters) to describe the model’s
predictive power. The Somers’ Dxy rank correlation between X β̂ and Y is
an easily interpreted measure of predictive discrimination. Since it is a rank
measure, it does not matter which intercept α is used in the calculation.
The probability of concordance, c, is also a useful measure. Here one takes all
possible pairs of subjects having differing Y values and computes the fraction
of such pairs for which the values of X β̂ are in the same direction as the two
Y values. c could be called a generalized ROC area in this setting. As before,
Dxy = 2(c − 0.5). Note that Dxy , c, and the Brier score B can easily be
computed for various dichotomizations of Y , to investigate predictive ability
in more detail.
The PO model is validated much the same way as the binary logistic model
(see Section 10.9). For estimating an overfitting-corrected calibration curve
(Section 10.11) one estimates Pr(Y ≥ j|X) using one j at a time.
b
If Y does not have very many levels, the median will be a discontinuous function
of X and may not be satisfactory.
13.4 Continuation Ratio Model 319
13.3.9 R Functions
The rms package’s lrm and orm functions fit the PO model directly, assuming
that the levels of the response variable (e.g., the levels of a factor variable)
are listed in the proper order. lrm is intended to be used for the case where the
number of unique values of Y are less than a few dozen whereas orm handles
the continuous Y case efficiently, as well as allowing for links other than the
logit. See Chapter 15 for more information.
If the response is numeric, lrm assumes the numeric codes properly order
the responses. If it is a character vector and is not a factor, lrm assumes the
correct ordering is alphabetic. Of course ordered variables in R are appropriate
response variables for ordinal regression. The predict function (predict.lrm)
can compute all the quantities listed in Section 13.3.7 except for quantiles.
The R functions popower and posamsize (in the Hmisc package) compute
power and sample size estimates for ordinal responses using the proportional
odds model.
The function plot.xmean.ordinaly in rms computes and graphs the quanti-
ties described in Section 13.2. It plots simple Y -stratified means overlaid with
Ê(X|Y = j), with j on the x-axis. The Ês are computed for both PO and con-
tinuation ratio ordinal logistic models. The Hmisc package’s summary.formula
function is also useful for assessing the PO assumption (Figure 13.1). Generic
rms functions such as validate, calibrate, and nomogram work with PO model
fits from lrm as long as the analyst specifies which intercept(s) to use. rms has
a special function generator Mean for constructing an easy-to-use function for
getting the predicted mean Y from a PO model. This is handy with plot and
nomogram. If the fit has been run through the bootcov function, it is easy to
use the Predict function to estimate bootstrap confidence limits for predicted
means.
13.4.1 Model
1
Pr(Y = j|Y ≥ j, X) =
1 + exp[−(θj + Xγ)]
logit(Y = 0|Y ≥ 0, X) = logit(Y = 0|X)
= θ0 + Xγ (13.10)
320 13 Ordinal Logistic Regression
logit(Y = 1|Y ≥ 1, X) = θ1 + Xγ
...
logit(Y = k − 1|Y ≥ k − 1, X) = θk−1 + Xγ.
The CR model has been said to be likely to fit ordinal responses when subjects
have to “pass through” one category to get to the next. The CR model is a
discrete version of the Cox proportional hazards model. The discrete hazard
function is defined as Pr(Y = j|Y ≥ j).
13.4.3 Estimation
Armstrong and Sloan31 and Berridge and Whitehead52 showed how the CR
model can be fitted using an ordinary binary logistic model likelihood func-
tion, after certain rows of the X matrix are duplicated and a new binary Y
vector is constructed. For each subject, one constructs separate records by
considering successive conditions Y ≥ 0, Y ≥ 1, . . . , Y ≥ k − 1 for a response
variable with values 0, 1, . . . , k. The binary response for each applicable con-
dition or “cohort” is set to 1 if the subject failed at the current “cohort” or
“risk set,” that is, if Y = j where the cohort being considered is Y ≥ j. The
constructed cohort variable is carried along with the new X and Y . This vari-
able is considered to be categorical and its coefficients are fitted by adding
k − 1 dummy variables to the binary logistic model. For ease of computation,
13.4 Continuation Ratio Model 321
the CR model is restated as follows, with the first cohort used as the reference
cell.
1
Pr(Y = j|Y ≥ j, X) = . (13.11)
1 + exp[−(α + θj + Xγ)]
Here α is an overall intercept, θ0 ≡ 0, and θ1 , . . . , θk−1 are increments from α.
13.4.4 Residuals
The partial residual plots just described are very useful for checking the
constant slope assumption of the CR model. The next section shows how to
test this assumption formally. Linearity can be assessed visually using the
smoothed partial residual plot, and interactions between predictors can be
tested as usual.
The PO model has been extended by Peterson and Harrell502 to allow for
unequal slopes for some or all of the Xs for some or all levels of Y . This partial
PO model requires specialized software. The CR model can be extended more
easily. In R notation, the ordinary CR model is specified as 5
y ∼ cohort + X1 + X2 + X3 + ...
The extended CR model is a discrete version of the Cox survival model with
time-dependent covariables.
There is nothing about the CR model that makes it fit a given dataset
better than other ordinal models such as the PO model. The real benefit of
the CR model is that using standard binary logistic model software one can
flexibly specify how the equal-slopes assumption can be relaxed.
the binary logistic likelihood trick. That is, sampling should be done with re-
placement from subjects rather than records. Second, the analyst must isolate
which event to predict. This is because when observations are expanded in
order to use a binary logistic likelihood function to fit the CR model, several
different events are being predicted simultaneously. Somers’ Dxy could be
computed by relating X γ̂ (ignoring intercepts) to the ordinal Y , but other
indexes are not defined so easily. The simplest approach here would be to
validate a single prediction for Pr(Y = j|Y ≥ j, X), for example. The sim-
plest event to predict is Pr(Y = 0|X), as this would just require subsetting
on all observations in the first cohort level in the validation sample. It would
also be easy to validate any one of the later conditional probabilities. The
validation functions described in the next section allow for such subsetting,
as well as handling the cluster sampling. Specialized calculations would be
needed to validate an unconditional probability such as Pr(Y ≥ 2|X).
13.4.9 R Functions
Since the lrm and pentrace functions have the capability to penalize dif-
ferent parts of the model by different amounts, they are valuable for fitting
extended CR models in which the cohort × predictor interactions are allowed
to be only as important as the information content in the data will support.
Simple main effects can be unpenalized or slightly penalized as desired.
The validate and calibrate functions for lrm allow specification of sub-
ject identifiers when using the bootstrap, so the samples can be constructed
with replacement from the original subjects. In other words, cluster sam-
pling is done from the expanded records. This is handled internally by the
predab.resample function. These functions also allow one to specify a subset of
the records to use in the validation, which makes it especially easy to validate
the part of the model used to predict Pr(Y = 0|X).
The plot.xmean.ordinaly function is useful for checking the CR assumption
for single predictors, as described earlier. 6
324 13 Ordinal Logistic Regression
1 See5, 25, 26, 31, 32, 52, 63, 64, 113, 126, 240, 245, 276, 354, 449, 502, 561, 664, 679 for some
excellent background references, applications, and extensions to the ordinal
models.663 and428 demonstrate how to model ordinal outcomes with repeated
measurements within subject using random effects in Bayesian models. The first
to develop an ordinal regression model were Aitchison and Silvey8 .
2 Some analysts feel that combining categories improves the performance of test
statistics when fitting PO models when sample sizes are small and cells are
sparse. Murad et al.469 demonstrated that this causes more problems, because
it results in overly conservative Wald tests.
3 Anderson and Philips [26, p. 29] proposed methods for constructing properly
spaced response values given a fitted PO model.
4 The simplest demonstration of this is to consider a model in which there is a
single predictor that is totally independent of a nine-level response Y , so PO
must hold. A PO model is fitted in SAS using:
DATA test;
DO i=1 to 50;
y=FLOOR(RANUNI(151)*9);
x=RANNOR(5);
OUTPUT;
END;
PROC LOGISTIC; MODEL y=x;
The score test for PO was χ2 = 56 on 7 d.f., P < 0.0001. This problem results
from some small cell sizes in the distribution of Y .502 The P -value for testing
the regression effect for X was 0.76.
5 The R glmnetcr package by Kellie Archer provides a different way to fit con-
tinuation ratio models.
6 Bender and Benner48 have some examples using the precursor of the rms package
for fitting and assessing the goodness of fit of ordinal logistic regression models.
13.6 Problems
Test for the association between disease group and total hospital cost in
SUPPORT, without imputing any missing costs (exclude the one patient
having zero cost).
1. Use the Kruskal–Wallis rank test.
2. Use the proportional odds ordinal logistic model generalization of the
Wilcoxon–Mann–Whitney Kruskal–Wallis Spearman test. Group total cost
into 20 quantile groups so that only 19 intercepts will need to be in the
model, not one less than the number of subjects (this would have taken
the program too long to fit the model). Use the likelihood ratio χ2 for this
and later steps.
3. Use a binary logistic model to test for association between disease group
and whether total cost exceeds the median of total cost. In other words,
group total cost into two quantile groups and use this binary variable as
the response. What is wrong with this approach?
13.6 Problems 325
4. Instead of using only two cost groups, group cost into 3, 4, 5, 6, 8, 10,
and 12 quantile groups. Describe the relationship between the number of
intervals used to approximate the continuous response variable and the
efficiency of the analysis. How many intervals of total cost, assuming that
the ordering of the different intervals is used in the analysis, are required
to avoid losing significant information in this continuous variable?
5. If you were selecting one of the rank-based tests for testing the association
between disease and cost, which of any of the tests considered would you
choose?
6. Why do all of the tests you did have the same number of degrees of freedom
for the hypothesis of no association between dzgroup and totcst?
7. What is the advantage of a rank-based test over a parametric test based
on log(cost)?
8. Show that for a two-sample problem, the numerator of the score test for
comparing the two groups using a proportional odds model is exactly the
numerator of the Wilcoxon-Mann-Whitney two-sample rank-sum test.
Chapter 14
Case Study in Ordinal Regression,
Data Reduction, and Penalization
This case study is taken from Harrell et al.272 which described a World Health
Organization study439 in which vital signs and a large number of clinical
signs and symptoms were used to develop a predictive model for an ordinal
response. This response consists of laboratory assessments of diagnosis and
severity of illness related to pneumonia, meningitis, and sepsis. Much of the
modeling strategy given in Chapter 4 was used to develop the model, with ad-
ditional emphasis on penalized maximum likelihood estimation (Section 9.10).
The following laboratory data are used in the response: cerebrospinal fluid
(CSF) culture from a lumbar puncture (LP), blood culture (BC), arterial
oxygen saturation (SaO2 , a measure of lung dysfunction), and chest X-ray
(CXR). The sample consisted of 4552 infants aged 90 days or less.
This case study covers these topics:
1. definition of the ordinal response (Section 14.1);
2. scoring and clustering of clinical signs (Section 14.2);
3. testing adequacy of weights specified by subject-matter specialists and
assessing the utility of various scoring schemes using a tentative ordinal
logistic model (Section 14.3);
4. assessing the basic ordinality assumptions and examining the propor-
tional odds and continuation ratio (PO and CR) assumptions separately
for each predictor (Section 14.4);
5. deriving a tentative PO model using cluster scores and regression splines
(Section 14.5);
6. using residual plots to check PO, CR, and linearity assumptions (Sec-
tion 14.6);
7. examining the fit of a CR model (Section 14.7);
8. utilizing an extended CR model to allow some or all of the regression
coefficients to vary with cutoffs of the response level as well as to provide
formal tests of constant slopes (Section 14.8);
a
SaO2 was measured but CXR was not done
b
Assumed zero since neither BC nor LP were done.
Forty-seven clinical signs were collected for each infant. Most questionnaire
items were scored as a single variable using equally spaced codes, with 0 to
3 representing, for example, sign not present, mild, moderate, severe. The
resulting list of clinical signs with their abbreviations is given in Table 14.2.
The signs are organized into clusters as discussed later.
0.0
oto
str
puskin
conj
fde
chi
abd
illd
smi2
hcs
att
coh
whz
twb
ldy
hap
0.2
jau
omph
gru
hfe
hdi
convul
abb
apn
ccy
dcp
crs
slpm
wake
hltHy respir/chest
hfb
hcl
csa
con
slpl
hcm
Spearman ρ2
abk
0.4
lcw
nfl
aro
hfa
hltHy respir/no chest
hdb
csd
qcr
mvm
0.6
stu
deh
0.8
afe
absu
1.0
Here, hx stands for history, ausc for auscultation, and hxprob for history of
problems. Two signs (qcr, hcm) were listed twice since they were later placed
into two clusters each.
Next, hierarchical clustering was done using the matrix of squared Spear-
man rank correlation coefficients as the similarity matrix. The varclus R
function was used as follows.
require ( rms )
The output appears in Figure 14.1. This output served as a starting point
for clinicians to use in constructing more meaningful clinical clusters. The
clusters in Table 14.2 were the consensus of the clinicians who were the in-
vestigators in the WHO study. Prior subject matter knowledge plays a key
role at this stage in the analysis.
The clusters listed in Table 14.2 were first scored by the first principal com-
ponent of transcan-transformed signs, denoted by P C1 . Knowing that the
resulting weights may be too complex for clinical use, the primary reasons
14.3 Developing Cluster Summary Scores 331
for analyzing the principal components were to see if some of the clusters
could be removed from consideration so that the clinicians would not spend
time developing scoring rules for them. Let us “peek” at Y to assist in scoring
clusters at this point, but to do so in a very structured way that does not
involve the examination of a large number of individual coefficients.
To judge any cluster scoring scheme, we must pick a tentative outcome
model. For this purpose we chose the PO model. By using the 14 P C1 s cor-
responding to the 14 clusters, the fitted PO model had a likelihood ratio
(LR) χ2 of 1155 with 14 d.f., and the predictive discrimination of the clus-
ters was quantified by a Somers’ Dxy rank correlation between X β̂ and Y
of 0.596. The following clusters were not statistically important predictors
and we assumed that the lack of importance of the P C1 s in predicting Y
(adjusted for the other P C1 s) justified a conclusion that no sign within that
cluster was clinically important in predicting Y : hydration, hxprob, pustular,
crying, fever.ill, stop.breath, labor. This list was identified using a back-
ward step-down procedure on the full model. The total Wald χ2 for these
seven P C1 s was 22.4 (P = 0.002). The reduced model had LR χ2 = 1133
with 7 d.f., Dxy = 0.591. The bootstrap validation in Section 14.11 penalizes
for examining all candidate predictors.
The clinicians were asked to rank the clinical severity of signs within each
potentially important cluster. During this step, the clinicians also ranked
severity levels of some of the component signs, and some cluster scores were
simplified, especially when the signs within a cluster occurred infrequently.
The clinicians also assessed whether the severity points or weights should be
equally spaced, assigning unequally spaced weights for one cluster (agitated).
The resulting rankings and sign combinations are shown in Table 14.3. The
signs or sign combinations separated by a comma are treated as separate
categories, whereas some signs were unioned (“or”–ed) when the clinicians
deemed them equally important. As an example, if an additive cluster score
was to be used for drowsy, the scorings would be 0 = none present, 1 = hcl,
2 = qcr>0, 3 = csd>0 or slpm or wake, 4 = aro>0, 5 = mvm>0 and the scores
would be added.
332 14 Ordinal Regression, Data Reduction, and Penalization
This table reflects some data reduction already (unioning some signs and
selection of levels of ordinal signs) but more reduction is needed. Even after
signs are ranked within a cluster, there are various ways of assigning the clus-
ter scores. We investigated six methods. We started with the purely statistical
approach of using P C1 to summarize each cluster. Second, all sign combina-
tions within a cluster were unioned to represent a 0/1 cluster score. Third,
only sign combinations thought by the clinicians to be severe were unioned,
resulting in drowsy=aro>0 or mvm>0, agitated=csa or con=2, reffort=lcw>1 or
gru>0 or ccy, ausc=crs>0, and feeding=absu>0 or afe>0. For clusters that are
not scored 0/1 in Table 14.3, the fourth summarization method was a hi-
erarchical one that used the weight of the worst applicable category as the
cluster score. For example, if aro=1 but mvm=0, drowsy would be scored as 4.
The fifth method counted the number of positive signs in the cluster. The
sixth method summed the weights of all signs or sign combinations present.
Finally, the worst sign combination present was again used as in the sec-
ond method, but the points assigned to the category were data-driven ones
obtained by using extra dummy variables. This provided an assessment of
the adequacy of the clinician-specified weights. By comparing rows 4 and 7
in Table 14.4 we see that response data-driven sign weights have a slightly
worse AIC, indicating that the number of extra β parameters estimated was
not justified by the improvement in χ2 . The hierarchical method, using the
clinicians’ weights, performed quite well. The only cluster with inadequate
clinician weights was ausc—see below. The P C1 method, without any guid-
ance, performed well, as in268 . The only reasons not to use it are that it
requires a coefficient for every sign in the cluster and the coefficients are not
translatable into simple scores such as 0, 1, . . ..
Representation of clusters by a simple union of selected signs or of all signs
is inadequate, but otherwise the choice of methods is not very important in
terms of explaining variation in Y . We chose the fourth method, a hierar-
chical severity point assignment (using weights that were prespecified by the
clinicians), for its ease of use and of handling missing component variables
(in most cases) and potential for speeding up the clinical exam (examining
14.5 A Tentative Full Proportional Odds Model 333
to detect more important signs first). Because of what was learned regard-
ing the relationship between ausc and Y , we modified the ausc cluster score
by redefining it as ausc=crs>0 (crepitations present). Note that neither the
“tweaking” of ausc nor the examination of the seven scoring methods dis-
played in Table 14.4 is taken into account in the model validation.
Section 13.2 described a graphical method for assessing the ordinality as-
sumption for Y separately with respect to each X, and for assessing PO and
CR assumptions individually. Figure 14.2 is an example of such displays. For
this dataset we expect strongly nonlinear effects for temp, rr, and hrat, so for
those predictors we plot the mean absolute differences from suitable “normal”
values as an approximate solution.
Sc ← transform ( Sc ,
ausc = 1 * ( ausc == 3) ,
bul.conv = 1 * ( bul.conv == ' TRUE ' ),
abdominal = 1 * ( abdominal == ' TRUE ' ))
plot.xmean. o rd in a ly (Y ∼ age + abs ( temp-37 ) + abs ( rr-60 ) +
abs ( hrat-125 ) + waz + bul.conv + drowsy +
agitated + reffort + ausc + feeding +
abdominal , data = Sc , cr = TRUE ,
subn = FALSE , cex.points = .65 ) # Figure 14.2
The plot is shown in Figure 14.2. Y does not seem to operate in an ordinal
fashion with respect to age, |rr−60|, or ausc. For the other variables, ordinality
holds, and PO holds reasonably well for the other variables. For heart rate,
the PO assumption appears to be satisfied perfectly. CR model assumptions
appear to be more tenuous than PO assumptions, when one variable at a
time is fitted.
Based on what was determined in Section 14.3, the original list of 47 signs
was reduced to seven predictors: two unions of signs (bul.conv, abdominal),
one single sign (ausc), and four “worst category” point assignments (drowsy,
agitated, reffort, feeding). Seven clusters were dropped for the time being
because of weak associations with Y . Such a limited use of variable selection
reduces the severe problems inherent with that technique.
334 14 Ordinal Regression, Data Reduction, and Penalization
40
C
l l C
l l C
l
14.0
l
abs(hrat − 125)
abs(temp − 37)
41
C
1.0
C
abs(rr − 60)
C l
C
36
age
13.0
39
0.8
l C
C
32
12.0
37
C
0.6
l C
l l C
l
0 1 2 0 1 2 0 1 2 0 1 2
Y Y Y Y
l C C C
0.16
C l l l
−0.2
3.5
bul.conv
agitated
drowsy
C
2.0
waz
0.10
−0.6
2.5
l
l
C C
l
−1.0
0.04
C
1.0
1.5
l C
l C
l C
l
C
0 1 2 0 1 2 0 1 2 0 1 2
Y Y Y Y
C C C C
2.0
l
0.40
l l
l
2.5
0.24
l
abdominal
1.5
feeding
reffort
ausc
0.25
l
1.5
0.18
1.0
l
C l
C C C
0.5
C
0.10
0.12
C
0.5
l l l C
l
0 1 2 0 1 2 0 1 2 0 1 2
Y Y Y Y
Fig. 14.2 Examination of the ordinality of Y for each predictor by assessing how
varying Y relate to the mean X, and whether the trend is monotonic. Solid lines
connect the simple stratified means, and dashed lines connect the estimated expected
value of X|Y = j given that PO holds. Estimated expected values from the CR model
are marked with Cs.
At this point in model development add to the model age and vital signs:
temp (temperature), rr (respiratory rate), hrat (heart rate), and waz, weight-
for-age Z-score. Since age was expected to modify the interpretation of temp,
rr, and hrat, and interactions between continuous variables would be difficult
to use in the field, we categorized age into three intervals: 0–6 days (n = 302),
7–59 days (n = 3042), and 60–90 days (n = 1208).a
Sc $ ageg ← cut2 ( Sc $ age , c (7 , 60))
The new variables temp, rr, hrat, waz were missing in, respectively, n =
13, 11, 147, and 20 infants. Since the three vital sign variables are somewhat
correlated with each other, customized single imputation models were de-
veloped to impute all the missing values without assuming linearity or even
monotonicity of any of the regressions.
a
These age intervals were also found to adequately capture most of the interaction
effects.
14.5 A Tentative Full Proportional Odds Model 335
temp hrat rr
0.168 0.160 0.066
Adjusted R2 :
temp hrat rr
0.167 0.159 0.064
Sc ← transform ( Sc ,
temp = impute ( vsign.trans , temp ),
hrat = impute ( vsign.trans , hrat ),
rr = impute ( vsign.trans , rr ))
The bottom four lines of the table are the most important. First, there is
strong evidence that some associations with Y exist (45 d.f. test) and very
strong evidence of nonlinearity in one of the vital signs or in waz (26 d.f. test).
There is moderately strong evidence for an interaction effect somewhere in the
model (22 d.f. test). We see that the grouped age variable ageg is predictive
of Y , but mainly as an effect modifier for rr, and hrat. temp is extremely
nonlinear, and rr is moderately so. hrat, a difficult variable to measure reliably
in young infants, is perhaps not important enough (χ2 = 19, 9 d.f.) to keep
in the final model.
Section 13.3.4 defined binary logistic score residuals for isolating the PO
assumption in an ordinal model. For the tentative PO model, score residuals
for four of the variables were plotted using
resid ( f1 , ' score.binary ' , pl = TRUE , which = c (17 ,18 ,20 ,21))
# # Figure 14.3
The result is shown in Figure 14.3. We see strong evidence of non-PO for
ausc and moderate evidence for drowsy and bul.conv, in agreement with
Figure 14.2.
14.6 Residual Plots 337
Partial residuals computed separately for each Y -cutoff (Section 13.3.4) are
the most useful residuals for ordinal models as they simultaneously check lin-
earity, find needed transformations, and check PO. In Figure 14.4, smoothed
partial residual plots were obtained for all predictors, after first fitting a sim-
ple model in which every predictor was assumed to operate linearly. Inter-
actions were temporarily ignored and age was used as a continuous variable.
0.02
0.002
bul.conv
drowsy
−0.02
−0.004
1 2 1 2
0.02 Y Y
0.005
reffort
−0.02 0.00
ausc
−0.010
1 2 1 2
Y Y
Fig. 14.3 Binary logistic model score residuals for binary events derived from two
cutoffs of the ordinal response Y . Note that the mean residuals, marked with closed
circles, correspond closely to differences between solid and dashed lines at Y = 1, 2
in Figure 14.2. Score residual assessments for spline-expanded variables such as rr
would have required one plot per d.f.
In order to take a first look at the fit of a CR model, let us consider the
two binary events that need to be predicted, and assess linearity and paral-
14.7 Graphical Assessment of Fit of CR Model 339
lelism over Y -cutoffs. Here we fit a sequence of binary fits and then use the
plot.lrm.partial function, which assembles partial residuals for a sequence
of fits and constructs one graph per predictor.
cr0 ← lrm ( Y ==0 ∼ age + temp + rr + hrat + waz +
bul.conv + drowsy + agitated + reffort + ausc +
feeding + abdominal , data = Sc , x = TRUE , y= TRUE )
# Use the update function to save repeating model right-
# hand side. An indicator variable for Y =1 is the
# response variable below
cr1 ← update ( cr0 , Y ==1 ∼ . , subset = Y ≥ 1)
plot.lrm.par ti al ( cr0 , cr1 , center = TRUE ) # Figure 14.5
The output is in Figure 14.5. There is not much more parallelism here than
in Figure 14.4. For the two most important predictors, ausc and rr, there are
strongly differing effects for the different events being predicted (e.g., Y = 0
or Y = 1|Y ≥ 1). As is often the case, there is no one constant β model that
satisfies assumptions with respect to all predictors simultaneously, especially
2.5
3.0
7.0
0.4
2.0
6.0
0.0
1.5
1.0
5.0
−0.6
4.0
0.5
0.0
0 20 40 60 80 32 34 36 38 40 0 40 80 120 50 150 250
0.8
0.8
1.0
0.8
0.4
0.4
−0.5
0.4
0.0
0.0
−2.0
−0.2
−0.4
−0.4
0.4
1.0
1.0
0.2
0.5
0.5
0.5
0.0
0.0
0.0
−0.5
−0.2
Fig. 14.4 Smoothed partial residuals corresponding to two cutoffs of Y , from a model
in which all predictors were assumed to operate linearly and additively. The smoothed
curves estimate the actual predictor transformations needed, and parallelism relates
to the PO assumption. Solid lines denote Y ≥ 1 while dashed lines denote Y ≥ 2.
340 14 Ordinal Regression, Data Reduction, and Penalization
0.5
cr1
0.0
0.1
cr0
0.0
cr1
−0.1
0.0
−1.0
cr1
−0.4
−0.5
−0.4
−0.8
−2.0
cr0 cr0 cr1
cr0
0 20 40 60 80 32 34 36 38 40 0 40 80 120 50 150 250
age temp rr hrat
0.0
cr0
0.2
0.2
cr1
cr0
0.5
0.0
−0.2
−0.1
cr1
−0.5
−0.4
cr1
−0.4
−0.4
cr0 cr0 cr1
0.0
cr1
0.0
−0.05
cr1
−0.4
−0.4
−0.4
cr0
−0.15
−1.0
−1.0
Fig. 14.5 loess smoothed partial residual plots for binary models that are compo-
nents of an ordinal continuation ratio model. Solid lines correspond to a model for
Y = 0, and dotted lines correspond to a model for Y = 1|Y ≥ 1.
when there is evidence for non-ordinality for ausc in Figure 14.2. The CR
model will need to be generalized to adequately fit this dataset.
The CR model in its ordinary form has no advantage over the PO model for
this dataset. But Section 13.4.6 discussed how the CR model can easily be
extended to relax any of its assumptions. First we use the cr.setup function
to set up the data for fitting a CR model using the binary logistic trick.
u ← cr.setup ( Y )
Sc.expanded ← Sc [ u $ subs , ]
y ← u$y
cohort ← u $ cohort
14.8 Extended Continuation Ratio Model 341
Here the cohort variable has values ’all’, ’Y>=1’ corresponding to the condi-
tioning events in Equation 13.10. Once the data frame is expanded to include
the different risk cohorts, vectors such as age are lengthened (to 5553 records).
Now we fit a fully extended CR model that makes no equal slopes assump-
tions; that is, the model has to fit Y assuming the covariables are linear and
additive. At this point, we omit hrat but add back all variables that were
deleted by examining their association with Y . Recall that most of these
seven cluster scores were summarized using P C1 . Adding back “insignificant”
variables will allow us to validate the model fairly using the bootstrap, as
well as to obtain confidence intervals that are not falsely narrow.16
full ←
lrm ( y ∼ cohort * ( ageg * ( rcs ( temp ,5) + rcs ( rr ,5)) +
rcs ( waz ,4) + bul.conv + drowsy + agitated + reffort +
ausc + feeding + abdominal + hydration + hxprob +
pustular + crying + fever.ill + stop.breath + labor ),
data = Sc.expanded , x = TRUE , y = TRUE )
# x = TRUE , y = TRUE are for pentrace , validate , calibrate below
perf ← function ( fit ) { # model performance for Y =0
pr ← predict ( fit , type = ' fitted ' )[ cohort == ' all ' ]
s ← round ( somers2 ( pr , y [ cohort == ' all ' ]) , 3)
pr ← 1 - pr # Predict Prob [ Y > 0] instead of Prob [ Y = 0]
f ← round ( c ( mean ( pr < .05 ), mean ( pr > .25 ),
mean ( pr > .5 )) , 2)
f ← paste ( f [1] , ' , ' , f [2] , ' , and ' , f [3] , ' . ' , sep = ' ' )
list ( somers =s , fractions = f )
}
perf.unpen ← perf ( full )
print ( full , latex = TRUE , coefs =5)
This model has LR χ2 = 1824 with 87 d.f. Wald statistics are in Tables 14.6
and 14.7. The global test of the constant slopes assumption in the CR model
(test of all interactions involving cohort) has Wald χ2 = 172 with 43 d.f.,
P < 0.0001. Consistent with Figure 14.5, the formal tests indicate that ausc
is the biggest violator, followed by waz and rr.
We know that the CR model must be extended to fit these data adequately. If
the model is fully extended to allow for all cohort × predictor interactions, we
have not gained any precision or power in using an ordinal model over using a
polytomous logistic model. Therefore we seek some restrictions on the model’s
parameters. The lrm and pentrace functions allow for differing λ for shrinking
different types of terms in the model. Here we do a grid search to determine
the optimum penalty for simple main effect (non-interaction) terms and the
penalty for interaction terms, most of which are terms interacting with cohort
14.9 Penalized Estimation 343
Table 14.7 Wald statistics for the continuation ratio model. Interactions with
cohort assess non-proportional hazards
χ2 d.f. P
cohort 199.47 44 < 0.0001
ageg 48.89 36 0.0742
temp 59.37 24 0.0001
rr 93.77 24 < 0.0001
waz 39.69 6 < 0.0001
bul.conv 10.80 2 0.0045
drowsy 15.19 2 0.0005
agitated 13.55 2 0.0011
reffort 51.85 2 < 0.0001
ausc 109.80 2 < 0.0001
feeding 27.47 2 < 0.0001
abdominal 1.78 2 0.4106
hydration 4.47 2 0.1069
hxprob 6.62 2 0.0364
pustular 3.03 2 0.2194
crying 1.55 2 0.4604
fever.ill 3.63 2 0.1630
stop.breath 5.34 2 0.0693
labor 5.35 2 0.0690
ageg × temp 8.18 16 0.9432
ageg × rr 38.11 16 0.0015
cohort × ageg 14.88 18 0.6701
cohort × temp 8.77 12 0.7225
cohort × rr 19.67 12 0.0736
cohort × waz 9.04 3 0.0288
cohort × bul.conv 0.33 1 0.5658
cohort × drowsy 0.57 1 0.4489
cohort × agitated 0.55 1 0.4593
cohort × reffort 2.29 1 0.1298
cohort × ausc 38.11 1 < 0.0001
cohort × feeding 2.48 1 0.1152
cohort × abdominal 0.09 1 0.7696
cohort × hydration 0.53 1 0.4682
cohort × hxprob 2.54 1 0.1109
cohort × pustular 2.40 1 0.1210
cohort × crying 0.39 1 0.5310
cohort × fever.ill 3.17 1 0.0749
cohort × stop.breath 2.99 1 0.0839
cohort × labor 0.05 1 0.8309
cohort × ageg × temp 2.22 8 0.9736
cohort × ageg × rr 10.22 8 0.2500
TOTAL NONLINEAR 93.36 40 < 0.0001
TOTAL INTERACTION 203.10 59 < 0.0001
TOTAL NONLINEAR + INTERACTION 257.70 67 < 0.0001
TOTAL 1211.73 87 < 0.0001
344 14 Ordinal Regression, Data Reduction, and Penalization
to allow for unequal slopes. The following code uses pentrace on the full
extended CR model fit to find the optimum penalty factors. All combinations
of the simple and interaction λs for which the interaction penalty ≥ the
penalty for the simple parameters are examined.
d ← options ( digits =4)
pentrace ( full ,
list ( simple = c (0 , .025 , .05 , .075 , .1 ),
interaction = c (0 ,10 ,50 ,100 ,125 ,150)))
Best penalty :
simple interaction df
0.05 125 49.75
options ( d )
We see that shrinkage from 87 d.f. down to 49.75 effective d.f. results in an
improvement in χ2 –scaled AIC of 23. The optimum penalty factors were 0.05
for simple terms and 125 for interaction terms.
Let us now store a penalized version of the full fit, find where the effective
d.f. were reduced, and compute χ2 for each factor in the model. We take
the effective d.f. for a collection of model parameters to be the sum of the
14.9 Penalized Estimation 345
Penalty factors
Original Penalized
All 87 49.75
Simple Terms 20 19.98
Interaction or Nonlinear 67 29.77
Nonlinear 40 16.82
Interaction 59 22.57
Nonlinear Interaction 32 9.62
ageg
fever.ill
crying
pustular
abdominal
hydration
stop.breath
labor
hxprob
bul.conv
agitated
drowsy
temp
feeding
waz
rr
reffort
ausc
This will be the final model except for the model used in Section 14.10.
The model has LR χ2 = 1772. The output of effective.df shows that non-
interaction terms have barely been penalized, and coefficients of interaction
terms have been shrunken from 59 d.f. to effectively 22.6 d.f. Predictive dis-
crimination was assessed by computing the Somers’ Dxy rank correlation
between X β̂ and whether Y = 0, in the subset of records for which Y = 0 is
what was being predicted. Here Dxy = 0.672, and the ROC area is 0.838 (the
unpenalized model had an apparent Dxy = 0.676). To summarize in another
way the effectiveness of this model in screening infants for risks of any abnor-
mality, the fraction of infants with predicted probabilities that Y > 0 being
< 0.05, > 0.25, and > 0.5 are, respectively, 0.1, 0.28, and 0.14. anova output is
plotted in Figure 14.6 to give a snapshot of the importance of the various pre-
dictors. The Wald statistics used here are computed on a variance–covariance
matrix which is adjusted for penalization (using Gray Equation 2.6237 before
it was determined that the sandwich covariance estimator performs less well
than the inverse of the penalized information matrix—see p. 211).
The full equation for the fitted model is below. Only the part of the equa-
tion used for predicting Pr(Y = 0) is shown, other than an intercept for
Y ≥ 1 that does not apply when Y = 0.
latex ( full.pen , which =1:21 , file = ' ' )
14.9 Penalized Estimation 347
X β̂ =
−1.337435[Y >= 1]
+0.1074525[ageg ∈ [ 7, 60)] + 0.1971287[ageg ∈ [60, 90]]
+0.1978706temp + 0.1091831(temp − 36.19998)3+ − 2.833442(temp − 37)3+
+5.07114(temp − 37.29999)3+ − 2.507527(temp − 37.69998)3+
+0.1606456(temp − 39)3+
+0.02090741rr − 6.336873×10−5 (rr − 32)3+ + 8.405441×10−5 (rr − 42)3+
+6.152416×10−5 (rr − 49)3+ − 0.0001018105(rr − 59)3+ + 1.960063×10−5 (rr − 76)3+
−0.07589699waz + 0.02508918(waz + 2.9)3+ − 0.1185068(waz + 0.75)3+
+0.1225752(waz − 0.28)3+ − 0.02915754(waz − 1.73)3+ − 0.4418073 bul.conv
−0.08185088 drowsy − 0.05327209 agitated − 0.2304409 reffort
−1.158604 ausc − 0.1599588 feeding − 0.1608684 abdominal
−0.05409718 hydration + 0.08086387 hxprob + 0.007519746 pustular
+0.04712091 crying + 0.004298725 fever.ill − 0.3519033 stop.breath
+0.06863879 labor
+[ageg ∈ [ 7, 60)][6.499592×10−5 temp − 0.00279976(temp − 36.19998)3+
−0.008691166(temp − 37)3+ − 0.004987871(temp − 37.29999)3+
+0.0259236(temp − 37.69998)3+ − 0.009444801(temp − 39)3+ ]
+[ageg ∈ [60, 90]][0.0001320368temp − 0.00182639(temp − 36.19998)3+
−0.01640406(temp − 37)3+ − 0.0476041(temp − 37.29999)3+
+0.09142148(temp − 37.69998)3+ − 0.02558693(temp − 39)3+ ]
+[ageg ∈ [ 7, 60)][−0.0009437598rr − 1.044673×10−6 (rr − 32)3+
−1.670499×10−6 (rr − 42)3+ − 5.189082×10−6 (rr − 49)3+ + 1.428634×10−5 (rr − 59)3+
−6.382087×10−6 (rr − 76)3+ ]
+[ageg ∈ [60, 90]][−0.001920811rr − 5.52134×10−6 (rr − 32)3+
−8.628392×10−6 (rr − 42)3+ − 4.147347×10−6 (rr − 49)3+ + 3.813427×10−5 (rr − 59)3+
−1.98372×10−5 (rr − 76)3+ ]
dd $ limits [ ' Adjust to ' , ' cohort ' ] ← ' all ' # original default
all ← Predict ( full.pen , name =v , ref.zero = TRUE )
ggplot ( all , ylim = yl , sepdiscrete = ' vertical ' ) # Figure 14.9
1
example, the mean absolute error in predicting L̂. Another advantage to full
model approximation is that shrinkage used in computing L̂ is inherited by
any model that predicts L̂. In contrast, the usual stepwise methods result in
β̂ that are too large since the final coefficients are estimated as if the model
structure were prespecified. 2
CART would be particularly useful as a model approximator as it would
result in a prediction tree that would be easy for health workers to use.
all Y>=1
1
0
log odds
−1
−2
−3
34 36 38 40 34 36 38 40
Temperature
all Y>=1
1
0
log odds
−1
−2
−3
30 60 90 30 60 90
Adjusted respiratory rate
Fig. 14.7 Centered effects of predictors on the log odds, showing the effects of two
predictors with interaction effects for the age intervals noted. The title all refers
to the prediction of Y = 0|Y ≥ 0, that is, Y = 0. Y>=1 refers to predicting the
probability of Y = 1|Y ≥ 1.
350 14 Ordinal Regression, Data Reduction, and Penalization
0
−1
0 1 2 3 4 5 −4 −2 0 2
Fig. 14.8 Centered effects of predictors on the log odds, for predicting Y = 1|Y ≥ 1
0
−1
0 1 2 3 4 5 −4 −2 0 2
−1 0 1 −1 0 1
log odds
Fig. 14.9 Centered effects of predictors on the log odds, for predicting Y ≥ 1. No
plot was made for the fever.ill, stop.breath. or labor cluster scores.
Co e f S . E . Wald Z P
[1 ,] 1.617 0.01482 109.1 0
F a c t o r s i n F i n a l Model
None
options ( w )
# 1 e10 causes all variables to eventually be
# deleted so can see most important ones in order
The approximate model had R2 against the full penalized model of 0.972, and
the mean absolute error in predicting L̂ was 0.17. The Dxy rank correlation
between the approximate model’s predicted logit and the binary event Y = 0
14.11 Validating the Model 353
is 0.665 as compared with the full model’s Dxy = 0.672. See Section 19.5 for
an example of computing correct estimates of variance of the parameters in
an approximate model.
Next turn to diagramming this model approximation so that all predicted
values can be computed without the use of a computer. We draw a type of
nomogram that converts each effect in the model to a 0 to 100 scale which is
just proportional to the log odds. These points are added across predictors
to derive the “Total Points,” which are converted to L̂ and then to predicted
probabilities. For the interaction between rr and ageg, rms’s nomogram func-
tion automatically constructs three rr axes—only one is added into the total
point score for a given subject. Here we draw a nomogram for predicting the
probability that Y > 0, which is 1 − Pr(Y = 0). This probability is derived
by negating β̂ and X β̂ in the model derived to predict Pr(Y = 0).
f ← full.approx
f $ coefficients ← -f $ coefficients
f $ linear.predi c to rs ← -f $ linear.predic to rs
n ← nomogram (f ,
temp =32:41 , rr = seq (20 ,120 , by =10) ,
waz = seq ( -1.5 ,2 , by = .5 ),
fun = plogis , funlabel = ' Pr (Y >0) ' ,
fun.at = c ( .02 , .05 , seq ( .1 , .9 , by = .1 ), .95 , .98 ))
# Print n to see point tables
plot (n , lmgp = .2 , cex.axis = .6 ) # Figure 14.10
newsubject ←
data.frame ( ageg = ' [ 0, 7) ' , rr =30 , temp =39 , waz =0 , drowsy =5 ,
reffort =2 , bul.conv =0 , ausc =0 , feeding =0)
xb ← predict (f , newsubject )
For the full CR model that was fitted using penalized maximum likelihood
estimation (PMLE), we used 200 bootstrap replications to estimate and then
to correct for optimism in various statistical indexes: Dxy , generalized R2 ,
intercept and slope of a linear re-calibration equation for X β̂, the maximum
calibration error for Pr(Y = 0) based on the linear-logistic re-calibration
(Emax), and the Brier quadratic probability score B. PMLE is used at each
of the 200 resamples. During the bootstrap simulations, we sample with
354 14 Ordinal Regression, Data Reduction, and Penalization
0 10 20 30 40 50 60 70 80 90 100
Points
38 39 41
Temperature
37 36 35 34 33 32
50 60 70 90 110
rr (ageg=[ 0, 7))
40 30 20
50 60 70 80 90 100 110 120
rr (ageg=[ 7,60))
40 30 20
50 60 70 80 90 100 110 120
rr (ageg=[60,90])
40 30 20
Weight−for−age
zscore 2 1 0 −0.5
1
bul.conv
0
1 3 5
drowsy
0 2 4
1 3 5
reffort
0 2 4
1
ausc
0
1 3 5
feeding
0 2 4
Total Points
0 20 40 60 80 100 120 140 160 180 200 220 240 260
Linear Predictor
−4 −3 −2 −1 0 1 2 3 4 5 6
Pr(Y>0)
0.02 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.98
Fig. 14.10 Nomogram for predicting Pr(Y > 0) from the penalized extended CR
model, using an approximate model fitted using ordinary least squares (R2 = 0.972
against the full model’s predicted logits).
replacement from the patients and not from the 5553 expanded records, hence
the specification cluster=u$subs, where u$subs is the vector of sequential pa-
tient numbers computed from cr.setup above. To be able to assess predictive
accuracy of a single predicted probability, the subset parameter is specified
so that Pr(Y = 0) is being assessed even though 5553 observations are used
to develop each of the 200 models.
set.seed (1) # so can reproduce results
v ← validate ( full.pen , B =200 , cluster = u $ subs ,
subset = cohort == ' all ' )
latex (v , file = ' ' , digits =2 , size = ' smaller ' )
14.12 Summary 355
We see that for the apparent Dxy = 0.672 and that the optimism from
overfitting was estimated to be 0.011 for the PMLE model, so the bias-
corrected estimate of predictive discrimination is 0.661. The intercept and
slope needed to re-calibrate X β̂ to a 45◦ line are very near (0, 1). The es-
timate of the maximum calibration error in predicting Pr(Y = 0) is 0.001,
which is quite satisfactory. The corrected Brier score is 0.122.
The simple calibration statistics just listed do not address the issue of
whether predicted values from the model are miscalibrated in a nonlinear
way, so now we estimate an overfitting-corrected calibration curve nonpara-
metrically.
cal ← calibrate ( full.pen , B =200 , cluster = u $ subs ,
subset = cohort == ' all ' )
err ← plot ( cal ) # Figure 14.11
The results are shown in Figure 14.11. One can see a slightly nonlinear cali-
bration function estimate, but the overfitting-corrected calibration is excellent
everywhere, being only slightly worse than the apparent calibration. The esti-
mated maximum calibration error is 0.044. The excellent validation for both
predictive discrimination and calibration are a result of the large sample size,
frequency distribution of Y , initial data reduction, and PMLE.
14.12 Summary
1.0
0.8
Actual Probability
0.6
0.4
Apparent
0.2
Bias−corrected
Ideal
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Predicted Pr{y=1}
Fig. 14.11 Bootstrap calibration curve for the full penalized extended CR model.
200 bootstrap repetitions were used in conjunction with the loess smoother.111 Also
shown is a “rug plot” to demonstrate how effective this model is in discriminating
patients into low- and high-risk groups for Pr(Y = 0) (which corresponds with the
derived variable value y = 1 when cohort=’all’).
with age, are powerful predictors of the ordinal response. Graphical methods
are effective for detecting lack of fit in the PO and CR models and for dia-
gramming the final model. Model approximation allowed development of par-
simonious clinical prediction tools. Approximate models inherit the shrinkage
from the full model. For the ordinal model developed here, substantial shrink-
age of the full model was needed.
1 See Moons et al.462 for another case study in penalized maximum likelihood
estimation.
2 The lasso method of Tibshirani608, 609 also incorporates shrinkage into variable
selection.
3 To see how this compares with predictions using the full model, the extra clinical
signs in that model that are not in the approximate model were predicted
individually on the basis of X β̂ from the reduced model along with the signs
that are in that model, using ordinary linear regression. The signs not specified
when evaluating the approximate model were then set to predicted values based
on the values given for the 6-day-old infant above. The resulting X β̂ for the full
model is −0.81 and the predicted probability is 0.31, as compared with -0.68
and 0.34 quoted above.
14.14 Problems 357
14.14 Problems
3. Use an R function such as the following to compute the logits of the cu-
mulative proportions.
sf ← function (y )
c ( ' Y ≥ 1 ' = qlogis ( mean ( y ≥ 1)) ,
' Y ≥ 2 ' = qlogis ( mean ( y ≥ 2)) ,
' Y ≥ 3 ' = qlogis ( mean ( y ≥ 3)) ,
' Y ≥ 4 ' = qlogis ( mean ( y ≥ 4)))
Comment on the linearity of the age effect (which of the two plots do
you use?) and on the proportional odds assumption for age, by assessing
parallelism in the second plot.
5. Impute race using the most frequent category and pafi and alb using
“normal” values.
6. Fit a model to predict the ordinal response using all predictors. For con-
tinuous ones assume a smooth relationship but allow it to be nonlinear.
Quantify the ability of the model to discriminate patients in the five out-
comes. Do an overall likelihood ratio test for whether any variables are
associated with the level of functional disability.
7. Compute partial tests of association for each predictor and a test of nonlin-
earity for continuous ones. Compute a global test of nonlinearity. Graphi-
cally display the ranking of importance of the predictors.
8. Display the shape of how each predictor relates to the log odds of exceeding
any level of sfdm2 you choose, setting other predictors to typical values
(one value per predictor). By default, Predict will make predictions for
the second response category, which is a satisfactory choice here.
9. Use resampling to validate the Somers’ Dxy rank correlation between pre-
dicted logit and the ordinal outcome. Also validate the generalized R2 ,
and slope shrinkage coefficient, all using a single R command. Comment
on the quality (potential “export-ability”) of the model.
Chapter 15
Regression Models for Continuous Y
and Case Study in Ordinal Regression
For the OLS fully parametric case, the model may be restated
Y − Xβ y − Xβ
Prob[Y ≥ y|X] = Prob[ ≥ ] (15.3)
σ σ
y − Xβ −y Xβ
= 1 − Φ( ) = Φ( + ) (15.4)
σ σ σ
b
Quantile regression allows the estimated value of the 0.5 quantile to be higher than
the estimated value of the 0.6 quantile for some values of X. Composite quantile
regression690 removes this possibility by forcing all the X coefficients to be the same
across multiple quantiles, a restriction not unlike what cumulative probability ordinal
models make.
c
For symmetric distributions applying a decreasing transformation will negate the
coefficients. For asymmetric distributions (e.g., Gumbel), reversing the order of Y
will do more than change signs.
d
Only an estimate of mean Y from these β̂s is non-robust.
e
It is more traditional to state the model in terms of Prob[Y ≤ y|X] but we use
Prob[Y ≥ y|X] so that higher predicted values are associated with higher Y .
362 15 Regression Models for Continuous Y and Case Study in Ordinal Regression
f
αˆy are unchanged if a constant is added to all y.
g
The intercepts have to be shifted to the left one position in solving this equation
because the quantile is such that Prob[Y ≤ y] = q whereas the model is stated in
terms of Prob[Y ≥ y].
15.3 Ordinal Regression Models for Continuous Y 363
The orm function in the rms package takes advantage of the information
matrix being of a sparse tri-band diagonal form for the intercept parameters.
This makes the computations efficient even for hundreds of intercepts (i.e.,
unique values of Y ). orm is made to handle continuous Y .
Ordinal regression has nice properties in addition to those listed above,
allowing for
• estimation of quantiles as efficiently as quantile regression if the parallel
slopes assumptions hold
• efficient estimation of mean Y
• direct estimation of Prob[Y ≥ y|X]
• arbitrary clumping of values of Y , while still estimating β and mean Y
efficientlyh
• solutions for β̂ using ordinary Newton-Raphson or other popular optimiza-
tion techniques
• being based on a standard likelihood function, penalized estimation can
be straightforward
• Wald, score, and likelihood ratio χ2 tests that are more powerful than tests
from quantile regression.
On the last point, if there is a single predictor in the model and it is binary,
the score test from the proportional odds model is essentially the Wilcoxon
test, and the score test from the Gumbel log-log cumulative probability
model is essentially the log-rank test.
h
But it is not sensible to estimate quantiles of Y when there are heavy ties in Y in
the area containing the quantile.
364 15 Regression Models for Continuous Y and Case Study in Ordinal Regression
Φ−1(F(y|X)) − ΔXβ
Φ−1(F(y|X))
− ΔXβ σ logit(F(y|X))
y y
Fig. 15.1 Assumptions of the linear model (left panel) and semiparametric ordi-
nal probit or logit (proportional odds) models (right panel). Ordinal models do not
assume any shape for the distribution of Y for a given X; they only assume paral-
lelism. The linear model can relax the parallelism assumption if σ is allowed to vary,
but in practice it is difficult to know how to vary it except for the unequal variance
two-sample t-test.
getHdata ( nhgh )
w ← subset ( nhgh , age ≥ 21 & dx ==0 & tx ==0 , select = -c ( dx , tx ))
latex ( describe ( w ) , file = ' ' )
366 15 Regression Models for Continuous Y and Case Study in Ordinal Regression
w
18 Variables 4629 Observations
sex
n missing unique
4629 0 2
male (2259, 49%), female (2370, 51%)
re : Race/Ethnicity
n missing unique
4629 0 5
Mexican American (832, 18%), Other Hispanic (474, 10%)
Non-Hispanic White (2318, 50%), Non-Hispanic Black (756, 16%)
Other Race Including Multi-Racial (249, 5%)
wt : Weight [kg]
n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
4629 0 890 1 80.49 52.44 57.18 66.10 77.70 91.40 106.52 118.00
lowest : 33.2 36.1 37.9 38.5 38.7
highest: 184.3 186.9 195.3 196.6 203.0
gh : Glycohemoglobin [%]
n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
4629 0 63 0.99 5.533 4.8 5.0 5.2 5.5 5.8 6.0 6.3
lowest : 4.0 4.1 4.2 4.3 4.4, highest: 11.9 12.0 12.1 12.3 14.5
First let’s see if gh would make a Gaussian residuals model fit. Use ordinary
regression on four key variables to collapse these into one variable (predicted
mean from the OLS model). Stratify the predicted means into six quantile
groups. Apply the normal inverse cumulative distribution function Φ−1 to the
empirical cumulative distribution functions (ECDF) of gh using these strata,
and check for normality and constant σ 2 . The ECDF estimates Prob[Y ≤
y|X] but for ordinal modeling we want to state models in terms of Prob[Y ≥
y|X] so take one minus the ECDF before inverse transforming.
f ← ols ( gh ∼ rcs ( age ,5) + sex + re + rcs ( bmi , 3) , data = w )
pgh ← fitted (f )
5 2
log(F/(1 − F))
qnorm (F)
0 0
−5 −2
5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5
Glycohemoglobin, % Glycohemoglobin, %
2
6
0
log (− log(1 − F))
− log(− log(F))
4
−2
2
−4
0
−6
−2
5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5
Glycohemoglobin, % Glycohemoglobin, %
Fig. 15.2 Examination of normality and constant variance assumption, and assump-
tions for various ordinal models
370 15 Regression Models for Continuous Y and Case Study in Ordinal Regression
The upper right curves in Figure 15.2 are not linear, implying that a normal
conditional distribution cannot work for ghi There is non-parallelism for the
logit model. The other graphs will be used to guide selection of an ordinal
model below.
In the upper left panel of Figure 15.2, logit inverse curves are not parallel
so the proportional odds assumption does not hold when predicting HbA1c .
The log-log link yields the highest degree of parallelism and most constant
regression coefficients across cutoffs of gh, so we use this link in an ordinal
regression model (linearity of the curves is not required).
Another way to examine model fit is to flexibly fit the single most important
predictor (age) using a variety of methods, and compare predictions to sample
quantiles and means based on subsets on age. We use overlapping subsets
to gain resolution, with each subset composed of those subjects having age
within five years of the point being predicted by the models. Here we predict
the 0.5, 0.75, and 0.9 quantiles and the mean. For quantiles we can compare
to quantile regression (discussed below) and for means we compare to OLS.
ag ← 25:75
lag ← length ( ag )
q2 ← q3 ← p90 ← means ← numeric ( lag )
for ( i in 1: lag ) {
s ← which ( abs ( w $ age - ag [ i ]) < 5)
y ← w $ gh [ s ]
a ← quantile (y , probs = c ( .5 , .75 , .9 ))
q2 [ i ] ← a [1]
q3 [ i ] ← a [2]
p90 [ i ] ← a [3]
means [ i ] ← mean ( y )
}
fams ← c ( ' logistic ' , ' probit ' , ' loglog ' , ' cloglog ' )
fe ← function ( pred , target ) mean ( abs ( pred $ yhat - target ))
mod ← gh ∼ rcs ( age ,6)
P ← Er ← list ()
for ( est in c ( ' q2 ' , ' q3 ' , ' p90 ' , ' mean ' )) {
meth ← if ( est == ' mean ' ) ' ols ' else ' QR '
p ← list ()
er ← rep ( NA , 5)
names ( er ) ← c ( fams , meth )
for ( family in fams ) {
h ← orm ( mod , family = family , data = w )
fun ← if ( est == ' mean ' ) Mean ( h )
else {
qu ← Quantile ( h )
i
They are not parallel either.
15.6 Ordinal Regression Applied to HbA1c 371
Er [[ est ]] ← er
pr ← do.call ( ' rbind ' , p )
pr $ est ← est
P ← r b i n d . d a t a . f r a m e(P , pr )
}
xyplot ( yhat ∼ age | est , groups = .set. , data =P , type = ' l ' , # F i g u r e 15.3
auto.key = list ( x = .75 , y = .2 , points = FALSE , lines = TRUE ) ,
panel = function ( ... , s u b s c r i p t s) {
p a n e l . x y p l o t( ... , s u b s c r i p t s= s u b s c r i p t s)
est ← P $ est [ s u b s c r i p t s[1]]
lpoints ( ag , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) ,
col = gray ( .7 ))
er ← format ( round ( Er [[ est ]] ,3) , nsmall =3)
ltext (26 , 6 .15 , paste ( names ( er ) , collapse = ' \ n ' ) ,
cex = .7 , adj =0)
ltext (40 , 6 .15 , paste ( er , collapse = ' \ n ' ) ,
cex = .7 , adj =1)})
30 40 50 60 70
q2 q3
logistic 0.048 logistic 0.050
probit 0.052 probit 0.045
6.2
loglog 0.058 loglog 0.037
cloglog 0.072 cloglog 0.077
QR 0.024 QR 0.027
6.0
5.8
5.6
5.4
5.2
yhat
mean p90
logistic 0.021 logistic 0.053
probit 0.025 probit 0.047
6.2
loglog 0.026 loglog 0.041
cloglog 0.033 cloglog 0.101
6.0 ols 0.013 QR 0.030
5.8
5.6 logistic
probit
5.4 loglog
cloglog
QR
5.2 ols
30 40 50 60 70
age
Fig. 15.3 Three estimated quantiles and estimated mean using 6 methods, compared
against caliper-matched sample quantiles/means (circles). Numbers are mean abso-
lute differences between predicted and sample quantities using overlapping intervals
of age and caliper matching. QR:quantile regression.
Figure 15.5 depicts a significant departure from the linear form implied by
Gaussian residuals (Eq. 15.4).
15.6 Ordinal Regression Applied to HbA1c 373
1.0
[4.88,5.29)
0.8 [5.29,5.44)
[5.44,5.56)
[5.56,5.66)
0.6 [5.66,5.76)
Prob(Y ≥ y)
[5.76,6.48]
0.4
0.2
0.0
Fig. 15.4 Observed (dashed lines, open circles) and predicted (solid lines, closed cir-
cles) exceedance probability distributions from a model using 6-tiles of OLS-predicted
HbA1c . Key shows quantile group intervals of predicted mean HbA1c .
−1
αy
−2
−3
−4
−5
aic ← NULL
for ( mod in list ( gh ∼ rcs ( age ,5) + rcs ( log ( bmi ) ,5) ,
gh ∼ rcs ( age ,5) + rcs ( log ( ht ) ,5) + rcs ( log ( wt ) ,5) ,
gh ∼ rcs ( age ,5) + rcs ( log ( ht ) ,4) * rcs ( log ( wt ) ,4)))
aic ← c ( aic , AIC ( orm ( mod , family = loglog , data = w )))
print ( aic )
The ratio of the coefficient of log height to the coefficient of log weight is -
2.4, which is between what BMI uses and the more dimensionally reasonable
weight / height3 . By AIC, a spline interaction surface between height and
weight does slightly better than BMI in predicting HbA1c , but a nonlinear
function of BMI is barely worse. It will require other body size measures to
displace BMI as a predictor.
As an aside, compare this model fit to that from the Cox proportional
hazards model. The Cox model uses a conditioning argument to obtain
a partial likelihood free of the intercepts α (and requires a second step to
estimate these log discrete hazard components) whereas we are using a full
marginal likelihood of the ranks of Y 330 .
print ( cph ( Surv ( gh ) ∼ rcs ( age ,5) + log ( ht ) + log ( wt ) , data = w ) ,
latex = TRUE )
Next we examine all body size measures, and check their redundancies.
v ← varclus (∼ wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + sex + re , data = w )
plot ( v )
376 15 Regression Models for Continuous Y and Case Study in Ordinal Regression
R e d u n d a n c y Analysis
n : 3853 p: 8 nk : 3
R e n d u n d a n t v a r i a b l e s:
bmi ht
P r e d i c t e d from v a r i a b l e s:
Six size measures adequately capture the entire set. Height and BMI are
removed (Figure 15.6). An advantage of removing height is that it is age-
dependent due to vertebral compression in the elderly:
f ← orm ( ht ∼ rcs ( age ,4) * sex , data = w ) # Prop. odds model
qu ← Quantile ( f ); med ← function ( x ) qu ( .5 , x )
ggplot ( Predict (f , age , sex , fun = med , conf.int = FALSE ),
ylab = ' Predicted Median Height , cm ' )
However, upper leg length has the same declining trend, implying a survival
bias or birth year effect.
In preparing to create a multivariable model, degrees of freedom are allo-
cated according to the generalized Spearman ρ2 (Figure 15.7)j.
s ← spearman2 ( gh ∼ age + sex + re + wt + leg + arml + armc +
waist + tri + sub , data =w , p =2)
plot ( s )
0.0
age
reOther Race Including Multi−Racial
reOther Hispanic
0.2
reNon−Hispanic White
reNon−Hispanic Black
Spearman ρ2
0.4
sexfemale
sub
tri
leg
0.6
ht
arml
0.8
wt
armc
bmi
waist
1.0
180
Predicted Median Height, cm
175
170 sex
male
165 female
160
155
20 40 60 80
Age, years
Fig. 15.7 Estimated median height as a smooth function of age, allowing age to
interact with sex, from a proportional odds model
Because there are many competing body measures, we use backwards step-
down to arrive at a set of predictors. The bootstrap will be used to penal-
ize predictive ability for variable selection. First the full model is fit using
casewise deletion, then we do a composite test to assess whether any of the
frequently–missing predictors is important.
f ← orm ( gh ∼ rcs ( age ,5) + sex + re + rcs ( wt ,3) + rcs ( leg ,3) + arml +
rcs ( armc ,3) + rcs ( waist ,4) + tri + rcs ( sub ,3) ,
family = ' loglog ' , data =w , x = TRUE , y = TRUE )
print (f , latex = TRUE , coefs = FALSE )
378 15 Regression Models for Continuous Y and Case Study in Ordinal Regression
Spearman ρ2 Response : gh
N df
age 4629 2
waist 4465 2
leg 4474 2
sub 3974 2
armc 4499 2
wt 4629 2
re 4629 4
tri 4295 2
arml 4502 2
sex 4629 1
χ2 d.f. P
leg 8.30 2 0.0158
Nonlinear 3.32 1 0.0685
arml 0.16 1 0.6924
armc 6.66 2 0.0358
Nonlinear 3.29 1 0.0695
waist 29.40 3 < 0.0001
Nonlinear 4.29 2 0.1171
tri 16.62 1 < 0.0001
sub 40.75 2 < 0.0001
Nonlinear 4.50 1 0.0340
TOTAL NONLINEAR 14.95 5 0.0106
TOTAL 128.29 11 < 0.0001
The model achieves Spearman ρ = 0.52, the rank correlation between
predicted and observed HbA1c .
We show the predicted mean and median HbA1c as a function of age,
adjusting other variables to their median or mode (Figure 15.9). Compare the
estimate of the median and 90th percentile with that from quantile regression.
M ← Mean ( f )
qu ← Quantile ( f)
med ← function ( x) qu ( .5 , x )
p90 ← function ( x) qu ( .9 , x )
fq ← Rq ( formula (f ), data = w )
fq90 ← Rq ( formula (f ), data =w , tau = .9 )
6.00
orm median
orm P90
5.50 QR median
QR P90
5.25
5.00
20 40 60 80
Age, years
Fig. 15.9 Estimated mean and 0.5 and 0.9 quantiles from the log-log ordinal model
using casewise deletion, along with predictions of 0.5 and 0.9 quantiles from quantile
regression (QR). Age is varied and other predictors are held constant to medians/-
modes.
Next we fit the reduced model, using multiple imputation to impute miss-
ing predictors (Figure 15.10).
a ← a r e g I m p u t e(∼ gh + wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + re , data =w , n.impute =5 , pr = FALSE )
g ← f i t . m u l t . i m p u t e( gh ∼ rcs ( age ,5) + re + rcs ( leg ,3) +
rcs ( waist ,4) + tri + rcs ( sub ,4) ,
orm , a , family = loglog , data =w , pr = FALSE )
an ← anova ( g )
lan ( an )
χ2 d.f. P
age 692.50 4 < 0.0001
Nonlinear 28.47 3 < 0.0001
re 168.91 4 < 0.0001
leg 24.37 2 < 0.0001
Nonlinear 2.14 1 0.1434
waist 128.31 3 < 0.0001
Nonlinear 4.05 2 0.1318
tri 39.44 1 < 0.0001
sub 39.30 3 < 0.0001
Nonlinear 6.63 2 0.0363
TOTAL NONLINEAR 46.80 8 < 0.0001
TOTAL 1464.24 17 < 0.0001
b ← anova (g , leg , waist , tri , sub )
# Add new lines to the plot with combined effect of 4 size var.
s ← rbind ( an , size = b [ ' TOTAL ' , ])
class ( s ) ← ' a n o v a . r m s '
plot ( s )
leg
sub
tri
waist
re
size
age
Compare the estimated age partial effects and confidence intervals with
those from a model using casewise deletion, and with bootstrap nonparamet-
ric confidence intervals (also with casewise deletion).
1.5 1.5
ORIM−R l
1.0 1.0
Race Ethnicity
Nn−HsB l
0.5 0.5
Fig. 15.11 Partial effects (log hazard or log-log cumulative probability scale) of all
predictors in reduced model, after multiple imputation
1.0
0.5
log hazard
0.0
−0.5
20 30 40 50 60 70 80
Age, years
Fig. 15.12 Partial effect for age from multiple imputation (center red line) and
casewise deletion (center blue line) with symmetric Wald 0.95 confidence bands using
casewise deletion (gray shaded area), basic bootstrap confidence bands using casewise
deletion (blue lines), percentile bootstrap confidence bands using casewise deletion
(dashed blue lines), and symmetric Wald confidence bands accounting for multiple
imputation (red lines).
8.0
7.5
Median and 0.9 Quantile
7.0
6.5
0.9
6.0
5.5 Median
5.0
Fig. 15.13 Predicted mean HbA1c vs. predicted median and 0.9 quantile along with
their marginal distributions
Finally, let us draw a nomogram that shows the full power of ordinal
models, by predicting five quantities of interest.
g ← Newlevels (g , list ( re = abbreviate ( levels ( w $ re ))))
exprob ← ExProb ( g )
nom ←
nomogram (g , fun = list ( Mean =M ,
' Median Glycohemoglobi n ' = med ,
' 0 .9 Quantile ' = q90 ,
' Prob ( HbA1c ≥ 6 .5 ) ' =
function ( x ) exprob (x , y =6 .5 ),
' Prob ( HbA1c ≥ 7 .0 ) ' =
function ( x ) exprob (x , y =7) ,
' Prob ( HbA1c ≥ 7 .5 ) ' =
15.6 Ordinal Regression Applied to HbA1c 387
function ( x ) exprob (x , y =7 .5 )) ,
fun.at = list ( seq (5 , 8, by = .5 ),
c (5 ,5 .25 ,5 .5 ,5 .75 ,6 ,6 .25 ),
c (5 .5 ,6 ,6 .5 ,7 ,8 ,10 ,12 ,14) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ),
c ( .01 , .05 , .1 , .2 , .3 , .4 ),
c ( .01 , .05 , .1 , .2 , .3 , .4 )))
plot ( nom , lmgp = .28 ) # Figure 15.14
0 10 20 30 40 50 60 70 80 90 100
Points
Age
20 25 30 35 40 45 50 55 60 65 70 75 80
OthH
Race/Ethnicity
N−HW ORIM
Upper Leg Length
55 45 35 30 25 20
Waist Circumference
50 70 90 100 110 120 130 140 150 160 170
Triceps Skinfold
45 35 25 15 5 0
15 20 25 30 35 40 45
Subscapular Skinfold
10
Total Points
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Linear Predictor
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Mean
5 5.5 6 6.5 7 7.5 8
Median Glycohemoglobin
5 5.25 5.5 5.75 6 6.25
0.9 Quantile
5.5 6 6.5 7 8 10 12 14
Prob(HbA1c >= 6.5)
0.01 0.05 0.1 0.2 0.3 0.4
Prob(HbA1c >= 7.0)
0.01 0.05 0.1 0.2 0.3 0.4
Prob(HbA1c >= 7.5)
0.01 0.05 0.1 0.2 0.3
Fig. 15.14 Nomogram for predicting median, mean, and 0.9 quantile of glycohe-
moglobin, along with the estimated probability that HbA1c ≥ 6.5, 7, or 7.5, all from
the log-log ordinal model
Chapter 16
Transform-Both-Sides Regression
16.1 Background
Fitting multiple regression models by the method of least squares is one of the
most commonly used methods in statistics. There are a number of challenges
to the use of least squares, even when it is only used for estimation and not
inference, including the following.
1. How should continuous predictors be transformed so as to get a good fit?
2. Is it better to transform the response variable? How does one find a good
transformation that simplifies the right-hand side of the equation?
3. What if Y needs to be transformed non-monotonically (e.g., |Y − 100|)
before it will have any correlation with X?
When one is trying to draw an inference about population effects using con-
fidence limits or hypothesis tests, the most common approach is to assume
that the residuals have a normal distribution. This is equivalent to assuming
that the conditional distribution of the response Y given the set of predictors
X is normal with mean depending on X and variance that is (one hopes)
a constant independent of X. The need for a distributional assumption to
enable us to draw inferences creates a number of other challenges such as the
following.
1. If for the untransformed original scale of the response Y the distribution of
the residuals is not normal with constant spread, ordinary methods will not
yield correct inferences (e.g., confidence intervals will not have the desired
coverage probability and the intervals will need to be asymmetric).
2. Quite often there is a transformation of Y that will yield well-behaving
residuals. How do you find this transformation? Can you find a transfor-
mation for the Xs at the same time?
3. All classical statistical inferential methods assume that the full model was
pre-specified, that is, the model was not modified after examining the data.
How does one correct confidence limits, for example, for data-based model
and transformation selection?
When the model’s left-hand side also needs transformation, either to im-
prove R2 or to achieve constant variance of the residuals (which increases the
chances of satisfying a normality assumption), there are a few approaches
available. One approach is Breiman and Friedman’s alternating conditional
expectation (ACE) method.68 ACE simultaneously transforms both Y and
each of the Xs so as to maximize the multiple R2 between the transformed
Y and the transformed Xs. The model is given by
16.5 R Functions
The R acepack package’s ace function implements all the features of the ACE
algorithm, and its avas function does likewise for AVAS. The bootstrap and
smearing capabilities mentioned above are offered for these estimation func-
tions by the areg.boot (“additive regression using the bootstrap”) function
in the Hmisc package. Unlike the ace and avas functions, areg.boot uses the
R modeling language, making it easier for the analyst to specify the predic-
16.6 Case Study 393
tor variables and what is assumed about their relationships with the trans-
formed Y . areg.boot also implements a parametric transform-both-sides ap-
proach using restricted cubic splines and canonical variates, and offers various
estimation options with and without smearing. It can estimate the effect of
changing one predictor, holding others constant, using the ordinary bootstrap
to estimate the standard deviation of difference in two possibly transformed
estimates (for two values of X), assuming normality of such differences. Nor-
mality is assumed to avoid generating a large number of bootstrap replica-
tions of time-consuming model fits. It would not be very difficult to add non-
parametric bootstrap confidence limit capabilities to the software. areg.boot
re-samples every aspect of the modeling process it uses, just as Faraway186
did for parametric least squares modeling.
areg.boot implements a variety of methods as shown in the simple exam-
ple below. The monotone function restricts a variable’s transformation to be
monotonic, while the I function restricts it to be linear.
f ← areg.boot ( Y ∼ monotone ( age ) +
sex + weight + I ( blood.pressure ))
set.seed (7)
n ← 400
x1 ← runif ( n )
x2 ← runif ( n )
x3 ← factor ( sample ( c ( ' a ' , ' b ' , ' c ' ), n , TRUE ))
y ← exp ( x1 + 2 * abs ( x2 - .5 ) + .3 * ( x3 == ' b ' ) + .5 * ( x3 == ' c ' ) +
.5 * rnorm ( n ))
394 16 Transform-Both-Sides Regression
P r e d i c t o r Types
type
x1 s
x2 s
x3 c
y type : s
n = 400 p= 3
C o e f f i c i e n t s of s t a n d a r d i z e d t r a n s f o r m a t i o n s:
Intercept x1 x2 x3
-3.443111 e -16 9.702960 e -01 1.224320 e +00 9.881150 e -01
R e s i d u a l s on t r a n s f o r m e d scale :
16.6 Case Study 395
Note that the coefficients above do not mean very much as the scale of the
transformations is arbitrary. We see that the model was very slightly overfit-
ted (R2 dropped from 0.44 to 0.42), and the R2 are in agreement with the
OLS model fit above.
Next we plot the transformations, 0.95 confidence bands, and a sample of
the bootstrap estimates.
plot (f , boot =20) # Figure 16.1
2 0.6
Transformed x1
Transformed y
0.4
1
0.2
0
0.0
−1 −0.2
−0.4
−2
−0.6
0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0
y x1
0.4
0.6
Transformed x2
Transformed x3
0.4 0.2
0.2 0.0
0.0
−0.2
−0.2
−0.4 −0.4
1
ytrans(ys)
−1
−2
E s t i m a t e s based on 300 r e s a m p l e s
y x1 x2 x3
3.728843 0.500000 0.300000 2.000000
b
Beware that use of a data–derived transformation in an ordinary model, as this will
result in standard errors that are too small. This is because model selection is not
taken into account.186
16.6 Case Study 397
P r e d i c t o r: x1
P r e d i c t o r: x2
P r e d i c t o r: x3
panel.xYplot ( ... )
dat ← subset ( newdat ,
statistic == c ( ' median ' , ' mean ' )[ current.column ()])
for ( w in c ( ' a ' , ' b ' , ' c ' ))
with ( subset ( dat , x3 == w ),
llines ( x2 , ytrue , col = gray ( .7 ), lwd =1 .5 ))
}
)
median mean
7 c
6 c
b
5
b
y^
4 a
a
3
17.1 Background
Suppose that one wished to study the occurrence of some event in a popu-
lation of subjects. If the time until the occurrence of the event were unim-
portant, the event could be analyzed as a binary outcome using the logistic
regression model. For example, in analyzing mortality associated with open
heart surgery, it may not matter whether a patient dies during the proce-
dure or he dies after being in a coma for two months. For other outcomes,
especially those concerned with chronic conditions, the time until the event
is important. In a study of emphysema, death at eight years after onset of
symptoms is different from death at six months. An analysis that simply
counted the number of deaths would be discarding valuable information and
sacrificing statistical power.
Survival analysis is used to analyze data in which the time until the event
is of interest. The response variable is the time until that event and is often
called a failure time, survival time, or event time. Examples of responses 1
of interest include the time until cardiovascular death, time until death or
myocardial infarction, time until failure of a light bulb, time until pregnancy,
or time until occurrence of an ECG abnormality during exercise. Bull and
Spiegelhalter83 have an excellent overview of survival analysis.
The response, event time, is usually continuous, but survival analysis al-
lows the response to be incompletely determined for some subjects. For exam-
ple, suppose that after a five-year follow-up study of survival after myocardial
infarction a patient is still alive. That patient’s survival time is censored on
the right at five years; that is, her survival time is known only to exceed five
years. The response value to be used in the analysis is 5+. Censoring can also
occur when a subject is lost to follow-up. 2
If no responses are censored, standard regression models for continuous
responses could be used to analyze the failure times by writing the ex-
pected failure time as a function of one or more predictors, assuming that
the distribution of failure time is properly specified. However, there are still
several reasons for studying failure time using the specialized methods of
survival analysis.
1. Time to failure can have an unusual distribution. Failure time is restricted
to be positive so it has a skewed distribution and will never be normally
distributed.
2. The probability of surviving past a certain time is often more relevant than
the expected survival time (and expected survival time may be difficult to
estimate if the amount of censoring is large).
3. A function used in survival analysis, the hazard function, helps one to
understand the mechanism of failure.308
Survival analysis is used often in industrial life-testing experiments, and
it is heavily used in clinical and epidemiologic follow-up studies. Examples
include a randomized trial comparing a new drug with placebo for its ability
to maintain remission in patients with leukemia, and an observational study
of prognostic factors in coronary heart disease. In the latter example subjects
may well be followed for varying lengths of time, as they may enter the study
over a period of many years.
When regression models are used for survival analysis, all the advantages
of these models can be brought to bear in analyzing failure times. Multiple,
independent prognostic factors can be analyzed simultaneously and treatment
differences can be assessed while adjusting for heterogeneity and imbalances
in baseline characteristics. Also, patterns in outcome over time can be pre-
dicted for individual subjects.
Even in a simple well-designed experiment, survival modeling can allow
one to do the following in addition to making simple comparisons.
1. Test for and describe interactions with treatment. Subgroup analyses can
easily generate spurious results and they do not consider interacting fac-
tors in a dose-response manner. Once interactions are modeled, relative
treatment benefits can be estimated (e.g., hazard ratios), and analyses
can be done to determine if some patients are too sick or too well to have
even a relative benefit.
2. Understand prognostic factors (strength and shape).
3. Model absolute effect of treatment. First, a model for the probability of
surviving past time t is developed. Then differences in survival probabilities
for patients on treatments A and B can be estimated. The differences will
be due primarily to sickness (overall risk) of the patient and to treatment
interactions.
4. Understand time course of treatment effect. The period of maximum effect
or period of any substantial effect can be estimated from a plot of relative
effects of treatment over time.
5. Gain power for testing treatment effects.
6. Adjust for imbalances in treatment allocation in non-randomized studies.
17.2 Censoring, Delayed Entry, and Truncation 401
fall below a fixed value C. In type II censoring, a study ends when there is
a pre-specified number of events. If, for example, 100 mice are followed until
50 die, the censoring time is not known in advance.
We are concerned primarily with random type I right-censoring in which
each subject’s event time is observed only if the event occurs before a certain
time, but the censoring time can vary between subjects. Whatever the cause
of censoring, we assume that the censoring is non-informative about the event;
that is, the censoring is caused by something that is independent of the im-
pending failure. Censoring is non-informative when it is caused by planned
termination of follow-up or by a subject moving out of town for reasons unre-
lated to the risk of the event. If subjects are removed from follow-up because
of a worsening condition, the informative censoring will result in biased esti-
mates and inaccurate statistical inference about the survival experience. For
example, if a patient’s response is censored because of an adverse effect of
a drug or noncompliance to the drug, a serious bias can result if patients
with adverse experiences or noncompliance are also at higher risk of suffering
the outcome. In such studies, efficacy can only be assessed fairly using the
intention to treat principle: all events should be attributed to the treatment
4 assigned even if the subject is later removed from that treatment.
where F (t) is the cumulative distribution function for T . If the event is death,
S(t) is the probability that death occurs after time t, that is, the probability
that the subject will survive at least until time t. S(t) is always 1 at t = 0;
all subjects survive at least to time zero. The survival function must be
non-increasing as t increases. An example of a survival function is shown in
Figure 17.1. In that example subjects are at very high risk of the event in the
early period so that the S(t) drops sharply. The risk is low for 0.1 ≤ t ≤ 0.6, so
S(t) is somewhat flat. After t = .6 the risk again increases, so S(t) drops more
quickly. Figure 17.2 depicts the cumulative hazard function corresponding
to the survival function in Figure 17.1. This function is denoted by Λ(t).
It describes the accumulated risk up until time t, and as is shown later,
is the negative of the log of the survival function. Λ(t) is non-decreasing
as t increases; that is, the accumulated risk increases or remains the same.
Another important function is the hazard function, λ(t), also called the force
17.3 Notation, Survival, and Hazard Functions 403
1.0
0.8
0.6
S(t)
0.4
0.2
3.0
2.5
2.0
Λ(t)
1.5
1.0
0.5
0.0
14
12
10
λ(t) 8
t
Fig. 17.3 Hazard function
that time, one can learn about the mechanisms and forces of risk over time.
Figure 17.3 depicts the hazard function corresponding to S(t) in Figure 17.1
and to Λ(t) in Figure 17.2. Notice that the hazard function allows one to
more easily determine the phases of increased risk than looking for sudden
drops in S(t) or Λ(t).
The hazard function is defined formally by
∂ log S(t)
λ(t) = − , (17.5)
∂t
the negative of the slope of the log of the survival function. Working back-
wards, the integral of λ(t) is:
t
λ(v)dv = − log S(t). (17.6)
0
The integral or area under λ(t) is defined to be Λ(t), the cumulative hazard
function. Therefore
Λ(t) = − log S(t), (17.7)
or
S(t) = exp[−Λ(t)]. (17.8)
So knowing any one of the functions S(t), Λ(t), or λ(t) allows one to derive
the other two functions. The three functions are different ways of describing
the same distribution.
One property of Λ(t) is that the expected value of Λ(T ) is unity, since if
T ∼ S(t), the density of T is λ(t)S(t) and
∞
E[Λ(T )] = Λ(t)λ(t) exp(−Λ(t))dt
0
∞
= u exp(−u)du (17.9)
0
= 1.
Tq = S −1 (1 − q). (17.10)
The median life length is the time by which half the subjects will fail, obtained
by setting S(t) = 0.5:
T0.5 = S −1 (0.5). (17.11)
The qth quantile of T can also be computed by setting exp[−Λ(t)] = 1 − q,
giving
406 17 Introduction to Survival Analysis
The mean or expected value of T (the expected failure time) is the area under
the survival function for t ranging from 0 to ∞:
∞
μ= S(v)dv. (17.13)
0
Irwin has defined mean restricted life (see [334, 335]), which is the area under
S(t) up to a fixed time (usually chosen to be a point at which there is still
adequate follow-up information).
The random variable T denotes a random failure time from the survival
distribution S(t). We need additional notation for the response and censoring
information for the ith subject. Let Ti denote the response for the ith subject.
This response is the time until the event of interest, and it may be censored
if the subject is not followed long enough for the event to be observed. Let Ci
denote the censoring time for the ith subject, and define the event indicator as
Yi = min(Ti , Ci ), (17.15)
which is the time that occurred first: the failure time or the censoring time.
The pair of values (Yi , ei ) contains all the response information for most
purposes (i.e., the potential censoring time Ci is not usually of interest if the
event occurred before Ci ).
Figure 17.4 demonstrates this notation. The line segments start at study
entry (survival time t = 0).
A useful property of the cumulative hazard function can be derived as fol-
lows. Let z be any cutoff time and consider the expected value of Λ evaluated
at the earlier of the cutoff time or the actual failure time.
Ti Ci Yi ei
75 81 75 1
7 76 7 1
68+ 68 68 0
52+ 52 52 0
20 56 20 1
Termination of Study
Fig. 17.4 Some censored data. Circles denote events.
Λ(z)
= −[u exp(−u) + exp(−u)]|0
= 1 − S(z)[Λ(z) + 1].
In this section we assume that each subject in the sample has the same dis-
tribution of the random variable T that represents the time until the event.
In particular, there are no covariables that describe differences between sub-
jects in the distribution of T . As before we use S(t), λ(t), and Λ(t) to denote,
respectively, the survival, hazard, and cumulative hazard functions.
The form of the true population survival distribution function S(t) is al-
most always unknown, and many distributional forms have been used for
describing failure time data. We consider first the two most popular para-
metric survival distributions: the exponential and Weibull distributions. The
exponential distribution is a very simple one in which the hazard function is
constant; that is, λ(t) = λ . The cumulative hazard and survival functions
are then
Λ(t) = λt and
S(t) = exp(−Λ(t)) = exp(−λt). (17.19)
The time by which 1/2 of the subjects will have failed is then proportional to
the reciprocal of the constant hazard rate λ . This is true also of the expected
or mean life length, which is 1/λ.
The exponential distribution is one of the few distributions for which a
closed-form solution exists for the estimator of its parameter when censoring
is present. This estimator is a function of the number of events and the total
person-years of exposure. Methods based on person-years in fact implicitly
assume an exponential distribution. The exponential distribution is often used
to model events that occur “at random in time.”323 It has the property that
the future lifetime of a subject is the same, no matter how “old” it is, or
This “ageless” property also makes the exponential distribution a poor choice
for modeling human survival except over short time periods.
The Weibull distribution is a generalization of the exponential distribution.
Its hazard, cumulative hazard, and survival functions are given by
λ(t) = αγtγ−1
Λ(t) = αtγ (17.22)
γ
S(t) = exp(−αt ).
6
.5
5 1
4 2
4
3
where the wj (t) are the restricted cubic spline terms of Equation 2.25. There
terms are cubic terms in t. A set of knots v1 , . . . , vk is selected from the
quantiles of the uncensored failure times (see Section 2.4.5 and [286]).
The cumulative hazard function for this model is
1 1
Λ(t) = at + t2 + × quartic terms in t. (17.25)
2 4
Standard maximum likelihood theory is used to obtain estimates of the k
unknown parameters to derive, for example, smooth estimates of λ(t) with
confidence bands. The flexible estimates of S(t) using this method are as
efficient as Kaplan–Meier estimates, but they are smooth and can be used as a
basis for modeling predictor variables. The spline hazard model is particularly
useful for fitting steeply falling and gently rising hazard functions that are
characteristic of high-risk medical procedures.
When censoring is present, S(t) can be estimated (at least for t up until
the end of follow-up) by the Kaplan–Meier333 product-limit estimator. This
method is based on conditional probabilities. For example, suppose that ev-
ery subject has been followed for 39 days or has died within 39 days so that
the proportion of subjects surviving at least 39 days can be computed. After
39 days, some subjects may be lost to follow-up besides those removed from
follow-up because of death within 39 days. The proportion of those still fol-
lowed 39 days who survive day 40 is computed. The probability of surviving
40 days from study entry equals the probability of surviving day 40 after
living 39 days, multiplied by the chance of surviving 39 days.
The life table in Table 17.1 demonstrates the method in more detail. We
suppose that 100 subjects enter the study and none die or are lost before
day 12.
Times in a life table should be measured as precisely as possible. If the
event being analyzed is death, the failure time should usually be specified
to the nearest day. We assume that deaths occur on the day indicated and
that being censored on a certain day implies the subject survived through the
end of that day. The data used in computing Kaplan–Meier estimates consist
of (Yi , ei ), i = 1, 2, . . . , n using notation defined previously. Primary data
collected to derive (Yi , ei ) usually consist of entry date, event date (if subject
failed), and censoring date (if subject did not fail). Instead, the entry date,
date of event/censoring, and event/censoring indicator ei may be specified.
The Kaplan–Meier estimator is called the product-limit estimator because
it is the limiting case of actuarial survival estimates as the time periods
shrink so that an entry is made for each failure time. An entry need not
be in the table for censoring times (when no failures occur at that time) as
long as the number of subjects censored is subtracted from the next number
17.5 Nonparametric Estimation of S and Λ 411
The Kaplan–Meier estimator of Λ(t) is ΛKM (t) = − log SKM (t). An estimate
−1
of quantile q of failure time is SKM (1 − q), if follow-up is long enough so that
SKM (t) drops as low as 1 − q. If the last subject followed failed so that SKM (t)
drops to zero, the expected failure time can be estimated by computing the
area under the Kaplan–Meier curve.
To demonstrate computation of SKM (t), imagine a sample of failure times
given by
1 3 3 6+ 8+ 9 10+ ,
where + denotes a censored time. The quantities needed to compute SKM are
in Table 17.2. Thus
Note that the estimate of S(t) is undefined for t > 10 since not all subjects
have failed by t = 10 but no follow-up extends beyond t = 10. A graph of the
Kaplan–Meier estimate is found in Figure 17.6.
412 17 Introduction to Survival Analysis
require ( rms )
tt ← c (1 ,3 ,3 ,6 ,8 ,9 ,10)
stat ← c (1 ,1 ,1 ,0 ,0 ,1 ,0)
S ← Surv ( tt , stat )
survplot ( npsurv ( S ∼ 1) , conf = " bands " , n.risk = TRUE ,
xlab = expression ( t ))
survplot ( npsurv ( S ∼ 1, type = " fleming-harri n gt on " ,
conf.int = FALSE ), add = TRUE , lty =3)
1.0
0.8
Survival Probability
0.6
0.4
0.2
0.0 7 7 6 6 4 4 4 3 3 2 1
0 1 2 3 4 5 6 7 8 9 10
t
Fig. 17.6 Kaplan–Meier product–limit estimator with 0.95 confidence bands. The
Altschuler–Nelson–Aalen–Fleming–Harrington estimator is depicted with the dotted
lines.
The variance of SKM (t) can be estimated using Greenwood’s formula [331,
p. 14], and using normality of SKM (t) in large samples this variance can
be used to derive a confidence interval for S(t). A better method is to de-
rive an asymmetric confidence interval for S(t) based on a symmetric in-
terval for log Λ(t). This latter method ensures that a confidence limit does
not exceed one or fall below zero, and is more accurate since log ΛKM (t) is
more normally distributed than SKM (t). Once a confidence interval, say [a, b]
is determined for log Λ(t), the confidence interval for S(t) is computed by
[exp{− exp(b)}, exp{− exp(a)}]. The formula for an estimate of the variance
of interest is [331, p. 15]:
i:t ≤t di /[ni (ni − di )]
Var{log ΛKM (t)} = i . (17.29)
{ i:ti ≤t log[(ni − di )/ni ]}2
17.6 Analysis of Multiple Endpoints 413
death) may be analyzed with any of the tools of survival analysis because only
the usual censoring mechanism is used. Sometimes the various endpoints may
need separate study, however, because they may have different risk factors.
When the multiple endpoints represent multiple causes of a terminating
event (e.g., death), Prentice et al. have developed standard methods for an-
alyzing cause-specific hazards513 [331, pp. 163–178]. Their methods allow
each cause of failure to be analyzed separately, censoring on the other causes.
They do not assume any mechanism for cause removal nor make any assump-
tions regarding the interrelation among causes of failure. However, analyses
of competing events using data where some causes of failure are removed in
a different way from the original dataset will give rise to different inferences.
When the multiple endpoints represent a mixture of fatal and nonfatal
outcomes, the analysis may be more complex. The same is true when one
wishes to jointly study an event-time endpoint and a repeated measurement.
9
When events are independent, each event may also be analyzed separately by
censoring on all other events as well as censoring on loss to follow-up. This will
yield an unbiased estimate of an easily interpreted cause-specific λ(t) or S(t)
because censoring is non-informative [331, pp. 168–169]. One minus SKM (t)
computed in this manner will correctly estimate the probability of failing from
the event in the absence of other events. Even when the competing events are
not independent, the cause-specific hazard model may lead to valid results,
but the resulting model does not allow one to estimate risks conditional on
removal of one or more causes of the event. See Kay340 for a nice example
of competing risks analysis when a treatment reduces the risk of death from
10 one cause but increases the risk of death from another cause.
Larson and Dinse376 have an interesting approach that jointly models the
time until (any) failure and the failure type. For r failure types, they use
an r-category polytomous logistic model to predict the probability of failing
from each cause. They assume that censoring is unrelated to cause of event.
function is defined as
Pr{fail from cause m in [t, t + u)|alive at t}
λm (t) = lim . (17.34)
u→0 u
The cumulative incidence function or probability of failure from cause m by
time t is given by t
Fm (t) = λm (u)S(u)du, (17.35)
0
where dmi is the number of failures of type m at time ti and ni is the number
of subjects at risk of failure at ti .
Pepe and others494, 496, 497 showed how to use a combination of Kaplan–
Meier estimators to derive an estimator of the probability of being free of
event 1 by time t given event 2 has not occurred by time t (see also [349]).
Let T1 and T2 denote, respectively, the times until events 1 and 2. Let S1 (t)
and S2 (t) denote, respectively, the two survival functions. Let us suppose
that event 1 is not a terminating event (e.g., is not death) and that even
after event 1 subjects are followed to ascertain occurrences of event 2. The
probability that T1 > t given T2 > t is
where S12 (t) is the survival function for min(T1 , T2 ), the earlier of the two
events. Since S12 (t) does not involve any informative censoring (assuming as
always that loss to follow-up is non-informative), S12 may be estimated by
the Kaplan–Meier estimator SKM12 (or by SΛ ). For the type of event 1 we
have discussed above, S2 can also be estimated without bias by SKM2 . Thus
we estimate, for example, the probability that a subject still alive at time t
will be free of myocardial infarction as of time t by SKM12 /SKM2 .
416 17 Introduction to Survival Analysis
Another quantity that can easily be computed from ordinary survival es-
timates is S2 (t) − S12 (t) = [1 − S12 (t)] − [1 − S2 (t)], which is the probability
that event 1 occurs by time t and that event 2 has not occurred by time t.
The ratio estimate above is used to estimate the survival function for one
event given that another has not occurred. Another function of interest is
the crude survival function which is a marginal distribution; that is, it is the
probability that T1 > t whether or not event 2 occurs:362
Sc (t) = 1 − F1 (t)
F1 (t) = Prob{T1 ≤ t}, (17.38)
where F1 (t) is the crude incidence function defined previously. Note that the
T1 ≤ t implies that the occurrence of event 1 is part of the probability being
computed. If event 2 is a terminating event so that some subjects can never
suffer event 1, the crude survival function for T1 will never drop to zero. The
crude survival function can be interpreted as the survival distribution of W
11 where W = T1 if T1 < T2 and W = ∞ otherwise.362
In many studies there is one final, absorbing state (death, all causes) and mul-
tiple live states. The live states may represent different health states or phases
of a disease. For example, subjects may be completely free of cancer, have an
isolated tumor, metastasize to a distant organ, and die. Unlike this example,
the live states need not have a definite ordering. One may be interested in es-
timating transition probabilities, for example, the probability πij (t1 , t2 ) that
an individual in state i at time t1 is in state j after an additional time t2 .
Strauss and Shavelle596 have developed an extended Kaplan–Meier estimator
i
for this situation. Let SKM (t|t1 ) denote the ordinary Kaplan–Meier estimate
of the probability of not dying before time t (ignoring distinctions between
multiple live states) for a cohort of subjects beginning follow-up at time t1
in state i. This is an estimate of the probability of surviving an additional t
time units (in any live state) given that the subject was alive and in state i
at time t1 . Strauss and Shavelle’s estimator is given by
nij (t1 , t2 ) i
πij (t1 , t2 ) = S (t2 |t1 ), (17.39)
ni (t1 , t2 ) KM
where ni (t1 , t2 ) is the number of subjects in live state i at time t1 who are
alive and uncensored t2 time units later, and nij (t1 , t2 ) is the number of such
12 subjects in state j t2 time units beyond t1 .
17.6 Analysis of Multiple Endpoints 417
where λ(t) is the hazard for the failure time and πj (t) is the probability of an
individual having event severity j given she fails at time t. Note that a shift
in the distribution of response severity is allowed as the time until the event
increases. 13
consistent estimates of variances and covariances even if the events are de-
pendent. Lin407 also discussed how this method can easily be used to model
14 multiple events of differing types.
17.7 R Functions
The event.chart function of Lee et al.394 will draw a variety of charts for dis-
playing raw survival time data, for both single and multiple events per sub-
ject. Relationships with covariables can also be displayed. The event.history
function of Dubin et al.166 draws an event history graph for right-censored
survival data, including time-dependent covariate status. These function are
in the Hmisc package.
The analyses described in this chapter can be viewed as special cases of the
Cox proportional hazards model.132 The programs for Cox model analyses
described in Section 20.13 can be used to obtain the results described here, as
long as there is at least one stratification factor in the model. There are, how-
ever, several R functions that are pertinent to the homogeneous or stratified
case. The R function survfit, and its particular renditions of the print, plot,
lines, and points generic functions (all part of the survival package written
by Terry Therneau), will compute, print, and plot Kaplan–Meier and Nelson
survival estimates. Confidence intervals for S(t) may be based on S, Λ, or
log Λ. The rms package’s front-end to the survival package’s survfit function
is npsurv for “nonparametric survival”. It and other functions described in
later chapters use Therneau’s Surv function to combine the response variable
and event indicator into a single R “survival time” object. In its simplest form,
use Surv(y, event), where y is the failure/right–censoring time and event is
the event/censoring indicator, usually coded T/F, 0 = censored 1 = event or
1 = censored 2 = event. If the event status variable has other coding (e.g., 3
means death), use Surv(y, s==3). To handle interval time-dependent covari-
ables, or to use Andersen and Gill’s counting process formulation of the Cox
model,23 use the notation Surv(tstart, tstop, status). The counting process
notation allows subjects to enter and leave risk sets at random. For each
time interval for each subject, the interval is made up of tstart< t ≤tstop.
For time-dependent stratification, there is an optional origin argument to
Surv that indicates the hazard shape time origin at the time of crossover
to a new stratum. A type argument is used to handle left– and interval–
censoring, especially for parametric survival models. Possible values of type
are "right","left","interval","counting","interval2","mstate".
The Surv expression will usually be used inside another function, but it is
fine to save the result of Surv in another object and to use this object in the
particular fitting function.
npsurv is invoked by the following, with default parameter settings indi-
cated.
17.7 R Functions 419
require ( rms )
units ( y ) ← " Month "
# Default is " Day " - used for axis labels , etc.
npsurv ( Surv (y , event ) ∼ svar1 + svar2 + ... , data , subset ,
type = c ( " kaplan-meier " , " fleming-har ri ng t on " , " fh2 " ),
error = c ( " greenwood " , " tsiatis " ), se.fit = TRUE ,
conf.int = .95 ,
conf.type = c ( " log " ," log-log " ," plain " ," none " ), ... )
The groupkm in rms computes and optionally plots SKM (u) or log ΛKM (u) (if
loglog=TRUE) for fixed u with automatic stratification on a continuous predic-
tor x. As in cut2 (Section 6.2) you can specify the number of subjects per
interval (default is m=50), the number of quantile groups (g), or the actual cut-
points (cuts). groupkm plots the survival or log–log survival estimate against
mean x in each x interval.
The bootkm function in the Hmisc package bootstraps Kaplan–Meier sur-
vival estimates or Kaplan–Meier estimates of quantiles of the survival time
distribution. It is easy to use bootkm to compute, for example, a nonparametric
confidence interval for the ratio of median survival times for two groups.
See the Web site for a list of functions from other users for nonparametric
estimation of S(t) with left–, right–, and interval–censored data. The adaptive
linear spline log-hazard fitting function heft361 is freely available.
420 17 Introduction to Survival Analysis
1 Some excellent general references for survival analysis are [57, 83, 114, 133, 154,
197, 282, 308, 331, 350, 382, 392, 444, 484, 574, 604]. Govindarajulu et al.229 have
a nice review of frailty models in survival analysis, for handling clustered time-
to-event data.
2 See Goldman,220 Bull and Spiegelhalter,83 , Lee et al.394 , and Dubin et al.166
for ways to construct descriptive graphs depicting right–censored data.
3 Some useful references for left–truncation are [83,112,244,524]. Mandel435 care-
fully described the difference between censoring and truncation.
4 See [384, p. 164] for some ideas for detecting informative censoring. Bilker and
Wang54 discuss right–truncation and contrast it with right–censoring.
5 Arjas29 has applications based on properties of the cumulative hazard function.
6 Kooperberg et al.361, 594 have an adaptive method for fitting hazard functions
using linear splines in the log hazard. Binquet et al.56 studied a related approach
using quadratic splines. Mudholkar et al.466 presented a generalized Weibull
model allowing for a variety of hazard shapes.
7 Hollander et al.299 provide a nonparametric simultaneous confidence band for
S(t), surprisingly using likelihood ratio methods. Miller459 showed that if the
parametric form of S(t) is known to be Weibull with known shape parameter (an
unlikely scenario), the Kaplan–Meier estimator is very inefficient (i.e., has high
variance) when compared with the parametric maximum likelihood estimator.
See [666] for a discussion of how the efficiency of Kaplan–Meier estimators can
be improved by interpolation as opposed to piecewise flat step functions. That
paper also discusses a variety of other estimators, some of which are significantly
more efficient than Kaplan–Meier.
8 See [112, 244, 438, 570, 614, 619] for methods of estimating S or Λ in the presence
of left–truncation. See Turnbull616 for nonparametric estimation of S(t) with
left–, right–, and interval–censoring, and Kooperberg and Clarkson360 for a
flexible parametric approach to modeling that allows for interval–censoring.
Lindsey and Ryan413 have a nice tutorial on the analysis of interval–censored
data.
9 Hogan and Laird297, 298 developed methods for dealing with mixtures of fa-
tal and nonfatal outcomes, including some ideas for handling outcome-related
dropouts on the repeated measurements. See also Finkelstein and Schoenfeld.193
The 30 April 1997 issue of Statistics in Medicine (Vol. 16) is devoted to methods
for analyzing multiple endpoints as well as designing multiple endpoint stud-
ies. The papers in that issue are invaluable, as is Therneau and Hamilton606
and Therneau and Grambsch.604 Huang and Wang311 presented a joint model
for recurrent events and a terminating event, addressing such issues as the fre-
quency of recurrent events by the time of the terminating event.
10 See Lunn and McNeil429 and Marubini and Valsecchi [444, Chapter 10] for
practical approaches to analyzing competing risks using ordinary Cox propor-
tional hazards models. A nice overview of competing risks with comparisons of
various approaches is found in Tai et al.599 , Geskus214, and Koller et al.358 .
Bryant and Dignam78 developed a semiparametric procedure in which com-
peting risks are adjusted for nonparametrically while a parametric cumulative
incidence function is used for the event of interest, to gain precision. Fine and
Gray192 developed methods for analyzing competing risks by estimating sub-
distribution functions. Nishikawa et al.478 developed some novel approaches to
competing risk analysis involving time to adverse drug events competing with
time to withdrawal from therapy. They also dealt with different severities of
events in an interesting way. Putter et al.517 has a nice tutorial on competing
risks, multi-state models, and associated R software. Fiocco et al.194 developed
17.9 Problems 421
17.9 Problems
1. Make a rough drawing of a hazard function from birth for a man who de-
velops significant coronary artery disease at age 50 and undergoes coronary
artery bypass surgery at age 55.
2. Define in words the relationship between the hazard function and the sur-
vival function.
3. In a study of the life expectancy of light bulbs as a function of the bulb’s
wattage, 100 bulbs of various wattage ratings were tested until each had
failed. What is wrong with using the product-moment linear correlation
test to test whether wattage is associated with life length concerning (a)
distributional assumptions and (b) other assumptions?
4. A placebo-controlled study is undertaken to ascertain whether a new drug
decreases mortality. During the study, some subjects are withdrawn be-
cause of moderate to severe side effects. Assessment of side effects and
withdrawal of patients is done on a blinded basis. What statistical tech-
nique can be used to obtain an unbiased treatment comparison of survival
times? State at least one efficacy endpoint that can be analyzed unbiasedly.
5. Consider long-term follow-up of patients in the support dataset. What pro-
portion of the patients have censored survival times? Does this imply that
one cannot make accurate estimates of chances of survival? Make a his-
togram or empirical distribution function estimate of the censored follow-
up times. What is the typical follow-up duration for a patient in the study
422 17 Introduction to Survival Analysis
who has survived so far? What is the typical survival time for patients who
have died? Taking censoring into account, what is the median survival time
from the Kaplan–Meier estimate of the overall survival function? Estimate
the median graphically or using any other sensible method.
6. Plot Kaplan–Meier survival function estimates stratified by dzclass. Esti-
mate the median survival time and the first quartile of time until death
for each of the four disease classes.
7. Repeat Problem 6 except for tertiles of meanbp.
8. The commonly used log-rank test for comparing survival times between
groups of patients is a special case of the test of association between the
grouping variable and survival time in a Cox proportional hazards regres-
sion model. Depending on how one handles tied failure times, the log-rank
χ2 statistic exactly equals the score χ2 statistic from the Cox model, and
the likelihood ratio and Wald χ2 test statistics are also appropriate. To
obtain global score or LR χ2 tests and P -values you can use a statement as
the following, where cph is in the rms package. It is similar to the survival
package’s coxph function.
cph ( Survobject ∼ predictor )
Here Survobject is a survival time object created by the Surv function. Ob-
tain the log-rank (score) χ2 statistic, degrees of freedom, and P -value for
testing for differences in survival time between levels of dzclass. Interpret
this test, referring to the graph you produced in Problem 6 if needed.
9. Do preliminary analyses of survival time using the Mayo Clinic primary bil-
iary cirrhosis dataset described in Section 8.9. Make graphs of Altschuler–
Nelson or Kaplan–Meier survival estimates stratified separately by a few
categorical predictors and by categorized versions of one or two continuous
predictors. Estimate median failure time for the various strata. You may
want to suppress confidence bands when showing multiple strata on one
graph. See [361] for parametric fits to the survival and hazard function for
this dataset.
Chapter 18
Parametric Survival Models
18.1.2 Estimation
All observations then contribute an amount to the log likelihood equal to the
negative of the cumulative hazard evaluated at the failure/censoring time.
In addition, uncensored observations contribute an amount equal to the log
of the hazard function evaluated at the time of failure. Once L or log L
is specified, the general ML methods outlined earlier can be used without
18.1 Homogeneous Models (No Predictors) 425
λ̂ = nu /w (18.7)
1 3 3 6+ 8+ 9 10+ .
426 18 Parametric Survival Models
α̂ = 0.0728
γ̂ = 1.164
Ŝ(t) = exp(−0.0728t1.164) (18.9)
−1 1/γ̂
Ŝ (0.5) = [(log 2)/α̂] = 6.935 (estimated median).
This fit is very close to the exponential fit since γ̂ is near 1.0. Note that the
two medians are almost equal. The predicted survival probabilities for the
Weibull model for t = 1, 3, 9, 10 are, respectively, 0.93, 0.77, 0.39, 0.35.
Sometimes a formal test can be made to assess the fit of the proposed
parametric survival distribution. For the data just analyzed, a formal test of
exponentiality versus a Weibull alternative is obtained by testing H0 : γ = 1
in the Weibull model. A score test yielded χ2 = 0.14 with 1 d.f., p = 0.7,
showing little evidence for non-exponentiality (note that the sample size is
too small for this test to have any power).
The fit of the hypothesized survival distribution can often be checked eas-
ily using graphical methods. Nonparametric estimates of S(t) and Λ(t)
are primary tools for this purpose. For example, the Weibull distribution
S(t) = exp(−αtγ ) can be rewritten by taking logarithms twice:
The fit of a Weibull model can be assessed by plotting log Λ̂(t) versus log t
and checking whether the curve is approximately linear. Also, the plotted
curve provides approximate estimates of α (the antilog of the intercept) and
γ (the slope). Since an exponential distribution is a special case of a Weibull
distribution when γ = 1, exponentially distributed data will tend to have a
graph that is linear with a slope of 1.
For any assumed distribution S(t), a graphical assessment of goodness of
fit can be made by plotting S −1 [SΛ (t)] or S −1 [SKM (t)] against t and checking
for linearity. For log distributions, S specifies the distribution of log(T ), so
we plot against log t. For a log-normal distribution we thus plot Φ−1 [SΛ (t)]
against log t, where Φ−1 is the inverse of the standard normal cumulative
distribution function. For a log-logistic distribution we plot logit[SΛ (t)] versus
log t. For an extreme value distribution we use log − log plots as with the
Weibull distribution. Parametric model fits can also be checked by plotting
the fitted Ŝ(t) and SΛ (t) against t on the same graph.
18.2.1 Model
The most widely used survival regression specification is to allow the hazard
function λ(t) to be multiplied by exp(Xβ). The survival model is thus gener-
alized from a hazard function λ(t) for the failure time T to a hazard function
λ(t) exp(Xβ) for the failure time given the predictors X:
In the general regression notation of Section 2.2, the log hazard or log cumu-
lative hazard can be used as the property of the response T evaluated at time
t that allows distributional and regression parts to be isolated and checked.
The PH model can be linearized with respect to Xβ using the following
identities.
No matter which of the three model statements are used, there are certain
assumptions in a parametric PH survival model. These assumptions are listed
below.
1. The true form of the underlying functions (λ, Λ, and S) should be specified
correctly.
18.2 Parametric Proportional Hazards Models 429
2. The relationship between the predictors and log hazard or log cumulative
hazard should be linear in its simplest form. In the absence of interaction
terms, the predictors should also operate additively.
3. The way in which the predictors affect the distribution of the response
should be by multiplying the hazard or cumulative hazard by exp(Xβ)
or equivalently by adding Xβ to the log hazard or log cumulative hazard
at each t. The effect of the predictors is assumed to be the same at all
values of t since log λ(t) can be separated from Xβ. In other words, the
PH assumption implies no t by predictor interaction.
The regression coefficient for Xj , βj , is the increase in log hazard or log
cumulative hazard at any fixed point in time if Xj is increased by one unit
and all other predictors are held constant. This can be written formally as
If there is only one predictor X1 and that predictor is binary, the PH model
can be written
λ(t|X1 = 0) = λ(t)
λ(t|X1 = 1) = λ(t) exp(β1 ). (18.17)
fitting two separate models. This is because even though males and females
are allowed to have unequal age slopes, both sexes are assumed to have the
Table 18.1 Mortality differences and ratios when hazard ratio is 0.5
Subject 5-Year Difference Mortality
Survival Ratio (T/C)
C T
1 0.98 0.99 0.01 0.01/0.02 = 0.5
2 0.80 0.89 0.09 0.11/0.2 = 0.55
3 0.25 0.50 0.25 0.5/0.75 = 0.67
(hazard ratios) also relate more closely to statistical power. For example,
even if the effect of a treatment is to halve the hazard rate, a population
where the control survival is 0.99 will require a much larger sample than will
a population where the control survival is 0.3.
Figure 18.1 depicts the relationship between survival S(t) of a control
subject at any time t, relative reduction in hazard (h), and difference in
survival S(t) − S(t)h . This figure demonstrates that absolute clinical benefit
0.1
0.7
0.6
Improvement in Survival
0.2
0.5
0.3
0.4
0.4
0.3
0.5
0.2 0.6
0.7
0.1 0.8
0.9
0.0
λ(t|X) = λ exp(Xβ)
S(t|X) = exp[−λt exp(Xβ)] = exp(−λt)exp(Xβ) . (18.19)
The exponential regression model can be written in another form that is more
numerically stable by replacing the λ parameter with an intercept term in
Xβ, specifically λ = exp(β0 ). After redefining Xβ to include β0 , λ can be
dropped in all the above formulas.
The Weibull regression model is defined by one of the following functions
(assuming that Xβ does not contain an intercept).
Note that the parameter α in the homogeneous Weibull model has been
replaced with α exp(Xβ). The median survival time is given by
18.2.5 Estimation
Once β̂, the MLE of β, is computed along with the large-sample standard
error estimates, hazard ratio estimates and their confidence intervals can
readily be computed. Letting s denote the estimated standard error of β̂j ,
a 1 − α confidence interval for the Xj + 1 : Xj hazard ratio is given by
exp[β̂j ± zs], where z is the 1 − α/2 critical value for the standard normal
distribution.
Once the parameters of the underlying hazard function are estimated, the
MLE of λ(t), λ̂(t), can be derived. The MLE of λ(t|X), the hazard as a
function of t and X, is given by
The MLE of Λ(t), Λ̂(t), can be derived from the integral of λ̂(t) with respect
to t. Then the MLE of S(t|X) can be derived:
For the Weibull model, we denote the MLEs of the hazard parameters α and
γ by α̂ and γ̂. The MLE of λ(t|X), Λ(t|X), and S(t|X) for this model are
Confidence intervals for S(t|X) are best derived using general matrix notation
to obtain an estimate s of the standard error of log[λ̂(t|X)] from the estimated
information matrix of all hazard and regression parameters. A confidence
interval for Ŝ will be of the form
Ŝ(t|X)exp(±zs) . (18.29)
The MLEs of β and of the hazard shape parameters lead directly to MLEs
of the expected and median life length. For the Weibull model the MLE of
the median life length given X is
For the exponential model, the MLE of the expected life length for a subject
having predictor values X is given by
X1 =1
β1
X1 =0
t
Fig. 18.2 PH model with one binary predictor. Y -axis is log λ(t) or log Λ(t). For
log Λ(t), the curves must be non-decreasing. For log λ(t), they may be any shape.
t=t2
t=t1
X1
Fig. 18.3 PH model with one continuous predictor. Y -axis is log λ(t) or log Λ(t); for
log Λ(t), drawn for t2 > t1 . The slope of each line is β1 .
X1= d
(d-c) β1
X1= c
t
Fig. 18.4 PH model with one continuous predictor. Y -axis is log λ(t) or log Λ(t). For
log λ, the functions need not be monotonic.
Figure 18.5 must hold at each time t if linearity is assumed for X2 and there
is no interaction between X1 and X2 . Methods for verifying the regression
assumptions (e.g., splines and residuals) and the PH assumption are covered
in detail under the Cox PH model in Chapter 20.
The method for verifying the assumed shape of S(t) in Section 18.1.3 is also
useful when there are a limited number of categorical predictors. To validate
a Weibull PH model one can stratify on X and plot log ΛKM (t|X stratum)
against log t. This graph simultaneously assesses PH in addition to shape
assumptions—all curves should be parallel as well as straight. Straight but
nonparallel (non-PH) curves indicate that a series of Weibull models with
differing γ parameters will fit.
436 18 Parametric Survival Models
X1 =1
X1 =0
β1
slope = β2
X2
Fig. 18.5 Regression assumptions, linear additive PH or AFT model with two pre-
dictors. For PH, Y -axis is log λ(t) or log Λ(t) for a fixed t. For AFT, Y -axis is log(T ).
18.3.1 Model
The log λ or log Λ transformation of the PH model has the following equiva-
lent for AFT models.
18.3 Accelerated Failure Time Models 437
Common choices for the distribution function ψ in Equation 18.32 are the
extreme value distribution ψ(u) = exp(− exp(u)), the logistic distribution
ψ(u) = [1 + exp(u)]−1 , and the normal distribution ψ(u) = 1 − Φ(u). The
AFT model equivalent of the Weibull model is obtained by using the extreme
value distribution, negating β, and replacing γ with 1/σ in Equation 18.24:
The t distribution allows for more flexibility by varying the degrees of free-
dom. Figure 18.6 depicts possible hazard functions for the log t distribution
for varying σ and degrees of freedom. However, this distribution does not
have a late increasing hazard phase typical of human survival.
require ( rms )
haz ← s u r v r e g . a u x i n f o $ t $ hazard
times ← c ( seq (0 , .25 , length =100) , seq ( .26 , 2 , length =150))
high ← c (6 , 1 .5 , 1 .5 , 1 .75 )
low ← c (0 , 0 , 0 , .25 )
dfs ← c (1 , 2 , 3 , 5 , 7 , 15 , 500)
cols ← rep (1 , 7)
ltys ← 1:7
i ← 0
for ( scale in c ( .25 , .6 , 1 , 2)) {
i ← i + 1
plot (0 , 0 , xlim = c (0 ,2) , ylim = c ( low [ i ] , high [ i ]) ,
xlab = expressio n ( t ) , ylab = expression ( lambda ( t )) , type = " n " )
col ← 1 .09
j ← 0
for ( df in dfs ) {
j ← j +1
## Divide by t to get hazard for log t distribution
lines ( times ,
haz ( log ( times ) , 0 , c ( log ( scale ) , df )) / times ,
col = cols [ j ] , lty = ltys [ j ])
if ( i ==1) text (1 .7 , .23 + haz ( log (1 .7 ) , 0 ,
c ( log ( scale ) , df )) / 1 .7 , format ( df ))
}
title ( paste (" Scale : " , format ( scale )))
} # Figure 18.6
All three of these parametric survival models have median survival time
T0.5 |X = exp(Xβ).
18.3.4 Estimation
5
15
4 1.0
7
λ(t)
λ(t)
3 5
3
2 2 0.5
1 1
0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
t t
Scale: 1 Scale: 2
1.5
1.5
1.0
1.0
λ(t)
λ(t)
0.5
0.5
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
t t
Fig. 18.6 log(T ) distribution for σ = 0.25, 0.6, 1, 2 and for degrees of freedom
1, 2, 3, 5, 7, 15, 500 (almost log-normal). The top left plot has degrees of freedom writ-
ten in the plot.
For normal and logistic distributions, T̂0.5 |X = exp(X β̂). The MLE of the
effect on log(T ) of increasing Xj by d units is βˆj d if Xj is linear and additive.
The delta (statistical differential) method can be used to compute an esti-
mate of the variance of f = [log(t) − X β̂]/σ̂. Let (β̂, δ̂) denote the estimated
parameters, and let V̂ denote the estimated covariance matrix for these pa-
rameter estimates. Let F denote the vector of derivatives of f with respect to
(β0 , β1 , . . . , βp , δ); that is, F = [−1, −X1 , −X2 , . . . , −Xp , −(log(t) − X β̂)]/σ̂.
The variance of f is then approximately
Var(f ) = F V̂ F ′ . (18.40)
440 18 Parametric Survival Models
Letting s be the square root of the variance estimate and z1−α/2 be the
normal critical value, a 1 − α confidence limit for S(t|X) is
18.3.5 Residuals
For a single binary predictor, all assumptions of the AFT model are depicted
in Figure 18.7. That figure also shows the assumptions for any two values of
a single continuous predictor that behaves linearly. For a single continuous
predictor, the relationships in Figure 18.8 must hold for any two follow-up
times. The regression assumptions are isolated in Figure 18.5.
To verify the fit of a log-logistic model with age as the only predictor, one
could stratify by quartiles of age and check for linearity and parallelism of the
four logit SΛ (t) or SKM (t) curves over increasing t as in Figure 18.7, which
stresses the distributional assumption (no T by X interaction and linearity vs.
log(t)). To stress the linear regression assumption while checking for absence
of time interactions (part of the distributional assumptions), one could make
18.3 Accelerated Failure Time Models 441
X1= c
(d-c) β1 /σ X1= d
log t
Fig. 18.7 AFT model with one predictor. Y -axis is ψ −1 [S(t|X)] = (log(t) − Xβ)/σ.
Drawn for d > c. The slope of the lines is σ −1 .
t=t2
t=t1
X1
Fig. 18.8 AFT model with one continuous predictor. Y -axis is ψ −1 [S(t|X)] =
(log(t) − Xβ)/σ. Drawn for t2 > t1 . The slope of each line is β1 /σ and the difference
between the lines is log(t2 /t1 )/σ.
a plot like Figure 18.8. For each decile of age, the logit transformation of the
1-, 3-, and 5-year survival estimates for that decile would be plotted against
the mean age in the decile. This checks for linearity and constancy of the
age effect over time. Regression splines will be a more effective method for
checking linearity and determining transformations. This is demonstrated in
Chapter 20 with the Cox model, but identical methods apply here.
As an example, consider data from Kalbfleisch and Prentice [331, pp. 1–2],
who present data from Pike508 on the time from exposure to the carcinogen
DMBA to mortality from vaginal cancer in rats. The rats are divided into
two groups on the basis of a pre-treatment regime. Survival times in days
(with censored times marked + ) are found in Table 18.2.
442 18 Parametric Survival Models
getHdata ( kprats )
kprats $ group ← factor ( kprats $ group , 0:1 , c ( ' Group 1 ' , ' Group 2 ' ))
dd ← datadist ( kprats ); options ( datadist = " dd " )
The top left plot in Figure 18.9 displays nonparametric survival estimates for
the two groups, with the number of rats “at risk” at each 30-day mark written
above the x-axis. The remaining three plots are for checking assumptions of
three models. None of the parametric models presented will completely allow
for such a long period with no deaths. Neither will any allow for the early
crossing of survival curves. Log-normal and log-logistic models yield very sim-
ilar results due to the similarity in shapes between Φ(z) and [1 + exp(−z)]−1
for non-extreme z. All three transformations show good parallelism after the
early crossing. The log-logistic and log-normal transformations are slightly
more linear. The fitted models are:
fw ← psm ( S ∼ group , data = kprats , dist = ' weibull ' )
fl ← psm ( S ∼ group , data = kprats , dist = ' loglogistic ' ,
y = TRUE )
fn ← psm ( S ∼ group , data = kprats , dist = ' lognormal ' )
latex ( fw , fi = ' ' )
18.3 Accelerated Failure Time Models 443
1.0
Survival Probability
0.8
0 Group 1
Group 2
0.6
−1
Group 1
0.4
Group 2
−2
0.2
19 19 19 19 19 17 11 3 1 Group 1
0.0 21 21 21 21 21 18 15 7 6 2 Group 2 −3
0 35 105 175 245 315 4.8 5.0 5.2 5.4 5.6 5.8
1 0.5
logit S(t)
0 0.0
Group 1 Group 1
−1 −0.5 Group 2
Group 2
−1.0
−2
−1.5
−3
−2.0
4.8 5.0 5.2 5.4 5.6 5.8 6.0 4.8 5.0 5.2 5.4 5.6 5.8
−4
log Survival Time in s log Survival Time in s
Log−logistic Log−normal
log(t) − Xβ
Prob{T ≥ t} = exp[− exp( )] where
0.1832976
X β̂ =
5.450859
+0.131983[Group 2]
log(t) − Xβ −1
Prob{T ≥ t} = [1 + exp( )] where
0.1159753
X β̂ =
5.375675
+0.1051005[Group 2]
log(t) − Xβ
Prob{T ≥ t} = 1 − Φ( ) where
0.2100184
X β̂ =
5.375328
+0.0930606[Group 2]
The Weibull PH form of the fitted extreme value model, using Equa-
tion 18.24, is
18.3 Accelerated Failure Time Models 445
1.0
0.8
Survival Probability
0.6
Group 1
0.4
Group 2
0.2
0.0
X β̂ =
−29.74
−0.72[Group 2]
1.0
0.8
Survival Probability
0.6
Group 1
0.4
Group 2
0.2
0.0
−6 −5 −4 −3 −2 −1 0 1 2 3 4
Residual
Fig. 18.11 Kaplan–Meier estimates of distribution of standardized censored residu-
als from the log-logistic model, along with the assumed standard log-logistic distri-
bution (dashed curve). The step functions in red is the estimated distribution of all
residuals, and the step functions in black are the estimated distributions of residuals
stratified by group, as indicated. The blue curve is the assumed log-logistic distribu-
tion.
AFT models may be validated for both calibration and discrimination accu-
racy using the same methods that are presented for the Cox model in Sec-
tion 20.11. The methods discussed there for checking calibration are based on
choosing a single follow-up time. Checking the distributional assumptions of
the parametric model is also a check of calibration accuracy in a sense. An-
other indirect calibration assessment may be obtained from a set of Cox–Snell
residuals (Section 18.3.5) or by using ordinary residuals as just described. A
higher resolution indirect calibration assessment based on plotting individual
uncensored failure times is available when the theoretical censoring times for
those observations are known. Let C denote a subject’s censoring time and F
the cumulative distribution of a failure time T . The expected value of F (T |X)
is 0.5 when T is an actual failure time random variable. The expected value
for an event time that is observed because it is uncensored is the expected
value of F (T |T ≤ C, X) = 0.5F (C|X). A smooth plot (using, say, loess) of
F (T |X) − 0.5F (C|X) against X β̂ should be a flat line through y = 0 if the
model is well calibrated. A smooth plot of 2F (T |X)/F (C|X) against X β̂ (or
anything else) should be a flat line through y = 1. This method assumes that
the model is calibrated well enough that we can substitute 1 − Ŝ(C|X) for
F (C|X).
18.8 Time-Dependent Covariates 447
18.9 R Functions
Therneau’s survreg function (part of his survival package) can fit regression
models in the AFT family with left–, right–, or interval–censoring. The time
variable can be untransformed or log-transformed (the default). Distributions
supported are extreme value (Weibull and exponential), normal, logistic, and
Student-t. The version of survreg in rms that fits parametric survival models
in the same framework as lrm, ols, and cph is called psm. psm works with
print, coef, formula, specs, summary, anova, predict, Predict, fastbw, latex,
nomogram, validate, calibrate, survest, and survplot functions for obtaining
and plotting predicted survival probabilities. The dist argument to psm can be
"exponential", "extreme", "gaussian", "logistic", "loglogistic", "lognormal",
"t", or "weibull". To fit a model with no covariables, use the command
psm ( Surv ( d.time , event ) ∼ 1)
Special functions work with objects created by psm to create S functions that
contain the analytic form for predicted survival probabilities (Survival), haz-
ard functions (Hazard), quantiles of survival time (Quantile), and mean or
expected survival time (Mean). Once the S functions are constructed, they can
be used in a variety of contexts. The survplot and survest functions have
a special argument for psm fits: what. The default is what="survival" to esti-
mate or plot survival probabilities. Specifying what="hazard" will plot hazard
functions. Predict also has a special argument for psm fits: time. Specifying a
single value for time results in survival probability for that time being plotted
instead of X β̂. Examples of many of the functions appear below, with the
output of the survplot command shown in Figure 18.12.
med ← Quantile ( fl )
meant ← Mean ( fl )
18.9 R Functions 449
haz ← Hazard ( fl )
surv ← Survival ( fl )
latex ( surv , file = ' ' , type = ' Sinput ' )
1 2
216.0857 240.0328
1 2
0.5612718 0.7599776
0.030
0.025
Group 1
Group 1 median: 216
Hazard Function
0.020
Group 2 median: 240
0.015
Group 2
0.010
0.005
0.000
Kooperberg et al.’s adaptive linear spline log-hazard model360, 361, 594 has
been implemented in the S function hare. Their procedure searches for second-
order interactions involving predictors (and linear splines of them) and linear
splines in follow-up time (allowing for non-proportional hazards). hare is also
used to estimate calibration curves for parametric survival models (rms func-
tion calibrate) as it is for Cox models.
18.11 Problems
1 3 3+ 6+ 7+
Consider the random sample of 1000 patients from the SUPPORT study,352
described in Section 3.12. In this case study we develop a parametric sur-
vival time model (accelerated failure time model) for time until death for the
acute disease subset of SUPPORT (acute respiratory failure, multiple organ
system failure, coma). We eliminate the chronic disease categories because
the shapes of the survival curves are different between acute and chronic dis-
ease categories. To fit both acute and chronic disease classes would require a
log-normal model with σ parameter that is disease-specific.
Patients had to survive until day 3 of the study to qualify. The baseline
physiologic variables were measured during day 3.
First we create a variable acute to flag the categories of interest, and print
univariable descriptive statistics for the data subset.
require ( rms )
support[acute, ]
35 Variables 537 Observations
age : Age
n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
537 0 529 1 60.7 28.49 35.22 47.93 63.67 74.49 81.54 85.56
lowest : 18.04 18.41 19.76 20.30 20.31
highest: 91.62 91.82 91.93 92.74 95.51
dzgroup
n missing unique
537 0 3
ARF/MOSF w/Sepsis (391, 73%), Coma (60, 11%), MOSF w/Malig (86, 16%)
dzclass
n missing unique
537 0 2
ARF/MOSF (477, 89%), Coma (60, 11%)
income
n missing unique
335 202 4
under $11k (158, 47%), $11-$25k (79, 24%), $25-$50k (63, 19%)
>$50k (35, 10%)
race
n missing unique
535 2 5
white black asian other hispanic
Frequency 417 84 4 8 22
% 78 16 1 1 4
456 19 Parametric Survival Modeling and Model Approximation
sfdm2
n missing unique
468 69 5
no(M2 and SIP pres) (134, 29%), adl>=4 (>=5 if sur) (78, 17%)
SIP>=30 (30, 6%), Coma or Intub (5, 1%), <2 mo. follow-up (221, 47%)
458 19 Parametric Survival Modeling and Model Approximation
The Hmisc varclus function is used to quantify and depict associations between
predictors, allowing for general nonmonotonic relationships. This is done by
using Hoeffding’s D as a similarity measure for all possible pairs of predictors
instead of the default similarity, Spearman’s ρ.
ac ← support [ acute ,]
ac $ dzgroup ← ac $ dzgroup [ drop = TRUE ] # Remove unused levels
label ( ac $ dzgroup ) ← ' Disease Group '
attach ( ac )
vc ← varclus (∼ age + sex + dzgroup + num.co + edu + income +
scoma + race + meanbp + wblc + hrt + resp +
temp + pafi + alb + bili + crea + sod + ph +
glucose + bun + urine + adlsc , sim = ' hoeffding ' )
plot ( vc ) # Figure 19.2
Let us check whether a parametric survival time model will fit the data, with
respect to the key prognostic factors. First, Kaplan–Meier estimates stratified
by disease group are computed, and plotted after inverse normal transforma-
tion, against log t. Parallelism and linearity indicate goodness of fit to the
log normal distribution for disease group. Then a more stringent assessment
is made by fitting an initial model and computing right-censored residuals.
These residuals, after dividing by σ̂, should all have a normal distribution
if the model holds. We compute Kaplan–Meier estimates of the distribution
of the residuals and overlay the estimated survival distribution with the the-
oretical Gaussian one. This is done overall, and then to get more stringent
assessments of fit, residuals are stratified by key predictors and plots are
produced that contain multiple Kaplan–Meier curves along with a single the-
oretical normal curve. All curves should hover about the normal distribution.
To gauge the natural variability of stratified residual distribution estimates,
the residuals are also stratified by a random number that has no bearing on
the goodness of fit.
dd ← datadist ( ac )
# describe distributions of variables to rms
19.2 Checking Adequacy of Log-Normal Model 459
0.0
scoma
adlsc
sod
crea
temp
resp
hrt
meanbp
race
avtisst
wblc
num.co
dzclass
dzgroup
d.time
slos
hospdead
sex
age
death
charges
totcst
0.1
sfdm2
pafi
ph
0.2
Fraction Missing
adls
edu
income
alb
bili
0.3
totmcst
0.4
adlp
urine
glucose
bun
0.5
Fig. 19.1 Cluster analysis showing which predictors tend to be missing on the same
patients
0.00
30 * Hoeffding D
raceasian
racehispanic
raceblack
raceother
sexmale
sod
resp
dzgroupMOSF w/Malig
wblc
meanbp
age
crea
dzgroupComa
scoma
num.co
adlsc
0.05
pafi
ph
hrt
temp
alb
bili
edu
0.10
income$11−$25k
income$25−$50k
income>$50k
0.15
urine
0.20
glucose
bun
Now remove from consideration predictors that are missing in more than 0.2
of patients. Many of these were collected only for the second half of SUP-
PORT. Of those variables to be included in the model, find which ones have
enough potential predictive power to justify allowing for nonlinear relation-
ships or multiple categories, which spend more d.f. For each variable compute
Spearman ρ2 based on multiple linear regression of rank(x), rank(x)2 , and the
0 dzgroup=MOSF w/Malig
dzgroup=ARF/MOSF w/Sepsis
−1
dzgroup=Coma
−2
−3 −2 −1 0 1 2
log Survival Time in Years
Fig. 19.3 Φ−1 (SKM (t)) stratified by dzgroup. Linearity and semi-parallelism indi-
cate a reasonable fit to the log-normal accelerated failure time model with respect to
one predictor.
survival time, truncating survival time at the shortest follow-up for survivors
(356 days; see Section 4.1).
shortest.fo ll o w. up ← min ( d.time [ death ==0] , na.rm = TRUE )
d.timet ← pmin ( d.time , shortest.fol lo w. up )
A better approach is to use the complete information in the failure and censor-
ing times by computing Somers’ Dxy rank correlation allowing for censoring.
Remaining missing values are imputed using the “most normal” values, a
procedure found to work adequately for this particular study. Race is imputed
using the modal category.
# Compute number of missing values per variable
sapply ( llist ( age , num.co , scoma , meanbp , hrt , resp , temp , crea , sod ,
adlsc , wblc , pafi , ph ), function ( x ) sum ( is.na ( x )))
age num . co scoma meanbp hrt resp temp crea sod adlsc
0 0 0 0 0 0 0 0 0 0
wblc pafi ph
5 37 37
Survival Probability
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−3.0 −2.0 −1.0 0.0 0.5 1.0 1.5 2.0 −3.0 −2.0 −1.0 0.0 0.5 1.0 1.5 2.0
Residual Residual
1.0 1.0
Survival Probability
Survival Probability
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−3.0 −2.0 −1.0 0.0 0.5 1.0 1.5 2.0 −3.0 −2.0 −1.0 0.0 0.5 1.0 1.5 2.0
Residual Residual
N df
scoma 537 2
meanbp 537 2
dzgroup 537 2
crea 537 2
pafi 500 2
ph 500 2
sod 537 2
hrt 537 2
adlsc 537 2
temp 537 2
wblc 532 2
num.co 537 2
age 537 2
resp 537 2
race 535 4
Adjusted ρ2
2
Fig. 19.5 Generalized Spearman ρ rank correlation between predictors and trun-
cated survival time
N
meanbp 537
crea 537
dzgroup 537
scoma 537
pafi 500
ph 500
adlsc 537
age 537
num.co 537
hrt 537
resp 537
race 535
sod 537
wblc 532
temp 537
Fig. 19.6 Somers’ Dxy rank correlation between predictors and original survival
time. For dzgroup or race, the correlation coefficient is the maximum correlation from
using a dummy variable to represent the most frequent or one to represent the second
most frequent category.’,scap=’Somers’ Dxy rank correlation between predictors and
original survival time
19.2 Checking Adequacy of Log-Normal Model 463
Now that missing values have been imputed, a formal multivariable redun-
dancy analysis can be undertaken. The Hmisc package’s redun function goes
farther than the varclus pairwise correlation approach and allows for non-
monotonic transformations in predicting each predictor from all the others.
redun (∼ crea + age + sex + dzgroup + num.co + scoma + adlsc +
race2 + meanbp + hrt + resp + temp + sod + wblc.i +
pafi.i + ph.i , nk =4)
R e d u n d a n c y Analysis
n : 537 p : 16 nk : 4
Number of NAs : 0
No r e d u n d a n t v a r i a b l e s
Now turn to a more efficient approach for gauging the potential of each
predictor, one that makes maximal use of failure time and censored data is to
all continuous variables to have a maximum number of knots in a log-normal
survival model. This approach must use imputation to have an adequate
sample size. A semi-saturated main effects additive log-normal model is fitted.
It is necessary to limit restricted cubic splines to 4 knots, force scoma to be
linear, and to omit ph.i in order to avoid a singular covariance matrix in
the fit.
k ← 4
f ← psm ( S ∼ rcs ( age , k )+ sex + dzgroup + pol ( num.co ,2)+ scoma +
pol ( adlsc ,2)+ race + rcs ( meanbp , k )+ rcs ( hrt , k )+
464 19 Parametric Survival Modeling and Model Approximation
Figure 19.7 properly blinds the analyst to the form of effects (tests of lin-
earity). Next fit a log-normal survival model with number of parameters
corresponding to nonlinear effects determined from the partial χ2 tests in
Figure 19.7. For the most promising predictors, five knots can be allocated,
as there are fewer singularity problems once less promising predictors are
simplified.
sex
temp
race
sod
num.co
hrt
wblc.i
adlsc
resp
scoma
pafi.i
age
meanbp
crea
dzgroup
0 10 20 30
χ2 − df
Fig. 19.7 Partial χ2 statistics for association of each predictor with response from
saturated main effects model, penalized for d.f.
a ← anova ( f )
First let’s plot the shape of the effect of each predictor on log survival time.
All effects are centered so that they can be placed on a common scale. This
allows the relative strength of various predictors to be judged. Then Wald
χ2 statistics, penalized for d.f., are plotted in descending order. Next, rela-
tive effects of varying predictors over reasonable ranges (survival time ratios
varying continuous predictors from the first to the third quartile) are charted.
latex (a , file = ' ' , label = ' tab : support-anova t ' ) # Table 19.1
Let us decide whether there was significant overfitting during the development
of this model, using the bootstrap.
# First add data to model fit so bootstrap can re-sample
# from the data
g ← update (f , x = TRUE , y = TRUE )
set.seed (717)
latex ( validate (g , B =300) , digits =2 , size = ' Ssize ' )
0
−1
−2
−3
30 60 90 120 1500 2 4 6 100 200 300 400 500 10 20 30 40
scoma sod temp wblc.i
2
χ21 = 10.6 χ21 = 0.1 χ21 = 0.4 χ22 = 5.5
1
0
−1
−2
−3
0 25 50 75 100 130 135 140 145 150155 35 36 37 38 39 40 0 10 20 30 40
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
log(T)
Fig. 19.8 Effect of each predictor on log survival time. Predicted values have been
centered so that predictions at predictor reference values are zero. Pointwise 0.95
confidence bands are also shown. As all y-axes have the same scale, it is easy to see
which predictors are strongest.
sod
sex
temp
race2
wblc.i
num.co
adlsc
resp
scoma
hrt
age
pafi.i
meanbp
crea
dzgroup
0 10 20 30 40
χ2 − df
Fig. 19.9 Contribution of variables in predicting survival time in log-normal model
age − 74.5:47.9
num.co − 2:1
scoma − 37:0
adlsc − 3.38:0
meanbp − 111:59
hrt − 126:75
resp − 32:12
temp − 38.5:36.4
crea − 2.6:0.9
sod − 142:134
wblc.i − 18.2:8.1
pafi.i − 323:142
sex − female:male
dzgroup − Coma:ARF/MOSF w/Sepsis
dzgroup − MOSF w/Malig:ARF/MOSF w/Sepsis
race2 − other:white
Fig. 19.10 Estimated survival time ratios for default settings of predictors. For
example, when age changes from its lower quartile to the upper quartile (47.9y to
74.5y), median survival time decreases by more than half. Different shaded areas of
bars indicate different confidence levels (.9, 0.95, 0.99).
set.seed (717)
cal ← calibrate (g , u =1 , B =300)
plot ( cal , subtitles = FALSE )
cal ← calibrate (g , cmethod = ' KM ' , u =1 , m =60 , B =120 , pr = FALSE )
plot ( cal , add = TRUE ) # Figure 19.11
19.5 Approximating the Full Model 469
0.8
0.4
0.2
0.0
The fitted log-normal model is perhaps too complex for routine use and for
routine data collection. Let us develop a simplified model that can predict
the predicted values of the full model with high accuracy (R2 = 0.967). The
simplification is done using a fast backward step-down against the full model
predicted values.
Z ← predict ( f ) # X * beta hat
a ← ols ( Z ∼ rcs ( age ,5)+ sex + dzgroup + num.co +
scoma + pol ( adlsc ,2)+ race2 +
rcs ( meanbp ,5)+ rcs ( hrt ,3)+ rcs ( resp ,3)+
temp + rcs ( crea ,4)+ sod + rcs ( wblc.i ,3)+
rcs ( pafi.i ,4) , sigma =1)
# sigma =1 is used to prevent sigma hat from being zero when
# R2 =1 .0 since we start out by approximating Z with all
# component variables
fastbw (a , aics =10000) # fast backward stepdown
Coef S . E . Wald Z P
[1 ,] -0.5928 0.04315 -13.74 0
None
The estimated variances from the reduced model are actually slightly smaller
than those that would have been obtained from stepwise variable selection
in this case, had variable selection used a stopping rule that resulted in the
same set of variables being selected. Now let us compute Wald statistics for
the reduced model.
f.approx $ var ← v
latex ( anova ( f.approx , test = ' Chisq ' , ss = FALSE ), file = ' ' ,
label = ' tab : support-anovaa ' )
The results are shown in Table 19.2. Note the similarity of the statistics
to those found in the table for the full model. This would not be the case had
deleted variables been very collinear with retained variables.
The equation for the simplified model follows. The model is also depicted
graphically in Figure 19.12. The nomogram allows one to calculate mean and
median survival time. Survival probabilities could have easily been added as
additional axes.
# Typeset mathematical form of approximate model
latex ( f.approx , file = ' ' )
X β̂ =
−2.51
−1.94[Coma] − 1.75[MOSF w/Malig]
−5 3 −5 3
+0.068meanbp − 3.08×10 (meanbp − 41.8)+ + 7.9×10 (meanbp − 61)+
−5 3 −6 3 −6 3
−4.91×10 (meanbp − 73)+ + 2.61×10 (meanbp − 109)+ − 1.7×10 (meanbp − 135)+
3 3 3
−0.553crea − 0.229(crea − 0.6)+ + 0.45(crea − 1.1)+ − 0.233(crea − 1.94)+
3
+0.0131(crea − 7.32)+
−5 3 −5 3
−0.0165age − 1.13×10 (age − 28.5)+ + 4.05×10 (age − 49.5)+
−5 3 −5 3 −5 3
−2.15×10 (age − 63.7)+ − 2.68×10 (age − 72.7)+ + 1.9×10 (age − 85.6)+
−7 3 −6 3 −6 3
−0.0136hrt + 6.09×10 (hrt − 60)+ − 1.68×10 (hrt − 111)+ + 1.07×10 (hrt − 140)+
−0.0135 scoma
−7 3 −7 3
+0.0161pafi.i − 4.77×10 (pafi.i − 88)+ + 9.11×10 (pafi.i − 167)+
472 19 Parametric Survival Modeling and Model Approximation
−7 3 −8 3 2
−5.02×10 (pafi.i − 276)+ + 6.76×10 (pafi.i − 426)+ − 0.369 adlsc + 0.0409 adlsc
−5 3 3 −5 3
+0.0394resp − 9.11×10 (resp − 10)+ + 0.000176(resp − 24)+ − 8.5×10 (resp − 39)+
{
names ( parms ) ← NULL
f ← function ( lp , q , parms ) lp + exp ( parms ) * qnorm ( q )
names ( q ) ← format ( q )
drop ( exp ( outer ( lp , q , FUN = f , parms = parms )))
}
19.6 Problems
0 10 20 30 40 50 60 70 80 90 100
Points
MOSF w/Malig
Disease Group
Coma ARF/MOSF w/Sepsis
Mean Arterial BP
0 20 40 60 80 120
6 7 8 9 10 11 12
Creatinine
5 3 2 1 0
Age
100 70 60 50 30 10
Heart Rate
300 200 100 50 0
SUPPORT Coma
Score 100 70 50 30 10
300
PaO2/(.01*FiO2)
0 50 100 200 500 700 900
5 7
ADL
4.5 2 1 0
65 60 55 50 45 40 35 30
Resp. Rate
0 5 15
Total Points
0 50 100 150 200 250 300 350 400 450
Linear Predictor
−7 −5 −3 −1 1 2 3 4
Fig. 19.12 Nomogram for predicting median and mean survival time, based on ap-
proximation of full model
Chapter 20
Cox Proportional Hazards Regression
Model
20.1 Model
20.1.1 Preliminaries
The Cox proportional hazards model132 is the most popular model for the
analysis of survival data. It is a semiparametric model; it makes a parametric 1
assumption concerning the effect of the predictors on the hazard function,
but makes no assumption regarding the nature of the hazard function λ(t)
itself. The Cox PH model assumes that predictors act multiplicatively on the
hazard function but does not assume that the hazard function is constant (i.e.,
exponential model), Weibull, or any other particular form. The regression
portion of the model is fully parametric; that is, the regressors are linearly
related to log hazard or log cumulative hazard. In many situations, either
the form of the true hazard function is unknown or it is complex, so the
Cox model has definite advantages. Also, one is usually more interested in
the effects of the predictors than in the shape of λ(t), and the Cox approach
allows the analyst to essentially ignore λ(t), which is often not of primary
interest.
The Cox PH model uses only the rank ordering of the failure and censoring
times and thus is less affected by outliers in the failure times than fully
parametric methods. The model contains as a special case the popular log-
rank test for comparing survival of two groups. For estimating and testing
regression coefficients, the Cox model is as efficient as parametric models
(e.g., Weibull model with PH) even when all assumptions of the parametric
model are satisfied.171
When a parametric model’s assumptions are not true (e.g., when a Weibull
model is used and the population is not from a Weibull survival distribution
so that the choice of model is incorrect), the Cox analysis is more efficient
than the parametric analysis. As shown below, diagnostics for checking Cox
model assumptions are very well developed.
The Cox PH model is most often stated in terms of the hazard function:
20.1.3 Estimation of β
Cox and others have shown that this partial log likelihood can be treated as
an ordinary log likelihood to derive valid (partial) MLEs of β. Note that this
log likelihood is unaffected by the addition of a constant to any or all of the
Xs. This is consistent with the fact that an intercept term is unnecessary and
cannot be estimated since the Cox model is a model for the relative hazard
and does not directly estimate the underlying hazard λ(t).
When there are tied failure times in the sample, the true partial log likeli-
hood function involves permutations so it can be time-consuming to compute.
When the number of ties is not large, Breslow70 has derived a satisfactory
approximate log likelihood function. The formula given above, when applied
without modification to samples containing ties, actually uses Breslow’s ap-
proximation. If there are ties so that k < n and t1 , . . . , tk denote the unique
failure times as we originally intended, Breslow’s approximation is written as
k
log L(β) = {Si β − di log[ exp(Xj β)]}, (20.6)
i=1 Yj ≥ti
where Si = j∈Di Xj , Di is the set of indexes j for subjects failing at time
ti , and di is the number of failures at ti .
Efron171 derived another approximation to the true likelihood that is sig-
nificantly more accurate than the Breslow approximation and often yields
estimates that are very close to those from the more cumbersome permuta-
tion likelihood:288
k
di
log L(β) = {Si β − log[ exp(Xj β)
i=1 j=1 Yj ≥ti
j−1
− exp(Xl β)]}. (20.7)
di
l∈Di
478 20 Cox Proportional Hazards Regression Model
In the special case when all tied failure times are from subjects with iden-
tical Xi β, the Efron approximation yields the exact (permutation) marginal
likelihood (Therneau, personal communication, 1993).
Kalbfleisch and Prentice330 showed that Cox’s partial likelihood, in the
absence of predictors that are functions of time, is a marginal distribution of
the ranks of the failure/censoring times.
See Therneau and Grambsch604 and Huang and Harrington310 for descrip-
tions of penalized partial likelihood estimation methods for improving mean
squared error of estimates of β in a similar fashion to what was discussed in
Section 9.10.
The Cox PH regression model has the same assumptions as the parametric
PH model except that no assumption is made regarding the shape of the
underlying hazard or survival functions λ(t) and S(t). The Cox PH model
assumes, in its most basic form, linearity and additivity of the predictors
with respect to log hazard or log cumulative hazard. It also assumes the PH
assumption of no time by predictor interactions; that is, the predictors have
the same effect on the hazard function at all values of t. The relative hazard
function exp(Xβ) is constant through time and the survival functions for
subjects with different values of X are powers of each other. If, for example,
the hazard of death at time t for treated patients is half that of control
patients at time t, this same hazard ratio is in effect at any other time point.
In other words, treated patients have a consistently better hazard of death
over all follow-up time.
The regression parameters are interpreted the same as in the parametric
PH model. The only difference is the absence of hazard shape parameters
in the model, since the hazard shape is not estimated in the Cox partial
likelihood procedure.
20.1.5 Example
Consider again the rat vaginal cancer data from Section 18.3.6. Figure 20.1
displays the nonparametric survival estimates for the two groups along with
estimates derived from the Cox model (by a method discussed later).
require ( rms )
20.1 Model 479
group ← c ( rep ( ' Group 1 ' ,19) , rep ( ' Group 2 ' ,21))
group ← factor ( group )
dd ← datadist ( group ); options ( datadist = ' dd ' )
days ←
c (143 ,164 ,188 ,188 ,190 ,192 ,206 ,209 ,213 ,216 ,220 ,227 ,230 ,
234 ,246 ,265 ,304 ,216 ,244 ,142 ,156 ,163 ,198 ,205 ,232 ,232 ,
233 ,233 ,233 ,233 ,239 ,240 ,261 ,280 ,280 ,296 ,296 ,323 ,204 ,344)
death ← rep (1 ,40)
death [ c (18 ,19 ,39 ,40)] ← 0
units ( days ) ← ' Day '
df ← data.frame ( days , death , group )
S ← Surv ( days , death )
co ← gray ( c (0 , .8 ))
survplot (f , lty = c (1 , 1) , lwd = c (1 , 3) , col = co ,
l a b e l . c u r v e s = FALSE , conf = ' none ' )
survplot (g , lty = c (3 , 3) , lwd = c (1 , 3) , col = co , # Efron approx.
add = TRUE , l a b e l . c u r v e s = FALSE , conf.type = ' none ' )
legend ( c (2 , 160) , c ( .38 , .54 ) ,
c ( ' N o n p a r a m e t r i c Estimates ' , ' Cox-Breslo w Estimates ' ) ,
lty = c (1 , 3) , cex = .8 , bty = ' n ' )
legend ( c (2 , 160) , c ( .18 , .34 ) , cex = .8 ,
c ( ' Group 1 ' , ' Group 2 ' ) , lwd = c (1 ,3) , col = co , bty = ' n ' )
The predicted survival curves from the fitted Cox model are in good agree-
ment with the nonparametric estimates, again verifying the PH assumption
for these data. The estimates of the group effect from a Cox model (using the
exact likelihood since there are ties, along with both Efron’s and Breslow’s
approximations) as well as from a Weibull model and an exponential model
are shown in Table 20.1. The exponential model, with its constant hazard,
cannot accommodate the long early period with no failures. The group pre-
dictor was coded as X1 = 0 and X1 = 1 for Groups 1 and 2, respectively. For
this example, the Breslow likelihood approximation resulted in β̂ closer to
that from maximizing the exact likelihood. Note how the group effect (47%
reduction in hazard of death by the exact Cox model) is underestimated by
the exponential model (9% reduction in hazard). The hazard ratio from the
Weibull fit agrees with the Cox fit.
480 20 Cox Proportional Hazards Regression Model
1.0
0.8
Survival Probability
0.6
Nonparametric Estimates
0.4 Cox−Breslow Estimates
Group 1
0.2 Group 2
0.0
Table 20.1 Group effects using three versions of the partial likelihood and three
parametric models
Designs are no different for the Cox PH model than for other models except
for one minor distinction. Since the Cox model does not have an intercept
parameter, the group omitted from X in an ANOVA model will go into the
underlying hazard function. As an example, consider a three-group model for
treatments A, B, and C. We use the two dummy variables
A unique feature of the Cox PH model is its ability to adjust for factors that
are not modeled. Such factors usually take the form of polytomous stratifi-
cation factors that either are too difficult to model or do not satisfy the PH
assumption. For example, a subject’s occupation or clinical study site may
take on dozens of levels and the sample size may not be large enough to
model this nominal variable with dozens of dummy variables. Also, one may
know that a certain predictor (either a polytomous one or a continuous one
that is grouped) may not satisfy PH and it may be too complex to model the
hazard ratio for that predictor as a function of time.
The idea behind the stratified Cox PH model is to allow the form of the
underlying hazard function to vary across levels of the stratification factors.
A stratified Cox analysis ranks the failure times separately within strata.
Suppose that there are b strata indexed by j = 1, 2, . . . , b. Let C denote the
stratum identification. For example, C = 1 or 2 may stand for the female and
male strata, respectively. The stratified PH model is
Here λj (t) and Sj (t) are, respectively, the underlying hazard and survival
functions for the jth stratum. The model does not assume any connection
between the shapes of these functions for different strata.
In this stratified analysis, the data are stratified by C but, by default, a
common vector of regression coefficients is fitted across strata. These common
regression coefficients can be thought of as “pooled” estimates. For example,
a Cox model with age as a (modeled) predictor and sex as a stratification
variable essentially estimates the common slope of age by pooling information
about the age effect over the two sexes. The effect of age is adjusted by sex
differences, but no assumption is made about how sex affects survival. There
is no PH assumption for sex. Levels of the stratification factor C can represent
multiple stratification factors that are cross-classified. Since these factors are
not modeled, no assumption is made regarding interactions among them.
482 20 Cox Proportional Hazards Regression Model
i−1
S(ti ) = αj , i = 1, 2, . . . , k (α0 = 1). (20.11)
j=0
and the estimate of the probability of survival past time t for a subject with
predictor values X is
Ŝ(t|X) = Ŝ(t)exp(X β̂) . (20.14)
When the model is stratified, estimation of the αj and S is carried out sep-
arately within each stratum once β̂ is obtained by pooling over strata. The
stratified survival function estimates can be thought of as stratified Kaplan–
Meier estimates adjusted for X, with the adjustment made by assuming PH
and linearity. As mentioned previously, these stratified adjusted survival es-
timates are useful for checking model assumptions and for providing a simple
way to incorporate factors that violate PH.
The stratified estimates are also useful in themselves as descriptive statis-
tics without making assumptions about a major factor. For example, in a
study from Califf et al.88 to compare medical therapy with coronary artery
bypass grafting (CABG), the model was stratified by treatment but adjusted
for a variety of baseline characteristics by modeling. These adjusted survival
estimates do not assume a form for the effect of surgery. Figure 20.2 displays
unadjusted (Kaplan–Meier) and adjusted survival curves, with baseline pre-
dictors adjusted to their mean levels in the combined sample. Notice that
valid adjusted survival estimates are obtained even though the curves cross
(i.e., PH is violated for the treatment variable). These curves are essentially
product limit estimates with respect to treatment and Cox PH estimates with
respect to the baseline descriptor variables.
The Kalbfleisch–Prentice discrete underlying hazard model estimates of
the αj are one minus estimates of the hazard function at the discrete failure
times. However, these estimated hazard functions are usually too “noisy” to
be useful unless the sample size is very large or the failure times have been
grouped (say by rounding).
20.2 Survival Probability and Secondary Parameters 485
Unadjusted Adjusted
1.00
Survival Probability
0.75
Treatment
0.50 Surgical
Medical
0.25
0.00
0 5 10 0 5 10
Years of Followup
More asymptotic theory has been derived from the Breslow estimator than
for the Kalbfleisch–Prentice estimator. Another advantage of the Breslow
estimator is that it does not require iterative computations for di > 1. Law-
less [382, p. 362] states that the two survival function estimators differ little
except in the right-hand tail when all di s are unity. Like the Kalbfleisch–
Prentice estimator, the Breslow estimator is invariant under different choices
of “standard subjects” for the underlying survival S(t). 2
Somewhat complex formulas are available for computing confidence limits
of Ŝ(t|X).615 3
486 20 Cox Proportional Hazards Regression Model
One way of estimating the minimum sample size for a Cox model analy-
sis aimed at estimating survival probabilities is to consider the simplest case
where there are no covariates. Thus the problem reduces to using the Kaplan-
Meier estimate to estimate S(t). Let’s further simplify things to assume there
is no censoring. Then the Kaplan-Meier estimate is just one minus the em-
pirical cumulative distribution function. By the Dvoretzky-Kiefer-Wolfowitz
inequality, the maximum absolute error in an empirical distribution function
estimate of the true continuous distribution function is less than or equal to
2
ǫ with probability of at least 1 − 2e−2nǫ . For the probability to be at least
0.95, n = 184. Thus in the case of no censoring, one needs 184 subjects to
estimate the survival curve to within a margin of error of 0.1 everywhere.
To estimate the subject-specific survival curves (S(t|X)) will require greater
sample sizes, as will having censored data. It is a fair approximation to think
of 184 as the needed number of subjects suffering the event or being censored
“late.”
Turning to estimation of a hazard ratio for a single binary predictor X
that has equal numbers of X = 0 and X = 1, if the total sample size is n
and the number of events in the two categories are respectively e0 and e1 ,
the variance of the log hazard ratio is approximately v = e10 + e11 . Letting z
denote the 1 − α/2 standard normal critical value, the multiplicative
√ margin
of error (MMOE) with confidence 1 − α is given by exp(z v). To achieve
a MMOE of 1.2 in estimating eβ̂ with equal numbers of events in the two
groups and α = 0.05 requires a total of 462 events.
Wald, score, and likelihood ratio statistics are useful and valid for drawing
inferences about β in the Cox model. The score test deserves special mention
here. If there is a single binary predictor in the model that describes two
groups, the score test for assessing the importance of the binary predictor
is virtually identical to the Mantel–Haenszel log-rank test for comparing the
two groups. If the analysis is stratified for other (nonmodeled) factors, the
score test from a stratified Cox model is equivalent to the corresponding
stratified log-rank test. Of course, the likelihood ratio or Wald tests could
also be used in this situation, and in fact the likelihood ratio test may be
better than the score test (i.e., type I errors by treating the likelihood ratio
test statistic as having a χ2 distribution may be more accurate than using
the log-rank statistic).
The Cox model can be thought of as a generalization of the log-rank pro-
cedure since it allows one to test continuous predictors, perform simultaneous
20.6 Assessment of Model Fit 487
tests of various predictors, and adjust for other continuous factors without
grouping them. Although a stratified log-rank test does not make assump-
tions regarding the effect of the adjustment (stratifying) factors, it makes the
same assumption (i.e., PH) as the Cox model regarding the treatment effect
for the statistical test of no difference in survival between groups.
20.5 Residuals
Therneau et al.605 discussed four types of residuals from the Cox model:
martingale, score, Schoenfeld, and deviance. The first three have been proven
to be very useful, as indicated in Table 20.2. 4
Residual Purposes
Martingale Assessing adequacy of a hypothesized predictor
transformation. Graphing an estimate of a
predictor transformation (Section 20.6.1).
Score Detecting overly influential observations
(Section 20.9). Robust estimate of
covariance matrix of β̂ (Section 9.5).410
Schoenfeld Testing PH assumption (Section 20.6.2).
Graphing estimate of hazard ratio function
(Section 20.6.2).
As stated before, the Cox model makes the same assumptions as the para-
metric PH model except that it does not assume a given shape for λ(t) or
S(t). Because the Cox PH model is so widely used, methods of assessing its fit
are dealt with in more detail than was done with the parametric PH models.
e
0 1
1611 389
−1
−2
log[−log S(3)]
sex
Male
−3
Female
−4
−5
30 40 50 60 70
Age
Fig. 20.3 Kaplan–Meier log Λ estimates by sex and deciles of age, with 0.95 confi-
dence limits. Solid line is for males, dashed line for females.
As with the logistic model and other regression models, the restricted cubic
spline function is an excellent tool for modeling the regression relationship
with very few assumptions. A four-knot spline Cox PH model in two variables
(X1 , X2 ) that assumes linearity in X1 and no X1 × X2 interaction is given by
where X2′ and X2′′ are spline component variables as described earlier and
f (X2 ) is the spline function or spline transformation of X2 given by
χ2 d.f. P
age 72.33 3 < 0.0001
Nonlinear 0.69 2 0.7067
TOTAL 72.33 3 < 0.0001
−1
−2
log[−log S(3)]
sex
Male
−3
Female
−4
−5
20 40 60 80
Age
Fig. 20.4 Cox PH model stratified on sex, using spline function for age, no inter-
action. 0.95 confidence limits also shown. Solid line is for males, dashed line is for
females.
20.6 Assessment of Model Fit 491
This model allows the relationship between X2 and log hazard to be a smooth
nonlinear function and the shape of the X2 effect to be completely different
for each level of X1 if X1 is dichotomous. Figure 20.5 displays a fit of this
model at t = 3 years for the simulated dataset.
f.ia ← cph ( Srv ∼ rcs ( age ,4) * strat ( sex ), x = TRUE , y = TRUE ,
surv = TRUE )
w ← latex ( f.ia , inline = TRUE , digits =3)
latex ( anova ( f.ia ), table.env = FALSE , file = ' ' )
χ2 d.f. P
age (Factor+Higher Order Factors) 72.82 6 < 0.0001
All Interactions 1.05 3 0.7886
Nonlinear (Factor+Higher Order Factors) 1.80 4 0.7728
age × sex (Factor+Higher Order Factors) 1.05 3 0.7886
Nonlinear 1.05 2 0.5911
Nonlinear Interaction : f(A,B) vs. AB 1.05 2 0.5911
TOTAL NONLINEAR 1.80 4 0.7728
TOTAL NONLINEAR + INTERACTION 1.80 5 0.8763
TOTAL 72.82 6 < 0.0001
−1
−2
log[−log S(3)]
sex
Male
−3
Female
−4
−5
20 40 60 80
Age
Fig. 20.5 Cox PH model stratified on sex, with interaction between age spline and
sex. 0.95 confidence limits are also shown. Solid line is for males, dashed line for
females.
-1.0
Cox Regression Model, n=979 events=198
Statistic X2 df
Model L.R. 129.92 2 AIC= 125.92
Association Wald 157.45 2 p= 0.000
-1.5
Linearity Wald 9.59 1 p= 0.002
-2.0
-2.5
-3.0
-3.5
-4.0
LVEF
Fig. 20.6 Restricted cubic spline estimate of relationship between LVEF and relative
log hazard from a sample of 979 patients and 198 cardiovascular deaths. Data from
the Duke Cardiovascular Disease Databank.
Other methods based on smoothed residual plots are also valuable tools
for selecting predictor transformations. Therneau et al.605 describe residuals
based on martingale theory that can estimate transformations of any number
of predictors omitted from a Cox model fit, after adjusting for other vari-
ables included in the fit. Figure 20.7 used various smoothing methods on the
points (LVEF, residual). First, the R loess function96 was used to obtain a
smoothed scatterplot fit and approximate 0.95 confidence bars. Second, an 5
ordinary least squares model, representing LVEF as a restricted cubic spline
with five default knots, was fitted. Ideally, both fits should have used weighted
regression as the residuals do not have equal variance. Predicted values from
this fit along with 0.95 confidence limits are shown. The loess and spline-
linear regression agree extremely well. Third, Cleveland’s lowess scatterplot
smoother111 was used on the martingale residuals against LVEF. The sug-
gested transformation from all three is very similar to that of Figure 20.6. For
smaller sample sizes, the raw residuals should also be displayed. There is one
vector of martingale residuals that is plotted against all of the predictors.
When correlations among predictors are mild, plots of estimated predictor
transformations without adjustment for other predictors (i.e., marginal trans-
formations) may be useful. Martingale residuals may be obtained quickly by
fixing β̂ = 0 for all predictors. Then smoothed plots of predictor against
residual may be made for all predictors. Table 20.3 summarizes some of the
494 20 Cox Proportional Hazards Regression Model
1.0
ols Spline Fit and 0.95 Confidence Limits
Martingale Residual
lowess Smoother
0.5
0.0
6 ways martingale residuals may be used. See section 10.5 for more information
on checking the regression assumptions. The methods for examining interac-
tion surfaces described there apply without modification to the Cox model
(except that the nonparametric regression surface does not apply because of
censoring).
Even though assessment of fit of the regression part of the Cox PH model
corresponds with other regression models such as the logistic model, the Cox
model has its own distributional assumption in need of validation. Here, of
course, the distributional assumption is not as stringent as with other survival
20.6 Assessment of Model Fit 495
2.5
1.5
1.0
0.5
Table 20.4 Interval-specific group effects from rat data by artificial censoring
Table 20.5 Interval-specific effects of squamous cell cancer in VA lung cancer data
Time Observations Deaths Log Hazard Standard
Interval Ratio Error
[0, 21) 110 26 −0.46 0.47
[21, 52) 84 26 −0.90 0.50
[52, 118) 59 26 −1.35 0.50
118 + 28 26 −1.04 0.45
getHdata ( valung )
with ( valung , {
hazard.ratio . pl ot (1 * ( cell == ' Squamous ' ), Surv (t , dead ),
e =25 , subset = cell ! = ' Large ' ,
pr = TRUE , pl = FALSE )
hazard.ratio . pl ot (1 * kps , Surv (t , dead ), e =25 ,
pr = TRUE , pl = FALSE ) })
Subset Estimate
0.95 C.L.
Smoothed
0.2
Log Hazard Ratio
0.1
0.0
-0.1
0 2 4 6 8 10
t
Predictor:Pain/Ischemia Index
Event:cdeathmi
Fig. 20.9 Stratified hazard ratios for pain/ischemia index over time. Data from the
Duke Cardiovascular Disease Databank.
β̂ + dRV̂ , (20.23)
20.6 Assessment of Model Fit 499
0.1
0.0
-0.1
0 2 4 6 8 10 12
t
Fig. 20.10 Smoothed weighted233 Schoenfeld557 residuals for the same data in Fig-
ure 20.9. Test for PH based on the correlation (ρ) between the individual weighted
Schoenfeld residuals and the rank of failure time yielded ρ = −0.23, z = −6.73, P =
2 × 10−11 .
Table 20.7 Time-specific hazard ratio estimates of squamous cell cancer effect in VA
lung cancer data, by fitting two Weibull distributions with unequal shape parameters
t log Hazard
Ratio
10 −0.36
36 −0.64
83.5 −0.83
200 −1.02
• Categorical X:
check parallelism of strati-
fied log[− log S(t)] plots as
t↑
• Muenz467 cum. hazard ra-
tio plots
• Arjas29 cum. hazard plots
Proportional hazards—effect • Check agreement of strati-
Interaction Between X of X does not depend on T fied and modeled estimates
and T (e.g., treatment effect is con- • Hazard ratio plots
stant over time) • Smoothed Schoenfeld resid-
ual plots and correlation
test (time vs. residual)
• Test time-dependent co-
variable such as X × log(t +
1)
• Ratio of parametrically es-
timated λ(t)
Table 20.9 Comparison of methods for checking the proportional hazards assump-
tion and for allowing for non-proportional hazards
Method Requires Requires Computa- Yields Yields Requires Must Choose
Grouping Grouping tional Formal Estimate of Fitting 2 Smoothing
X t Efficiency Test λ2 (t)/λ1 (t) Models Parameter
log[− log], x x x
Muenz,
Arjas plots
Dabrowska x x x x
log Λ̂
difference
plots
Stratified vs. x x x
Modeled
Estimates
Hazard ratio x ? x x ?
plot
Schoenfeld x x x
residual
plot
Schoenfeld x x
residual
correlation
test
Fit time- x x
dependent
covariables
Ratio of x x x x x
parametric
estimates
of λ(t)
Parametric models that assume an effect other than PH, for example, the
log-logistic model,226 can be used to allow a predictor to have a constantly
increasing or decreasing effect over time. If one predictor satisfies PH but
another does not, this approach will not work. 15
20.8 Collinearity
See Section 4.6 for the general approach using variance inflation factors.
504 20 Cox Proportional Hazards Regression Model
Therneau et al.605 describe the use of score residuals for assessing influence in
Cox and related regression models. They show that the infinitesimal jackknife
estimate of the influence of observation i on β equals V s′ , where V is the
estimated variance–covariance matrix of the p regression estimates b and s =
(si1 , si2 , . . . , sip ) is the vector of score residuals for the p regression coefficients
for the ith observation. Let Sn×p denote the matrix of score residuals over
all observations. Then an approximation to the unstandardized change in b
(DFBETA) is SV . Standardizing by the standard errors of b found from the
diagonals of V , e = (V11 , V22 , . . . , Vpp )1/2 , yields
Schemper546 and Korn and Simon365 have reported that R2 is too sen-
sitive to the distribution of censoring times and have suggested alterna-
tives based on the distance between estimated Cox survival probabilities
(using predictors) and Kaplan–Meier estimates (ignoring predictors). Kent
and O’Quigley345 also report problems with R2 and suggest a more complex
measure. Schemper548 investigated the Maddala–Magee431, 432 index RLR
2
de-
scribed in Section 9.8.3, applied to Cox regression:
2
RLR = 1 − exp(−LR/n)
= 1 − ω 2/n , (20.31)
where ω is the null model likelihood divided by the fitted model likelihood.
2
For many situations, RLR performed as well as Schemper’s more complex
546, 549
measure and hence it is preferred because of its ease of calculation
(assuming that PH holds). Ironically, Schemper548 demonstrated that the n
in the formula for this index is the total number of observations, not the
number of events (but see O’Quigley, Xu, and Stare481 ). To make the R2
index have a maximum value of 1.0, we use the Nagelkerke471 RN 2
discussed
in Section 9.8.3. 16
An easily interpretable index of discrimination for survival models is de-
rived from Kendall’s τ and Somers’ Dxy rank correlation,579 the Gehan–
Wilcoxon statistic for comparing two samples for survival differences, and
the Brown–Hollander–Korwar nonparametric test of association for censored
data.76, 170, 262, 268 This index, c, is a generalization of the area under the ROC
curve discussed under the logistic model, in that it applies to a continuous
response variable that can be censored. The c index is the proportion of all
pairs of subjects whose survival time can be ordered such that the subject
with the higher predicted survival is the one who survived longer. Two sub-
jects’ survival times cannot be ordered if both subjects are censored or if one
has failed and the follow-up time of the other is less than the failure time
of the first. The c index is a probability of concordance between predicted
and observed survival, with c = 0.5 for random predictions and c = 1 for a
perfectly discriminating model. The c index is mildly affected by the amount
of censoring. Dxy is obtained from 2(c − 0.5). While c (and Dxy ) is a good
measure of pure discrimination ability of a single model, it is not sensitive
enough to allow multiple models to be compared447 . 17
Since high hazard means short survival time, when the linear predictor
X β̂ from a Cox model is compared with observed survival time, Dxy will be
negative. Some analysts may want to negate reported values of Dxy .
506 20 Cox Proportional Hazards Regression Model
uniform variable that is independent of all the predictors, and censor half of
the failure times at random. Due to fitting 20 predictors to 100 events, there
will apparently be fair agreement between predicted and observed survival
over all strata (smooth black curve from hazard regression in Figure 20.11).
However, the bias-corrected calibration (blue curve from hazard regression)
gives a more truthful answer: examining the Xs across levels of predicted
survival demonstrate that predicted and observed survival are weekly related,
in more agreement with how the data were generated. For the more arbitrary
Kaplan-Meier approach, we divide the observations into quintiles of predicted
0.5-year survival, so that there are 40 observations per stratum.
n ← 200
p ← 20
set.seed (6)
xx ← matrix ( rnorm ( n * p ), nrow =n , ncol = p )
y ← runif ( n )
units ( y ) ← " Year "
e ← c ( rep (0 , n / 2) , rep (1 , n / 2))
f ← cph ( Surv (y , e) ∼ xx , x = TRUE , y = TRUE ,
time.inc =.5 , surv = TRUE )
cal ← calibrate (f , u = .5 , B =200)
Here bootstrapping and cross-validation are used as for logistic models (Sec-
tion 10.9). We can obtain bootstrap bias-corrected estimates of c or equiv-
alently Dxy . To instead obtain a measure of relative calibration or slope
shrinkage, we can bootstrap the apparent estimate of γ = 1 in the model
1.0
0.8
0.7
0.6
0.5
0.4
LR(γ̂Xb) − LR(Xb)
U= , (20.33)
L0
where L0 is the −2 log likelihood for the null model (Section 9.8.3). Similarly,
a discrimination index D267 can be derived from the −2 log likelihood at the
shrunken linear predictor, penalized for estimating one parameter (γ) (see
also [633, p. 1318] and [123]):
LR(γ̂Xb) − 1
D= . (20.34)
L0
D is the same as R2 discussed above when p = 1 (indicating only one reesti-
mated parameter, γ), the penalized proportion of explainable log likelihood
that was explained by the model. Because of the remark of Schemper,546 all
of these indexes may unfortunately be functions of the censoring pattern.
An index of overall quality that penalizes discrimination for unreliability is
LR(Xb) − 1
Q=D−U = . (20.35)
L0
20.12 Describing the Fitted Model 509
It can be seen that the apparent correlation (Dxy = −0.21) does not hold
up after correcting for overfitting (Dxy = −0.02). Also, the slope shrinkage
(0.39) indicates extreme overfitting.
See [633, Section 6] and [640] and Section 18.3.7 for still more useful meth-
ods for validating the Cox model.
As with logistic modeling, once a Cox PH model has been fitted and all
its assumptions verified, the final model needs to be presented and inter-
preted. The fastest way to describe the model is to interpret each effect in
it. For each predictor the change in log hazard per desired units of change
in the predictor value may be computed, or the antilog of this quantity,
exp(βj × change in Xj ), may be used to estimate the hazard ratio holding
all other factors constant. When Xj is a nonlinear factor, changes in predicted
Xβ for sensible values of Xj such as quartiles can be used as described in
Section 10.10. Of course for nonmodeled stratification factors, this method is
of no help. Figure 20.12 depicts a way to display estimated surgical : medical
hazard ratios in the presence of a significant treatment by disease severity
interaction and a secular trend in the benefit of surgical therapy (treatment
by year of diagnosis interaction).
Often, the use of predicted survival probabilities may make the model
more interpretable. If the effect of only one factor is being displayed and
510 20 Cox Proportional Hazards Regression Model
1−Vessel Disease
2−Vessel Disease
1970
3−Vessel Disease
75% Left Main
95% Left Main
1−Vessel Disease
2−Vessel Disease
1977
3−Vessel Disease
75% Left Main
95% Left Main
1−Vessel Disease
2−Vessel Disease
1984
3−Vessel Disease
75% Left Main
95% Left Main
Fig. 20.12 A display of an interaction between treatment and extent of disease, and
between treatment and calendar year of start of treatment. Comparison of medical and
surgical average hazard ratios for patients treated in 1970, 1977, and 1984 according
to coronary disease severity. Circles represent point estimates; bars represent 0.95
confidence limits of hazard ratios. Ratios less than 1.0 indicate that coronary bypass
surgery is more effective.88
that factor is polytomous or predictions are made for specific levels, survival
curves (with or without adjustment for other factors not shown) can be drawn
for each level of the predictor of interest, with follow-up time on the x-axis.
Figure 20.2 demonstrated this for a factor which was a stratification factor.
Figure 20.13 extends this by displaying survival estimates stratified by treat-
ment but adjusted to various levels of two modeled factors, one of which, year
of diagnosis, interacted with treatment.
When a continuous predictor is of interest, it is usually more informative
to display that factor on the x-axis with estimated survival at one or more
time points on the y-axis. When the model contains only one predictor, even
if that predictor is represented by multiple terms such as a spline expansion,
one may simply plot that factor against the predicted survival. Figure 20.14
depicts the relationship between treadmill exercise score, which is a weighted
linear combination of several predictors in a Cox model, and the probability
of surviving five years.
When displaying the effect of a single factor after adjusting for multiple
predictors which are not displayed, care only need be taken for the values
to which the predictors are adjusted (e.g., grand means). When instead the
desire is to display the effect of multiple predictors simultaneously, an im-
portant continuous predictor can be displayed on the x-axis while separate
20.12 Describing the Fitted Model 511
curves or graphs are made for levels of other factors. Figure 20.15, which
corresponds to the log Λ plots in Figure 20.5, displays the joint effects of age
and sex on the three-year survival probability. Age is modeled with a cubic
spline function, and the model includes terms for an age × sex interaction.
p ← Predict ( f.ia , age , sex , time =3)
ggplot ( p )
0.75
LVEF=0.4
0.50
Survival Probability
0.25
Treatment
0.00 Surgical
1.00
Medical
0.75
LVEF=0.6
0.50
0.25
0.00
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Years of Followup
1.0
0.8
0.7
0.6
0.5
Fig. 20.14 Cox model predictions with respect to a continuous variable. X-axis
shows the range of the treadmill score seen in clinical practice and Y -axis shows the
corresponding five-year survival probability predicted by the Cox regression model
for the 2842 study patients.440
1.0
3 Year Survival Probability
0.9
0.8 sex
Male
0.7 Female
0.6
0.5
20 40 60 80
Age
Fig. 20.15 Survival estimates for model stratified on sex, with interaction.
so that the desired survival curve could be read directly, at least to the nearest
tabulated X β̂. For estimating survival at a fixed time, say two years, one only
need to provide the constant Ŝ(t). The nomogram could even be adapted to
include a nonlinear scale Ŝ(2)exp(X β̂) to allow direct computation of two-year
survival.
20.13 R Functions 513
20.13 R Functions
Harrell’s cpower, spower, and ciapower (in the Hmisc package) perform power
calculations for Cox tests in follow-up studies. cpower computes power for
a two-sample Cox (log-rank) test with random patient entry over a fixed
duration and a given length of minimum follow-up. The expected number of
events in each group is estimated by assuming exponential survival. cpower
uses a slight modification of the method of Schoenfeld558 (see [501]). Separate
specification of noncompliance in the active treatment arm and “drop-in”from
the control arm into the active arm are allowed, using the method of Lachin
and Foulkes.370 The ciapower function computes power of the Cox interaction
test in a 2 × 2 setup using the method of Peterson and George.501 It does
not take noncompliance into account. The spower function simulates power
for two-sample tests (the log-rank test by default) allowing for very complex
conditions such as continuously varying treatment effect and noncompliance
probabilities.
The rms package cph function is a slight modification of the coxph func-
tion written by Terry Therneau (in his survival package to work in the rms
framework. cph computes MLEs of Cox and stratified Cox PH models, overall
score and likelihood ratio χ2 statistics for the model, martingale residuals, the
linear predictor (X β̂ centered to have mean 0), and collinearity diagnostics.
Efron, Breslow, and exact partial likelihoods are supported (although the
exact likelihood is very computationally intensive if ties are frequent). The
function also fits the Andersen–Gill23 generalization of the Cox PH model.
This model allows for predictor values to change over time in the form of step
functions as well as allowing time-dependent stratification (subjects can jump
to different hazard function shapes). The Andersen–Gill formulation allows
multiple events per subject and permits subjects to move in and out of risk at
any desired time points. The latter feature allows time zero to have a more
general definition. (See Section 9.5 for methods of adjusting the variance–
covariance matrix of β̂ for dependence in the events per subject.) The print-
2
ing function corresponding to cph prints the Nagelkerke index RN described
in Section 20.10, and has a latex option for better output. cph works in con-
junction with the generic functions such as specs, predict, summary, anova,
fastbw, which.influence, latex, residuals, coef, nomogram, and Predict de-
scribed in Section 20.13, the same as the logistic regression function lrm does.
For the purpose of plotting predicted survival at a single time, Predict has an
additional argument time for plotting cph fits. It also has an argument loglog
which if TRUE causes instead log-log survival to be plotted on the y-axis. cph
has all the arguments described in Section 20.13 and some that are specific
to it.
Similar to functions for psm, there are Survival, Quantile, and Mean functions
which create other R functions to evaluate survival probabilities and perform
other calculations, based on a cph fit with surv=TRUE. These functions, un-
like all the others, allow polygon (linear interpolation) estimation of survival
514 20 Cox Proportional Hazards Regression Model
X β̂ =
−1.8
+0.0493age − 2.15×10−6(age − 30.3)3+ − 2.82×10−5(age − 45.1)3+
+5.18×10−5(age − 54.6)3+ − 2.15×10−5(age − 69.6)3+
+[Female][−0.0366age + 4.29×10−5(age − 30.3)3+ − 0.00011(age − 45.1)3+
+6.74×10−5(age − 54.6)3+ − 2.32×10−7(age − 69.6)3+ ]
0 10 20 30 40 50 60 70 80 90 100
Points
age (sex=Male)
10 20 30 40 50 60 70 90
age (sex=Female)
10 30 50 60 70 80 90 100
Total Points
0 10 20 30 40 50 60 70 80 90 100
Linear Predictor
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
S(3 | Male)
0.99 0.98 0.95 0.9
S(3 | Female)
0.95 0.9 0.8 0.7 0.5 0.3 0.1
Median (Male)
12 10
Median (Female)
12 8 4 2 1
Fig. 20.16 Nomogram from a fitted stratified Cox model that allowed for interaction
between age and sex, and nonlinearity in age. The axis for median survival time is
truncated on the left where the median is beyond the last follow-up time.
The corresponding smoothed martingale residual plot for LVEF in Figure 20.7
was created with
516 20 Cox Proportional Hazards Regression Model
The computation and plotting of scaled Schoenfeld residuals could have been
done automatically in this case by using the single command plot(cox.zph
(cox)), although cox.zph defaults to plotting against the Kaplan–Meier trans-
formation of follow-up time.
20.14 Further Reading 517
If you have multiple degree of freedom factors, you may want to score them
into linear predictors before using hazard.ratio.plot. The predict function
with argument type="terms" will produce a matrix with one column per factor
to do this (Section 20.13).
Therneau’s cox.zph function implements Harrell’s Schoenfeld residual cor-
relation test for PH. This function also stores results that can easily be passed
to a plotting method for cox.zph to automatically plot smoothed residuals
that estimate the effect of each predictor over time.
Therneau has also written an R function survdiff that compares two or
more survival curves using the G − ρ family of rank tests (Harrington and
Fleming273 ).
The rcorr.cens function in the Hmisc library computes the c index and the
corresponding generalization of Somers’ Dxy rank correlation for a censored
response variable. rcorr.cens also works for uncensored and binary responses
(see ROC area in Section 10.8), although its use of all possible pairings makes
it slow for this purpose. The survival package’s survConcordance has an ex- 20
tremely fast algorithm for the c index and a fairly accurate estimator of its
standard error.
The calibrate function for cph constructs a bootstrap or cross-validation
optimism-corrected calibration curve for a single time point by resampling
the differences between average Cox predicted survival and Kaplan–Meier es-
timates (see Section 20.11.1). But more precise is calibrate’s default method
based on adaptive semiparametric regression discussed in the same section.
Figure 20.11 is an example.
The validate function for cph fits validates several statistics describing Cox
2
model fits—slope shrinkage, RN , D, U, Q, and Dxy . The val.surv function
can also be of use in externally validating a Cox model using the methods
presented in Section 18.3.7.
1 Good general texts for the Cox PH model include Cox and Oakes,133 Kalbfleisch
and Prentice,331 Lawless,382 Collett,114 Marubini and Valsecchi,444 and Klein
and Moeschberger.350 Therneau and Grambsch604 describe the many ways the
standard Cox model may be extended.
2 Cupples et al.141 and Marubini and Valsecchi [444, pp. 201–206] present good
description of various methods of computing “adjusted survival curves.”
3 See Altman and Andersen15 for simpler approximate formulas. Cheng et al.103
derived methods for obtaining pointwise and simultaneous confidence bands for
518 20 Cox Proportional Hazards Regression Model
getHdata ( prostate )
levels ( prostate $ ekg )[ levels ( prostate $ ekg ) % in %
c ( ' old MI ' , ' recent MI ' )] ← ' MI '
# combines last 2 levels and uses a new name , MI
attach ( prostate )
sz ← impute (w , sz , data = prostate )
sg ← impute (w , sg , data = prostate )
age ← impute (w , age , data = prostate )
wt ← impute (w , wt , data = prostate )
ekg ← impute (w , ekg , data = prostate )
The likelihood ratio χ2 statistic is 136.2 with 36 d.f. This test is highly
significant so some modeling is warranted. The AIC value (on the χ2 scale) is
136.2−2×36 = 64.2. The rough shrinkage estimate is 0.74 (100.2/136.2) so we
estimate that 0.26 of the model fitting will be noise, especially with regard to
calibration accuracy. The approach of Spiegelhalter582 is to fit this full model
and to shrink predicted values. We instead try to do data reduction (blinded
to individual χ2 statistics from the above model fit) to see if a reliable model
can be obtained without shrinkage. A good approach at this point might
be to do a variable clustering analysis followed by single degree of freedom
scoring for individual predictors or for clusters of predictors. Instead we do
an informal data reduction. The strategy is described in Table 21.2. For ap,
more exploration is desired to be able to model the shape of effect with such a
highly skewed distribution. Since we expect the tumor variables to be strong
prognostic factors we retain them as separate variables. No assumption is
made for the dose-response shape for estrogen, as there is reason to expect a
non-monotonic effect due to competing risks for cardiovascular death.
heart ← hx + ekg % nin % c ( ' normal ' , ' benign ' )
label ( heart ) ← ' Heart Disease Code '
map ← (2 * dbp + sbp ) / 3
label ( map ) ← ' Mean Arterial Pressure / 10 '
dd ← datadist ( dd , heart , map )
The total savings is thus 12 d.f. The likelihood ratio χ2 is 118 with 24 d.f.,
with a slightly improved AIC of 70. The rough shrinkage estimate is slightly
better at 0.80, but still worrisome. A further data reduction could be done,
such as using the transcan transformations determined from self-consistency
of predictors, but we stop here and use this model.
From Table 21.3 there are 11 parameters associated with nonlinear effects,
and the overall test of linearity indicates the strong presence of nonlinearity
for at least one of the variables age,wt,map,hg,sz,sg,ap. There is no strong
evidence for a difference in survival time between doses of estrogen.
Now that we have a tentative model, let us examine the model’s distributional
assumptions using smoothed scaled Schoenfeld residuals. A messy detail is
how to handle multiple regression coefficients per predictor. Here we do an
526 21 Case Study in Cox Regression
The fit f.short based on the matrix of single d.f. predictors z has the
same LR χ2 of 118 as the fit f, but with a falsely low 11 d.f. All regression
coefficients are unity.
Now we compute scaled Schoenfeld residuals separately for each predictor
and test the PH assumption using the “correlation with time” test. Also plot
smoothed trends in the residuals. The plot method for cox.zph objects uses
cubic splines to smooth the relationship.
phtest ← cox.zph ( f.short , transform = ' identity ' )
phtest
rho chisq p
rx 0.10232 4.00823 0.0453
age -0.05483 1.05850 0.3036
wt 0.01838 0.11632 0.7331
pf . coded -0.03429 0.41884 0.5175
heart 0.02650 0.30052 0.5836
map 0.02055 0.14135 0.7069
hg -0.00362 0.00511 0.9430
sg -0.05137 0.94589 0.3308
sz -0.01554 0.08330 0.7729
ap 0.01720 0.11858 0.7306
bm 0.04957 0.95354 0.3288
GLOBAL NA 7.18985 0.7835
Perhaps only the drug effect significantly changes over time (P = 0.05 for
testing the correlation rho between the scaled Schoenfeld residual and time),
but when a global test of PH is done penalizing for 11 d.f., the P value is
0.78. A graphical examination of the trends doesn’t find anything interesting
for the last 10 variables. A residual plot is drawn for rx alone and is shown in
Figure 21.1. We ignore the possible increase in effect of estrogen over time. If
this non-PH is real, a more accurate model might be obtained by stratifying
on rx or by using a time × rx interaction as a time-dependent covariable.
21.4 Describing Predictor Effects 527
10 l l l l
l ll l lll ll l
l l lll
ll ll l l
ll l lll l llll l
l l
l llll ll lllll l ll l ll l l
lllll ll llll l
ll l
lllll llllll llll
5 ll
l l
ll
ll l ll
l lllllllllllll
l
llllll l llll lllll lll l
l
ll
llllll l
l l l l l l l
lll l ll l l l l ll
l l
l llll l
lll
l lll ll lll ll l lll
l
l l l
l
Beta(t) for rx
0 l llllll
l
l ll ll l
l l l l ll
lll l
l l
ll
l llll ll ll l l
ll
l
−5 l
l
ll l
l ll l
l ll l ll
−10 l l ll ll
l
l l l l
l lll l ll ll lll l
l
l l l l
l l l l l l ll
l l lllllll l ll l l
l
−15 l
l
0 20 40 60
Time
Fig. 21.1 Raw and spline-smoothed scaled Schoenfeld residuals for dose of estrogen,
nonlinearly coded from the Cox model fit, with ± 2 standard errors.
Note that the model has several insignificant predictors. These are not
deleted, as that would not improve predictive accuracy and it would make
accurate confidence intervals hard to obtain. At this point it would be rea-
sonable to test prespecified interactions. Here we test all interactions with
dose. Since the multiple terms for many of the predictors (and for rx) make
for a great number of d.f. for testing interaction (and a loss of power), we do
approximate tests on the data-driven coding of predictors. P -values for these
tests are likely to be somewhat anti-conservative.
z.dose ← z [, " rx " ] # same as saying z [ ,1] - get first column
z.other ← z [, -1 ] # all but the first column of z
f.ia ← cph ( S ∼ z.dose * z.other ) # Figure 21.4:
latex ( anova ( f.ia ), file = ' ' , label = ' tab : coxcase-anova2 ' )
The global test of additivity in Table 21.4 has P = 0.27, so we ignore the
interactions (and also forget to penalize for having looked for them below!).
Let us plot how each predictor is related to the log hazard of death, including
0.95 confidence bands. Note in Figure 21.2 that due to a peculiarity of the
Cox model the standard error of the predicted X β̂ is zero at the reference
values (medians here, for continuous predictors).
528 21 Case Study in Cox Regression
age ap hg
1.0
0.5
0.0
−0.5
−1.0
−1.5
60 70 80 0 50 100 9 11 13 15 17
log Relative Hazard
map sg sz
1.0
0.5
0.0
−0.5
−1.0
−1.5
8 10 12 7.5 10.0 12.5 15.0 0 10 20 30 40 50
wt
1.0
0.5
0.0
−0.5
−1.0
−1.5
80 90 100 110 120 130
bm heart
1 l 2 l
1 l
0 l
0 l
pf.coded rx
4 l 5.0 mg estrogen l
3 l 1.0 mg estrogen l
2 l 0.2 mg estrogen l
1 l placebo l
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
log Relative Hazard
Fig. 21.2 Shape of each predictor on log hazard of death. Y -axis shows X β̂, but
the predictors not plotted are set to reference values. Note the highly non-monotonic
relationship with ap, and the increased slope after age 70 which occurs in outcome
models for various diseases.
21.5 Validating the Model 529
We first validate this model for Somers’ Dxy rank correlation between pre-
dicted log hazard and observed survival time, and for slope shrinkage. The
bootstrap is used (with 300 resamples) to penalize for possible overfitting, as
discussed in Section 5.3.
set.seed (1) # so can reproduce results
v ← validate (f , B =300)
0.7
Fraction Surviving 60 Month
0.6
0.5
0.4
0.3
0.2
0.1
0.0
The estimated calibration curves are shown in Figure 21.3, similar to what
was done in Figure 19.11. Bootstrap calibration demonstrates some overfit-
ting, consistent with regression to the mean. The absolute error is appreciable
for 5-year survival predicted to be very low or high.
age − 76:70
wt − 107:90
pf.coded − 4:1
heart − 2:0
map − 11:9.333333
hg − 14.69922:12.29883
sg − 11:9
sz − 21:5
ap − 20:1
bm − 1:0
rx − 0.2 mg estrogen:placebo
rx − 1.0 mg estrogen:placebo
rx − 5.0 mg estrogen:placebo
Fig. 21.4 Hazard ratios and multi-level confidence bars for effects of predictors in
model, using default ranges except for ap
The ultimate graphical display for this model will be a nomogram relating
the predictors to X β̂, estimated three– and five-year survival probabilities
and median survival time. It is easy to add as many “output” axes as desired
to a nomogram.
surv ← Survival ( f )
surv3 ← function ( x ) surv (3 * 12 , lp = x )
surv5 ← function ( x ) surv (5 * 12 , lp = x )
quan ← Quantile ( f )
med ← function ( x ) quan ( lp = x ) / 12
ss ← c ( .05 , .1 , .2 , .3 , .4 , .5 , .6 , .7 , .8 , .9 , .95 )
21.7 Problems
Perform Cox regression analyses of survival time using the Mayo Clinic PBC
dataset described in Section 8.9. Provide model descriptions, parameter esti-
mates, and conclusions.
1. Assess the nature of the association of several predictors of your choice.
For polytomous predictors, perform a log-rank-type score test (or k-sample
ANOVA extension if there are more than two levels). For continuous pre-
dictors, plot a smooth curve that estimates the relationship between the
predictor and the log hazard or log–log survival. Use both parametric
and nonparametric (using martingale residuals) approaches. Make a test
of H0 : predictor is not associated with outcome versus Ha : predictor
532 21 Case Study in Cox Regression
0 10 20 30 40 50 60 70 80 90 100
Points
5.0 mg estrogen
rx
1.0 mg estrogen 0.2 mg estrogen
75 80 85 90
Age in Years
70 50
120
Weight Index = wt(kg)−ht(cm)+200
110 90 80 70 60
2 4
pf.coded
1 3
1
Heart Disease Code
0 2
22 12
Mean Arterial Pressure/10
4 8
16 18 20 22
Serum Hemoglobin (g/100ml)
14 12 10 8 6 4
Combined Index of Stage and Hist.
Grade
5 6 7 8 10 12 15
Total Points
0 20 40 60 80 100 140 180 220 260
Linear Predictor
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
3−year Survival
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05
5−year Survival
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05
is associated (by a smooth function). The test should have more than 1
d.f. If there is no evidence that the predictor is associated with outcome.
Make a formal test of linearity of each remaining continuous predictor.
Use restricted cubic spline functions with four knots. If you feel that you
can’t narrow down the number of candidate predictors without examining
the outcomes, and the number is too great to be able to derive a reliable
model, use a data reduction technique and combine many of the variables
into a summary index.
21.7 Problems 533
2. For factors that remain, assess the PH assumption using at least two meth-
ods, after ensuring that continuous predictors are transformed to be as
linear as possible. In addition, for polytomous predictors, derive log cu-
mulative hazard estimates adjusted for continuous predictors that do not
assume anything about the relationship between the polytomous factor
and survival.
3. Derive a final Cox PH model. Stratify on polytomous factors that do not
satisfy the PH assumption. Decide whether to categorize and stratify on
continuous factors that may strongly violate PH. Remember that in this
case you can still model the continuous factor to account for any residual
regression after adjusting for strata intervals. Include an interaction be-
tween two predictors of your choosing. Interpret the parameters in the final
model. Also interpret the final model by providing some predicted survival
curves in which an important continuous predictor is on the x-axis, pre-
dicted survival is on the y-axis, separate curves are drawn for levels of
another factor, and any other factors in the model are adjusted to speci-
fied constants or to the grand mean. The estimated survival probabilities
should be computed at t = 730 days.
4. Verify, in an unbiased fashion, your “final” model, for either calibration or
discrimination. Validate intermediate steps, not just the final parameter
estimates.
Appendix A
Datasets, R Packages, and Internet
Resources
R Packages
The rms package written by the author maintains detailed information about
a model’s design matrix so that many analyses using the model fit are au-
tomated. rms is a large package of R functions. Most of the functions in rms
analyze model fits, validate them, or make presentation graphics from them,
a By convention these should have had .rda suffixes.
but the packages also contain special model–fitting functions for binary and
ordinal logistic regression (optionally using penalized maximum likelihood),
unpenalized ordinal regression with a variety of link functions, penalized and
unpenalized least squares, and parametric and semiparametric survival mod-
els. In addition, rms handles quantile regression and longitudinal analysis
using generalized least squares. The rms package pays special attention to
computing predicted values in that design matrix attributes (e.g., knots for
splines, categories for categorical predictors) are “remembered” so that pre-
dictors are properly transformed while predictions are being generated. The
functions makes extensive use of a wealth of survival analysis software writ-
ten by Terry Therneau of the Mayo Foundation. This survival package is a
standard part of R.
The author’s Hmisc package contains other miscellaneous functions used
in the text. These are functions that do not operate on model fits that used
the enhanced design attributes stored by the rms package. Functions in Hmisc
include facilities for data reduction, imputation, power and sample size calcu-
lation, advanced table making, recoding variables, translating SAS datasets
into R data frames while preserving all data attributes (including variable
and value labels and special missing values), drawing and annotating plots,
and converting certain R objects to LATEX371 typeset form. The latter capa-
bility, provided by a family of latex functions, completes the conversion to
LATEX of many of the objects created by rms. The packages contain several
LATEX methods that create LATEX code for typesetting model fits in algebraic
notation, for printing ANOVA and regression effect (e.g., odds ratio) tables,
and other applications. The LATEX methods were used extensively in the text,
especially for writing restricted cubic spline function fits in simplest notation.
The latest version of the rms package is available from CRAN (see below).
It is necessary to install the Hmisc package in order to use rms package. The
Web site also contains more in-depth overviews of the packages, which run on
UNIX, Linux, Mac, and Microsoft Windows systems. The packages may be
automatically downloaded and installed using R’s install.packages function
or using menus under R graphical user interfaces.
To subscribe to the highly informative and helpful R-help e-mail group, see the
Web site. R-help is appropriate for asking general questions about R including
those about finding or writing functions to do specific analyses (for questions
specific to a package, contact the author of that package). Another resource
is the CRAN repository at www.r-project.org. Another excellent resource
for askings questions about R is stackoverflow.com/questions/tagged/r.
There is a Google group regmod devoted to the book and courses.
A Datasets, R Packages, and Internet Resources 537
Multiple Imputation
Bibliography
SAS
SAS macros for fitting restricted cubic splines and for other basic operations
are freely available from the main Web site. The Web site also has notes on
SAS usage for some of the methods presented in the text.
References
35. P. C. Austin. Bootstrap model selection had similar performance for select-
ing authentic and noise variables compared to backward variable elimination: a
simulation study. J Clin Epi, 61:1009–1017, 2008. ⋄70
36. P. C. Austin and E. W. Steyerberg. Events per variable (EPV) and the relative
performance of different strategies for estimating the out-of-sample validity of
logistic regression models. Statistical methods in medical research, Nov. 2014. ⋄
112
37. P. C. Austin and E. W. Steyerberg. Graphical assessment of internal and exter-
nal calibration of logistic regression models by using loess smoothers. Stat Med,
33(3):517–535, Feb. 2014. ⋄105
38. P. C. Austin, J. V. Tu, P. A. Daly, and D. A. Alter. Tutorial in Biostatistics:The
use of quantile regression in health care research: a case study examining gender
differences in the timeliness of thrombolytic therapy. Stat Med, 24:791–816,
2005. ⋄392
39. D. Bamber. The area above the ordinal dominance graph and the area below
the receiver operating characteristic graph. J Mathe Psych, 12:387–415, 1975.
⋄257
40. J. Banks. Nomograms. In S. Kotz and N. L. Johnson, editors, Encyclopedia of
Stat Scis, volume 6. Wiley, New York, 1985. ⋄104, 267
41. J. Barnard and D. B. Rubin. Small-sample degrees of freedom with multiple
imputation. Biometrika, 86:948–955, 1999. ⋄58
42. S. A. Barnes, S. R. Lindborg, and J. W. Seaman. Multiple imputation techniques
in small sample clinical trials. Stat Med, 25:233–245, 2006. ⋄47, 58
43. F. Barzi and M. Woodward. Imputations of missing values in practice: Results
from imputations of serum cholesterol in 28 cohort studies. Am J Epi, 160:34–45,
2004. ⋄50, 58
44. R. A. Becker, J. M. Chambers, and A. R. Wilks. The New S Language.
Wadsworth and Brooks/Cole, Pacific Grove, CA, 1988. ⋄127
45. H. Belcher. The concept of residual confounding in regression models and some
applications. Stat Med, 11:1747–1758, 1992. ⋄11, 19
46. D. A. Belsley. Conditioning Diagnostics: Collinearity and Weak Data in Re-
gression. Wiley, New York, 1991. ⋄101
47. D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. Wiley, New York, 1980. ⋄91
48. R. Bender and A. Benner. Calculating ordinal regression models in SAS and
S-Plus. Biometrical J, 42:677–699, 2000. ⋄324
49. J. K. Benedetti, P. Liu, H. N. Sather, J. Seinfeld, and M. A. Epton. Effective
sample size for tests of censored survival data. Biometrika, 69:343–349, 1982. ⋄
73
50. K. Berhane, M. Hauptmann, and B. Langholz. Using tensor product splines
in modeling exposure–time–response relationships: Application to the Colorado
Plateau Uranium Miners cohort. Stat Med, 27:5484–5496, 2008. ⋄37
51. K. N. Berk and D. E. Booth. Seeing a curve in multiple regression. Technomet-
rics, 37:385–398, 1995. ⋄272
52. D. M. Berridge and J. Whitehead. Analysis of failure time data with ordinal
categories of response. Stat Med, 10:1703–1710, 1991. ⋄319, 320, 324, 417
53. C. Berzuini and D. Clayton. Bayesian analysis of survival on multiple time
scales. Stat Med, 13:823–838, 1994. ⋄401
54. W. B. Bilker and M. Wang. A semiparametric extension of the Mann-Whitney
test for randomly truncated data. Biometrics, 52:10–20, 1996. ⋄420
55. D. A. Binder. Fitting Cox’s proportional hazards models from survey data.
Biometrika, 79:139–147, 1992. ⋄213, 215
56. C. Binquet, M. Abrahamowicz, A. Mahboubi, V. Jooste, J. Faivre, C. Bonithon-
Kopp, and C. Quantin. Empirical study of the dependence of the results of
multivariable flexible survival analyses on model selection strategy. Stat Med,
27:6470–6488, 2008. ⋄420
542 References
100. C. Chatfield. Model uncertainty, data mining and statistical inference (with
discussion). J Roy Stat Soc A, 158:419–466, 1995. ⋄vii, 9, 10, 11, 68, 100, 123,
204
101. S. Chatterjee and A. S. Hadi. Regression Analysis by Example. Wiley, New
York, fifth edition, 2012. ⋄78, 101
102. S. C. Cheng, J. P. Fine, and L. J. Wei. Prediction of cumulative incidence
function under the proportional hazards model. Biometrics, 54:219–228, 1998.
⋄415
103. S. C. Cheng, L. J. Wei, and Z. Ying. Predicting Survival Probabilities with
Semiparametric Transformation Models. JASA, 92(437):227–235, Mar. 1997. ⋄
517
104. F. Chiaromonte, R. D. Cook, and B. Li. Sufficient dimension reduction in
regressions with categorical predictors. Appl Stat, 30:475–497, 2002. ⋄101
105. B. Choodari-Oskooei, P. Royston, and M. K. B. Parmar. A simulation study
of predictive ability measures in a survival model II: explained randomness and
predictive accuracy. Stat Med, 31(23):2644–2659, 2012. ⋄518
106. B. Choodari-Oskooei, P. Royston, and M. K. B. Parmar. A simulation study of
predictive ability measures in a survival model I: Explained variation measures.
Stat Med, 31(23):2627–2643, 2012. ⋄518
107. A. Ciampi, A. Negassa, and Z. Lou. Tree-structured prediction for censored
survival data and the Cox model. J Clin Epi, 48:675–689, 1995. ⋄41
108. A. Ciampi, J. Thiffault, J. P. Nakache, and B. Asselain. Stratification by stepwise
regression, correspondence analysis and recursive partition. Comp Stat Data
Analysis, 1986:185–204, 1986. ⋄41, 81
109. L. A. Clark and D. Pregibon. Tree-Based Models. In J. M. Chambers and T. J.
Hastie, editors, Statistical Models in S, chapter 9, pages 377–419. Wadsworth
and Brooks/Cole, Pacific Grove, CA, 1992. ⋄41
110. T. G. Clark and D. G. Altman. Developing a prognostic model in the presence
of missing data: an ovarian cancer case study. J Clin Epi, 56:28–37, 2003. ⋄57
111. W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots.
J Am Stat Assoc, 74:829–836, 1979. ⋄29, 141, 238, 315, 356, 493
112. A. Cnaan and L. Ryan. Survival analysis in natural history studies of disease.
Stat Med, 8:1255–1268, 1989. ⋄401, 420
113. T. J. Cole, C. J. Morley, A. J. Thornton, M. A. Fowler, and P. H. Hewson. A
scoring system to quantify illness in babies under 6 months of age. J Roy Stat
Soc A, 154:287–304, 1991. ⋄324
114. D. Collett. Modelling Survival Data in Medical Research. Chapman and Hall,
London, 1994. ⋄420, 517
115. D. Collett. Modelling Binary Data. Chapman and Hall, London, second edition,
2002. ⋄213, 272, 315
116. A. F. Connors, T. Speroff, N. V. Dawson, C. Thomas, F. E. Harrell, D. Wagner,
N. Desbiens, L. Goldman, A. W. Wu, R. M. Califf, W. J. Fulkerson, H. Vidaillet,
S. Broste, P. Bellamy, J. Lynn, W. A. Knaus, and T. S. Investigators. The effec-
tiveness of right heart catheterization in the initial care of critically ill patients.
JAMA, 276:889–897, 1996. ⋄3
117. E. F. Cook and L. Goldman. Asymmetric stratification: An outline for an effi-
cient method for controlling confounding in cohort studies. Am J Epi, 127:626–
639, 1988. ⋄31, 231
118. N. R. Cook. Use and misues of the receiver operating characteristic curve in
risk prediction. Circulation, 115:928–935, 2007. ⋄93, 101, 273
119. R. D. Cook. Fisher Lecture:Dimension reduction in regression. Statistical Sci,
22:1–26, 2007. ⋄101
120. R. D. Cook and L. Forzani. Principal fitted components for dimension reduction
in regression. Statistical Sci, 23(4):485–501, 2008. ⋄101
References 545
121. J. Copas. The effectiveness of risk scores: The logit rank plot. Appl Stat, 48:165–
183, 1999. ⋄273
122. J. B. Copas. Regression, prediction and shrinkage (with discussion). J Roy Stat
Soc B, 45:311–354, 1983. ⋄100, 101
123. J. B. Copas. Cross-validation shrinkage of regression predictors. J Roy Stat Soc
B, 49:175–183, 1987. ⋄115, 123, 273, 508
124. J. B. Copas. Unweighted sum of squares tests for proportions. Appl Stat, 38:71–
80, 1989. ⋄236
125. J. B. Copas and T. Long. Estimating the residual variance in orthogonal regres-
sion with variable selection. The Statistician, 40:51–59, 1991. ⋄68
126. C. Cox. Location-scale cumulative odds models for ordinal data: A generalized
non-linear model approach. Stat Med, 14:1191–1203, 1995. ⋄324
127. C. Cox. The generalized f distribution: An umbrella for parametric survival
analysis. Stat Med, 27:4301–4313, 2008. ⋄424
128. C. Cox, H. Chu, M. F. Schneider, and A. Muñoz. Parametric survival analysis
and taxonomy of hazard functions for the generalized gamma distribution. Stat
Med, 26:4352–4374, 2007. ⋄424
129. D. R. Cox. The regression analysis of binary sequences (with discussion). J Roy
Stat Soc B, 20:215–242, 1958. ⋄14, 220
130. D. R. Cox. Two further applications of a model for binary regression.
Biometrika, 45(3/4):562–565, 1958. ⋄259
131. D. R. Cox. Further results on tests of separate families of hypotheses. J Roy
Stat Soc B, 24:406–424, 1962. ⋄205
132. D. R. Cox. Regression models and life-tables (with discussion). J Roy Stat Soc
B, 34:187–220, 1972. ⋄39, 41, 172, 207, 213, 314, 418, 428, 475, 476
133. D. R. Cox and D. Oakes. Analysis of Survival Data. Chapman and Hall, London,
1984. ⋄401, 420, 517
134. D. R. Cox and E. J. Snell. A general definition of residuals (with discussion). J
Roy Stat Soc B, 30:248–275, 1968. ⋄440
135. D. R. Cox and E. J. Snell. The Analysis of Binary Data. Chapman and Hall,
London, second edition, 1989. ⋄206
136. D. R. Cox and N. Wermuth. A comment on the coefficient of determination for
binary responses. Am Statistician, 46:1–4, 1992. ⋄206, 256
137. J. G. Cragg and R. Uhler. The demand for automobiles. Canadian Journal of
Economics, 3:386–406, 1970. ⋄206, 256
138. S. L. Crawford, S. L. Tennstedt, and J. B. McKinlay. A comparison of analytic
methods for non-random missingness of outcome data. J Clin Epi, 48:209–219,
1995. ⋄58
139. N. J. Crichton and J. P. Hinde. Correspondence analysis as a screening method
for indicants for clinical diagnosis. Stat Med, 8:1351–1362, 1989. ⋄81
140. N. J. Crichton, J. P. Hinde, and J. Marchini. Models for diagnosing chest pain:
Is CART useful? Stat Med, 16:717–727, 1997. ⋄41
141. L. A. Cupples, D. R. Gagnon, R. Ramaswamy, and R. B. D’Agostino. Age-
adjusted survival curves with application in the Framingham Study. Stat Med,
14:1731–1744, 1995. ⋄517
142. E. E. Cureton and R. B. D’Agostino. Factor Analysis, An Applied Approach.
Erlbaum, Hillsdale, NJ, 1983. ⋄81, 87, 101
143. D. M. Dabrowska, K. A. Doksum, N. J. Feduska, R. Husing, and P. Neville.
Methods for comparing cumulative hazard functions in a semi-proportional haz-
ard model. Stat Med, 11:1465–1476, 1992. ⋄482, 495, 502
144. R. B. D’Agostino, A. J. Belanger, E. W. Markson, M. Kelly-Hayes, and P. A.
Wolf. Development of health risk appraisal functions in the presence of multiple
indicators: The Framingham Study nursing home institutionalization model.
Stat Med, 14:1757–1770, 1995. ⋄81, 101
546 References
166. J. A. Dubin, H. Müller, and J. Wang. Event history graphs for censored data.
Stat Med, 20:2951–2964, 2001. ⋄418, 420
167. R. Dudley, F. E. Harrell, L. Smith, D. B. Mark, R. M. Califf, D. B. Pryor,
D. Glower, J. Lipscomb, and M. Hlatky. Comparison of analytic models for
estimating the effect of clinical factors on the cost of coronary artery bypass
graft surgery. J Clin Epi, 46:261–271, 1993. ⋄x
168. S. Durrleman and R. Simon. Flexible regression models with cubic splines. Stat
Med, 8:551–561, 1989. ⋄40
169. J. P. Eaton and C. A. Haas. Titanic: Triumph and Tragedy. W. W. Norton,
New York, second edition, 1995. ⋄291
170. B. Efron. The two sample problem with censored data. In Proceedings of the
Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 4,
pages 831–853. 1967. ⋄505
171. B. Efron. The efficiency of Cox’s likelihood function for censored data. J Am
Stat Assoc, 72:557–565, 1977. ⋄475, 477
172. B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-
validation. J Am Stat Assoc, 78:316–331, 1983. ⋄70, 113, 114, 115, 116, 123,
259
173. B. Efron. How biased is the apparent error rate of a prediction rule? J Am Stat
Assoc, 81:461–470, 1986. ⋄101, 114
174. B. Efron. Missing data, imputation, and the bootstrap (with discussion). J Am
Stat Assoc, 89:463–479, 1994. ⋄52, 54
175. B. Efron and G. Gong. A leisurely look at the bootstrap, the jackknife, and
cross-validation. Am Statistician, 37:36–48, 1983. ⋄114
176. B. Efron and C. Morris. Stein’s paradox in statistics. Sci Am, 236(5):119–127,
1977. ⋄77
177. B. Efron and R. Tibshirani. Bootstrap methods for standard errors, confidence
intervals, and other measures of statistical accuracy. Statistical Sci, 1:54–77,
1986. ⋄70, 106, 114, 197
178. B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and
Hall, New York, 1993. ⋄70, 106, 114, 115, 122, 197, 199
179. B. Efron and R. Tibshirani. Improvements on cross-validation: The .632+ boot-
strap method. J Am Stat Assoc, 92:548–560, 1997. ⋄123, 124
180. G. E. Eide, E. Omenaas, and A. Gulsvik. The semi-proportional hazards model
revisited: Practical reparameterizations. Stat Med, 15:1771–1777, 1996. ⋄482
181. C. Faes, G. Molenberghs, M. Aerts, G. Verbeke, and M. G. Kenward. The
effective sample size and an alternative small-sample degrees-of-freedom method.
Am Statistician, 63(4):389–399, 2009. ⋄148
182. M. W. Fagerland and D. W. Hosmer. A goodness-of-fit test for the proportional
odds regression model. Stat Med, 32(13):2235–2249, 2013. ⋄317
183. J. Fan and R. A. Levine. To amnio or not to amnio: That is the decision for
Bayes. Chance, 20(3):26–32, 2007. ⋄5
184. D. Faraggi, M. LeBlanc, and J. Crowley. Understanding neural networks using
regression trees: an application to multiple myeloma survival data. Stat Med,
20:2965–2976, 2001. ⋄120
185. D. Faraggi and R. Simon. A simulation study of cross-validation for selecting an
optimal cutpoint in univariate survival analysis. Stat Med, 15:2203–2213, 1996.
⋄11, 19
186. J. J. Faraway. The cost of data analysis. J Comp Graph Stat, 1:213–229, 1992.
⋄10, 11, 97, 100, 115, 116, 322, 393, 396
187. V. Fedorov, F. Mannino, and R. Zhang. Consequences of dichotomization.
Pharm Stat, 8:50–61, 2009. ⋄5, 19
188. Z. Feng, D. McLerran, and J. Grizzle. A comparison of statistical methods for
clustered data analysis with Gaussian error. Stat Med, 15:1793–1806, 1996. ⋄
197, 213
548 References
189. L. Ferré. Determining the dimension in sliced inverse regression and related
methods. J Am Stat Assoc, 93:132–149, 1998. ⋄101
190. S. E. Fienberg. The Analysis of Cross-Classified Categorical Data. Springer,
New York, second edition, 2007. ⋄311, 319
191. P. Filzmoser, H. Fritz, and K. Kalcher. pcaPP: Robust PCA by Projection Pur-
suit, 2012. R package version 1.9–48. ⋄175
192. J. P. Fine and R. J. Gray. A proportional hazards model for the subdistribution
of a competing risk. J Am Stat Assoc, 94:496–509, 1999. ⋄420
193. D. M. Finkelstein and D. A. Schoenfeld. Combining mortality and longitudinal
measures in clinical trials. Stat Med, 18:1341–1354, 1999. ⋄420
194. M. Fiocco, H. Putter, and H. C. van Houwelingen. Reduced-rank proportional
hazards regression and simulation-based predictino for multi-state models. Stat
Med, 27:4340–4358, 2008. ⋄420
195. G. M. Fitzmaurice. A caveat concerning independence estimating equations
with multivariate binary data. Biometrics, 51:309–317, 1995. ⋄214
196. T. R. Fleming and D. P. Harrington. Nonparametric estimation of the survival
distribution in censored data. Comm Stat Th Meth, 13(20):2469–2486, 1984. ⋄
413
197. T. R. Fleming and D. P. Harrington. Counting Processes & Survival Analysis.
Wiley, New York, 1991. ⋄178, 420
198. I. Ford, J. Norrie, and S. Ahmadi. Model inconsistency, illustrated by the Cox
proportional hazards model. Stat Med, 14:735–746, 1995. ⋄4
199. E. B. Fowlkes. Some diagnostics for binary logistic regression via smoothing.
Biometrika, 74:503–515, 1987. ⋄272
200. J. Fox. Applied Regression Analysis, Linear Models, and Related Methods.
SAGE Publications, Thousand Oaks, CA, 1997. ⋄viii
201. J. Fox. An R and S-PLUS Companion to Applied Regression. SAGE Publica-
tions, Thousand Oaks, CA, 2002. ⋄viii
202. J. Fox. Applied Regression Analysis and Generalized Linear Models. SAGE
Publications, Thousand Oaks, CA, second edition, 2008. ⋄121
203. Fox, John. Bootstrapping Regression Models: An Appendix to An R and S-
PLUS Companion to Applied Regression, 2002. ⋄202
204. B. Francis and M. Fuller. Visualization of event histories. J Roy Stat Soc A,
159:301–308, 1996. ⋄421
205. D. Freedman, W. Navidi, and S. Peters. On the Impact of Variable Selection
in Fitting Regression Equations, pages 1–16. Lecture Notes in Economics and
Mathematical Systems. Springer-Verlag, New York, 1988. ⋄115
206. D. A. Freedman. On the so-called “Huber sandwich estimator” and “robust
standard errors”. Am Statistician, 60:299–302, 2006. ⋄213
207. J. H. Friedman. A variable span smoother. Technical Report 5, Laboratory for
Computational Statistics, Department of Statistics, Stanford University, 1984.
⋄29, 82, 141, 210, 273, 498
208. L. Friedman and M. Wall. Graphical views of suppression and multicollinearity
in multiple linear regression. Am Statistician, 59:127–136, 2005. ⋄101
209. M. H. Gail. Does cardiac transplantation prolong life? A reassessment. Ann Int
Med, 76:815–817, 1972. ⋄401
210. M. H. Gail and R. M. Pfeiffer. On criteria for evaluating models of absolute
risk. Biostatistics, 6(2):227–239, 2005. ⋄5
211. J. C. Gardiner, Z. Luo, and L. A. Roman. Fixed effects, random effects and
GEE: What are the differences? Stat Med, 28:221–239, 2009. ⋄160
212. J. J. Gaynor, E. J. Feuer, C. C. Tan, D. H. Wu, C. R. Little, D. J. Straus,
D. D. Clarkson, and M. F. Brennan. On the use of cause-specific failure and
conditional failure probabilities: Examples from clinical oncology data. J Am
Stat Assoc, 88:400–409, 1993. ⋄414, 415
References 549
273. D. P. Harrington and T. R. Fleming. A class of rank test procedures for censored
survival data. Biometrika, 69:553–566, 1982. ⋄517
274. T. Hastie. Discussion of “The use of polynomial splines and their tensor products
in multivariate function estimation” by C. J. Stone. Appl Stat, 22:177–179, 1994.
⋄37
275. T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall,
London, 1990. ⋄29, 41, 142, 390
276. T. J. Hastie, J. L. Botha, and C. M. Schnitzler. Regression with an ordered
categorical response. Stat Med, 8:785–794, 1989. ⋄324
277. T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman &
Hall/CRC, Boca Raton, FL, 1990. ISBN 9780412343902. ⋄90, 359
278. W. W. Hauck and A. Donner. Wald’s test as applied to hypotheses in logit
analysis. J Am Stat Assoc, 72:851–863, 1977. ⋄193, 234
279. X. He and L. Shen. Linear regression after spline transformation. Biometrika,
84:474–481, 1997. ⋄82
280. Y. He and A. M. Zaslavsky. Diagnosing imputation models by applying target
analyses to posterior replicates of completed data. Stat Med, 31(1):1–18, 2012.
⋄59
281. G. Heinze and M. Schemper. A solution to the problem of separation in logistic
regression. Stat Med, 21(16):2409–2419, 2002. ⋄203
282. R. Henderson. Problems and prediction in survival-data analysis. Stat Med,
14:161–184, 1995. ⋄420, 518, 519
283. R. Henderson, M. Jones, and J. Stare. Accuracy of point predictions in survival
analysis. Stat Med, 20:3083–3096, 2001. ⋄519
284. A. V. Hernández, M. J. Eijkemans, and E. W. Steyerberg. Randomized con-
trolled trials with time-to-event outcomes: how much does prespecified covariate
adjustment increase power? Annals of epidemiology, 16(1):41–48, Jan. 2006. ⋄
231
285. A. V. Hernández, E. W. Steyerberg, and J. D. F. Habbema. Covariate ad-
justment in randomized controlled trials with dichotomous outcomes increases
statistical power and reduces sample size requirements. J Clin Epi, 57:454–460,
2004. ⋄231
286. J. E. Herndon and F. E. Harrell. The restricted cubic spline hazard model.
Comm Stat Th Meth, 19:639–663, 1990. ⋄408, 409, 424
287. J. E. Herndon and F. E. Harrell. The restricted cubic spline as baseline hazard in
the proportional hazards model with step function time-dependent covariables.
Stat Med, 14:2119–2129, 1995. ⋄408, 424, 501, 518
288. I. Hertz-Picciotto and B. Rockhill. Validity and efficiency of approximation
methods for tied survival times in Cox regression. Biometrics, 53:1151–1156,
1997. ⋄477
289. K. R. Hess. Assessing time-by-covariate interactions in proportional hazards
regression models using cubic spline functions. Stat Med, 13:1045–1062, 1994. ⋄
501
290. K. R. Hess. Graphical methods for assessing violations of the proportional
hazards assumption in Cox regression. Stat Med, 14:1707–1723, 1995. ⋄518
291. T. Hielscher, M. Zucknick, W. Werft, and A. Benner. On the prognostic value
of survival models with application to gene expression signatures. Stat Med,
29:818–829, 2010. ⋄518, 519
292. J. Hilden and T. A. Gerds. A note on the evaluation of novel biomarkers: do not
rely on integrated discrimination improvement and net reclassification index.
Statist. Med., 33(19):3405–3414, Aug. 2014. ⋄101
293. S. L. Hillis. Residual plots for the censored data linear regression model. Stat
Med, 14:2023–2036, 1995. ⋄450
294. S. G. Hilsenbeck and G. M. Clark. Practical p-value adjustment for optimally
selected cutpoints. Stat Med, 15:103–112, 1996. ⋄11, 19
References 553
382. J. F. Lawless. Statistical Models and Methods for Lifetime Data. Wiley, New
York, 1982. ⋄420, 450, 485, 517
383. J. F. Lawless. The analysis of recurrent events for multiple subjects. Appl Stat,
44:487–498, 1995. ⋄421
384. J. F. Lawless and C. Nadeau. Some simple robust methods for the analysis of
recurrent events. Technometrics, 37:158–168, 1995. ⋄420, 421
385. J. F. Lawless and K. Singhal. Efficient screening of nonnormal regression models.
Biometrics, 34:318–327, 1978. ⋄70, 137
386. J. F. Lawless and Y. Yuan. Estimation of prediction error for survival models.
Stat Med, 29:262–274, 2010. ⋄519
387. S. le Cessie and J. C. van Houwelingen. A goodness-of-fit test for binary regres-
sion models, based on smoothing methods. Biometrics, 47:1267–1282, 1991. ⋄
236
388. S. le Cessie and J. C. van Houwelingen. Ridge estimators in logistic regression.
Appl Stat, 41:191–201, 1992. ⋄77, 209
389. M. LeBlanc and J. Crowley. Survival trees by goodness of fit. J Am Stat Assoc,
88:457–467, 1993. ⋄41
390. M. LeBlanc and R. Tibshirani. Adaptive principal surfaces. J Am Stat Assoc,
89:53–64, 1994. ⋄101
391. A. Leclerc, D. Luce, F. Lert, J. F. Chastang, and P. Logeay. Correspondence
analysis and logistic modelling: Complementary use in the analysis of a health
survey among nurses. Stat Med, 7:983–995, 1988. ⋄81
392. E. T. Lee. Statistical Methods for Survival Data Analysis. Lifetime Learning
Publications, Belmont, CA, second edition, 1980. ⋄420
393. E. W. Lee, L. J. Wei, and D. A. Amato. Cox-type regression analysis for large
numbers of small groups of correlated failure time observations. In J. P. Klein
and P. K. Goel, editors, Survival Analysis: State of the Art, NATO ASI, pages
237–247. Kluwer Academic, Boston, 1992. ⋄197
394. J. J. Lee, K. R. Hess, and J. A. Dubin. Extensions and applications of event
charts. Am Statistician, 54:63–70, 2000. ⋄418, 420
395. K. L. Lee, D. B. Pryor, F. E. Harrell, R. M. Califf, V. S. Behar, W. L. Floyd, J. J.
Morris, R. A. Waugh, R. E. Whalen, and R. A. Rosati. Predicting outcome in
coronary disease: Statistical models versus expert clinicians. Am J Med, 80:553–
560, 1986. ⋄205
396. S. Lee, J. Z. Huang, and J. Hu. Sparse logistic principal components analysis
for binary data. Ann Appl Stat, 4(3):1579–1601, 2010. ⋄101
397. E. L. Lehmann. Model specification: The views of Fisher and Neyman and later
developments. Statistical Sci, 5:160–168, 1990. ⋄8, 10
398. S. Lehr and M. Schemper. Parsimonious analysis of time-dependent effects in
the Cox model. Stat Med, 26:2686–2698, 2007. ⋄501
399. F. Leisch. Sweave: Dynamic Generation of Statistical Reports Using Literate
Data Analysis. In W. Härdle and B. Rönz, editors, Compstat 2002 — Proceed-
ings in Computational Statistics, pages 575–580. Physica Verlag, Heidelberg,
2002. ISBN 3-7908-1517-9. ⋄138
400. L. F. León and C. Tsai. Functional form diagnostics for Cox’s proportional
hazards model. Biometrics, 60:75–84, 2004. ⋄518
401. M. A. H. Levine, A. I. El-Nahas, and B. Asa. Relative risk and odds ratio data
are still portrayed with inappropriate scales in the medical literature. J Clin
Epi, 63:1045–1047, 2010. ⋄122
402. C. Li and B. E. Shepherd. A new residual for ordinal outcomes. Biometrika,
99(2):473–480, 2012. ⋄315
403. K. Li, J. Wang, and C. Chen. Dimension reduction for censored regression data.
Ann Stat, 27:1–23, 1999. ⋄101
404. K. C. Li. Sliced inverse regression for dimension reduction. J Am Stat Assoc,
86:316–327, 1991. ⋄101
558 References
405. K.-Y. Liang and S. L. Zeger. Longitudinal data analysis of continuous and
discrete responses for pre-post designs. Sankhyā, 62:134–148, 2000. ⋄160
406. J. G. Liao and D. McGee. Adjusted coefficients of determination for logistic
regression. Am Statistician, 57:161–165, 2003. ⋄273
407. D. Y. Lin. Cox regression analysis of multivariate failure time data: The marginal
approach. Stat Med, 13:2233–2247, 1994. ⋄197, 213, 417, 418
408. D. Y. Lin. Non-parametric inference for cumulative incidence functions in com-
peting risks studies. Stat Med, 16:901–910, 1997. ⋄415
409. D. Y. Lin. On fitting Cox’s proportional hazards models to survey data.
Biometrika, 87:37–47, 2000. ⋄215
410. D. Y. Lin and L. J. Wei. The robust inference for the Cox proportional hazards
model. J Am Stat Assoc, 84:1074–1078, 1989. ⋄197, 213, 487
411. D. Y. Lin, L. J. Wei, and Z. Ying. Checking the Cox model with cumulative
sums of martingale-based residuals. Biometrika, 80:557–572, 1993. ⋄518
412. D. Y. Lin and Z. Ying. Semiparametric regression analysis of longitudinal data
with informative drop-outs. Biostatistics, 4:385–398, 2003. ⋄47
413. J. C. Lindsey and L. M. Ryan. Tutorial in biostatistics: Methods for interval-
censored data. Stat Med, 17:219–238, 1998. ⋄420
414. J. K. Lindsey. Models for Repeated Measurements. Clarendon Press, 1997. ⋄143
415. J. K. Lindsey and B. Jones. Choosing among generalized linear models applied
to medical data. Stat Med, 17:59–68, 1998. ⋄11
416. K. Linnet. Assessing diagnostic tests by a strictly proper scoring rule. Stat Med,
8:609–618, 1989. ⋄114, 123, 257, 258
417. S. R. Lipsitz, L. P. Zhao, and G. Molenberghs. A semiparametric method of
multiple imputation. J Roy Stat Soc B, 60:127–144, 1998. ⋄54
418. R. Little and H. An. Robust likelihood-based analysis of multivariate data with
missing values. Statistica Sinica, 14:949–968, 2004. ⋄57, 59
419. R. J. Little. Missing Data. In Ency of Biostatistics, pages 2622–2635. Wiley,
New York, 1998. ⋄59
420. R. J. A. Little. Missing-data adjustments in large surveys. J Bus Econ Stat,
6:287–296, 1988. ⋄51
421. R. J. A. Little. Regression with missing X’s: A review. J Am Stat Assoc,
87:1227–1237, 1992. ⋄50, 51, 54
422. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley,
New York, second edition, 2002. ⋄48, 52, 54, 59
423. G. F. Liu, K. Lu, R. Mogg, M. Mallick, and D. V. Mehrotra. Should baseline be
a covariate or dependent variable in analyses of change from baseline in clinical
trials? Stat Med, 28:2509–2530, 2009. ⋄160
424. K. Liu and A. R. Dyer. A rank statistic for assessing the amount of variation
explained by risk factors in epidemiologic studies. Am J Epi, 109:597–606, 1979.
⋄206, 256
425. R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A significance test
for the lasso. Technical report, arXiv, 2013. ⋄68
426. J. S. Long and L. H. Ervin. Using heteroscedasticity consistent standard errors
in the linear regression model. Am Statistician, 54:217–224, 2000. ⋄213
427. J. Lubsen, J. Pool, and E. van der Does. A practical device for the application
of a diagnostic or prognostic function. Meth Info Med, 17:127–129, 1978. ⋄104
428. D. J. Lunn, J. Wakefield, and A. Racine-Poon. Cumulative logit models for
ordinal data: a case study involving allergic rhinitis severity scores. Stat Med,
20:2261–2285, 2001. ⋄324
429. M. Lunn and D. McNeil. Applying Cox regression to competing risks. Biomet-
rics, 51:524–532, 1995. ⋄420
430. X. Luo, L. A. Stfanski, and D. D. Boos. Tuning variable selection procedures
by adding noise. Technometrics, 48:165–175, 2006. ⋄11, 100
References 559
540. SAS Institute, Inc. SAS/STAT User’s Guide, volume 2. SAS Institute, Inc.,
Cary NC, fourth edition, 1990. ⋄315
541. W. Sauerbrei and M. Schumacher. A bootstrap resampling procedure for model
building: Application to the Cox regression model. Stat Med, 11:2093–2109,
1992. ⋄70, 113, 177
542. J. L. Schafer and J. W. Graham. Missing data: Our view of the state of the art.
Psych Meth, 7:147–177, 2002. ⋄58
543. D. E. Schaubel, R. A. Wolfe, and R. M. Merion. Estimating the effect of a
time-dependent treatment by levels of an internal time-dependent covariate:
Application to the contrast between liver wait-list and posttransplant mortality.
J Am Stat Assoc, 104(485):49–59, 2009. ⋄518
544. M. Schemper. Analyses of associations with censored data by generalized Mantel
and Breslow tests and generalized Kendall correlation. Biometrical J, 26:309–
318, 1984. ⋄518
545. M. Schemper. Non-parametric analysis of treatment-covariate interaction in the
presence of censoring. Stat Med, 7:1257–1266, 1988. ⋄41
546. M. Schemper. The explained variation in proportional hazards regression
(correction in 81:631, 1994). Biometrika, 77:216–218, 1990. ⋄505, 508
547. M. Schemper. Cox analysis of survival data with non-proportional hazard func-
tions. The Statistician, 41:445–455, 1992. ⋄518
548. M. Schemper. Further results on the explained variation in proportional hazards
regression. Biometrika, 79:202–204, 1992. ⋄505
549. M. Schemper. The relative importance of prognostic factors in studies of sur-
vival. Stat Med, 12:2377–2382, 1993. ⋄215, 505
550. M. Schemper. Predictive accuracy and explained variation. Stat Med, 22:2299–
2308, 2003. ⋄519
551. M. Schemper and G. Heinze. Probability imputation revisited for prognostic
factor studies. Stat Med, 16:73–80, 1997. ⋄52, 177
552. M. Schemper and R. Henderson. Predictive accuracy and explained variation in
Cox regression. Biometrics, 56:249–255, 2000. ⋄518
553. M. Schemper and T. L. Smith. Efficient evaluation of treatment effects in the
presence of missing covariate values. Stat Med, 9:777–784, 1990. ⋄52
554. M. Schemper and J. Stare. Explained variation in survival analysis. Stat Med,
15:1999–2012, 1996. ⋄215, 519
555. M. Schmid and S. Potapov. A comparison of estimators to evaluate the dis-
criminatory power of time-to-event models. Stat Med, 31(23):2588–2609, 2012.
⋄519
556. C. Schmoor, K. Ulm, and M. Schumacher. Comparison of the Cox model and
the regression tree procedure in analysing a randomized clinical trial. Stat Med,
12:2351–2366, 1993. ⋄41
557. D. Schoenfeld. Partial residuals for the proportional hazards regression model.
Biometrika, 69:239–241, 1982. ⋄314, 498, 499, 516
558. D. A. Schoenfeld. Sample size formulae for the proportional hazards regression
model. Biometrics, 39:499–503, 1983. ⋄513
559. G. Schulgen, B. Lausen, J. Olsen, and M. Schumacher. Outcome-oriented cut-
points in quantitative exposure. Am J Epi, 120:172–184, 1994. ⋄19, 20
560. G. Schwarz. Estimating the dimension of a model. Ann Stat, 6:461–464, 1978.
⋄214
561. S. C. Scott, M. S. Goldberg, and N. E. Mayo. Statistical assessment of ordinal
outcomes in comparative studies. J Clin Epi, 50:45–55, 1997. ⋄324
562. M. R. Segal. Regression trees for censored data. Biometrics, 44:35–47, 1988. ⋄
41
563. S. Senn. Change from baseline and analysis of covariance revisited. Stat Med,
25:4334–4344, 2006. ⋄159, 160
References 565
564. S. Senn and S. Julious. Measurement in clinical trials: A neglected issue for
statisticians? (with discussion). Stat Med, 28:3189–3225, 2009. ⋄313
565. J. Shao. Linear model selection by cross-validation. J Am Stat Assoc, 88:486–
494, 1993. ⋄100, 113, 122
566. J. Shao and R. R. Sitter. Bootstrap for imputed survey data. J Am Stat Assoc,
91:1278–1288, 1996. ⋄54
567. X. Shen, H. Huang, and J. Ye. Inference after model selection. J Am Stat Assoc,
99:751–762, 2004. ⋄102
568. Y. Shen and P. F. Thall. Parametric likelihoods for multiple non-fatal competing
risks and death. Stat Med, 17:999–1015, 1998. ⋄421
569. J. Siddique. Multiple imputation using an iterative hot-deck with distance-based
donor selection. Stat Med, 27:83–102, 2008. ⋄58
570. R. Simon and R. W. Makuch. A non-parametric graphical representation of
the relationship between survival and the occurrence of an event: Application
to responder versus non-responder bias. Stat Med, 3:35–44, 1984. ⋄401, 420
571. J. S. Simonoff. The “Unusual Episode” and a second statistics course. J Stat
Edu, 5(1), 1997. Online journal at www.amstat.org/publications/jse/v5n1/-
simonoff.html. ⋄291
572. S. L. Simpson, L. J. Edwards, K. E. Muller, P. K. Sen, and M. A. Styner. A
linear exponent AR(1) family of correlation structures. Stat Med, 29:1825–1838,
2010. ⋄148
573. J. C. Sinclair and M. B. Bracken. Clinically useful measures of effect in binary
analyses of randomized trials. J Clin Epi, 47:881–889, 1994. ⋄272
574. J. D. Singer and J. B. Willett. Modeling the days of our lives: Using survival
analysis when designing and analyzing longitudinal studies of duration and the
timing of events. Psych Bull, 110:268–290, 1991. ⋄420
575. L. A. Sleeper and D. P. Harrington. Regression splines in the Cox model with
application to covariate effects in liver disease. J Am Stat Assoc, 85:941–949,
1990. ⋄23, 40
576. A. F. M. Smith and D. J. Spiegelhalter. Bayes factors and choice criteria for
linear models. J Roy Stat Soc B, 42:213–220, 1980. ⋄214
577. L. R. Smith, F. E. Harrell, and L. H. Muhlbaier. Problems and potentials
in modeling survival. In M. L. Grady and H. A. Schwartz, editors, Medical
Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No.
92-0056, pages 151–159. US Dept. of Health and Human Services, Agency for
Health Care Policy and Research, Rockville, MD, 1992. ⋄72
578. P. L. Smith. Splines as a useful and convenient statistical tool. Am Statistician,
33:57–62, 1979. ⋄40
579. R. H. Somers. A new asymmetric measure of association for ordinal variables.
Am Soc Rev, 27:799–811, 1962. ⋄257, 505
580. A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute
meningitis: An analysis of the predictive value of initial observations. JAMA,
262:2700–2707, 1989. ⋄266, 267, 268
581. I. Spence and R. F. Garrison. A remarkable scatterplot. Am Statistician, 47:12–
19, 1993. ⋄91
582. D. J. Spiegelhalter. Probabilistic prediction in patient management and clinical
trials. Stat Med, 5:421–433, 1986. ⋄97, 101, 115, 116, 523
583. D. M. Stablein, W. H. Carter, and J. W. Novak. Analysis of survival data with
nonproportional hazard functions. Controlled Clin Trials, 2:149–159, 1981. ⋄
500
584. N. Stallard. Simple tests for the external validation of mortality prediction
scores. Stat Med, 28:377–388, 2009. ⋄237
585. J. Stare, F. E. Harrell, and H. Heinzl. BJ: An S-Plus program to fit linear
regression models to censored data using the Buckley and James method. Comp
Meth Prog Biomed, 64:45–52, 2001. ⋄447
566 References
586. E. W. Steyerberg. Clinical Prediction Models. Springer, New York, 2009. ⋄viii
587. E. W. Steyerberg, S. E. Bleeker, H. A. Moll, D. E. Grobbee, and K. G. M. Moons.
Internal and external validation of predictive models: A simulation study of bias
and precision in small samples. Journal of Clinical Epi, 56(5):441–447, May
2003. ⋄123
588. E. W. Steyerberg, P. M. M. Bossuyt, and K. L. Lee. Clinical trials in acute
myocardial infarction: Should we adjust for baseline characteristics? Am Heart
J, 139:745–751, 2000. Editorial, pp. 761–763. ⋄4, 231
589. E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema.
Prognostic modelling with logistic regression analysis: A comparison of selection
and estimation methods in small data sets. Stat Med, 19:1059–1079, 2000. ⋄69,
100, 286
590. E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema.
Prognostic modeling with logistic regression analysis: In search of a sensible
strategy in small data sets. Med Decis Mak, 21:45–56, 2001. ⋄100, 271
591. E. W. Steyerberg, F. E. Harrell, G. J. J. M. Borsboom, M. J. C. Eijkemans,
Y. Vergouwe, and J. D. F. Habbema. Internal validation of predictive models:
Efficiency of some procedures for logistic regression analysis. J Clin Epi, 54:774–
781, 2001. ⋄115
592. E. W. Steyerberg, A. J. Vickers, N. R. Cook, T. Gerds, M. Gonen, N. Obu-
chowski, M. J. Pencina, and M. W. Kattan. Assessing the performance of pre-
diction models: a framework for traditional and novel measures. Epi (Cambridge,
Mass.), 21(1):128–138, Jan. 2010. ⋄101
593. C. J. Stone. Comment: Generalized additive models. Statistical Sci, 1:312–314,
1986. ⋄26, 28
594. C. J. Stone, M. H. Hansen, C. Kooperberg, and Y. K. Truong. Polynomial
splines and their tensor products in extended linear modeling (with discussion).
Ann Stat, 25:1371–1470, 1997. ⋄420, 450
595. C. J. Stone and C. Y. Koo. Additive splines in statistics. In Proceedings of the
Statistical Computing Section ASA, pages 45–48, Washington, DC, 1985. ⋄24,
28, 41
596. D. Strauss and R. Shavelle. An extended Kaplan–Meier estimator and its ap-
plications. Stat Med, 17:971–982, 1998. ⋄416
597. S. Suissa and L. Blais. Binary regression with continuous outcomes. Stat Med,
14:247–255, 1995. ⋄11, 19
598. G. Sun, T. L. Shook, and G. L. Kay. Inappropriate use of bivariable analysis
to screen risk factors for use in multivariable analysis. J Clin Epi, 49:907–916,
1996. ⋄72
599. B. Tai, D. Machin, I. White, and V. Gebski. Competing risks analysis of patients
with osteosarcoma: a comparison of four different approaches. Stat Med, 20:661–
684, 2001. ⋄420
600. J. M. G. Taylor, A. L. Siqueira, and R. E. Weiss. The cost of adding parameters
to a model. J Roy Stat Soc B, 58:593–607, 1996. ⋄101
601. R. D. C. Team. R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria, 2015. ISBN 3-900051-
07-0. ⋄127
602. H. T. Thaler. Nonparametric estimation of the hazard ratio. J Am Stat Assoc,
79:290–293, 1984. ⋄518
603. P. F. Thall and J. M. Lachin. Assessment of stratum-covariate interactions in
Cox’s proportional hazards regression model. Stat Med, 5:73–83, 1986. ⋄482
604. T. Therneau and P. Grambsch. Modeling Survival Data: Extending the Cox
Model. Springer-Verlag, New York, 2000. ⋄420, 447, 478, 517
605. T. M. Therneau, P. M. Grambsch, and T. R. Fleming. Martingale-based residu-
als for survival models. Biometrika, 77:216–218, 1990. ⋄197, 413, 487, 493, 494,
504
References 567
652. Y. Wang and J. M. G. Taylor. Inference for smooth curves in longitudinal data
with application to an AIDS clinical trial. Stat Med, 14:1205–1218, 1995. ⋄215
653. Y. Wang, G. Wahba, C. Gu, R. Klein, and B. Klein. Using smoothing spline
ANOVA to examine the relation of risk factors to the incidence and progression
of diabetic retinopathy. Stat Med, 16:1357–1376, 1997. ⋄41
654. Y. Wax. Collinearity diagnosis for a relative risk regression analysis: An appli-
cation to assessment of diet-cancer relationship in epidemiological studies. Stat
Med, 11:1273–1287, 1992. ⋄79, 138, 255
655. L. J. Wei, D. Y. Lin, and L. Weissfeld. Regression analysis of multivariate
incomplete failure time data by modeling marginal distributions. J Am Stat
Assoc, 84:1065–1073, 1989. ⋄417
656. R. E. Weiss. The influence of variable selection: A Bayesian diagnostic perspec-
tive. J Am Stat Assoc, 90:619–625, 1995. ⋄100
657. S. Wellek. A log-rank test for equivalence of two survivor functions. Biometrics,
49:877–881, 1993. ⋄450
658. T. L. Wenger, F. E. Harrell, K. K. Brown, S. Lederman, and H. C. Strauss.
Ventricular fibrillation following canine coronary reperfusion: Different outcomes
with pentobarbital and α-chloralose. Can J Phys Pharm, 62:224–228, 1984. ⋄
266
659. H. White. A heteroskedasticity-consistent covariance matrix estimator and a
direct test for heteroskedasticity. Econometrica, 48:817–838, 1980. ⋄196
660. I. R. White and J. B. Carlin. Bias and efficiency of multiple imputation
compared with complete-case analysis for missing covariate values. Stat Med,
29:2920–2931, 2010. ⋄59
661. I. R. White and P. Royston. Imputing missing covariate values for the Cox
model. Stat Med, 28:1982–1998, 2009. ⋄54
662. I. R. White, P. Royston, and A. M. Wood. Multiple imputation using chained
equations: Issues and guidance for practice. Stat Med, 30(4):377–399, 2011. ⋄
53, 54, 58
663. A. Whitehead, R. Z. Omar, J. P. T. Higgins, E. Savaluny, R. M. Turner, and
S. G. Thompson. Meta-analysis of ordinal outcomes using individual patient
data. Stat Med, 20:2243–2260, 2001. ⋄324
664. J. Whitehead. Sample size calculations for ordered categorical data. Stat Med,
12:2257–2271, 1993. See letter to editor SM 15:1065-6 for binary case;see errata
in SM 13:871 1994;see kol95com, jul96sam. ⋄2, 73, 313, 324
665. J. Whittaker. Model interpretation from the additive elements of the likelihood
function. Appl Stat, 33:52–64, 1984. ⋄205, 207
666. A. S. Whittemore and J. B. Keller. Survival estimation using splines. Biometrics,
42:495–506, 1986. ⋄420
667. H. Wickham. ggplot2: elegant graphics for data analysis. Springer, New York,
2009. ⋄xi
668. R. E. Wiegand. Performance of using multiple stepwise algorithms for variable
selection. Stat Med, 29:1647–1659, 2010. ⋄100
669. A. R. Willan, W. Ross, and T. A. MacKenzie. Comparing in-patient classifica-
tion systems: A problem of non-nested regression models. Stat Med, 11:1321–
1331, 1992. ⋄205, 215
670. A. Winnett and P. Sasieni. A note on scaled Schoenfeld residuals for the pro-
portional hazards model. Biometrika, 88:565–571, 2001. ⋄518
671. A. Winnett and P. Sasieni. Iterated residuals and time-varying covariate effects
in Cox regression. J Roy Stat Soc B, 65:473–488, 2003. ⋄518
672. D. M. Witten and R. Tibshirani. Testing significance of features by lassoed
principal components. Ann Appl Stat, 2(3):986–1012, 2008. ⋄175
673. A. M. Wood, I. R. White, and S. G. Thompson. Are missing outcome data
adequately handled? A review of published randomized controlled trials in major
medical journals. Clin Trials, 1:368–376, 2004. ⋄58
570 References
Titanic, 291 E
degrees of freedom, 193 economists, 71
effective, 30, 41, 77, 96, 136, effective.df, 134, 136, 345, 346
210, 269 Emax, 353
generalized, 10 epidemiology, 38
phantom, 35, 111 estimation, 2, 98, 104
delayed entry, 401 estimator
delta method, 439 Buckley–James, 447, 449
describe, 129, 291, 453 maximum likelihood, 181
deviance, 236, 449, 487, 516 mean, 362
DFBETA, 91 penalized, see maximum
DFBETAS, 91 likelihood, 175
DFFIT, 91 quantile, 362
DFFITS, 91 self-consistent, 525
diabetes, see datasets, 365 smearing, 392, 393
difference in predictions, 192, 201 explained variation, 273
dimensionality, 88 exponential distribution, see
discriminant analysis, 220, 230, distribution
272 ExProb, 135
discrimination, see accuracy, see external validation, see validation
accuracy of model
distribution, 317
t, 186 F
binomial, 73, 181, 194, 235 failure time, 399
Cauchy, 362 fastbw, 133, 134, 137, 280, 286,
exponential, 142, 407, 408, 351, 469
425, 427, 451 feature selection, 94
extreme value, 362, 363, 427, financial data, 3
437 fit.mult.impute, 54, 306
Gumbel, 362, 363 Fleming–Harrington survival
log-logistic, 9, 423, function estimator, see
427, 440, 442, 503 survival function
log-normal, 9, 106, formula, 134
391, 423, 427, 442, 463, 464 fractional polynomial, 40
normal, 187 Function, 134, 135, 138, 149, 310,
Weibull, 39, 408, 408, 420, 426, 395
432–437, 444, 448 functions, generating R code, 395
dose-response, 523
doubly nonlinear, 131 G
drop-in, 513 GAM, see generalized additive
dropouts, 143 model, see generalized
dummy variable, 1, see indicator additive model
variable, 75, 129, 130, gam package, 390
209, 210 GDF, see degrees of freedom
GEE, 147
Index 575
predictive mean matching, 51, latex, 129, 134, 135, 137, 138, 149,
52, 55 246, 282, 292, 336, 342, 346,
single, 52, 56, 57, 138, 453, 466, 470, 536
171, 275, 276, 334 lattice package, 134
impute, 129, 135, 138, 171, least squares
276, 277, 334, 461 censored, 447
incidence leave-out-one, see validation of
crude, 416 model
cumulative, 415 left truncation, 401, 420
incomplete principal component life expectancy, 4, 408, 472
regression, 170, 275 lift curve, 5
indicator variable, 16, 17, 38, 39 likelihood function, 182,
infinite regression coefficient, 234 187, 188, 190,
influential observations, 90–92, 194, 195, 424, 425, 476
116, 255, 256, 269, 504 partial, 477
information function, 182, 183 likelihood ratio test, 185–186,
information matrix, 79, 188, 189, 189–191, 193–195,
191, 196, 208, 211, 232, 346 198, 204, 205, 207, 228, 240
informative missing, see missing linear model, 73, 74, 143, 311, 359,
data 361, 362, 364, 368, 370, 372
interaction, 16, 36, 375 case study, 143
interquartile-range effect, 104, 136 linear spline, see spline function
intracluster correlation, 135, 141, link function, 15
197, 417 Cauchy, 362
isotropic correlation structure, see complementary log-log, 362
correlation structures log-log, 362
probit, 362
lm, 131
J lme, 149
jackknife, 113, 504 local regression, see
nonparametric
K loess, see nonparametric
Kalbfleisch–Prentice estimator, loess, 29, 142, 493
see survival function log-rank, see hypothesis test
Kaplan–Meier estimator, see LOGISTIC, 315
survival function logistic model
knots, 22 binary, 219–231
Kullback–Leibler information, 215 case study 1, 275–288
case study 2, 291–310
conditional, 483
L continuation ratio, 319–323
landmark survival time analysis, case study, 338–340
447 extended continuation ratio,
lasso, 71, 100, 121, 175, 356 321–322
LATEX, 129, 536 case study, 340–355
Index 577
N O
na.action, 131 object-oriented program, x, 127,
na.delete, 131, 132 133
na.detail.response, 131 observational study, 3, 58,
na.fail, 132 230, 400
na.fun.response, 131 odds ratio, 222, 224, 318
na.omit, 132 OLS, see linear model
naclus, 47, 142, 302, 458, 461 ols, 131, 135, 137, 350, 351,
naplot, 47, 302, 461 448, 469, 470
naprint, 135 optimism, 109, 111, 114, 391
naresid, 132, 135 ordered, 133
natural spline, see restricted ordinal model, 311, 359, 361–363,
cubic spline 370, 371
nearest neighbor, 51 case study, 327–356, 359–387
Nelson estimator, see survival probit, 364
function, 422 ordinal response, see response
Newlabels, 473 ordinality, see assumptions
Newton–Raphson algorithm, 193, orm, 131, 135, 319, 362, 363
195, 196, 209, 231, 426 outlier, 116, 294
NHANES, 365 overadjustment, 2
nlme package, 131, 148, 149 overfitting, 72, 109–110
noise, 34, 68, 69, 72, 209, 488, 523
nomogram, 104, 268, P
310, 318, 353, 514, 531 parsimony, 87, 97, 119
nomogram, 135, 138, 149, 282, 319,
partial effect plot, 104, 318
353, 473, 514 partial residual, see residual
non-proportional hazards, 73, 450, partial test, see hypothesis test
506 PC, see principal component,
170, 172, 175, 275
Index 579