Analysis of Survival Data

ST 745
Analysis of Survival Data

Lecture Notes
(Modied from Dr. A. Tsiatis Lecture Notes)
Daowen Zhang
Department of Statistics
North Carolina State University
c 2005 by Anastasios Tsiatis and Daowen Zhang
ST 745001: Analysis of Survival Data
Spring, 2005
Textbook 1. Survival Analysis: Techniques for Censored and Truncated Data (2nd
edition) by John P. Klein and Melvin L. Moeschberger (the website
http://www.biostat.mcw.edu/homepgs/klein/book.html contains
some data sets and SAS macros used in the book)
2. Survival Analysis Using The SAS System: A Practical Guide
by Paul D. Allison (data sets and macros used in the book can be found from
http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=55233
Lecture notes Lecture notes can be downloaded from the class web site:
http://www4.stat.ncsu.edu/ dzhang2/st745/index.html
Class Hours Tu & Th 2:35 - 3:50
Location Room 00110 Polk Hall
Instructor Dr. Daowen Zhang
203 B Patterson
515-1933
Oce Hours 3:00 - 4:00 MWF or by appointment
Teaching Assistant TBA
Oce Hours TBA
Evaluation Homework (will be given about every 10 days): 15%
Mid-term (in-class, open-book & open-note, to be given around Spring break):
35%
Final (comprehensive, in-class, open-book & open-note, 1:00 - 4:00, 5/3/2005,
Thuesday): 50%
References 1. J.F. Lawless, Statistical Models and Methods for Lifetime Data,
Wiley, 1982. (this is a more technical reference)
2. Therneau, T.M. and Grambsch, P.M., Modeling Survival Data:
Extending the Cox Model (the website
http://www.mayo.edu/hsr/people/therneau/book/book.html contains
additional materials such as data sets and software)
3. W.N. Venables and B.D. Ripley, Modern Applied Statistics with S-
PLUS, 2nd edition, Springer 1997.
Software SAS: Procs Lifetest, Lifereg, Phreg.
R: lifetab(), survt(), survfdi(), survreg(), coxph().
CHAPTER 1 ST 745, Daowen Zhang
1 Survival Analysis
In many biomedical applications the primary endpoint of interest is time to a certain event.
Examples are
time to death;
time it takes for a patient to respond to a therapy;
time from response until disease relapse (i.e., disease returns); etc.
We may be interested in characterizing the distribution of time to event for a given pop-
ulation as well as comparing this time to event among dierent groups (e.g., treatment vs.
control in a clinical trial or an observational study), or modeling the relationship of time to
event to other covariates (sometimes called prognostic factors or predictors). Typically, in
biomedical applications the data are collected over a nite period of time and consequently the
time to event may not be observed for all the individuals in our study population (sample).
This results in what is called censored data. That is, the time to event for those individuals
who have not experienced the event under study is censored (by the end of study). It is also
common that the amount of follow-up for the individuals in a sample vary from subject to sub-
ject. The combination of censoring and dierential follow-up creates some unusual diculties in
the analysis of such data that cannot be handled properly by the standard statistical methods.
Because of this, a new research area in statistics has emerged which is called Survival Analysis
or Censored Survival Analysis.
To study, we must introduce some notation and concepts for describing the distribution of
time to event for a population of individuals. Let the random variable T denote time to the
event of our interest. Of course, T is a positive random variable which has to be unambiguously
dened; that is, we must be very specic about the start and end with the length of the time
period in-between corresponding to T.
PAGE 1
Some examples
Survival time (in general): measured from birth to death for an individual. This is the
survival time we need to investigate in a life expectancy study.
Survival time of a treatment for a population with certain disease: measured from the time
of treatment initiation until death.
Survival time due to heart disease: (the event is death from heart disease): measured from
birth (or other time point such as treatment initiation for heart disease patients) to death
caused by heart disease. (This may be a bit tricky if individuals die from other causes.
This is competing risk problem. That is, other risks are competing with heart disease to
produce an event death.)
The time of interest may be time to something good happening. For example, we may be
interested in how long it takes to eradicate an infection after treatment with antibiotics.
Describing the Distribution of Time to An Event
In routine data analysis, we may rst present some summary statistics such as mean, standard
error for the mean, etc. In analyzing survival data, however, because of possible censoring,
the summary statistics may not have the desired statistical properties, such as unbiasedness.
For example, the sample mean is no longer an unbiased estimator of the population mean (of
survival time). So we need to use other methods to present our data. One way is to estimate
the underlying true distribution. When this distribution is estimated (either parametrically or
nonparametrically), we then can estimate other quantities of interest such as mean, median, etc.
of the survival time.
The distribution of the random variable T can be described in a number of equivalent ways.
There is of course the usual (cumulative) distribution function
F(t) = P[T t], t 0, (1.1)
PAGE 2
which is right continuous, i.e., lim
ut
+ F(u) = F(t). When T is a survival time, F(t) is the
probability that a randomly selected subject from the population will die before time t.
If T is a continuous random variable, then it has a density function f(t), which is related to
F(t) through following equations
f(t) =
dF(t)
dt
, F(t) =
t
0
f(u)du. (1.2)
In biomedical applications, it is often common to use the survival function
S(t) = P[T t] = 1 F(t
), (1.3)
where F(t
) = lim
ut
F(u). When T is a survival time, S(t) is the probability that a ran-
domly selected individual will survive to time t or beyond. (So S(t) has the name of survival
function.)
Note: Some authors use the following denition of a survival function
S(t) = P[T > t] = 1 F(t).
This denition will be identical to the above one if T is a continuous random variable, which is
the case we will focus on in this course.
The survival function S(t) is a non-increasing function over time taking on the value 1 at
t = 0, i.e., S(0) = 1. For a proper random variable T, S() = 0, which means that everyone will
eventually experience the event. However, we will also allow the possibility that S() > 0. This
corresponds to a situation where there is a positive probability of not dying or not experiencing
the event. For example, if the event of interest is the time from response until disease relapse
and the disease has a cure for some proportion of individuals in the population, then we have
S() > 0, where S() corresponds to the proportion of cured individuals.
Obviously if T is a continuous r.v., we have
S(t) =

t
f(u)du, f(t) =
dS(t)
dt
. (1.4)
PAGE 3
That is, there is a one-to-one correspondence between f(t) and S(t).
Mean Survival Time: = E(T). Due to censoring, sample mean of observed survival
times is no longer an unbiased estimate of = E(T). If we can estimate S(t) well, then we can
estimate = E(T) using the following fact:
E(T) =

0
S(t)dt. (1.5)
Median Survival Time: Median survival time m is dened as the quantity m satisfying
S(m) = 0.5. Sometimes denoted by t
0.5
. If S(t) is not strictly decreasing, m is the smallest one
such that S(m) 0.5.
pth quantile of Survival Time (100pth percentile): t
p
such that S(t
p
) = 1p ( 0 < p < 1).
If S(t) is not strictly decreasing, t
p
is the smallest one such that S(t
p
) 1 p.
Mean Residual Life Time(mrl):
mrl(t
0
) = E[T t
0
|T t
0
], (1.6)
i.e., mrl(t
0
) = average remaining survival time given the population has survived beyond t
0
. It
can be shown that
mrl(t
0
) =
t
0
S(t)dt
S(t
0
)
. (1.7)
For example, in the hypothetical population shown in Figure 1.1, we have a population where
70% of the individuals will survive 2 years (i.e., t
0.3
= 2) and the median survival time is 2.8
years (i.e., 50% of the population will survive at least 2.8 years).
We say that the survival distribution for group 1 is stochastically larger than the survival
distribution for group 2 if S
1
(t) S
2
(t), for all t 0, where S
i
(t) is the survival function for
group i. If T
i
is the corresponding survival time for groups i, we also say that T
1
is stochastically
(not deterministically) larger than T
2
. Note that T
1
being stochastically larger than T
2
does
NOT necessarily imply that T
1
T
2
. The situation is illustrated in Figure 1.2.
PAGE 4
Figure 1.1: The survival function for a hypothetical population
Time (years)
S
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
y
0 2 4 6
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
0.5
0.7
2.8
Note: At any time point a greater proportion of group 1 will survive as compared to group
2.
Figure 1.2: Illustration that T
1
is stochastically larger than T
2
Time (years)
S
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
y
0 2 4 6
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
S1(t)
S2(t)
Hazard Rate
The hazard rate is a useful way of describing the distribution of time to event because it
has a natural interpretation that relates to the aging of a population. This terminology is very
popular in biomedical community. We motivate the denition of hazard rate by rst dening
mortality rate which is a discrete version of the hazard rate.
PAGE 5
The mortality rate at time t, where t is generally taken to be an integer in terms of some
unit of time (e.g., years, months, days, etc), is the proportion of the population who fail (die)
between times t and t + 1 among individuals alive at time t, , i.e.,
m(t) = P[t T < t + 1|T t]. (1.8)
In a human population, the mortality rate has the typical pattern shown in Figure 1.3.
Figure 1.3: A typical mortality pattern for human
age (years)
m
(
t
)
0 20 40 60 80 100
The hazard rate (t) is the limit of a mortality rate if the interval of time is taken to be
small (rather than one unit). The hazard rate is the instantaneous rate of failure (experiencing
the event) at time t given that an individual is alive at time t.
Specically, hazard rate (t) is dened by the following equation
(t) = lim
h0
P[t T < t + h|T t]
h
. (1.9)
Therefore, if h is very small, we have
P[t T < t + h|T t] (t)h. (1.10)
The denition of the hazard function implies that
(t) =
lim
h0
P[tT<t+h]
h
P[T t]
=
f(t)
S(t)
(1.11)
PAGE 6
=
S
(t)
S(t)
=
dlog{S(t)}
dt
. (1.12)
From this, we can integrate both sides to get
(t) =
t
0
(u)du = log{S(t)}, (1.13)
where (t) is referred to as the cumulative hazard function. Here we used the fact that S(0) = 1.
Hence,
S(t) = e
(t)
= e
t
0
(u)du
. (1.14)
Figure 1.4: Three hazard patterns
age (years)
H
a
z
a
r
d

r
a
t
e
0 20 40 60 80 100
2
4
6
8
Increasing hazard
Decreasing hazard
Constant hazard
Note:
1. There is a one-to-one relationship between hazard rate (t), t 0 and survival function
S(t), namely,
S(t) = e
t
0
(u)du
and (t) =
dlog{S(t)}
dt
. (1.15)
2. The hazard rate is NOT a probability, it is a probability rate. Therefore it is possible that
a hazard rate can exceed one in the same fashion as a density function f(t) may exceed
one.
PAGE 7
Common Parametric Models:
Distribution (t) S(t) density f(t) E(T)
Exponential (> 0) e
t
e
t 1
Weibull t
1
(, > 0) e
t
t
1
e
t
(1+1/)
1/
Gamma
f(t)
S(t)
1 I(t, )

t
1
e
t
()
I(t, ) =

t
0
u
1
e
u
()
du.
See page 38 of Klein and Moeschberger and Chapter 5 of the lecture notes for more distri-
butions.
Exponential distribution: (t) = , S(t) = e
t
and f(t) = e
t
. So mean survival time
= E(T) =

0
tf(t)dt =

0
S(t)dt =

0
e
t
dt =
1
.
Letting S(t
0.5
) = e
t
0.5
= 0.5, then median survival time is t
0.5
=
log2
.
The mean residual life time after t
0
is
mrl(t
0
) =
t
0
S(t)dt
S(t
0
)
=
t
0
e
t
e
t
0
=
1
= E(T).
Sometimes it is useful to plot the survival distribution on a log scale. By so doing, we can
identify the hazard rate as minus of the derivative of this function. In particular on a log scale
the exponential distribution is a straight line. This is because S(t) = e
t
for the exponential
distribution, so
log[S(t)] = t.
The above equation gives us a way to check if the underlying true distribution of the survival
time is exponential or not given a data set. Suppose we can have an estimate

S(t) of S(t) without
assuming any distribution of the survival time (the Kaplan-Meier estimate to be discussed in
Chapter 2 is such an estimate). Then we can plot log[
S(t)] vs t to see if it is approximately a

straight line. A (approximate) straight line indicates that the exponential distribution may be a
reasonable choice for the data.
PAGE 8
Another alternative is to assume the exponential distribution for the data and get the estimate
of S(t) = e
t
(we only need to estimate ; this kind of estimation will be discussed in Chapter
3). Denote this estimate by

S
1
(t) and Kaplan-Meier estimate by

S
KM
(t). If the exponential
distribution assumption is correct, both estimates will be good estimates of the same survival
function S(t) = e
t
. Therefore,

S
1
(t) and

S
KM
(t) should be close to each other and hence the
plot

S
1
(t) vs

S
KM
(t) should be approximately a straight line. A non-straight line indicates that
the exponential distributional assumption is not appropriate.
Figure 1.5: The survival function of an exponential distribution on two scales
t
S
(
t
)
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Survial funcion on orignal scale
t
l
o
g
(
S
(
t
)
)
-
3
.
0
-
2
.
0
-
1
.
0
0
.
0
Survial funcion on a log scale
Weibull distribution: (t) = t
1
, S(t) = e
t
. Note this model allows:

Constant hazard: = 1
increasing hazard: > 1
decreasing hazard: < 1.
and has the hazard patterns shown in Figure 1.4.
= E(T) =

0
S(t)dt =

0
e
t
dt =
(1 + 1/)
1/
.
PAGE 9
The mean survival time t
0.5
: e
t
= 0.5 =
t
0.5
=
log2
1/
.
Since logS(t) = t
, so
log{logS(t)} = log + logt.
A straight line in the plot of log{logS(t)} vs. logt indicates a Weibull model. We can use the
above equation to check if the Weibull model is a reasonable choice for the survival time given
a data set. Alternatively, we can assume a Weibull model for the survival time and use the
data to estimate S(t) and plot this estimate against the Kaplan-Meier estimate as we proposed
for the exponential distribution. A (approximate) straight line indicates the Weibull model is a
reasonable choice for the data.
Question: How do we check a Gamma model?
PAGE 10
2 Right Censoring and Kaplan-Meier Estimator
In biomedical applications, especially in clinical trials, two important issues arise when studying
time to event data (we will assume the event to be death. It can be any event of interest):
1. Some individuals are still alive at the end of the study or analysis so the event of interest,
namely death, has not occurred. Therefore we have right censored data.
2. Length of follow-up varies due to staggered entry. So we cannot observe the event for those
individuals with insucient follow-up time.
Note: It is important to distinguish calendar time and patient time
Figure 2.1: Illustration of censored data

x x
x x
x x
x o
Study Calendar Study
starts time ends

0
x
o
o
o
Patient time
(measured from entry to study)
In addition to censoring because of insucient follow-up (i.e., end of study censoring due to
staggered entry), other reasons for censoring includes
loss to follow-up: patients stop coming to clinic or move away.
deaths from other causes: competing risks.
Censoring from these types of causes may be inherently dierent from censoring due to
staggered entry. We will discuss this in more detail later.
PAGE 11
Censoring and dierential follow-up create certain diculties in the analysis for such data
as is illustrated by the following example taken from a clinical trial of 146 patients treated after
they had a myocardial infarction (MI).
The data have been grouped into one year intervals and all time is measured in terms of
patient time.
Table 2.1: Data from a clinical trial on myocardial infarction (MI)
Number alive and under
Year since observation at beginning Number dying Number censored
entry into study of interval during interval or withdrawn
[0, 1) 146 27 3
[1, 2) 116 18 10
[2, 3) 88 21 10
[3, 4) 57 9 3
[4, 5) 45 1 3
[5, 6) 41 2 11
[6, 7) 28 3 5
[7, 8) 20 1 8
[8, 9) 11 2 1
[9, 10) 8 2 6
Question: Estimate the 5 year survival rate, i.e., S(5) = P[T 5].
Two naive and incorrect answers are given by
1.

F(5) = P[T < 5] =
76 deaths in 5 years
146 individuals
= 52.1%,

S(5) = 1

F(5) = 47.9%.
2.

F(5) = P[T < 5] =
146 -29 (withdrawn in 5 years)
= 65%,

S(5) = 1

F(5) = 35%.
Obviously, we can observe the following
PAGE 12
1. The rst estimate would be correct if all censoring occurred after 5 years. Of cause, this
was not the case leading to overly optimistic estimate (i.e., overestimates S(5)).
2. The second estimate would be correct if all individuals censored in the 5 years were censored
immediately upon entering the study. This was not the case either, leading to overly
pessimistic estimate (i.e., underestimates S(5)).
Our clinical colleagues have suggested eliminating all individuals who are censored and use
the remaining complete data. This would lead to the following estimate
F(5) = P[T 5] =
146 -60 (censored)
= 88.4%,

S(5) = 1

F(5) = 11.6%.
This is even more pessimistic than the estimate given by (2).
Life-table Estimate
More appropriate methods use life-table or actuarial method. The problem with the above
two estimates is that they both ignore the fact that each one-year interval experienced censoring
(or withdrawing). Obviously we need to take this information into account in order to reduce
bias. If we can express S(5) as a function of quantities related to each interval and get a very
good estimate for each quantity, then intuitively, we will get a very good estimate of S(5). By
the denition of S(5), we have:
S(5) = P[T 5] = P[(T 5) (T 4)] = P[T 4] P[T 5|T 4]
= P[T 4] {1 P[4 T < 5|T 4]} = P[T 4] q
5
= P[T 3] P[T 4|T 3] q
5
= P[T 3] {1 P[3 T < 4|T 3]} q
5
= P[T 3] q
4
q
5
= = q
1
q
2
q
3
q
4
q
5
where q
i
= 1 P[i 1 T < i|T i 1], i = 1, 2, ..., 5. So if we can estimate q
i
well, then
we will get a very good estimate of S(5). Note that 1 q
i
is the mortality rate m(x) at year
x = i 1 by our denition.
PAGE 13
Table 2.2: Life-table estimate of S(5) assuming censoring occurred at the end of interval
duration [t
i1
, t
i
) n(x) d(x) w(x) m(x) =
d(x)
n(x)
1 m(x)

S
R
(t
i
) =

(1 m(x))
[0, 1) 146 27 3 0.185 0.815 0.815
[1, 2) 116 18 10 0.155 0.845 0.689
[2, 3) 88 21 10 0.239 0.761 0.524
[3, 4) 57 9 3 0.158 0.842 0.441
[4, 5) 45 1 3 0.022 0.972 0.432
Case 1: Let us rst assume that anyone censored in an interval of time is censored at the
end of that interval. Then we can estimate each q
i
= 1 m(i 1) in the following way:
d(0) Bin(n(0), m(0)) = m(0) =
d(0)
n(0)
=
27
146
= 0.185, q
1
= 1 m(0) = 0.815
d(1)|H Bin(n(1), m(1)) = m(1) =
d(1)
n(1)
=
18
116
= 0.155, q
2
= 1 m(1) = 0.845

where H means data history (i.e, data before the second interval).
The life table estimate would be computed as shown in Table 2.2. So the 5 year survival
probability estimate

S
R
(5) = 0.432. (If the assumption that anyone censored in an in-
terval of time is censored at the end of that interval is true, then the estimator

S
R
(5) is
approximately unbiased to S(5).)
Of course, this estimate

S
R
(5) will have variation since it was calculated from a sample. We
need to estimate its variation in order to make inference on S(5) (for example, construct a
95% CI for S(5)).
However,

S
R
(5) is a product of 5 estimates ( q
1
q
5
), whose variance is not easy to nd.
But we have
log(
S
R
(5)) = log( q
1
) +log( q
2
) +log( q
3
) +log( q
4
) +log( q
5
).
So if we can nd out the variance of each log( q
i
), we might be able to nd out the variance
PAGE 14
of log(
S
R
(5)) and hence the variance of

S
R
(5).
For this purpose, let us rst introduce a very popular method in statistics: delta method:
Delta Method:
If

a
N(,
2
)
then f(
)
a
N(f(), [f
()]
2
2
)
Proof of delta method: If
2
is small,

will be close to with high probability. We hence
can expand f(
) about using Taylor expansion:

f(
) f() +f
()(
).
We immediately get the (asymptotic) distribution of f(
) from this expansion.

Returning to our problem. Let

i
= log( q
i
). Using the delta method, the variance of

i
is approximately equal to
var(
i
) =
_
1
q
i
_
2
var( q
i
).
Therefore we need to nd out and estimate var( q
i
). Of course, we also need to nd out the
covariances among

i
and

j
(i = j). For this purpose, we need the following theorem:
Double expectation theorem (Law of iterated conditional expectation and variance): If X and
Y are any two random variables (or vectors), then
E(X) = E[E(X|Y )]
Var(X) = Var[E(X|Y )] + E[Var(X|Y )]
Since q
i
= 1 m(i 1), we have
var( q
i
) = var( m(i 1))
PAGE 15
= E[var( m(i 1)|H)] + var[E( m(i 1)|H)]
= E
_
m(i 1)[1 m(i 1)]
n(i 1)
_
+ var[m(i 1)]
= m(i 1)[1 m(i 1)]E
_
1
n(i 1)
_
,
which can be estimated by
m(i 1)[1 m(i 1)]
n(i 1)
.
Hence the variance of

i
= log( q
i
) can be approximately estimated by
_
1
q
i
_
2
m(i 1)[1 m(i 1)]
n(i 1)
=
m(i 1)
[1 m(i 1)]n(i 1)
=
d
(n d)n
.
Now let us look at the covariances among

i
and

j
(i = i). It is very amazing that they
are all approximately equal to zero!
For example, let us consider the covariance between

1
and

2
. Since

1
= log( q
1
) and
2
= log( q
2
), using the same argument for the delta method, we know that we only need
to nd out the covariance between q
1
and q
2
, or equivalently, the covariance between m(0)
and m(1). This can be seen from the following:
E[ m(0) m(1)] = E[E[ m(0) m(1)|n(0), d(0), w(0)]]
= E[ m(0)E[ m(1)|n(0), d(0), w(0)]]
= E[ m(0)m(1)]
= m(1)E[ m(0)]
= m(1)m(0) = E[ m(0)]E[ m(1)].
Therefore, the covariance between m(0) and m(1) is zero. Similarly, we can show other
covariances are zero. Hence,
var(log(
S
R
(5))) = var(
1
) + var(
2
) + var(
3
) + var(
4
) + var(
5
).
Let

= log(
S
R
(5)). Then

S
R
(5) = e
. So
var(
S
R
(5)) = (e
)
2
var(log(
S
R
(5))) = (S(5))
2
[var(
1
)+var(
2
)+var(
3
)+var(
4
)+var(
5
)],
PAGE 16
which can be estimated by
var(
S
R
(5)) = (
S
R
(5))
2
_
d(0)
(n(0) d(0))n(0)
+
d(1)
(n(1) d(1))n(1)
+
d(2)
(n(2) d(2))n(2)
+
d(3)
(n(3) d(3))n(3)
+
d(4)
(n(4) d(4))n(4)
_
= (
S
R
(5))
2
4
i=0
d(i)
[n(i) d(i)]n(i)
. (2.1)
Case 2: Let us assume that anyone censored in an interval of time is censored right at the
beginning of that interval. Then the life table estimate would be computed as shown in
Table 2.3. So the 5 year survival probability estimate = 0.400. (In this case, the estimator
S
L
(5) is approximately unbiased to S(5).)
The variance estimate of

S
L
(5) is similar to that of

S
R
(5) except that we need to change
the sample size for each mortality estimate to n w in equation (2.1).
Table 2.3: Life-table estimate of S(5) assuming censoring occurred at the beginning of interval
duration [t
i1
, t
i
) n(x) d(x) w(x) m(x) =
d(x)
n(x)w(x)
1 m(x)

S
L
(t
i
) =

(1 m(x))
[0, 1) 146 27 3 0.189 0.811 0.811
[1, 2) 116 18 10 0.170 0.830 0.673
[2, 3) 88 21 10 0.269 0.731 0.492
[3, 4) 57 9 3 0.167 0.833 0.410
[4, 5) 45 1 3 0.024 0.976 0.400
The naive estimates range from 35% to 47.9% for the ve year survival probability with the
complete case (i.e., eliminating anyone censored) estimator giving an estimate of 11.6%.
The life-table estimate ranged from 40% to 43.2% depending on whether we assume censoring
occurred at the left (i.e., beginning) or right (i.e., end) of each interval.
More than likely censoring occurs during the interval. Thus

S
L
and

S
R
are not correct. A
compromise is to use the following modication:
PAGE 17
Table 2.4: Life-table estimate of S(5) assuming censoring occurred during the interval
duration [t
i1
, t
i
) n(x) d(x) w(x) m(x) =
d(x)
n(x)w(x)/2
1 m(x)

S
LT
(t
i
) =

(1 m(x))
[0, 1) 146 27 3 0.187 0.813 0.813
[1, 2) 116 18 10 0.162 0.838 0.681
[2, 3) 88 21 10 0.253 0.747 0.509
[3, 4) 57 9 3 0.162 0.838 0.426
[4, 5) 45 1 3 0.023 0.977 0.417
That is, when calculating the mortality estimate in each interval, we use (n(x) w(x)/2) as
the sample size. This number is often referred to as the eective sample size.
So the 5 year survival probability estimate

S
LT
(5) = 0.417, which is between

S
L
= 0.400 and
S
R
= 0.432.
Figure 2.2: Life-table estimate of the survival probability for MI data
Time (years)
S
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
y
0 2 4 6 8 10
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Figure 2.2 shows the life-table estimate of the survival probability assuming censoring oc-
curred during interval. Here the estimates were connected using straight lines. No special
signicance should be given to this. From this gure, the median survival time is estimated to
PAGE 18
be about 3 years.
The variance estimate of the life-tabble estimate

S
LT
(5) is similar to equation (2.1) except
that the sample size n(i) is changed to n(i) w(i)/2. That is
var(
S
LT
(5)) = (
S
LT
(5))
2
4
i=0
d(i)
[n(i) w(i)/2 d(i)][n(i) w(i)/2]
. (2.2)
Of course, we can also use the above formula to calculate the variance of

S
LT
(t) at other
time points. For example:
var(
S
LT
(1)) = (
S
LT
(1))
2
_
d(0)
[n(0) w(0)/2 d(0)][n(0) w(0)/2]
_
= 0.813
2
27
(146 3/2 27)(146 3/2)
= 0.813
2
0.001590223 = 0.001051088.
Therefore SE(
S
LT
(1)) =
0.001051088 = 0.0324.
The calculation presented in Table 2.4 can be implemented using Proc Lifetest in SAS:
options ls=72 ps=60;
Data mi;
input survtime number status;
cards;
0 27 1
0 3 0
1 18 1
1 10 0
2 21 1
2 10 0
3 9 1
3 3 0
4 1 1
4 3 0
5 2 1
5 11 0
6 3 1
6 5 0
7 1 1
7 8 0
8 2 1
8 1 0
9 2 1
9 6 0
;
proc lifetest method=life intervals=(0 to 10 by 1);
time survtime*status(0);
freq number;
run;
PAGE 19
Note that the number of observed events and withdrawals in [t
i1
, t
i
) were entered after t
i1
instead of t
i
. Part of the output of the above SAS program is
The LIFETEST Procedure
Life Table Survival Estimates
Effective Conditional
Interval Number Number Sample Probability
[Lower, Upper) Failed Censored Size of Failure
0 1 27 3 144.5 0.1869
1 2 18 10 111.0 0.1622
2 3 21 10 83.0 0.2530
3 4 9 3 55.5 0.1622
4 5 1 3 43.5 0.0230
5 6 2 11 35.5 0.0563
6 7 3 5 25.5 0.1176
7 8 1 8 16.0 0.0625
8 9 2 1 10.5 0.1905
9 10 2 6 5.0 0.4000
Conditional
Probability Survival Median
Interval Standard Standard Residual
[Lower, Upper) Error Survival Failure Error Lifetime
0 1 0.0324 1.0000 0 0 3.1080
1 2 0.0350 0.8131 0.1869 0.0324 4.4265
2 3 0.0477 0.6813 0.3187 0.0393 5.2870
3 4 0.0495 0.5089 0.4911 0.0438 .
4 5 0.0227 0.4264 0.5736 0.0445 .
5 6 0.0387 0.4166 0.5834 0.0446 .
6 7 0.0638 0.3931 0.6069 0.0450 .
7 8 0.0605 0.3469 0.6531 0.0470 .
8 9 0.1212 0.3252 0.6748 0.0488 .
9 10 0.2191 0.2632 0.7368 0.0558 .
Here the numbers in the column under Conditional Probability of Failure are the es-
timated mortality m(x) = d(x)/(n(x) w(x)/2).
The above lifetable estimation can also be implemented using R. Here is the R code:
> tis <- 0:10
> ninit <- 146
> nlost <- c(3,10,10,3,3,11,5,8,1,6)
> nevent <- c(27,18,21,9,1,2,3,1,2,2)
> lifetab(tis, ninit, nlost, nevent)
PAGE 20
The output from the above R function is
nsubs nlost nrisk nevent surv pdf hazard se.surv
0-1 146 3 144.5 27 1.0000000 0.186851211 0.20610687 0.00000000
1-2 116 10 111.0 18 0.8131488 0.131861966 0.17647059 0.03242642
2-3 88 10 83.0 21 0.6812868 0.172373775 0.28965517 0.03933747
3-4 57 3 55.5 9 0.5089130 0.082526440 0.17647059 0.04382194
4-5 45 3 43.5 1 0.4263866 0.009801991 0.02325581 0.04452036
5-6 41 11 35.5 2 0.4165846 0.023469556 0.05797101 0.04456288
6-7 28 5 25.5 3 0.3931151 0.046248831 0.12500000 0.04503654
7-8 20 8 16.0 1 0.3468662 0.021679139 0.06451613 0.04699173
8-9 11 1 10.5 2 0.3251871 0.061940398 0.21052632 0.04879991
9-10 8 6 5.0 2 0.2632467 NA NA 0.05579906
se.pdf se.hazard
0-1 0.032426423 0.03945410
1-2 0.028930638 0.04143228
2-3 0.033999501 0.06254153
3-4 0.026163333 0.05859410
4-5 0.009742575 0.02325424
5-6 0.016315545 0.04097447
6-7 0.025635472 0.07202769
7-8 0.021195209 0.06448255
8-9 0.040488466 0.14803755
9-10 NA NA
Note: Here the numbers in the column of hazard are the estimated hazard rates at the
midpoint of each interval by assuming the true survival function S(t) is a straight line in each
interval. You can nd an explicit expression for this estimator using the relation
(t) =
f(t)
S(t)
,
and the assumption that the true survival function S(t) is a straight line in [t
i1
, t
i
):
S(t) = S(t
i1
) +
S(t
i
) S(t
i1
)
t
i
t
i1
(t t
i1
), for t [t
i1
, t
i
).
These estimates are very close to the mortality estimates we obtained before (the column under
Conditional Probability of Failure in the SAS output.)
Kaplan-Meier Estimator
The Kaplan-Meier or product limit estimator is the limit of the life-table estimator when
intervals are taken so small that only at most one distinct observation occurs within an interval.
Kaplan and Meier demonstrated in a paper in JASA (1958) that this estimator is maximum
likelihood estimate.
PAGE 21
Figure 2.3: An illustrative example of Kaplan-Meier estimator

0.0
0.2
0.4
0.6
0.8
1.0
x
4.5
x
7.5
o x
11.5
o x
13.5
x
15.5
x
16.5
o x
19.5
o
Patient time (years)
1 m(x) : 1 1 1 1
9
10
1 1
8
9
1 1 1
6
7
1 1 1
4
5
3
4
1 1
1
2
1 1
S(t) : 1 1 1 1
9
10
. .
8
10
. . .
48
70
. . .
192
350
144
350
. .
144
700
. .
We will illustrate through a simple example shown in Figure 2.3 how the Kaplan-Meier
estimator is constructed.
By convention, the Kaplan-Meier estimate is a right continuous step function which takes
jumps only at the death time.
The calculation of the above KM estimate can be implemented using Proc Lifetest in SAS
as follows:
Data example;
input survtime censcode;
cards;
4.5 1
7.5 1
8.5 0
11.5 1
13.5 0
15.5 1
16.5 1
17.5 0
19.5 1
21.5 0
;
Proc lifetest;
PAGE 22
time survtime*censcode(0);
run;
And part of the output from the above program is
Product-Limit Survival Estimates
Survival
Standard Number Number
SURVTIME Survival Failure Error Failed Left
0.0000 1.0000 0 0 0 10
4.5000 0.9000 0.1000 0.0949 1 9
7.5000 0.8000 0.2000 0.1265 2 8
8.5000* . . . 2 7
11.5000 0.6857 0.3143 0.1515 3 6
13.5000* . . . 3 5
15.5000 0.5486 0.4514 0.1724 4 4
16.5000 0.4114 0.5886 0.1756 5 3
17.5000* . . . 5 2
19.5000 0.2057 0.7943 0.1699 6 1
21.5000* . . . 6 0
* Censored Observation
The above Kaplan-Meier estimate can also be obtained using R function survfit(). The
code is given in the following:
> survtime <- c(4.5, 7.5, 8.5, 11.5, 13.5, 15.5, 16.5, 17.5, 19.5, 21.5)
> status <- c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0)
> fit <- survfit(Surv(survtime, status), conf.type=c("plain"))
Then we can use R function summary() to see the output:
> summary(fit)
Call: survfit(formula = Surv(survtime, status), conf.type = c("plain"))
time n.risk n.event survival std.err lower 95% CI upper 95% CI
4.5 10 1 0.900 0.0949 0.7141 1.000
7.5 9 1 0.800 0.1265 0.5521 1.000
11.5 7 1 0.686 0.1515 0.3888 0.983
15.5 5 1 0.549 0.1724 0.2106 0.887
16.5 4 1 0.411 0.1756 0.0673 0.756
19.5 2 1 0.206 0.1699 0.0000 0.539
Let d(x) denote the number of deaths at time x. Generally d(x) is either zero or one, but we
allow the possibility of tied survival times in which case d(x) may be greater than one. Let n(x)
PAGE 23
denote the number of individuals at risk just prior to time x; i.e., number of individuals in the
sample who neither died nor were censored prior to time x. Then Kaplan-Meier estimate can be
expressed as
KM(t) =
xt
_
1
d(x)
n(x)
_
.
Note: In the notation above, the product changes only at times x where d(x) 1; , i.e.,
only at times where we observed deaths.
Non-informative Censoring
In order that the life-table estimates give unbiased results there is an important assumption
that individuals who are censored are at the same risk of subsequent failure as those who are still
alive and uncensored. The risk set at any time point (the individuals still alive and uncensored)
should be representative of the entire population alive at the same time. If this is the case, the
censoring process is called non-informative. Statistically, if the censoring process is indepen-
dent of the survival time, then we will automatically have non-informative censoring. Actually,
we almost always mean independent censoring by non-informative censoring.
If censoring only occurs because of staggered entry, then the assumption of non-informative
censoring seems plausible. However, when censoring results from loss to follow-up or death from
a competing risk, then this assumption is more suspect. If at all possible censoring from these
later situations should be kept to a minimum.
Greenwoods Formula for the Variance of the Life-table Estimator
The derivation given below is heuristic in nature but will try to capture some of the salient
feature of the more rigorous treatments given in the theoretical literature on survival analysis.
For this reason, we will use some of the notation that is associated with the counting process
approach to survival analysis. In fact we have seen it when we discussed the life-table estimator.
PAGE 24
It is useful when considering the product limit estimator to partition time into many small
intervals, say, with interval length equal to x where x is small.
Figure 2.4: Partition of time axis
x
Patient time
Let x denote some arbitrary time point on the grid above and dene
Y (x) = number of individuals at risk (i.e., alive and uncensored) at time point x.
dN(x) = number of observed deaths occurring in [x, x + x).
Recall: Previously, Y (x) was denoted by n(x) and dN(x) was denoted by d(x).
It should be straightforward to see that w(x), the number of censored individuals in [x, x+
x), is equal to {[Y (x) Y (x + x)] dN(x)}.
Note: In theory, we should be able to choose x small enough so that {dN(x) > 0 and
w(x) > 0} should never occur. In practice, however, data may not be collected in that fashion,
in which case, approximations such as those given with life-table estimators may be necessary.
With these denitions, the Kaplan-Meier estimator can be written as
KM(t) =
all grid points x such that x + x t

_
1
dN(x)
Y (x)
_
, as x 0,
which can be modied if x is not chosen small enough to be
LT(t) =

_
1
dN(x)
Y (x) w(x)/2
_
,
where LT(t) means life-table estimator.
If the sample size is large and x is small, then
dN(x)
Y (x)
is a small number (i.e., close to zero)
and as long as x is not close to the right hand tail of the survival distribution (where Y (x) may
PAGE 25
be very small). If this is the case, then
exp
_
dN(x)
Y (x)
_
_
1
dN(x)
Y (x)
_
.
Here we used the approximation e
x
1 +x when x is close to zero. This approximation is exact
when
dN(x)
Y (x)
= 0.
Therefore, the Kaplan-Meier estimator can be approximated by
KM(t)

exp
_
dN(x)
Y (x)
_
= exp
_
x<t
dN(x)
Y (x)
_
,
here and thereafter, {x < t} means {all grid points x such that x + x t}.
If x is taken to be small enough so that all distinct times (either death times or withdrawal
times) are represented at most once in any time interval, then the estimator

x<t
dN(x)
Y (x)
will be
uniquely dened and will not be altered by choosing a ner partition for the grid of time points.
In such a case the quantity

x<t
dN(x)
Y (x)
is sometimes represented as
_
t
0
dN(x)
Y (x)
.
1. Basically, this estimator take the sum over all the distinct death times before time t of the
number of deaths divided by the number at risk at each of those distinct death times.
2. The estimator

x<t
dN(x)
Y (x)
is referred to as the Nelson-Aalen estimator for the cumulative
hazard function (t) =
_
t
0
(x)dx. That is
(t) =
x<t
dN(x)
Y (x)
.
Recall that S(t) = exp((t)).
By the denition of an integral,
(t) =
_
t
0
(x)dx
grid points x such that x + x t

(x)x.
PAGE 26
By the denition of a hazard function,
(x)x P[x T < x + x|T x].
With independent censoring, it would seem reasonable to estimate (x)x, i.e., the con-
ditional probability of dying in [x, x + x) given being alive at time x by
dN(x)
Y (x)
. Therefore we
obtain the Nelson-Aalen estimator
(t) =
x<t
dN(x)
Y (x)
.
We will now show how to estimate the variance of the Nelson-Aalen estimator and then show
how this will be used to estimate the variance of the Kaplan-Meier estimator.
For a grid point x, let H(x) denote the history of all deaths and censoring occurring up to
time x.
H(x) = {dN(u), w(u); for all values u on our grid of points for u < x}.
Note the following
1. Conditional on H(x), we would know the value of Y (x) (i.e., the number of risk at time x)
and that dN(x) would follow a binomial distribution denoted as
dN(x)|H(x) Bin(Y (x), (x)),
where (x) is the Conditional probability of an individual dying in [x, x + x) given that
the individual was at risk at time x (i.e., (x) = P[x T < x + x|T x]). Recall that
this probability can be approximated by (x) (x)x.
2. The following are standard results for a binomially distributed random variable.
(a) E[dN(x)|H(x)] = Y (x)(x),
PAGE 27
(b) Var[dN(x)|H(x)] = Y (x)(x)[1 (x)],
(c) E
_
dN(x)
Y (x)
H(x)
_
= (x),
(d) E
__
Y (x)
Y (x) 1
_ _
dN(x)
Y (x)
_ _
Y (x) dN(x)
Y (x)
_
H(x)
_
= (x)[1 (x)].
Consider the Nelson-Aalen estimator

(t) =

x<t
dN(x)
Y (x)
. We have
E[
(t)] = E
_
x<t
dN(x)
Y (x)
_
=
x<t
E
_
dN(x)
Y (x)
_
=
x<t
E
_
E
_
dN(x)
Y (x)
H(x)
__
=
x<t
(x)
x<t
(x)x
_
t
0
(x)dx = (t).
Hence
E[
(t)] =

x<t
(x).
If we take x smaller and smaller, then in the limit

x<t
(x) goes to (t). Namely

(t)
is nearly unbiased to (t).
How to Estimate the Variance of

(t)
The denition of variance is given by
Var(
(t)) = E[
(t) E(
(t))]
2
= E
_
x<t
dN(x)
Y (x)

x<t
(x)
_
2
= E
_
x<t
_
dN(x)
Y (x)
(x)
__
2
.
Note: The square of a sum of terms is equal to the sum of the squares plus the sum of all
cross product terms. So the above expectation is equal to
E
_
_
x<t
_
dN(x)
Y (x)
(x)
_
2
+
x=x
<t
_
dN(x)
Y (x)
(x)
__
dN(x
)
Y (x
)
(x
)
_
_
_
=
x<t
E
_
dN(x)
Y (x)
(x)
_
2
+
x=x
<t
E
__
dN(x)
Y (x)
(x)
__
dN(x
)
Y (x
)
(x
)
__
PAGE 28
We will rst demonstrate that the cross product terms have expectation equal to zero. Let
us take one such term and let us say, without loss of generality, that x < x
.
E
__
dN(x)
Y (x)
(x)
__
dN(x
)
Y (x
)
(x
)
__
= E
_
E
__
dN(x)
Y (x)
(x)
_ _
dN(x
)
Y (x
)
(x
)
_
H(x
)
__
Note: Conditional on H(x
), dN(x), Y (x) and (x) are constant since x < x
. Therefore the
above expectation is equal to
E
__
dN(x)
Y (x)
(x)
_
E
__
dN(x
)
Y (x
)
(x
)
_
H(x
)
__
The inner conditional expectation is zero since
E
_
dN(x
)
Y (x
H(x
)
_
= (x
)
by (2.c). Therefore we show that
E
__
dN(x)
Y (x)
(x)
__
dN(x
)
Y (x
)
(x
)
__
= 0.
Since the cross product terms have expectation equal to zero, this implies that
Var(
(t)) =
x<t
E
_
dN(x)
Y (x)
(x)
_
2
Using the double expectation again, we get that
E
_
dN(x)
Y (x)
(x)
_
2
= E
_
_
E
_
_
_
_
_
dN(x)
Y (x)
(x)
_
2
H(x)
_
_
_
_
= E
_
Var
_
dN(x)
Y (x)
H(x)
__
= E
_
(x)[1 (x)]
Y (x)
_
.
Therefore, we have that
Var(
(t)) =
x<t
E
_
(x)[1 (x)]
Y (x)
_
.
PAGE 29
If we wanted to estimate
(x)[1(x)]
Y (x)
, then using (2.d) we might think that
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
may be reasonable. In fact, we would then use as an estimate for Var(
(t)) the following

estimator; summing the above estimator over all grid points x such that x + x t.
Var(
(t)) =
x<t
_
_
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
_
_
.
In fact, the above variance estimator is unbiased for Var(
(t)), which can be seen using the

following argument:
E
_
_
x<t
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
_
_
=
x<t
E
_
_
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
_
_
=
x<t
E
_
_
E
_
_
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
H(x)
_
_
_
_
(double expectation again)
=
x<t
E
_
(x)[1 (x)]
Y (x)
_
(by (2.d))
= Var[
(t)].
What this last argument shows is that an unbiased estimator for Var[
(t)] is given by
x<t
_
_
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
_
_
.
Note: If the survival data are continuous (i.e., no ties) and x is taken small enough, then
dN(x) would take on the values 0 or 1 only. In this case
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
=
dN(x)
Y
2
(x)
,
and
Var(
(t)) =
x<t
dN(x)
Y
2
(x)
,
PAGE 30
which is also written as
_
t
0
dN(x)
Y
2
(x)
.
Remark:
We proved that the Nelson-Aalen estimator

x<t
dN(x)
Y (x)
is an unbiased estimator for
x<t
(x). We argued before that in the limit as x goes to zero,
x<t
dN(x)
Y (x)
becomes
_
t
0
dN(x)
Y (x)
.
We also argued that (x) (x)x, hence as x goes to zero, then
x<t
(x) goes to
_
t
0
(x)dx.
These two arguments taken together imply that
_
t
0
dN(x)
Y (x)
is an unbiased estimator of the cumulative hazard function
(t) =
_
t
0
(x)dx,
namely,
E
_
_
t
0
dN(x)
Y (x)
_
= (t).
Since

(t) =

x<t
dN(x)
Y (x)
is made up of a sum of random variables that are conditionally
uncorrelated, they have a martingale structure for which there exists a body of theory
that enables us to show that
PAGE 31
(t) is asymptotically normal with mean (t) and variance Var[
(t)], which can be estimated

unbiasedly by
Var(
(t)) =
x<t
_
_
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
_
_
;
and in the case of no ties, by
Var(
(t)) =
x<t
dN(x)
Y
2
(x)
.
Let us refer to the estimated standard error of

(t) by
se[
(t)] =
_
_
x<t
_
_
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
_
_
_
_
1/2
.
The unbiasedness and asymptotic normality of

(t) about (t) allow us to form condence
intervals for (t) (at time t). Specically, the (1 )th condence interval for (t) is given by
(t) z
/2
se(
(t)),
where z
/2
is the (1 /2)th quantile of a standard normal distribution. That is, the random
interval
[
(t) z
/2
se(
(t)),

(t) +z
/2
se(
(t))]
covers the true value (t) with probability 1 .
This result could also be used to construct condence intervals for the survival function S(t).
This is seen by realizing that
S(t) = e
(t)
,
in which case the condence interval is given by
[e
(t)z
/2
se(
(t))
, e
(t)+z
/2
se(
(t))
],
PAGE 32
meaning that this random interval will cover the true value S(t) with probability 1 .
An example: We will use the hypothetical data shown in Figure 2.3 to illustrate the calcu-
lation of

(t),

Var
(t), and condence intervals for (t) and S(t). For illustration, let us take
t = 17. Note that there are no ties in this example. So
(t) =
x<t
dN(x)
Y (x)
=
_
t
0
dN(x)
Y (x)
=
1
10
+
1
9
+
1
7
+
1
5
+
1
4
= 0.804,
Var[
(t)] =
x<t
dN(x)
Y
2
(x)
=
_
t
0
dN(x)
Y
2
(x)
=
1
10
2
+
1
9
2
+
1
7
2
+
1
5
2
+
1
4
2
= 0.145,
se[
(t)] =
0.145 = 0.381.
So the 95% condence interval for (t) is
0.804 1.96 0.381 = [0.0572, 1.551].
and the Nelson-Aalen estimate of S(t) is
S(t) = e
(t)
= e
0.804
= 0.448.
The 95% condence interval for S(t) is
[e
1.551
, e
0.0572
] = [0.212, 0.944].
Note The above Nelson-Aalen estimate

S(t) = 0.448 is dierent from (but close to) the
Kaplan-Meier estimate KM(t) = 0.411. It should also be noted that above condence interval
for the survival probability S(t) is not symmetric about the estimator

S(t). Another way of
getting approximate condence intervals for S(t) = e
(t)
is by using the delta method. This
method guarantees symmetric condence intervals.
Hence a (1 )th condence interval for f() is given by
f(
) z
/2
|f
)| .
In our case, (t) takes on the role of ,

(t) takes on the role of

, f() = e
so that
S(t) = f{(t)}. Since
|f
() = | e
| = e
, and

S(t) = e
(t)
.
PAGE 33
Consequently, using the delta method we get
S(t)
a
N(S(t), [S(t)]
2
Var[
(t)]),
and a (1 )th condence interval for S(t) is given by
S(t) z
/2
{
S(t) se[
(t)]}.
Remark: Note that [S(t)]
2
Var[
(t)] is an estimate of Var[
S(t)], where

S(t) = exp[
(t)].
Previously, we showed that the Kaplan-Meier estimator
KM(t) =
x<t
_
1
dN(x)
Y (x)
_
was well approximated by

S(t) = exp[
(t)].
Thus a reasonable estimator of Var(KM(t)) would be to use the estimator of Var[exp(
(t))],
or (by using the delta method)
[
S(t)]
2
Var[
(t))] = [
S(t)]
2
x<t
dN(x)
Y
2
(x)
.
This is very close (asymptotically the same) as the estimator for the variance of the Kaplan-
Meier estimator given by Greenwood. Namely
Var{KM(t)} = {KM(t)}
2
_
x<t
dN(x)
[Y (x) w(x)/2][Y (x) dN(x) w(x)/2]
_
.
Note: SAS uses the above formula to calculate the estimated variance for the life-table estimate
of the survival function, by replacing KM(t) on both sides by LT(t).
Note: The summation in the above equation can be viewed as the variance estimate for the
cumulative hazard estimator dened by

KM
(t) = log[KM(t)]. Namely,
Var{
KM
(t)} =
x<t
dN(x)
[Y (x) w(x)/2][Y (x) dN(x) w(x)/2]
.
In the example shown in Figure 2.3, using the delta-method approximation for getting a
condence interval with the Nelson-Aalen estimator, we get that a 95% CI for S(t) (where t=17)
PAGE 34
is
e
(t)
1.96 e
(t)
se[
(t)] = e
0.801
1.96 e
0.801
0.381 = [0.114, 0.784].
The estimated se[
S(t)] = 0.171.
If we use the Kaplan-Meier estimator, together with Greenwoods formula for estimating the
variance, to construct a 95% condence interval for S(t), we would get
KM(t) =
_
1
1
10
_ _
1
1
9
_ _
1
1
7
_ _
1
1
5
_ _
1
1
4
_
= 0.411
Var[KM(t)] = 0.411
2
_
1
10 9
+
1
9 8
+
1
7 6
+
1
5 4
+
1
4 3
_
= 0.03077
se[KM(t)] =
0.03077 = 0.175
Var[
KM
(t)] =
1
10 9
+
1
9 8
+
1
7 6
+
1
5 4
+
1
4 3
= 0.182
se[
KM
(t)] = 0.427.
Thus a 95% condence interval for S(t) is given by
KM(t) 1.96 se[KM(t)] = 0.411 1.96 0.175 = [0.068, 0.754],
which is close to the condence interval using delta method, considering the sample size is only 10.
In fact the estimated standard errors for

S(t) and KM(t) using delta method and Greenwoods
formula are 0.171 and 0.175 respectively, which agree with each other very well.
Note: If we want to use R function survfit() to construct a condence interval for S(t) with
the form KM(t) z
/2
se[KM(t)], we have to specify the argument conf.type=c("plain")
in survfit(). The default constructs the condence interval for S(t) by exponentiating the
condence interval for the cumulative hazard using the Kaplan-Meier estimator. For example,
a 95% CI for S(t) is KM(t) [e
1.96se[
KM
(t)]
, e
1.96se[
KM
(t)]
] = 0.411 [e
1.960.427
, [e
1.960.427
] =
[0.178, 0.949].
Comparison of condence intervals for S(t)
1. exponentiating the 95% CI for cumulative hazard using Nelson-Aalen estimator: [0.212, 0.944].
PAGE 35
2. Delta-method using Nelson-Aalen estimator: [0.114, 0.784].
3. exponentiating the 95% CI for cumulative hazard using Kaplan-Meier estimator: [0.178, 0.949].
4. Kaplan-Meier estimator together with Greenwoods formula for variance: [0.068, 0.754].
These are relatively close and the approximations become better with larger sample sizes.
Of the dierent methods for constructing condence intervals, usually the most accurate
is based on exponentiating the condence intervals for the cumulative hazard function based on
Nelson-Aalen estimator. We dont feel that symmetry is necessarily an important feature that
condence interval need have.
Summary
1. We rst estimate S(t) by KM(t) =

x<t
_
1
d(x)
n(x)
_
, then estimate (t) by

KM
(t) =
log[KM(t)]. Their variance estimates are
Var{
KM
(t)} =
x<t
dN(x)
[Y (x) w(x)/2][Y (x) dN(x) w(x)/2]
Var{KM(t)} = {KM(t)}
2

Var{
KM
(t)}.
The condence intervals for S(t) can be constructed in two ways:
KM(t) z
/2
se[KM(t)], or e
KM
(t)z
/2
se[
KM
(t)]
= KM(t) e
z
/2
se[
KM
(t)]
2. We rst estimate (t) by Nelson-Aalen estimator

(t) =

x<t
dN(x)
Y (x)
, then estimate S(t) by
S(t) = e
(t)
. Their variance estimates are given by
Var{
(t)} =
x<t
_
_
dN(x)
Y (x)
_
Y (x)dN(x)
Y (x)
_
Y (x) 1
_
_
Var{
S(t)} = {
S(t)}
2

Var{
(t)}.
The condence intervals for S(t) can also be constructed in two ways:
S(t) z
/2
se[
S(t)], or e
(t)z
/2
se[
(t)]
=

S(t) e
z
/2
se[
(t)]
.
PAGE 36
Estimators of quantiles (such as median, rst and third quartiles) of a distribution can be
obtained by inverse relationships. This is most easily illustrated through an example.
Suppose we want to estimate the median S
1
(0.5) or any other quantile = S
1
(); 0 <
< 1. Then the point estimate of is obtained (using the Kaplan-Meier estimator of S(t))
= KM
1
(), i .e., KM( ) = .
An approximate (1 )th condence interval for if given by [
L
,
U
], where
L
satises
KM(
L
) z
/2
se[KM(
L
)] =
and
U
satises
KM(
U
) +z
/2
se[KM(
U
)] = .
Proof: We prove this argument for a general estimator

S(t). So if we use the Kaplan-Meier
estimator, then

S(t) is KM(t). It can also be the Nelson-Aalen estimator. Then
P[
L
< <
U
] = P[S(
U
) < < S(
L
)] (note that S(t) is decreasing and S() = )
= 1 (P[S(
U
) > ] +P[S(
L
) < ]).
Denote
U
the solution to the equation
S(
U
) +z
/2
se[
S(
U
)] = .
Then
U
will be close to
U
. Therefore,
P[S(
U
) > ] = P[S(
U
) >

S(
U
) +z
/2
se[
S(
U
)]]
= P
_

S(
U
) S(
U
)
se[
S(
U
)]
< z
/2
_
P
_

S(
U
) S(
U
)
se[
S(
U
)]
< z
/2
_
P[Z < z
/2
] (Z N(0, 1))
=

2
.
PAGE 37
Similarly, we can show that
P[S(
L
) < ]

2
.
Therefore,
P[
L
< <
U
] 1
_
2
+

2
_
= 1 .
We illustrate this practice using a simulated data set generated using the following R com-
mands
> survtime <- rexp(50, 0.2)
> censtime <- rexp(50, 0.1)
> status <- (survtime <= censtime)
> obstime <- survtime*status + censtime*(1-status)
> fit <- survfit(Surv(obstime, status))
> summary(fit)
Call: survfit(formula = Surv(obstime, status))
0.0747 50 1 0.980 0.0198 0.9420 1.000
0.0908 49 1 0.960 0.0277 0.9072 1.000
0.4332 46 1 0.939 0.0341 0.8747 1.000
0.4420 45 1 0.918 0.0392 0.8446 0.998
0.5454 44 1 0.897 0.0435 0.8161 0.987
0.6126 43 1 0.877 0.0472 0.7887 0.974
0.7238 42 1 0.856 0.0505 0.7622 0.961
1.1662 40 1 0.834 0.0536 0.7356 0.946
1.2901 39 1 0.813 0.0563 0.7097 0.931
1.3516 38 1 0.791 0.0588 0.6843 0.915
1.4490 37 1 0.770 0.0609 0.6594 0.899
1.6287 35 1 0.748 0.0630 0.6342 0.882
1.8344 34 1 0.726 0.0649 0.6094 0.865
1.9828 33 1 0.704 0.0666 0.5850 0.847
2.1467 32 1 0.682 0.0680 0.5610 0.829
2.3481 31 1 0.660 0.0693 0.5373 0.811
2.4668 30 1 0.638 0.0704 0.5140 0.792
2.5135 29 1 0.616 0.0713 0.4910 0.773
2.5999 28 1 0.594 0.0721 0.4683 0.754
2.9147 27 1 0.572 0.0727 0.4459 0.734
2.9351 25 1 0.549 0.0733 0.4228 0.713
3.2168 24 1 0.526 0.0737 0.3999 0.693
3.4501 22 1 0.502 0.0742 0.3762 0.671
3.5620 21 1 0.478 0.0744 0.3528 0.649
3.6795 20 1 0.455 0.0744 0.3298 0.627
3.8475 18 1 0.429 0.0744 0.3056 0.603
4.8888 16 1 0.402 0.0745 0.2800 0.578
5.3910 15 1 0.376 0.0742 0.2551 0.553
6.1186 14 1 0.349 0.0736 0.2307 0.527
6.1812 13 1 0.322 0.0726 0.2069 0.501
6.1957 12 1 0.295 0.0714 0.1837 0.474
6.2686 10 1 0.266 0.0701 0.1584 0.445
6.3252 9 1 0.236 0.0682 0.1340 0.416
PAGE 38
6.5206 7 1 0.202 0.0663 0.1065 0.385
7.1127 6 1 0.169 0.0632 0.0809 0.352
9.3017 3 1 0.112 0.0623 0.0379 0.333
11.1589 1 1 0.000 NA NA NA
The true survival time has an exponential distribution with = 0.2/year (so the true mean
is 5 years and median is 5 log(2) 3.5 years). The (potential) censoring time is independent
from the survival time and has an exponential distribution with = 0.1/year (so it is stochas-
tically larger than the survival time). The Kaplan estimate (solid line) and its 95% condence
intervals (dotted lines) are shown in Figure 2.5, which is generated using R function plot(fit,
xlab="Patient time (years)", ylab="survival probability"). Note that these CIs are
constructed by exponentiating the CIs for (t). From this gure, the median survival time is
estimated to be 3.56 years, with its 95% condence interval [2.51, 6.20].
Figure 2.5: Illustration for constructing 95% CI for median survival time
s
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
y
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
3.56 2.51 6.20
If we use symmetric condence intervals of S(t) to construct the condence interval for the
median of the true survival time, then we need to specify conf.type=c("plain") in survfit()
as follows
> fit <- survfit(Surv(obstime, status), conf.type=c("plain"))
We get the following output using summary()
PAGE 39
> summary(fit)
Call: survfit(formula = Surv(obstime, status), conf.type = c("plain"))
0.0747 50 1 0.980 0.0198 0.9412 1.000
0.0908 49 1 0.960 0.0277 0.9057 1.000
0.4332 46 1 0.939 0.0341 0.8723 1.000
0.4420 45 1 0.918 0.0392 0.8414 0.995
0.5454 44 1 0.897 0.0435 0.8121 0.983
0.6126 43 1 0.877 0.0472 0.7839 0.969
0.7238 42 1 0.856 0.0505 0.7567 0.955
1.1662 40 1 0.834 0.0536 0.7292 0.939
1.2901 39 1 0.813 0.0563 0.7025 0.923
1.3516 38 1 0.791 0.0588 0.6763 0.907
1.4490 37 1 0.770 0.0609 0.6506 0.890
1.6287 35 1 0.748 0.0630 0.6245 0.872
1.8344 34 1 0.726 0.0649 0.5988 0.853
1.9828 33 1 0.704 0.0666 0.5736 0.835
2.1467 32 1 0.682 0.0680 0.5487 0.815
2.3481 31 1 0.660 0.0693 0.5242 0.796
2.4668 30 1 0.638 0.0704 0.5001 0.776
2.5135 29 1 0.616 0.0713 0.4763 0.756
2.5999 28 1 0.594 0.0721 0.4528 0.735
2.9147 27 1 0.572 0.0727 0.4296 0.715
2.9351 25 1 0.549 0.0733 0.4055 0.693
3.2168 24 1 0.526 0.0737 0.3818 0.671
3.4501 22 1 0.502 0.0742 0.3570 0.648
3.5620 21 1 0.478 0.0744 0.3326 0.624
3.6795 20 1 0.455 0.0744 0.3087 0.600
3.8475 18 1 0.429 0.0744 0.2834 0.575
4.8888 16 1 0.402 0.0745 0.2565 0.548
5.3910 15 1 0.376 0.0742 0.2302 0.521
6.1186 14 1 0.349 0.0736 0.2046 0.493
6.1812 13 1 0.322 0.0726 0.1796 0.464
6.1957 12 1 0.295 0.0714 0.1552 0.435
6.2686 10 1 0.266 0.0701 0.1283 0.403
6.3252 9 1 0.236 0.0682 0.1024 0.370
6.5206 7 1 0.202 0.0663 0.0724 0.332
7.1127 6 1 0.169 0.0632 0.0447 0.293
9.3017 3 1 0.112 0.0623 0.0000 0.235
11.1589 1 1 0.000 NA NA NA
The Kaplan estimate (solid line) and its symmetric 95% condence intervals (dotted lines) are
shown in Figure 2.6. Note that the Kaplan estimate is the same as before. From this gure, the
median survival time is estimated to be 3.56 years, with its 95% condence interval [2.51, 6.12].
Note: If we treat the censored data obstime as uncensored and t an exponential model
to it, then the best estimate of the median survival time is 2.5, with 95% condence interval
[1.8, 3.2] (using the methodology to be presented in next chapter). These estimates severely
underestimate the true median survival time 3.5 years.
PAGE 40
Figure 2.6: Illustration for constructing 95% CI for median survival time using symmetric CIs
of S(t)
s
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
y
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
3.56 2.51 6.12
Note:
If we want a CI for the quantile such as the median survival time with a dierent condence
level, say, 90%, then we need to construct 90% condence intervals for S(t). This can be done
by specifying conf.int=0.9 in the R function survfit().
If we use Proc Lifetest in SAS to compute the Kaplan-Meier estimate, it will produce 95%
condence intervals for 25%, 50% (median) and 75% quantiles of the true survival time.
Other types of censoring and truncation:
Left censoring: This kind of censoring occurs when the event of interest is only known to
happen before a specic time point. For example, in a study of time to rst marijuana use
(example 1.17, page 17 of Klein & Moeschberger) 191 high school boys were asked when
did you rst use marijuana?. Some answers were I have used it but cannot recall when
the rst time was. For these boys, their time to rst marijuana use is left censored at
their current age. For the boys who never used marijuana, their time to rst marijuana use
is right censored at their current age. Of course, we got their exact time to rst marijuana
PAGE 41
use for those boys who remembered when they rst used it.
Interval censoring occurs when the event of interest is only known to take place in an
interval. For example, in a study to compare time to cosmetic deterioration of breasts
for breast cancer patients treated with radiotherapy and radiotherapy + chemotherapy,
patients were examined at each clinical visit for breast retraction and the breast retraction
is only known to take place between two clinical visits or right censored at the end of the
study. See example 1.18 on page 18 of Klein & Moeschberger.
Left truncation occurs when the time to event of interest in the study sample is greater
than a (left) truncation variable. For example, in a study of life expectancy (survival time
measured from birth to death) using elderly residents in a retirement community (example
1.16, page 15 of Klein & Moeschberger), the individuals must survive to a sucient age to
enter the retirement community. Therefore, their survival time is left truncated by their
age entering the community. Ignoring the truncation will lead to a biased sample and the
survival time from the sample will over estimate the underlying life expectancy.
Right truncation occurs when the time to event of interest in the study sample is less
than a (right) truncation variable. A special case is when the study sample consists of
only those individuals who have already experienced the event. For example, to study the
induction period (also called latency period or incubation period) between infection with
AIDS virus and the onset of clinical AIDS, the ideal approach will be to collect a sample
of patients infected with AIDS virus and then follow them for some period of time until
some of them develop clinical AIDS. However, this approach may be too lengthy and costly.
An alternative approach is to study those patients who were infected with AIDS from a
contaminated blood transfusion and later developed clinical AIDS. In this case, the total
number of patients infected with AIDS is unknown. A similar approach can be used to
study the induction time for pediatric AIDS. Children were infected with AIDS in utero or
at birth and later developed clinical AIDS. But the study sample consists of children only
known to develop AIDS. This sampling scheme is similar to the case-control design. See
PAGE 42
example 1.19 on page 19 of Klein & Moeschberger for more description and the data.
Note: The K-M survival estimation approach cannot be directly applied to the data with the
above censorings and truncations. Modied K-M approach or others have to be used. Similar to
right censoring case, the censoring time and truncation time are often assumed to be independent
of the time to event of interest (survival time). Since right censoring is the most common
censoring scheme, we will focus on this special case most of the time in this course. Nonparametric
estimation of the survival function (or the cumulative distribution function) for the data with
other censoring or truncation schemes can be found in Chapters 4 and 5 of Klein & Moeschberger.
PAGE 43
3 Likelihood and Censored (or Truncated) Survival Data
Review of Parametric Likelihood Inference
Suppose we have a random sample (i.i.d.) X
1
, X
2
, ..., X
n
from distribution f(x; ) (here
f(x; ) is either the density function if the random variable X is continuous or probability mass
function is X is discrete; can be a scalar parameter or a vector of parameters). The distribution
f(x; ) is totally determined by the parameter . For example, if X
i
is known from a log-normal
distribution, then
f(x; ) =
1
2x
e
(logx)
2
/(2
2
)
, (3.1)
and = (, ) are the parameters of interest. Any quantity w.r.t. X can be determined by .
For example, E(X) = e
+
2
/2
. The likelihood function of (given data X) is
L(; X) =
n
i=1
1
2X
i
e
(logX
i
)
2
/(2
2
)
(3.2)
= (
2)
n
n
i=1
e
(logX
i
)
2
/(2
2
)
X
i
. (3.3)
In general, the likelihood function of (given data X) is given by
L(; X) =
n
i=1
f(X
i
; ) (3.4)
and the log- likelihood function is
(; X) = log{L(; X)} =
n
i=1
log{f(X
i
; )}. (3.5)
Note that the (log) likelihood function of is viewed more as a function of than of data X. We
are interested in making inference on : estimating , constructing condence interval (region)
for , and performing hypothesis testing for (part) of .
In the likelihood inference for a regression problem, the function f(.) in the above likelihood
function is the conditional density of X
i
given covariates. For example, suppose X
i
is from the
following model
logX
i
=
0
+ z
i
1
+
i
, (3.6)
PAGE 44
where
0
is an intercept and regression coecient
1
characterizes the eect of z on X. If we
assume
i
N(0,
2
), then the likelihood of = (
0
,
1
,
2
) is
L(; X) =
n
i=1
f(X
i
|z
i
,
0
,
1
,
2
) =
n
i=1
1
2X
i
e
(logX
i
0
z
i
1
)
2
/(2
2
)
. (3.7)
The maximum likelihood estimate (MLE)

of is dened as the maximizer of (; X), which
can be usually obtained by solving the following likelihood equation (or score equation)
U(; X) =
(; X)
=
n
i=1
log{f(X
i
; )}
= 0,
and U(; X) is often referred to as the score. Usually

does not have a closed form, in which
case an iterative algorithm such as Newton-Raphson algorithm can be used to nd

.
Obviously, the MLE of , denoted by

, is a function of data X = (X
1
, X
2
, ..., X
n
), and hence
a statistic that has a sampling distribution. Asymptotically (i.e., for large sample size n) ,

will
have the following distribution
a
N(, C),
where C = J
1
or C = J
1
0
and
J = E
_
2
(; X)
T
_
=
n
i=1
E
_
2
log{f(X
i
; )}
T
_
is often referred to as the Fisher information matrix and
J
0
=

2
(; X)
=
n
i=1
2
log{f(X
i
; )}
is often referred to as the observed information matrix. Asymptotically J and J

0
are the same. So
we usually just use J to mean either information matrix. These results can be used to construct
condence interval (region) for .
Suppose = (
1
,
2
) and we are interested in testing H
0
:
1
=
10
v.s. H
A
:
1
=
10
. Under
mild conditions, the following test procedures can be used to test H
0
.
Wald test: Suppose the corresponding decompositions of

and C are
=
_
_
_
_
2
_
_
_
_
and C =
_
_
_
_
C
11
C
12
C
21
C
22
_
_
_
_
.
PAGE 45
Then under H
0
,
2
obs
= (
10
)
T
C
1
11
(
10
)
a

2
k
,
where k is the dimension of
1
. Therefore, we reject H
0
if
2
obs
>
2
1
, where
2
1
is the (1)th
percentile of
2
k
.
Score test: The score test is based on the fact that the score U(; X) has the following
asymptotic distribution
U(; X)
a
N(0, J).
Decompose U(; X) as U(; X) = (U
1
(; X), U
2
(; X)) and let

2
be the MLE of
2
under H
0
:
1
=
10
, i.e.,

2
maximizes (
10
,
2
; X). Then under H
0
:
1
=
10
,
2
obs
= U
T
1
C
11
U
1

2
k
,
where U
1
(; X) and C
11
are evaluated under H
0
. We reject H
0
if
2
obs
>
2
1
.
Likelihood ratio test: Under H
0
:
1
=
10
,
2
obs
= 2 ((
10
,

2
; X) (
; X))
2
k
.
Therefore, we reject H
0
if
2
obs
>
2
1
.
An example of score tests: Suppose the sample x
1
, x
2
, ..., x
n
is from a Weibull distribution
with survival function s(x) = e
x
. We want to construct a score test for testing H

0
: = 1,
i.e., the data is from an exponential distribution.
The likelihood function of (, ) is
L(, ; x) =
n
i=1
_
x
1
i
e
x
i
_
=
n
n
e
n
i=1
x
i
+(1)
n
i=1
log(x
i
)
.
Therefore, the log-likelihood function of (, ) is
(, ; x) = nlog() + nlog()
n
i=1
x
i
+ ( 1)
n
i=1
log(x
i
).
PAGE 46
So the components of the score are:
(, ; x)
=
n
i=1
x
i
log(x
i
) +
n
i=1
log(x
i
)
(, ; x)
=
n
i=1
x
i
,
and the components of the information matrix is
2
(, ; x)
2
=
n
2

n
i=1
x
i
(log(x
i
))
2
2
(, ; x)
=
n
i=1
x
i
log(x
i
)
2
(, ; x)
2
=
n
2
.
For a given data, calculate the above quantities under H
0
: = 1 and construct a score test.
For example, for the complete data in home work 2, we have n = 25,

x
i
= 6940,

log(x
i
) =
132.24836,

x
i
log(x
i
) = 40870.268,

x
i
(log(x
i
))
2
= 243502.91,

= 1/ x = 0.0036023 so the
score U = 25 0.0036023 40870.268 + 132.24836 = 10, the information matrix and its inverse
are
J =
_
_
902.17186 40870.268
40870.268 1926544
_
_
, C = J
1
=
_
_
0.0284592 0.000604
0.000604 0.0000133
_
_
,
so the score statistic is
2
= 10 0.0284592 10 = 2.8 and the p-value is 0.09.
Back to Censored Data
Suppose we have a random sample of individuals of size n from a specic population whose
true survival times are T
1
, T
2
, ..., T
n
. However, due to right censoring such as staggered entry,
loss to follow-up, competing risks (death from other causes) or any combination of these, we
dont always have the opportunity of observing these survival times. Denote by C the censoring
process and by C
1
, C
2
, ..., C
n
the (potential) censoring times. Thus if a subject is not censored
we have observed his/her survival time (in this case, we may not observe the censoring time for
this individual), otherwise we have observed his/her censoring time (survival time is larger than
PAGE 47
the censoring time). In other words, the observed data are the minimum of the survival time
and censoring time for each subject in the sample and the indication whether or not the subject
is censored. Statistically, we have observed data (X
i
,
i
), i = 1, 2, ..., n, where
X
i
= min(T
i
, C
i
),
i
= I(T
i
C
i
) =
_
_
1 if T
i
C
i
(observed failure)
0 if T
i
> C
i
(observed censoring)
Namely, the potential data are {(T
1
, C
1
), (T
2
, C
2
), ..., (T
n
, C
n
)}, but the actual observed data
are {(X
1
,
1
), (X
2
,
2
), ..., (X
n
,
n
)}.
Of course we are interested in making inference on the random variable T, i.e., any one of
following functions
f(t) = density function
F(t) = distribution function
S(t) = survival function
(t) = hazard function
Since we need to work with our data: {(X
1
,
1
), (X
2
,
2
), ..., (X
n
,
n
)}, we dene the fol-
lowing corresponding functions for the censoring time C:
g(t) = density function
G(t) = distribution function = P[C t]
H(t) = survival function = P[C t] = 1 G(t)
(t) = hazard function =
g(t)
H(t)
Usually, the density function f(t) of T may be governed by some parameters and g(t) by
some other parameters . In these cases, we are interested in making inference on .
In order to derive the density of (X, ), we assume independent censoring, i.e., random
PAGE 48
variables T and C are independent. The density function of (X, ) is dened as
f(x, ) = lim
h0
P[x X < x + h, = ]
h
, x 0, = {0, 1}.
Note: Do not mix up the density f(t) of T and f(x, ) of (X, ). If we want to be more
specic, we will use f
T
(t) for T and f
X,
(x, ) for (X, ). But when there is no ambiguity, we
will suppress the subscripts.
1. Case 1: = 1, i.e., T C, X = min(T, C) = T, we have
P[x X < x + h, = 1]
= P[x T < x + h, C T]
P[x T < x + h, C x] (Note: x is a xed number)
= P[x T < x + h] P[C x] (by independence of T and C)
= f()h H(x), [x, x + h), (Note: H(x) is the survival function of C).
Therefore
f(x, = 1) = lim
h0
P[x X < x + h, = 1]
h
= lim
h0
f()h H(x)
h
= f
T
(x)H
C
(x).
2. Case 2: = 0, i.e., T > C, X = min(T, C) = C, we have
P[x X < x + h, = 0]
= P[x C < x + h, T > C]
P[x C < x + h, T x] (Note: x is a xed number)
= P[x C < x + h] P[T x] (by independence of T and C)
= g
C
()h S(x), [x, x + h).
PAGE 49
Therefore
f(x, = 0) = lim
h0
P[x X < x + h, = 0]
h
= lim
h0
g
C
()h S(x)
h
= g
C
(x)S(x).
Combining these two cases, we have the density function of (X, ):
f(x, ) = [f
T
(x)H
C
(x)]
[g
C
(x)S(x)]
1
= {[f
T
(x)]
[S(x)]
1
}{[g
C
(x)]
1
[H
C
(x)]
}.
Sometimes it may be useful to use hazard functions. Recalling that the hazard function
T
(x) =
f
T
(x)
S
T
(x)
, or f
T
(x) =
T
(x) S
T
(x),
we can write [f
T
(x)]
[S(x)]
1
as
[f
T
(x)]
[S(x)]
1
= [
T
(x) S
T
(x)]
[S(x)]
1
= [
T
(x)]
[S(x)].
Another useful way of dening the distribution of the random variable (X, ) is through the
cause-specic hazard function.
Denition: The cause-specic hazard function is dened as
(x, ) = lim
h0
P[x X < x + h, = |X x]
h
.
For example, (x, = 1) corresponds to the probability rate of observing a failure at time x
given an individual is at risk at time x (i.e., neither failed nor was censored prior to time x).
If T and C are statistically independent, then through the following calculations, we obtain
P[x X < x + h, = |X x] =
P[(x X < x + h, = ) (X x)]
P[X x]
=
P[x X < x + h, = ]
P[X x]
.
PAGE 50
Hence
(x, = 1) =
lim
h0
P[xX<x+h,=1]
h
P[X x]
=
f(x, = 1)
P[X x]
.
Since f(x, = 1) = f
T
(x)H
C
(x) and
P[X x] = P[min(T, C) x]
= P[(T x) (C x)]
= P[T x] P[C x] (by independence of T and C)
= S
T
(x)H
C
(x).
Therefore,
(x, = 1) =
f
T
(x)H
C
(x)
S
T
(x)H
C
(x)
=
f
T
(x)
S
T
(x)
=
T
(x).
Remark:
1. This last statement is very important. It says that if T and C are independent then
the cause-specic hazard for failing (of the observed data) is the same as the underlying
hazard of failing for the variable T we are interested in. This result was used implicitly
when constructing the life-table, Kaplan-Meier and Nelson-Aalen estimators.
2. If the cause-specic hazard of failing is equal to the hazard of underlying failure time, the
censoring process is said to be non-informative. Except for some pathological examples,
non-informative censoring is equivalent to independent censoring.
3. We assumed independent censoring when we derive the density function for (X, ) and
the cause-specic hazard. All results depend on this assumption. If this assumption is
violated, all the inferential methods will yield biased results.
PAGE 51
4. To make matters more complex, we cannot tell whether or not T and C are independent
based on the observed data (X
i
,
i
), i = 1, 2, ..., n. This is an inherent non-identiability
problem; See Tsiatis (1975) in Proceeding of the National Academy of Science.
5. To complete, if T and C are independent, then
(x, = 0) =
C
(x).
Now we are in a position to write down the likelihood function for a parametric model given
our observed data (x
i
,
i
) (under independence of T and C): i = 1, 2, ..., n.
L(, ; x, ) =
n
i=1
{[f(x
i
; )]
i
[S(x
i
; )]
1
i
}{[g(x
i
; )]
1
i
[H(x
i
; )]
i
}.
Keep in mind that we are mainly interested in making inference on the parameters char-
acterizing the distribution of T. So if and have no common parameters, we can use the
following likelihood function to make inference on :
L(; x, ) =
n
i=1
[f(x
i
; )]
i
[S(x
i
; )]
1
i
. (3.8)
Or equivalently,
L(; x, ) =
n
i=1
[(x
i
; )]
i
[S(x
i
; )]. (3.9)
Note: Even if and may have common parameters, we can still use (3.8) or (3.9) to draw
valid inference on . Of course, we may lose some eciency in this case.
Likelihood for general censoring case
The likelihood function (3.8) has the following form
L(; x, ) =
dD
f(x
d
)
rR
S(x
r
), (3.10)
where D is the set of death times, R is the set of right censored times. For a death time x
d
,
f(x
d
) is proportional to the probability of observing a death at time x
d
. For a right censored
PAGE 52
observed x
r
, the only thing we know is that the real survival time T
r
is greater than x
r
. Hence
we have P[T
r
> x
r
] = S(x
r
), the probability that the real survival time T
r
is greater than x
r
, for
a right censored observation.
The above likelihood can be generalized to the case where there might be any kind of cen-
soring:
L(; x, ) =
dD
f(x
d
)
rR
S(x
r
)
lL
[1 S(x
l
)]
iI
[S(U
i
) S(V
i
)], (3.11)
where L is the set of left censored observations, I is the set of interval censored observations
with the only knowledge that the real survival time T
i
is in the interval [U
i
, V
i
]. Note that
S(U
i
) S(V
i
) = P[U
i
T
i
V
i
] is the probability that the real survival time T
i
is in [U
i
, V
i
].
Likelihood for left truncated observations
Suppose now that the real survival time T
i
is left truncated at Y
i
. Then we have to consider
the conditional distribution of T
i
given that T
i
Y
i
:
g(t|T
i
Y
i
) =
f(t)
P[T
i
Y
i
]
=
f(t)
S(Y
i
)
. (3.12)
Therefore, the probability to observe a death at x
d
is proportional to
g(x
d
|T
d
Y
d
) = f(x
d
)/S(Y
d
).
The probability that the real survival time T
r
is right censored at x
r
is
P[T
r
x
r
|T
r
Y
r
] = S(x
r
)/S(Y
r
).
The probability that the real survival time T
l
is left censored at x
l
is
P[T
l
x
l
|T
l
Y
l
] = [S(Y
l
) S(x
l
)]/S(Y
l
).
And the probability that the real survival time T
i
is in [U
i
, V
i
] (U
i
Y
i
) is
P(U
i
T
i
V
i
|T
i
Y
i
) = P(T
i
U
i
|T
i
Y
i
) P(T
i
V
i
|T
i
Y
i
) = [S(U
i
) S(V
i
)]/S(Y
i
).
PAGE 53
In this case, the likelihood function is given by
L(; x, ) =
dD
f(x
d
)
S(Y
d
)
rR
S(x
r
)
S(Y
r
)
lL
[S(Y
l
) S(x
l
)]
S(Y
l
)
iI
[S(U
i
) S(V
i
)]
S(Y
i
)
(3.13)
=
_
_
dD
f(x
d
)
rR
S(x
r
)
lL
(S(Y
l
) S(x
l
))
iI
(S(U
i
) S(V
i
))
_
_
_
n
i=1
S(Y
i
) (3.14)
Likelihood for right truncated observations
We consider the special case of right truncation, that is, only deaths are observed. In this
case the probability to observe a death at Y
i
conditional on the survival time T
i
is less than or
equal to Y
i
is proportional to
f(Y
i
)/(1 S(Y
i
)).
So the likelihood function is
L(; x, ) =
n
i=1
f(Y
i
)
1 S(Y
i
)
. (3.15)
An Example of right censored data: Suppose the underlying survival time T is from an
exponential distribution with parameter (here the parameter is ) and we have observed
data: (x
i
,
i
), i = 1, 2, ..., n. Since (t; ) = and S(t; ) = e
t
, we get the likelihood function
of :
L(; x, ) =
n
i=1
i
e
x
i
=
n
i=1

i
e
n
i=1
x
i
.
So the log-likelihood of is
(; x, ) = log()
n
i=1
i=1
x
i
.
Obviously, the likelihood equation is
U(; x, ) =
d(; X, )
d
=
n
i=1
i=1
x
i
= 0.
PAGE 54
So the MLE of is given by
n
i=1
n
i=1
x
i
=
# of failures
person time at risk
=
D
PT
,
where D is the number of observed deaths and PT is the total patient time. Since
d
2
(; X, )
d
2
=
n
i=1
2
,
so the estimated variance for

is
Var(
) =
_
d
2
(; X, )
d
2
_
1
=
n
i=1
i
[
n
i=1
x
i
]
2
=
2
D
,
and asymptotically, we have
a
N
_
,
n
i=1
i
[
n
i=1
x
i
]
2
_
= N
_
,
2
D
_
.
This result can be used to construct condence interval for or perform hypothesis testing on
. For example, a (1 ) condence interval for is given by
z
/2
D
.
Note:
1. Sometimes the exponential distribution is parameterized in terms of the mean parameter
= 1/. In this case the MLE of is given by
n
i=1
x
i
n
i=1
i
=
total person time at risk
# of failures
=
PT
D
,
PAGE 55
and asymptotically,
a
N
_
,
2
D
_
.
(The estimated variance of

can be obtained by inverting the observed information or
using delta-method.)
2. If we ignored censoring and treated the data x
1
, x
2
, ...x
n
from the exponential distribution,
then the MLE of would be
n
i=1
x
i
n
,
which, depending on the percentage of censoring, would severely underestimate the true
mean (note that the sample size n is always larger than D, the number of deaths).
A Data Example: The data below show survival times (in months) of patients with certain
disease
3, 5, 6
, 8, 10
, 11
, 15, 20
, 22, 23, 27
, 29, 32, 35, 40, 26, 28, 33
, 21, 24
,
where

indicates right censored data. If we t exponential model to this data set, we have
D = 13 and PT =

x
i
= 418, so
=
D
PT
=
13
418
= 0.0311/month,
and the estimated standard error of

is
se(
) =
D
=
0.0311
13
= 0.0086,
and a 95% condence interval of is
z
0.025
se(
) = 0.0311 1.96 0.0086 = [0.0142, 0.0480].

To see how well the exponential model ts the data, the tted exponential survival function
is superimposed to the Kaplan-Meier estimate as shown in Figure 3.1 using the following R
functions:
PAGE 56
Figure 3.1: Three ts to the survival data

Patient time (months)
s
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
y
0 10 20 30 40
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
KM estimate
Exponential fit
Weibull fit
> example <- read.table(file="tempsurv.dat", header=T)
> fit <- survfit(Surv(survtime, status), conf.type=c("plain"), example)
> plot(0,0, xlim=c(0,40), ylim=c(0,1),
xlab="Patient time (months)", ylab="survival probability", pch=" ")
> lines(fit, lty=1)
> x <- seq(0,40, by=0.5)
> sx <- exp(-0.0311*x)
> lines(x, sx, lty=2)
where the data le tempsurv.dat looks like the following
survtime status
3 1
5 1
6 0
8 1
10 0
11 0
15 1
20 0
22 1
23 1
27 0
29 1
32 1
35 1
40 1
26 1
28 1
33 0
21 1
24 0
PAGE 57
Obviously, the exponential distribution is a poor t. In this case, we can choose one of the
following options
1. Choose a more exible model, such as the Weibull model.
2. Be content with the Kaplan-Meier estimator which makes no assumption regarding the
shape of the distribution. In most biomedical applications, the default is to go with the
Kaplan-Meier estimator.
To complete, we t a Weibull model to the data set. Recall that Weibull model has the
following survival function
S(t) = e
t
and the following hazard function

(t) = t
1
.
So the likelihood function of = (, ) is given by
L(, ; x, ) =
n
i=1
_
x
1
i
_
i
e
x
i
.
However, there is no closed form for the MLEs of = (, ). So we used Proc Lifereg in
SAS to t Weibull model implemented using the following SAS program
Data tempsurv;
infile "tempsurv.dat" firstobs=2;
input survtime status;
run;
Proc lifereg data=tempsurv;
model survtime*status(0)= / dist=weibull;
run;
The above program produced the following output:
PAGE 58
10:41 Friday, January 28, 2005
The LIFEREG Procedure
Model Information
Data Set WORK.TEMPSURV
Dependent Variable Log(survtime)
Censoring Variable status
Censoring Value(s) 0
Number of Observations 20
Noncensored Values 13
Right Censored Values 7
Left Censored Values 0
Interval Censored Values 0
Name of Distribution Weibull
Log Likelihood -16.67769141
Algorithm converged.
Analysis of Parameter Estimates
Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq
Intercept 1 3.3672 0.1291 3.1141 3.6203 679.81 <.0001
Scale 1 0.4653 0.1087 0.2943 0.7355
Weibull Scale 1 28.9964 3.7447 22.5121 37.3483
Weibull Shape 1 2.1494 0.5023 1.3596 3.3979
This SAS program ts a Weibull model with two parameters: intercept
0
and a scale param-
eter . Two parameters we use and are related to
0
and by (the detail will be discussed
in Chapter 5)
= e
0
/
=
1
.
Since the MLE of
0
and are

0
= 3.36717004 and = 0.46525153, the MLEs

and are
= e
0
/
= e
3.36717004/0.46525153
= 0.00072,
=
1

=
1
0.46525153
= 2.149.
So is the Weibull Shape parameter in the SAS output. However, SAS uses the parame-
PAGE 59
terization S(t) = e
(t/)
for Weibull distribution so that is the Weibull scale parameter.

Comparing this to our parameterization, we see that
_
1
= , = =
_
1
_
1/
.
The estimate of this Weibull scale parameter is
=
_
1
0.00072
_
1/2.149
= 28.99.
The tted Weibull survival function was superimposed to the Kaplan-Meier estimator in
Figure 3.1 using the the following R functions
> alpha <- 1/0.46525153
> lambda <- exp(-3.36717004/0.46525153)
> sx <- exp(-lambda * x^alpha)
# the object "x" was created before
> lines(x, sx, lty=4)
> legend(25,1, c("KM estimate", "Exponential fit", "Weibull fit"),
lty=c(1,2,4), cex=0.8)
Compared to the exponential t, the Weibull model ts the data much better (since its
estimated survival function tracks the Kaplan-Meier estimator much better than the estimated
exponential survival function). In fact, since the exponential model is a special case of the
Weibull model (when = 1), we can test H
0
: = 1 using the Weibull t. Note that H
0
: = 1
is equivalent to H
0
: = 1. Since
_
1
se( )
_
2
=
_
0.46525153 1
0.108717
_
2
= 24.194,
and P[
2
> 24.194] = 0.0000, we reject H
0
: = 1, i.e., we reject the exponential model. Note
also that = 2.149 > 1, so the estimated Weibull model has an increasing hazard function.
The inadequacy of the exponential t is also demonstrated in the rst plot of Figure 3.2. If
the exponential model were a good t to the data, we would see a straight line. On the other
hand, plot 2 in Figure 3.2 shows the adequacy of the Weibull model, since a straight line of the
plot of log{-log( s(t))} vs. log(t) indicates a Weibull model. Here s(t) is the KM estimate.
This graph was plotted using the following R codes:
PAGE 60
Figure 3.2: Two empirical plots
-
L
o
g
(
S
(
t
)
)
5 10 20 30
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Log of patient time (months)
L
o
g

o
f

c
u
m
u
l
a
t
i
v
e

h
a
z
a
r
d
1.5 2.5 3.5
-
4
-
3
-
2
-
1
0
1
postscript(file="fig4.2.ps", horizontal = F,
height=6, width=8.5, font=3, pointsize=14)
par(mfrow=c(1,2), pty="s")
example <- read.table(file="tempsurv.dat", header=T)
fit <- survfit(Surv(survtime, status), conf.type=c("plain"), example)
plot(fit$time, -log(fit$surv), type="s", xlab=c("Patient time (months)"),
ylab=c("-Log(S(t))"))
plot(log(fit$time), log(-log(fit$surv)), type="s", ylim=c(-4,1),
xlab=c("Log of patient time (months)"),
ylab=c("Log of cumulative hazard"))
dev.off()
PAGE 61
4 Two (K) Sample Problems
In many biomedical experiments we are interested in comparing the survival distributions
between two or more groups. For example, in phase III clinical trials we may be interested
in comparing the survival distributions between two or more competing treatments on patients
with a particular disease. For the time being, we will consider two sample comparisons and later
extend to k > 2 sample comparisons.
The problem of comparing two treatments with respect to a time to event endpoint can be
posed as a hypothesis testing problem. Let Z denote the treatment indicator. That is, Z = 1
for treatment 1 and Z = 0 for treatment 0.
In general, we will use treatment 0 (or group 0) to mean the standard treatment or placebo
comparator, and treatment 1 (or group 1) to denote the new treatment that is to be compared
to the standard or the placebo. This of course does not have to be the case; for example, we
may be comparing two new promising treatments to each other in a disease for which there is
no agreement what treatment is standard.
The null hypothesis is generally that of no treatment (group) dierence; that is, the distri-
bution of the time to event is the same for both treatments. If we denote by S
0
(t) and S
1
(t) the
survival functions for treatments 0 and 1 respectively, then the null hypothesis can be expressed
as
H
0
: S
0
(t) = S
1
(t), for t 0,
or equivalently,
H
0
:
0
(t) =
1
(t), for t 0,
where
0
(t) and
1
(t) are the hazard functions for treatments 0 and 1 respectively. Recall that
j
(t) =
dlog{S
j
(t)}
dt
, j = 0, 1.
PAGE 62
The alternative hypothesis we are most interested in is that the survival time for one treat-
ment is stochastically larger or smaller than the survival time for the other treatment.
For example, we may be interested in the alternative that the new treatment is better than
the standard one.
H
a
: S
1
(t) S
0
(t), for t 0, with strict inequality for some t.
This is an example of a one-sided alternative . Most often, we are interested in declaring a
dierence from the null hypothesis if either treatment is better than the other. If this is the case,
we use a two sided alternative:
H
a
: either S
1
(t) S
0
(t), or S
0
(t) S
1
(t), with strict inequality for some t.
In biomedical applications, it has become common practice to use nonparametric tests; that
is, using test statistics whose distribution under the null hypothesis does not depend on specic
parametric assumptions on the shape of the probability distribution. With censored survival
data, the class of weighted logrank tests are mostly used to test the null hypothesis of treatment
equality, with the logrank test being the most commonly used.
Censored survival data for comparing two groups are given as a sample of triplets (X
i
,
i
, Z
i
),
i = 1, 2, ..., n, where
X
i
= min(T
i
, C
i
),
T
i
= latent failure time
C
i
= latent censoring time
i
= I(T
i
C
i
)
Z
i
=
_
_
1 new treatment
0 standard treatment
We now dene the following notation:
n
1
= number of individuals in group 1
PAGE 63
n
0
= number of individuals in group 0
Obviously,
n
j
=
n
i=1
I(Z
i
= j), j = 0, 1
n = n
0
+n
1
.
The number of individuals at risk at time x from treatments 0 and 1 is denoted by Y
0
(x) and
Y
1
(x) respectively, where
Y
0
(x) =
n
i=1
I(X
i
x, Z
i
= 0),
Y
1
(x) =
n
i=1
I(X
i
x, Z
i
= 1),
and the total number at risk at time x is denoted by
Y (x) = Y
0
(x) +Y
1
(x).
Similarly, let dN
0
(x) and dN
1
(x) denote the number of deaths observed at time x from
treatments 0 and 1 respectively,
dN
0
(x) =
n
i=1
I(X
i
= x,
i
= 1, Z
i
= 0),
dN
1
(x) =
n
i=1
I(X
i
= x,
i
= 1, Z
i
= 1),
and
dN(x) =
n
i=1
I(X
i
= x,
i
= 1) = dN
0
(x) +dN
1
(x).
Note: In some applications, dN(x) will actually correspond to the observed number of deaths
in time window [x, x + x) for some partition of the time axis into intervals of length x. If
the partition is suciently ne then thinking of the number of deaths occurring exactly at x or
in [x, x + x) makes little dierence, and in the limit makes no dierence at all. When we are
dealing with data, we can view dN(x) as the number of deaths observed at time x. In theory
PAGE 64
the probability that we can observe a death at time x is always zero. So we understand dN(x)
as the number of deaths observed in [x, x + x) in our theoretical arguments later.
The weighted logrank test statistic is given by
T(w) =
U(w)
se(U(w))
where
U(w) =
x
w(x)
_
dN
1
(x)
Y
1
(x) dN(x)
Y (x)
_
,
and se(U(w)) will be given later. The null hypothesis of treatment equality will be rejected if
T(w) is suciently dierent from zero.
Note: At any time x for which there is no observed death
dN
1
(x)
Y
1
(x) dN(x)
Y (x)
= 0.
This means that the sum above is only over distinct failure times.
Motivation of the test
These two-sample tests can be viewed as a weighted sum over the distinct failure times
of observed number of deaths from treatment 1 minus the expected number of deaths from
treatment 1 if the null hypothesis were true.
Figure 4.1: A slice of time
-
x
x + x
Take a slice of time as shown in Figure 4.1 and consider the following resulting 2 2 table
If H
0
is true, then conditional on Y
1
(x), Y (x) and dN(x),
dN
1
(x)|(Y
1
(x), Y (x), dN(x)) Hypergeometric (Y
1
(x), dN(x), Y (x)) .
PAGE 65
Table 4.1: 2 2 table from [x, x + x)
Treatment
0 1 total
# of death dN
0
(x) dN
1
(x) dN(x)
# of not dying Y
0
(x) dN
0
(x) Y
1
(x) dN
1
(x) Y (x) dN(x)
# at risk Y
0
(x) Y
1
(x) Y (x)
This is analogous to assuming that there are dN(x) black balls and Y (x) dN(x) white
balls. Randomly draw Y
1
(x) balls from these Y (x) balls using sampling without replacement.
Then the number of black balls dN
1
(x) in the sample has the above hypergeometric distribution.
Obviously,
E[dN
1
(x)|(Y
1
(x), Y (x), dN(x))] =
dN(x)
Y (x)
Y
1
(x).
Since the observed number of deaths at x from treatment 1 is dN
1
(x), so the observed minus
expected is equal to
_
dN
1
(x)
dN(x) Y
1
(x)
Y (x)
_
.
From this point of view, the censored survival data can be viewed as k such 2 2 tables,
where k corresponds to the total number of distinct failure times from two groups combined.
If the null hypothesis were true, we would expect
_
dN
1
(x)
dN(x) Y
1
(x)
Y (x)
_
to be equal to zero on the average, and hence so should the sum over all x. If however, the
hazard rate for treatment 1 were lower than that for treatment 0 consistently over x, then on
PAGE 66
average we would expect
_
dN
1
(x)
dN(x) Y
1
(x)
Y (x)
_
to be negative. The opposite should be true if the hazard rate for treatment 1 were consistently
higher than that for treatment 0.
This suggests that we should reject the null hypothesis if our test statistics T
w
is suciently
far from zero, positive, negative, or in absolute value depending on the alternative hypothesis.
In order to measure the strength of the evidence against the null hypothesis, we must be
able to evaluate the distribution of the test statistic (at least approximately) under the null
hypothesis. Specically, the weighted logrank test statistic is given by
T(w) =
x
w(x)
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
_
_
x
w
2
(x)
_
Y
1
(x)Y
0
(x)dN(x)[Y (x)dN(x)]
Y
2
(x)[Y (x)1]
__
1/2
.
Under the null hypothesis, this test statistic is approximately distributed as a standard
normal
T(w)
a
N(0, 1).
Therefore, a level test (two-sided) will reject H
0
: S
0
(t) = S
1
(t) whenever
|T(w)| z
/2
,
where z
/2
is the (1 /2)th quantile of a standard normal distribution.
A heuristic justication for this result will be given shortly. We want to mention however,
that the most commonly used test statistic is the log-rank test, where w(x) = 1 for all x.
logrank test stat =
x
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
_
_
x
_
Y
1
(x)Y
0
Y
2
(x)[Y (x)1]
__
1/2
.
Remark: The statistic in the numerator is a weighted sum of observed minus the expected
over the k 2 2 tables, where k is the number of distinct failure times.
PAGE 67
The weight function w(x) can be used to emphasize dierences in the hazard rates over time
according to their relative values. For example, if the weight early in time is larger and later
becomes smaller, then such test statistic would emphasize early dierences in the survival curves.
The weights to be chosen depends on the type of alternative dierence we wish to detect.
Note: If the weights w(x) are stochastic (functions of data), then they need to be a function
of the censoring and survival information prior to time x.
The most commonly used test is the logrank test where w(x) = 1 for all x. Other tests given
in the literature are:
1. Gehans generalization of Wilcoxon test that uses w(x) = Y (x).
2. Peto-Prentices generalization of Wilcoxon test that uses w(x) = KM(x), where KM(x)
is the Kaplan-Meier estimator using the combined sample; i.e.,
w(x) =
ux
_
1
dN(u)
Y (u)
_
.
Since both Y (x) and KM(x) are non-increasing functions of x, both tests emphasize the dierence
early in the survival curves.
Heuristic proof of the statistical properties for the weighted logrank test
As you will soon see, the proofs are similar to those used to nd the variance of the Nelson-
Aalen estimator, and will rely heavily on the double expectation theorem (or iterative expectation
(variance) theorem).
Toward that end, we dene a set of random variables
F(x) = {dN
0
(u), dN
1
(u), Y
0
(u), Y
1
(u), w
0
(u), w
1
(u), dN(x) for all grid points u < x}.
That is, when we dene F(x), then we know all the failure and censoring that has occurred
prior to time x form either treatment, the number of individuals at risk at time x as well as the
PAGE 68
number of total deaths (dN(x)) that occurs in [x, x + x). What we dont know is the number
of deaths from each treatment group in [x, x + x).
Let us consider the 2 2 table (Table 4.1) that is created using the slice of time [x, x +x).
We already have argued that given an individual is at risk at time x, and is in treatment
group 1, then (assuming independent censoring) the probability of dying in [x, x + x) is equal
to
1
(x)x, where
1
(x) is the hazard function for treatment group 1.
Similarly, the probability is equal to
0
(x)x for an individual at risk at time x from treat-
ment group 0. Under the null hypothesis
H
0
:
1
(x) =
0
(x),
the conditional probability of dying in [x, x +x), given being at risk at time x, is the same for
both treatment groups.
Assume the null hypothesis is true. Knowing F(x) would imply (with respect to the 2 2
table) that:
We know Y
1
(x), Y
0
(x) (i.e., the number at risk at time x from either treatment group), and,
in addition, we know dN(x) (i.e., the number of deaths (total from both treatment groups)
occurring in [x, x + x)).
The only thing we dont know about the 2 2 table is dN
1
(x) (Note: knowing this would
complete the knowledge of the counts in the 2 2 table).
Fact: In a 2 2 table, under the assumption of independence, the count in one cell of the
table, conditional on the marginal counts, follows a hypergeometric distribution. (This is the
basis of Fishers exact test for independence in a 2 2 table).
Conditional on F(x), we have a 2 2 table which under the null hypothesis follows indepen-
dence and we have the knowledge of the marginal counts of the table (i.e., the marginal counts
are xed conditional on F(x)).
PAGE 69
Therefore, the conditional distribution of one of the counts, say, dN
1
(x), in the cell of the
table, given F(x) follows a hypergeometric distribution.
This is equivalent to imaging that there are Y (x) = dN(x) +(Y (x) dN(x)) balls in a urn,
of which dN(x) are black balls, and (Y (x) dN(x)) are white balls. We then randomly draw
Y
1
(x) balls from the urn without replacement. Let dN
1
(x) be the number of black balls in this
sample. Then dN
1
(x) has a hypergeometric distribution, i.e.,
P[dN
1
(x) = c|Y
1
(x), Y
0
(x), dN(x)] =
_
_
_
_
dN
1
(x)
c
_
_
_
_
_
_
_
_
Y (x) dN(x)
Y
1
(x) c
_
_
_
_
_
_
_
_
Y (x)
Y
1
(x)
_
_
_
_
.
From the properties of a hypergeometric distribution, we know that conditional on F(x),
dN
1
(x) has the following mean and variance
E[dN
1
(x)|F(x)] =
dN(x)Y
1
(x)
Y (x)
,
Var[dN
1
(x)|F(x)] =
dN(x)Y
1
(x)Y
0
(x)[Y (x) dN(x)]
Y
2
(x)[Y (x) 1]
.
Note that Y
1
(x) is the sample size,
dN(x)
Y (x)
= proportion of black balls,
Y (x) dN(x)
Y (x)
= proportion of white balls,
Y (x) Y
1
(x)(= Y
0
(x))
Y (x) 1
= variance correction factor.
Now that we have taken care of some preliminaries, let us go back to our weighted logrank
test. The rst thing I want to demonstrate is that under the null hypothesis the weighted logrank
test statistic has a numerator with mean zero.
The numerator of the weighted logrank test statistic is
U(w) =
x
w(x)
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
_
,
PAGE 70
which has the expectation
E[U(w)] =
x
E
_
w(x)
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
__
=
x
E
_
E
_
w(x)
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
_
F(x)
__
.
By the assumption we made about w(x), we know that w(x) is a function of data prior to
x. That is to say, conditional on F(x), w(x) is a known value. Again given F(x), Y
1
(x), dN(x)
and Y (x) are known. Therefore, the inner expectation in the above sum can be written as
E
_
w(x)
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
_
F(x)
_
= w(x)
_
E[dN
1
(x)|F(x)]
dN(x)Y
1
(x)
Y (x)
_
= 0.
So th inner expectation is equal to zero. Consequently, so is the total expectation as the sum
of the expectations. Therefore, under the null hypothesis, we have
E[U(w)] = 0.
Finding an unbiased estimator for the variance of U(w)
For ease of notation, let us dene
U(w) =
x
A(x),
where
A(x) = w(x)
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
_
.
The variance of U(w) is
Var[U(w)] = Var
_
x
A(x)
_
=
x
Var(A(x)) +
x=y
Cov(A(x), A(y)).
PAGE 71
By using a conditioning argument, we will now show that each covariance term (i.e., the
cross product term) is equal to zero. Let us take one arbitrary covariance term Cov(A(x), A(y))
for y < x, where x and y are the grid points of the partition of the time axis.
Remember that we have already shown that
E(A(x)) = 0 and E(A(y)) = 0.
Therefore,
Cov(A(x), A(y)) = E[A(x) A(y)].
By the double expectation theorem, we have
Cov(A(x), A(y)) = E[A(x) A(y)]
= E[E[A(x) A(y)|F(x)]].
Since y < x, this implies that all the elements which make up A(y) (, i.e., dN
1
(y), Y
1
(y), dN(y), Y (y))
are known when we condition on F(x). Hence the inner expectation is
E[A(x) A(y)|F(x)]] = A(y) E[A(x)|F(x)]] = 0.
Therefore we showed that
Cov(A(x), A(y)) = 0.
Since y < x are arbitrary, hence all the covariance terms are equal to zero. Therefore,
Var[U(w)] =
x
Var(A(x))
=
x
E[A
2
(x)]
=
x
E[E[A
2
(x)|F(x)]].
PAGE 72
Let us examine the inner expectation more closely:
E[A
2
(x)|F(x)] = E
_
_
w
2
(x)
_
dN
1
(x)
Y
1
(x)dN(x)
Y (x)
_
2
F(x)
_
_
= w
2
(x) E[{dN
1
(x) E[dN
1
(x)|F(x)]}
2
|F(x)]
= w
2
(x) Var[dN
1
(x)|F(x)]
= w
2
(x)
_
Y
1
(x)Y
0
(x)dN(x)[Y (x) dN(x)]
Y
2
(x)[Y (x) 1]
_
.
Therefore, we have shown that
Var[U(w)] =
x
E[E[A
2
(x)|F(x)]]
=
x
E
_
w
2
(x)
_
Y
1
(x)Y
0
Y
2
(x)[Y (x) 1]
__
,
which means that the following statistic
x
w
2
(x)
_
Y
1
(x)Y
0
Y
2
(x)[Y (x) 1]
_
is an unbiased estimate for Var[U(w)].
Recapping: Under the null hypothesis H
0
: S
0
(t) = S
1
(t),
1. The statistic U(w) =

x
A(x) has expectation equal to zero.
E(U(w)) = 0.
2. U(w) =

x
A(x) is made up of a sum of conditionally uncorrelated terms each with mean
zero. By the central limit theory for such martingale structures, U(w) properly normalized
will be approximately a standard normal random variable. That is
U(w)
se(U(w))
a
N(0, 1), under H
0
3. We showed that an unbiased estimate for the variance of U(w) was given by
x
w
2
(x)
_
Y
1
(x)Y
0
Y
2
(x)[Y (x) 1]
_
.
PAGE 73
Therefore
T(w) =
U(w)
se(U(w))
=
x
w(x)
_
dN
1
(x)
dN(x)Y
1
(x)
Y (x)
_
_
x
w
2
(x)
_
Y
1
(x)Y
0
Y
2
(x)[Y (x)1]
__
1/2
a
N(0, 1).
This ends the heuristic proof.
We will illustrate the tests (logrank test, Gehans Wilcoxons and Peto-Prentices Wilcoxon
test) using the following data set (data le = myel.dat) taken from Paul Allisons book. The
data give the survival times for 25 myelomatosis patients randomized to two treatments (1 or 2):
dur status trt renal
8 1 1 1
180 1 2 0
632 1 2 0
852 0 1 0
52 1 1 1
2240 0 2 0
220 1 1 0
63 1 1 1
195 1 2 0
76 1 2 0
70 1 2 0
8 1 1 0
13 1 2 1
1990 0 2 0
1976 0 1 0
18 1 2 1
700 1 2 0
1296 0 1 0
1460 0 1 0
210 1 2 0
63 1 1 1
1328 0 1 0
1296 1 2 0
365 0 1 0
23 1 2 1
where dur is the patients survival or censored time, status is the censoring indicator, trt is
the treatment indicator and renal is the indicator of impaired renal function (0 = normal; 1 =
impaired). To test the null hypothesis the treatment trt has no eect (i.e., H
0
: S
0
(t) = S
1
(t)),
we used the following SAS program to perform logrank and Gehans Wilcoxon tests:
PAGE 74
data myel;
infile "myel.dat" firstobs=2;
input dur status trt renal;
run;
proc lifetest data=myel;
time dur*status(0);
strata trt;
run;
Part of the output from this program gives logrank and Gehans Wilcoxon tests:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
09:21 Tuesday, January 11, 2000 3
Testing Homogeneity of Survival Curves over Strata
Time Variable DUR
Rank Statistics
TRT Log-Rank Wilcoxon
1 -2.3376 -18.000
2 2.3376 18.000
Covariance Matrix for the Log-Rank Statistics
TRT 1 2
1 4.16301 -4.16301
2 -4.16301 4.16301
Covariance Matrix for the Wilcoxon Statistics
TRT 1 2
1 1301.00 -1301.00
2 -1301.00 1301.00
Test of Equality over Strata
Pr >
Test Chi-Square DF Chi-Square
Log-Rank 1.3126 1 0.2519
Wilcoxon 0.2490 1 0.6178
-2Log(LR) 1.5240 1 0.2170
PAGE 75
This output gives the numerators of logrank and Gehans Wilcoxon tests and their estimated
variances for each group. So for example, the numerator of logrank test for treatment 1 is -
2.3376 with its estimated variance 4.16301 (negative means on average treatment 1 is better
than treatment 2; but we have to judge this statement using the p-value from our test). So
(2.3376)
2
/4.16301 = 1.3126 and the p-value of this test is P[
2
> 1.3126] = 0.2519. Similarly
the p-value for Gehans Wilcoxon test is 0.6178. Note that the numerator of Gehans Wilcoxon
test is much larger than that of logrank test since Gehans Wilcoxon test uses the number at risk
as the weight and logrank test uses identity weight. (The last test is likelihood ratio test based
on exponential model. See next chapter for more detail).
In this example, logrank test gives a more signicant result than Gehans Wilcoxon test
(although none of them provides strong evidence against the null hypothesis). Why is that?
Recall that Gehans Wilcoxon test puts more weight to early times than to later times (the
number at risk is a decreasing function of time) and logrank test puts equal weight to all times.
So if the true survival distributions for treatments 1 and 2 dier less early on than later (of course,
there eventually will be no dierence in their survival functions when the time is suciently large),
then logrank test is more powerful (more sensitive) than Gehans Wilcoxon test. This is the case
if
1
(t) and
0
(t) are related by
1
(t) =
0
(t), for all t 0,
where > 0 is a constant. This means that the hazard for group 1 is proportional to that for
group 0. This is the proportional hazards model proposed by D.R. Cox. We will discuss this more
when we talk about the power of these tests.
The treatment specic Kaplan-Meier survival estimates were generated using the following
splus functions and were presented in Figure 4.2.
postscript(file="fig4.2.ps", horizontal = F,
height=6, width=8.5, pointsize=14)
# par(mfrow=c(1,2))
example <- read.table(file="myel.dat", header=T)
PAGE 76
Figure 4.2: Kaplan-Meier estimates for two treatments
s
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
y
0 500 1000 1500 2000
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
trt = 1
trt = 2
fit <- survfit(Surv(dur, status) ~ trt, example)
plot(fit, xlab="Patient time (months)", ylab="survival probability",
lty=c(1,2))
legend(1000,1, c("trt = 1", "trt = 2"),
lty=c(1,2), cex=0.8)
dev.off()
Figure 4.2 shows that there is less dierence in the estimated survival functions early on than
later. So logrank test gives a more signicant result.
We can also use the following splus functions to do logrank test:
> survdiff(Surv(dur, status) ~ trt, example)
survdiff(Surv(dur, status) ~ trt, example)
Call:
survdiff(formula = Surv(dur, status) ~ trt, data = example)
N Observed Expected (O-E)^2/E (O-E)^2/V
trt=1 12 6 8.34 0.655 1.31
trt=2 13 11 8.66 0.631 1.31
Chisq= 1.3 on 1 degrees of freedom, p= 0.252
If we want to perform Peto-Prentices Wilcoxon test, we need to specify rho=1 in the above
splus functions:
> survdiff(Surv(dur, status) ~ trt, rho=1, example)
PAGE 77
survdiff(Surv(dur, status) ~ trt, rho=1, example)
Call:
survdiff(formula = Surv(dur, status) ~ trt, data = example, rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
trt=1 12 4.80 5.60 0.115 0.304
trt=2 13 6.83 6.03 0.106 0.304
Chisq= 0.3 on 1 degrees of freedom, p= 0.581
This test has the similar p-value to Gehans Wilcoxon test since both tests put more weights
to the earlier time than to later time.
Power and Sample Size
Focus so far has been on the null hypothesis. We showed that weighted logrank test statistics
(after properly normalized) are asymptotically distributed as a standard normal if the null hy-
pothesis is true, enabling us to use these test statistics to compute p-value to assess the strength
of evidence against the null hypothesis (in favor of treatment dierence)
In considering the sensitivity of the tests, we must also assess the power, or the probability
of rejecting the null when in truth we have departures from the null. Describing departures
from the null hypothesis that we feel are important to detect is complicated. That is because
a survival curve is innite dimensional and departures from the null have to be described as
dierences at every point in time over the survival curve. Clearly, some simplifying conditions
must be given. In clinical trials proportional hazards alternatives have become very popular.
That is
1
(t)
0
(t)
= exp(), for all t 0.
We use exp(), since by necessity, hazard ratios have to be positive and that = 0 would
correspond to no treatment dierence.
Note:
1. > 0 individuals on treatment 1 have worse survival (i.e., die faster).
PAGE 78
2. = 0 no treatment dierence (null is true)
3. < 0 individuals on treatment 1 have better survival (i.e., live longer).
Other ways of representing proportional hazards follow from the following relationship
1
(t)
0
(t)
= exp()

dlog{S
1
(t)}
dt
=
dlog{S
0
(t)}
dt
exp()
dlog{S
1
(t)}
dt
=
dlog{S
0
(t)}
dt
exp()
log{S
1
(t)} = log{S
0
(t)}exp() +C, (4.1)
where C is a constant to be determined. In the above identity, take t = 0, we get C = 0.
Therefore we get
log{S
1
(t)} = log{S
0
(t)}exp() (4.2)
S
1
(t) = S
0
(t),
where = exp().
If we multiply both sides of (4.3) by 1 and then take log, we will have:
log[log{S
1
(t)}] = log[log{S
0
(t)}] +.
(Since 0 S
j
(t) 1, log{S
j
(t)} < 0. So we need to multiply log{S
j
(t)} by 1 before we can
take log)
This last relationship is very useful to help us identify situations where we may have pro-
portional hazards. By plotting estimated survival curves (say, Kaplan-Meier estimates) for two
treatments (groups) on a log[-log] scale, we would see constant vertical shift of the two curves if
the hazards are proportional. The situation is illustrated in Figure 4.3. In this case, we say two
curves are parallel. Do not be misled by the visual impression of the curves near the origin.
For the specic case where the survival curves for the two groups are exponentially distributed
PAGE 79
Figure 4.3: Two survival functions with proportional hazards on log[-log] scale
Patient time
l
o
g
(
-
l
o
g
(
(
s
(
t
)
)
)
0 1 2 3 4
-
6
-
4
-
2
0
2
4
(i.e., constant hazard), we automatically have proportional hazards, since
1
(t)
0
(t)
=

1
0
, for all t 0.
The median survival time for an exponentially distributed random variable is given by m
S(m) = e
m
= 0.5, or m = log(2)/.
The ratio of median survival times for two groups having exponential distributions is
m
1
m
0
=
log(2)/
1
log(2)/
0
=

0
1
, (4.3)
i.e., the ratio of median survival times is inversely proportional to the ratio of hazard rates. This
result may be useful when trying to illicit clinically important dierences from your collaborators.
If survival times are exponentially distributed (or approximately so) then the desired increase in
median survival times can be easily translated to the desired dierence in hazard ratio.
Note: If the survival times have exponential distributions, then the ratio of mean survival
times is also inversely proportional to the ratio of hazard rates. Therefore, the clinically important
dierence in the survival times can also be specied via the ratio of mean survival times.
PAGE 80
The logrank test is the most powerful test among the weighted logrank tests to detect
proportional hazards alternatives. In fact, it is the most powerful test among all nonparametric
tests for detecting proportional hazards alternatives.
Therefore, the proportional hazards alternative has not only a nice interpretation but also
nice statistical properties. These features leads to the natural use of (unweighted) logrank tests.
In order to compute power to detect the dierence of interest and corresponding sample sizes,
we must be able to derive the distribution of our test statistic under the alternative hypothesis
(here proportional hazards alternative). When censoring does not depend on treatment (e.g.,
randomized experiments), the logrank test has distribution under the following alternative
H
A
:

1
(t)
0
(t)
= exp(
A
);
A
= 0, (4.4)
approximated by
T
n
a
N
_
A
_
d(1 ), 1
_
,
where d is the total number of deaths (events), is the proportion in group 1,
A
is the log hazard
ratio under the alternative. That is, under the proportional hazards alternative, the logrank test
statistic is distributed approximated as a normal distribution with mean
A
_
d(1 ) and
variance 1.
For the common case where = 1/2 (randomization with equal probability), the mean equals
A
2
d.
Such a result will also be useful to aid us in determining sample size during the design stage
of an experiment. It is fairly easy to show that in order that a level test (say, two-sided)
has power 1 to detect the alternative of interest, the mean of the test statistic (under the
alternative) must be equal to (z
/2
+z
).
Remark: We are using here to describe the type II error probability since we already used
to describe the log hazard ratio.
A
is used to denote the log hazard ratio that is felt to be
clinically important to detect.
PAGE 81
Let =
A
_
d(1 ), the mean of the log rank test statistic T
n
under the alternative 4.4.
Recall that our test procedure:
reject H
0
when |T
n
| > z
/2
,
and
T
n
a
N(0, 1) under H
0
and T
n
a
N(, 1) under H
A
.
By the denition of power, we have
P[|T
n
| > z
/2
|H
A
] = 1
P[T
n
> z
/2
|H
A
] +P[T
n
< z
/2
|H
A
] = 1
Assume
A
> 0 at this moment, then > 0. In this case,
P[T
n
< z
/2
|H
A
] = P[T
n
< z
/2
|H
A
]
= P[Z < z
/2
] (Z N(0, 1))
= P[Z > z
/2
+]
0 (at least less than /2, since P[Z > z
/2
] = /2),
and
P[T
n
> z
/2
|H
A
] = P[T
n
> z
/2
|H
A
]
= P[Z > z
/2
] (Z N(0, 1))
Therefore,
P[Z > z
/2
] 1
P[Z < z
/2
]
P[Z > z
/2
+]
z
/2
+ z
(since P[Z > z
] = by denition)
= z
/2
+z
.
PAGE 82
Consequently,
d
A
_
(1 ) = z
/2
+z
d =
(z
/2
+z
)
2
(
A
)
2
(1 )
.
Exactly the same formula for d can be derived if
A
< 0. This is the requirement for number
of events d we have to observ in order for our level logrank test to have a power 1 . In
this sense, d acts as the sample size.
For the case where = 1/2, we have
d =
4(z
/2
+z
)
2
(
A
)
2
.
An Example
Take a two-sided logrank test with level = 0.05, power 1 = 0.90, = 1/2. Then
d =
4(1.96 + 1.28)
2
(
A
)
2
.
The following table gives some required number of events for dierent hazard ratio exp(
A
).
Hazard ratio (exp(
A
)) d
2.00 88
1.50 256
1.25 844
1.10 4623
Therefore one has to answer during the design stage that a sucient number of patients are
entered and followed long enough so that the required number of events are attained.
PAGE 83
Sample size (number of patients) calculations
One simple approach is to continue the experiment (i.e., keep recruiting patients) until the
required number of failures is obtained.
Example: Suppose patients with advanced lung cancer have a median survival time of 6
months. We have a new treatment which we hope will increase the median survival time to
9 months. If the survival time follows exponential distributions, then this dierence would
correspond to a hazard ratio of
exp(
A
) =

1
(t)
0
(t)
=

1
0
=
m
0
m
1
=
6
9
=
2
3
.
Then the log hazard ratio is
A
= log(2/3).
Suppose we were asked to help design a clinical trial where these two treatments were going
to be compared in a randomized experiment on newly diagnosed lung cancer patients. If patients
were randomized with equal probability to the two treatments, and we desired 90% power to
detect the above dierence using a level = 0.05 two-sided log rank test, then the number of
failures (deaths) necessary is given by
d =
4(1.96 + 1.28)
2
(log(2/3))
2
= 256 (always rounding up).
One strategy is to enter some larger number of patients, say 350 patients (about 175 patients
on each treatment arm) and then continue following until we have 256 deaths.
Design Specication
More often in survival studies we need to be able to specify to the investigators the following:
1. number of patients;
2. accrual period;
3. follow-up time.
PAGE 84
It was shown by Schoenfeld that reasonable approximations for obtaining the desired power
can be made by ensuring that the total expected number of deaths (events) from both groups,
computed under the alternative, should equal (assuming equal probability of assigning treat-
ments)
E(d) =
4(z
/2
+z
)
2
(
A
)
2
.
Namely, we compute the expected number of deaths for both groups 0 and 1 separately
under the alternative hypothesis, the sum of these should be equal to the above formula.
Computing expected number of deaths when we have censoring
We only need to consider the one-sample problem here since expected number of deaths
needs to be computed separately for each treatment group.
Suppose (X
i
,
i
), i = 1, 2, ..., n represents a sample of possibly censored survival data, with
the usual kind of assumption we have been making, i.e.,
X
i
= min(T
i
, C
i
)
i
= I(T
i
C
i
).
T is the underlying survival time having density f(t), distribution function F(t), survival function
S(t) and hazard function (t). (We may want to subscribe by T to denote that these functions
refer to the survival time T, such as
T
(t)) C is the underlying censoring time having density
g(t), distribution function G(t), survival function H(t) and hazard function (t).
The expected number of deaths is equal to
n P[ = 1].
From the derivation in Chapter 3, we know that the desity for the pair of random variables
(X, ):
f(x, ) = [f(x)]
[S(x)]
1
[g(x)]
1
[H(x)]
.
PAGE 85
So
f(x, = 1) = f(x)H(x)
where (t) =
_
t
0
(u)du is the cumulative hazard for the survival time T and M(t) =
_
t
0
(u)du
is the cumulative hazard for the censoring time C. Therefore,
P[ = 1] =
_

0
f(x, = 1)dx
=
_

0
f(x)H(x)dx,
or integrating any of the above equivalent relationships.
Alternatively, the probability P[ = 1] can be calculated in another way:
P[ = 1] = P[T C] =
_ _
D
f(t, c)dtdc (Here D = {(t, c) : t c})
=
_ _
D
f(t)g(c)dtdc =
_

0
__

t
f(t)g(c)dc
_
dt
=
_

0
f(t)H(t)dt.
Example: Suppose T is exponential with hazard and C is exponential with hazard , then
P[ = 1] =
_

0
f(x)H(x)dx
=
_

0
e
x
e
x
dx
=
_

0
e
(+)x
dx
=

+
.
How to use these results for designing survival experiments
End of study censoring due to staggered entry: Suppose the only censoring we expect to see
in a clinical trial is due to incomplete follow-up resulting at the time of analysis, as illustrated
by Figure 4.4.
n patients enter the study at times E
1
, E
2
, ..., E
n
assumed to be independent and identicaly
distributed (i.i.d.) with distribution function Q
E
(u) = P[E u]. The censoring random
PAGE 86
Figure 4.4: Censoring due to staggered entry
0
-
E
i L
variable, if there was no other loss to follow-up or competing risk, would be C = L E. Hence,
H
C
(u) = P[L E u]
= P[E L u]
= Q
E
(L u), u [0, L].
Therefore, for such an experiment, the expected number of deaths in a sample of size n would
be equal to
nP[ = 1] = n
_
L
0
T
(u)S
T
(u)Q
E
(L u)du.
Example
Suppose the underlying survival of a population follows an exponential distribution. A study
will accrue patients for A years uniformly during that time and then analysis will be conducted
after an additional F years of follow-up. What is the expected number of deaths for a sample of
n patients.
Figure 4.5: Illustration of accrual and follow-up
0
-
E
i
(Accrual) (Follow-up)
A L = A +F
The entry rate follows a uniform distribution in [0, A]. That is
Q
E
(u) = P[E u] =
_
_
0 if u 0
u
A
if 0 < u A
1 if u > A
PAGE 87
Consequently,
H
C
(u) = Q
E
(L u) =
_
_
1 if u L A
Lu
A
if L A < u L
0 if u > L
Hence,
P[ = 1] =
_
L
0
T
(u)S
T
(u)H
C
(u)du
=
_
LA
0
e
u
du +
_
L
LA
e
u
L u
A
du
=
_
LA
0
e
u
du +
L
A
_
L
LA
e
u
du
1
A
_
L
LA
ue
u
du.
After some straightforward algebra, we get
P[ = 1] =
_
1
e
L
A
_
e
A
1
_
_
.
Therefore, if we accrue n patients uniformly over A years, who fail according to an exponential
distribution with hazard , and follow them for an additional F years, then the expected number
of deaths in the sample is
n
_
1
e
L
A
_
e
A
1
_
_
.
Example of a designed experiment
The survival time for treatment 0 is assumed to follow an exponential distribution with the
median survival time equal to m
0
=4 years (so the hazard rate is
0
= log(2)/m
0
= 0.173). We
hope the new treatment 1 will increase the median survival time to m
1
= 6 years (assume
exponential distribution,
1
= log(2)/m
1
= 0.116), which we want to have 90% power to detect
using a (two-sided) logrank test at the 0.05 level of signicance. The hazard ratio is 2/3 and
A
= log(2/3). The total number of deaths from both treatments must be equal to
d =
4(1.96 + 1.28)
2
(log(2/3))
2
= 256.
PAGE 88
Suppose we decide to accrue patients for A = 5 years and then follow them for an additional
F = 3 years, so L = A +F = 8 years. How large a sample size is necessary?
In a randomized trial where we randomize the patients to the two treatments with equal
probability, the expected number of deaths would be equal to D
1
+D
0
,where
D
j
=
n
2
_
1
e
j
L
j
A
_
e
j
A
1
_
_
, j = 0, 1.
For our problem, the expected number of deaths is
D
1
+D
0
=
n
2
_
1
e
0.1738
0.173 5
_
e
0.1735
1
_
_
+
n
2
_
1
e
0.1168
0.116 5
_
e
0.1165
1
_
_
=
n
2
0.6017 +
n
2
0.4642 =
n
2
1.0659.
Thus if we want the expected number of deaths to equal 256, then
n
2
1.0659 = 256 n = 480.
We can, of course, experiment with dierent combinations of sample sizes, accrual periods
and follow-up periods, that gives us the desired answer and best suits the needs of the experiment
being conducted.
The above calculation for the sample size requires that we are able to get n = 480 patients
within A = 5 years. If this is not the case, we will be underpowered to detect the dierence
of interest. In fact, the sample size n and the accrual period A are tied by the accrual rate R
(number of patients available per year) by
n = AR.
If we have information on R, the above calculation has to be modied.
Other issues that aect power and may have to be considered are
1. loss to follow-up
PAGE 89
2. competing risks
3. non-compliance.
Remark: Originally, we introduced a class of weighted logrank tests to test H
0
: S
1
(t) = S
0
(t),
for t 0. The weighted logrank test with weight function w(t) is optimal to detect the following
alternative hypothesis
1
(t) =
0
(t)e
w(t)
,
or log
_
1
(t)
0
(t)
_
= w(t); = 0.
Hence, for proportional hazards, i.e., w(t) = 1, the logrank test is the most powerful.
K sample weighted logrank test
Suppose we are interested in testing the null hypothesis that the survival distributions are the
same for K > 2 groups. For example, we may be evaluating K > 2 treatments in a randomized
clinical trial.
With right censoring, the data from such a clinical trial can be represented as (X
i
,
i
, Z
i
),
i = 1, 2, ..., n, where for the ith individual
X
i
= min(T
i
, C
i
)
i
= I(T
i
C
i
)
and Z
i
= {1, 2, ..., K} corresponding to group membership in one of the K groups.
Denote by S
j
(t) = P[T
j
t], the survival distribution for the jth group, where T
j
is the
survival time for this group. The null hypothesis of no treatment dierence can be represented
as
H
0
: S
1
(t) = S
2
(t) = ... = S
K
(t); t 0,
PAGE 90
or equivalently,
H
0
:
1
(t) =
2
(t) = ... =
K
(t); t 0,
where
j
(t) is the hazard function for group j.
No assumptions will be made regarding the distribution of the censoring time C or its relation-
ship to Z. However, we will need to assume that conditional on Z, censoring is non-informative;
i.e., that T and C are conditionally independent given Z.
The test of the null hypothesis will be a direct generalization of the two-sample weighted
logrank tests. Towards that end, we dene the following quantities for a grid point x
dN
j
(x) = number of observed deaths at time x ([x, x + x)) from group j = 1, 2, ..., K.
Y
j
(x) = number at risk at time x from group j.
dN(x) =
K
j=1
dN
j
(x), total number of observed deaths at time x
Y (x) =
K
j=1
Y
j
(x), total number at risk at time x
F(x) = {dN
j
(u), Y
j
(u); j = 1, 2, ..., K, for all grid points u < x, and dN(x)},
That is, F(x) is the information available at time x.
At a slice of time [x, x + x), the data can be viewed as a K 2 contingency table shown
in Table 4.2
Table 4.2: K 2 table from [x, x + x)
Treatments
1 2 K total
# of death dN
1
(x) dN
2
(x) dN
K
(x) dN(x)
# alive Y
1
(x) dN
1
(x) Y
2
(x) dN
2
(x) Y
K
(x) dN
K
(x) Y (x) dN(x)
# at risk Y
1
(x) Y
2
(x) Y
K
(x) Y (x)
PAGE 91
We now consider a vector of observed number of deaths minus their expected number of
deaths under the null hypothesis for each treatment group j = 1, 2, ..., K
_
_
_
_
_
_
_
_
_
_
_
_
dN
1
(x)
Y
1
(x)dN(x)
Y (x)
dN
2
(x)
Y
2
(x)dN(x)
Y (x)
.
.
.
dN
K
(x)
Y
K
(x)dN(x)
Y (x)
_
_
_
_
_
_
_
_
_
_
_
_
K1
.
Note: The sum of the elements in this vector is equal to zero, which means one element is
redundant.
If we condition on F(x), then we know the marginal counts of this K 2 table, in which
case the vector (dN
1
(x), dN
2
(x), ..., dN
K
(x))
T
is distributed as a multivariate version of a hyper-
geometric distribution.
Particularly, conditional on F(x), we know the following conditional means, variances and
covariances:
E[dN
j
(x)|F(x)] =
Y
j
(x) dN(x)
Y (x)
, j = 1, 2, ..., K.
Var[dN
j
(x)|F(x)] =
dN(x)[Y (x) dN(x)]Y
j
(x)[Y (x) Y
j
(x)]
Y
2
(x)[Y (x) 1]
.
Cov[dN
j
(x), dN
j
(x)|F(x)] =
dN(x)[Y (x) dN(x)]Y
j
(x) Y
j
(x)
Y
2
(x)[Y (x) 1]
.
Consider the (K 1) dimensional vector U(w), made up by the weighted sum of observed
minus expected deaths in group j = 1, 2, ..., K 1, summed over time x
U(w) =
_
_
_
_
_
_
_
_
_
_
_
_
x
w(x)
_
dN
1
(x)
Y
1
(x)dN(x)
Y (x)
_
x
w(x)
_
dN
2
(x)
Y
2
(x)dN(x)
Y (x)
_
.
.
.
x
w(x)
_
dN
K1
(x)
Y
K1
(x)dN(x)
Y (x)
_
_
_
_
_
_
_
_
_
_
_
_
_
.
PAGE 92
Note: We take the (K 1) dimensional vector since the sum of all K elements is equal to
zero and hence we have redundancy. If we included all K elements then the resulting vector
would have a singular variance matrix.
Using arguments similar to the two-sample test, we can show that the vector of observed
minus expected counts computed at dierent times, x and x
are uncorrelated.
Consequently, the corresponding (K 1) (K 1) covariance matrix of the vector T
n
(w) is
given by
V = [V
jj
], j, j
= 1, 2, ..., K 1,
where
V
jj
=
x
w
2
(x)
_
dN(x)[Y (x) dN(x)]Y
j
(x)[Y (x) Y
j
(x)]
Y
2
(x)[Y (x) 1]
_
,
and
V
jj
=
x
w
2
(x)
_
dN(x)[Y (x) dN(x)]Y
j
(x) Y
j
(x)
Y
2
(x)[Y (x) 1]
_
, for j = j = 1, 2, ..., K 1.
The test statistic used to test the null hypothesis is given by the quadratic form
T(w) = [U(w)]
T
V
1
U(w).
Note: This statistic would be numerically identical regardless which of the K1 groups were
included to avoid redundancy.
Under H
0
, this is distributed asymptotically as a
2
distribution with (K 1) degrees of
freedom.
Hence, a level test would reject the null hypothesis whenever
T(w) = [U(w)]
T
V
1
U(w)
2
;K1
,
where
2
;K1
is the quantity that satises P[
2
K1

2
;K1
] = .
PAGE 93
Remark: As with the two-sample tests, if the weight function w(x) is stochastic, then it must
be a function of the survival and censoring data prior to time x.
The most popular test was a weight w(x) 1 and is referred to as the Ksample logrank
test. These tests are available on most major software packages such as SAS, S
+
, etc. For exam-
ple, the SAS code is exactly the same the that for two sample tests.
Stratied logrank test
When comparing survival distributions among groups, especially in non-randomized studies,
we may be concerned about the confounding eects that other factors may have on the interpre-
tation of the relationship between survival and groups. For example, suppose we extract hospital
records to obtain information on patients who were treated after a myocardial infarction (heart
attack) with either bypass surgery or angioplasty. We wish to study subsequent survival and test
whether or not there is a dierence in the survival distributions between these treatments.
If we believe that these two groups of patients are comparable, we might test treatment
equality using a logrank test or weighted logrank test. However, since this study was not ran-
domized, there is no guarantee that the patients being compared are prognostically similar. For
example, it may be that the group of patients receiving angioplasty are younger on average or
prognostically better in other ways.
If this were the case, then we wouldnt know whether signicant dierence in treatment
groups, if they occurred, were due to treatment or other prognostic factors. Or the treatments
do have dierent eects. But the dierence was blocked by some other factors the were distributed
unbalancedly between treatment groups.
In such cases, we may want to adjust for the eect of these prognostic factors either through
stratication or through regression modeling. Regression modeling will be discussed later in much
greater detail. Very similarly, to adjust by stratication, we dene strata of our population ac-
cording to combination of factors which make individuals within each strata more prognostically
PAGE 94
similar. Comparisons of survival distribution between groups are made within each strata and
then these results are combined across the strata.
In clinical trials, the use stratied tests may also be important even though balance of
prognostic factors by comparison groups is obtained by randomization. Use of permuted block
randomization as well as other treatment allocation schemes which balance treatment group
within strata by more than would be expected by chance alone may aect the statistical properties
of the usual two and Ksample tests.
If the strata are prognostic, this enforced balance may cause the treatment (group) dierence
to be less variable than would be expected by chance, since the groups are more alike than would
have been obtained by chance alone. Less variability is a desirable property if we can take
advantage of it. The statistical tests developed so far (i.e., two-sample and Ksample weighted
logrank tests) have distributional theory developed under the assumption of simple randomness
and consequently may lead to inference that is conservative when applied to clinical trials, which
used treatment balancing methods within strata. A simple remedy to this problem is also to use
stratied tests.
Think of the population being sampled as consisting of p strata. The strata, for example,
could be those used in balanced randomization of a clinical trial, or combination of factors
that make individuals within each strata prognostically similar. For example, consider the four
strata created by the combination of sex categories (M, F) and age categories ( 50, > 50):
[(M, 50), (M, > 50), (F, 50), (F, > 50)].
consider two-sample comparisons, say, treatments 0 and 1, and let j index the strata j =
1, 2, ..., p. The null hypothesis being tested in a stratied test is
H
0
: S
1j
(t) = S
0j
(t), t 0, j = 1, 2, ..., p.
That is, the survival distributions from the two treatments are the same within each of the
strata. The stratied logrank test consists of computing two-sample test statistic within each
PAGE 95
strata and then combining these results across strata. For example,
T(w) =
p
j=1
_
x
w
j
(x)
_
dN
1j
(x)
dN
j
(x)Y
1j
(x)
Y
j
(x)
__
_
p
j=1
_
x
w
2
j
(x)
_
Y
1j
(x)Y
0j
(x)dN
j
(x)[Y
j
(x)dN
j
(x)]
Y
2
j
(x)[Y
j
(x)1]
___
1/2
.
Note: Here j indexes the strata. In the previous section of the notes, j indexed treatment
for more than two treatments.
Since within each of the strata there was no additional balance being forced between two
groups (or if we believe the two groups are similar prognostically within each strata other than
the treatment group being compared) beyond chance, the mean and variance of the test statistics
computed within strata under the null hypothesis, are correct. The combining of the statistics
and their variances over independent strata is now also correct. The resulting stratied logrank
test has a standard normal distribution (asymptotically) under the null hypothesis, i.e.,
T(w)
a
N(0, 1),
or
[T(w)]
2
a

2
1
.
Remark:
1. Stratied tests can be constructed for K samples as well. You just add the vector of test
statistics over strata, as well as the covariance matrices before you compute the quadratic
form leading to the
2
statistic with (K 1) degrees of freedom.
2. Sample size consideration are similar to the unstratied tests. Power is dependent on
the number of observed deaths and the hazard ratio between groups within strata. For
example, the stratied logrank test with w
j
(x) 1 for all x and j, is most powerful
to detect proportional hazards alternatives within strata, where the hazard ratio is also
assumed constant between strata. Namely
H
A
:
1j
(x) =
0j
(x)exp(
A
).
PAGE 96
The number of deaths total in the study necessary to obtain power (1 ) for detecting
a dierence corresponding to
A
above, using a stratied logrank test at the level of
signicance (two-sided), is equal to
d =
4 (z
/2
+z
)
2
2
A
.
This assumes equal randomization to the two treatments and is the same value as that
obtained for unstratied tests. To compute the expected number of deaths using the
design stage, we must compute separately over treatments and strata and these should add
up to the desired number above.
Myelomatosis data revisited: When we analyzed the myelomatosis data on page 54, we found
that the two treatments do not dier on prolonging patients survival time. One may argue that
we did not see treatment eect because the patients assigned to dierent treatment arms do
not the same renal condition (on average), so we perform stratied tests using the following SAS
program
time dur*status(0);
strata renal;
test trt;
run;
Part of the output from this program is as follows
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4
16:15 Wednesday, February 9, 2000
Rank Tests for the Association of DUR with Covariates
Pooled over Strata
Univariate Chi-Squares for the WILCOXON Test
Test Standard Pr >
Variable Statistic Deviation Chi-Square Chi-Square
PAGE 97
TRT -2.6352 1.2963 4.1324 0.0421
Covariance Matrix for the WILCOXON Statistics
Variable TRT
TRT 1.68039
Forward Stepwise Sequence of Chi-Squares for the WILCOXON Test
Pr > Chi-Square Pr >
Variable DF Chi-Square Chi-Square Increment Increment
TRT 1 4.1324 0.0421 4.1324 0.0421
Univariate Chi-Squares for the LOG RANK Test
Test Standard Pr >
TRT -4.4306 1.8412 5.7908 0.0161
Covariance Matrix for the LOG RANK Statistics
Variable TRT
TRT 3.38990
Forward Stepwise Sequence of Chi-Squares for the LOG RANK Test
TRT 1 5.7908 0.0161 5.7908 0.0161
This result tells us that after adjusting for renal eect, treatments 1 is (statistically) signi-
cantly better than treatment 2 from either logrank test or Wilcoxon test. But we have to be very
cautious in interpreting this result. If the patients were stratied into two dierent blocks based
on their renal function before randomization and then treatments were randomly assigned to
patients within each block, then we should adjust any possible renal eect in identifying treat-
ment eect. If this was not the case, then it is hard for people to accept the renal-adjusted
treatment eect. Also due to the small sample size, a small imbalance in renal function (treat-
PAGE 98
ment 1 has 4 out of 12 patients with impaired renal function, while treatment 2 only has 3 out
of 13 patients with impaired renal function) may have a signicant impact on the nal result.
But this secondary analysis may give us some insight about the true treatment eect.
Note: If the number of treatments in a stratied test is greater than 2, we need to dene
indicator variables and put them in the test statement in Proc lifetest.
PAGE 99
5 Modeling Survival Data with Parametric Regression
Models
5.1 The Accelerated Failure Time Model
Before talking about parametric regression models for survival data, let us introduce the ac-
celerated failure time (AFT) Model. Denote by S
1
(t) and S
2
(t) the survival functions of two
populations. The AFT models says that there is a constant c > 0 such that
S
1
(t) = S
2
(ct) for all t 0. (5.1)
This model implies that the aging rate of population 1 is c times as much as that of population
2. (For example, if S
1
(t) is the survival function for the dog population and S
2
(t) is the survival
function for the human population, then the conventional wisdom that a year for a dog is equiv-
alent to 7 years for a human implies c = 7, and S
1
(t) = S
2
(7t). So the probability that a dog
can survive 10 years or beyond is the same as the probability that a human subject can survive
70 years or beyond)
Let
i
be the mean survival time for population i and let
i
be the population quantiles such
that S
i
(t)(
i
) = for some (0, 1). Then
2
=
_

0
S
2
(t)dt
= c
_

0
S
2
(cu)du (t = cu)
= c
_

0
S
1
(u)du
= c
1
and
S
2
(
2
) = = S
1
(
1
) = S
2
(c
1
).
Assume that S
2
(t) is a strictly decreasing function. Then we have
2
= c
1
.
PAGE 100
This simple argument tells us that under the accelerated failure time model (5.1), the ex-
pected survival time, median survival time of population 2 all are c times as much as those of
population 1.
Suppose we have a sample of size n from a target population. For subject i (i = 1, 2, ..., n),
we have observed values of covariates z
i1
, z
i2
, ..., z
ip
and possibly censored survival time T
i
. The
procedure Proc Lifereg in SAS ts models to data specied by the following equations
log(T
i
) =
0
+
1
z
i1
+ ... +
p
z
ip
+
i
, (5.2)
where
0
, ...,
p
are the regression coecients of interest, is a scale parameter and
i
are the
random disturbance terms, usually assumed to be independent and identically distributed with
some density function f(). The reason why we take logarithm of T
i
is obvious considering the
fact that the survival times are always positive (with probability 1).
Equation (5.2) is very similar to a linear regression model for the log-transformed response
variable Y
i
= log(T
i
). In a linear regression, the random error term e
i
is usually assumed to be
i.i.d. from N(0,
2
) so that e
i
can be written as e
i
=
i
, where
i
are i.i.d. from N(0, 1) in this
case.
At this moment, let us see how the regression coecients in model (5.2) can be interpreted
in general. We will investigate their interpretation more closely later when we consider more
specic models (i.e., with dierent distributional assumptions for
i
). For this purpose, let us
consider
k
(k = 1, , p). Holding other covariate values xed, let us increase covariate z
k
by
one unit from z
k
to z
k
+1 and denote by T
1
and T
2
the corresponding survival times for the two
populations with covariate values z
k
and z
k
+1 (with other covariate values xed). Then T
1
and
T
2
can be expressed as
T
1
= e
0
+
1
z
1
+...+
k
z
k
+...+pzp
e
1
= c
1
e
1
T
2
= e
0
+
1
z
1
+...+
k
(z
k
+1)+...+pzp
e
2
= c
2
e
2
where c
2
and c
1
are two constants related by c
2
= c
1
e
k
. The corresponding survival functions
PAGE 101
are
S
1
(t) = P[T
1
t] = P[c
1
e
1
t] = P[e
1
c
1
1
t],
S
2
(t) = P[T
2
t] = P[c
2
e
2
t] = P[e
2
c
1
2
t]
Since
1
and
2
have the same distribution, and c
2
= c
1
e
k
, we have
S
2
(e
k
t) = P[e
2
c
1
2
e
k
t] = P[e
2
c
1
1
e
k
e
k
t] = P[e
2
t] = P[e
t] = S
1
(t).
Therefore, we have accelerated failure time model between populations 1 (covariate value=z
k
)
and 2 (covariate value=z
k
+ 1) with c = e
k
. So if we increase the covariate value of z
k
by one
unit while holding other covariate values unchanged, the corresponding average survival time
2
and
1
will be related by
2
= e
1
.
If
k
is small, then
1
= e
k
1
k
.
Similarly we have for the population quantiles
i
1
= e
k
1
k
.
Therefore, when
k
is small, it can interpreted as the percentage increase if
k
> 0 or percentage
decrease if
k
< 0 in the average survival time and/or median survival time when we increase
the covariate value of z
k
by one unit. Thus the greater value of the covariate with positive
k
is
more benecial in improving survival time for the target population. This interpretation of
k
is very similar to that in a linear regression model.
5.2 Some Popular AFT Models
We can assume dierent distributions for the disturbance term
i
in model (5.2). For example,
we can assume
i
i.i.d.
N(0, 1). This assumption is equivalent to assuming that T
i
has log-normal
PAGE 102
distribution (of course, conditional on the covariates zs). In this section, we will introduce some
popular parametric models for T
i
(equivalently for
i
). The following table gives some of these
distributions:
Distribution of Distribution of T Syntax in Proc Lifereg
extreme values (2 par.) Weibull dist = weibull
extreme values (1 par.) exponential dist = exponential
log-gamma gamma dist = gamma
logistic log-logistic dist = llogistic
normal log-normal dist = lnormal
In Proc Lifereg of SAS, all models are named for the distribution of T rather than the
distribution of . Although these above models tted by Proc Lifereg all are AFT models (so
the regression coecients have a unied interpretation), dierent distributions assume dierent
shapes for the hazard function.
The exponential model
The simplest model is the exponential model where T at z = 0 (usually referred to as the
baseline) has exponential distribution with constant hazard exp(
0
). This is equivalent to
assuming that = 1 and has a standard extreme value distribution
f() = e
e
,
which has the density function shown in Figure 5.1. (So e
has the standard exponential distri-

bution with constant hazard 1.)
From this specication, it is easy to see that the distribution of T at any covariate vector z
is exponential with constant hazard (independent of t)
(t|z) = e
1
z
1
pzp
.
PAGE 103
Figure 5.1: The density function of the standard extreme value distribution
x
f
(
x
)
-3 -2 -1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
So automatically, we get proportional hazards models. For a given set of covariates (z
1
, z
2
, ..., z
p
),
the corresponding survival function is
S(t|z) = e
(t|z)t
,
where (t|z) = e
1
z
1
pzp
. Let
j
=
j
. Then equivalently
(t|z) = e
0
+
1
z
1
++
p
zp
.
Therefore, if we increase the value of covariate z
k
(k = 1, , p) by on unit from z
k
to z
k
+ 1
while holding other covariate values xed, then the ratio of the corresponding hazards is equal
to
(t|z
k
+ 1)
(t|z
k
)
= e
k
.
Thus e
k
can be interpreted as the hazard ratio corresponding to one unit increase in the covariate
z
k
, or equivalently,
k
can be interpreted as the increase in log-hazard as the value of covariate
z
k
increases by one unit (while other covariate values being held constant).
Note: Another SAS procedure Proc Phreg ts a proportional hazards model to the data
and outputs the regression coecient estimates in log-hazard form (i.e., in
k
). Therefore, if an
PAGE 104
exponential model ts the data well (Proc Phreg will also t the data well in this case.) then
the regression coecient estimates in outputs from Proc Lifereg using dist=exponential and
Proc Phreg should be just opposite to each other (opposite sign but almost the same absolute
value). We should be able to shift back and forth between these two models.
Example (Autologous and Allogeneic Bone Marrow Transplants for Hodgkins and Non-
Hodgkins Lymphoma, pages 11-12 of the textbook): Data on 43 bone marrow transplant patients
were collected. Patients had either Hodgkins disease or Non-Hodgkins Lymphoma, and were
given either an allogeneic (Allo) transplant (from a HLA match sibling donor) or autogeneic
(Auto) transplant (their own marrow was cleansed and returned to them after a high dose of
chemotherapy). Other covariates are Karnofsky score (a subjective measure of how well the
patient is doing, ranging from 0-100) and waiting time (in months) from diagnosis to transplant.
It is of substantial interest to see the dierence in leukemia-free survival (in days) between those
patients given an Allo or Auto transplant, after adjusting for patients disease status, Karnofsky
score and waiting time. The data were given in Table 1.5 of the textbook. We used the following
SAS program to t an exponential model to the data
title "Exponential fit";
proc lifereg data=bone;
model time*status(0) = allo hodgkins kscore wtime / dist=exponential;
run;
where allo=1 for allogeneic transplant and allo=0 for autologous transplant, hodgkins=1 for
Hodgkins disease and hodgkins=0 for Non-Hodgkins Lymphoma, kscore is the Karnofsky score
and wtime is the waiting time. We got the following output:
Exponential fit 1
17:35 Tuesday, March 1, 2005
Model Information
Data Set WORK.BONE
Dependent Variable Log(time)
PAGE 105
Name of Distribution Exponential
Type III Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
allo 1 0.0837 0.7723
hodgkins 1 6.2467 0.0124
kscore 1 64.8976 <.0001
wtime 1 1.6610 0.1975
Intercept 1 0.6834 0.6408 -0.5726 1.9394 1.14 0.2862
allo 1 0.1333 0.4607 -0.7697 1.0363 0.08 0.7723
hodgkins 1 -1.3185 0.5275 -2.3524 -0.2845 6.25 0.0124
kscore 1 0.0758 0.0094 0.0574 0.0942 64.90 <.0001
wtime 1 0.0093 0.0072 -0.0049 0.0235 1.66 0.1975
Scale 0 1.0000 0.0000 1.0000 1.0000
Weibull Shape 0 1.0000 0.0000 1.0000 1.0000
Lagrange Multiplier Statistics
Parameter Chi-Square Pr > ChiSq
Scale 1.9089 0.1671
According to this model, allogeneic transplant is slightly better than autologous transplant
after adjusting for disease status, Karnofsky score and waiting time. Hodgkins patients did
worse than Non-Hodgkins patients (the average disease-free survival time for Hodgkins patients
is only exp(1.3185) = 0.27 of that of the Non-Hodgkins patients). The patients with higher
Karnofsky scores have better survival (with one point higher of Karnofsky score, the patients
average survival time will increase by about 7%). Waiting time has no eect on the disease-free
survival.
PAGE 106
The Weibull model
The only dierence between the Weibull model and the exponential model is that the scale
parameter is estimated rather than being set to be one. In this case, the distribution of is
an extreme value distribution with scale parameter . The survival function of T at covariate
value z = (1, z
1
, , z
p
)
T
can be shown to be
S(t|z) = exp
_
_
te
z
T
_ 1
_
,
where = (
0
, ,
p
)
T
is the vector of regression coecients. Equivalently, in terms of (log)-
hazard function
log(t|z) =
_
1
1
_
logt log z
T
(/).
Let = 1/,
0
= log
0
/, and
j
=
j
/ for j = 1, , p (again, pay close attention
to the negative sign). Then we have
log(t|z) = ( 1)logt +
0
+ z
1
1
+ + z
p
p
.
Thus we also get a proportional hazards model and the coecient
k
(k = 1, , p) also has the
interpretation that it is the increase in log-hazard when the value of covariate z
k
increases by
one unit while other covariate values being held unchanged. The function
0
(t) = t
1
e
0
= t
1
e
0
/
= t
1
e
0
is the baseline hazard (i.e., when z = 0).
Note: If the Weibull model is a reasonable model for your data and you use Proc Lifereg
and Proc Phreg to t the data, then the regression coecient estimates not only have opposite
signs (except possibly for the intercept) but also have dierent magnitude (depending on whether
> 1 or < 1).
Since
k
=
k
/ for k = 1, , p. Testing H
0
:
k
= 0 is equivalent to testing H
0
:
k
= 0.
If we are interested in calculating standard error for the estimate of
k
and constructing a
condence interval for
k
, we can use delta method for this purpose.
PAGE 107
Example (Bone marrow transplant data revisited) We used the following SAS program to
t a Weibull model to the bone marrow transplant data
title "Weibull fit";
model time*status(0) = allo hodgkins kscore wtime / dist=weibull;
run;
and got the following output:
Weibull fit 2
Model Information
Data Set WORK.BONE
Name of Distribution Weibull
Wald
allo 1 0.1351 0.7132
hodgkins 1 4.5212 0.0335
kscore 1 43.2179 <.0001
wtime 1 1.2210 0.2692
Intercept 1 0.4258 0.8463 -1.2329 2.0845 0.25 0.6148
allo 1 0.2080 0.5659 -0.9012 1.3172 0.14 0.7132
hodgkins 1 -1.3746 0.6465 -2.6417 -0.1075 4.52 0.0335
kscore 1 0.0793 0.0121 0.0557 0.1029 43.22 <.0001
wtime 1 0.0104 0.0094 -0.0081 0.0289 1.22 0.2692
PAGE 108
Scale 1 1.2733 0.2044 0.9297 1.7440
Weibull Shape 1 0.7854 0.1260 0.5734 1.0757
If we think that the Weibull model is a reasonable one, then the likelihood ratio test statistic is
2*(-61.21034611 - (-62.49090652)) = 2.56 and the p-value = 0.1096, not a strong evidence
against the exponential model. From this model, we see similar results in the transplant methods.
The log-normal model
The log-normal model simply assumes that N(0, 1). Let
0
(t) be the hazard function
of T when = 0(
0
=
1
= =
p
= 0). Then it can be shown that
0
(t) has the following
functional form
0
(t) =
_
log(t)
_
_
1
_
log(t)
__
t
,
where (x) =
1
2
e
x
2
/2
is the probability density function and (x) =
_
x
2
e
u
2
/2
du is
the cumulative distribution function of the standard normal distribution. Then the log-hazard
function of T at any covariate value z can be expressed as
log(t|z) = log
0
(te
z
T
) z
T
. (5.3)
Obviously we no longer have a proportional hazards model.
Note: The function
0
(t) is not the baseline hazard function. If such a function is desired,
it can be obtained from equation (5.3) by setting z = 0.
Some typical patterns that the hazard function
0
(t) assumes are presented in Figure 5.2.
The inverted U-shaped of the log-normal hazard if often appropriate for repeated events such
as a residential move (i.e, the interest is the time to next move). Immediately after a move, the
hazard for another move is likely to be low, then increases with time, and eventually begins to
decline since people tend to not move as they get older.
The survival function S(t|z) at any covariate value z can be expressed as
1
[S(t|z)] =
0
+
1
z
1
+ +
p
z
p
log(t), (5.4)
PAGE 109
Figure 5.2: Typical hazard functions for a log-normal model
time
H
a
z
a
r
d

f
u
n
c
t
i
o
n
s
0 1 2 3 4 5 6
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
sigma=0.5
sigma=1
sigma=1.5
or equivalently
S(t|z) = [
0
+
1
z
1
+ +
p
z
p
log(t)],
where = 1/ and
j
=
j
/ for j = 0, 1, , p. This is a probit regression model with intercept
depending on t.
Note: Equation (5.4) indicates that the coecients
j
can be estimated using Proc
Logistic or Proc Genmod by specifying probit link function. Specically, pick a time point
of interest, say t
0
. Then dichotomize each subject based on his/her survival status at t
0
(in this
case log(t
0
) is absorbed into the intercept). Of cause, there are some limitations using this
approach. First, there should not be censoring prior to time t
0
. Second, the scale parameter is
not estimable. Third, we will lose eciency since we did not use all the information on the exact
timing of the events. Since normal distribution and the logistic distribution we will introduce
soon behave similarly to each other, the parameters
k
here have similar interpretation as the
parameters in log-logistic model.
Example (Bone marrow transplant data revisited) If we want to t a log-normal for the
bone marrow transplant data, we use the following tt SAS program:
title "Log-normal fit";
PAGE 110
model time*status(0) = allo hodgkins kscore wtime / dist=lnormal;
run;
The following output is from the above program:
Log-normal fit 3
Model Information
Data Set WORK.BONE
Name of Distribution Lognormal
Wald
allo 1 0.3556 0.5509
hodgkins 1 3.6277 0.0568
kscore 1 21.5358 <.0001
wtime 1 1.5309 0.2160
Intercept 1 0.5064 1.1563 -1.7600 2.7727 0.19 0.6614
allo 1 0.3520 0.5902 -0.8048 1.5088 0.36 0.5509
hodgkins 1 -1.3137 0.6898 -2.6656 0.0382 3.63 0.0568
kscore 1 0.0663 0.0143 0.0383 0.0944 21.54 <.0001
wtime 1 0.0142 0.0115 -0.0083 0.0367 1.53 0.2160
Scale 1 1.6296 0.2441 1.2149 2.1857
The results from this model are quite similar to the results from other models.
PAGE 111
The Log-Logistic Model
The log-logistic model assumes that the disturbance term has a standard logistic distribu-
tion
f() =
e
(1 + e
)
2
.
The density is plotted in Figure 5.3. Graphically, it looks like the standard normal density except
that the standard normal density puts more mass around its mean value (0).
The hazard function of T at any covariate value z has a closed form:
(t|z) =
t
1
e
z
T
/
1 + t
e
z
T
/
,
where = 1/.
Figure 5.3: Density function of standard logistic distribution
x
f
(
x
)
-4 -2 0 2 4
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Since the logistic distribution looks similar to the normal distribution, it is expected that the
hazard function of log-logistic distribution would also look like that of log-normal distribution,
i.e., would have an inverted U-shaped hazard. However, this is the case only for < 1. The
hazard function of T when = 0 for some values of s is presented in Figure 5.4.
PAGE 112
Figure 5.4: Hazard functions for the log-logistic distribution
time
H
a
z
a
r
d

f
u
n
c
t
i
o
n
s
0 1 2 3 4 5 6
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
sigma=0.5
sigma=1
sigma=1.5
The random variable T has a very simple survival function at covariate value z
S(t|z) =
1
1 + (te
z
T
)
1/
.
Some simple algebra then shows that
log
_
S(t|z)
1 S(t|z)
_
=
0
+
1
z
1
+ +
p
z
p
log(t), (5.5)
where
j
=
j
/ for j = 0, 1, , p. This is nothing but a logistic regression model with the
intercept depending on t. Since S(t|z) is the probability of surviving to time t for any given time
t, the ratio S(t|z)/(1 S(t|z)) is often called the odds of surviving to time t. Therefore, with
one unit increase in z
k
while other covariates being held xed, the odds ratio is given by
S(t|z
k
+ 1)/(1 S(t|z
k
+ 1))
S(t|z
k
)/(1 S(t|z
k
))
= e
k
for all t 0,
which is a constant over time. Therefore, we have a proportional odds models. Hence
k
can be
interpreted as the log odds ratio (for surviving) with one unit increase in z
k
and
is the log
odds ratio of dying before time t with one unit increase in z
k
. At the times when the event of
failure is rare (such as the early phase of a study),
k
can also be approximately interpreted as
the log relative risk of dying. The log-logistic model is the only one that is both an AFT model
and a proportional odds model.
PAGE 113
Obviously, has the following cumulative distribution function
F(u) =
e
u
1 + e
u
, u (, ),
whose inverse function
logit() = log
_

1
_
, (0, 1),
is often called the logit function.
Note: As in the case of log-normal distribution, equation (5.5) indicates that the regression
coecients
k
can be estimated using Proc logistic or Proc Genmod. See the notes for log-
normal model for the procedure and the limitations of such approach.
Example (Bone marrow transplant data revisited) We t a log-logistic model to the data
using the following SAS program:
title "Log-logistic fit";
model time*status(0) = allo hodgkins kscore wtime / dist=llogistic;
run;
and got the following output:
Log-logistic fit 4
Model Information
Data Set WORK.BONE
Name of Distribution LLogistic
PAGE 114
Wald
allo 1 0.3518 0.5531
hodgkins 1 4.6895 0.0303
kscore 1 24.0413 <.0001
wtime 1 1.4635 0.2264
Intercept 1 0.5715 1.0620 -1.5101 2.6530 0.29 0.5905
allo 1 0.3419 0.5764 -0.7878 1.4715 0.35 0.5531
hodgkins 1 -1.4494 0.6693 -2.7613 -0.1376 4.69 0.0303
kscore 1 0.0669 0.0137 0.0402 0.0937 24.04 <.0001
wtime 1 0.0131 0.0108 -0.0081 0.0343 1.46 0.2264
Scale 1 0.9382 0.1567 0.6763 1.3016
We got the same conclusion that two transplants are not signicantly dierent after adjusting
for other covariates. The eects of other covariates are similar too.
The gamma model
The procedure Proc Lifereg in SAS actually ts a generalized gamma model (not a standard
gamma model) to the data by assuming T
0
= e
to have the following density function

f(t) = ||(t
/
2
)
1/
2
exp(t
/
2
)/(t(1/
2
)),
where is called shape with label Gamma shape parameter in the output of Proc Lifereg when
we specify dist=gamma. The hazard function of this gamma distribution does not have a closed
form and is presented graphically in Figure 5.5 for some s. Note: They are not the hazard
functions of the standard gamma distribution.
Clearly from this plot, the hazard function is an inverted U-shaped function of time when
< 1 and takes a U-shape when > 1. This feature makes gamma model very appropriate to
model the survival times, especially for human. In practice, the hazard function is determined
jointly by the scale parameter and the shape parameter and we need to examine the resulting
PAGE 115
Figure 5.5: Hazard functions for the gamma distribution
time
H
a
z
a
r
d

f
u
n
c
t
i
o
n
s
0 1 2 3 4 5 6
0
1
2
3
4
5
delta=0.5
delta=1.5
delta=2
hazard function case by case.
For a given set of covariates (z
1
, z
2
, ..., z
p
), let c = e
0
+z
1
1
+...+z
k
k
= e
z
T
. Then log(T) =
z
T
+ implies T = e
z
T
[e
= cT
0
. Hence the survival function for this population is
S(t|z) = P[T t] = P[cT
0
t]
= P[T
0
(c
1
t)
1/
] = P[T
0
b(t)]
=
_

b(t)
||(x
/
2
)
1/
2
exp(x
/
2
)/(x(1/
2
))dx
x
/
2
=y
=
_
_
_
b(t)
y
1
e
y
()
dy if > 0
_
b(t)
0
y
1
e
y
()
dy if < 0,
where = 1/
2
. This nal integration can be calculated using built-in functions in any popular
software.
Example (Bone marrow transplant data revisited) We t a gamma model to the bone
marrow transplant data using the following SAS program
title "Gamma fit";
model time*status(0) = allo hodgkins kscore wtime / dist=gamma;
run;
and got the following results:
PAGE 116
Gamma fit 5
Model Information
Data Set WORK.BONE
Name of Distribution Gamma
Wald
allo 1 0.4046 0.5247
hodgkins 1 3.7275 0.0535
kscore 1 11.3847 0.0007
wtime 1 1.2282 0.2678
Intercept 1 0.4299 1.1407 -1.8057 2.6656 0.14 0.7062
allo 1 0.3734 0.5870 -0.7772 1.5240 0.40 0.5247
hodgkins 1 -1.3098 0.6784 -2.6394 0.0199 3.73 0.0535
kscore 1 0.0696 0.0206 0.0292 0.1100 11.38 0.0007
wtime 1 0.0133 0.0120 -0.0102 0.0367 1.23 0.2678
Scale 1 1.5771 0.3805 0.9829 2.5307
Shape 1 0.2047 1.0278 -1.8098 2.2192
Again, there is no signicant dierent in the transplant methods. Some special cases:
1. = 1 T|z has the Weibull distribution.
2. = 0 T|z has the log-normal distribution. We need the following approximation in
PAGE 117
order to show this:
(x)
2x
x
1
2
e
x
as x .
3. = T|z has the standard gamma distribution with the following density:
f(t|z) =
t
K1
e
K
(K)
,
where K = 1/
2
is the shape parameter and =
2
e
z
T
is the scale parameter in the

standard gamma distribution. However, Proc Lifereg will not t a standard gamma
distribution to data. We have to use grid search to t a standard gamma distribution.
Specically, use the output from a generalized gamma distribution to get an idea about
the true value of = . Then form a grid. For each grid point of (or ), say, = 1.2, t
a gamma distribution with the specication dist=gamma noshape1 shape1=1.2 noscale
scale=1.2;. Select the model that gives the largest log-likelihood.
Categorical Variables and Class statement in Proc Lifereg
If a covariate z
k
is categorical, then we can use class statement in Proc Lifereg to tell the
procedure that z
k
is categorical and just enter z
k
in the model statement in a usual way. Then
the output of Proc Lifereg will provide a
2
statistic and a p-value to test the null hypothesis
that the survival time is not associated with z
k
. Then it reports the estimates, standard errors,
and p-values, etc., for the contrasts between each level and the highest level of the category.
However, we need to create appropriate variables for the interaction.
5.3 Goodness-of-t Using Likelihood Ratio Test
The following fact on the models we described in this chapter allows us to perform likelihood
ratio tests in order to pick a right model:
1. generalized gamma (, ) standard gamma ( = ) exponential distribution ( = =
1).
PAGE 118
2. generalized gamma (, ) Weibull ( = 1, ) exponential distribution ( = = 1).
3. generalized gamma (, ) log-normal ( = 0, ).
We can use the above nested models to conduct likelihood ratio test for the bone marrow
transplant data:
Maximum likelihood values for dierent models
Model Maximum Likelihood
Gamma -60.91
Log-logistic -61.15
Log-normal -60.93
Weibull -61.21
Exponential -62.49
Assuming the Gamma model is a reasonable model for the data, the LRT indicates that the
log-normal model is equally good and the Weibull model is also acceptable. Since the log-normal
model and the Weibull model have the same number of parameters, we might want to take the
log-normal model as the nal model based on the larger maximum log-likelihood value.
PAGE 119
6 Modeling Survival Data with Cox Regression Models
6.1 The Proportional Hazards Model
A proportional hazards model proposed by D.R. Cox (1972) assumes that
(t|z) =
0
(t)e
z
1
1
++zpp
=
0
(t)e
z
T
, (6.1)
where z is a p 1 vector of covariates such as treatment indicators, prognositc factors, etc., and
is a p 1 vector of regression coecients. Note that there is no intercept
0
in model (6.1).
Obviously,
(t|z = 0) =
0
(t).
So
0
(t) is often called the baseline hazard function. It can be interpreted as the hazard function
for the population of subjects with z = 0.
The baseline hazard function
0
(t) in model (6.1) can take any shape as a function of t. The
only requirement is that
0
(t) > 0. This is the nonparametric part of the model and z
T
is the
parametric part of the model. So Coxs proportional hazards model is a semiparametric model.
Interpretation of a proportional hazards model
1. It is easy to show that under model (6.1)
S(t|z) = [S
0
(t)]
exp(z
T
)
,
where S(t|z) is the survival function of the subpopulation with covariate z and S
0
(t) is the
survival function of baseline population (z = 0). That is
S
0
(t) = e
_
t
0

0
(u)du
.
PAGE 120
2. For any two sets of covariates z
0
and z
1
,
(t|z
1
)
(t|z
0
)
=

0
(t)e
z
T
1

0
(t)e
z
T
0

= e
(z
1
z
0
)
T
, for all t 0,
which is a constant over time (so the name of proportional hazards model). Equivalently,
log
_
(t|z
1
)
(t|z
0
)
_
= (z
1
z
0
)
T
, for all t 0.
3. With one unit increase in z
k
while other covariate values being held xed, then
log
_
(t|z
k
+ 1)
(t|z
k
)
_
= log((t|z
k
+ 1)) log((t|z
k
)) =
k
.
Therefore,
k
is the increase in log hazard (i.e., log hazard-ratio) at any time with unit
increase in the kth covariate z
k
. Equivalently,
(t|z
k
+ 1)
(t|z
k
)
= e
k
, for all t 0.
So exp(
k
) is the hazard ratio associated with one unit increase in z
k
. Furthermore, since
P[t T < t + t|T t, z] (t|z)t, we have
P[t T < t + t|T t, z
k
+ 1]
P[t T < t + t|T t, z
k
]
e
k
, for all t 0.
so exp(
k
) can be loosely interpreted as the ratio of two conditional probabilities of dying
in the near future given a subject is alive at any time t. Since
(t|z
k
+ 1) (t|z
k
)
(t|z
k
)
= e
k
1.
So e
k
1 can be interpreted as the percentage change (increase or decrease) in hazard
with one unit increase in z
k
while adjusting for other covariates.
Inferential Problems
From the interpretation of the model, it is obvious that characterizes the eect of z. So
should be the focus of our inference while
0
(t) is a nuisance parameter. Given a sample of
censored survival data, our inferential problems include:
1. Estimate ; derive its statistical properties.
PAGE 121
2. Testing hypothesis H
0
: = 0 or for part of .
3. Diagnostics.
Estimation
Since the baseline hazard
0
(t) is left completely unspecied (innite dimensional), ordinary
likelihood methods cant be used to estimate . Cox conceived of the idea of a partial likelihood
to remove the nuisance parameter
0
(t) from the proposed estimating equation.
Historical Note: Cox described the proportional hazards model in JRSSB (1972), in what is
now the most quoted statistical papers in history. He also outlined in this paper the method for
estimation which he referred to as using conditional likelihood. It was pointed out to him in the
literature that what he proposed was not a conditional likelihood and that there may be some
aws in his logic. Cox (1975) was able to recast his method of estimation through what he called
partial likelihood and published this in Biometrika. This approach seemed to be based on
sound inferential principles. Rigorous proofs showing the consistency and asymptotic normality
were not published until 1981 when Tsiatis (Annals of Statistics) demonstrated these large sample
properties. In 1982, Anderson and Gill (Annals of Statistics) simplied and generalized these
results through the use of counting processes.
6.2 Estimation Using Partial Likelihood
Data and Model
1. Data: (X
i
,
i
, z
i
), i = 1, , n, where for the ith individual
X
i
= min(T
i
, C
i
).
i
= I(T
i
C
i
).
z
i
= (z
i1
, z
i2
, , z
ip
)
T
is a vector of covariates.
PAGE 122
2. Model: Proportional hazards model
(t|z
i
) =
0
(t)e
z
T
i

,
where
(t|z
i
) = lim
h0
+
_
P[t T
i
< t +h|T
i
t, z
i
]
h
_
.
Assume that C
i
and T
i
are conditionally independent given z
i
. Then the cause-specic hazard
can be used to represent the hazard of interest. That is (in terms of conditional probabilities)
P[x X
i
< x + x,
i
= 1|X
i
x, z
i
] = P[x T
i
< x + x|T
i
x, z
i
]
T
i
(x|z
i
)x.
Similar to the case of log rank test, we need to dene some notation. Let us break the time
axis (patient time) into a grid of points. Assume the survival time is continuous. We hence can
take the grid points dense enough so that at most one death can occur within any interval.
Let dN
i
(u) denote the indicator for the ith individual being observed to die in [u, u + u).
Namely,
dN
i
(u) = I(X
i
[u, u + u),
i
= 1).
Let Y
i
(u) denote the indicator for whether or not the ith individual is at risk at time u.
Namely,
Y
i
(u) = I(X
i
u).
Let dN(u) =

n
i=1
dN
i
(u) denote the number of deaths for the whole sample occurring in
[u, u+u). Since we are assuming u is suciently small, so dN(u) is either 1 or 0 at any time
u.
Let Y (u) =

n
i=1
Y
i
(u) be the total number from the entire sample who are at risk at time u.
Let F(x) denote the information up to time x (one of the grid points)
F(x) = {(dN
i
(u), Y
i
(u), z
i
), i = 1, , n; for grid points u < x and dN(x)}.
PAGE 123
Note: Conditional on F(x), we know who has died or was censored prior to x, when they
died or were censored, together with their covariate values. We know the individuals at risk at
time x and their corresponding covariate value. In addition, we also know if a death occurs at
interval [x, x + x).
What we dont know is the individual who was observed to die among those at risk at time
x if dN(x) = 1.
Let I(x) denote the individual in the sample who died at time x if someone died. If no one
dies at time x, then I(x) = 0
For example, if I(x) = j, then this means that the jth individual in the sample with covariate
vector z
j
died in [x, x + x).
Let F() denote all the data in the sample. Namely
F() = {(X
i
,
i
, z
i
), i = 1, , n}.
If we let u
1
< u
2
< denote the value of the grid points along the time axis, then the data
(with redundancy) can be expressed as
(F(u
1
), I(u
1
), F(u
2
), I(u
2
), , F()).
Denote the observed values of the above random variables by lower cases. Then the likelihood
of the parameter
0
(t) and can be written as
P[F(u
1
) = f(u
1
);
0
(), ] P[I(u
1
) = i(u
1
)|F(u
1
) = f(u
1
);
0
(), ]
P[F(u
2
) = f(u
2
)|F(u
1
) = f(u
1
), I(u
1
) = i(u
1
);
0
(), ]
P[I(u
2
) = i(u
2
)|F(u
1
) = f(u
1
), I(u
1
) = i(u
1
), F(u
2
) = f(u
2
);
0
(), ]

and the last term can be simplied as
P[I(u
2
) = i(u
2
)|F(u
1
) = f(u
1
), I(u
1
) = i(u
1
), F(u
2
) = f(u
2
);
0
(), ]
PAGE 124
= P[I(u
2
) = i(u
2
)|F(u
2
) = f(u
2
);
0
(), ].
That is, the full likelihood can be written as the product of a series of conditional likelihoods.
The partial likelihood (as dened by D.R. Cox) consists of the product of every other condi-
tional probabilities in the above presentation. That is
PL =
{all grid pt u}
P[I(u) = i(u)|F(u) = f(u);
0
(), ].
Suppose we have the following small data set, we will try to nd nd out this partial likeli-
hood:
Patient ID x z
1 2 1 2
2 2 0 2
3 3 1 1
4 4 1 3
It turns out that the partial likelihood is
PL() =
e
2
e
2
+ e
2
+ e
+ e
3

e
+ e
3

e
3
e
3
. (6.2)
In general, we have to consider two cases in calculating the above partial likelihood.
Case 1: Suppose conditional on F(u) we have dN(u) = 0. That is, no death is observed at
time u. In such a case, I(u) = 0 with probability 1.
Hence for any grid point u where dN(u) = 0, we have
P[I(u) = 0|F(u) = f(u)] = 1.
Therefore, the partial likelihood is not aected at any point u such that dN(u) = 0.
PAGE 125
Case 2: dN(u) = 1. Conditional on F(u), if we know that one individual dies at time u,
then it must be one of the individuals still at risk (alive and not censored) at time u; i.e.,
among the following individuals
{i : Y
i
(u) = 1}.
Also conditional on F(u), we know the covariate vector z
i
associated to each individual i
such that Y
i
(u) = 1. Therefore, we ask the following question:
Among Y (u) =

n
i=1
Y
i
(u) individuals, what is the probability that the observed
death happened to the ith subject (who is actually observed to die at u) rather
than to the other patients?
Unlike the null hypothesis case for the two-sample problem, the probabilities of choosing
these subjects are not equally likely, but rather, they are proportional to their cause-
specic hazard of dying at time u, which can be derived as follows:
Let A
i
= the event that subject i is going to die in [u, u + u) given that he/she is still
alive at u. If a patient is not at risk at u (i.e., Y
i
(u) = 0), then A
i
= . Since we chose u
to be so small that there is at most one death in [u, u + u), so we know
A
1
, A
2
, , A
n
are mutually exclusive.
Because of the independence of survival times and censoring times, those Y (u) patients
who are at risk at u (not censored and still alive at u) make up a random sample of the
subpopulation consisting of the patients who will survive up to u (and with the same
covariate value). Under independent censoring assumption, we already showed in Chapter
3 that the cause-specic hazard is the same as the hazard of interest; i.e.,
(u,
i
= 1|z
i
) = (u, |z
i
).
Since u is chosen to be very small, so
P[A
i
] Y
i
(u)(u,
i
= 1|z
i
)u
= Y
i
(u)(u, |z
i
)u.
= Y
i
(u)
0
(u)exp(z
T
i
)u,
PAGE 126
where the last equation is due to the assumption of the cox model. Therefore
P[I(u) = i(u)|F(u) = f(u);
0
(), ]
= P[A
i(u)
|A
1
A
n
]
=
P[A
i(u)
]
n
l=1
P[A
l
]
0
(u)exp(z
T
i(u)
)u
n
l=1
0
(u)exp(z
T
l
)Y
l
(u)u
=
exp(z
T
i(u)
)
n
l=1
exp(z
T
l
)Y
l
(u)
.
Here Y
i(u)
(u) = 1 since we know this patient had to be at risk at u (since we know that
this patient died in [u, u + u)).
Combining these cases, the partial likelihood can be written as
PL() =
{all grid pt u}
_
exp(z
T
i(u)
)
n
l=1
exp(z
T
l
)Y
l
(u)
_
dN(u)
.
Remark: To be formal, we need to dene z
0
even though it is never used. We can, for
example, take z
0
= 0.
Other equivalent ways of writing the partial likelihood include: Let t
1
, , t
d
dene the
distinct death times, then
PL() =
d
j=1
_
_
exp(z
T
i(t
j
)
)
n
l=1
exp(z
T
l
)Y
l
(t
j
)
_
_
;
PL() =
n
i=1
{all grid pt u}
_
exp(z
T
i
)
n
l=1
exp(z
T
l
)Y
l
(u)
_
dN
i
(u)
;
PL() =
n
i=1
_
exp(z
T
i
)
n
l=1
exp(z
T
l
)Y
l
(x
i
)
_
i
.
Remark: Stare at these dierent representations for a while, you will convince yourself that
they are all equivalent.
The importance of using the partial likelihood is that this function depends only on ,
the parameter of interest, and is free of the baseline hazard
0
(t), which is innite dimensional
nuisance function.
PAGE 127
Cox suggested treating PL as a regular likelihood function and making inference on ac-
cordingly. For example, we maximize the partial likelihood to get the estimate of , often called
MPLE (maximum partial likelihood estimate), and use the minus of the second derivative of the
log partial likelihood as the information matrix, etc.
Properties of the score of the partial likelihood
For ease of presentation, let us focus on one covariate case. The extension is straightforward.
Obviously, the log partial likelihood function of is
() =
{all grid pts u}

dN(u)
_
z
I(u)
log
_
n
l=1
exp(z
l
)Y
l
(u)
__
.
The score function is
U() =
()
{all grid pts u}

dN(u)
_
z
I(u)
n
l=1
z
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
_
,
and the second derivative is
2
()
2
=
u
dN(u)
_
_
n
l=1
z
2
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)

_
n
l=1
z
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
_
2
_
_
.
Dene
z(u, ) =
n
l=1
z
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
=
n
l=1
z
l
w
l
,
where
w
l
=
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
is the weight that is proportional to the hazard of the individual failing. So z(u, ) can be
interpreted as the weighted average of the covariate z among those individuals still at risk at
time u with weights w
l
.
Dene
V
z
(u, ) =
_
_
n
l=1
z
2
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)

_
n
l=1
z
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
_
2
_
_
PAGE 128
=
_
n
l=1
z
2
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
( z(u, ))
2
_
=
n
l=1
z
2
l
w
l
( z(u, ))
2
.
This can be shown to be equal to
V
z
(u, ) =
n
l=1
_
(z
l
z(u, ))
2
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
_
=
n
l=1
(z
l
z(u, ))
2
w
l
.
This last representation says that V
z
(u, ) can be interpreted as the weighted variance of the
covariates among those individuals still at risk at u and hence V
z
(u, ) > 0. Consequently,
2
()
2
=
u
dN(u)V
z
(u, ) < 0.
The above property can also be displayed graphically. For example, the partial likelihood
function (6.2) looks like:
Figure 6.1: The partial likelihood (6.2)

4 2 0 2 4
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
beta
p
a
r
t
i
a
l

l
i
k
e
l
i
h
o
o
d
Therefore () has a unique maximizer and can be obtained uniquely by solving the following
partial likelihood equation:
U() =
()
{all grid pts u}

dN(u)
_
z
I(u)
n
l=1
z
l
exp(z
l
)Y
l
(u)
n
l=1
exp(z
l
)Y
l
(u)
_
= 0.
PAGE 129
This maximizer

denes the MPLE of .
Terminology: The quantity
2
()
2
=
u
dN(u)V
z
(u, )
is dened as the partial likelihood observed information and is denoted by J().
Ultimately, we want to show that the MPLE

has nice statistical properties. These include:
Consistency: That is,

will converge to the true value of which generated the data as
the sample size gets larger. We call this true value
0
.
Asymptotic Normality:

will be approximately normally distributed with mean
0
and a
variance which can be estimated from the data. This approximation will be better as the
sample size gets larger. This result is useful in making inference for the true .
Eciency: Among all other competing estimators for , the MPLE has the smallest vari-
ance, at least, when the sample size gets larger.
In order to show the properties for

, we expand U(
) at the true value

0
using Taylor
expansion:
0 = U(
) U(
0
) +
U(
0
)

0
).
Since
U(
0
)
=

2
(
0
)
2
= J(
0
),
therefore
(

0
) [J(
0
)]
1
U(
0
)
This expression indicates that we need to investigate the properties of the score function U(
0
)
U(
0
) =
u
dN(u)
_
z
I(u)
z(u,
0
)
_
.
PAGE 130
Properties of the score:
(1) E[U(
0
)] = 0.
Since
E[U(
0
)] = E
_
u
dN(u)
_
z
I(u)
z(u,
0
)
_
_
=
u
E
_
dN(u)
_
z
I(u)
z(u,
0
)
__
,
and
E
_
dN(u)
_
z
I(u)
z(u,
0
)
__
= E
_
E
_
dN(u)
_
z
I(u)
z(u,
0
)
_
F(u)
__
Conditional on F(u), dN(u) and z(u,
0
) are both known. Consequently the inner expecta-
tion can be written as
dN(u)
_
E[z
I(u)
|F(u)] z(u,
0
)
_
.
Remember that I(u) is the patient identier for the individual that dies at time u and is set
to zero if no one dies at u. If no one dies at u, then dN(u) = 0, and hence the above quantity is
zero. If someone dies at u, then dN(u) = 1, and conditional on F(u), we know it has to be one
of the Y (u) people at risk at time u; i.e., I(u) must be one of the values {i : Y
i
= 1}.
The conditional distribution of z
I(u)
given F(u) can be derived through the conditional
distribution of I(u) given F(u) as shown in Table 6.1.
Therefore
E[z
I(u)
|F(u)] =
n
l=1
z
l
w
l
=
n
l=1
z
l
exp(z
l
0
)Y
l
(u)
n
l=1
exp(z
l
0
)Y
l
(u)
= z(u,
0
).
From this, we immediately get
E[U(
0
)] = 0.
PAGE 131
Table 6.1: Conditional distribution of z
I(u)
given F(u)
Values of I(u) Values of z
I(u)
Probability
1 z
1
exp(z
1
0
)Y
1
(u)/
n
l=1
exp(z
l
0
)Y
l
(u) = w
1
2 z
2
exp(z
2
0
)Y
2
(u)/
n
l=1
exp(z
l
0
)Y
l
(u) = w
2
.
.
.
.
.
.
.
.
.
n z
n
exp(z
n
0
)Y
n
(u)/
n
l=1
exp(z
l
0
)Y
l
(u) = w
n
Note: From the conditional distribution of z
I(u)
given F(u), it is easy to see the conditional
variance of z
I(u)
Var[z
I(u)
|F(u)] =
n
l=1
_
z
l
E[z
I(u)
|F(u)]
_
2
w
l
=
n
l=1
(z
l
z(u,
0
))
2
exp(z
l
0
)Y
l
(u)
n
l=1
exp(z
l
0
)Y
l
(u)
= V
z
(u,
0
).
(2) Finding an unbiased estimate for the variance of U(
0
)
Since E[U(
0
)] = 0, so
Var[U(
0
)] = E[U(
0
)]
2
= E
_
u
dN(u)
_
z
I(u)
z(u,
0
)
_
_
2
= E
_
u
_
dN(u)
_
z
I(u)
z(u,
0
)
__
2
_
+ E
_
_
u=u
_
dN(u)
_
z
I(u)
z(u,
0
)
__ _
dN(u
)
_
z
I(u
)
z(u
,
0
)
__
_
_
PAGE 132
As usual, we will take an arbitrary cross-product and show it has zero expectation. Assume
u
> u and denote

A(u) = dN(u)
_
z
I(u)
z(u,
0
)
_
, A(u
) = dN(u
)
_
z
I(u)
z(u
,
0
)
_
.
Then the expectation of the cross-product is
E[A(u)A(u
)]
= E[E [A(u)A(u
)| F(u
)]] .
Since u
> u, conditional on F(u
), A(u) is known. So
E [A(u)A(u
)| F(u
)] = A(u)E [A(u
)| F(u
)] = 0.
Therefore
Var[U(
0
)] = E
u
_
A
2
(u)
_
=
u
E
_
A
2
(u)
_
=
u
E
_
E
_
A
2
(u)
F(u)
__
The inner conditional expectation is
E
_
A
2
(u)
F(u)
_
= E
_
_
dN(u)
_
z
I(u)
z(u,
0
)
__
2
F(u)
_
.
Since we pick the grid points in our partition of time ne enough so that dN(u) is either 0
or 1, so dN
2
(u) = dN(u). Hence
E
_
A
2
(u)
F(u)
_
= E
_
dN(u)
_
z
I(u)
z(u,
0
)
_
2
F(u)
_
.
Conditional on F(u), dN(u) is known, z(u,
0
) is also known and from Table 6.1
z(u,
0
) = E[z
I(u)
|F(u)].
PAGE 133
Therefore
E
_
A
2
(u)
F(u)
_
= dN(u)E
_
_
z
I(u)
z(u,
0
)
_
2
F(u)
_
= dN(u)Var[z
I(u)
|F(u)]
= dN(u)V
z
(u,
0
).
Consequently,
Var [U(
0
)] =
u
E[dN(u)V
z
(u,
0
)]
= E
_
u
dN(u)V
z
(u,
0
)
_
.
Note that the quantity

u
dN(u)V
z
(u,
0
) is a statistic (can be calculated from the observed
data), so

u
dN(u)V
z
(u,
0
) is an unbiased estimate of Var [U(
0
)]. In fact,

u
dN(u)V
z
(u,
0
)
is the partial likelihood observed information J(
0
) we dened before.
Conclusion
The score U(
0
) =

u
A(u) is a sum of conditionally uncorrelated mean zero random vari-
ables and its variance can be unbiasedly estimated by
J(
0
) =
u
dN(u)V
z
(u,
0
).
By the martingale CLT, we have:
U(
0
)
a
N(0, J(
0
)).
Previously, we have shown that
(

0
) [J(
0
)]
1
U(
0
).
Treating J(
0
) as a constant, we get the approximate distribution of (

0
)
(

0
)
a
N(0, J
1
(
0
)).
PAGE 134
Of course, in practice,
0
is unknown. But we can substitute

for
0
and use J
1
(
) as the
estimated variance of

. That is, we use the following approximate distribution for (

0
)
(

0
)
a
N(0, J
1
(
)),
where
J(
) =
u
dN(u)
_
V
z
(u,

)
_
,
and

is the MPLE of solving the following equation
U(
) =
u
dN(u)
_
z
I(u)
z(u,

)
_
= 0.
Inference with a Single Covariate
Assume a proportional hazards model with a single covariate z
(t) =
0
(t)e
z
.
After we get our data (x
i
,
i
, z
i
), we can obtain the MPLE

by solving the partial likelihood
equation; i.e., setting the partial score to zero. Then asymptotically,
a
N(
0
, J
1
(
)).
We can use this fact to construct condence interval for and test the hypothesis H
0
: =
0
,
etc. For example, a (1 ) CI of is
z
/2
[J
1
(
)]
1/2
.
Myelomatosis data revisited: We analyzed myelomatosis data and did not nd statistically
signicant dierence between treatments 1 and 2. We want to quantify the dierence by assuming
the hazards of these two treatments are proportional to each other. Dene a treatment indicator
trt1 which takes value 0 for treatment 1 and takes value 1 for treatment 2. Then we can use
Proc Phreg for this purpose.
PAGE 135
proc phreg data=myel;
model dur*status(0)=trt1;
run;
Part of the output is given as follows:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
16:43 Thursday, March 2, 2000 15
The PHREG Procedure
Data Set: WORK.MYEL
Dependent Variable: DUR
Censoring Variable: STATUS
Censoring Value(s): 0
Ties Handling: BRESLOW
Summary of the Number of
Event and Censored Values
Percent
Total Event Censored Censored
25 17 8 32.00
Testing Global Null Hypothesis: BETA=0
Without With
Criterion Covariates Covariates Model Chi-Square
-2 LOG L 94.084 92.765 1.319 with 1 DF (p=0.2508)
Score . . 1.297 with 1 DF (p=0.2547)
Wald . . 1.263 with 1 DF (p=0.2610)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Risk
Variable DF Estimate Error Chi-Square Chi-Square Ratio
TRT1 1 0.572807 0.50960 1.26344 0.2610 1.773
So

= 0.5728 with standard error 0.5096. This means that compared to treatment 1,
treatment 2 will increase the hazard of dying at any time by 77% (exp(
) 1). A 95% CI of is
1.96 se[
] = 0.5728 1.96 0.5096 = [0.426, 1.572].

And a 95% CI for the hazard ratio exp() is
[e
0.426
, e
1.572
] = [0.653, 4.816].
PAGE 136
Note: The output also gives three tests for H
0
: = 0: likelihood ratio, score and Wald tests.
Comparison of score test and two-sample log rank test
Assume z is the dichotomous indicator for treatment; i.e.,
z =
_
_
1 for treatment 1
0 for treatment 0
,
and the proportional hazards model:
(t) =
0
(t)e
z
.
Score test: Under H
0
: = 0, the score U(0) (evaluated under H
0
) has the distribution
U(0)
a
N(0, J(0)).
Or equivalently,
_
U(0)
J
1/2
(0)
_
2
a

2
1
.
Since the score U(0) has the expression
U(0) =
u
dN(u)
_
z
I(u)
z(u, 0)
_
.
Then
1. If a death occurs at time u, then dN(u) = 1, in which case there will a contribution to
U(0) by adding [z
I(u)
z(u, 0)]. Otherwise no contribution.
2. Since z = 1 for treatment 1 and z = 0 for treatment 0, z
I(u)
will then the number of deaths
at time u from treatment 1.
3. Under H
0
: = 0, z(u, 0) is simplied to be
z(u, 0) =
n
l=1
z
l
Y
l
(u)
n
l=1
Y
l
(u)
,
PAGE 137
which is the proportion of individuals in group 1 among those at risk at time u. Since
we only assume one death at time u, this proportion is the expected number of death
for treatment 1 among those at risk at time u, under the null hypothesis of no treatment
dierence.
4. Therefore, U(0) is the sum over the death times of the observed number of deaths from
treatment 1 minus the expected number of deaths under the null hypothesis. This was the
numerator of the two-sample log rank test:
u
_
dN
1
(u)
Y
1
(u)
Y (u)
dN(u)
_
where dN
1
(u) = # of observed deaths from treatment 1, Y
1
(u) = # at risk at time u from
treatment 1, Y (u) = total # at risk at time u from 2 treatments, dN(u) = total # of
deaths from 2 treatments.
5. The denominator of the score test was computed as
J
1/2
(0) =
_
u
dN(u)V
z
(u, 0)
_
1/2
,
where
V
z
(u, 0) =
l
[z
l
z(u, 0)]
2
Y
l
(u)
l
Y
l
(u)
.
Note: Among the Y (u) individuals at risk at time u, there are Y
1
(u) individuals whose z
l
value of z
l
= 1 and Y
0
(u) individuals whose z
l
value of z
l
= 0. We already argued that
z(u, 0) =
Y
1
(u)
Y (u)
.
Therefore,
V
z
(u, 0) =
l
[z
l
z(u, 0)]
2
Y
l
(u)
l
Y
l
(u)
=
_
1
Y
1
(u)
Y (u)
_
2
Y
1
(u) +
_
0
Y
1
(u)
Y (u)
_
2
Y
0
(u)
Y (u)
(z
l
(u) takes 1 or 0)
=
Y
2
0
(u)Y
1
(u)
Y
2
(u)
+
Y
2
1
(u)Y
0
(u)
Y
2
(u)
Y (u)
(Y
1
(u) +Y
0
(u) = Y (u))
PAGE 138
=
Y
0
(u)Y
1
(u)Y (u)
Y
3
(u)
=
Y
0
(u)Y
1
(u)
Y
2
(u)
.
Therefore,
J(0) =
u
dN(u)
Y
0
(u)Y
1
(u)
Y
2
(u)
.
Let us contrast this with the variance used to compute the logrank test statistic:
u
_
Y
1
(u)Y
0
(u)dN(u)[Y (u) dN(u)]
Y
2
(u)[Y (u) 1]
_
.
Note: In the special case where dN(u) can only be one or zero, then above expression reduces to
u
_
Y
1
(u)Y
0
(u)dN(u)[Y (u) 1]
Y
2
(u)[Y (u) 1]
_
=
u
_
Y
1
(u)Y
0
(u)dN(u)
Y
2
(u)
_
,
which is exactly equal to J(0).
Therefore, we have demonstrated with continuous survival time data with no ties, the score
test of the hypothesis H
0
: = 0 in the proportional hazards model is exactly the same as the
logrank test for dichotomous covariate z.
The score test
_
U(0)
J
1/2
(0)
_
2
can be used to test the hypothesis H
0
: = 0 for the model
(t|z) =
0
(t)e
z
for any covariate value z, whether or not z is discrete or continuous. The null hypothesis H
0
: =
0 implies that the hazard rate at any time t is unaected by the covariate z. This also implies
that the survival distribution does not depend on z. The alternative hypothesis H
A
: = 0
implies that hazard rate increases or decreases (depending on the sign of ) as z increases
throughout all time. Therefore, belief in this alternative hypothesis would mean that individuals
with a higher value of z would have stochastically larger (or smaller depending on the sign of
PAGE 139
) survival distribution than those individuals with a smaller values of z. The test command
in Proc Lifetest computes the score test of the hypothesis H
0
: = 0 for the proportional
hazards model. Consequently, when using the test command, the covariate z is not limited to
being dichotomous, nor discrete.
For example, we can test the treatment dierence between treatments 1 and 2 for myelo-
matosis data using the following SAS command:
time dur*status(0);
test trt;
run;
and part of the output is presented in the following:
Univariate Chi-Squares for the LOG RANK Test
Test Standard Pr >
TRT -2.3376 2.0522 1.2975 0.2547
Covariance Matrix for the LOG RANK Statistics
Variable TRT
TRT 4.21151
Forward Stepwise Sequence of Chi-Squares for the LOG RANK Test
TRT 1 1.2975 0.2547 1.2975 0.2547
Likelihood Ratio Test
As in the ordinary likelihood theory, the (partial) likelihood ratio test can also be used to
test the null hypothesis:
H
0
: =
0
.
PAGE 140
Recall that () is the log partial likelihood. Intuitively, if H
0
is true, then

, the MPLE
of , should be close to
0
. Hence (
) should be close to (
0
). Since (
) (
0
) is always
non-negative, so we should reject H
0
when this dierence is large.
The likelihood ratio test uses the fact that
2
_
(
) (
0
)
_
a

2
1
, under H
0
: = 0.
Therefore, for a given level of signicance , we reject H
0
: =
0
if
2
_
(
) (
0
)
_

2
1,
where
2
1,
is the value such that P[
2
1
>
2
1,
] = .
Expanding (
0
) at the MPLE

, we get
(
0
) (
) +
d(
)
d
(
0

) +
1
2!
d
2
(
)
d
2
(
0

)
2
.
Since MPLE

maximizes (), i.e.,
U(
) =
d(
)
d
= 0,
and
d
2
(
)
d
2
= J(
),
so
2
_
(
) (
0
)
_
J(
)(

0
)
2
.
We already derived that
(

0
)
a
N(0, J
1
(
)).
Therefore,
2
_
(
) (
0
)
_
J(
)(

0
)
2
=
_

0
J
1/2
(
)
_
2
a

2
1
under H
0
: =
0
.
Note: The SAS procedure Phreg can ONLY handle right censored data.
PAGE 141
7 Cox Proportional Hazards Regression Models (contd)
7.1 Handling Tied Data in Proportional Hazards Models
So far we have assumed that there is no tied observed survival time in our data when we construct
the partial likelihood function for the proportional hazards model. However, in practice, it is
quite common for our data to contain tied survival times due to obvious reasons. Therefore,
we need a dierent technique to construct the partial likelihood in the presence of tied data.
Throughout this subsection, we will work with the following super simple example:
Patient x z
1 x
1
1 z
1
2 x
2
1 z
2
3 x
3
0 z
3
4 x
4
1 z
4
5 x
5
1 z
5
where x
1
= x
2
< x
3
< x
4
< x
5
. So the rst two patients have tied survival times. We assume
the following proportional hazards model
(t|z
i
) =
0
(t)exp(z
i
)
Since there are 3 distinct survival times (i.e, x
1
, x
4
, x
5
) in this data set, intuitively, the partial
likelihood function of will take the following form
L() = L
1
()L
2
()L
3
(),
where L
j
() is the component in the partial likelihood corresponding to the jth distinct survival
time. Since the second and third survival times x
4
and x
5
are distinct, L
2
() and L
3
() can be
constructed in the usual way. So we will focus on the construction of L
1
(). In fact,
L
2
() =
e
z
4
e
z
4
+ e
z
5
, and L
3
() = 1.
PAGE 142
We will discuss 4 methods that are implemented in SAS.
1. The Exact Method: This method assumes that the survival time has a continuous
distribution and the true survival times of patients 1 and 2 are dierent. These two patients
have the same survival times in our data because our measurement does not have enough accuracy
or the original data was rounded for convenience and this information got lost, etc.
Without any knowledge of the true ordering of the survival times of patients 1 and 2, we
have to consider all possible orderings. There are 2! = 2 possible orderings. Let A
1
denote the
event that patient 1 died before patient 2 and A
2
denote the event that patient 2 died before
patient 1. Then by the law of total probability, we have
L
1
() = P[observe two deaths at x
1
] = P[A
1
A
2
] = P[A
1
] + P[A
2
],
and P[A
1
], P[A
2
] are given in the usual way:
P[A
1
] =
e
z
1
e
z
1
+ e
z
2
+ e
z
3
+ e
z
4
+ e
z
5

e
z
2
e
z
2
+ e
z
3
+ e
z
4
+ e
z
5
P[A
2
] =
e
z
2
e
z
2
+ e
z
1
+ e
z
3
+ e
z
4
+ e
z
5

e
z
1
e
z
1
+ e
z
3
+ e
z
4
+ e
z
5
After the partial likelihood L() is constructed, the inference of is exactly the same as the
case where there is no tied survival time (tied survival time and censoring time have no eect on
the partial likelihood construction). Specically, we maximize the new partial likelihood L() to
obtain MPLE of , use inverse of minus second derivative of the log partial likelihood function
to estimate the variability in the MPLE of . We can also perform score test and likelihood ratio
test.
The exact method is implemented in Proc Phreg in SAS. Suppose in our data set mydata we
use time to denote the (censored) survival times with cens the censoring indicator, and z the
covariate, then the PH model can be t with the exact method using the following SAS code:
Proc Phreg data=mydata;
model time*cens(0) = z / ties=exact;
run;
PAGE 143
Of course, the exact method will yield optimal estimate of . However, this method can be
potentially computationally intensive. For example, suppose there are d
j
tied survival times at
the jth distinct survival time, then d
j
! dierent orderings have to be considered and L
j
() is the
sum of d
j
! dierent terms, each of which is the product of d
j
terms (conditional probabilities).
This number could be very large. For example, when d
j
= 5 then d
j
d
j
! = 5 5! = 6000
dierent terms have to be calculated to get L
j
(). Because of this computational diculties,
two methods have been proposed to approximate the exact partial likelihood.
2. Breslows Approximation (default in Proc Phreg): Obviously, we can have the fol-
lowing approximation for our example:
e
z
2
e
z
2
+ e
z
3
+ e
z
4
+ e
z
5

e
z
2
e
z
1
+ e
z
2
+ e
z
3
+ e
z
4
+ e
z
5
e
z
1
e
z
1
+ e
z
3
+ e
z
4
+ e
z
5

e
z
1
e
z
1
+ e
z
2
+ e
z
3
+ e
z
4
+ e
z
5
Therefore both P[A

1
] and P[A
2
], and hence L
1
() can be approximated by
e
z
1
e
z
1
+ e
z
2
+ e
z
3
+ e
z
4
+ e
z
5

e
z
2
e
z
1
+ e
z
2
+ e
z
3
+ e
z
4
+ e
z
5
=
e
(z
1
+z
2
)
[
5
l=1
e
z
l
]
2
.
In genera, if there are d
j
tied survival times at the jth distinct survival time, then L
j
() is
approximated by
L
j
()
exp(
lD
j
z
l
)
_
lR
j
exp(z
l
)
_
d
j
,
where R
j
is the risk set at the jth survival time and D
j
is the event (death) set at the jth distinct
survival time. So the partial likelihood of is
L() =
D
j=1
L
j
()
D
j=1
exp(
lD
j
z
l
)
_
lR
j
exp(z
l
)
_
d
j
,
where D is the total distinct events. This approximation was proposed by Breslow (1974) and
is the default in Proc Phreg of SAS.
Obviously, if at each distinct survival time the number of events (failures) d
j
is small or/and
the number of patients at risk n
j
is large (so the ratio d
j
/n
j
is small), then Breslows approxi-
mation should work well (the approximated partial likelihood should be very close to the exact
PAGE 144
partial likelihood) However, if these conditions do not satisfy, the approximation can be poor.
Therefore Efron (1977) suggested another approximation.
3. Efrons Approximation: For our example, L
1
() in the exact partial likelihood using
the exact method can be written as
L
1
() =
bc
a(a b)
+
bc
a(a c)
,
which can be approximated by
L
1
() =
2bc
a(a (b + c)/2)
.
This motivates the general approximation:
L
1
() =
e
lD
1
z
l
d
1
j=1
_
lR
1
e
z
l
j1
d
1
lD
1
e
z
l
_
.
We can specify the option ties=efron in Proc Phreg for this approximation.
4. Discrete Method: This method does not assume that there is underlying ordering of
the tied survival times. Instead, the time is assumed to be discrete, which may arise in some
applications. For example, suppose we are interested in studying the number of times we drop a
dish before it breaks. In this case, we consider the following model: for any death time t, let
it
= P[subject i will die at t|subject i survive up to t],
then assume the following proportional odds model (a logistic regression with time-varying
intercepts)
log
_

it
1
it
_
=
t
+ z
i
,
where
t
s are nuisance parameters and is the parameter of interest (treatment eect, for
example). In this case, L
1
() can be interpreted as
L
1
() = P[deaths occurred to subjects 1 and 2)|there are 2 deaths out of 5 subjects].
It can be shown that the above probability is equal to
L
1
() =
e
(z
1
+z
2
)
all D
j
e
s
j
,
PAGE 145
where D
j
are
_
_
_
_
5
2
_
_
_
_
= 10 possible combinations.
Obviously, the model considered here is not a proportional hazards model. However, when
there is no tied observation in the data set, the resulting likelihood is exactly the same as the
Cox partial likelihood. This is the main reason that discrete method is included in Proc Phreg.
Note that conditional logistic model is a special case of this model. So Proc Phreg can be
used to t conditional logistic model. Also note that this method can be even more computa-
tionally intensive than, say, the exact method.
7.2 Multiple Covariates
The real strength of the proportional hazards model is that it allows us to model the relationship
of survival time, through its hazard function, to many covariates simultaneously:
(t|z) =
0
(t)e
z
1
1
++zpp
=
0
(t)e
z
T
,
where z is a (p 1) vector and = (
1
, ,
p
)
T
is a (p 1) vector of regression coecients.
Estimation of is exactly similar to the case of one covariate. The partial likelihood of is
given by
PL() =
{all grid pt u}
_
exp(z
T
i(u)
)
n
l=1
exp(z
T
l
)Y
l
(u)
_
dN(u)
,
and the log partial likelihood of is
() =
{all grid pts u}

dN(u)
_
z
T
I(u)
log
_
n
l=1
exp(z
T
l
)Y
l
(u)
__
.
Note: z
l
is the covariate value for the lth individual; i.e., z
l
= (z
l1
, , z
lp
)
T
.
The maximum partial likelihood estimate

(MPLE) of is obtained by maximizing (),
i.e., by setting the score vector to be zero
U() =
()
= 0,
PAGE 146
where
()
=
_
()
1
, ,
()
p
_
T
.
Similar to the previous chapter, we have
()
j
=
u
dN(u)
_
z
I(u)j
z
j
(u, )
_
,
where z
I(u)j
denotes the jth element of the covariate vector for the individual I(u) who died at
time u, and
z
j
(u, ) =
n
l=1
z
lj
exp(z
T
l
)Y
l
(u)
n
l=1
exp(z
T
l
)Y
l
(u)
=
n
l=1
z
lj
w
l
, w
l
=
exp(z
T
l
)Y
l
(u)
n
l=1
exp(z
T
l
)Y
l
(u)
,
is the weighted average of the jth element of the covariate vector for the individuals at risk at
time u.
If we denote
Z
p1
I(u)
=
_
_
_
_
_
_
_
_
z
I(u)1
.
.
.
z
I(u)p
_
_
_
_
_
_
_
_
,

Z
p1
(u, ) =
_
_
_
_
_
_
_
_
z
1
(u, )
.
.
.
z
p
(u, )
_
_
_
_
_
_
_
_
,
then the partial likelihood equation can be expressed as
U() =
u
dN(u)
_
Z
p1
I(u)

Z
p1
(u, )
_
= 0
p1
.
In order for the partial likelihood equation to have a unique solution, it is sucient that the
Hessian matrix H be negative denite
a
T
Ha < 0 for all a
p1
= 0,
where
H =

2
()
=
_

2
()
_
pp
.
Equivalently,
J() =
2
()
=
_

2
()
_
PAGE 147
is positive denite.
It can be easily shown that the (j, j
)th element of J() is

J
j,j
=
u
dN(u)
_
n
l=1
z
lj
z
lj
exp(z
T
l
)Y
l
(u)
n
l=1
exp(z
T
l
)Y
l
(u)
z
j
(u, ) z
j
(u, )
_
=
u
dN(u)
_
n
l=1
(z
lj
z
j
(u, ))(z
lj
z
j
(u, ))exp(z
T
l
)Y
l
(u)
n
l=1
exp(z
T
l
)Y
l
(u)
_
=
u
dN(u)V
j,j
(u, ),
where V
j,j
(u, ) is the weighted sample covariance between the jth and j
th element of the
covariate vector among individuals at risk at time u with the weight being
w
l
=
exp(z
T
l
)Y
l
(u)
n
l=1
exp(z
T
l
)Y
l
(u)
.
If we denote the weighted p p covariate matrix of the covariate vector among individuals
at risk at time u as
V (u, ) =
_
_
V
11
(u, ) V
1p
(u, )
.
.
.
.
.
.
.
.
.
V
p1
(u, ) V
pp
(u, )
_
_
,
then the information matrix is
J
pp
() =
u
dN(u)V (u, ).
Note: In matrix notation, V (u, ) can be expressed as
V (u, ) =
n
l=1
(z
l
z(u, ))(z
l
z(u, ))
T
exp(z
T
l
)Y
l
(u)
n
l=1
exp(z
T
l
)Y
l
(u)
=
n
l=1
w
l
(z
l
z(u, ))(z
l
z(u, ))
T
,
which is a weighted variance matrix of the covariate vectors among the individuals at risk at
time u. Thus V (u, ) is positive denite. Therefore the information matrix
J
pp
() =
u
dN(u)V (u, ).
PAGE 148
is also a positive denite matrix. So the Hessian matrix H = J
pp
() is negative denite. This
implies that log partial likelihood is a concave function of and hence it has a unique maximum,
which can be obtained by setting the rst derivative of the log partial likelihood, i.e., score U(),
to be zero.
Statistical properties associated with the partial likelihood, the score vector, and the MPLE
for multi-parameter problems (i.e, a vector of covariates) can also be generalized from the one
parameter case.
Namely, the score vector U(
0
) evaluated at the true value of will be asymptotically
distributed as a multivariate normal with mean vector zero and covariance matrix which can be
estimated unbiasedly by J(
0
). Write this fact as
U(
0
)
a
N(0, J(
0
)).
The MPLE

will also be asymptoticly normal
a
N(
0
, J
1
(
0
)),
where J
1
(
0
) is the inverse of J(
0
). Since J(
0
) is positive denite, so its unique inverse exists
and is also positive denite.
When we use a model with a vector of parameters, we are often interested in making in-
ferential statements about the entire vector simultaneously or part of the vector. Towards this
end, let us partition the parameter vector into two parts: = (
T
,
T
)
T
, where is a g( p)
dimensional vector.
We should refer to as the parameter of interest and call as the nuisance parameter. Of
course, the parameter of interest can be the entire parameter vector .
Correspondingly, the score vector is partitioned as
U(, ) =
_
_
_
_
U
(, )
U
(, )
_
_
_
_
,
PAGE 149
where
U
(, ) =
(, )
, U
(, ) =
(, )
.
The partial likelihood information matrix can also be partitioned into
J() =
_
_
J
(, ) J
(, )
J
(, ) J
(, )
_
_
and its inverse into
J
1
() =
_
_
J
(, ) J
(, )
J
(, ) J
(, )
_
_
.
Note: Here we use superscript notation to index the partition of an inverse matrix and
subscript notation to index the original matrix.
With this notation, the following distributional statement
a
N(
0
, J
1
(
))
is equivalent to
_
_
_
_
_
_
_
_
a
N
_
_
_
_
_
_
_
_
0
_
_
_
_
,
_
_
J
,

) J
,

)
J
,

) J
,

)
_
_
_
_
_
_
.
Therefore,

has the asymptotic distribution
a
N(
0
, J
,

)).
If , say, is one-dimensional, then J
,

) is also one-dimensional. In this case,
j
,
0
=
j0
,
and
J
,

) =
_
se(
j
)
_
2
.
PAGE 150
Using this notation, we can nd a condence region for the parameter of interest .
Since
a
N(
0
, J
,

)),
which is equivalent to
(

0
)
a
N(0, J
,

)).
Therefore
(

0
)
T
_
J
,

)
_
1
(

0
)
a

2
g
,
i.e., the quadratic form is distributed as a
2
with g degrees of freedom.
Note:
_
J
,

)
_
1
is the inverse of the partition of the inverse of the information matrix.
In general
_
J
,

)
_
1
= J
,

).
Let
2
;g
be the (1 ) quantile of a
2
with g degrees of freedom, i.e.,
P[
2
g

2
,g
] = .
Then
P
_
(
)
T
[J
,

)]
1
(
)
2
,g
_
= ,
or equivalently,
P
_
(
)
T
[J
,

)]
1
(
)
2
,g
_
= .
For a given data set, the following inequality
(
)
T
[J
,

)]
1
(
)
2
,g
PAGE 151
describes a g-dimensional ellipsoid centered at

and whose orientation is dictated by the eigenval-
ues and eigenvectors of
_
J
,

)
_
1
. The interior of such an ellipsoid is the (1)th condence
region for .
Note: If is one-dimensional, then this condence region simplies to an interval. In fact, if
=
j
(one of the element of ), then the (1 )th condence interval of or
j
would be
j
z
/2
se(
j
),
where
se(
j
) =
_
J
,

)
_
1/2
Generalization of Wald, Score and Likelihood ratio tests
Wald Test: We are interested in testing the null hypothesis
H
0
: =
0
.
Since under H
0
, we have
(

0
)
T
_
J
,

)
_
1
(

0
)
a

2
g
.
If the null hypothesis H
0
were not true, we would expect the above quadratic form to get larger
since

would not be close to
0
. This suggests that we will reject H
0
: =
0
at the level of
signicance if
(

0
)
T
_
J
,

)
_
1
(

0
)
2
;g
.
This is the Wald test.
Score Test: Before we can describe the score test and likelihood ratio test for the hypothesis
H
0
: =
0
, we must rst dene the notion of a restricted maximum partial likelihood estimator.
PAGE 152
Since the interest is focused on the parameter of interest , our null hypothesis species a specic
value of that we wish to entertain. Nothing, however, is assumed about the nuisance parameters
. Therefore, even under the null hypothesis, an estimate of will be necessary in order to derive
tests as a function of the data.
An obvious estimator for , if we assume the null hypothesis to be true, is to maximize the
log partial likelihood as a function of , keeping xed at the hypothesized value of
0
. This
is referred to as a restricted MPLE and will be denoted by

(
0
). That is,

(
0
) is the value of
which maximizes the function (
0
, ). This restricted MPLE can be obtained by solving the
(p g) equations of (p g) unknowns
U
(
0
,

(
0
)) = 0,
using the (p g) dimensional subset of the score vector corresponding to the partial derivatives
of the log partial likelihood with respect to the nuisance parameters.
The score test of the hypothesis H
0
: =
0
is based on the score vector
U
(
0
,

(
0
)).
It can be shown (the proof is omitted here) that if H
0
: =
0
is true then this score vector
with respect to the parameters of interest would be multivariate normal with mean zero and
covariance matrix that can be estimated by
_
J
(
0
,

(
0
))
_
1
.
That is,
U
(
0
,

(
0
))
a
N
_
0,
_
J
(
0
,

(
0
))
_
1
_
.
If the null hypothesis were not true, we would expect the score vector above (evaluated at
0
) to have mean dierent from zero. This suggests rejecting H
0
whenever the quadratic form
_
U
(
0
,

(
0
))
_
T
_
J
(
0
,

(
0
))
_ _
U
(
0
,

(
0
))
_
PAGE 153
is suciently large.
This quadratic form was computed with respect to the inverse of the covariance matrix.
Therefore, under H
0
, the distribution of the quadratic form is a chi-square with g degrees of
freedom.
Thus a level score test of the hypothesis H
0
: =
0
is to reject H
0
whenever
_
U
(
0
,

(
0
))
_
T
_
J
(
0
,

(
0
))
_ _
U
(
0
,

(
0
))
_

2
;g
.
Likelihood ratio test We dene the MPLE for , or equivalently (, ) as the value of (, )
that maximizes the log partial likelihood (, ). We denote this estimate as

, or (
,

). We
also dened the restricted MPLE

(
0
) as the value of that maximizes the following function
(
0
, ).
It must be the case that, for any set of data, (
,

) must be greater than or equal to
(
0
,

(
0
)), since (
,

) is maximized over a larger parameter space. We would expect, however,
that if H
0
were true,

would be close to
0
and consequently (
,

) would be close to (
0
,

(
0
)).
It is therefore reasonable to expect that H
0
would not be true if the dierence
(
,

) (
0
,

(
0
))
is suciently large.
Under H
0
, the distribution of
2
_
(
,

) (
0
,

(
0
))
_
H
0

2
g
.
Therefore, the likelihood ratio test rejects H
0
at level whenever
2
_
(
,

) (
0
,

(
0
))
_

2
;g
.
PAGE 154
Models with multiple covariates: When studying the relationship of survival to a potential
factor, we may wish to adjust for the eect of other variables. For example, if we wish to study the
relationship of alcohol drinking to survival, in an observational study, we may be concerned that
alcohol drinking is correlated with smoking. Thus, if we dont adjust for the eect of smoking,
then what may seem as an apparent relationship between survival and drinking may really be
an artifact of the eect of smoking on survival which is being confounded with drinking.
In epidemiology, if our interest is the relationship of survival to drinking, we would say that
smoking was a confounding variable. That is, smoking was a prognostic factor (i.e., is related to
survival) and smoking is correlated to drinking.
Even in controlled studies, i.e., randomized clinical trials, we may wish to adjust for other
variables. Such adjusted analyses often lead to more precise estimate of the eect of interest and
greater power to detect dierences.
In some cases, enforced balance of certain prognostic factors by treatment, necessitates the
need for adjusted analyses.
The proportional hazards model with multiple covariates is ideal for such purposes. By
including both the variable of interest as well as other variables (which may be confounders, or
other variables we may wish to adjust for), we obtain the relationship that the variable of interest
has on survival while adjusting for the eect of the other covariates.
Cautionary Remark: All of the above statements are based on the premise that the models
being considered are adequate representations of the distribution of the data. So, for example,
if proportional hazards is not a good model of the relationship of survival to the covariates, the
results derived from such a model may be misleading.
Example: Let S denote the smoking indicator (1 = smoker, 0 = nonsmoker), and D denote
drinking indicator (1 = drinker, 0 = nondrinker). If we were to study the eect of drinking
on survival, we may identify a cohort of individuals, say, individuals enrolling into a health
PAGE 155
insurance program or HMO. At the time of enrollment certain information may be gathered;
including Age, Sex, Smoking and Drinking status, for example. Using either information from
the insurance company or a death register, we identify who has died, when they died, as well as
who is currently alive. That is, we obtain censored survival data.
Suppose, we use the following proportional hazards model
(t|D) =
0
(t)exp(D).
As we know, the parameter is interpreted as the log hazard ratio between drinkers and non-
drinkers (assumed constant over time t) and exp() as the hazard ratio.
Although this interpretation is correct, it may be causally misleading as it does not adjust
for potential confounding factors. Consequently, we may use the following proportional hazards
model with multiple covariates
(t|) =
0
(t)exp(D + S
1
+ A
2
+ Sx
3
),
where S = smoking status, A = age, Sx = sex.
Here the parameter corresponds to the log hazard ratio for a drinker compared to a non-
drinker with the same smoking, age and sex variables; i.e., adjusted for smoking, age and sex.
And exp() is the adjusted hazard ratio.
Note: Here is the parameter of interest and = (
1
,
2
,
3
) is the nuisance parameters.
Reminder: The hazard ratio above is
(t|D = 1, S = s, A = a, Sx = sx)
(t|D = 0, S = s, A = a, Sx = sx)
=

0
(t)exp( + s
1
+ a
2
+ sx
3
)
0
(t)exp(0 + s
1
+ a
2
+ sx
3
)
= exp().
The data collected necessary to t this model would be at the form
(x
i
,
i
, d
i
, s
i
, a
i
, sx
i
), i = 1, 2, , n.
The proportional hazards model
(t|) =
0
(t)exp(D + S
1
+ A
2
+ Sx
3
),
PAGE 156
would be t using Proc Phreg in SAS, using partial likelihood methods.
The output would yield the MPLE (
1
,

2
,

3
) as well as their estimated standard errors.
From this we would construct a (1 ) condence interval for
z
/2
se(
).
We could also test the null hypothesis H
0
: = 0 using a Wald test, score test, or partial
likelihood ratio test, for with corresponding to the nuisance parameters.
A Real Example: We will discuss a dataset on breast cancer (CALGB 8082). The data set
has the following variables:
Menopausal status (0 = pre menopausal, 1 = post menopausal)
Tumor size (largest dimension of tumor in cm)
number of positive nodes
Estrogen receptor status (0 = negative, 1 = positive)
The primary purpose of this study is to evaluate certain treatment on breast cancer, adjusting
for the above prognostic factors.
Note: After adjusting for the other covariates, the estimate of treatment eect yielded a
parameter estimate of 0.021 with a estimated standard error 0.101.
Let Rx denote treatment, MS denote menopausal status, TS denote tumor size, NN number
of positive nodes and ER estrogen receptor status.
If our interest is the eect of treatment on survival adjusting for the other covariates, we
write our model as
(t|) =
0
(t)exp(Rx + MS
1
+ TS
2
+ NN
3
+ ER
4
)
= 0.021, se(
) = 0.101,
PAGE 157
and a 95% condence interval for is
1.96 se(
) = 0.021 1.96 0.101 = [0.177, 0.219].

The estimate of the adjusted treatment hazard ratio is
exp(
) = exp(0.021) = 1.021,
with a 95% CI of
[exp(0.177), exp(0.219)] = [0.838, 1.245].
If we want to test the hypothesis H
0
: = 0; i.e., no treatment eect adjusting for the other
covariates, we can use
1. The Wald test:
_

se(
)
_
2
=
_
0.021
0.101
_
2
= 0.042,
with p-value = 0.838.
2. Likelihood ratio test:
2[(
,

) ( = 0,

( = 0))]
= 4739.685 (4739.727) = 0.042,
with p-value=0.838.
3. Score test: Proc Phreg will not automatically calculate the score test for H
0
: = 0 in
the presence of nuisance parameters. See the program for the score test. The observed
2
= 0.042, yielding the same p-value as other two tests.
Now that we feel fairly condent that there is not treatment eect. Suppose we decide to
use these data to study the relationship of tumor size to survival. With respect to this question,
these data can be viewed as an observational dataset. Let us consider the model
(t|) =
0
(t)exp(TS).
PAGE 158
The result of this model gives an estimate

= 0.042, se(
) = 0.019. The Wald test for

H
0
: = 0 is
_
0.042
0.019
_
2
= 4.75, p-value = 0.029.
The likelihood ratio test and score test yield similar conclusions; namely, there may be some
prognostic eect of tumor size on survival.
Remark: A typical larger tumor size is about 7cm ( 2 standard deviation above the mean
for this sample of patients. A typical smaller tumor size is about 1cm (the smallest tumor size is
0.1cm). Hence the relative risk (or hazard ratio) for a woman with tumor size 7cm as compared
to a woman with tumor size 1cm is
0
(t)exp(7)
0
(t)exp()
= exp(6),
which is estimated to be
exp(6
) = exp(6 0.042) = 1.28.

A 95% CI for is
1.96 se(
) = 0.042 1.96 0.019 = [0.0048, 0.079].

Consequently, a 95% CI for relative risk exp(6) is
exp(6 0.0048), exp(6 0.079) = [1.029, 1.606].
It may be however that the eect of tumor size may be confounded with other covariates.
To study this, we consider the model
(t|) =
0
(t)exp(TS + MS
1
+ NN
2
+ ER
3
).
From this model, we get

= 0.02, and se(
) = 0.019.
PAGE 159
The corresponding estimate for the relative risk exp(6) is now
RR = exp(6
) = 1.128,
and its 95% CI (adjusted for the other covariates) is
[0.902, 1.41].
Summary result for exp(6)
Unadjusted (All available data) Adjusted (All available data)
# of patients n = 817 n = 723
RR 1.28 1.13
95% CI [1.029, 1.606] [0.902, 1.41]
Wald test 4.75 (p-val = 0.03) 1.14 (p-val = 0.29)
LR test 4.02 1.03
Score test 4.65 1.14
Remark: Unfortunately, in many clinical trials, not all the data are collected on all the
individuals. Consequently, one or more variables may be missing per individuals. In SAS the
default for missing data is a .. The way that SAS handles missing data is to delete an entire
recored if any of the variables being considered for a particular analysis is missing. Therefore,
we must be careful when we are considering analysis with dierent sub-models. For example,
fewer recored may be missing when we consider one covariate as opposed to a model with that
covariate and additional covariates.
This is especially the case when we consider the likelihood ratio test for nested models. We
must make sure that the nested models being compared are on the same set of individuals. This
might necessitate running a model on a subset of the data, where the subset corresponds to all
data records with complete covariate information for the larger model (i.e., the model with the
most covariates).
PAGE 160
The impact that missing data may have on the results of a study can be very complicated
and only recently has been studied seriously. The strategy to eliminate entire record if any of
the data are missing is very crude and can give biased results depending on the reasons for
missingness.
It may be useful to conduct some sensitivity analyses on dierent sets of data corresponding to
dierent levels of missingness. For example, in our analysis for CALGB 8082, we note that nobody
had missing treatment information. Therefore, the eect of treatment could be analyzed using all
905 women randomized to this study. However, only 723 women had all the covariate information
we ultimately considered. We therefore also looked at the eect of treatment (unadjusted) within
this subset of 723 patients to see if the results were comparable to the full data.
All patients Patients with complete covariates
n = 905 n = 723
RR 1.061 1.075
95% CI [0.890, 1.265] [0.882, 1.331]
Similarly, when we consider the eect of tumor size on survival (unadjusted), we used 817
women for which tumor size was collected. However, for the adjusted analysis we could only use
723 women with complete data on all covariates.
Previously, we contrasted the relationship of tumor size to survival; unadjusted versus ad-
justed. However, this was done on dierent data sets, one with 817 women having tumor size
information and the other with 723 women with all covariates. In order to make sure that the
dierences seen between these two analyses is not due to the dierent datasets being considered,
we also look at the unadjusted eect of tumor size on survival using the data set with 723 women.
The estimate of relative risk (hazard ratio between tumor size of 7cm vs. 1cm) and 95% CI
are
n = 723, RR = 1.307, 95%CI = [1.036, 1.649].
PAGE 161
These results are similar to the unadjusted results obtained on the 817 patients.
In order to compare the likelihood ratio test for H
0
: = 0 (no eect of tumor size on
survival) adjusted for the other covariates, we need to compute
2[(
,

) ( = 0,

( = 0))]
or
[2( = 0,

( = 0))] [2(
,

)].
In order to compute ( = 0,

( = 0)), we must consider the model when = 0; i.e.,
(t|) =
0
(t)exp(0 + MS
1
+ NN
2
+ ER
3
)
and nd the maximized log likelihood for this sub-model. We must make sure however that this
sub-model is run on the same set of data as the full model; i.e., on 723 women.
This is how we get the value for the likelihood ratio test:
4740.759 4739.727 = 1.032.
Remark on confounding: Previously, we noted that the unadjusted eect of tumor size on
survival was signicant (p-value = 0.03, Wald test), whereas the adjusted eect was not signicant
(p-value = 0.29, Wald test). This suggests that at least one of the variables we adjusted for
confounds the relationship of tumor size to survival.
A serious study of this issue, assuming we felt it was important to study, would take some
work. However, at rst glance, we note that the number of nodes was a highly signicant
prognostic factor (Wald chi-square > 65, adjusted or unadjusted) and that there was substantial
and signicant correlation between number of nodes and tumor size. I suspect that this is the
primary confounding relationship that weakened the eect of tumor size as an independent
prognostic factor of survival.
PAGE 162
Appendix: SAS Program and output
The following is the program and output related to the breast cancer data set from CALGB
8082:
options ps=62 ls=72;
data bcancer;
infile "cal8082.dat";
input days cens trt meno tsize nodes er;
trt1 = trt - 1;
label days="(censored) survival time in days"
cens="censoring indicator"
trt="treatment"
meno="menopausal status"
tsize="size of largest tumor in cm"
nodes="number of positive nodes"
er="estrogen receptor status"
trt1="treatment indicator";
run;
data bcancer1; set bcancer;
if meno = . or tsize = . or nodes = . or er = . then delete;
run;
title "Univariate analysis of treatment effect";
proc phreg data=bcancer;
model days*cens(0) = trt1;
run;
The output of the above univariate program is
Univariate analysis of treatment effect 1
09:37 Tuesday, April 2, 2002
The PHREG Procedure
Model Information
Data Set WORK.BCANCER
Dependent Variable days (censored) survival time in days
Censoring Variable cens censoring indicator
Ties Handling BRESLOW
Summary of the Number of Event and Censored Values
Percent
905 497 408 45.08
Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
PAGE 163
Without With
Criterion Covariates Covariates
-2 LOG L 6362.858 6362.421
AIC 6362.858 6364.421
SBC 6362.858 6368.629
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 0.4375 1 0.5083
Score 0.4375 1 0.5083
Wald 0.4374 1 0.5084
Parameter Standard
Variable DF Estimate Error Chi-Square Pr > ChiSq
trt1 1 0.05935 0.08973 0.4374 0.5084
Hazard
Variable Ratio Variable Label
trt1 1.061 treatment indicator
Program 2: adjusting for meno tsize nodes er:
title "Analysis of treatment effect adjusting for meno tsize nodes er";
model days*cens(0) = trt1 meno tsize nodes er;
run;
The output of program 2:
Analysis of treatment effect adjusting for meno tsize nodes er 2
The PHREG Procedure
Model Information
Percent
PAGE 164
723 391 332 45.92
Convergence Status
Without With
-2 LOG L 4833.945 4739.685
AIC 4833.945 4749.685
SBC 4833.945 4769.528
Likelihood Ratio 94.2607 5 <.0001
Score 113.4441 5 <.0001
Wald 111.1227 5 <.0001
Parameter Standard
trt1 1 0.02080 0.10147 0.0420 0.8376
meno 1 0.39108 0.10797 13.1198 0.0003
tsize 1 0.01992 0.01875 1.1289 0.2880
nodes 1 0.05252 0.00652 64.8325 <.0001
er 1 -0.52723 0.10485 25.2862 <.0001
Analysis of treatment effect adjusting for meno tsize nodes er 3
The PHREG Procedure
Hazard
meno 1.479 menopausal status
tsize 1.020 size of largest tumor in cm
nodes 1.054 number of positive nodes
er 0.590 estrogen receptor status
Program 3: a model without treatment indicator:
title "Model without treatment";
model days*cens(0) = meno tsize nodes er;
PAGE 165
run;
Output of program 3:
Model without treatment 4
The PHREG Procedure
Model Information
Percent
723 391 332 45.92
Convergence Status
Without With
-2 LOG L 4833.945 4739.727
AIC 4833.945 4747.727
SBC 4833.945 4763.601
Score 113.4346 4 <.0001
Wald 111.2321 4 <.0001
Parameter Standard
meno 1 0.39180 0.10791 13.1828 0.0003
tsize 1 0.02006 0.01876 1.1426 0.2851
nodes 1 0.05257 0.00651 65.1841 <.0001
er 1 -0.52691 0.10483 25.2652 <.0001
Model without treatment 5
PAGE 166
The PHREG Procedure
Hazard
Program 4: Univariate analysis of treatment eect using the subsample:
title "Univariate analysis of treatment effect using subsample";
proc phreg data=bcancer1;
run;
Univariate analysis of treatment effect using subsample 6
The PHREG Procedure
Model Information
Data Set WORK.BCANCER1
Percent
723 391 332 45.92
Convergence Status
Without With
-2 LOG L 4833.945 4833.430
AIC 4833.945 4835.430
SBC 4833.945 4839.398
PAGE 167
Score 0.5155 1 0.4728
Wald 0.5149 1 0.4730
Parameter Standard
trt1 1 0.07263 0.10121 0.5149 0.4730
Hazard
Program 5: Univariate analysis of tumor size eect using the whole sample.
title "Univariate analysis of tumor size effect using whole sample";
model days*cens(0) = tsize;
run;
Univariate analysis of tumor size effect using whole sample 7
The PHREG Procedure
Model Information
Percent
817 451 366 44.80
Convergence Status
PAGE 168
Without With
-2 LOG L 5681.392 5677.370
AIC 5681.392 5679.370
SBC 5681.392 5683.481
Score 4.6533 1 0.0310
Wald 4.7476 1 0.0293
Parameter Standard
tsize 1 0.04153 0.01906 4.7476 0.0293
Hazard
Program 6: Univariate analysis of tumor size eect using the subsample:
title "Univariate analysis of tumor size effect using subsample";
model days*cens(0) = tsize;
run;
Univariate analysis of tumor size effect using subsample 8
The PHREG Procedure
Model Information
PAGE 169
Percent
723 391 332 45.92
Convergence Status
Without With
-2 LOG L 4833.945 4829.744
AIC 4833.945 4831.744
SBC 4833.945 4835.712
Score 5.0066 1 0.0253
Wald 5.1128 1 0.0238
Parameter Standard
tsize 1 0.04465 0.01975 5.1128 0.0238
Hazard
Program 7: Reduced model with meno nodes er:
title "Reduced model with meno nodes er";
model days*cens(0) = meno nodes er;
run;
Reduced model with meno nodes er 9
The PHREG Procedure
Model Information
PAGE 170
Percent
723 391 332 45.92
Convergence Status
Without With
-2 LOG L 4833.945 4740.759
AIC 4833.945 4746.759
SBC 4833.945 4758.666
Score 112.3495 3 <.0001
Wald 110.3494 3 <.0001
Parameter Standard
meno 1 0.38742 0.10786 12.9016 0.0003
nodes 1 0.05379 0.00636 71.5972 <.0001
er 1 -0.51916 0.10452 24.6744 <.0001
Hazard
Reduced model with meno nodes er 10
The PHREG Procedure
PAGE 171
Hazard
Program 8: look at the correlation among covariates in the whole sample and the subsample:
title "Correlation of covariates using whole sample";
proc corr data=bcancer;
var meno tsize nodes er;
run;
title "Correlation of covariates using subsample";
proc corr data=bcancer1;
var meno tsize nodes er;
run;
Correlation of covariates using whole sample 11
The CORR Procedure
4 Variables: meno tsize nodes er
Simple Statistics
Variable N Mean Std Dev Sum
meno 891 0.58810 0.49245 524.00000
tsize 817 3.21603 1.98253 2627
nodes 896 6.53125 6.65252 5852
er 791 0.64855 0.47773 513.00000
Simple Statistics
Variable Minimum Maximum Label
meno 0 1.00000 menopausal status
tsize 0.10000 30.00000 size of largest tumor in cm
nodes 0 57.00000 number of positive nodes
er 0 1.00000 estrogen receptor status
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
meno tsize nodes er
meno 1.00000 -0.05815 0.05115 0.10469
menopausal status 0.0973 0.1275 0.0033
891 814 889 786
tsize -0.05815 1.00000 0.16787 -0.02528
PAGE 172
size of largest tumor in cm 0.0973 <.0001 0.4967
814 817 817 725
nodes 0.05115 0.16787 1.00000 -0.09113
number of positive nodes 0.1275 <.0001 0.0106
889 817 896 786
er 0.10469 -0.02528 -0.09113 1.00000
estrogen receptor status 0.0033 0.4967 0.0106
786 725 786 791
Correlation of covariates using subsample 12
The CORR Procedure
4 Variables: meno tsize nodes er
Simple Statistics
Variable N Mean Std Dev Sum
meno 723 0.59474 0.49128 430.00000
tsize 723 3.21646 1.97440 2325
nodes 723 6.38036 6.48484 4613
er 723 0.65560 0.47550 474.00000
Simple Statistics
Variable Minimum Maximum Label
meno 0 1.00000 menopausal status
nodes 1.00000 43.00000 number of positive nodes
er 0 1.00000 estrogen receptor status
Pearson Correlation Coefficients, N = 723
Prob > |r| under H0: Rho=0
meno tsize nodes er
meno 1.00000 -0.07193 0.02758 0.10133
menopausal status 0.0532 0.4590 0.0064
tsize -0.07193 1.00000 0.18031 -0.02508
size of largest tumor in cm 0.0532 <.0001 0.5007
nodes 0.02758 0.18031 1.00000 -0.08592
number of positive nodes 0.4590 <.0001 0.0209
er 0.10133 -0.02508 -0.08592 1.00000
estrogen receptor status 0.0064 0.5007 0.0209
Program 9: score test for treatment eect adjusting for other covariates:
title "Score test for treatment effect adjusting for other covariates";
model days*cens(0) = tsize meno nodes er trt1
/ selection=forward include=4 details slentry=1.0;
PAGE 173
run;
Score test for treatment effect adjusting for other covariates 13
The PHREG Procedure
Model Information
Percent
723 391 332 45.92
The following variable(s) will be included in each model:
tsize meno nodes er
Convergence Status
Without With
-2 LOG L 4833.945 4739.727
AIC 4833.945 4747.727
SBC 4833.945 4763.601
Score 113.4346 4 <.0001
Wald 111.2321 4 <.0001
Parameter Standard
tsize 1 0.02006 0.01876 1.1426 0.2851
meno 1 0.39180 0.10791 13.1828 0.0003
PAGE 174
nodes 1 0.05257 0.00651 65.1841 <.0001
er 1 -0.52691 0.10483 25.2652 <.0001
The PHREG Procedure
Hazard
Analysis of Variables Not in the Model
Score
Variable Chi-Square Pr > ChiSq Label
trt1 0.0420 0.8376 treatment indicator
Residual Chi-Square Test
Chi-Square DF Pr > ChiSq
0.0420 1 0.8376
Step 1. Variable trt1 is entered. The model contains the following
explanatory variables:
tsize meno nodes er trt1
Convergence Status
Without With
-2 LOG L 4833.945 4739.685
AIC 4833.945 4749.685
SBC 4833.945 4769.528
Score 113.4441 5 <.0001
PAGE 175
Wald 111.1227 5 <.0001
The PHREG Procedure
Parameter Standard
tsize 1 0.01992 0.01875 1.1289 0.2880
meno 1 0.39108 0.10797 13.1198 0.0003
nodes 1 0.05252 0.00652 64.8325 <.0001
er 1 -0.52723 0.10485 25.2862 <.0001
trt1 1 0.02080 0.10147 0.0420 0.8376
Hazard
NOTE: All variables have been entered into the model.
Summary of Forward Selection
Variable Number Score Variable
Step Entered In Chi-Square Pr > ChiSq Label
1 trt1 5 0.0420 0.8376 treatment indicator
Program 10: Score test of tumor size eect adjusting for other covariates:
title "Score test of tumor size effect adjusting for other covariates";
model days*cens(0) = meno nodes er tsize
/ selection=forward include=3 details slentry=1.0;
run;
Ouput of program 10:
Score test of tumor size effect adjusting for other covariates 16
The PHREG Procedure
Model Information
PAGE 176
Percent
723 391 332 45.92
meno nodes er
Convergence Status
Without With
-2 LOG L 4833.945 4740.759
AIC 4833.945 4746.759
SBC 4833.945 4758.666
Score 112.3495 3 <.0001
Wald 110.3494 3 <.0001
Parameter Standard
meno 1 0.38742 0.10786 12.9016 0.0003
nodes 1 0.05379 0.00636 71.5972 <.0001
er 1 -0.51916 0.10452 24.6744 <.0001
The PHREG Procedure
Hazard
PAGE 177
Score
1.1448 1 0.2846
Step 1. Variable tsize is entered. The model contains the following
explanatory variables:
meno nodes er tsize
Convergence Status
Without With
-2 LOG L 4833.945 4739.727
AIC 4833.945 4747.727
SBC 4833.945 4763.601
Score 113.4346 4 <.0001
Wald 111.2321 4 <.0001
The PHREG Procedure
Parameter Standard
meno 1 0.39180 0.10791 13.1828 0.0003
nodes 1 0.05257 0.00651 65.1841 <.0001
er 1 -0.52691 0.10483 25.2652 <.0001
tsize 1 0.02006 0.01876 1.1426 0.2851
PAGE 178
Hazard
NOTE: All variables have been entered into the model.
Summary of Forward Selection
Variable Number Score Variable
Step Entered In Chi-Square Pr > ChiSq Label
1 tsize 4 1.1448 0.2846 size of largest tumor in cm
PAGE 179
8 Modeling Survival Data with Categorical Covariates
We shall rst consider the case where there is no a-priori ordering expected between the
categories and the outcome of interest (survival in this case). For example, geographical region,
day of week, color of eyes; etc.
In regression modeling, including proportional hazards regression, a useful way of model-
ing such categorical covariates and their eect on outcome is by the use of dummy variables.
Specically, if there are k categories, we would dene k dummy variables, D
1
, ..., D
k
where
D
j
=
1 if individuals fall into the jth category,

0 otherwise,
for j = 1, ..., k.
In a proportional hazards model, if we were interested in modeling the eect of such a
categorical covariate on the hazard function, we may consider the following model:
(t|) =
0
(t)exp(D
1
1
+ + D
k1
k1
+ z
1
1
+ + z
q
q
)
Note: There are only (k 1) of the dummy variables in the model to avoid overparametriza-
tion. The category that is left out (category k) is called the reference category. At most only
one of D
1
, , D
k1
may be equal to one, and all are equal to zero when an individual falls into
the reference category (i.e., the kth category).
Category D
1
D
2
D
k1
1 1 0 0
2 0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
k 1 0 0 1
k 0 0 0
The parameters
1
, ...,
k1
are used to measure the degree of eect that the categorical
PAGE 180
covariate has on the hazard rate. We may want to include other covariates (z
1
, ..., z
q
) in the
model to adjust for their eects.
The interpretation of
j
is the log hazard ratio between an individual in category j and an
individual in the reference category (the kth category) assuming all other covariates were the
same.
This is easily seen by noting that
(t|cat = j, z)
(t|cat = k, z)
=

0
(t)exp(
j
+ z
T
)
0
(t)exp(0 + z
T
)
= exp(
j
).
If we want the hazard ratio between category j and category j
(1 j, j
(k 1)), then we
use the following
(t|cat = j, z)
(t|cat = j
, z)
=

0
(t)exp(
j
+ z
T
)
0
(t)exp(
j
+ z
T
)
= exp(
j

j
).
The hypothesis corresponding to no eect of the categorical variable on survival is given by
H
0
:
1
=
2
= =
k1
= 0.
Under this null hypothesis, the hazard function is the same regardless what category an
individual was in.
The null hypothesis could be tested using the Wald test, score test, or likelihood ratio test.
Since our null hypothesis considers xed values (i.e., 0) for (k 1) of the parameters in the
model, the distribution of all the tests above would be chi-square with (k 1) degrees of freedom
if the null hypothesis were true. P-values can be computed by evaluating the probability that a
2
k1
random variable exceeds the observed value of the test statistics.
Note: If we are testing the null hypothesis of no eect of a categorical variable with k
categories, using a proportional hazards model with (k 1) dummy variables and not adjusting
for additional covariates, then the score test derived from this partial likelihood will be identical
PAGE 181
to the k-sample log rank test if there were no ties in the survival data. This extends the results
we noted for the two-sample log rank test.
Let us illustrate the use of dummy variables for coding categorical variables in our dataset
CAL8082.dat of breast cancer patients. We shall focus on the eect that the number of nodes
involved at randomization has on survival.
Since the number of nodes ranges from 1 to 57, we broke it down into 7 categories (1, 2, 3, 4,
510, 1115, > 15). We created dummy variables for the rst six categories leaving the category
(> 15) as the reference category.
The rst model considered:
(t|) =
0
(t)exp(DN
1
1
+ DN
2
2
+ DN
3
3
+ DN
4
4
+ DN
510
5
+ DN
1015
6
).
The corresponding Wald test, score test and likelihood ratio test of the null hypothesis
H
0
:
1
=
2
=
3
=
4
=
5
=
6
= 0,
or no eect of these category of the nodes on survival were equal to
Wald = 100.4, score = 108.5, LR = 96.03
respectively.
All of these, compared to a chi-square with 6 degrees of freedom, yielded highly signicant
results.
More interesting is the ability to assess the degree of eect. For example,
1
corresponds to
the log hazard ratio for patients with one node aected vs. patients with > 15 nodes (reference
category).
In this example, the estimate of
1
and its standard error are
1
= 1.283 (e
1
= 0.28), se(
1
) = 0.174,
PAGE 182
so a 95% CI for
1
is
1
1.96 se(
1
) = 1.283 1.96 0.174 = [1.624, 0.942].
The corresponding 95% CI for the hazard ratio is
[exp(1.624), exp(0.942)] = [0.197, 0.390].
Suppose we want to estimate the hazard ratio between the categories (nodes=1) vs. (nodes=3).
We compute this hazard ratio to be
exp(
1

3
).
The estimate of
1

3
is equal to
3
= 1.283 (1.213) = 0.070.
Therefore, the corresponding hazard ratio estimate is
exp(0.070) = 0.932.
To nd the condence interval for
1

3
, we need to compute se(
3
):
Var(
3
) = Var(
1
) + Var(
3
) 2 Cov(
1
,

3
) = 0.03037 + 0.04446 2 0.01588 = 0.04307.
So
se(
3
) =
0.04307 = 0.2075.
Note: We dont need to do above calculation to get the standard error of

3
. We just
need to rerun the model using category 3 as the reference category. That is, we use all dummy
variables except the dummy for category 3. Then the parameter estimate corresponding to
category 1 is

3
with its standard error being se(
3
).
To get a better understanding of the relationship of the various categories to survival, it is
useful to plot the log hazard ratio and hazard ratio as a function of the categories. For example,
PAGE 183
Figure 8.1: Log hazard ratio as a function of category
number of nodes
l
o
g

h
a
z
a
r
d

r
a
t
i
o
2 4 6 8 10 12
-
1
.
2
-
1
.
0
-
0
.
8
-
0
.
6
-
0
.
4
Figure 8.1 presents the relationship of log hazard ratio and the categories and Figure 8.2 presents
the relationship of hazard ratio and the categories.
We also included a model when we adjust for the eect of menopausal status, tumor size,
and estrogen receptor status. The adjusted eects of number of nodes changed very little.
For the model
(t|) =
0
(t)e
DN
1
1
+DN
2
2
+DN
3
3
+DN
4
4
+DN
510
5
+DN
1015
6
+MN
1
+TS
2
+ER
3
,
we can construct a likelihood ratio test for the null hypothesis
H
0
:
1
=
2
=
3
=
4
=
5
=
6
= 0.
We compute
2(
( = 0)) (2(
,

))
and compare this to a chi-square with 6 degrees of freedom.

Using the output, we get:
LR = 4791.872 4728.493 = 63.38, Score = 71.668.
PAGE 184
Figure 8.2: Hazard ratio as a function of category
number of nodes
h
a
z
a
r
d

r
a
t
i
o
2 4 6 8 10 12
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
These give strong evidence against H
0
(we can also calculate Wald statistic to be 67.12).
Ordered Categorical Covariates and Trend Tests
When we model the eect of a categorical covariate using dummy variables in a proportional
hazards model, we are assuming no implicit ordering of the categories on their eect on survival.
For example, in the model
(t|) =
0
(t)exp(D
1
1
+ + D
k1
k1
)
the hazard ratio between the jth and j
th category is equal to
exp(
j

j
) if j, j
= k,
exp(
j
) if j
= k.
Since
j
and
j
are not restricted, this hazard ratio can vary from 0 to innity regardless of j
and j
.
In some cases, however, we might expect the eect of category on survival to follow some
natural ordering. In our breast cancer example, we might expect the hazard rate to increase as
the number of nodes dening the categories gets larger.
For ordered categorical covariates, it may be easier if we label the K categories as categories
PAGE 185
0, 1, , k 1, and let category 0 be the reference category. In which case we consider the model
(t|) =
0
(t)exp(D
1
1
+ + D
k1
k1
)
If there is an ordered eect on survival, we might expect that
0 <
1
<
2
<
k1
,
or
0 >
1
>
2
>
k1
.
However, the model above puts no restrictions on the values
1
, ,
k1
. Consequently, the
multiparameter tests of
H
0
:
1
=
2
= =
k1
= 0
we have discussed so far (all of which have a chi-square distribution with (k 1) degrees of
freedom) are considering omnibus alternatives, that is, any deviation from the null hypothesis.
Because of this, these tests are not especially powerful in detecting alternatives which have an
implied natural ordering.
For such situations, we may prefer to use a trend test.
In a trend test, we assign a score to the ordered categories. For example, we may use
1, 2, , k 1 for the k 1 ordered categories. In the breast cancer example, the score is average
number of nodes for each of the categories, i.e.,
1 = 1, 2 = 2, 3 = 3, 4 = 4, (5 10) = 7.5, (11 15) = 13, (> 15) = 20 (approximately)
We then consider the model
(t|) =
0
(t)exp(Sc),
PAGE 186
where Sc corresponds to the ordered score, and test the hypothesis
H
0
: = 0 vs. H
A
: = 0.
Under this alternative, the hazard increases or decreases as the score of the category increases
depending on whether or not > 0 or < 0.
Remark:
The null hypothesis for the trend test is the same null hypothesis as for the omnibus test;
that is, the hazard function does not depend on category.
The trend test is distributed as a chi-square with one degree of freedom under H
0
whereas
the omnibus test is distributed as a chi-square with (k 1) degrees of freedom under H
0
.
In general, the trend test has greater power to detect dierences in categories which are
ordered and have an ordered eect. However, the trend test may have less power to nd
deviations from the null hypothesis that are not ordered compared to the omnibus test.
Any of the large sample tests (Wald, score, likelihood ratio) may be used to test H
0
.
We can also adjust for other covariates that may be potential confounders.
For the CALGB 8082 example, the trend test yields
Wald test : 102.3
score test : 106.9
LR test : 89.6
All of these are to be compared to a chi-square with one d.f.
We can contrast these values with the results from the omnibus test:
Wald test : 100.4
score test : 108.5
LR test : 96.6
PAGE 187
These numbers are similar to the numbers from the trend test, but they are to be compared to
a chi-square distribution with 6 d.f. yielding weaker evidence against H
0
(although the evidence
is still strong in this case).
When we adjust for menopausal status, estrogen receptor status and tumor size, we get for
the trend test:
Wald test : 67.97
LR test : 4791.87 4732.07 = 59.80
score test : 70.51
to be compared to a chi-square with one d.f.
The Philosophy of Model Building
When trying to build models and understand the relationships that these models imply, it
is useful to work up hierarchically considering increasingly more complex structures of nested
models. Likelihood ratio test is preferred to used in deciding which variables (or structures) are
or are not important (LR test is usually more stable and easily constructed).
We should strive to nd parsimonious models, i.e., the model that adequately explain the
structure of the data with as simple a structure as possible. It is especially helpful to get feedback
from a subject matter scientist.
Modeling Continuous Covariates
Suppose we have a covariate Z which is continuous and we want to model the hazard function
to Z using a proportional hazards model. The simplest model we could consider is
(t|Z) =
0
(t)exp(Z).
This model species a very specic structure on the relationship of the hazard to the covariate
PAGE 188
Z. Namely,
(t|Z = z + 1)
(t|Z = z)
=

0
(t)exp((z + 1))
0
(t)exp(z)
= exp(),
regardless of z. That is, a unit increase in covariate Z will yield a proportional increase in the
hazard of exp().
If this relationship is an adequate representation of the truth then the interpretation that we
can give to the parameter is easy to understand. Of course, this assumption may or may not
be adequate.
Checking Adequacy of the Covariate Relationship in the Proportional Hazards Model
Using the above model building philosophy, we shall assess whether a particular covariate
relationship is reasonable by embedding the proposed model into a more complex model and
then testing if the more complex structure gives suciently better t.
There are two ways that we suggest for considering more complex structures for modeling a
continuous variable.
1. Assume the relationship follows a higher order polynomial: For example, we may consider
the model
(t|Z) =
0
(t)exp(
1
Z +
2
Z
2
).
A test of the hypothesis H
0
:
2
= 0 may be used to assess the adequacy of the model
(t|Z) =
0
(t)exp(Z).
Example: In CALGB 8082, nodal status seemed to be an important prognostic factor. Since
the number of nodes varies from 1 to 57 it may be reasonable to think this variable as
approximately a continuous variable and try to nd the approximate relationship of this
variable to the hazard function.
Consider the SAS output as we examine a linear and quadratic relationship.
PAGE 189
2. Discretizing (or categorizing) Continuous Covariate to Assess Models: The values of the
parameters in a higher order polynomial are dicult to interpret. It may be easier to break
up the continuous covariate into several categories and then use methods we developed for
categorical covariates. Plots of the parameter estimates for the eects of dierent categories
versus the mid-value dening the categories may be helpful to assess t or suggest dierent
models. Let us illustrate through an example. Here we will discretize number of nodes into
intervals of length 5 (except the last interval, which is > 25) and use 15 as the reference
category. The plot is presented in Figure 8.3.
Figure 8.3: Log-hazard ratio as a function of category midpoint
number of nodes
h
a
z
a
r
d

r
a
t
i
o
10 15 20 25 30 35 40
2
3
4
5
Interaction (Eect Modication)
When studying the eect of a variable on survival we showed how to control for the possible
confounding eects of other prognostic factors by including these in the proportional hazards
model as well.
For example, in Chapter 6 we discussed the relationship of drinking on survival controlling
for smoking, age and sex by looking at the model:
(t|) =
0
(t)exp(D +
1
S +
2
A +
3
Sx),
where D is the drinking indicator, S is the smoking indicator, A is age and Sx is sex indicator.
PAGE 190
This model assumes that the hazard ratio for a drinker compared to a non drinker is exp()
regardless of their smoking status, age and sex. Therefore, if the eect of drinking on survival
is measured through the hazard ratio, the above model does not allow for eect modication,
i.e., where the eect of drinking on survival might change or vary by dierent smoking, age or
sex categories.
Eect modication may be accommodated in a proportional hazards model by including
interaction terms; i.e., a product of the variables that are thought to be eect modiers.
Remark: Eect modication is a term used in epidemiology. In statistics, we use the term
interaction to denote the same concept.
For example, suppose we suspected that smoking was as eect modier for the relationship
of drinking to survival, then we may consider the following model
(t|) =
0
(t)exp(D +
1
S +
2
A +
3
Sx + (D S))
where DS is the interaction term and its coecient measures the degree of eect modication.
For such a model, the hazard ratio of a drinker (D = 1) compared to a non-drinker (D = 0) for
a given smoking status, sex and age is given by
(t|D = 1, )
(t|D = 0, )
=

0
(t)exp( +
1
S +
2
A +
3
Sx + S)
0
(t)exp(
1
S +
2
A +
3
Sx)
= exp( + S)
This would imply that the hazard ratio for a drinker to a non-drinker is exp( + ) among
smokers and exp() among non-smokers.
One could test for eect modication of smoking on the relationship of drinking to survival
by testing the null hypothesis
H
0
: = 0
for this multiparamter proportional hazards model.
Of course, we could also consider age or sex as eect modiers for drinking by including
terms D A and D Sx in the proportional hazards model.
PAGE 191
Let us go back to our CALGB 8082 data set and consider interaction terms:
Model 2logL d.f.
All main eects 4739.69 5
All main eects + all interactions 4716.67 15
All main eects + trt er 4734.56 6
All main eects + trt er + tumor size er 4721.30 7
Note: Two potentially important interactions between treatment and ER status, between
tumor size and ER were surfaced that may warrant further investigation.
From the model where we have All main eects + trt er + tumor size er, we get
(t|Rx = 1, )
(t|Rx = 0, )
=

0
(t)exp(0.288 + 0.449ER)
0
(t)exp(0 + + 0)
= exp(0.288 0.449ER)
Thus for ER positive patients (ER=1), hazard ratio for trt1=1 vs. trt1=0 is exp(0.288
0.449) = exp(0.161) = 0.85, while for ER negative patients (ER=0), hazard ratio for trt1=1
vs. trt1=0 is exp(0.288) = 1.33.
Neither of these estimates are highly signicant and given the fact that this relationship was
discovered among many possible relationships considered in a post-hoc analysis, one must be
cautious of the problem of multiple comparisons. Nonetheless, it may be worth investigating this
issue further and bringing this nding to the attention of the collaborators.
Appendix: SAS Program and output
The following is the SAS program for the analyses on pages 153-156.
data bcancer;
trt1 = trt - 1;
if nodes=0 or nodes=. then delete;
PAGE 192
dn1 = (nodes=1);
dn2 = (nodes=2);
dn3 = (nodes=3);
dn4 = (nodes=4);
dn510 = (4.5<nodes<10.5);
dn1015 = (10.5<nodes<15.5);
dn15 = (nodes>15.5);
dnscore = nodes;
if dn510=1 then
dnscore=7.5;
else if dn1015=1 then
dnscore=13;
else if dn15=1 then
dnscore=20;
trt="treatment"
run;
if meno = . or tsize = . or nodes = . or er = . then delete;
run;
proc freq data=bcancer;
tables nodes;
run;
title "Unadjusted analysis of nodes effect using whole sample";
model days*cens(0) = dn1 dn2 dn3 dn4 dn510 dn1015 / covb;
run;
title "Unadjusted analysis of nodes effect using whole sample";
model days*cens(0) = dn1 dn2 dn4 dn510 dn1015 dn15;
run;
title "Unadjusted analysis of nodes effect using subsample";
model days*cens(0) = dn1 dn2 dn3 dn4 dn510 dn1015;
run;
title "Analysis of adjusted nodes effect using subsample";
model days*cens(0) = dn1 dn2 dn3 dn4 dn510 dn1015 meno tsize er /covb;
run;
title "Model with only meno tsize er";
model days*cens(0) = meno tsize er;
run;
title "Score test for nodes effect adjusting for other covariates";
model days*cens(0) = meno tsize er dn1 dn2 dn3 dn4 dn510 dn1015
PAGE 193
/ selection=forward detail include=3 slentry=0;
run;
title1 "Trend test for number of nodes";
title2 "Unadjusted analysis of nodes effect using whole sample";
model days*cens(0) = dnscore;
run;
title2 "Analysis of adjusted nodes effect using subsample";
model days*cens(0) = dnscore meno tsize er;
run;
title2 "Score test for nodes effect adjusting for other covariates";
model days*cens(0) = meno tsize er dnscore
/ selection=forward detail include=3 slentry=0;
run;
The following is the corresponding output:
The SAS System 1
16:16 Monday, April 7, 2003
The FREQ Procedure
number of positive nodes
Cumulative Cumulative
nodes Frequency Percent Frequency Percent
----------------------------------------------------------
1 174 19.44 174 19.44
2 140 15.64 314 35.08
3 78 8.72 392 43.80
4 74 8.27 466 52.07
5 58 6.48 524 58.55
6 53 5.92 577 64.47
7 42 4.69 619 69.16
8 37 4.13 656 73.30
9 34 3.80 690 77.09
10 26 2.91 716 80.00
11 21 2.35 737 82.35
12 20 2.23 757 84.58
13 20 2.23 777 86.82
14 16 1.79 793 88.60
15 20 2.23 813 90.84
16 7 0.78 820 91.62
17 11 1.23 831 92.85
18 8 0.89 839 93.74
19 8 0.89 847 94.64
20 6 0.67 853 95.31
21 5 0.56 858 95.87
22 6 0.67 864 96.54
23 6 0.67 870 97.21
24 1 0.11 871 97.32
25 6 0.67 877 97.99
26 3 0.34 880 98.32
27 4 0.45 884 98.77
28 2 0.22 886 98.99
29 1 0.11 887 99.11
PAGE 194
31 1 0.11 888 99.22
33 1 0.11 889 99.33
34 1 0.11 890 99.44
35 1 0.11 891 99.55
38 1 0.11 892 99.66
43 2 0.22 894 99.89
57 1 0.11 895 100.00
Unadjusted analysis of nodes effect using whole sample 2
The PHREG Procedure
Model Information
Percent
895 489 406 45.36
Convergence Status
Without With
-2 LOG L 6251.265 6155.232
AIC 6251.265 6167.232
SBC 6251.265 6192.386
Score 108.5044 6 <.0001
Wald 100.4176 6 <.0001
Parameter Standard Hazard
Variable DF Estimate Error Chi-Square Pr > ChiSq Ratio
dn1 1 -1.28437 0.17426 54.3251 <.0001 0.277
dn2 1 -1.25842 0.18123 48.2173 <.0001 0.284
PAGE 195
dn3 1 -1.21370 0.21085 33.1338 <.0001 0.297
dn4 1 -1.08482 0.21264 26.0262 <.0001 0.338
dn510 1 -0.62394 0.14893 17.5510 <.0001 0.536
dn1015 1 -0.26508 0.17134 2.3933 0.1219 0.767
Estimated Covariance Matrix
Variable dn1 dn2 dn3
dn1 0.0303656699 0.0158794030 0.0158743801
dn2 0.0158794030 0.0328433110 0.0158873234
dn3 0.0158743801 0.0158873234 0.0444580750
dn4 0.0158321124 0.0158403749 0.0158365471
dn510 0.0157822736 0.0157910332 0.0157887547
dn1015 0.0157102878 0.0157179826 0.0157169949
The PHREG Procedure
Variable dn4 dn510 dn1015
dn1 0.0158321124 0.0157822736 0.0157102878
dn2 0.0158403749 0.0157910332 0.0157179826
dn3 0.0158365471 0.0157887547 0.0157169949
dn4 0.0452171683 0.0157584925 0.0156986869
dn510 0.0157584925 0.0221809786 0.0156817871
dn1015 0.0156986869 0.0156817871 0.0293588001
The PHREG Procedure
Model Information
Percent
895 489 406 45.36
Convergence Status
PAGE 196
Without With
-2 LOG L 6251.265 6155.232
AIC 6251.265 6167.232
SBC 6251.265 6192.386
Score 108.5044 6 <.0001
Wald 100.4176 6 <.0001
dn1 1 -0.07068 0.20755 0.1160 0.7335 0.932
dn2 1 -0.04472 0.21337 0.0439 0.8340 0.956
dn4 1 0.12888 0.24084 0.2864 0.5926 1.138
dn510 1 0.58976 0.18725 9.9202 0.0016 1.804
dn1015 1 0.94862 0.20587 21.2322 <.0001 2.582
dn15 1 1.21370 0.21085 33.1338 <.0001 3.366
Unadjusted analysis of nodes effect using subsample 5
The PHREG Procedure
Model Information
Percent
723 391 332 45.92
Convergence Status
PAGE 197
Without With
-2 LOG L 4833.945 4764.954
AIC 4833.945 4776.954
SBC 4833.945 4800.766
Score 77.8626 6 <.0001
Wald 72.4992 6 <.0001
dn1 1 -1.21094 0.19315 39.3037 <.0001 0.298
dn2 1 -1.26069 0.20165 39.0842 <.0001 0.283
dn3 1 -1.17723 0.23237 25.6654 <.0001 0.308
dn4 1 -1.00578 0.24002 17.5597 <.0001 0.366
dn510 1 -0.60345 0.17061 12.5111 0.0004 0.547
dn1015 1 -0.33276 0.19337 2.9614 0.0853 0.717
Analysis of adjusted nodes effect using subsample 6
The PHREG Procedure
Model Information
Percent
723 391 332 45.92
Convergence Status
Without With
PAGE 198
-2 LOG L 4833.945 4728.493
AIC 4833.945 4746.493
SBC 4833.945 4782.212
Score 115.3524 9 <.0001
Wald 109.3568 9 <.0001
Parameter Standard
dn1 1 -1.19408 0.19574 37.2135 <.0001
dn2 1 -1.20719 0.20415 34.9666 <.0001
dn3 1 -1.16259 0.23449 24.5813 <.0001
dn4 1 -1.03819 0.24114 18.5357 <.0001
dn510 1 -0.60950 0.17210 12.5431 0.0004
dn1015 1 -0.32581 0.19445 2.8074 0.0938
meno 1 0.40551 0.10820 14.0459 0.0002
Hazard
dn1 0.303
dn2 0.299
dn3 0.313
dn4 0.354
dn510 0.544
dn1015 0.722
The PHREG Procedure
Parameter Standard
tsize 1 0.02298 0.01945 1.3963 0.2373
er 1 -0.54446 0.10475 27.0157 <.0001
Hazard
PAGE 199
Variable dn1 dn2
dn1 0.0383145537 0.0216371781
dn2 0.0216371781 0.0416773678
dn3 0.0215804368 0.0215943509
dn4 0.0212539145 0.0212273511
dn510 0.0212225418 0.0212180900
dn1015 0.0210776688 0.0210965424
meno menopausal status -.0001772870 0.0000821671
tsize size of largest tumor in cm 0.0006074631 0.0006114021
er estrogen receptor status -.0000041378 -.0006272171
Variable dn3 dn4
dn1 0.0215804368 0.0212539145
dn2 0.0215943509 0.0212273511
dn3 0.0549853099 0.0213140081
dn4 0.0213140081 0.0581490154
dn510 0.0212533597 0.0210335278
dn1015 0.0211331931 0.0209323786
meno menopausal status -.0011395951 -.0012567326
er estrogen receptor status -.0004634995 0.0001294100
Variable dn510 dn1015
dn1 0.0212225418 0.0210776688
dn2 0.0212180900 0.0210965424
dn3 0.0212533597 0.0211331931
dn4 0.0210335278 0.0209323786
dn510 0.0296173567 0.0209108008
dn1015 0.0209108008 0.0378115889
meno menopausal status -.0008123275 -.0007730645
Variable meno tsize
dn1 -.0001772870 0.0006074631
dn2 0.0000821671 0.0006114021
dn3 -.0011395951 0.0005472880
dn4 -.0012567326 0.0003777127
The PHREG Procedure
Variable meno tsize
PAGE 200
dn510 -.0008123275 0.0004003289
dn1015 -.0007730645 0.0003517206
meno menopausal status 0.0117072729 0.0000769842
Variable er
dn1 -.0000041378
dn2 -.0006272171
dn3 -.0004634995
dn4 0.0001294100
dn510 -.0001460574
dn1015 -.0004473206
meno menopausal status -.0014632745
tsize size of largest tumor in cm -.0001378757
er estrogen receptor status 0.0109726802
Model with only meno tsize er 9
The PHREG Procedure
Model Information
Percent
723 391 332 45.92
Convergence Status
Without With
-2 LOG L 4833.945 4791.872
AIC 4833.945 4797.872
SBC 4833.945 4809.779
PAGE 201
Score 44.3354 3 <.0001
Wald 43.9297 3 <.0001
Parameter Standard
meno 1 0.41662 0.10758 14.9962 0.0001
tsize 1 0.05245 0.01914 7.5127 0.0061
er 1 -0.54977 0.10446 27.6995 <.0001
Hazard
Score test for nodes effect adjusting for other covariates 10
The PHREG Procedure
Model Information
Percent
723 391 332 45.92
meno tsize er
Convergence Status
Without With
-2 LOG L 4833.945 4791.872
AIC 4833.945 4797.872
PAGE 202
SBC 4833.945 4809.779
Score 44.3354 3 <.0001
Wald 43.9297 3 <.0001
Parameter Standard
meno 1 0.41662 0.10758 14.9962 0.0001
tsize 1 0.05245 0.01914 7.5127 0.0061
er 1 -0.54977 0.10446 27.6995 <.0001
Hazard
Score test for nodes effect adjusting for other covariates 11
The PHREG Procedure
Score
dn1 10.5788 0.0011
dn2 9.0532 0.0026
dn3 3.9337 0.0473
dn4 1.6003 0.2059
dn510 5.9467 0.0147
dn1015 14.8804 0.0001
71.6681 6 <.0001
NOTE: No (additional) variables met the 0 level for entry into the
model.
Trend test for number of nodes 12
Unadjusted analysis of nodes effect using whole sample
PAGE 203
The PHREG Procedure
Model Information
Percent
895 489 406 45.36
Convergence Status
Without With
-2 LOG L 6251.265 6161.650
AIC 6251.265 6163.650
SBC 6251.265 6167.843
Score 106.9318 1 <.0001
Wald 102.3116 1 <.0001
dnscore 1 0.07245 0.00716 102.3116 <.0001 1.075
Analysis of adjusted nodes effect using subsample
The PHREG Procedure
Model Information
PAGE 204
Percent
723 391 332 45.92
Convergence Status
Without With
-2 LOG L 4833.945 4732.068
AIC 4833.945 4740.068
SBC 4833.945 4755.943
Score 114.2136 4 <.0001
Wald 110.7178 4 <.0001
Parameter Standard
dnscore 1 0.06774 0.00822 67.9694 <.0001
meno 1 0.41281 0.10780 14.6631 0.0001
tsize 1 0.02268 0.01885 1.4467 0.2291
er 1 -0.54589 0.10464 27.2166 <.0001
Hazard
dnscore 1.070
Score test for nodes effect adjusting for other covariates
PAGE 205
The PHREG Procedure
Model Information
Percent
723 391 332 45.92
meno tsize er
Convergence Status
Without With
-2 LOG L 4833.945 4791.872
AIC 4833.945 4797.872
SBC 4833.945 4809.779
Score 44.3354 3 <.0001
Wald 43.9297 3 <.0001
Parameter Standard
meno 1 0.41662 0.10758 14.9962 0.0001
tsize 1 0.05245 0.01914 7.5127 0.0061
er 1 -0.54977 0.10446 27.6995 <.0001
Hazard
PAGE 206
9 Estimating the Underlying Survival Distribution for a
Proportional Hazards Model
So far the focus has been on the regression parameters in the proportional hazards model.
These parameters describe the strength of the relationship of the the covariates to survival.
Since a proportional hazards model is semiparametric in the sense that the underlying baseline
hazard function is left totally unspecied, these parameters do not suce in describing the
survival distribution. However, we may be interested in estimating the survival distribution for
individuals with a certain combination of covariates.
One strategy would be to create a subset or stratum using individuals with a particular set
of covariates, or at least in a range of values if we are considering continuous covariates, and then
estimate the survival distribution for this particular stratum by using a Kaplan-Meier estimate.
If the number of variables is large or we choose a narrow range, then the number of individuals
in any subset would be so small as to make the Kaplan-Meier useless, i.e., if a subset contained
just a few censored survival times, then the corresponding Kaplan-Meier estimator would be very
imprecise.
In a proportional hazards model, we assume a certain structure in the relationship of the
covariates to the hazard rate. Assuming the model is an adequate representation of the true
relationship, we can take advantage of this structure to obtain better estimate of the survival
distribution as a function of the covariates.
The proportional hazards model assumes that the relationship of the hazard rate at time t
given the covariate Z, where Z is a q dimensional vector of covariates (Z
1
, ..., Z
q
)
T
, is given by
(t|Z) =
0
(t)exp(
T
Z).
If the model is correct, then the hazard at time t for an individual whose covariate vector is
PAGE 208
Z = z
= (z
1
, ..., z
q
)
T
is
(t|Z = z
) =
0
(t)exp(
T
z
).
Note: We use z
here to emphasize that we are particularly interested in the survival function

for a randomly sampled subject with this particular covariate. It should not be confused with
the covariate values for the subjects in the study sample, where we will use Z
i
to indicate the
covariate values for subject i.
Because of the relationship of hazard to survival, this would imply that the survival distri-
bution given Z = z
is
S(t|Z = z
) = e
(t|Z=z
)
,
where (t|Z = z
) is the cumulative hazard function given Z = z
, i.e.,
(t|Z = z
) =
_
t
0
(u|Z = z
)du.
For the proportional hazards model,
(t|Z = z
) =
_
t
0
(u|Z = z
)du
=
_
t
0
0
(u)exp(
T
z
)du
= exp(
T
z
)
_
t
0
0
(u)du
= exp(
T
z
)
0
(t),
where
0
(t) is the cumulative baseline hazard function; i.e.,
0
(t) =
_
t
0
0
(u)du.
Consequently, the survival function for given Z = z
is
S(t|Z = z
) = e
exp(
T
z
)
0
(t)
.
This means that in order to estimate S(t|Z = z
), we only need to estimate and

0
(t).
The parameter can be estimated by MPLE

from the partial likelihood. So we only need to
PAGE 209
get an estimate

0
(t) for
0
(t). Then the estimate of S(t|Z = z
) would be given by
S(t|Z = z
) = e
exp(
T
z
0
(t)
.
Note: We could choose any combination of the covariates z
and nd the corresponding

estimate for the survival distribution for such a z
.
Caution: Of course, all of this is predicated on the assumption that the proportional hazards
model is an adequate representation to the data structure. We would not try to extrapolate these
results to combinations of the covariates outside the range of the data even if the proportional
hazards model was a reasonable t the observed data.
We are left with task of nding a reasonable estimate for the cumulative hazard function
0
(t) =
_
t
0
0
(u)du.
The logic for nding an estimate for the cumulative hazard function
0
(t) in a proportional
hazards model will be similar to that which was used in Chapter 2 to derive the Nelson-Aalen
estimate of the cumulative hazard function in the one sample problem.
Recall that we divided the time axis into a grid of points using increasingly ne partition:
Figure 9.1: Partition of time axis
x
Patient time
In the one-sample problem, all individuals in the sample have the same hazard of failing,
implying the same cause-specic hazard (non-informative censoring). An estimate of (x)x
was obtained by using
dN(x)
Y (x)
=
# of individuals in sample observed to die in [x, x + x)
# of individuals in sample at risk at time x
.
PAGE 210
Since
(t)
x<t
(x)x.
This led us to the Nelson-Aalen estimate for (t):
(t) =
x<t
dN(x)
Y (x)
.
In a proportional hazards model, the individuals in the sample do not have the same hazard
of failing at time x but rather have a hazard which depends on their covariate values. That is,
the ith individual with covariate values Z
i
= (z
i1
, ..., z
iq
)
T
, has hazard
i
(t) =
0
(t)exp(
T
Z
i
).
Consequently, if we dene the past history of failures, censoring, and covariates before time
x, by F(x), then
E[dN
i
(x) |F(x)] = Y
i
i
(x)x
=
0
(x)exp(
T
z
i
)Y
i
x.
Now dN(x) =

n
i=1
dN
i
(x) is the number of deaths in [x, x + x) for our sample, and
E[dN(x) |F(x)] = E
_
n
i=1
dN
i
(x) |F(x)
_
=
n
i=1
E[dN
i
(x) |F(x)]
=
n
i=1
i
(x)Y
i
x
=
0
(x)x
n
i=1
exp(
T
Z
i
)Y
i
(x).
Therefore it seems reasonable to estimate
0
(x)x by using
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
.
PAGE 211
Hence if we wanted to estimate
0
(t)
x<t
0
(x)x,
we would use
x<t
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
_
.
Note: If all the s were equal to zero (i.e., no relationship of hazard to the covariates), then
the previous formula would reduce to
x<t
_
dN(x)
Y (x)
_
,
giving us back the Nelson-Aalen estimator.
The property that the estimate
x<t
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
_
.
is approximately unbiased for
0
(t) follows form the following logic similar to that used for the
Nelson-Aalen estimator.
E
_
x<t
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
__
=
x<t
E
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
_
=
x<t
E
_
E
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
F(x)
__
.
In the inner expectation, the denominator
n
i=1
exp(
T
Z
i
)Y
i
(x)
is xed conditional on F(x), therefore, the inner expectation is equal to
E
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
F(x)
_
=
E[dN(x)|F(x)]
n
i=1
exp(
T
Z
i
)Y
i
(x)
=
E[
dN
i
(x)|F(x)]
n
i=1
exp(
T
Z
i
)Y
i
(x)
=

0
(x)x
exp(
T
Z
i
)Y
i
(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
=
0
(x)x.
PAGE 212
Since
0
(x)x is not a random variable, the outer expectation is also
0
(x)x. Consequently,
the total expectation is
E
_
x<t
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
__
=
x<t
0
(x)x
0
(t).
The formula above involves the parameter vector which also needs to be estimated from the
data. Substituting the MPLE

yields an estimator for the cumulative baseline hazard function
given by
0
(t) =
x<t
_
dN(x)
n
i=1
exp(
T
Z
i
)Y
i
(x)
_
,
which is referred to as the Breslow estimator, Breslow (1972).
Therefore, if we wanted to estimate the survival function for an individual with covariate
vector z
= (z
1
, ..., z
q
)
T
, we could use
S(t|z
) = exp
_
0
(t)exp(
T
z
)
_
.
Standard error for

S(t|z
) and condence intervals for

S(t|z
) could also be obtained. These

formula are a bit complex and will be derived in the Advanced Survival Analysis class. The large
sample properties for

S(t|z
) were derived by Tsiatis (1981) and by the use of counting processes

by Andersen and Gill (1982).
Survival estimates for the proportional hazards model are given in SAS. Enclosed is an ex-
ample we consider nodal status and ER status for CALGB 8082 data. This is only for illustrative
purposes. A more complete analysis should include all important prognostic factors
Appendix: SAS program
The following is a SAS program to estimate survival functions for some combinations of
covariate:
data bcancer;
trt = trt - 1;
PAGE 213
trt="treatment"
trt="treatment indicator";
run;
if nodes = 1 or nodes = 10;
if nodes=1 and er=0 then cat=1;
if nodes=. then delete;
run;
data covars;
input nodes er;
cards;
1 0
1 1
10 0
10 1
;
title "Get survival estimate for each combination of nodes and er";
model days*cens(0) = nodes er;
baseline out=a covariates=covars survival=s/nomean;
run;
data a1; set a;
run;
title "KM estimates for each category";
proc lifetest plots=(s) graphics notable data=bcancer1;
time days*cens(0);
strata nodes er;
symbol1 v=none color=black line=1;
run;
title "Survival estimates for each category";
proc gplot data=a1;
plot s*days=cat;
symbol1 interpol=join color=black line=1;
run;
data _null_; set bcancer1;
file "cat.dat";
PAGE 214
put days cens cat;
run;
data _null_; set a1;
file "estsurv.dat";
put days s cat;
run;
The following two graphes are generated using the following r functions:
postscript(file="estsurv1.ps", horizontal = F, height=6, width=8.5)
par(mfrow=c(1,2))
dat <- read.table(file="cat.dat", col.names=c("days", "cens", "cat"))
fit <- survfit(Surv(days, cens) ~ cat, dat)
plot(fit, xlab="Survival time in days", ylab="Survival probabilities",
lty=c(1,2,3,4))
dat <- read.table(file="estsurv.dat", col.names=c("days", "sprob", "cat"))
plot(0,0, xlab="Survival time in days", ylab="Survival probabilities",
pch=" ", xlim=c(0, max(dat$days)), ylim=c(0,1))
for (i in 1:4){
lines(dat$days[dat$cat==i],dat$sprob[dat$cat==i], lty=i)}
legend(10, 0.2, c("cat 1", "cat 2", "cat 3", "cat 4"), lty=1:4, cex=0.8)
dev.off()
Figure 9.2: KM estimate (left) and estimated survival curve (right) for each category
0 1000 3000 5000
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Survival time in days
S
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
i
e
s

0 1000 3000 5000
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
S
u
r
v
i
v
a
l

p
r
o
b
a
b
i
l
i
t
i
e
s
cat 1
cat 2
cat 3
cat 4
PAGE 215
10 Time Dependent Covariates
Since survival data occur over time, important covariates we wish to consider may also change
over time. We refer to these as time-dependent covariates. Examples of such covariates are:
cumulative exposure to some risk factor,
smoking status,
heart (kidney) transplant status:
0 prior to heart (kidney) transplant
1 after heart (kidney) transplant
blood pressure.
We may have a vector of such covariates, which for the ith individual in our sample we
denote by Z
i
(t) = (Z
i1
(t), ..., Z
iq
(t))
T
, corresponding to the value of these covariates at time
t. This notation allows us to use time independent covariates as well, For example, if the jth
covariate is time-independent, then Z
ij
(t) is constant over time.
Modeling the hazard rate is a natural way of thinking about time-dependent covariates. If
we let Z
H
i
(t) denote the history of the vector of the time-dependent covariates up to time t, i.e.,
Z
H
i
(t) = {Z
i
(u), 0 u t}, then we can dene the hazard rate at time t conditional on this
history by
(t|Z
H
i
(t)) = lim
h0
P[t T
i
< t + h|T
i
t, Z
H
i
(t)]
h
.
This is the instantaneous rate of failure at time t, given the individual was at risk at time t with
a history Z
H
i
(t). For such a conditional hazard rate, we may consider a proportional hazards
model
(t|Z
H
i
(t)) =
0
(t)exp(
T
g(Z
H
i
(t))),
PAGE 216
where g(Z
H
i
(t)) is a vector of function of the history of the covariates that we feel may aect the
hazard.
For example, one choice is to use
g(Z
H
i
(t)) = Z
i
(t).
If we assume that
(t|Z
H
i
(t)) =
0
(t)exp(
T
Z
i
(t)),
then implicitly we would be assuming that the hazard rate at time t given the entire history of
the covariates up to time t is only eected by the current values of the covariates at time t. This,
of course, may or may not be true. Some thought should be given when entertaining use of these
models.
For example, suppose we want to consider the eect of exposure to asbestos over time on
mortality. A sample of workers in a factory where asbestos is made were monitored for a period
of time and data were collected on survival and asbestos exposure. For the ith individual in the
sample, the data could be summarized as
(X
i
,
i
, Z
H
i
(X
i
)),
where
X
i
= min(T
i
, C
i
) is the observed survival time or censoring time,

i
= I(T
i
C
i
) is the failure indicator,
Z
H
i
(X
i
) is the history of asbestos exposure up to time X
i
. This may, for example, be daily
exposure collected on that individual every six months. This data may be collected up to
the point that patient dies, or until he/she is censored, or until he/she stops working at
the factory.
PAGE 217
Suppose we wish to consider the following proportional hazards model with time-dependent
covariates
(t|Z
H
i
(t)) =
0
(t)exp(
T
g(Z
H
i
(t))).
What should we use for the function g(Z
H
i
(t))?
1. We may use cumulative exposure (need extrapolation), i.e.,
g(Z
H
i
(t)) =
j
Z
i
(u
ij
)(u
ij
u
i(j1)
),
where u
ij
are days at which measurements were made prior to day t.
2. We may use average exposure up to time t,
g(Z
H
i
(t)) =
u
ij
<t
Z
i
(u
ij
)
# of measurements up to t
3. We may use maximum exposure up to time t,
g(Z
H
i
(t)) = max{Z
i
(u
ij
) : u
ij
< t}
We may also want to consider models such as
(t|Z
H
i
(t)) =
0
(t)exp(
1
g
1
(Z
H
i
(t)) +
2
g
2
(Z
H
i
(t))),
where g
1
(Z
H
i
(t)) = cumulative exposure up to time t and g
2
(Z
H
i
(t)) = maximum exposure up to
time t.
This model may be used if we think that both of these components of the asbestos history
may have an eect on survival. It also allows us to test whether these dierent components of
history are important on survival by testing whether the parameters
1
or
2
are signicantly
dierent from zero.
A cautionary note must be made when interpreting hazard rates with time-dependent co-
variates, the hazard function with time-dependent covariates may NOT necessarily be used to
construct survival distributions.
PAGE 218
For example, if we have a time-independent covariate Z, then the conditional survival dis-
tribution
S(t|Z) = P[T t|Z] = e
_
t
0
(u|Z)du
is well dened and meaningful. But the following distribution
S(t|Z
H
(t)) = P[T t|Z
H
(t)]
may not make any sense since by the very fact that Z
H
(t) was measured when an individual was
alive at time t.
It is useful to dierentiate between internal and external time-dependent covariates for this
purpose.
1. An internal time-dependent covariate is one where the change of the covariate over time is
related to the behavior of the individual. For example, blood pressure, disease complica-
tions, etc.
2. An external or ancillary time-dependent covariate is one whose path is generated externally.
For example, levels of air pollution.
For external time-dependent covariate, we can image a process which generates the time-
dependent covariate over time. Therefore, for a particular realization of the process, Z
H
(), we
can image that the following quantity exists
(t|Z
H
i
()) = lim
h0
P[t T
i
< t + h|T
i
t, Z
H
i
()]
h
,
and we may be willing to assume that
(t|Z
H
i
()) = (t|Z
H
i
(t)), for any t > 0.
Therefore, if we have an external time-dependent covariate, we can ask the question what is
the survival distribution at time t given the external process which generated Z
H
()
S(t|Z
H
i
()) = exp
_
_
t
0
(u|Z
H
i
())du
_
PAGE 219
= exp
_
_
t
0
(u|Z
H
i
(u))du
_
.
For internal time-dependent covariates, this conceptualization would not make sense, al-
though the relationship of the history of the covariate process on the hazard rate does have a
useful interpretation.
Once we decide on a proportional hazards model with time-dependent covariates, the esti-
mation of the regression parameters in the model, as well as the underlying cumulative hazard
function (for external time-dependent covariate), create no additional diculties. That is, we
can use the theory developed so far for time-independent covariates with only slight modication.
For example, if we consider the model
(t|Z
H
(t)) =
0
(t)exp(
T
Z(t)),
then the partial likelihood function of for this model is given by
PL() =
u
_
exp(
T
Z
I(u)
(u))
n
l=1
exp(
T
Z
l
(u))Y
l
(u)
_
dN(u)
,
where I(u) is the indicator variable that identies the individual label {1, 2, ..., n} for the
individual who dies at time u.
This formula for partial likelihood looks almost identical to the one derived for time-
independent covariates.
The only dierence is that at time u, the values of the time-dependent covariates at time u
were used, both for the individual who dies at that time, as well as the individuals who are at
risk sets at that time. Therefore, the same individual appearing in dierent risk sets would use
the possibly dierent values of their covariates at those risk sets.
Estimates, standard errors, tests and all other statistical properties would then follow exactly
as they did before. That is, we would compute the MPLE by maximizing the log partial likelihood
PAGE 220
given above. The score vector and the information matrix can be obtained as the rst derivative
and minus second derivative of the log partial likelihood. Wald, score and likelihood ratio tests
can be computed analogously.
The major diculty with time-dependent covariates in a proportional hazards model is com-
puting and storage. Theoretically, at each death time we need to know the exact value of the
covariate at that death time for ALL individuals at risk. The management, collection and stor-
age of such data can create some diculties, whereas the theory is no more dicult than with
time-independent covariates.
SAS has some very nice software for handling time-dependent covariates.
Example 1: Time-varying Smoking Data
Suppose we have the a small data set as follows
ID time status z1 z2 z3 z4
1 2 1 1 . . .
2 4 1 1 1 . .
3 5 1 0 1 0 .
4 7 0 1 0 1 .
5 8 1 1 0 0 1
and we assume a proportional hazards model with time-varying smoking status:
(t|z
i
(t)) =
0
(t)e
z
i
(t)
,
where z(t) is the smoking status for subject i. Then the partial likelihood function of using
the above data is
L(; x, , z(t)) =
e
1 + 4e
2 + 2e

1
2 + e
.
PAGE 221
Figure 10.1: Log partial likelihood function of
x
L
o
g

p
a
r
t
i
a
l

l
i
k
e
l
i
h
o
o
d
-1.0 -0.5 0.0 0.5 1.0
-
4
.
8
-
4
.
6
-
4
.
4
-
4
.
2
-
4
.
0
The log partial likelihood function of looks like (using the following r functions:
postscript(file="tvlik.ps", horizontal = F,
height=6, width=8.5)
# par(mfrow=c(1,2))
x <- seq(-1, 1, length=100)
y <- exp(x)
y <- 2*x - log(1+4*y)-log(2+2*y)-log(2+y)
plot(x, y, type="l", ylab="Log partial likelihood")
box()
dev.off()
The above model can be t using the following SAS program:
data smoking;
input time status z1-z4;
cards;
2 1 1 . . .
4 1 1 1 . .
5 1 0 1 0 .
7 0 1 0 1 .
8 1 1 0 0 1
;
proc phreg;
model time*status(0) = smoke;
array tt{*} t1-t4;
array zz{*} z1-z4;
t1 = 2;
t2 = 4;
PAGE 222
t3 = 5;
t4 = 8;
do i=1 to 4;
if time=tt[i] then smoke=zz[i];
end;
run;
Example 2: Heart-Transplant Data
Problem: We want to evaluate whether patients receiving heart transplants will benet
with increased survival.
Experiment: A group of patients are recruited that are eligible for heart transplants. How-
ever, a heart has to become available and then the patients with the closest match receives
this heart.
Question: How do we evaluate the eectiveness of the heart transplant?
Here are some early attempts at answering this question in the medical literature.
1. Identify the patients that received a heart transplant and those that did not; measure
their survival times from the time they entered the study and compare the survival times
between these two groups using, say, a log rank test.
Comment: Patients that died early will not have the chance to receive a heart transplant.
Thus the two groups being compared are selectively biased favoring the heart transplant
patients.
2. Identify patients that received a heart transplant and those that did not; measure survival
times for heart transplant patients as time from transplant to death, and for the patients
that did not receive a heart transplant measure survival times as time from entry to the
study until death; compare two groups.
Comment: Now the bias may go in the other direction.
Preferred Answer
PAGE 223
Let Z
i
(t) denote heart transplant indicator for patient i at time t. That is,
Z
i
(t) =
_
_
1 if patient i received a heart transplant at time t
0 otherwise
here time t is measured as time from the entry into study to death.
Then consider the proportional hazards model with time-dependent covariate Z
i
(t)
(t|Z
H
i
(t)) =
0
(t)exp(Z
i
(t)).
This model assumes that the hazard increases by exp() after a heart transplant compared
to before at time t. Therefore,
= 0 implies no eect on survival due to the heart transplant.
< 0 implies heart transplant is benecial (hazard decreases).
> 0 implies heart transplant is detrimental (hazard increases).
The situation is illustrated by Figure 10.4.
Figure 10.2: Illustration of the Eect of Heart Transplant

Days since entry to study
l
o
g

h
a
z
a
r
d

r
a
t
i
o
100 200 300 400
heart transplant
PAGE 224
If we dene a variable wait in our data set as the time, say, in days from entry into study
until receipt of heart transplant, if no heart transplant we can use wait = ., then we can use
Proc Phreg in SAS to t the above model. Specically,
model days*cens(0) = plant;
if wait>days or wait=. then
plant = 0;
else
plant = 1;
run;
Notice that the covariate plant represents the time-dependent covariate we dened in the
above model and is dened after model statement in Proc Phreg. The variable days in the
model statement is a running variable in SAS used to dene the risk sets over time, making the
variable plant a time-dependent covariate. Therefore, we cannot use the same if-then-else
statement in Data step to dene this time-dependent covariate.
Note: The model we described assumes the benet (if < 0) or harm (if > 0) of heart
transplant takes eect immediately after the transplant. This assumption may not be reason-
able in practice. In fact, the hazard may increase right after heart transplant because of the
complication due to transplant and then begin to decrease steadily.
The use of time-dependent covariates allows us to relax the proportional hazards assumption
as well as giving us a framework for testing the adequacy of the proportional hazards assumption.
For example, suppose we have a covariate Z (say it is time-independent) and we entertain the
proportional hazards mode
(t|Z) =
0
(t)exp(Z).
As we know, this assumption implies that
(t|Z
1
)
(t|Z
0
)
= exp((Z
1
Z
0
)).
PAGE 225
Suppose we wanted to test whether the hazard ratio changed over time. Consider the following
model:
(t|Z) =
0
(t)exp(Z + Zg(t)),
where g(t) is some specied function of time chosen by the data analyst. For example, we may
choose g(t) = log(t).
Note: We must not include main eect of g(t) since such a main eect will be absorbed
into
0
(t), making it unidentiable.
The term Zg(t) is an interaction term between the covariate Z and some function g(t) of
time. For such a model the log hazard ratio is
log
_
(t|Z
1
)
(t|Z
0
)
_
= log
_
0
(t)exp(Z
1
+ Z
1
g(t))
0
(t)exp(Z
0
+ Z
0
g(t))
_
= (Z
1
Z
0
)( + g(t)).
This model allows the hazard ratio to change over time giving us greater exibility than
proportional hazards assumption. In addition, testing whether or not is signicantly from zero
allows us the opportunity to evaluate the proportional hazards assumption.
The model
(t|Z) =
0
(t)exp(Z + Zg(t)),
can be viewed as a proportional hazards model with two covariates:
1. the time-independent covariate Z,
2. the time-dependent covariate g(t)Z.
The term g(t)Z is a simple example of an external or ancillary time-dependent covariate
dened by the data analyst.
In SAS such a model is easy to implement. For example, suppose days, cens and time-
independent covariate z are dened in a data set, then we use the following SAS code:
PAGE 226
model days*cens(0) = z zlogt;
zlogt = z*log(days);
run;
In CALGB 8082, we found that nodes was the most signicant prognostic factor for survival.
Let us check whether the proportional hazards assumption is a reasonable representation of this
relationship using g(t) = log(t). Of course, we might other function g(t) such as g(t) = t, e
t
, etc.
title "Test PH for nodes effect using g(t)=log(t)";
model days*cens(0) = nodes nodelogt;
nodelogt = nodes*log(days+1);
run;
********************************************************************************
Test PH for nodes effect using g(t)=log(t)
09:14 Sunday, April 17, 2005
The PHREG Procedure
Model Information
Percent
905 490 415 45.86
Convergence Status
Without With
-2 LOG L 6264.861 6180.005
AIC 6264.861 6184.005
SBC 6264.861 6192.394
PAGE 227
Score 116.8580 2 <.0001
Wald 114.4077 2 <.0001
Parameter Standard
nodes 1 0.08246 0.03630 5.1613 0.0231
nodelogt 1 -0.00365 0.00521 0.4911 0.4834
Hazard
nodelogt 0.996
From the SAS output, there does not seem to be any problem with the proportional hazards
assumption.
The model we used assumes a specic departure away from the proportional hazards as-
sumption. That is
(t|Z) =
0
(t)exp(Z + Zlog(t)).
If proportional hazards failed in a fashion dierently than that assumed in the above model,
then the test we proposed may not detect such a deviation. Another approach is to use a more
omnibus type of alternative away from the hypothesis of proportional hazards. This could be
accomplished by the use of indicator functions over intervals of time.
We rst partition our time axis into K intervals by choosing (K 1) time points: 0 <
1
<
... <
K1
, and then dene the following indicator functions:
I
1
(t) = I[t [0,
1
)],
I
2
(t) = I[t [
1
,
2
)],
PAGE 228
.
.
.
I
K1
(t) = I[t [
K2
,
K1
)],
I
K
(t) = I[t [
K1
, )].
We then dene the model
(t|Z) =
0
(t)exp(Z +
1
ZI
1
(t) + ... +
K1
ZI
K1
(t)).
Note: We include K1 interaction terms between the covariate Z and the indicator function
of time intervals. We must exclude one indicator function to avoid overparametrization.
For such a model, we have
log
_
(t|Z
1
)
(t|Z
0
)
_
= (Z
1
Z
0
) [ +
1
I
1
(t) + ... +
K1
I
K1
(t)] .
Thus the log hazard ratio in each interval would be
(Z
1
Z
0
) if t
K1
(reference time interval)
(Z
1
Z
0
)( +
1
) if 0 t <
1
.
.
.
(Z
1
Z
0
)( +
K1
) if
K2
t <
K1
.
An omnibus test for proportional hazards assumption can be obtained by testing the hy-
pothesis
H
0
:
1
= ... =
K1
= 0.
This would yield a chi-square test with (K 1) degrees of freedom. As always, we can use
the score test, Wald test, or likelihood ratio test to test H
0
.
Note: If we nd signicant deviation from proportional hazards, then plotting
1
, ...,
K1
vs. their respective time intervals may give a suggestion on the functional form of the deviation
over time.
PAGE 229
For example, for the breast cancer data, if we consider the following proportional hazards
model
(t|z) =
0
(t)e
1
trt+
2
nn
,
we can test PH assumption both for treatment and number of nodes using the above idea:
title "Test PH for treatment effect using dummy";
model days*cens(0) = nodes trt1 d1 d2 d3 d4/selection=forward
include=2 detail sle=0;
if days<1000 then do;
d1=trt1; d2=0; d3=0; d4=0;
end;
else if days<2000 then do;
d1=0; d2=trt1; d3=0; d4=0;
end;
d1=0; d2=0; d3=trt1; d4=0;
end;
d1=0; d2=0; d3=0; d4=trt1;
end;
else do;
d1=0; d2=0; d3=0; d4=0;
end;
run;
title "Test PH for nodes effect using dummy";
model days*cens(0) = nodes trt1 d1 d2 d3 d4/selection=forward
if days<1000 then do;
d1=nodes; d2=0; d3=0; d4=0;
end;
d1=0; d2=nodes; d3=0; d4=0;
end;
d1=0; d2=0; d3=nodes; d4=0;
end;
d1=0; d2=0; d3=0; d4=nodes;
end;
else do;
d1=0; d2=0; d3=0; d4=0;
end;
run;
********************************************************************************
Test PH for treatment effect using dummy
09:14 Sunday, April 17, 2005
PAGE 230
The PHREG Procedure
Model Information
Percent
905 490 415 45.86
nodes trt1
Convergence Status
Without With
-2 LOG L 6264.861 6180.264
AIC 6264.861 6184.264
SBC 6264.861 6192.653
Score 113.1765 2 <.0001
Wald 111.8468 2 <.0001
Parameter Standard
nodes 1 0.05707 0.00542 110.9645 <.0001
trt1 1 0.04168 0.09048 0.2122 0.6450
PAGE 231
Hazard
Score
d1 0.0058 0.9394
d2 0.0222 0.8817
d3 0.1955 0.6584
d4 0.1396 0.7087
0.2985 4 0.9899
Test PH for nodes effect using dummy
09:14 Sunday, April 17, 2005
The PHREG Procedure
Model Information
Percent
905 490 415 45.86
nodes trt1
Convergence Status
PAGE 232
Without With
-2 LOG L 6264.861 6180.264
AIC 6264.861 6184.264
SBC 6264.861 6192.653
Score 113.1765 2 <.0001
Wald 111.8468 2 <.0001
Parameter Standard
nodes 1 0.05707 0.00542 110.9645 <.0001
trt1 1 0.04168 0.09048 0.2122 0.6450
Hazard
Score
d1 1.7974 0.1800
d2 0.4795 0.4887
d3 0.4462 0.5041
d4 0.7319 0.3923
6.8443 4 0.1443
Score Test for Proportional Hazards
PAGE 233
Let us consider the model with indicators for time interval
(t|Z) =
0
(t)exp(Z +
1
ZI
1
(t) + ... +
K1
ZI
K1
(t)).
We shall now derive the score test of the null hypothesis
H
0
:
1
= ... =
K1
= 0,
i.e., the hypothesis of proportional hazards.
We use the one of the representations for the partial likelihood of = (,
1
, ...,
K1
)
T
:
PL() =
n
i=1
_
_
exp(Z
i
+

K1
j=1

j
Z
i
I
j
(X
i
))
n
l=1
exp
_
Z
l
+

K1
j=1

j
Z
l
I
j
(X
i
)
_
Y
l
(X
i
)
_
_
i
.
The log partial likelihood is equal to
() =
n
i=1
i
_
_
Z
i
+
K1
j=1
j
Z
i
I
j
(X
i
)
_
_
i=1
i
log
_
_
n
l=1
exp
_
_
Z
l
+
K1
j=1
j
Z
l
I
j
(X
i
)
_
_
Y
l
(X
i
)
_
_
.
To evaluate the score test of the hypothesis
H
0
:
1
= ... =
K1
= 0,
we need to compute the score vector with respect to the parameters
1
, ...,
K1
, and evaluate
these using the restricted MPLE under H
0
:
()
(
1
=...=
K1
=0)
=
n
i=1
i
_
Z
i
I
j
(X
i
)
n
l=1
Z
l
I
j
(X
i
)exp(
Z
l
)Y
l
(X
i
)
n
l=1
exp(
Z
l
)Y
l
(X
i
)
_
, j = 1, ..., K 1.
Note: The restricted MPLE

(
1
= ... =
K1
= 0) is just the MPLE for the original
proportional hazards model.
The score ()/
j
can be rewritten as
()
j
=
n
i=1
I
j
(X
i
)
i
_
Z
i
n
l=1
Z
l
exp(
Z
l
)Y
l
(X
i
)
n
l=1
exp(
Z
l
)Y
l
(X
i
)
_
=
i
_
Z
i

Z(X
i
,

)
_
I
j
(X
i
),
PAGE 234
where the summation is over all individuals whose value X
i
is in the jth time interval, and
Z(X
i
,

) is the weighted average of the covariate for individuals at risk at time X
i
.
The value
i
_
Z
i

Z(X
i
,

)
_
for individual i is referred to as the Schoenfeld residual or score
residual (Note: this is not the score residue used by SAS); Schoenfeld (1982) Biometrika. The
key word in SAS of Schoenfeld residual is ressch and SAS only calculates Schoenfeld residual for
i
= 1.
If we denote the Schoenfeld residual by Sh
i
for the ith individual, then the score can be
written as
()
j
=
Sh
i
I(X
i
[
j1
,
j
)), j = 1, ..., K 1.
The basis for the score test is that under the null hypothesis we expect the score vector
_

1
, ...,

K1
_
T
to have mean zero and we reject H
0
when the score is not close to zero, i.e., we reject H
0
when
the quadratic form using the score vector is suciently large.
Therefore, Schoenfeld suggested that these residuals be plotted as a function of time; that is
Sh
i
vs. X
i
.
If the proportional hazards assumption is adequate, then on average these residuals should
be zero. A noticable trend away from zero may be indicative of the lack of proportional hazards.
The test we suggested earlier for H
0
:
1
= ... =
K1
= 0 can be used as a formal goodness
of t test for proportional hazards.
If we are using a model with many covariates:
(t|Z) =
0
(t)exp(
1
Z
1
+ ... +
q
Z
q
),
then we can compute Schoenfeld residuals for each of the covariates; i.e.,
Sh
ij
=
i
_
Z
ij

n
l=1
Z
lj
I
j
(X
i
)exp(
T
Z
l
)Y
l
(X
i
)
n
l=1
exp(
T
Z
l
)Y
l
(X
i
)
_
, j = 1, ..., q.
PAGE 235
For each covariate Z
j
, we can then plot
SH
ij
vs. X
i
, for j = 1, ..., q
producing q such plots.
The corresponding formal test for the jth covariate could be obtained by considering the
model
(t|Z) =
0
(t)exp(
T
Z +
1
ZI
1
(t) + ... +
K1
ZI
K1
(t)),
and testing
H
0
:
1
= ... =
K1
= 0.
Example: Breast cancer data revisited
Let us consider the following proportional hazards model
(t|z) =
0
(t)e
1
trt+
2
nn
.
Then we can use the following SAS program to output Schoenfeld residual for each covariate:
title "residual analysis for treatment and nodes";
model days*cens(0) = trt1 nodes;
output out=residout RESSCH=trtresid nodresid;
run;
proc gplot data=residout;
plot trtresid * days / vref=0;
symbol1 value=circle;
run;
plot nodresid * days / vref=0;
run;
PAGE 236
Figure 10.3: Schoenfeld residual for treatment(left) and nodes(right)

0 1000 3000 5000
0
.
4
0
.
2
0
.
0
0
.
2
0
.
4
S
c
h
o
e
n
f
e
l
d

r
e
s
i
d
u
a
l
s

0 1000 3000 5000
1
0
0
1
0
2
0
3
0
4
0
S
c
h
o
e
n
f
e
l
d

r
e
s
i
d
u
a
l
s
Score Test of the Functional Form of the
Covariate in a Proportional Hazards Model and Martingale Residuals
Previously we discussed an omnibus alternative model that allowed deviation from propor-
tional hazards. This model tests for proportional hazards as well as giving us the motivation for
considering Schoenfeld residuals.
We also consider ways of checking adequacy of the covariate relationship in a proportional
hazards model. This include using higher order polynomials or dummy variables after discretizing
the continuous covariate. Using the idea of discretization, we can formally develop a hierarchical
model for testing the adequacy of functional relationship to the covariate.
Suppose we entertain the model
(t|Z) =
0
(t)exp(Z),
and we want to consider whether the relationship exp(Z) is suitable. We then partition the
covariate values of Z into K intervals by dening values
1
, ...,
K1
along the range of possible
PAGE 237
values of the covariate Z, and dene the indicator functions
I
1
(Z) = I[Z <
1
]
I
2
(Z) = I[
1
Z <
2
]
.
.
.
I
K1
(Z) = I[
K2
Z
K1
]
I
K
(Z) = I[Z
K1
].
We then consider the model
(t|Z) =
0
(t)exp(Z +
K1
j=1
j
I
j
(Z))).
Remark: We use (K 1) indicator variables to avoid overparametrization.
This model is an omnibus alternative away from the null model
(t|Z) =
0
(t)exp(Z).
A formal goodness of t test could be derived to test the adequacy of the above null model
by testing the hypothesis
H
0
:
1
= ... =
K1
= 0
using the Wald, score or likelihood ratio tests. These tests are asymptotically chi-square with
(K 1) d.f. if the null hypothesis were true.
The following program tests linear nodes eect for breast cancer data:
title "Test linear nodes effect using dummy";
model days*cens(0) = nodes trt1 dn1 dn2 dn3 dn4/selection=forward
run;
********************************************************************************
Test linear nodes effect using dummy
09:35 Sunday, April 17, 2005
PAGE 238
The PHREG Procedure
Model Information
Percent
896 490 406 45.31
nodes trt1
Convergence Status
Without With
-2 LOG L 6264.861 6180.264
AIC 6264.861 6184.264
SBC 6264.861 6192.653
Score 113.1765 2 <.0001
Wald 111.8468 2 <.0001
Parameter Standard
nodes 1 0.05707 0.00542 110.9645 <.0001
trt1 1 0.04168 0.09048 0.2122 0.6450
PAGE 239
Hazard
Score
dn1 5.2307 0.0222
dn2 0.9936 0.3189
dn3 0.2617 0.6089
dn4 4.7095 0.0300
9.6325 4 0.0471
The above output indicates that it may not be appropriate to assume linear eect for the
number of positive nodes.
Score Test for Covariate Relationship
Similar to the computation made before when we discussed Schoenfeld residual, the log
partial likelihood can be derived as
(,
1
, ...,
K1
) =
n
i=1
i
_
_
Z
i
+
K1
j=1
j
I
j
(Z
i
)
_
_
i=1
i
log
_
_
n
l=1
exp
_
_
Z
l
+
K1
j=1
j
I
j
(Z
l
)
_
_
Y
l
(X
i
)
_
_
.
The jth element of the score vector is given by
(
1
=...=
K1
=0)
=
n
i=1
i
_
I
j
(Z
i
)
n
l=1
I
j
(Z
l
)exp(
Z
l
)Y
l
(X
i
)
n
l=1
exp(
Z
l
)Y
l
(X
i
)
_
, j = 1, ... = K 1.
This can be written as
n
i=1
i
I
j
(Z
i
)
n
i=1
_
n
l=1

i
I
j
(Z
l
)exp(
Z
l
)Y
l
(X
i
)
n
r=1
exp(
Z
r
)Y
r
(X
i
)
.
_
If we interchange the sums in the second term, the second term becomes
n
l=1
I
j
(Z
l
)exp(
Z
l
)
n
i=1
_

i
I(X
i
X
l
)
n
r=1
exp(
Z
r
)Y
r
(X
i
)
_
.
PAGE 240
Note: Here we used the fact Y
l
(X
i
) = I(X
i
X
l
), i.e., whether or subject l is still at risk at
time X
i
.
By reversing the index l with the index i we get
n
i=1
I
j
(Z
i
)exp(
Z
i
)
n
l=1
_

l
I(X
l
X
i
)
n
r=1
exp(
Z
r
)Y
r
(X
l
)
_
.
Obviously, the quantity
n
l=1
_

l
I(X
l
X
i
)
n
r=1
exp(
Z
r
)Y
r
(X
l
)
_
is nothing but the Breslow estimator for the cumulative baseline hazard function evaluated at
time X
i
; i.e.,

0
(X
i
) (since
l
I(X
l
X
i
) is dN
i
(X
i
), the indicator indicating whether or subject
i was dead at time X
i
).
Therefore, the jth element of the score vector is equal to
n
i=1
I
j
(Z
i
)
_
0
(X
i
)exp(
Z
i
)
_
.
The quantity
_
0
(X
i
)exp(
Z
i
)
_
for the ith individual is referred to as the martingale residual MR
i
. Therefore, the score vector
has as its jth element
j
=
n
i=1
I
j
(Z
i
)MR
i
, j = 1, ..., K 1.
Under
H
0
: (t|Z) =
0
(t)exp(Z),
the score vector has mean zero. So a suciently large quadratic form of the score vector is an
indication that H
0
may not be true. Therefore we reject H
0
when this quadratic form of the
score vector is large.
PAGE 241
It is recommended that the martingale residuals be plotted against the covariate Z; i.e., we
plot MR
i
vs. Z
i
. When there are multiple covariates in the model, we recommend plotting MR
i
vs. Z
T
i
(call xbeta in SAS).

Note: The sum of the martingale residuals in jth intervals of the covariate range makes up
the jth element of the score vecotr: /
j
.
If our model is correct, these resisuals should have mean zero and be uncorrelated with the
covariate (since the score vector has mean zero conditional on the convariate). So a noticable
trend away from zero may be indicative of covariate model misspecication.
Example: Breast cancer data revisited
Let us consider the model on page 198 and plot the martigale residual using the the following
program:
model days*cens(0) = trt1 nodes;
output out=residout resmart=mart resdev=dev xbeta=xb;
run;
plot mart * xb / vref=0;
run;
plot dev * xb / vref=0;
run;
Let us return to CAL 8082 and consider the relationship to nodes once again. The following
table summarizes the results:
Model 2LogL d.f.
nodes 5189.80 3
nodes + dummy 5174.23 8
nodes + nodes
2
5183.30 4
nodes + nodes
2
+ dummy 5173.90 9
PAGE 242
Figure 10.4: Martingale (left) and deviance (right) residual for the model

0.0 0.5 1.0 1.5 2.0 2.5 3.0
1
0
1
Linear predictor
M
a
r
t
i
n
g
a
l
e

r
e
s
i
d
u
a
l
s

0.0 0.5 1.0 1.5 2.0 2.5 3.0
1
0
1
2
3
Linear predictor
D
e
v
i
a
n
c
e

r
e
s
i
d
u
a
l
s
Since
2
0.05;5
= 11.07,
2
0.01;5
= 15.09,
these results suggest that putting in a quadratic term when modeling nodes gives an adequate t.
What to do if you nd substantial deviation from proportional hazards
The proportional hazards model is the most popular model for censored survival data. The
parameters of the model have a nice interpretation, the theoretical properties have been studied
expensively, software is readily available, and the likelihood surface is easy to work with.
There will be situations however when the proportional hazards assumption is not an ade-
quate t to the data. What can we do in such cases?
By hierarchical model building, we can identify covariates where the proportional hazards
assumption is not appropriate and by including interaction terms between functions of times and
covariates get a more suitable model. However, this model building results in a loss of parsimony
with results that may be dicult to interpret and dicult to explain to your collaborators.
PAGE 243
Another alternative is to use a stratied proportional hazards model. When we are consid-
ering many covariates in a model, we may nd that most of the covariates follow a proportional
hazards relationship and only a few of the covariates do not. If this is the case, we may stratify
our study population into categories obtained by dierent combinations of the covariates and
then use a stratied proportional hazards model.
If we denote the number of strata by K and let l index the strata, where l = 1, ..., K, then
the stratied proportional hazards model is given by
l
(t|Z) =
0l
(t)exp(
T
Z),
where Z = (Z
1
, ..., Z
q
)
T
is an q dimensional vector of covariates that satisfy proportional hazards.
In this model, there are K unspecied baseline hazard function for each stratum; i.e.,
0l
(t), l = 1, ..., K; t 0,
and within each stratum, covariates Z satisfy proportional hazards assumption and the eect
of the covariates Z are the same across K strata.
The interpretation of = (
1
, ...,
q
)
T
is exactly the same as in an unstratied proportional
hazards model. Namely, if we consider the hazard ratio resulting from an increase of one unit in
the covariate Z
j
, keeping all other covariates xed (including those used to construct the strata),
we get
(t|Z
j
= z
j
+ 1)
(t|Z
j
= z
j
)
= exp(
j
),
independent of time t. However, the hazard ratio between strata, xing the value of other
covariates, is
0l
(t)
0l
(t)
, comparing strata l to l
.
Since these functions are unrestricted, any relationship of this hazard ratio over time is possible.
To obtain estimates for , we only need a slight modication to the partial likelihood.
PAGE 244
For stratum l, denote the data within that stratum by
(X
li
,
li
, Z
li
), i = 1, ..., n
l
, l = 1, ..., K.
The total sample size is n =

K
l=1
n
l
.
The modied partial likelihood of is given by
PL() =
K
l=1
PL
l
(),
where PL
l
() is the partial likelihood of contributed by the data from the lth stratum:
PL
l
() =
u
_
exp(
T
Z
l[i(u)]
)
n
l
i=1
exp(
T
Z
li
)Y
li
(u)
_
dN
l
(u)
,
where dN
l
(u) is the number of deaths observed in time interval [u, u + u) in the lth stratum,
Y
li
(u) = I(X
li
u) is the indicator indicating whether or not subject i in stratum l is at risk at
time u.
All inferential methods derived previously for the unstratied partial likelihood can be used
with the stratied partial likelihood above, such as MPLE, score test, Wald likelihood ratio test,
etc.
The Breslow estimator for the cumulative baseline hazard can also be used for the cumulative
baseline hazard function for the lth strata; i.e.,
0l
(t) =
ut
_
dN
l
(u)
n
l
i=1
exp(
T
Z
li
)Y
li
(u)
_
, l = 1, ..., K.
For example, in the breast cancer data, if we suspect the proportional hazards assumption
for er, then we can stratify on this covariate. The following is the SAS program and output:
data bcancer;
trt1 = trt - 1;
PAGE 245
trt="treatment"
run;
title "Model 1: Univariate analysis of treatment";
proc phreg;
run;
title "Model 2: Univariate analysis of treatment stratified on ER";
proc phreg;
strata er;
run;
title "Model 3: Log-rank test of treatment effect stratified on ER";
proc lifetest notable;
time days*cens(0);
strata er;
test trt1;
run;
********************************************************************************
Model 1: Univariate analysis of treatment 1
12:05 Saturday, April 16, 2005
The PHREG Procedure
Model Information
Percent
905 497 408 45.08
Convergence Status
PAGE 246
Without With
-2 LOG L 6362.858 6362.421
AIC 6362.858 6364.421
SBC 6362.858 6368.629
Score 0.4375 1 0.5083
Wald 0.4374 1 0.5084
Parameter Standard
trt1 1 0.05935 0.08973 0.4374 0.5084
Hazard
Model 2: Univariate analysis of treatment stratified on ER 2
The PHREG Procedure
Model Information
Percent
Stratum er Total Event Censored Censored
1 0 278 170 108 38.85
2 1 513 258 255 49.71
-------------------------------------------------------------------
Total 791 428 363 45.89
Convergence Status
PAGE 247
Without With
-2 LOG L 4781.604 4780.894
AIC 4781.604 4782.894
SBC 4781.604 4786.953
Score 0.7099 1 0.3995
Wald 0.7089 1 0.3998
Parameter Standard
trt1 1 0.08150 0.09680 0.7089 0.3998
Hazard
Model 3: Log-rank test of treatment effect stratified on ER 3
Summary of the Number of Censored and Uncensored Values
Percent
Stratum er Total Failed Censored Censored
1 0 278 170 108 38.85
2 1 513 258 255 49.71
-------------------------------------------------------------------
Total 791 428 363 45.89
NOTE: There were 114 observations with missing values, negative time values or
frequency values less than 1.
Model 3: Log-rank test of treatment effect stratified on ER 4
PAGE 248
Testing Homogeneity of Survival Curves for days over Strata
Rank Statistics
er Log-Rank Wilcoxon
0 44.090 33623
1 -44.090 -33623
er 0 1
0 88.6179 -88.6179
1 -88.6179 88.6179
Covariance Matrix for the Wilcoxon Statistics
er 0 1
0 30007365 -3.001E7
1 -3.001E7 30007365
Test of Equality over Strata
Pr >
Test Chi-Square DF Chi-Square
Log-Rank 21.9360 1 <.0001
Wilcoxon 37.6743 1 <.0001
-2Log(LR) 19.6431 1 <.0001
Rank Tests for the Association of days with Covariates Pooled over Strata
Univariate Chi-Squares for the Log-Rank Test
Test Standard Pr >
Variable Statistic Deviation Chi-Square Chi-Square Label
trt1 -8.7104 10.3381 0.7099 0.3995 treatment indicator
Variable trt1
trt1 106.877
Forward Stepwise Sequence of Chi-Squares for the Log-Rank Test
PAGE 249
Variable DF Chi-Square Chi-Square Increment Increment Label
trt1 1 0.7099 0.3995 0.7099 0.3995 treatment indicator
Note: We note from the above output that the score test of treatment eect stratied on
er is the same as the stratied log-rank test of treatment stratied on er.
PAGE 250
ST745, Spring 2005
Homework 1, due: Thursday, 1/20/2005
1. (10pts) Assume the random survival time (in years) T has survival function S(t) = 64/(t + 8)
2
.
Do the following:
(a) Find the mean and median survival times.
(b) Find the hazard function of T.
(c) Find the mortality rate m(t) at time t.
(d) Find the average remaining survival time after t
0
= 8. How does it compare to the mean
survival time you got in (a)? How do you explain this?
2. (10 pts) In class we claimed that the sample mean of censored survival times is no longer an
unbiased estimator of the population mean. Let us prove this claim for a special case. Suppose
the life time T (in days) of some light bulbs has an exponential distribution with constant hazard
and we are interested in estimating the mean lifetime
T
= 1/. We picked a random sample
of the light bulbs and plan to test them for L (e.g., 30) days. So for those light bulbs which break
before L days, we will observe their actual life times. For those bulbs which is still working after
L days, we dont know their life times and we only know that their life times are greater than L.
That is, we will have a random sample of the random variable X = min(T, L). Do the following:
(a) Find the survival function of X. That is, nd P[X t] for any t (Hint: consider two cases:
t L and t < L).
(b) Find E(X) using the survival function of X you got in (a) and the formula E(X) =
0
S
X
(t)dt, where S
X
(t) is the survival function you got in (a).
(c) What is your conclusion on using sample mean to estimate the population mean when cen-
soring is present based on E(X) and E(T)?
3. (10 pts) The time in days to development of a tumor for rats exposed to a carcinogen follows a
Weibull distribution with = 2 and = 0.002.
1
(a) Find the probabilities that a (random) rat will be tumor free at 10 days, 20 days and 30
days.
(b) What is the average time to tumor development? (Hint: (0.5) =

, where () =
0
t
1
e
t
dt)
(c) Find the hazard rate of time to tumor development at 10 days, 20 days and 30 days.
(d) Find the median time to tumor development.
4. (10 pts) Suppose the hazard function of a random survival time T is given by
(t) =
1
0 = t
0
t < t
1
2
t
1
t < t
2
3
t
2
t <
(a) Find the survival function for this model.
(b) Using your favorite software plot the survival function in [0, 100] for the special case where
t
1
= 10, t
2
= 30, t
3
= and
1
= 0.01,
2
= 0.03,
3
= 0.02. Find the average survival
time and median survival time for this model. Assume the time unit is year. (Hint: An R
example is available in the class website)
2
ST745, Spring 2005
1. (15pts) The following table shows data on time to HIV development for a sample of individuals
with STD but free of HIV at time 0:
Year intervals # of HIV positive # lost to follow-up
0-2 2 3
2-4 1 2
4-6 4 8
6-8 3 10
8-10 2 21
10-12 2 21
12-14 3 21
Use the data in this table to do the following:
(a) Find the life-table estimate of the survival function of the time to HIV at years 2, 4, 6, 8, 10,
12 and 14 for the individuals with STD.
(b) Find the variance of the estimate you got in (a) at years 2, 4, 6, 8, 10, 12 and 14.
(c) Repeat the above using SAS and R.
2. (10 pts) For the following small data set of survival time: 3, 4, 5+, 6, 6+, 8+, 11, 14, 15, 16+,
where + means a right censored survival time, do the following:
(a) Find the Kaplan-Meier estimate of the survival function and its variance.
(b) Use the above Kaplan-Meier estimate to get an estimate and its variance of the cumulative
hazard function.
(c) Find the Nelson-Aalen estimate of the cumulative hazard function and its variance.
(d) Find an estimate and its variance of the survival function using the Nelson-Aalen estimate
you got in (c).
3. (10 pts) Using the lung cancer data (http://www.biostat.mcw.edu/homepgs/klein/4.7.4.html)
in problem 4.3 of the textbook, do the following by using statistical software (such as SAS or R):
1
(a) Find and plot the Kaplan-Meier estimate and the 90% pointwise condence interval of the
survival function using the data available on 3/31/1980.
(b) Find the estimate and its 90% CIs of the median survival time from the above plot using the
method described in class. Compare your result to the output from SAS.
4. (10 pts) For the life-table estimates in Table 2.2 on page 14 of the lecture notes, we assume that
the withdrawals in each interval occurred at the end of the interval and the true survival function
is a straight line in the interval.
(a) Find an estimate of S(2.5) under the above assumption.
(b) Express your estimate in (a) as a linear function of

S
R
(2) and

S
R
(3). Find the variance of
your estimate and hence construct a 95% CI for S(2.5).
2
ST745, Spring 2005
Homework 3, due: Tuesday, 2/15/2005
Happy Valentines Day
1. (5 pts) Suppose we have a small data set with dierent kinds of censoring: 2, 3, 4
+
, 5
, 6, 7
+
,
[8,9], where
+
(
) means right (left) censored observations and [a, b] means an interval censored
observation. Suppose the distribution of the underlying survival time is an exponential distribution
with a constant hazard . Write down the likelihood function of for this given data set.
2. (10 pts) Fit a Weibull model to the censored survival data in problem 3 of HW2. Let be the
population median such that S() = 1/2.
(a) Estimate .
(b) Dene = log(). Show that can be written as
= log(log2) +
0
using the parametrization of Proc Lifereg of SAS.
(c) Estimate the variance of = log( ) and hence construct a 95% condence interval for
(You need to obtain the estimate of the variance matrix of (
0
, )).
(d) Use the condence interval obtained in (c) for to construct a 95% condence interval for
the true median .
3. (10 pts) In class, we considered a score test for testing whether or not the survival times are from
an exponential distribution under the assumption that they are from a Weibull distribution and
used the complete data in HW2 as an illustration. In this problem, you are asked to work with
the censored survival data in problem 3 of HW2.
(a) Write down the log-likelihood function of the model parameters assuming the survival data
are from a Weibull distribution. Assume the observed data is (x
i
,
i
), i = 1, 2, ..., n.
(b) Find the score and information matrix from this model and then evaluate them under the
hypothesis that the data are from an exponential distribution.
(c) Use the censored survival data in HW2, perform the score test to test whether or not the
survival times are from an exponential distribution.
1
4. (10 pts) Using proc lifereg in SAS with the censored survival data in problem 3 of HW2, do the
following:
(a) Perform Wald test to test whether or not the survival times are from an exponential distri-
bution.
(b) Perform likelihood ratio test to test whether or not the survival times are from an exponential
distribution.
5. (5 pts) The above problems all are under the assumption that the data are from a Weibull model.
Suggest ways to check this model assumption and conduct the diagnostics.
2
ST745, Spring 2005
1. (10 pts) You are given a small data sets on survival times of subjects in two groups: group 1:
1, 1+, 2, 2+ and group 2: 2, 3, 3+, 4, where + means a censored observation. Conduct the
standard log-rank test (weight function = 1) by hand to compare the dierence in the survival
distribution. Which group has better survival?
2. (10 pts) Using all the data in problem 7.7 on page 240 of the textbook (the data can also be
downloaded from http://www.biostat.mcw.edu/homepgs/klein/7.8.7.html), do the following
(you can do it using SAS):
(a) Compare the survival curves for the three groups using the logrank test.
(b) Perform pairwise (logrank) tests to determine if there is any dierence in survival between
pairs or groups.
3. (10 pts) The website http://www.biostat.mcw.edu/homepgs/klein/7.8.13.html contains data
on time to tumor development for some litters of rats treated with drug or placebo. Test the
hypothesis that there is no dierence in the times to tumor between the treated and control rats
using a log-rank statistic stratied on litter.
4. (10 pts) An investigator asked you to help design a clinical trial for comparing a new treatment to
the standard treatment for patients with some kind of cancer. Suppose the mean survival time of
the standard treatment is 2 years and the new treatment is expected to extend the mean survival
time to 3 years. For design purpose, let us assume the survival times for each treatment have
exponential distribution. We would like to use the log-rank test for testing the survival dierence
at level = 0.05 and the investigator wants to have 90% power to detect the above dierence.
Assume equal number of patients will be allocated to each treatment. Do the following:
(a) What is the expected total number of deaths we have to observe in order for the log-rank
test to have the desired power to detect the dierence we expect?
(b) Suppose the study length is L (years) and the investigator wants to let the patients enter the
study throughout the whole study. What is the relationship the total sample size and the
study length L have to satisfy? Assume patients enter the study randomly.
(c) If on average there are 100 patients available each year. Find the study length L so that we
have the above design characteristics.
1
ST745, Spring 2005
1. The website http://www.biostat.mcw.edu/homepgs/klein/larynx.html contains survival data
from 90 patients with larynx cancer. Treating disease stage as a categorical variable (so you need
to either dene 3 dummy variables or declare it as a class variable), t an exponential AFT model
to the data with disease stage as covariates in the model. Answer the following questions:
(a) Compare the mean survival times for patients with dierent disease stages.
(b) Since the exponential AFT model is also a proportional hazards model, nd the estimates
and 95% CIs for the hazard ratios comparing patients with dierent disease stages.
2. For the data in (1), t a Weibull AFT model to the data with disease stage as covariates in the
model. Answer the following questions:
(a) Compare the mean survival times for patients with dierent disease stages.
(b) Since the Weibull AFT model is also a proportional hazards model, nd the estimates and
95% CIs for the hazard ratios comparing patients with dierent disease stages.
3. For the data in (1), t a log-logistic model to the data and answer the following questions:
(a) Compare the mean survival times for patients with dierent disease stages based on this
model.
(b) Find the estimates and 95% CIs of odds-ratios comparing patients with dierent disease
stages.
4. Assume the Gamma AFT model ts the data well. Conduct likelihood ratio test to see if a Weibull
model or exponential model or log-logistic model is reasonable.
1
ST745, Spring 2005
1. The following table gives a small data set of survival times and a covariate z:
patient id survival time (in years) z
1 8 3
2 7 4
3 9
+
5
4 10 6
where + means a right censored observation. Assuming a proportional hazards model
(t|z) =
0
(t)e
z
,
do the following:
(a) Write down the partial likelihood of .
(b) Plot the log partial likelihood of in [-8, 3], and convince yourself that this function is
concave.
(c) Find

that maximize this log partial likelihood function, calculate the second derivative of
the log partial likelihood function at

.
(d) Use Phreg in SAS to t the above proportional hazards model to the data. How do your
results compare to those from the SAS output?
2. We showed in class that the score test for comparing two treatments and two sample log rank
test are equivalent when there are no ties in the censored survival times. This equivalence is also
true for the situation where there are more than two treatments. In this problem, you are asked
to show part of this when there are three treatments. Namely, suppose we have the following
proportional hazards model
(t|) =
0
(t)e
Z
1
1
+Z
2
2
,
where Z
1
and Z
2
are two dummy variables created for 3 treatments:
Z
1
=
1 if treatment = 1
0 otherwise
Z
2
=
1 if treatment = 2
0 otherwise
1
Given data (x
i
,
i
, z
1i
, z
2i
) for i = 1, 2, ..., n where there are NO ties in the censored survival times,
show that the score vector (

1
,

2
)
T
for testing
H
0
:
1
=
2
= 0
is identical to the vector of the 3-sample log rank test (with w(x) = 1) given on page 92 of the
lecture notes.
3. Do problem 8.4 on page 288 of the textbook. The data set can be downloaded from
http://www.biostat.mcw.edu/homepgs/klein/7.8.7.html.
4. Do problem 8.10 on page 291 of the textbook. The data set can be downloaded from the website
http://www.biostat.mcw.edu/homepgs/klein/bmt.html.
2
ST745, Spring 2005
HW 7, due: Tuesday, 4/28/2005
1. The following small data set contains the survival information from 4 patients and smoking status
z(1), z(2), z(3) and z(4) at each death time
x (month) z(1) z(2) z(3) z(4)
3 1 1 0 0 .
2 0 1 0 . .
1 1 1 . . .
4 1 0 0 1 0
where x = time to failure or censoring (you may sort the data by x); = failure indicator: 1 =
failure, 0 = censored; z = 1 for smoking and z = 0 for nonsmoking. Assume a proportional
hazards model with time-dependent covariate z(t):
(t|z(t)) =
0
(t)e
z(t)
.
(a) Construct the partial likelihood of using this data set.
(b) plot the log partial likelihood of in the range of [-4, 4].
(c) Find

that maximizes the log partial likelihood function and hence calculate the standard
error of your estimate.
(d) Repeat part (c) Using Proc Phreg in SAS.
2. The following small data set contains the survival and covariate information from 4 patients
x z
3 1 7
2 0 6
1 1 4
4 1 5
where x = time to failure or censoring (you may sort the data by x); = failure indicator; 1 =
failure, 0 = censored; z = observed value of covariate. Assume a proportional hazards model
(t|z) =
0
(t)exp(z)
1
(a) Using this data set, compute the Breslow estimator (BY HAND) of the cumulative baseline
hazard function
0
(t). Plot this as a function of time.
Remark: After the last observed failure in the data set, the Breslow estimator remains
constant.
(b) Assume the proportional hazards model is correct, plot the estimated survival curve for
S(t|z = 4) and S(t|z = 7) as a function of time t on the same graph. Again, please do it by
hand.
3. A data set cal7581.dat, similar to the one we discussed in class, was available in the directory:
/afs/unity.ncsu.edu/users/d/dzhang2/www/st745 (you need to use gunzip to unzip it). These
data are from a randomized study of three treatments for women with breast cancer. There are
seven variables:
survival time or censoring time in days;
failure indicator (1=failure, 0=censoring);
treatment indicator (coded as 1, 2, 3);
menopausal status (0=premopausal, 1=post menopausal)
tumor size
number of nodes
estrogen receptor status (0=negative, 1=positive)
Fit a proportional hazards model with only main eects for treatments and number of nodes
aected. Estimate the survival function for three treatments with number of nodes = 1 using
Breslows estimate for the cumulative hazard function. Contrast these three estimates with the
Kaplan-Meier estimates using the data only from the corresponding stratum and comment on the
similarity and dierence between these estimates.
4. In the dataset CAL 7581.dat you analyzed in (3), I want you to only consider the relationship
between treatments 1 and 2. If we denote by R
the treatment number, I want you to consider a

2
model where the log hazard ratio between treatments may be a quadratic function in time. That
is,
(t|R
= 1)
(t|R
= 2)
= exp( + t + t
2
).
(a) Using Phreg in SAS construct such a model and nd the estimates for , and .
(b) With this model, plot the log hazard ratio between treatments 1 and 2 as a function of time.
(c) Use this model to test for proportional hazards assumption.
(d) What are your conclusions.
Remark: You may want to change days to years to stabilize your answers.
3

Analysis of Survival Data - LN - D Zhang - 05

Uploaded by

Copyright:

Available Formats

Analysis of Survival Data - LN - D Zhang - 05

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Survival Data - LN - D Zhang - 05

Uploaded by

Copyright:

Available Formats

What topics are covered in the document?

What topics are covered in the document?

What statistical methods are discussed?

What statistical methods are discussed?

ST 745

S(t)] vs t to see if it is approximately a

. Note this model allows:

) about using Taylor expansion:

) from this expansion.

all grid points x such that x + x t

all grid points x such that x + x t

all grid points x such that x + x t

grid points x such that x + x t

), dN(x), Y (x) and (x) are constant since x < x

(t)) the following

(t)), which can be seen using the

(t) is asymptotically normal with mean (t) and variance Var[

(t)], which can be estimated

(t)] is an estimate of Var[

is often referred to as the observed information matrix. Asymptotically J and J

. We want to construct a score test for testing H

, 29, 32, 35, 40, 26, 28, 33

) = 0.0311 1.96 0.0086 = [0.0142, 0.0480].

and the following hazard function

for Weibull distribution so that is the Weibull scale parameter.

(since P[Z > z

has the standard exponential distri-

to have the following density function

is the scale parameter in the

{all grid pts u}

{all grid pts u}

{all grid pts u}

) at the true value

> u and denote

> u, conditional on F(u

] = 0.5728 1.96 0.5096 = [0.426, 1.572].

Therefore both P[A

{all grid pts u}

)th element of J() is

) = 0.021 1.96 0.101 = [0.177, 0.219].

) = 0.019. The Wald test for

) = exp(6 0.042) = 1.28.

) = 0.042 1.96 0.019 = [0.0048, 0.079].

1 if individuals fall into the jth category,

and compare this to a chi-square with 6 degrees of freedom.

here to emphasize that we are particularly interested in the survival function

) is the cumulative hazard function given Z = z

), we only need to estimate and

and nd the corresponding

) and condence intervals for

) could also be obtained. These

) were derived by Tsiatis (1981) and by the use of counting processes

(call xbeta in SAS).

the treatment number, I want you to consider a

You might also like