Multiple Regression - Estimation

10/3/2020 Chapter 7: Multiple Regression: Estimation
Chapter 7: Multiple Regression: Estimation

Notes for MATH 668 based on Linear Models in Statistics by Alvin C. Rencher and G. Bruce
Schaalje, second edition, Wiley, 2008.
February 13, 2018
7.2 The Model

Definition 7.2.1 (p.137): The multiple linear regression model with n observations can be written as
yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi , i = 1, 2, … , k + 1, … , n
with the following assumptions:

E(ε) = 0 ,
cov(ε) = σ I.
2
Here, β 0 , β 1 , … , β k are fixed but unknown model parameters representing the (partial) regression coefficients.
In this chapter, we assume the x ij ’s are not random.
⎛ x 1j ⎞
Also, y is the dependent or response vector, the xj = ⎜

⎜ ⋮
⎟
⎟
is the j th independent, predictor, or explanatory variable,
⎝ ⎠
x nj
and ε is a random vector.

The multiple linear regression model can be expressed in matrix form as
y = Xβ + ε
where
1 x 11 ⋯ x 1k ⎞ β0 ε1
⎛ ⎛ ⎞ ⎛ ⎞
y1
⎛ ⎞
⎜ 1 x 21 ⋯ x 2k ⎟ ⎜ β1 ⎟ ⎜ ε2 ⎟
⎜ ⎟ ⎜ ⎟,ε = ⎜ ⎟.
y = ⎜ ⎟,X = ⎜ ⎟,β = ⎜
⎜ ⋮ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟ ⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎝ ⎠
yn ⎝ ⎠ ⎝ ⎠
⎝ ⎠
1 x n1 ⋯ x nk βk εn
The matrix X is often referred to as the design matrix.
7.3 Estimation of β and σ 2

^
Definition 7.3.1 (p.142): The least squares estimate of β is a vector β which minimizes the least squares function
⊤
Q(b) = (y − Xb) (y − Xb)
n
2
= ∑ (yi − b0 − b1 x i1 − b2 x i2 − … − bk x ik ) .
i=1
Theorem 7.3.1: If n ≥ p and X is a n × p full rank matrix, then X ⊤ X is a p × p full rank matrix.
Proof: Since X is full rank, its columns are linearly independent (that is, Xu = 0 implies u = 0).
Now, we will show that the columns of X ⊤ X are independent.
v1
⎛ ⎞
For any p -dimensional vector v = ⎜

⎜
⎟
⋮ ⎟ , v⊤ v = v
2
1
+ … + vp
2
equals 0 if and only if v = 0 .
⎝ ⎠
vp
Suppose (X ⊤ X)u = 0 . Let v = Xu so that
⊤ ⊤ ⊤ ⊤
v v = u X Xu = u 0 = 0.
Then v = 0 , or equivalently, Xu = 0 . Since X is full rank, this implies that u = 0 . Thus, X ⊤ Xu = 0 implies u = 0 .
Recall some vector differentiation rules from Chapter 2:
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 1/18
⊤ ⊤
∂ [c x] ∂ [x c]
= = c,
∂x ∂x
⊤
∂ [x Cx]
= 2Cx if C is symmetric.
∂x
Theorem 7.3.2 (p.142): Suppose we observe data (x i1 , … , x ik , yi ) for i = 1, … , n from a multiple linear regression
model where n > k . If X has rank k + 1 (it is full rank), then the least squares estimate of β is
−1
^ ⊤ ⊤
β = (X X) X y.
Proof: Since
⊤
Q(b) = (y − Xb) (y − Xb)
⊤ ⊤ ⊤ ⊤ ⊤ ⊤
= y y − b X y − y Xb + b X Xb
⊤ ⊤ ⊤ ⊤ ⊤
= y y − y Xb − y Xb + b X Xb
⊤ ⊤ ⊤ ⊤
= y y − 2y Xb + b X Xb,
we have
∂Q ⊤ ⊤ ⊤
= 0 − 2(y X) + 2X Xb
∂b
⊤ ⊤
= −2X y + 2X Xb.
∂Q
Setting = 0 and denoting the solution as β
^ , we write
∂b
⊤ ⊤ ^
−2X y + 2X Xβ = 0
⊤ ^ ⊤
2X Xβ = 2X y
⊤ ^ ⊤
X Xβ = X y
−1
^ ⊤ ⊤ ⊤
β = (X X) X y (X X is invertible since X is full rank).
To show that β
^ minimizes
Q, consider
^ ^ ⊤ ^ ^
Q(b) = (y − Xβ + Xβ − Xb) (y − Xβ + Xβ − Xb)
^ ⊤ ^ ^ ⊤ ^ ^ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(y − Xβ) (Xβ − Xb) + (Xβ − Xb) (Xβ − Xb)
^ ⊤ ^ ^ ⊤ ^ ^ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(X(β − b)) (y − Xβ) + (X(β − b)) (X(β − b))
^ ⊤ ^ ^ ⊤ ⊤ ^ ^ ⊤ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(β − b) X (y − Xβ) + (β − b) X X(β − b)
^ ⊤ ^ ^ ⊤ ⊤ ⊤ ^ ^ ⊤ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(β − b) (X y − X Xβ) + (β − b) X X(β − b).
^
Since −2(X ,
⊤ ⊤
y − X Xβ) = 0
^ ⊤ ^ ⊤
Q(b) = (y − Xβ) (y − Xβ) + v v
^ ^ ^
where v = X(β − b) . Now, v ⊤ v ≥ 0 with equality if and only if v = X(β − b) = 0 . This occurs if and only if b = β
^ ^
since X is full rank. Thus, Q(b) ≥ Q(β) with equality if and only if b = β .
^ ^
Theorem 7.3.3 (p.145): If X is full rank, then and .
2 ⊤ −1
E(β) = β cov(β) = σ (X X)
Proof:
^ ⊤ −1 ⊤
E(β) = E [(X X) X y]
⊤ −1 ⊤
= (X X) X E(y)
⊤ −1 ⊤
= (X X) X Xβ = β
and
^) = cov [(
cov(β
⊤
X)
−1 ⊤
y]
X X
⊤ −1 ⊤ ⊤ −1 ⊤ ⊤
= (X X) X cov(y)((X X) X )
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X cov(y)X((X X) )
⊤ −1 ⊤ ⊤ −1
= (X X) X cov(y)X(X X)
⊤ −1 ⊤ 2 ⊤ −1
= (X X) X (σ I)X(X X)
2 ⊤ −1 ⊤ ⊤ −1
= σ (X X) X X(X X)
2 ⊤ −1
= σ (X X) .
1
^
Theorem 7.3.4 (p.150): Let s . If X is full rank, then E(s2 ) for the multiple linear regression
2 2
= Q(β) = σ
n − k − 1
model.
Proof: Since
^)
Q(β ^ ⊤ ^
= (y − Xβ) (y − Xβ)
⊤ ⊤
= y
⊤ ^
y − 2β
⊤ ^
y + β
⊤ ^
Xβ
X X
⊤ ⊤
= y
⊤ ^
y − 2β
⊤ ^
y + β
⊤
y
X X
⊤
⊤ ^
y − β
⊤
y
= y X
⊤ ⊤ ⊤ −1 ⊤
= y y − y X(X X) X y
⊤ ⊤
= y y − y Hy
⊤
= y (I − H)y
where H = X(X
⊤
X)
−1
X
⊤
, Theorem 5.2.1 implies that
^ 2 ⊤
E[Q(β)] = tr ((In − H)(σ I)) + (Xβ) (I − H)(Xβ)
2 ⊤ ⊤
= σ tr (In − H) + β X (I − H)Xβ
2 ⊤ ⊤ ⊤ −1 ⊤
= σ tr (In − H) + β(X X − X X(X X) X X)β
2 ⊤ ⊤
= σ tr (In − H) + β(X X − X X)β
2
= σ tr (In − H)
2
= σ {tr (In ) − tr (H)} .
Since H2 = X(X
⊤
X)
−1
X
⊤
X(X
⊤
X)
−1
X
⊤
= X(X
⊤
X)
−1
X
⊤
= H so that H is idempotent, Theorem 2.11.1
implies that
⊤ −1 ⊤
tr (H) = tr ((X(X X) )X )
⊤ ⊤ −1
= tr (X X(X X) )
= tr (Ik+1 )
= k + 1.
So, we have
^ 2
E[Q(β)] = σ {n − (k + 1)}
2
= σ (n − k − 1),
1
^
and it follows that E(s2 ) = E[Q(β)] = σ
2
.
n − k − 1
The next result is known as the Gauss-Markov Theorem.
Theorem 7.3.5 (p.148): For the multiple linear regression model, the least squares estimate is the best linear unbiased
estimate (BLUE). That is, suppose we are interested in estimating a⊤ β where a is a (k + 1) -dimensional vector of
^
constants, and we consider estimates of the form c ⊤ y. If E(c ⊤ y) = a
⊤
β for all β , then var(c ⊤ y) ≥ var(a
⊤
β) .
Proof: Let d . Since a for all β , it follows that a (this can be
⊤ ⊤ ⊤ ⊤ ⊤ ⊤
= X c β = E(c y) = c E(y) = c Xβ = d β = d
1 0
⎛ ⎞ ⎛ ⎞
⎜ 0⎟ ⎜ ⎟
⋮ ⎟
shown since β = ⎜
⎜
⎟
⎟
implies a1 = d1 ,⋯, β = ⎜
⎜ ⎟
implies ak+1 = d k+1 ).
⎜ ⋮ ⎟ ⎜ 0⎟
⎝ ⎠ ⎝ ⎠
0 1
Since I − H is idempotent and symmetric, we see that
⊤ ⊤
c (I − H)c = c (I − H)(I − H)c
⊤ ⊤
= c (I − H) (I − H)c
⊤
= ((I − H)c) (I − H)c ≥ 0.
Since a = d = X
⊤
c , c ⊤ (I − H)c ≥ 0 implies that
⊤ ^ ⊤ ^
var(a β) = a cov(β)a
⊤ 2 ⊤ −1
= a σ (X X) a
2 ⊤ ⊤ −1 ⊤
= σ c X(X X) X c
2 ⊤
= σ c Hc
2 ⊤ 2 ⊤
≤ σ c Hc + σ c (I − H)c
2 ⊤
= σ c Ic
⊤ 2
= c (σ I)c
⊤
= c cov(y)c
⊤
= var(c y).
7.4 Geometry of Least-Squares

R Example 7.4.1: Consider the simple linear regression model for the following data set:
(3, 4), (−1, 2), (1, 0)
First, let’s find the least squares estimate based on the formula in Theorem 7.3.2. To do this, we start by creating the design
matrix and the response vector.
x=c(3,-1,1)
X=cbind(1,x);X
## x
## [1,] 1 3
## [2,] 1 -1
## [3,] 1 1
y=c(4,2,0);y
## [1] 4 2 0
Now, we use the formula for the least squares estimate.
beta.hat=solve(t(X)%*%X)%*%t(X)%*%y; beta.hat
## [,1]
## 1.5
## x 0.5
1.5
^
So, β = ( ) and the regression line for modeling y based on x is
0.5
^ = 1.5 + 0.5x.
y
Graphically, if we look at the rows of X and y, the least squares line minimizes the sum of squares of vertical distances of the data
points to the line as illustrated below.
plot(x,y,pch=19)
abline(beta.hat,col="blue")
y.hat=X%*%beta.hat
for (i in 1:3)
points(c(x[i],x[i]),c(y[i],y.hat[i]),type="l",col="red")
There is another geometrical intrepretation if we look at the columns of X and y. First, we plot the 3-dimensional vectors from the
columns of X in red and find the subspace of R3 spanned by these two vectors.
require(rgl)
## Loading required package: rgl
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col="#AAAAAA")
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col="#AAAAAA")
}
segments3d(c(0,1),c(0,1),c(0,1),col="#CC0000",lwd=5)
segments3d(c(0,x[1]),c(0,x[2]),c(0,x[3]),col="#FF0000",lwd=5)
4
⎛ ⎞
Now, we plot the 3-dimensional response vector y = ⎜ 2⎟ .

⎝ ⎠
0
segments3d(c(0,y[1]),c(0,y[2]),c(0,y[3]),col="#0000FF",lwd=5)
Now, we project the response vector onto the subspace spanned by the columns of the design matrix. The fitted vector
3 1
⎛ ⎞ ⎛ ⎞
^
^
y = Xβ = ⎜ 1 ⎟ and the residual vector ^ = y − y
ε ^ = ⎜ 1 ⎟ are shown below.
⎝ ⎠ ⎝ ⎠
2 −2
segments3d(c(0,y.hat[1]),c(0,y.hat[2]),c(0,y.hat[3]),col="#FF00FF",lwd=5)
segments3d(c(y[1],y.hat[1]),c(y[2],y.hat[2]),c(y[3],y.hat[3]),col="#00FF00",lwd=5)
7.6 Normal Model

Definition 7.6.1 (p.157): The normal multiple linear regression model with n observations can be written as
yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi , i = 1, 2, … , k + 1, … , n
with the following assumption:

ε ∼ Nn (0, σ I) .
2
From Theorem 4.2.1, the likelihood function for β and σ 2 is

′ −1
2 −1/2
2 −n/2 −(y−Xβ) (σ I) (y−Xβ)/2 2
L(β, σ ) = (2π) e (det (σ I))
′ 2
2 −n/2 −(y−Xβ) (y−Xβ)/(2σ )
= (2πσ ) e .
The log-likelihood function can then be written as
2 2
n n 2
1
ℓ(β, σ ) = ln L(β, σ ) = − ln(2π) − ln σ − Q(β).
2
2 2 2σ
Theorem 7.6.1 (p.158): For the normal multiple linear regression model where n > k and X has rank k + 1 (it is full rank),
^
β
the maximum likelihood estimator of ( β, σ 2 ) is ( ) where
2
^
σ
−1 2
1 1
^ ⊤ ⊤
^ ^ ^ ⊤ ^
β = (X X) X y and σ = Q(β) = (y − Xβ) (y − Xβ).
n n
Proof: Differentiating ℓ(β, σ 2 ) with respect to β

^ and
σ , we obtain
2
∂ℓ 1 ∂Q
= −
2
∂β 2σ ∂β
and
∂ℓ n 1
= − + Q(β).
2 2 2 2
∂σ 2σ 2(σ )
Setting the first equation equal to 0 and solving, we see that
1 ∂Q
− = 0
2
2σ ∂β
∂Q
= 0
∂β
^ ⊤ −1 ⊤
β = (X X) X y.
^
Substituting β = β and setting the second equation equal 0 and solving, we obtain
n 1
^
− + Q(β) = 0
2 2
^
2σ ^ )2
2(σ
n 1
^
= Q(β)
2 2
^
2σ ^ )2
2(σ
^
2
Q(β)
^
σ = .
n
^
β
To show that ( ) maximizes ℓ , let ℓ∗ be the profile likelihood defined by
2
^
σ
n n 1
∗ 2 ^, 2 2 ^).
ℓ (σ ) = ℓ(β σ ) = − ln(2π) − ln σ − Q(β
2
2 2 2σ
Since
∗
dℓ n 1
^
= − + Q(β)
2 2 2 2
d(σ ) 2σ 2(σ )
n 2 2
= − (σ ^ )
− σ
2 2
2(σ )
is positive when σ 2 ^
< σ
2
and negative when σ 2 ^
> σ
2
, ℓ∗ (σ 2 ) is maximized at σ 2 ^
= σ
2
. Finally, we see that
^ ^2 ∗ 2 ∗ 2 ^
^ ) ≥ ℓ (σ ) = ℓ(β
2 2
ℓ(β, σ ) = ℓ (σ , σ ) ≥ ℓ(β, σ )
for all β and σ 2 .

Theorem 7.6.2 (p.159): For the normal multiple linear regression model where n > k + 1 and X has rank k + 1 (it is full
rank), the maximum likelihood estimator has the following properties:
^
(a)β ∼ Nk+1 (β, σ
2
(X
⊤
X)
−1
)
2 2
^
nσ (n − k − 1)s
(b) =
2
∼ χ (n − k − 1)
2 2
σ σ
2
^
(c)β and σ
^ are independent.
^
Proof: ( a ) Theorem 4.4.1 implies that β = (X
⊤
X)
−1
X
⊤
y follows a (k + 1) -dimensional Normal distribution with mean
vector
⊤ −1 ⊤ ⊤ −1 ⊤
E[(X X) X y] = (X X) X E(y)
⊤ −1 ⊤
= (X X) X Xβ
= β
and covariance matrix

⊤ −1 ⊤ ⊤ −1 ⊤ ⊤ −1 ⊤ ⊤
cov[(X X) X y] = (X X) X cov(y)((X X) X )
⊤ −1 ⊤ ⊤ −1
= (X X) X cov(y)X(X X)
⊤ −1 ⊤ 2 ⊤ −1
= (X X) X (σ I)X(X X)
2 ⊤ −1 ⊤ ⊤ −1
= σ (X X) X X(X X)
2 ⊤ −1
= σ (X X) .
( b ) Letting H , Theorem 5.5.1 implies that

⊤ −1 ⊤
= X(X X) X
2 ^ ⊤ ^
^
nσ (y − Xβ) (y − Xβ)
=
2 2
σ σ
⊤
y (I − H)(I − H)y
=
2
σ
⊤
1 2
= y (I − H)y ∼ χ (n − k − 1)
2
σ
1
since ( (I − H)) (σ
2
I) = I − H is idempotent with rank n − k − 1.
2
σ
2
nσ
^ 1
^
( c ) By Theorem 5.6.1, β = (X
⊤
X)
−1
X
⊤
y and = y
⊤
(I − H)y are independent since
2 2
σ σ
⊤ −1 ⊤ 2
1 ⊤ −1 ⊤
(X X) X (σ I) (I − H) = (X X) X (I − H)
2
σ
⊤ −1 ⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X − (X X) X X(X X) X
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X − (X X) X
= O k+1,n .
2 2
2 σ nσ
^
^
Then β and σ
^ = ( ) are independent (see Theorem L3.5 from MATH 667).
2
n σ
^
β
Theorem 7.6.3 (p.160): For the normal multiple linear regression model, ( ) is the uniform minimum variance unbiased
2
s
β
estimator (UMVUE) of ( ) .
2
σ
7.5 The Model in Centered Form

The multiple linear regression model can be reparametrized in centered form by writing
yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi
= α + β 1 (x i1 − x̄ 1 ) + β 2 (x i2 − x̄ 2 ) + … + β k (x ik − x̄ k ) + εi
n
1
where x̄ j = ∑ x ij for j = 1, … , k and
n
i=1
α = β 0 + β 1 x̄ 1 + … + β k x̄ k .
In matrix form, this can be written as
α
y = (j, X c ) ( ) + ε
β1
where
β1
⎛ ⎞
β1 = ⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
βk
is the vector of regression coefficients with the intercept removed,
x 11 ⋯ x 1k
⎛ ⎞
⎜ x 21 ⋯ x 2k ⎟
⎜ ⎟
X1 = ⎜ ⎟
⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ ⎠
x n1 ⋯ x nk
is the design matrix with the first column of ones removed, and
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html ⎛ ⎞ 9/18
− x̄ 1 ⋯ x 1k − x̄ k ⎞
⎛ x 11
⎜ x 21 − x̄ 1 ⋯ x 2k − x̄ k ⎟
1 ⎜ ⎟
X c = (I − J) X 1 = ⎜ ⎟
n ⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ ⎠
x n1 − x̄ 1 ⋯ x nk − x̄ k
is the centered design matrix.

Then α ^ = ȳ and
^ ⊤ −1 ⊤
β1 = (X c X c ) X c y.
If we denote the sample variance of the j th column of X c as

n
1 2
2
s = ∑(x ij − x̄ j ) ,
i
n − 1
i=1
the sample covariance of the j th and gth columns of X c as

n
1
sjg = ∑(x ij − x̄ j )(x ig − x̄ g ),
n − 1
i=1
and the sample covariance of y and the j th column of X c as

n
1
syj = ∑(yi − ȳ )(x ij − x̄ j ),
n − 1
i=1
then
⊤
^ −1 ^ ^ ⊤ −1
β1 = S xx syx and β 0 = α
^ − β
1
x̄ = ȳ − syx S xx x̄
2
s s12 ⋯ s1k sy1
⎛ 1 ⎞ ⎛ ⎞ x̄ 1
⎛ ⎞
2
⎜ s21 s ⋯ s2k ⎟ ⎜ sy2 ⎟ ⎜ x̄ 2 ⎟
⎜
2
⎟ 1 ⎜ ⎟ 1
where Sxx = ⎜ ⎟ =
⊤
Xc Xc , syx = ⎜ ⎟ =
⊤
Xc y , and x̄ = ⎜
⎜
⎟
⎟
.
⎜ ⎟ n − 1 ⎜ ⎟ n − 1
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎝ ⎠ ⎝ ⎠
⎝ 2 ⎠ syk x̄ k
sk1 sk2 ⋯ s
k
7.7 R2 in Fixed-x Regression

Here, we generalize some of the notation from Section 6.4.
^
The predicted value of y at x is y
^ = x
⊤
β .
^
The vector of fitted values of y is ^ = Xβ
y .
The residual vector is ε^ = y − y ^.
^
Evaluating the least squares function at β , we obtain the residual sum of squares (or error sum of squares) denoted by
⊤
⊤
^ ^ ε ^ ⊤
^ = (y − Xβ ^ ⊤ ^ ⊤
S S E = Q(β) = ε ) (y − Xβ) = y y − β X y.
The corrected total sum of squares S S T = ∑(yi − ȳ )

2
= y
⊤
y − nȳ
2
can be decomposed into two parts
i=1
SST = SSE + SSR
n
⊤ ⊤
^ ^
where S S R ^ − ȳ )
= ∑(y i
2
= β X
⊤
y − nȳ
2 ⊤
= β1 X c y .
i=1
SSR
The coefficient of determination R2 = is the the proportion of variation explained by the regression model compared
SST
with that explained only using the mean. This gives a measure of how well the model fits the data.
R is sometimes referred to as the multiple correlation coefficient.
7.8 Generalized Least-Squares: cov(y) = σ V

2
Suppose the multiple linear regression model holds except cov(ε) = σ

2
V where V is known symmetric positive definite
matrix but σ 2 is unknown.
Definition 7.8.1 (p.164): The generalized least squares estimate of β is the vector β
^ which minimizes the generalized least
squares function
⊤ −1
QV (b) = (y − Xb) V (y − Xb)
n n
= ∑ ∑ wij (yi − b0 − b1 x i1 − … − bk x ik )(yj − b0 − b1 x j1 − … − bk x jk )
i=1 j=1
where wij is the element in the i th row and j th column of V −1 .
Theorem 7.8.1 (p.165): Given observed data (x i1 , … , x ik , yi ) for i = 1, … , n from a multiple linear regression model
where n > k and X has rank k + 1 (it is full rank), the generalized least squares estimate of β is
−1
^ ⊤ −1 ⊤ −1
β = (X V X) X V y.
The following properties hold:

^
β is the best linear unbiased estimate (BLUE) of β
^ 2 ⊤ −1
cov(β) = σ (X VX)
1
^
s
2
= QV (β) is an unbiased estimate of σ 2 .
n − k − 1
Proof: The results follow from the results in Section 7.3 with y replaced by V −1/2 y and X replaced by V −1/2 X.
7.9 Model Misspecification

In this section, we consider what happens if the true multiple linear regression model is
β1
y = Xβ + ε = (X 1 , X 2 ) ( ) + ε = X 1 β1 + X 2 β2 + ε
β2
but we incorrectly leave out the variables in the columns X 2 and use the model
∗ ∗
y = X1 β + ε .
1
Theorem 7.9.1 (p.170): Suppose X ⊤

1
X 1 is invertible. If y = X 1 β1 + X 2 β2 + ε where E(ε) = 0 and cov(ε) = σ
2
I
∗
^
and ⊤
β1 = (X 1 X 1 )
−1 ⊤
X1 y is the least-squares estimator of β1 in the misspecified model without X 2 , then
∗
∗
^
(a) E(β1 ) = β1 + Aβ2 where A = (X 1 X 1 )
⊤ −1
X1 X2
⊤
is the alias matrix
∗
^
(b) 2
cov(β1 ) = σ (X 1 X 1 )
−1 ⊤
.
Proof: For ( a ), we compute
∗
^ ⊤ −1 ⊤
E(β1 ) = (X 1 X 1 ) X 1 E(y)
⊤ −1 ⊤
= (X X 1 ) X 1 (X 1 β1 + X 2 β2 )
1
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X 1 ) X 1 X 1 β1 + (X 1 X 1 ) X 1 X 2 β2
1
⊤ −1 ⊤
= β1 + (X X 1 ) X 1 X 2 β2 .
1
For (b), we compute

∗
^ ⊤ −1 ⊤ ⊤ −1
cov(β1 ) = (X 1 X 1 ) X 1 cov(y)X 1 (X 1 X 1 )
⊤ −1 ⊤ 2 ⊤ −1
= (X 1 X 1 ) X 1 (σ I)X 1 (X 1 X 1 )
2 ⊤ −1 ⊤ ⊤ −1
= σ (X 1 X 1 ) X 1 X 1 (X 1 X 1 )
2 ⊤ −1
= σ (X 1 X 1 ) .
Theorem 7.9.2 (p.172): Suppose X 1 is an n × (p + 1) matrix and X 1 X 1 is invertible. Let

⊤
∗ ∗
^ ⊤ ^
(y − X 1 β1 ) (y − X 1 β1 ) ∗
s
2
1
= be the estimator of σ 2 based only on X 1 where β
^
1
⊤
= (X X 1 )
1
−1 ⊤
X1 y . If
n − p − 1
y = X 1 β1 + X 2 β2 + ε where E(ε) = 0 and cov(ε) = σ

2
I , then
⊤ ⊤ ⊤ −1 ⊤
β X 2 (I − X 1 (X 1 X 1 ) X 1 )X 2 β2
2 2 2
E(s ) = σ + .
1
n − p − 1
Proof: The result holds since Theorem 5.2.1 implies that
∗ ∗
^
E [(y − X 1 β
⊤ ^ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤
1
) (y − X 1 β1 )] = E [(y − X 1 (X 1 X 1 ) X y) (y − X 1 (X 1 X 1 ) X y)]
⊤ −1 ⊤ ⊤ −1 ⊤
= E [(I − X 1 (X X 1 ) X)y) (y − X 1 (X X 1 ) X y)]
1 1
⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
= E [y (I − X 1 (X X 1 ) X) (y − X 1 (X X 1 ) X y)]
1 1
⊤ ⊤ −1 ⊤
= E [y (In − X 1 (X X 1 ) X 1 ) y]
1
⊤ −1 ⊤ 2 ⊤ ⊤ ⊤ −1 ⊤
= tr ((In − X 1 (X X 1 ) X 1 ) (σ I)) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1
2 ⊤ −1 ⊤ ⊤ ⊤ ⊤ −1 ⊤
= σ (tr (In ) − tr (X 1 (X X 1 ) X 1 )) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1
2 ⊤ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤
= σ (tr (In ) − tr (X X 1 (X X 1 ) )) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1 1
2 ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1
2 ⊤ ⊤ ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + (β X1 + β X 2 ) (In − X 1 (X 1 X 1 ) X 1 ) (X 1 β1 + X 2 β2 )
1 2
2 ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + β X 2 (In − X 1 (X 1 X 1 ) X 1 ) X 2 β2 .
2
Example 7.9.1: Suppose yi = β0 + β1 xi + β2 x

2
i
+ εi for i = 1, … , 4 where x i = i are known values, the regression
parameters β 0 , β 1 , and β 2 are unknown, and ε1 , … , ε4 are independent and identically distributed random variables with
mean 0 and unknown variance σ 2 .
Now, suppose that the x 2 term is excluded from the model, and least squares estimation is used to model the y’s based on
∗ ∗
an intercept term and the x’s (that is, suppose that we incorrectly use a linear model and find values β^0 and β^1 , which
n
1 ∗ ∗
minimize ∑4i=1 (yi − β
∗
0
+ β
∗
1
xi )
2
). What is the bias of using s21 =
^ ^
∑(yi − β 0 − β 1 x i )
2
to estimate σ 2 ?
2
i=1
Answer: Here the design matrix is
1 1 1
⎛ ⎞ ⎛ ⎞
⎜ 1 2 ⎟ ⎜ 4 ⎟
X = (X 1 , X 2 ) with X 1 = ⎜ ⎟ and X 2 = ⎜ ⎟,
⎜ 1 3 ⎟ ⎜ 9 ⎟
⎝ ⎠ ⎝ ⎠
1 4 16
so
⊤
4 10 ⊤ ⊤
30
X1 X1 = ( ) , X 2 X 2 = 354, and X 1 X 2 = ( ).
10 30 100
−1
4 10 30 −10 1.5 −0.5
Then X 1 X 1 so
⊤ 1
= ( ) = ( ) = ( )
20
10 30 −10 4 −0.5 0.2
⊤ ⊤ −1 ⊤
1.5 −0.5 30
X 2 X 1 (X 1 X 1 ) X 1 X 2 = (30, 100) ( )( ) = 350
−0.5 0.2 100
β 2 (354 − 350)β 2
and it follows that E(s2 ) − σ 2 = = 2β
2
2
.
4 − 1 − 1
For the next result, we will need Theorem 2.5.3:

A 11 A 12
If A = ( ) is an n × n nonsingular matrix, then
A 21 A 22
−1 −1 −1 −1 −1 −1
A + A A 12 B A 21 A −A A 12 B
−1 11 11 11 11
A = ( )
−1 −1 −1
−B A 21 A B
11
where B = A 22 − A 21 A
−1
11
A 12 .
^
β1
^ ^
Theorem 7.9.3 (p.171): Suppose X ⊤
1
X 1 is invertible. If β = (X
⊤
X)
−1
X
⊤
y where X = (X 1 , X 2 ) ,β = ( ) , and
^
β2
∗
^ ⊤
β1 = (X 1 X 1 )
−1 ⊤
X1 y , then
∗
^ ^ 2 −1 ⊤
cov(β1 ) − cov(β1 ) = σ AB A
where A = (X 1 X 1 )
⊤ −1 ⊤
X1 X2 and B ⊤
= X2 X2 − X2 X1 A
⊤
. Furthermore, AB−1 A ⊤ is positive definite or positive
∗
semidefinite so var(β^j ) ^ )
≥ var(β j
for all j .
Proof: Since
^) =
cov(β
2
(X
⊤
X)
−1
σ
−1
⊤ ⊤
X1 X1 X1 X2
2
= σ ( )
⊤ ⊤
X2 X1 X2 X2
⊤ −1 ⊤ −1 ⊤ −1 ⊤ ⊤ −1 ⊤ −1 ⊤ −1
(X 1 X 1 ) + (X 1 X 1 ) X1 X2 B X 2 X 1 (X 1 X 1 ) −(X 1 X 1 ) X1 X2 B
2
= σ ( )
−1 ⊤ ⊤ −1 −1
−B X 2 X 1 (X 1 X 1 ) B
⊤ −1 −1 ⊤ −1
2
(X 1 X 1 ) + AB A −AB
= σ ( )
−1 ⊤ −1
−B A B
where B ,
⊤ ⊤ ⊤ −1 ⊤ ⊤ ⊤
= X 2 X 2 − X 2 X 1 (X 1 X 1 ) X1 X2 = X2 X2 − X2 X1 A
2 ⊤ −1 −1 ⊤
^ ) = σ
cov(β ((X 1 X 1 ) + AB A ).
1
Thus, cov(β . This matrix is positive definite or positive semidefinite since, for any v with
−1 ⊤
^ ) − cov( ^ ) = σ
2
AB A
1
β1
the dimension of β1 ,
⊤ −1 ⊤ ⊤ ⊤ ⊤ −1 ⊤ ⊤
v AB A = v AX 2 (I − X 1 (X 1 X 1 ) X 1 )X 2 A
⊤ ⊤ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤ ⊤
= v AX 2 (I − X 1 (X 1 X 1 ) X1 ) (I − X 1 (X 1 X 1 ) X 1 )X 2 A
⊤
= w w ≥ 0
where w = (I − X 1 (X 1 X 1 )
⊤ −1 ⊤
X 1 )X 2 A
⊤
.
7.10 Orthogonalization
Notice that if the columns of X 1 are orthogonal to the columns of X 2 (that is, X ⊤
1
X 2 = O ), then the least squares
estimates of β1 and β2 based on fitting models separately with only X 1 and X 2 , respectively, are the same as the full
model based on X = (X 1 , X 2 ) .
Note that X ⊤1
X 2 = O implies A = O and B = X 2 X 2 . Then we see that
⊤
^
β1 ⊤ −1 ⊤
( ) = (X X) X y
^
β2
−1 ⊤
⊤ ⊤
X1 X1 X1 X2 X1
= ( ) ( ) y
⊤ ⊤
X2 X1 X2 X2 X2
⊤ −1 −1 ⊤ −1 ⊤
(X 1 X 1 ) + AB A −AB X1 y
= ( )( )
−1 ⊤ −1 ⊤
−B A B X2 y
⊤ −1 ⊤
(X 1 X 1 ) O X1 y
= ( )( )
⊤ −1 ⊤
O (X 2 X 2 ) X2 y
⊤ −1 ⊤
(X 1 X 1 ) X1 y
= ( )
⊤ −1 ⊤
(X 2 X 2 ) X2 y
∗
^
β1
= ( ).
∗
^
β2
Orthogonalization also provides a method for updating coefficient estimates for a regression model of y based only on X 1
to obtain coefficient estimates for the full model based of y based on X = (X 1 , X 2 ) , even when the columns of X 2 are
not orthogonal to the columns of X 1 .
Here are the steps:
∗ ∗
^ ^
Step 1: Regress y on X 1 to obtain β1
= (X 1 X 1 )
⊤ −1
X1 y
⊤
and residuals ε∗1 ^
= y − y 1
where y
^
1
= X 1 β1 .
Step 2: Simultaneously regress each column of X 2 on X 1 where each column of (X 1 X 1 ) ⊤ −1 ⊤
X1 X2 = A gives
the coefficients for the corresponding column. Then the residuals are X 2⋅1 = X 2 − X 1 A.
^
Step 3: Regress y − y
^ on X 2⋅1 to obtain β
1 2
. This gives
ˆ ^
y − y
^1 = X 2⋅1 β2
^ − y ^
^ = (X 2 − X 1 A)β
y 1 2
^ = y ^
^ − X 1 Aβ ^
y 1 2
+ X 2 β2
∗
^
^ = X1 β ^ ^
y 1
− X 1 Aβ2 + X 2 β2
∗
^
^ = X 1 (β ^ ^
y 1
− Aβ2 ) + X 2 β2
∗
^ ^ ^
so β1
= β1 − Aβ2 .
R Example 7.10.1: Let’s see how the orthogonal method works geometrically for the data in R Example 7.4.1.
4 1
⎛ ⎞ ⎛ ⎞
First, we regress y = ⎜ 2⎟ on j = ⎜ 1⎟ to obtain

⎝ ⎠ ⎝ ⎠
0 1
⊤
∗ j y 6
^
β1 = = = 2
⊤
j j 3
so the projection of y onto j is y

^
1
= 2j and its residual is
2
⎛ ⎞
y − y
^1 = ⎜ 0 ⎟
⎝ ⎠
−2
as illustrated below.
x=c(3,-1,1)
y=c(4,2,0)
j=rep(1,3)
y.hat.1=mean(y)*j
SE,axes=FALSE)
m=4
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col=ifelse(s==0,"#000000","#AAAAAA"))
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col=ifelse(t==0,"#000000","#AAAAAA"))
}
segments3d(c(0,y[1]),c(0,y[2]),c(0,y[3]),col="#0000FF",lwd=5)
segments3d(c(y[1],y.hat.1[1]),c(y[2],y.hat.1[2]),c(y[3],y.hat.1[3]),col="#FF8800",lwd=5)
3 1
⎛ ⎞ ⎛ ⎞
Then we find a vector in the subspace spanned by x = ⎜ −1 ⎟ and j = ⎜ 1⎟ orthogonal to j by computing the residual after
⎝ ⎠ ⎝ ⎠
1 1
regressing x on j . The coefficient estimate of this regression is
⊤
j x 3
A = = = 1
⊤
j j 3
so the projection of x onto j is
^ = 1j = j
x
and its residual is
2
⎛ ⎞
^ = ⎜ −2 ⎟
X 2⋅1 = x − x
⎝ ⎠
0
as illustrated below.
x.hat=mean(x)*j
SE,axes=FALSE)
m=4
}
segments3d(c(0,x[1]),c(0,x[2]),c(0,x[3]),col="#FF0000",lwd=5)
segments3d(c(x[1],x.hat[1]),c(x[2],x.hat[2]),c(x[3],x.hat[3]),col="#00FFFF",lwd=5)
2 2
⎛ ⎞ ⎛ ⎞
Then we regress the residual y − y

^
1
= ⎜ 0 ⎟ on x − x
^ = ⎜ −2 ⎟ and obtain the regression coefficient
⎝ ⎠ ⎝ ⎠
−2 0
⊤
(y − y
^ ) (x − x
^) 4 + 0 + 0 1
1
^
β2 = = = .
⊤
(x − x
^) (x − x
^) 4 + 4 + 0 2
The projection of y − y
^ onto x − x
1
^ is
ˆ
^
y − y ^ − y
= y ^
1 1
1
= ^ ).
(x − x
2
Its projection and its residual
1
⎛ ⎞
ˆ
^ ) − y − y
(y − y ^ = y − ^ = ⎜
y 1 ⎟
1 1
⎝ ⎠
−2
is illustrated below.
beta.hat.2=sum((y-y.hat.1)*(x-x.hat))/sum((x-x.hat)^2)
SE,axes=FALSE)
m=4
}
segments3d(c(y[1],y.hat.1[1]),c(y[2],y.hat.1[2]),c(y[3],y.hat.1[3]),col="#FF8800",lwd=5)
segments3d(c(mean(x)+x[1],mean(x)+x.hat[1]),c(mean(x)+x[2],mean(x)+x.hat[2]),c(mean(x)+x[3],mean(x)+x.hat[3]),
col="#00FFFF",lwd=5)
segments3d(c(mean(y)+beta.hat.2*(x[1]-mean(x)),y[1]),c(mean(y)+beta.hat.2*(x[2]-mean(x)),y[2]),c(mean(y)+beta.
hat.2*(x[3]-mean(x)),y[3]),col="#00FF00",lwd=5)
Finally, we obtain the regression coefficients for the full model
1
^ − y
y ^ = (x − x
^)
1
2
1
^ = y
y ^ + (x − x
^)
1
2
1
^ = 2j +
y (x − j)
2
1 1
^ = 2j +
y x − j
2 2
3 1
^ =
y j + x,
2 2
3
⎛ ⎞
and we see that ^ = ⎜ 1⎟

y .
⎝ ⎠
2
R Example 7.10.2: The following code shows the computation time for various methods used to compute the estimates of
the regression coefficients in R Example 7.10.1.
x=c(3,-1,1)
X=cbind(1,x)
y=c(4,2,0)
### METHOD 1: Use lm function.

# Start the clock
R=10000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
beta.hat=lm(y~x)$coef
}
# Stop the clock
run=proc.time() - ptm
run/R
## user system elapsed

## 0.000664 0.000000 0.000664
### METHOD 2: Use multiple linear regression formula for beta.hat.

# Start the clock
R=100000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
beta.hat=solve(t(X)%*%X)%*%t(X)%*%y
}
# Stop the clock
run/R

## 3.43e-05 0.00e+00 3.44e-05
### METHOD 3: Use orthogonalization/simple linear regression method.

# Start the clock
R=1000000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
m=mean(x)
xc=x-m
beta2.hat=sum(xc*y)/sum(xc*xc)
beta1.hat=mean(y)-beta2.hat*m
}
# Stop the clock
run/R

## 9.54e-06 0.00e+00 9.55e-06
In R, we can also use C functions to make computations more efficiently. Here is a method to perform linear regression based on
least squares using the LAPACK library dgels.
#include <R.h>
extern int dgels_(char *trans, int *m, int *n, int *nrhs, double *a, int *lda, double *b, int *ldb, double *wo
rk, int *lwork, int *info);
void ls(double *X, double *y, int *n, int *p){

/* documentation for dgels_ at
http://www.netlib.org/lapack/explore-html/d7/d3b/group__double_g_esolve_ga225c8efde208eaf246882df48e590eac.htm
l
*/
double *wrk;
char trans='N';
int intone=1;
int info;
int lw=(*p)+16*(*n);
wrk=malloc(lw*sizeof(double));
dgels_(&trans,n,p,&intone,X,n,y,n,wrk,&lw,&info);
free(wrk);
}
dyn.load("ls.dll")
ourls = function(X,y){
n=as.integer(nrow(X))
p=as.integer(ncol(X))
out=.C("ls",as.double(X),beta=as.double(y),n,p)
out$beta[1:p]
}
ourls(X,y)
## [1] 1.5 0.5
Now, let’s compare its speed with the other methods.
### METHOD 4: Use linear regression method from C.

# Start the clock
R=1000000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
beta.hat=ourls(X,y)
}
# Stop the clock
run/R

## 5.97e-06 0.00e+00 5.97e-06

Multiple Regression - Estimation

Uploaded by

Copyright:

Available Formats

Multiple Regression - Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression - Estimation

Uploaded by

Copyright:

Available Formats

10/3/2020 Chapter 7: Multiple Regression: Estimation

Chapter 7: Multiple Regression: Estimation

7.2 The Model

with the following assumptions:

Also, y is the dependent or response vector, the xj = ⎜

and ε is a random vector.

The matrix X is often referred to as the design matrix.

7.3 Estimation of β and σ 2

Now, we will show that the columns of X ⊤ X are independent.

For any p -dimensional vector v = ⎜

Suppose (X ⊤ X)u = 0 . Let v = Xu so that

Recall some vector differentiation rules from Chapter 2:

Since I − H is idempotent and symmetric, we see that

7.4 Geometry of Least-Squares

(3, 4), (−1, 2), (1, 0)

Now, we use the formula for the least squares estimate.

## Loading required package: rgl

Now, we plot the 3-dimensional response vector y = ⎜ 2⎟ .

7.6 Normal Model

with the following assumption:

From Theorem 4.2.1, the likelihood function for β and σ 2 is

The log-likelihood function can then be written as

Proof: Differentiating ℓ(β, σ 2 ) with respect to β

Setting the first equation equal to 0 and solving, we see that

for all β and σ 2 .

and covariance matrix

( b ) Letting H , Theorem 5.5.1 implies that

7.5 The Model in Centered Form

In matrix form, this can be written as

is the vector of regression coefficients with the intercept removed,

is the centered design matrix.

If we denote the sample variance of the j th column of X c as

the sample covariance of the j th and gth columns of X c as

and the sample covariance of y and the j th column of X c as

7.7 R2 in Fixed-x Regression

The corrected total sum of squares S S T = ∑(yi − ȳ )

SST = SSE + SSR

7.8 Generalized Least-Squares: cov(y) = σ V

Suppose the multiple linear regression model holds except cov(ε) = σ

= ∑ ∑ wij (yi − b0 − b1 x i1 − … − bk x ik )(yj − b0 − b1 x j1 − … − bk x jk )

where wij is the element in the i th row and j th column of V −1 .

The following properties hold:

7.9 Model Misspecification

Theorem 7.9.1 (p.170): Suppose X ⊤

For (b), we compute

Theorem 7.9.2 (p.172): Suppose X 1 is an n × (p + 1) matrix and X 1 X 1 is invertible. Let

y = X 1 β1 + X 2 β2 + ε where E(ε) = 0 and cov(ε) = σ

Proof: The result holds since Theorem 5.2.1 implies that

Example 7.9.1: Suppose yi = β0 + β1 xi + β2 x

Answer: Here the design matrix is

For the next result, we will need Theorem 2.5.3:

First, we regress y = ⎜ 2⎟ on j = ⎜ 1⎟ to obtain

so the projection of y onto j is y

regressing x on j . The coefficient estimate of this regression is

so the projection of x onto j is

and its residual is

Then we regress the residual y − y

Its projection and its residual

Finally, we obtain the regression coefficients for the full model

void ls(double X, double y, int n, int p){