Multiple Regression - Estimation
Multiple Regression - Estimation
Multiple Regression - Estimation
yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi , i = 1, 2, … , k + 1, … , n
cov(ε) = σ I.
2
Here, β 0 , β 1 , … , β k are fixed but unknown model parameters representing the (partial) regression coefficients.
In this chapter, we assume the x ij ’s are not random.
⎛ x 1j ⎞
y = Xβ + ε
where
1 x 11 ⋯ x 1k ⎞ β0 ε1
⎛ ⎛ ⎞ ⎛ ⎞
y1
⎛ ⎞
⎜ 1 x 21 ⋯ x 2k ⎟ ⎜ β1 ⎟ ⎜ ε2 ⎟
⎜ ⎟ ⎜ ⎟,ε = ⎜ ⎟.
y = ⎜ ⎟,X = ⎜ ⎟,β = ⎜
⎜ ⋮ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟ ⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎝ ⎠
yn ⎝ ⎠ ⎝ ⎠
⎝ ⎠
1 x n1 ⋯ x nk βk εn
⊤
Q(b) = (y − Xb) (y − Xb)
n
2
= ∑ (yi − b0 − b1 x i1 − b2 x i2 − … − bk x ik ) .
i=1
Theorem 7.3.1: If n ≥ p and X is a n × p full rank matrix, then X ⊤ X is a p × p full rank matrix.
Proof: Since X is full rank, its columns are linearly independent (that is, Xu = 0 implies u = 0).
v1
⎛ ⎞
⊤ ⊤ ⊤ ⊤
v v = u X Xu = u 0 = 0.
Then v = 0 , or equivalently, Xu = 0 . Since X is full rank, this implies that u = 0 . Thus, X ⊤ Xu = 0 implies u = 0 .
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 1/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
⊤ ⊤
∂ [c x] ∂ [x c]
= = c,
∂x ∂x
⊤
∂ [x Cx]
= 2Cx if C is symmetric.
∂x
Theorem 7.3.2 (p.142): Suppose we observe data (x i1 , … , x ik , yi ) for i = 1, … , n from a multiple linear regression
model where n > k . If X has rank k + 1 (it is full rank), then the least squares estimate of β is
−1
^ ⊤ ⊤
β = (X X) X y.
Proof: Since
⊤
Q(b) = (y − Xb) (y − Xb)
⊤ ⊤ ⊤ ⊤ ⊤ ⊤
= y y − b X y − y Xb + b X Xb
⊤ ⊤ ⊤ ⊤ ⊤
= y y − y Xb − y Xb + b X Xb
⊤ ⊤ ⊤ ⊤
= y y − 2y Xb + b X Xb,
we have
∂Q ⊤ ⊤ ⊤
= 0 − 2(y X) + 2X Xb
∂b
⊤ ⊤
= −2X y + 2X Xb.
∂Q
Setting = 0 and denoting the solution as β
^ , we write
∂b
⊤ ⊤ ^
−2X y + 2X Xβ = 0
⊤ ^ ⊤
2X Xβ = 2X y
⊤ ^ ⊤
X Xβ = X y
−1
^ ⊤ ⊤ ⊤
β = (X X) X y (X X is invertible since X is full rank).
To show that β
^ minimizes
Q, consider
^ ^ ⊤ ^ ^
Q(b) = (y − Xβ + Xβ − Xb) (y − Xβ + Xβ − Xb)
^ ⊤ ^ ^ ⊤ ^ ^ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(y − Xβ) (Xβ − Xb) + (Xβ − Xb) (Xβ − Xb)
^ ⊤ ^ ^ ⊤ ^ ^ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(X(β − b)) (y − Xβ) + (X(β − b)) (X(β − b))
^ ⊤ ^ ^ ⊤ ⊤ ^ ^ ⊤ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(β − b) X (y − Xβ) + (β − b) X X(β − b)
^ ⊤ ^ ^ ⊤ ⊤ ⊤ ^ ^ ⊤ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(β − b) (X y − X Xβ) + (β − b) X X(β − b).
^
Since −2(X ,
⊤ ⊤
y − X Xβ) = 0
^ ⊤ ^ ⊤
Q(b) = (y − Xβ) (y − Xβ) + v v
^ ^ ^
where v = X(β − b) . Now, v ⊤ v ≥ 0 with equality if and only if v = X(β − b) = 0 . This occurs if and only if b = β
^ ^
since X is full rank. Thus, Q(b) ≥ Q(β) with equality if and only if b = β .
^ ^
Theorem 7.3.3 (p.145): If X is full rank, then and .
2 ⊤ −1
E(β) = β cov(β) = σ (X X)
Proof:
^ ⊤ −1 ⊤
E(β) = E [(X X) X y]
⊤ −1 ⊤
= (X X) X E(y)
⊤ −1 ⊤
= (X X) X Xβ = β
and
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 2/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
^) = cov [(
cov(β
⊤
X)
−1 ⊤
y]
X X
⊤ −1 ⊤ ⊤ −1 ⊤ ⊤
= (X X) X cov(y)((X X) X )
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X cov(y)X((X X) )
⊤ −1 ⊤ ⊤ −1
= (X X) X cov(y)X(X X)
⊤ −1 ⊤ 2 ⊤ −1
= (X X) X (σ I)X(X X)
2 ⊤ −1 ⊤ ⊤ −1
= σ (X X) X X(X X)
2 ⊤ −1
= σ (X X) .
1
^
Theorem 7.3.4 (p.150): Let s . If X is full rank, then E(s2 ) for the multiple linear regression
2 2
= Q(β) = σ
n − k − 1
model.
Proof: Since
^)
Q(β ^ ⊤ ^
= (y − Xβ) (y − Xβ)
⊤ ⊤
= y
⊤ ^
y − 2β
⊤ ^
y + β
⊤ ^
Xβ
X X
⊤ ⊤
= y
⊤ ^
y − 2β
⊤ ^
y + β
⊤
y
X X
⊤
⊤ ^
y − β
⊤
y
= y X
⊤ ⊤ ⊤ −1 ⊤
= y y − y X(X X) X y
⊤ ⊤
= y y − y Hy
⊤
= y (I − H)y
where H = X(X
⊤
X)
−1
X
⊤
, Theorem 5.2.1 implies that
^ 2 ⊤
E[Q(β)] = tr ((In − H)(σ I)) + (Xβ) (I − H)(Xβ)
2 ⊤ ⊤
= σ tr (In − H) + β X (I − H)Xβ
2 ⊤ ⊤ ⊤ −1 ⊤
= σ tr (In − H) + β(X X − X X(X X) X X)β
2 ⊤ ⊤
= σ tr (In − H) + β(X X − X X)β
2
= σ tr (In − H)
2
= σ {tr (In ) − tr (H)} .
Since H2 = X(X
⊤
X)
−1
X
⊤
X(X
⊤
X)
−1
X
⊤
= X(X
⊤
X)
−1
X
⊤
= H so that H is idempotent, Theorem 2.11.1
implies that
⊤ −1 ⊤
tr (H) = tr ((X(X X) )X )
⊤ ⊤ −1
= tr (X X(X X) )
= tr (Ik+1 )
= k + 1.
So, we have
^ 2
E[Q(β)] = σ {n − (k + 1)}
2
= σ (n − k − 1),
1
^
and it follows that E(s2 ) = E[Q(β)] = σ
2
.
n − k − 1
The next result is known as the Gauss-Markov Theorem.
Theorem 7.3.5 (p.148): For the multiple linear regression model, the least squares estimate is the best linear unbiased
estimate (BLUE). That is, suppose we are interested in estimating a⊤ β where a is a (k + 1) -dimensional vector of
^
constants, and we consider estimates of the form c ⊤ y. If E(c ⊤ y) = a
⊤
β for all β , then var(c ⊤ y) ≥ var(a
⊤
β) .
Proof: Let d . Since a for all β , it follows that a (this can be
⊤ ⊤ ⊤ ⊤ ⊤ ⊤
= X c β = E(c y) = c E(y) = c Xβ = d β = d
1 0
⎛ ⎞ ⎛ ⎞
⎜ 0⎟ ⎜ ⎟
⋮ ⎟
shown since β = ⎜
⎜
⎟
⎟
implies a1 = d1 ,⋯, β = ⎜
⎜ ⎟
implies ak+1 = d k+1 ).
⎜ ⋮ ⎟ ⎜ 0⎟
⎝ ⎠ ⎝ ⎠
0 1
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 3/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
⊤ ⊤
c (I − H)c = c (I − H)(I − H)c
⊤ ⊤
= c (I − H) (I − H)c
⊤
= ((I − H)c) (I − H)c ≥ 0.
Since a = d = X
⊤
c , c ⊤ (I − H)c ≥ 0 implies that
⊤ ^ ⊤ ^
var(a β) = a cov(β)a
⊤ 2 ⊤ −1
= a σ (X X) a
2 ⊤ ⊤ −1 ⊤
= σ c X(X X) X c
2 ⊤
= σ c Hc
2 ⊤ 2 ⊤
≤ σ c Hc + σ c (I − H)c
2 ⊤
= σ c Ic
⊤ 2
= c (σ I)c
⊤
= c cov(y)c
⊤
= var(c y).
First, let’s find the least squares estimate based on the formula in Theorem 7.3.2. To do this, we start by creating the design
matrix and the response vector.
x=c(3,-1,1)
X=cbind(1,x);X
## x
## [1,] 1 3
## [2,] 1 -1
## [3,] 1 1
y=c(4,2,0);y
## [1] 4 2 0
beta.hat=solve(t(X)%*%X)%*%t(X)%*%y; beta.hat
## [,1]
## 1.5
## x 0.5
1.5
^
So, β = ( ) and the regression line for modeling y based on x is
0.5
^ = 1.5 + 0.5x.
y
Graphically, if we look at the rows of X and y, the least squares line minimizes the sum of squares of vertical distances of the data
points to the line as illustrated below.
plot(x,y,pch=19)
abline(beta.hat,col="blue")
y.hat=X%*%beta.hat
for (i in 1:3)
points(c(x[i],x[i]),c(y[i],y.hat[i]),type="l",col="red")
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 4/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
There is another geometrical intrepretation if we look at the columns of X and y. First, we plot the 3-dimensional vectors from the
columns of X in red and find the subspace of R3 spanned by these two vectors.
require(rgl)
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col="#AAAAAA")
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col="#AAAAAA")
}
segments3d(c(0,1),c(0,1),c(0,1),col="#CC0000",lwd=5)
segments3d(c(0,x[1]),c(0,x[2]),c(0,x[3]),col="#FF0000",lwd=5)
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 5/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
4
⎛ ⎞
segments3d(c(0,y[1]),c(0,y[2]),c(0,y[3]),col="#0000FF",lwd=5)
Now, we project the response vector onto the subspace spanned by the columns of the design matrix. The fitted vector
3 1
⎛ ⎞ ⎛ ⎞
^
^
y = Xβ = ⎜ 1 ⎟ and the residual vector ^ = y − y
ε ^ = ⎜ 1 ⎟ are shown below.
⎝ ⎠ ⎝ ⎠
2 −2
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 6/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
segments3d(c(0,y.hat[1]),c(0,y.hat[2]),c(0,y.hat[3]),col="#FF00FF",lwd=5)
segments3d(c(y[1],y.hat[1]),c(y[2],y.hat[2]),c(y[3],y.hat[3]),col="#00FF00",lwd=5)
yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi , i = 1, 2, … , k + 1, … , n
2 2
n n 2
1
ℓ(β, σ ) = ln L(β, σ ) = − ln(2π) − ln σ − Q(β).
2
2 2 2σ
Theorem 7.6.1 (p.158): For the normal multiple linear regression model where n > k and X has rank k + 1 (it is full rank),
^
β
the maximum likelihood estimator of ( β, σ 2 ) is ( ) where
2
^
σ
−1 2
1 1
^ ⊤ ⊤
^ ^ ^ ⊤ ^
β = (X X) X y and σ = Q(β) = (y − Xβ) (y − Xβ).
n n
∂ℓ 1 ∂Q
= −
2
∂β 2σ ∂β
and
∂ℓ n 1
= − + Q(β).
2 2 2 2
∂σ 2σ 2(σ )
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 7/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
1 ∂Q
− = 0
2
2σ ∂β
∂Q
= 0
∂β
^ ⊤ −1 ⊤
β = (X X) X y.
^
Substituting β = β and setting the second equation equal 0 and solving, we obtain
n 1
^
− + Q(β) = 0
2 2
^
2σ ^ )2
2(σ
n 1
^
= Q(β)
2 2
^
2σ ^ )2
2(σ
^
2
Q(β)
^
σ = .
n
^
β
To show that ( ) maximizes ℓ , let ℓ∗ be the profile likelihood defined by
2
^
σ
n n 1
∗ 2 ^, 2 2 ^).
ℓ (σ ) = ℓ(β σ ) = − ln(2π) − ln σ − Q(β
2
2 2 2σ
Since
∗
dℓ n 1
^
= − + Q(β)
2 2 2 2
d(σ ) 2σ 2(σ )
n 2 2
= − (σ ^ )
− σ
2 2
2(σ )
is positive when σ 2 ^
< σ
2
and negative when σ 2 ^
> σ
2
, ℓ∗ (σ 2 ) is maximized at σ 2 ^
= σ
2
. Finally, we see that
^ ^2 ∗ 2 ∗ 2 ^
^ ) ≥ ℓ (σ ) = ℓ(β
2 2
ℓ(β, σ ) = ℓ (σ , σ ) ≥ ℓ(β, σ )
^
Proof: ( a ) Theorem 4.4.1 implies that β = (X
⊤
X)
−1
X
⊤
y follows a (k + 1) -dimensional Normal distribution with mean
vector
⊤ −1 ⊤ ⊤ −1 ⊤
E[(X X) X y] = (X X) X E(y)
⊤ −1 ⊤
= (X X) X Xβ
= β
⊤ −1 ⊤ ⊤ −1
= (X X) X cov(y)X(X X)
⊤ −1 ⊤ 2 ⊤ −1
= (X X) X (σ I)X(X X)
2 ⊤ −1 ⊤ ⊤ −1
= σ (X X) X X(X X)
2 ⊤ −1
= σ (X X) .
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 8/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
2 ^ ⊤ ^
^
nσ (y − Xβ) (y − Xβ)
=
2 2
σ σ
⊤
y (I − H)(I − H)y
=
2
σ
⊤
1 2
= y (I − H)y ∼ χ (n − k − 1)
2
σ
1
since ( (I − H)) (σ
2
I) = I − H is idempotent with rank n − k − 1.
2
σ
2
nσ
^ 1
^
( c ) By Theorem 5.6.1, β = (X
⊤
X)
−1
X
⊤
y and = y
⊤
(I − H)y are independent since
2 2
σ σ
⊤ −1 ⊤ 2
1 ⊤ −1 ⊤
(X X) X (σ I) (I − H) = (X X) X (I − H)
2
σ
⊤ −1 ⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X − (X X) X X(X X) X
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X − (X X) X
= O k+1,n .
2 2
2 σ nσ
^
^
Then β and σ
^ = ( ) are independent (see Theorem L3.5 from MATH 667).
2
n σ
^
β
Theorem 7.6.3 (p.160): For the normal multiple linear regression model, ( ) is the uniform minimum variance unbiased
2
s
β
estimator (UMVUE) of ( ) .
2
σ
yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi
= α + β 1 (x i1 − x̄ 1 ) + β 2 (x i2 − x̄ 2 ) + … + β k (x ik − x̄ k ) + εi
n
1
where x̄ j = ∑ x ij for j = 1, … , k and
n
i=1
α = β 0 + β 1 x̄ 1 + … + β k x̄ k .
α
y = (j, X c ) ( ) + ε
β1
where
β1
⎛ ⎞
β1 = ⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
βk
x 11 ⋯ x 1k
⎛ ⎞
⎜ x 21 ⋯ x 2k ⎟
⎜ ⎟
X1 = ⎜ ⎟
⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ ⎠
x n1 ⋯ x nk
is the design matrix with the first column of ones removed, and
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html ⎛ ⎞ 9/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
− x̄ 1 ⋯ x 1k − x̄ k ⎞
⎛ x 11
⎜ x 21 − x̄ 1 ⋯ x 2k − x̄ k ⎟
1 ⎜ ⎟
X c = (I − J) X 1 = ⎜ ⎟
n ⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ ⎠
x n1 − x̄ 1 ⋯ x nk − x̄ k
^ ⊤ −1 ⊤
β1 = (X c X c ) X c y.
then
⊤
^ −1 ^ ^ ⊤ −1
β1 = S xx syx and β 0 = α
^ − β
1
x̄ = ȳ − syx S xx x̄
2
s s12 ⋯ s1k sy1
⎛ 1 ⎞ ⎛ ⎞ x̄ 1
⎛ ⎞
2
⎜ s21 s ⋯ s2k ⎟ ⎜ sy2 ⎟ ⎜ x̄ 2 ⎟
⎜
2
⎟ 1 ⎜ ⎟ 1
where Sxx = ⎜ ⎟ =
⊤
Xc Xc , syx = ⎜ ⎟ =
⊤
Xc y , and x̄ = ⎜
⎜
⎟
⎟
.
⎜ ⎟ n − 1 ⎜ ⎟ n − 1
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎝ ⎠ ⎝ ⎠
⎝ 2 ⎠ syk x̄ k
sk1 sk2 ⋯ s
k
^
Evaluating the least squares function at β , we obtain the residual sum of squares (or error sum of squares) denoted by
⊤
⊤
^ ^ ε ^ ⊤
^ = (y − Xβ ^ ⊤ ^ ⊤
S S E = Q(β) = ε ) (y − Xβ) = y y − β X y.
n
⊤ ⊤
^ ^
where S S R ^ − ȳ )
= ∑(y i
2
= β X
⊤
y − nȳ
2 ⊤
= β1 X c y .
i=1
SSR
The coefficient of determination R2 = is the the proportion of variation explained by the regression model compared
SST
with that explained only using the mean. This gives a measure of how well the model fits the data.
R is sometimes referred to as the multiple correlation coefficient.
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 10/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
squares function
⊤ −1
QV (b) = (y − Xb) V (y − Xb)
n n
i=1 j=1
Theorem 7.8.1 (p.165): Given observed data (x i1 , … , x ik , yi ) for i = 1, … , n from a multiple linear regression model
where n > k and X has rank k + 1 (it is full rank), the generalized least squares estimate of β is
−1
^ ⊤ −1 ⊤ −1
β = (X V X) X V y.
1
^
s
2
= QV (β) is an unbiased estimate of σ 2 .
n − k − 1
Proof: The results follow from the results in Section 7.3 with y replaced by V −1/2 y and X replaced by V −1/2 X.
β1
y = Xβ + ε = (X 1 , X 2 ) ( ) + ε = X 1 β1 + X 2 β2 + ε
β2
but we incorrectly leave out the variables in the columns X 2 and use the model
∗ ∗
y = X1 β + ε .
1
∗
^
(a) E(β1 ) = β1 + Aβ2 where A = (X 1 X 1 )
⊤ −1
X1 X2
⊤
is the alias matrix
∗
^
(b) 2
cov(β1 ) = σ (X 1 X 1 )
−1 ⊤
.
Proof: For ( a ), we compute
∗
^ ⊤ −1 ⊤
E(β1 ) = (X 1 X 1 ) X 1 E(y)
⊤ −1 ⊤
= (X X 1 ) X 1 (X 1 β1 + X 2 β2 )
1
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X 1 ) X 1 X 1 β1 + (X 1 X 1 ) X 1 X 2 β2
1
⊤ −1 ⊤
= β1 + (X X 1 ) X 1 X 2 β2 .
1
⊤ −1 ⊤ 2 ⊤ −1
= (X 1 X 1 ) X 1 (σ I)X 1 (X 1 X 1 )
2 ⊤ −1 ⊤ ⊤ −1
= σ (X 1 X 1 ) X 1 X 1 (X 1 X 1 )
2 ⊤ −1
= σ (X 1 X 1 ) .
∗ ∗
^ ⊤ ^
(y − X 1 β1 ) (y − X 1 β1 ) ∗
s
2
1
= be the estimator of σ 2 based only on X 1 where β
^
1
⊤
= (X X 1 )
1
−1 ⊤
X1 y . If
n − p − 1
⊤ ⊤ ⊤ −1 ⊤
β X 2 (I − X 1 (X 1 X 1 ) X 1 )X 2 β2
2 2 2
E(s ) = σ + .
1
n − p − 1
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 11/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
∗ ∗
^
E [(y − X 1 β
⊤ ^ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤
1
) (y − X 1 β1 )] = E [(y − X 1 (X 1 X 1 ) X y) (y − X 1 (X 1 X 1 ) X y)]
⊤ −1 ⊤ ⊤ −1 ⊤
= E [(I − X 1 (X X 1 ) X)y) (y − X 1 (X X 1 ) X y)]
1 1
⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
= E [y (I − X 1 (X X 1 ) X) (y − X 1 (X X 1 ) X y)]
1 1
⊤ ⊤ −1 ⊤
= E [y (In − X 1 (X X 1 ) X 1 ) y]
1
⊤ −1 ⊤ 2 ⊤ ⊤ ⊤ −1 ⊤
= tr ((In − X 1 (X X 1 ) X 1 ) (σ I)) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1
2 ⊤ −1 ⊤ ⊤ ⊤ ⊤ −1 ⊤
= σ (tr (In ) − tr (X 1 (X X 1 ) X 1 )) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1
2 ⊤ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤
= σ (tr (In ) − tr (X X 1 (X X 1 ) )) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1 1
2 ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1
2 ⊤ ⊤ ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + (β X1 + β X 2 ) (In − X 1 (X 1 X 1 ) X 1 ) (X 1 β1 + X 2 β2 )
1 2
2 ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + β X 2 (In − X 1 (X 1 X 1 ) X 1 ) X 2 β2 .
2
1 1 1
⎛ ⎞ ⎛ ⎞
⎜ 1 2 ⎟ ⎜ 4 ⎟
X = (X 1 , X 2 ) with X 1 = ⎜ ⎟ and X 2 = ⎜ ⎟,
⎜ 1 3 ⎟ ⎜ 9 ⎟
⎝ ⎠ ⎝ ⎠
1 4 16
so
⊤
4 10 ⊤ ⊤
30
X1 X1 = ( ) , X 2 X 2 = 354, and X 1 X 2 = ( ).
10 30 100
−1
4 10 30 −10 1.5 −0.5
Then X 1 X 1 so
⊤ 1
= ( ) = ( ) = ( )
20
10 30 −10 4 −0.5 0.2
⊤ ⊤ −1 ⊤
1.5 −0.5 30
X 2 X 1 (X 1 X 1 ) X 1 X 2 = (30, 100) ( )( ) = 350
−0.5 0.2 100
β 2 (354 − 350)β 2
and it follows that E(s2 ) − σ 2 = = 2β
2
2
.
4 − 1 − 1
−1 −1 −1 −1 −1 −1
A + A A 12 B A 21 A −A A 12 B
−1 11 11 11 11
A = ( )
−1 −1 −1
−B A 21 A B
11
where B = A 22 − A 21 A
−1
11
A 12 .
^
β1
^ ^
Theorem 7.9.3 (p.171): Suppose X ⊤
1
X 1 is invertible. If β = (X
⊤
X)
−1
X
⊤
y where X = (X 1 , X 2 ) ,β = ( ) , and
^
β2
∗
^ ⊤
β1 = (X 1 X 1 )
−1 ⊤
X1 y , then
∗
^ ^ 2 −1 ⊤
cov(β1 ) − cov(β1 ) = σ AB A
where A = (X 1 X 1 )
⊤ −1 ⊤
X1 X2 and B ⊤
= X2 X2 − X2 X1 A
⊤
. Furthermore, AB−1 A ⊤ is positive definite or positive
∗
semidefinite so var(β^j ) ^ )
≥ var(β j
for all j .
Proof: Since
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 12/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
^) =
cov(β
2
(X
⊤
X)
−1
σ
−1
⊤ ⊤
X1 X1 X1 X2
2
= σ ( )
⊤ ⊤
X2 X1 X2 X2
⊤ −1 ⊤ −1 ⊤ −1 ⊤ ⊤ −1 ⊤ −1 ⊤ −1
(X 1 X 1 ) + (X 1 X 1 ) X1 X2 B X 2 X 1 (X 1 X 1 ) −(X 1 X 1 ) X1 X2 B
2
= σ ( )
−1 ⊤ ⊤ −1 −1
−B X 2 X 1 (X 1 X 1 ) B
⊤ −1 −1 ⊤ −1
2
(X 1 X 1 ) + AB A −AB
= σ ( )
−1 ⊤ −1
−B A B
where B ,
⊤ ⊤ ⊤ −1 ⊤ ⊤ ⊤
= X 2 X 2 − X 2 X 1 (X 1 X 1 ) X1 X2 = X2 X2 − X2 X1 A
2 ⊤ −1 −1 ⊤
^ ) = σ
cov(β ((X 1 X 1 ) + AB A ).
1
Thus, cov(β . This matrix is positive definite or positive semidefinite since, for any v with
−1 ⊤
^ ) − cov( ^ ) = σ
2
AB A
1
β1
the dimension of β1 ,
⊤ −1 ⊤ ⊤ ⊤ ⊤ −1 ⊤ ⊤
v AB A = v AX 2 (I − X 1 (X 1 X 1 ) X 1 )X 2 A
⊤ ⊤ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤ ⊤
= v AX 2 (I − X 1 (X 1 X 1 ) X1 ) (I − X 1 (X 1 X 1 ) X 1 )X 2 A
⊤
= w w ≥ 0
where w = (I − X 1 (X 1 X 1 )
⊤ −1 ⊤
X 1 )X 2 A
⊤
.
7.10 Orthogonalization
Notice that if the columns of X 1 are orthogonal to the columns of X 2 (that is, X ⊤
1
X 2 = O ), then the least squares
estimates of β1 and β2 based on fitting models separately with only X 1 and X 2 , respectively, are the same as the full
model based on X = (X 1 , X 2 ) .
Note that X ⊤1
X 2 = O implies A = O and B = X 2 X 2 . Then we see that
⊤
^
β1 ⊤ −1 ⊤
( ) = (X X) X y
^
β2
−1 ⊤
⊤ ⊤
X1 X1 X1 X2 X1
= ( ) ( ) y
⊤ ⊤
X2 X1 X2 X2 X2
⊤ −1 −1 ⊤ −1 ⊤
(X 1 X 1 ) + AB A −AB X1 y
= ( )( )
−1 ⊤ −1 ⊤
−B A B X2 y
⊤ −1 ⊤
(X 1 X 1 ) O X1 y
= ( )( )
⊤ −1 ⊤
O (X 2 X 2 ) X2 y
⊤ −1 ⊤
(X 1 X 1 ) X1 y
= ( )
⊤ −1 ⊤
(X 2 X 2 ) X2 y
∗
^
β1
= ( ).
∗
^
β2
Orthogonalization also provides a method for updating coefficient estimates for a regression model of y based only on X 1
to obtain coefficient estimates for the full model based of y based on X = (X 1 , X 2 ) , even when the columns of X 2 are
not orthogonal to the columns of X 1 .
Here are the steps:
∗ ∗
^ ^
Step 1: Regress y on X 1 to obtain β1
= (X 1 X 1 )
⊤ −1
X1 y
⊤
and residuals ε∗1 ^
= y − y 1
where y
^
1
= X 1 β1 .
Step 2: Simultaneously regress each column of X 2 on X 1 where each column of (X 1 X 1 ) ⊤ −1 ⊤
X1 X2 = A gives
the coefficients for the corresponding column. Then the residuals are X 2⋅1 = X 2 − X 1 A.
^
Step 3: Regress y − y
^ on X 2⋅1 to obtain β
1 2
. This gives
ˆ ^
y − y
^1 = X 2⋅1 β2
^ − y ^
^ = (X 2 − X 1 A)β
y 1 2
^ = y ^
^ − X 1 Aβ ^
y 1 2
+ X 2 β2
∗
^
^ = X1 β ^ ^
y 1
− X 1 Aβ2 + X 2 β2
∗
^
^ = X 1 (β ^ ^
y 1
− Aβ2 ) + X 2 β2
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 13/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
∗
^ ^ ^
so β1
= β1 − Aβ2 .
R Example 7.10.1: Let’s see how the orthogonal method works geometrically for the data in R Example 7.4.1.
4 1
⎛ ⎞ ⎛ ⎞
⊤
∗ j y 6
^
β1 = = = 2
⊤
j j 3
2
⎛ ⎞
y − y
^1 = ⎜ 0 ⎟
⎝ ⎠
−2
as illustrated below.
x=c(3,-1,1)
y=c(4,2,0)
j=rep(1,3)
y.hat.1=mean(y)*j
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col=ifelse(s==0,"#000000","#AAAAAA"))
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col=ifelse(t==0,"#000000","#AAAAAA"))
}
segments3d(c(0,1),c(0,1),c(0,1),col="#CC0000",lwd=5)
segments3d(c(0,y[1]),c(0,y[2]),c(0,y[3]),col="#0000FF",lwd=5)
segments3d(c(y[1],y.hat.1[1]),c(y[2],y.hat.1[2]),c(y[3],y.hat.1[3]),col="#FF8800",lwd=5)
3 1
⎛ ⎞ ⎛ ⎞
Then we find a vector in the subspace spanned by x = ⎜ −1 ⎟ and j = ⎜ 1⎟ orthogonal to j by computing the residual after
⎝ ⎠ ⎝ ⎠
1 1
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 14/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
⊤
j x 3
A = = = 1
⊤
j j 3
^ = 1j = j
x
2
⎛ ⎞
^ = ⎜ −2 ⎟
X 2⋅1 = x − x
⎝ ⎠
0
as illustrated below.
x.hat=mean(x)*j
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col=ifelse(s==0,"#000000","#AAAAAA"))
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col=ifelse(t==0,"#000000","#AAAAAA"))
}
segments3d(c(0,1),c(0,1),c(0,1),col="#CC0000",lwd=5)
segments3d(c(0,x[1]),c(0,x[2]),c(0,x[3]),col="#FF0000",lwd=5)
segments3d(c(x[1],x.hat[1]),c(x[2],x.hat[2]),c(x[3],x.hat[3]),col="#00FFFF",lwd=5)
2 2
⎛ ⎞ ⎛ ⎞
⊤
(y − y
^ ) (x − x
^) 4 + 0 + 0 1
1
^
β2 = = = .
⊤
(x − x
^) (x − x
^) 4 + 4 + 0 2
The projection of y − y
^ onto x − x
1
^ is
ˆ
^
y − y ^ − y
= y ^
1 1
1
= ^ ).
(x − x
2
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 15/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
1
⎛ ⎞
ˆ
^ ) − y − y
(y − y ^ = y − ^ = ⎜
y 1 ⎟
1 1
⎝ ⎠
−2
is illustrated below.
beta.hat.2=sum((y-y.hat.1)*(x-x.hat))/sum((x-x.hat)^2)
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col=ifelse(s==0,"#000000","#AAAAAA"))
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col=ifelse(t==0,"#000000","#AAAAAA"))
}
segments3d(c(y[1],y.hat.1[1]),c(y[2],y.hat.1[2]),c(y[3],y.hat.1[3]),col="#FF8800",lwd=5)
segments3d(c(mean(x)+x[1],mean(x)+x.hat[1]),c(mean(x)+x[2],mean(x)+x.hat[2]),c(mean(x)+x[3],mean(x)+x.hat[3]),
col="#00FFFF",lwd=5)
segments3d(c(mean(y)+beta.hat.2*(x[1]-mean(x)),y[1]),c(mean(y)+beta.hat.2*(x[2]-mean(x)),y[2]),c(mean(y)+beta.
hat.2*(x[3]-mean(x)),y[3]),col="#00FF00",lwd=5)
1
^ − y
y ^ = (x − x
^)
1
2
1
^ = y
y ^ + (x − x
^)
1
2
1
^ = 2j +
y (x − j)
2
1 1
^ = 2j +
y x − j
2 2
3 1
^ =
y j + x,
2 2
3
⎛ ⎞
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 16/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
R Example 7.10.2: The following code shows the computation time for various methods used to compute the estimates of
the regression coefficients in R Example 7.10.1.
x=c(3,-1,1)
X=cbind(1,x)
y=c(4,2,0)
In R, we can also use C functions to make computations more efficiently. Here is a method to perform linear regression based on
least squares using the LAPACK library dgels.
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 17/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
#include <R.h>
extern int dgels_(char *trans, int *m, int *n, int *nrhs, double *a, int *lda, double *b, int *ldb, double *wo
rk, int *lwork, int *info);
dyn.load("ls.dll")
ourls = function(X,y){
n=as.integer(nrow(X))
p=as.integer(ncol(X))
out=.C("ls",as.double(X),beta=as.double(y),n,p)
out$beta[1:p]
}
ourls(X,y)
www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 18/18