Classical Least Squares Theory

Classical Least Squares Theory
CHUNG-MING KUAN
Department of Finance & CRETA

National Taiwan University
October 18, 2014
C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, 2014 1 / 100
Lecture Outline
1 The Method of Ordinary Least Squares (OLS)
Simple Linear Regression
Multiple Linear Regression
Geometric Interpretations
Measures of Goodness of Fit
Example: Analysis of Suicide Rate
2 Statistical Properties of the OLS Estimator
Classical Conditions
Without the Normality Condition
With the Normality Condition
3 Hypothesis Testing
Tests for Linear Hypotheses
Power of the Tests
Alternative Interpretation of the F Test
Confidence Regions
Lecture Outline (cont’d)
4 Multicollinearity
Near Multicollinearity
Regression with Dummy Variables
5 Limitation of the Classical Conditions
6 The Method of Generalized Least Squares (GLS)

The GLS Estimator
Stochastic Properties of the GLS Estimator
The Feasible GLS Estimator
Heteroskedasticity
Serial Correlation
Application: Linear Probability Model
Application: Seemingly Unrelated Regressions
Simple Linear Regression
Given the variable of interest y , we are interested in finding a function of

another variable x that can characterize the systematic behavior of y .
y : Dependent variable or regressand

x: Explanatory variable or regressor
Specifying a linear function of x: α + βx with unknown parameters α
and β
The non-systematic part is the error: y − (α + βx)
Together we write:
y= α + βx + e(α, β) .
| {z } | {z }
linear function error
For the specification α + βx, the objective is to find the “best” fit of the
data (yt , xt ), t = 1, . . . , T .
1 Minimizing a least-squares (LS) criterion function wrt α and β:
T
1 X
QT (α, β) := (yt − α − βxt )2 .
T
t=1
2 Minimizing a least-absolute-deviation (LAD) criterion wrt α and β:

T
1 X
|yt − α − βxt |.
T
t=1
3 Minimizing asymmetrically weighted absolute deviations:

 
1  X X
θ |yt − α − βxt | + (1 − θ) |yt − α − βxt | ,
T
t:yt >α+βxt t:yt <α+βxt
with 0 < θ < 1.
data (yt , xt ), t = 1, . . . , T .
T
1 X
QT (α, β) := (yt − α − βxt )2 .
T
t=1

T
1 X
|yt − α − βxt |.
T
t=1

 
1  X X
θ |yt − α − βxt | + (1 − θ) |yt − α − βxt | ,
T
with 0 < θ < 1.
data (yt , xt ), t = 1, . . . , T .
T
1 X
QT (α, β) := (yt − α − βxt )2 .
T
t=1

T
1 X
|yt − α − βxt |.
T
t=1

 
1  X X
θ |yt − α − βxt | + (1 − θ) |yt − α − βxt | ,
T
with 0 < θ < 1.
The OLS Estimators
The first order conditions (FOCs) of LS minimization are:

T
∂QT (α, β) 2 X
=− (yt − α − βxt ) = 0,
∂α T
t=1
T
∂QT (α, β) 2 X
=− (yt − α − βxt )xt = 0.
∂β T
t=1
The solutions are known as the ordinary least squares (OLS)

estimators:
PT
(yt − ȳ )(xt − x̄)
β̂T = t=1 PT ,
2
t=1 (xt − x̄)
α̂T = ȳ − β̂T x̄.
The estimated regression line is ŷ = α̂T + β̂T x, which is the linear
function evaluated at α̂T and β̂T , and ê = y − ŷ is the error
evaluated at α̂T and β̂T and also known as residual.
The t-th fitted value of the regression line is ŷt = α̂T + β̂T xt .
The t-th residual is êt = yt − ŷt = et (α̂T , β̂T ).
No other linear functions of the form a + bx can provide a better fit of
the data in terms of sum of squared errors.
β̂T characterizes the the predicted change of y , given a change of one

unit of x, whereas α̂T is the predicted y without (the information of)
x.
Note that the OLS method requires no assumption on the data,
except that xt can not be a constant.
Algebraic Properties
Substituting α̂T and β̂T into the FOCs:

T T
1 X 1 X
(yt − α − βxt ) = 0, (yt − α − βxt )xt = 0,
T T
t=1 t=1
we have the following algebraic results:

PT
t=1 êt = 0.
PT
t=1 êt xt = 0.
PT PT ¯
t=1 yt = t=1 ŷt so that ȳ = ŷ .
ȳ = α̂T + β̂T x̄; that is, the estimated regression line must pass
through the point (x̄, ȳ ).
Suppose we want to know how the suicide rate (s) in Taiwan can be
explained by unemployment rate (u), GDP growth rate (g ), or time
(t). The suicide rate is 1/100000.
Data (1981–2013): s̄ = 12.05 with s.d. 3.91; ḡ = 5.64 with s.d. 3.16;
ū = 3.09 with s.d. 1.33.
Estimation results:
ŝt = 15.11 − 0.54 gt , R̄ 2 = 0.17;

ŝt = 15.56 − 0.61 gt−1 , R̄ 2 = 0.21;
ŝt = 4.40 + 2.48 ut , R̄ 2 = 0.70;
ŝt = 4.84 + 2.40 ut−1 , R̄ 2 = 0.68;
ŝt = 7.40 + 0.27 t, R̄ 2 = 0.44.
10 / 100
2013
(b) Suicide and unemploy. rates
2012
2011
2010
2009
October 18, 2014
2008
2007
2006
2005
2004
2003
Unemplyment Rate
2002
2001
2000
1999
1998
1997
1996
Suicide Rate
1995
1994
1993
1992
1991
1990
1989
1988
1987
Classical Least Squares Theory

1986
1985
1984
1983
1982
1981
25
20
15
10
(a) Suicide & GDP growth rates

2013
2012
2011
2010
Suicide Rates 1981–2013
2009
2008
2007
2006
2005
2004
2003
Real GDP Growth Rate
2002
2001
C.-M. Kuan (Finance & CRETA, NTU)

2000
1999
1998
1997
1996
1995
Suicide Rate
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
-5
25
20
15
10
0
Multiple Linear Regression
With k regressors x1 , . . . , xk (x1 is usually the constant one):
y = β1 x1 + · · · + βk xk + e(β1 , . . . , βk ).
With data (yt , xt1 , . . . , xtk ), t = 1, . . . , T , we can write
y = Xβ + e(β), (1)
where β = (β1 β2 · · · βk )0 ,
     
y1 x11 x12 · · · x1k e1 (β)
y2 x21 x22 · · · x2k e2 (β)
     
     
y= .. , X= .. .. .. .. , e(β) =  .. .
. . . . . .
     
     
yT xT 1 xT 2 · · · xTk eT (β)
Least-squares criterion function:
1 1
QT (β) := e(β)0 e(β) = (y − Xβ)0 (y − Xβ). (2)
T T
The FOCs of minimizing QT (β) are −2X0 (y − Xβ)/T = 0, leading
to the normal equations:
X0 Xβ = X0 y.
Identification Requirement [ID-1]: X is of full column rank k.

Any column of X is not a linear combination of other columns.
Intuition: X does not contain redundant information.
When When X is not of full column rank, we say there exists exact
multicollinearity among regressors.
Given [ID-1], X0 X is positive definite and hence invertible. The unique
solution to the normal equations is known as the OLS estimator of β:
β̂ T = (X0 X)−1 X0 y. (3)
Under [ID-1], we have the second order condition:
∇2β QT (β) = 2(X0 X)/T is p.d.
The result below requires only the identification requirement and does
not depend on the statistical properties of y and X.
Theorem 3.1
Given specification (1), suppose [ID-1] holds. Then, the OLS estimator
β̂ T = (X0 X)−1 X0 y uniquely minimizes the criterion function (2).
The magnitude of β̂ T is affected by the measurement units of the
dependent and explanatory variables.
A larger coefficient does not imply that the associated regressor is more
important.
The so-called “beta coefficients” (see homework) do not depend on the
measurement units, and hence their magnitudes are comparable.
Given β̂ T , the vector of the OLS fitted values is ŷ = Xβ̂ T , and the
vector of the OLS residuals is ê = y − ŷ = e(β̂ T ).
Plugging β̂ T into the FOCs: X0 (y − Xβ) = 0, we have:
X0 ê = 0.
PT
When X contains a vector of ones, t=1 êt = 0.
0
ŷ0 ê = β̂ T X0 ê = 0.
These are all algebraic results under the OLS method.
Geometric Interpretations
Recall that P = X(X0 X)−1 X0 is the orthogonal projection matrix that

projects vectors onto span(X), and IT − P is the orthogonal projection
matrix that projects vectors onto span(X)⊥ , the orthogonal complement of
span(X). Thus, PX = X and (IT − P)X = 0.
The vector of fitted values, ŷ = Xβ̂ T = X(X0 X)−1 X0 y = Py, is the

orthogonal projection of y onto span(X).
The residual vector, ê = y − ŷ = (IT − P)y, is the orthogonal
projection of y onto span(X)⊥ .
ê is orthogonal to X, i.e., X0 ê = 0, and it is also orthogonal to ŷ
because ŷ is in span(X), i.e., ŷ0 ê = 0.
y
x2
ê = (I − P )y
x2 β̂ 2 P y = x1 β̂ 1 + x2 β̂ 2
x1
x1 β̂ 1
Figure: The orthogonal projection of y onto span(x1 ,x2 ).
Theorem 3.3 (Frisch-Waugh-Lovell)
Given y = X1 β 1 + X2 β 2 + e, the OLS estimators of β 1 and β 2 are
β̂ 1,T = [X01 (I − P2 )X1 ]−1 X01 (I − P2 )y,
β̂ 2,T = [X02 (I − P1 )X2 ]−1 X02 (I − P1 )y,
where P1 = X1 (X01 X1 )−1 X01 and P2 = X2 (X02 X2 )−1 X02 .
β̂ 1,T can also be computed from regressing (I − P2 )y on (I − P2 )X1 ,

where (I − P2 )y and (I − P2 )X1 are the residual vectors of y on X2
and X1 on X2 , respectively. That is, β̂ 1,T characterizes the marginal
effect of X1 on y, after the effect of X2 is purged away.
Similarly, regressing (I − P1 )y on (I − P1 )X2 yields β̂ 2,T . Thus, That
is, β̂ 2,T characterizes the marginal effect of Xx on y, after the effect
of X1 is purged away.
Proof: Writing y = X1 β̂ 1,T + X2 β̂ 2,T + (I − P)y, where P = X(X0 X)−1 X0
with X = [X1 X2 ], we have
X01 (I − P2 )y
= X01 (I − P2 )X1 β̂ 1,T + X01 (I − P2 )X2 β̂ 2,T + X01 (I − P2 )(I − P)y
= X01 (I − P2 )X1 β̂ 1,T + X01 (I − P2 )(I − P)y.
We know span(X2 ) ⊆ span(X), so that span(X)⊥ ⊆ span(X2 )⊥ . Hence,

(I − P2 )(I − P) = I − P, and
X01 (I − P2 )y = X01 (I − P2 )X1 β̂ 1,T + X01 (I − P)y
= X01 (I − P2 )X1 β̂ 1,T ,
from which we obtain the expression for β̂ 1,T .
Some Implications of the FWL Theorem
The OLS estimator of regressing y on X1 is not the same as β̂ 1,T ,

unless X1 and X2 are orthogonal to each other.
Regressing y on X1 and X2 enables us to assess the marginal effect of
X1 on y while “controlling” the effect of X2 and also the marginal
effect of X2 on y while “controlling” the effect of X1 .
When one regresses y on X1 only, the estimated marginal effect of X1
is affected by other relevant variables not included in the regression.
Observe that (I − P1 )y = (I − P1 )X2 β̂ 2,T + (I − P1 )(I − P)y.
(I − P1 )(I − P) = I − P, so that the residual vector of regressing

(I − P1 )y on (I − P1 )X2 is identical to the residual vector of
regressing y on X = [X1 X2 ]:
(I − P1 )y = (I − P1 )X2 β̂ 2,T + (I − P)y.
P1 = P1 P, so that the orthogonal projection of y directly on

span(X1 ) (i.e., P1 y) is equivalent to iterated projections of y on
span(X) and then on span(X1 ) (i.e., P1 Py). Hence,
(I − P1 )X2 β̂ 2,T = (I − P1 )Py = (P − P1 )y.
y
x2 ê = (I − P)y
Py
(I − P1 )y
(P − P1 )y
x1
P1 y
Figure 1: An Illustration of the Frisch-Waugh-Lovell Theorem.

Figure: An illustration of the Frisch-Waugh-Lovell Theorem.
Measures of Goodness of Fit
Given ŷ0 ê = 0, we have y0 y = ŷ0 ŷ + ê0 ê, where y0 y is known as TSS

(total sum of squares), ŷ0 ŷ is RSS (regression sum of squares), and
ê0 ê is ESS (error sum of squares).
The non-centered coefficient of determination (or non-centered R 2 ),
RSS ESS
R2 = =1− , (4)
TSS TSS
measures the proportion of the total variation of yt that can be
explained by the model.
It is invariant wrt measurement units of the dependent variable but not
invariant wrt constant addition.
It is a relative measure such that 0 ≤ R 2 ≤ 1.
It is nondecreasing in the number of regressors. (Why?)
Centered R 2
When the specification contains a constant term,
T
X T
X T
X
(yt − ȳ )2 = (ŷt − ŷ¯ )2 + êt2 .
|t=1 {z } |t=1 {z } |t=1
{z }
centered TSS centered RSS ESS
The centered coefficient of determination (or centered R 2 ),
PT 2
2 t=1 (ŷt − ȳ ) Centered RSS ESS
R = PT = =1− ,
t=1 (yt − ȳ )
2 Centered TSS Centered TSS
measures the proportion of the total variation of yt that can be

explained by the model, excluding the effect of the constant term.
It is invariant wrt constant addition.
0 ≤ R 2 ≤ 1, and it is non-decreasing in the number of regressors.
It may be negative when the model does not contain a constant term.
Centered R 2 : Alternative Interpretation
When the specification contains a constant term,
T
X T
X T
X
(yt − ȳ )(ŷt − ȳ ) = (ŷt − ȳ + êt )(ŷt − ȳ ) = (ŷt − ȳ )2 ,
t=1 t=1 t=1
PT Pt
because t=1 ŷt êt = t=1 êt = 0.
Centered R 2 can also be expressed as
PT PT
(ŷt − ȳ )2 (y − ȳ )(ŷt − ȳ )]2
[
R =2
Pt=1
T
= PT t=1 t ,
− ȳ )2 [ t=1 (yt − ȳ )2 ][ T 2
P
t=1 (yt t=1 (ŷt − ȳ ) ]
which is the the squared sample correlation coefficient of yt and ŷt ,

also known as the squared multiple correlation coefficient.
Models for different dep. variables are not comparable in terms of R 2 .
Adjusted R 2
Adjusted R 2 is the centered R 2 adjusted for the degrees of freedom:
ê0 ê/(T − k)
R̄ 2 = 1 − .
(y0 y − T ȳ 2 )/(T − 1)
R̄ 2 adds a penalty term to R 2 :

T −1 k −1
R̄ 2 = 1 − (1 − R 2 ) = R 2 − (1 − R 2 ),
T −k T −k
where the penalty term depends on the trade-off between model
complexity and model explanatory ability.
R̄ 2 may be negative and need not be non-decreasing in k.
Q: How the suicide rate (s) can be explained by unemployment rate

(u), GDP growth rate (g ), and time (t) during 1981–2010?
Estimation results with gt and ut :
ŝt = 15.11 − 0.54 gt , R̄ 2 = 0.17;

ŝt = 4.40 + 2.48 ut , R̄ 2 = 0.70;
ŝt = 4.20 + 2.50 ut + 0.02 gt , R̄ 2 = 0.69.
Estimation results with gt−1 and ut−1 :
ŝt = 15.56 − 0.61 gt−1 , R̄ 2 = 0.21;

ŝt = 4.84 + 2.40 ut−1 , R̄ 2 = 0.68;
ŝt = 5.56 + 2.31 ut−1 − 0.08 gt−1 , R̄ 2 = 0.67.
Estimation results with t but without g :
ŝt = 4.40 + 2.48 ut , R̄ 2 = 0.70;

ŝt = 4.41 + 2.41 ut + 0.01 t, R̄ 2 = 0.69;
ŝt = 4.84 + 2.40 ut−1 , R̄ 2 = 0.68;
ŝt = 4.84 + 2.35 ut−1 + 0.01 t, R̄ 2 = 0.67.
Estimation results with t and g :
ŝt = 4.20 + 2.50 ut + 0.02 gt , R̄ 2 = 0.69;

ŝt = 4.16 + 2.43 ut + 0.03 gt + 0.01 t, R̄ 2 = 0.68;
ŝt = 5.56 + 2.31 ut−1 − 0.08 gt−1 , R̄ 2 = 0.67;
ŝt = 5.55 + 2.29 ut−1 − 0.08 gt−1 + 0.004 t, R̄ 2 = 0.66.
Estimation results with t and t 2 :
ŝt = 7.40 + 0.27 t, R̄ 2 = 0.44;

ŝt = 11.60 − 0.45 t + 0.02 t 2 , R̄ 2 = 0.63;
ŝt = 8.05 + 1.93 ut − 0.46 t + 0.02 t 2 , R̄ 2 = 0.78;
ŝt = 12.27 − 0.09 gt − 0.45 t + 0.02 t 2 , R̄ 2 = 0.62;
ŝt = 7.68 + 1.96 ut + 0.04 gt − 0.46 t + 0.02 t 2 , R̄ 2 = 0.77;
ŝt = 8.37 + 1.80 ut−1 − 0.43 t + 0.01 t 2 , R̄ 2 = 0.75;
ŝt = 13.02 − 0.18 gt−1 − 0.45 t + 0.02 t 2 , R̄ 2 = 0.64;
ŝt = 9.03 + 1.74 ut−1 − 0.07 gt−1 − 0.43 t + 0.01 t 2 , R̄ 2 = 0.74.
As far as R̄ 2 is concerned, a specification with t, t 2 , and u seems to

provide good fit of data and reasonable interpretation.
Q: Is there any other way to determine if a specification is “good”?
Classical Conditions
To derive the statistical properties of the OLS estimator, we assume:
[A1] X is non-stochastic.
[A2] y is a random vector such that
(i) IE(y) = Xβ o for some β o ;
(ii) var(y) = σo2 IT for some σo2 > 0.
[A3] y is a random vector s.t. y ∼ N (Xβ o , σo2 IT ) for some β o and σo2 > 0.
The specification (1) with [A1] and [A2] is known as the classical
linear model, whereas (1) with [A1] and [A3] is the classical normal
linear model.
When var(y) = σo2 IT , the elements of y are homoskedastic and
(serially) uncorrelated.
Without Normality
The OLS estimator of the parameter σo2 is an average of squared residuals:

T
1 X 2
σ̂T2 = êt .
T −k
t=1
Theorem 3.4
Consider the linear specification (1).
(a) Given [A1] and [A2](i), β̂ T is unbiased for β o .
(b) Given [A1] and [A2], σ̂T2 is unbiased for σo2 .
(c) Given [A1] and [A2], var(β̂ T ) = σo2 (X0 X)−1 .
Proof: By [A1], IE(β̂ T ) = IE[(X0 X)−1 X0 y] = (X0 X)−1 X0 IE(y). [A2](i)
gives IE(y) = Xβ o , so that
IE(β̂ T ) = (X0 X)−1 X0 Xβ o = β o ,
proving unbiasedness. Given ê = (IT − P)y = (IT − P)(y − Xβ o ),
IE(ê0 ê) = IE[trace (y − Xβ o )0 (IT − P)(y − Xβ o )

= IE[trace (y − Xβ o )(y − Xβ o )0 (IT − P)

= trace IE[(y − Xβ o )(y − Xβ o )0 ](IT − P)

= trace σo2 IT (IT − P)

= σo2 trace(IT − P),
where the 4-th equality follows from [A2](ii) that var(y) = σo2 IT .
Proof (cont’d): As trace(IT − P) = rank(IT − P) = T − k, we have
IE(ê0 ê) = σo2 (T − k) and
IE(σ̂T2 ) = IE(ê0 ê)/(T − k) = σo2 ,
proving (b). By [A1] and [A2](ii),
var(β̂ T ) = var (X0 X)−1 X0 y

= (X0 X)−1 X0 [var(y)]X(X0 X)−1
= σo2 (X0 X)−1 X0 IT X(X0 X)−1
= σo2 (X0 X)−1 .
This establishes the assertion of (c).
Theorem 3.4 establishes unbiasedness of the OLS estimators β̂ T and
σ̂T2 but does not address the issue of efficiency.
By Theorem 3.4(c), the elements of β̂ T can be more precisely
estimated (i.e., with a smaller variance) when X has larger variation.
To see this, consider the simple linear regression: y = α + βx + e, it
can be verified that
1
var(β̂T ) = σo2 PT .
t=1 (xt − x̄)2
PT
Thus, the larger the (squared) variation of xt (i.e., t=1 (xt − x̄)2 ),
the smaller is the variance of β̂T .
The result below establishes efficiency of β̂ T among all unbiased
estimators of β o that are linear in y.
Theorem 3.5 (Gauss-Markov)

Given linear specification (1), suppose that [A1] and [A2] hold. Then the
OLS estimator β̂ T is the best linear unbiased estimator (BLUE) for β o .
Proof: Consider an arbitrary linear estimator β̌ T = Ay, where A is a

non-stochastic matrix, say, A = (X0 X)−1 X0 + C. Then, β̌ T = β̂ T + Cy,
such that
var(β̌ T ) = var(β̂ T ) + var(Cy) + 2 cov(β̂ T , Cy).
By [A1] and [A2](i), IE(β̌ T ) = β o + CXβ o , which is unbiased iff CX = 0.
Proof (cont’d): The condition CX = 0 implies cov(β̂ T , Cy) = 0. Thus,
var(β̌ T ) = var(β̂ T ) + var(Cy) = var(β̂ T ) + σo2 CC0 .
This shows that var(β̌ T ) − var(β̂ T ) is a p.s.d. matrix σo2 CC0 , so that β̂ T
is more efficient than any linear unbiased estimator β̌ T .
Example: IE(y) = X1 b1 and var(y) = σo2 IT . Two specification:
y = X1 β 1 + e.
with the OLS estimator b̂1,T , and
y = Xβ + e = X1 β 1 + X2 β 2 + e.
0 0
with the OLS estimator β̂ T = (β̂ 1,T β̂ 2,T )0 . Clearly, b̂1,T is the BLUE of
b1 with var(b̂1,T ) = σo2 (X01 X1 )−1 . By the Frisch-Waugh-Lovell Theorem,
IE(β̂ 1,T ) = IE [X01 (IT − P2 )X1 ]−1 X01 (IT − P2 )y = b1 ,

IE(β̂ 2,T ) = IE [X02 (IT − P1 )X2 ]−1 X02 (IT − P1 )y = 0.

That is, β̂ T is unbiased for (b01 00 )0 .
Example (cont’d):
var(β̂ 1,T ) = var([X01 (IT − P2 )X1 ]−1 X01 (IT − P2 )y)
= σo2 [X01 (IT − P2 )X1 ]−1 .
As X01 X1 − X01 (IT − P2 )X1 = X01 P2 X1 is p.s.d., it follows that
[X01 (IT − P2 )X1 ]−1 − (X01 X1 )−1
is p.s.d. This shows that b̂1,T is more efficient than β̂ 1,T , as it ought to
be.
With Normality
Under [A3] that y ∼ N (Xβ o , σo2 IT ), the log-likelihood function of y is

T T 1
log L(β, σ 2 ) = − log(2π) − log σ 2 − 2 (y − Xβ)0 (y − Xβ).
2 2 2σ
The score vector is
 
1 0
σ2
X (y − Xβ)
s(β, σ 2 ) =  .
T
− 2σ 2 +
1
2σ 4
(y − Xβ)0 (y − Xβ)
Solutions to s(β, σ 2 ) = 0 are the (quasi) maximum likelihood

estimators (MLEs). Clearly, the MLE of β is the OLS estimator, and
the MLE of σ 2 is
(y − Xβ̂ T )0 (y − Xβ̂ T ) ê0 ê
σ̃T2 = = 6= σ̂T2 .
T T
With the normality condition on y, a lot more can be said about the OLS
estimators.
Theorem 3.7
Given the linear specification (1), suppose that [A1] and [A3] hold.
(a) β̂ T ∼ N β o , σo2 (X0 X)−1 .

(b) (T − k)σ̂T2 /σo2 ∼ χ2 (T − k).

(c) σ̂T2 has mean σo2 and variance 2σo4 /(T − k).
Proof: For (a), we note that β̂ T is a linear transformation of

y ∼ N (Xβ o , σo2 IT ) and hence also a normal random vector. As for (b),
writing ê = (IT − P)(y − Xβ o ), we have
(T − k)σ̂T2 /σo2 = ê0 ê/σo2 = y∗0 (IT − P)y∗ ,
where y∗ = (y − Xβ o )/σo ∼ N (0, IT ) by [A3].
Proof (cont’d): Let C orthogonally diagonalizes IT − P such that
C0 (IT − P)C = Λ. Since rank(IT − P) = T − k, Λ contains T − k
eigenvalues equal to one and k eigenvalues equal to zero. Then,
" #
IT −k 0
y∗0 (IT − P)y∗ = y∗0 C[C0 (IT − P)C]C0 y∗ = η 0 η.
0 0
where η = C0 y∗ . As η ∼ N (0, IT ), ηi are independent, standard normal

random variables. It follows that
T
X −k
∗0 ∗
y (IT − P)y = ηi2 ∼ χ2 (T − k),
i=1
proving (b). (c) is a direct consequence of (b) and the facts that
χ2 (T − k) has mean T − k and variance 2(T − k).
Theorem 3.8
Given the linear specification (1), suppose that [A1] and [A3] hold. Then
the OLS estimators β̂ T and σ̂T2 are the best unbiased estimators (BUE)
for β o and σo2 , respectively.
Proof: The Hessian matrix of the log-likelihood function is

 
1 0 1 0
− σ2
X X − σ4
X (y − Xβ)
H(β, σ 2 ) =  .
− σ4 (y − Xβ) X 2σ4 − σ6 (y − Xβ)0 (y − Xβ)
1 0 T 1
Under [A3], IE[s(β o , σo2 )] = 0 and

 
− σ12 X0 X 0
IE[H(β o , σo2 )] =  o .
T
0 − 2σ 4
o
Proof (cont’d):
By the information matrix equality, − IE[H(β o , σo2 )] is the information
matrix. Then, its inverse,
 
σo2 (X0 X)−1 0
− IE[H(β o , σo2 )]−1 =  ,
2σo4
0 T
is the Cramér-Rao lower bound.
var(β̂ T ) achieves this lower bound (the upper-left block) so that β̂ T

is the best unbiased estimator for β o . This conclusion is much
stronger than the Gauss-Markov Theorem.
Although var(σ̂T2 ) = 2σo4 /(T − k) is greater than the lower bound
(lower-right element), it can be shown that σ̂T2 is still the best
unbiased estimator for σo2 ; see Rao (1973, p. 319) for a proof.
Tests for Linear Hypotheses
Linear hypothesis: Rβ o = r, where R is q × k with full row rank q

and q < k, r is a vector of hypothetical values.
A natural way to construct a test statistic is to compare Rβ̂ T and r ;
we reject the null if their difference is too “large.”
Given [A1] and [A3], Theorem 3.7(a) states:
β̂ T ∼ N β o , σo2 (X0 X)−1 ,

so that
Rβ̂ T ∼ N (Rβ o , σo2 [R(X0 X)−1 R0 ]).
The comparison between Rβ̂ T and r is based on this distribution.
When q = 1, Rβ̂ T and R(X0 X)−1 R0 are scalars. Under the null hypothesis,
Rβ̂ T − r R(β̂ T − β o )
0 0 1/2 = ∼ N (0, 1).
−1
σo [R(X X) R ] σo [R(X0 X)−1 R0 ]1/2
As σo2 is unknown, the statistic above is not operational. By replacing σo

with σ̂T , we obtain the so-called t statistic:
Rβ̂ T − r
τ= .
σ̂T [R(X0 X)−1 R0 ]1/2
Theorem 3.9
Given the linear specification (1), suppose that [A1] and [A3] hold. When
R is 1 × k, τ ∼ t(T − k) under the null hypothesis.
Note: The normality condition [A3] is crucial for this t distribution result.
Proof: We write the statistic τ as
,s
Rβ̂ T − r (T − k)σ̂T2 /σo2
τ= ,
σo [R(X0 X)−1 R0 ]1/2 T −k
where the numerator is N (0, 1) and (T − k)σ̂T2 /σo2 is χ2 (T − k) by

Theorem 3.7(b). The assertion follows when the numerator and
denominator are independent. This is indeed the case, because β̂ T and ê
are jointly normally distributed with
cov(ê, β̂ T ) = IE[(IT − P)(y − Xβ o )y0 X(X0 X)−1 ]
= (IT − P) IE[(y − Xβ o )y0 ]X(X0 X)−1
= σo2 (IT − P)X(X0 X)−1
= 0.
Examples
To test βi = c, let R = [0 · · · 0 1 0 · · · 0] and mij be the (i, j) th element

of M−1 = (X0 X)−1 . Then, the resulting t statistic is:
β̂i,T − c
τ= √ ∼ t(T − k),
σ̂T mii
where mii = R(X0 X)−1 R0 . For testing βi = 0, τ is also referred to as the t

ratio.
It is straightforward to verify that to test aβi + bβj = c, with a, b, c given
constants, the corresponding t statistic reads:
aβ̂i,T + b β̂j,T − c
τ= p ∼ t(T − k).
σ̂T [a2 mii + b 2 mjj + 2abmij ]
When R is a q × k matrix with full row rank q (q > 1), we have under the
null hypothesis: [R(X0 X)−1 R0 ]−1/2 (Rβ̂ T − r)/σo ∼ N (0, Iq ). Hence,
(Rβ̂ T − r)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ T − r)/σo2 ∼ χ2 (q).
The so-called F test statistic is:
(Rβ̂ T − r)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ T − r)

ϕ= .
σ̂T2 q
It is easy to verify that
(Rβ̂ T − r)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ T − r)/(σo2 q)

ϕ= ,
(T − k)σ̂T2 /[σo2 (T − k)]
where both the numerator and denominator are χ2 variables divided by

their corresponding degrees of freedom.
Theorem 3.10
R is q × k with full row rank, ϕ ∼ F (q, T − k) under the null hypothesis.
Notes:
1 When q = 1, ϕ ∼ F (1, T − k), and this distribution is the same as
that of τ 2 .
2 Note that t distribution is symmetric about zero. Hence one may
consider one- or two-sided t test. On the other hand, F distribution is
non-negative and asymmetric, it is more typical to consider only
one-sided F test.
Example: Ho : β1 = b1 and β2 = b2 . The F statistic,
!0 " #−1 !
1 β̂1,T − b1 m11 m12 β̂1,T − b1
ϕ= 2 ,
2σ̂T β̂2,T − b2 m21 m22 β̂2,T − b2
is distributed as F (2, T − k).
Example: Ho : β2 = 0, and β3 = 0, · · · and βk = 0,

 0  −1  
β̂2,T m22 m23 · · · m2k β̂2,T
β̂3,T m32 m33 · · · m3k β̂3,T
     
1      
ϕ=  ..   .. ..   .. ,
(k − 1)σ̂T2 
 .



 . .



 .


β̂k,T mk2 mk3 · · · mkk β̂k,T
is distributed as F (k − 1, T − k) and known as regression F test.
Test Power
To examine the power of the F test, we evaluate the distribution of ϕ

under the alternative hypothesis: Rβ o = r + δ, with R is a q × k matrix
with rank q < k and δ 6= 0.
Theorem 3.11
Rβ o = r + δ,
ϕ ∼ F (q, T − k; δ 0 D−1 δ, 0),
where D = σo2 [R(X0 X)−1 R0 ], and δ 0 D−1 δ is the non-centrality parameter

of the numerator of ϕ.
Proof: When Rβ o = r + δ,
[R(X0 X)−1 R0 ]−1/2 (Rβ̂ T − r)/σo = D−1/2 [R(β̂ T − β o ) + δ],
which is distributed as N (0, Iq ) + D−1/2 δ. Then,
(Rβ̂ T − r)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ T − r)/σo2 ∼ χ2 (q; δ 0 D−1 δ),
a non-central χ2 distribution with the non-centrality parameter δ 0 D−1 δ. It

is also readily seen that (T − k)σ̂T2 /σo2 is still distributed as χ2 (T − k).
Similar to the argument before, these two terms are independent, so that
ϕ has a non-central F distribution.
Test power is determined by the non-centrality parameter δ 0 D−1 δ,
where δ signifies the deviation from the null. When Rβ o deviates
farther from the hypothetical value r (i.e., δ is “large”), the
non-centrality parameter δ 0 D−1 δ increases, and so does the power.
Example: The null distribution is F (2, 20), and its critical value at 5%
level is 3.49. Then for F (2, 20; ν1 , 0) with the non-centrality
parameter ν1 = 1, 3, 5, the probabilities that ϕ exceeds 3.49 are
approximately 12.1%, 28.2%, and 44.3%, respectively.
Example: The null distribution is F (5, 60), and its critical value at 5%
level is 2.37. Then for F (5, 60; ν1 , 0) with ν1 = 1, 3, 5, the
probabilities that ϕ exceeds 2.37 are approximately 9.4%, 20.5%, and
33.2%, respectively.
Alternative Interpretation
Constrained OLS: Finding the saddle point of the Lagrangian:

1
min (y − Xβ)0 (y − Xβ) + (Rβ − r)0 λ,
β,λ T
where λ is the q × 1 vector of Lagrangian multipliers, we have
λ̈T = 2[R(X0 X/T )−1 R0 ]−1 (Rβ̂ T − r),
β̈ T = β̂ T − (X0 X/T )−1 R0 λ̈T /2.
The constrained OLS residuals are
ë = y − Xβ̂ T + X(β̂ T − β̈ T ) = ê + X(β̂ T − β̈ T ),
with β̂ T − β̈ T = (X0 X)−1 R0 [R(X0 X)−1 R0 ]−1 (Rβ̂ T − r).
The sum of squared, constrained OLS residuals are:
ë0 ë = ê0 ê + (β̂ T − β̈ T )0 X0 X(β̂ T − β̈ T )
= ê0 ê + (Rβ̂ T − r)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ T − r),
where the 2nd term on the RHS is the numerator of the F statistic.
Letting ESSc = ë0 ë and ESSu = ê0 ê we have
ë0 ë − ê0 ê (ESSc − ESSu )/q

ϕ= 2
= ,
qσ̂T ESSu /(T − k)
suggesting that F test in effect compares the constrained and

unconstrained models based on their lack-of-fitness.
(Ru −Rc )/q 2 2
The regression F test is thus ϕ = (1−R 2 which compares model
u )/(T −k)
fitness of the full model and the model with only a constant term.
The sum of squared, constrained OLS residuals are:
ë0 ë = ê0 ê + (β̂ T − β̈ T )0 X0 X(β̂ T − β̈ T )
= ê0 ê + (Rβ̂ T − r)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ T − r),
where the 2nd term on the RHS is the numerator of the F statistic.
Letting ESSc = ë0 ë and ESSu = ê0 ê we have
ë0 ë − ê0 ê (ESSc − ESSu )/q

ϕ= 2
= ,
qσ̂T ESSu /(T − k)
suggesting that F test in effect compares the constrained and

unconstrained models based on their lack-of-fitness.
(Ru −Rc )/q 2 2
The regression F test is thus ϕ = (1−R 2 which compares model
u )/(T −k)
fitness of the full model and the model with only a constant term.
Confidence Regions
A confidence interval for βi,o is the interval (g α , g α ) such that
IP{ g α ≤ βi,o ≤ g α } = 1 − α,
where (1 − α) is known as the confidence coefficient.

Letting cα/2 be the critical value of t(T − k) with tail prob. α/2,

√
IP β̂i,T − βi,o / σ̂T mii ≤ cα/2
√ √
= IP β̂i,T − cα/2 σ̂T mii ≤ βi,o ≤ β̂i,T + cα/2 σ̂T mii
= 1 − α.
The confidence region for a vector of parameters can be constructed
by resorting to F statistic.
For (β1,o = b1 , β2,o = b2 )0 , suppose T − k = 30 and α = 0.05. Then,
F0.05 (2, 30) = 3.32, and
 !0 " #−1 ! 
 1 β̂1,T − b1 m11 m12 β̂1,T − b1 
IP ≤ 3.32
 2σ̂T2 β̂2,T − b2 m21 m22 β̂2,T − b2 
is 1 − α, which results in an ellipse with the center (β̂1,T , β̂2,T ).

Note: It is possible that (β1 , β2 ) is outside the confidence box formed
by individual confidence intervals but inside the joint confidence
ellipse. That is, while a t ratio may indicate statistic significance of a
coefficient, the F test may suggest the opposite based on the
confidence region.
Part I: Estimation results with t
const ut ut−1 t R̄ 2 Reg F
4.40 2.48 0.70 74.51∗∗

(4.57 ) (8.63∗∗ )
∗∗
4.41 2.41 0.01 0.69 36.11∗∗

(4.50∗∗ ) (5.04∗∗ ) (0.18)
4.84 2.40 0.68 68.92∗∗

(5.09∗∗ ) (8.30∗∗ )
4.84 2.35 0.01 0.67 33.37∗∗
(5.01∗∗ ) ∗∗
(4.74 ) (0.13)
Note: The numbers in parentheses are t-ratios; ∗∗ and ∗ stand for

significance of a two-sided test at 1% and 5% levels, respectively.
Part II: Estimation results with t and g
const ut ut−1 gt gt−1 t R̄ 2 Reg F
4.20 2.50 0.02 0.69 36.09∗∗

(2.47∗ ) (7.25∗∗ ) (0.14)
4.16 2.43 0.03 0.01 0.68 23.31∗∗
(2.39∗ ) (4.85∗∗ ) (0.18) (0.21)
5.56 2.31 −0.08 0.67 33.77∗∗

(3.25∗∗ ) (6.70∗∗ ) (−0.51)
5.55 2.29 −0.08 0.004 0.66 21.77∗∗
(3.18∗∗ ) (4.40∗∗ ) (−0.49) (0.06)
F tests for the joint significance of the coefficients of g and t: 0.03 (Model
2) and 0.13 (Model 4), which are insignificant even at 10% level.
Part III: Estimation results with t and t 2
const ut ut−1 gt gt−1 t t2 R̄ 2 /F
11.60 −0.45 0.02 0.63

(8.82∗∗ ) (−2.50∗ ) (4.16∗∗ ) 28.61∗∗
8.05 1.93 −0.46 0.02 0.78

(6.32∗∗ ) (4.62∗∗ ) (−3.34∗∗ ) (3.73∗∗ ) 39.10∗∗
7.68 1.96 0.04 −0.46 0.02 0.77
(4.42∗∗ ) (4.50∗∗ ) (0.32) (−3.27∗∗ ) (3.68∗∗ ) 28.44∗∗
8.37 1.80 −0.43 0.01 0.75

(6.11∗∗ ) (3.87∗∗ ) (−2.92∗∗ ) (3.27∗∗ ) 32.97∗∗
9.03 1.74 −0.07 −0.43 0.01 0.74
(4.84∗∗ ) (3.60∗∗ ) (−0.53) (−2.90∗∗ ) (3.22∗∗ ) 24.18∗∗
F tests for the joint significance of the coefficients of g and t: 5.45∗ (Model
3) and 4.29∗ (Model 5), which are significant at 5% level.
Selected estimation results (with more precise estimates):
ŝt = 8.05 + 1.93 ut − 0.460 t + 0.0154 t 2 , R̄ 2 = 0.78;

ŝt = 8.37 + 1.80 ut−1 − 0.430 t + 0.0147 t 2 , R̄ 2 = 0.75.
For the second regression, the marginal effect of ut−1 on st is 1.80,

approx. 410 persons. That is, the suicide rate would increase by 1.8
when there is one percent increase of last year’s unemployment rate.
The time effect of the second regression is −0.43 + 0.0294 t, which
changes with t. At 2013, the time effect is 0.54, approx. 120 persons.
Since 1994 (about 14.6 years after 1980), there has been a natural
increase of the suicide rate in Taiwan; lowering unemployment rate
would help cancel out the time effect to some extent.
The predicted and actual suicide rates in 2013 are, respectively, 17.86
and 15.3; the difference between them is 2.56, approx. 590 persons.
Near Multicollinearity
It is more common to have near multicollinearity: Xa ≈ 0.
Writing X = [xi Xi ], we have from the FWL Theorem that
σo2
var(β̂i,T ) = σo2 [x0i (I − Pi )xi ]−1 = PT ,
t=1 (xti − x̄i )2 (1 − R 2 (i))
where Pi = Xi (X0i Xi )−1 X0i , and R 2 (i) is the centered R 2 from

regressing xi on Xi .
Consequence of near multicollinearity:
R 2 (i) is high, so that var(β̂i,T ) tend to be large and that β̂i,T are
sensitive to data changes.
Large var(β̂i,T ) lead to small (insignificant) t ratios. Yet, regression F
test may suggest that the model (as a whole) is useful.
How do we circumvent the problems from near multicollinearity?
Try to break the approximate linear relation.

Adding more data if possible.
Dropping some regressors.
Statistical approaches:
Ridge regression: For some λ 6= 0,
b̂ridge = (X0 X + λIk )−1 X0 y.
Principal component regression:

Note: Multicollinearity vs. “micronumerosity” (Goldberger)
Digression: Regression with Dummy Variables
Example: Let yt denote the wage of the t th individual and xt the working
experience (in years). Consider the following specification:
yt = α0 + α1 Dt + β0 xt + et ,
where Dt is a dummy variable such that Dt = 1 if t is a male and Dt = 0

otherwise. This specification puts together two regressions:
Regression for female: Dt = 0, and intercept is α0 .
Regression for male: Dt = 1, and intercept os α0 + α1 .
These two regressions coincide if α1 = 0. Testing no wage discrimination
against female amounts to testing the hypothesis of α1 = 0.
We may also consider the specification with a dummy variable and its
interaction with a regressor:
yt = α0 + α1 Dt + β0 xt + β1 (xt Dt ) + et .
Then, the slopes of the regressions for female and male are, respectively,
β0 and β0 + β1 . These two regressions coincide if α1 = 0 and β1 = 0. In
this case, testing no wage discrimination against female amounts to
testing the joint hypothesis of α1 = 0 and β1 = 0.
Example: Consider two dummy variables:
D1,t = 1 if high school is t’s highest degree and D1,t=0 otherwise;
D2,t = 1 if college or graduate is t’s highest degree and D2,t=0 otherwise.
The specification below in effect puts together 3 regressions:
yt = α0 + α1 D1,t + α2 D2,t + βxt + et ,
where below-high-school regression has intercepts α0 , high-school

regression has intercept α0 + α1 , college regression has intercept α0 + α2 .
Similar to the previous example, we may also consider a more general
specification in which x interacts with D1 and D2 .
Dummy variable trap: To avoid exact multicollinearity, the number of

dummy variables in a model (with the constant term) should be one less
than the number of groups.
Let Dt = 1 for t = T ∗ + 1, . . . , T and Dt = 0 otherwise, where T ∗ is the

year of structure change. Consider the specification:
st = α0 + δDt + β0 ut−1 + γut−1 Dt + et .
The “before-change” regression has the intercept α0 and slope β0 , and the
“after-change” regression has the intercept α0 + δ and slope β0 + γ.
Testing a structure change at T ∗ amounts to testing δ = 0 and γ = 0
(Chow test).
Alternatively, we can estimate the specification:
st = α0 (1 − Dt ) + α1 Dt + β0 ut−1 (1 − Dt ) + β1 ut−1 Dt + et ,
and test a structure change at T ∗ by checking if α0 = α1 and β0 = β1 .
Part I: Estimation results with a known change: Without t
T∗ const Dt ut−1 ut−1 Dt R̄ 2 /Reg F Chow
1992 6.94 −2.82 1.42 1.17 0.67 0.50

(2.85∗∗ ) (−1.00) (1.20) (0.94) 22.56∗∗
1993 6.07 −1.39 1.75 0.71 0.66 0.17
(2.58∗ ) (−0.49) (1.51) (0.57) 21.86∗∗
1994 5.58 −0.35 1.94 0.39 0.66 0.14
(2.48∗ ) (−0.12) (1.72) (0.32) 21.80∗∗
1995 5.36 0.51 2.02 0.16 0.66 0.26
(2.46∗ ) (0.17) (1.83) (0.13) 22.05∗∗
Chow test is the F test of the coefficients of Dt and ut−1 Dt being

zero.
Part II: Estimation results with a known change: With t
T∗ const Dt ut−1 t tDt R̄ 2 /Reg F
1992 11.11 −10.41 1.08 −0.54 0.92 0.83

(8.30 ) (−5.09 ) (2.52∗ )
∗∗ ∗∗
(−4.00∗∗ ) (5.45∗∗ ) 40.21∗∗
1993 11.11 −9.97 1.05 −0.52 0.90 0.83
(8.47∗∗ ) (−4.79∗∗ ) (2.47∗ ) (−4.42∗∗ ) (5.63∗∗ ) 41.09∗∗
1994 11.05 −9.46 0.97 −0.48 0.85 0.83
(8.42∗∗ ) (−4.41∗∗ ) (2.25∗ ) (−4.48∗∗ ) (5.53∗∗ ) 40.60∗∗
1995 10.92 −8.78 0.89 −0.43 0.79 0.83
∗∗ ∗∗
(8.19 ) (−3.92 ) (1.98) (−4.32∗∗ ) (5.26∗∗ ) 38.98∗∗
F test of the coefficients of Dt and tDt being zero: 15.28∗∗ (’92); 15.83∗∗
(’93); 15.52∗∗ (’94); 14.51∗∗ (’95)
We do not know T ∗ , the year of change, and hence tried estimating with
different T ∗ :
T ∗ = 1992 : ŝt = 11.11 − 10.41 Dt + 1.08 ut−1 − 0.54 t + 0.92 tDt ;

T ∗ = 1993 : ŝt = 11.11 − 9.97 Dt + 1.05 ut−1 − 0.52 t + 0.90 tDt ;
T ∗ = 1994 : ŝt = 11.05 − 9.46 Dt + 0.97 ut−1 − 0.48 t + 0.85 tDt ;
T ∗ = 1995 : ŝt = 10.92 − 8.78 Dt + 0.89 ut−1 − 0.43 t + 0.79 tDt .
Consider the regression with T ∗ = 1994. The “before-change” and

“after-change” slopes are −0.48 (a decrease over time) and 0.37 (an
increase over time), respectively. The marginal effect of ut−1 on st is
significant at 5% level.
The regression line with T ∗ = 1994 predicts the suicide rate in 2013
as 17.8; the difference between the predicted and actual suicide rates
is 2.5, approximately 570 persons.
Limitation of the Classical Conditions
[A1] X is non-stochastic: Economic variables can not be regarded as

non-stochastic; also, lagged dependent variables may be used as
regressors.
[A2](i) IE(y) = Xβ o : IE(y) may be a linear function with more
regressors or a nonlinear function of regressors.
[A2](ii) var(y) = σo2 IT : The elements of y may be correlated (serial
correlation, spatial correlation) and/or may have unequal variances.
[A3] Normality: y may have a non-normal distribution.
The OLS estimator loses the properties derived before when some of
the classical conditions fail to hold.
When var(y) 6= σo2 IT
Given the linear specification y = Xβ + e, suppose, in addition to [A1] and

[A2](i), var(y) = Σo 6= σo2 IT , where Σo is p.d. That is, the elements of y
may be correlated and have unequal variances.
The OLS estimator β̂ T remains unbiased with
var(β̂ T ) = var((X0 X)−1 X0 y) = (X0 X)−1 X0 Σo X(X0 X)−1 .
β̂ T is not the BLUE for β o , and it is not the BUE for β o under
normality.
c β̂ T ) = σ̂T2 (X0 X)−1 is a biased estimator for
The estimator var(
var(β̂ T ). Consequently, the t and F tests do not have t and F
distributions, even when y is normally distributed.
The GLS Estimator
Consider the specification: Gy = GXβ + Ge, where G is nonsingular and

non-stochastic.
IE(Gy) = GXβ o and var(Gy) = GΣo G0 .

GX has full column rank so that the OLS estimator can be computed:
b(G) = (X0 G0 GX)−1 X0 G0 Gy,
which is still linear and unbiased. It would be the BLUE provided that
G is chosen such that GΣo G0 = σo2 IT .
−1/2 −1/2
Setting G = Σo , where Σo = CΛ−1/2 C0 and C orthogonally
−1/2 −1/20
diagonalizes Σo : C0 Σo C = Λ, we have Σo Σo Σo = IT .
−1/2 −1/2
With y∗ = Σo y and X∗ = Σo X, we have the GLS estimator:
β̂ GLS = (X∗0 X∗ )−1 X∗0 y∗ = (X0 Σ−1 −1 0 −1

o X) (X Σo y). (5)
The β̂ GLS is a minimizer of weighted sum of squared errors:

1 ∗ 1
Q(β; Σo ) = (y −X∗ β)0 (y∗ −X∗ β) = (y−Xβ)0 Σ−1
o (y−Xβ).
T T
The vector of GLS fitted values, ŷGLS = X(X0 Σ−1 −1 0 −1
o X) (X Σo y), is
an oblique projection of y onto span(X), because
X(X0 Σ−1 −1 0 −1
o X) X Σo is idempotent but asymmetric. The GLS
residual vector is êGLS = y − ŷGLS .
The sum of squared OLS residuals is less than the sum of squared
GLS residuals. (Why?)
Stochastic Properties of the GLS Estimator
Theorem 4.1 (Aitken)

Given linear specification (1), suppose that [A1] and [A2](i) hold and that
var(y) = Σo is positive definite. Then, β̂ GLS is the BLUE for β o .
Given [A30 ] y ∼ N (Xβ o , Σo ),
β̂ GLS ∼ N β o , (X0 Σ−1 −1

o X) .
Under [A30 ], the log likelihood function is
T 1 1
log L(β; Σo ) = − log(2π)− log(det(Σo ))− (y−Xβ)0 Σ−1
o (y−Xβ),
2 2 2
with the FOC: X0 Σ−1
o (y − Xβ) = 0. Thus, the GLS estimator is also
the MLE under normality.
Under normality, the information matrix is

IE[X0 Σ−1
o (y − Xβ)(y − Xβ) 0 −1
Σo X] = X0 Σ−1
o X.
β=β o
Thus, the GLS estimator is the BUE for β o , because its covariance
matrix reaches the Crámer-Rao lower bound.
Under the null hypothesis Rβ o = r, we have
(Rβ̂ GLS − r)0 [R(X0 Σ−1 −1 0 −1 2

o X) R ] (Rβ̂ GLS − r) ∼ χ (q).
A major difficulty: How should the GLS estimator be computed when

Σo is unknown?
Under normality, the information matrix is

IE[X0 Σ−1
o (y − Xβ)(y − Xβ) 0 −1
Σo X] = X0 Σ−1
o X.
β=β o
Thus, the GLS estimator is the BUE for β o , because its covariance
matrix reaches the Crámer-Rao lower bound.
Under the null hypothesis Rβ o = r, we have
(Rβ̂ GLS − r)0 [R(X0 Σ−1 −1 0 −1 2

o X) R ] (Rβ̂ GLS − r) ∼ χ (q).
A major difficulty: How should the GLS estimator be computed when

Σo is unknown?
The Feasible GLS (FGLS) estimator is

−1 −1
β̂ FGLS = (X0 Σ
b X)−1 X0 Σ
T
b y,
T
where Σ
b is an estimator of Σ .
T o
Further difficulties in FGLS estimation:
The number of parameters in Σo is T (T + 1)/2. Estimating Σo
without some prior restrictions on Σo is practically infeasible.
Even when an estimator Σ b T is available under certain assumptions,
β̂ FGLS is a complex function of the data y and X. As such, the
finite-sample properties of the FGLS estimator are typically difficult to
derive.

−1 −1
β̂ FGLS = (X0 Σ
b X)−1 X0 Σ
T
b y,
T
where Σ
T o
derive.

−1 −1
β̂ FGLS = (X0 Σ
b X)−1 X0 Σ
T
b y,
T
where Σ
T o
derive.
Some Remarks on the Feasible GLS Estimation
In the classical literature, the feasible GLS estimation is based on

some assumptions on Σo . These assumptions are needed to reduce
the number of parameters and hence render the estimation of Σo
feasible.
The subsections below deal with the estimation under some special
assumptions, which are imposed on either the diagonal or off-diagonal
terms of Σo but not both.
Although the assumptions on Σo are weaker than the condition of
Σo = σo2 IT , they are very ad hoc and can not be easily generalized.
Tests for Heteroskedasticity
A simple form of Σo is
" #
σ12 IT1 0
Σo = ,
0 σ22 IT2
with T = T1 + T2 ; this is known as groupwise heteroskedasticity.
The null hypothesis of homoskedasticity: σ12 = σ22 = σo2 .

Perform separate OLS regressions using the data in each group and
obtain the variance estimates σ̂T2 1 and σ̂T2 2 .
Under [A1] and [A30 ], the F test is:
,
σ̂T2 1 (T1 − k)σ̂T2 1 (T2 − k)σ̂T2 2
ϕ := 2 = 2 ∼ F (T1 − k, T2 − k).
σ̂T2 σo (T1 − k) σo2 (T2 − k)
More generally, for some constants c0 , c1 > 0, σt2 = c0 + c1 xtj2 .
The Goldfeld-Quandt test:
(1) Rearrange obs. according to the values of xj in a descending order.
(2) Divide the rearranged data set into three groups with T1 , Tm , and T2
observations, respectively.
(3) Drop the Tm observations in the middle group and perform separate
OLS regressions using the data in the first and third groups.
(4) The statistic is the ratio of the variance estimates:
σ̂T2 1 /σ̂T2 2 ∼ F (T1 − k, T2 − k).
Some questions:
Can we estimate the model with all observations and then compute σ̂T2 1
and σ̂T2 2 based on T1 and T2 residuals?
If Σo is not diagonal, does the F test above still work?
σ̂T2 1 /σ̂T2 2 ∼ F (T1 − k, T2 − k).
Some questions:
σ̂T2 1 /σ̂T2 2 ∼ F (T1 − k, T2 − k).
Some questions:
GLS and FGLS Estimation
Under groupwise heteroskedasticity,

" #
−1/2 σ1−1 IT1 0
Σo = ,
0 σ2−1 IT2
so that the transformed specification is

" # " # " #
y1 /σ1 X1 /σ1 e1 /σ1
= β+ .
y2 /σ2 X2 /σ2 e2 /σ2
−1/2
Clearly, var(Σo y) = IT . The GLS estimator is:
−1
X01 X1 X02 X2 X01 y1 X02 y2

β̂ GLS = + + .
σ12 σ22 σ12 σ22
With σ̂T2 1 and σ̂T2 2 from separate regressions, an estimator of Σo is
" #
σ̂T2 1 IT1 0
Σ
b=
2
.
0 σ̂T2 IT2
The FGLS estimator is:

0 −1 0
X1 X1 X02 X2 X1 y1 X02 y2

β̂ FGLS = + + 2 .
σ̂12 σ̂22 σ̂12 σ̂2
Note: If σt2 = c xtj2 , a transformed specification is
yt 1 xt,j−1 xt,j+1 x e
= βj + β1 + · · · + βj−1 + βj+1 + · · · + βk tk + t ,
xtj xtj xtj xtj xtj xtj
where var(yt /xtj ) = c := σo2 . Here, the GLS estimator is readily computed
as the OLS estimator for the transformed specification.
Discussion and Remarks
How do we determine the “groups” for groupwise heteroskedasticity?

What if the diagonal elements of Σo take multiple values (so that
there are more than 2 groups)?
A general form of heteroskedasticity: σt2 = h(α0 + z0t α1 ), with h
unknown, zt a p × 1 vector and p a fixed number less than T .
When the F test rejects the null of homoskedasticity, groupwise
heteroskedasticity need not be a correct description of Σo .
When the form of heteroskedasticity is incorrectly specified, the
resulting FGLS estimator may be less efficient than the OLS estimator.
The finite-sample properties of FGLS estimators and hence the exact
tests are typically unknown.
Serial Correlation
When time series data yt are correlated over time, they are said to
exhibit serial correlation. For cross-section data, the correlations of yt
are known as spatial correlation.
A general form of Σo is that its diagonal elements (variances of yt )
are a constant σo2 , and the off-diagonal elements (cov(yt , yt−i )) are
non-zero.
In the time series context, cov(yt , yt−i ) are known as the
autocovariances of yt , and the autocorrelations of yt are
cov(yt , yt−i ) cov(yt , yt−i )

corr(yt , yt−i ) = p = .
σo2
p
var(yt ) var(yt−i )
Simple Model: AR(1) Disturbances
A time series yt is said to be weakly (covariance) stationary if its

mean, variance, and autocovariances are all independent of t.
i.i.d. random variables
White noise: A time series with zero mean, a constant variance, and
zero autocovariances.
Disturbance: := y − Xβ o so that var(y) = var() = IE(0 ).
Suppose that t follows a weakly stationary AR(1) (autoregressive of
order 1) process:
t = ψ1 t−1 + ut , |ψ1 | < 1,
where {ut } is a white noise with IE(ut ) = 0, IE(ut2 ) = σu2 , and

IE(ut uτ ) = 0 for t 6= τ .
By recursive substitution,
∞
X
t = ψ1i ut−i ,
i=0
a weighted sum of current and previous “innovations” (shocks). This is a

stationary process because:
IE(t ) = 0, var(t ) = ∞ 2i 2 2 2
P
i=0 ψ1 σu = σu /(1 − ψ1 ), and
cov(t , t−1 ) = ψ1 IE(2t−1 ) = ψ1 σu2 /(1 − ψ12 ),
so that corr(t , t−1 ) = ψ1 .

cov(t , t−2 ) = ψ1 cov(t−1 , t−2 ) so that corr(t , t−2 ) = ψ12 . Thus,
corr(t , t−i ) = ψ1 corr(t−1 , t−i ) = ψ1i ,
which depend only on i, but not on t.
The variance-covariance matrix var(y) is thus
 
1 ψ1 ψ12 · · · ψ1T −1
· · · ψ1T −2
 
 ψ1 1 ψ1 
 
Σo = σo2  ψ12
 ψ1 1 · · · ψ1T −3 ,

 . .. .. .. .. 
 .. . . . . 
 
T −1 T −2 T −3
ψ1 ψ1 ψ1 ··· 1
with σo2 = σu2 /(1 − ψ12 ). Note that all off-diagonal elements of this matrix
are non-zero, but there are only two unknown parameters.
−1/2
A transformation matrix for GLS estimation is the following Σo :
 
1 0 0 ··· 0 0
 −√ ψ √ 1
···
 1

0 0 0 
 1−ψ12 1−ψ12 
− √ ψ1 2 √ 1 2 · · ·
 
 0 0 0 
1  1−ψ1 1−ψ1 
.

σo  .. .. .. .. .. ..
. . . . . .

 
 1

 0 0 0 ··· √ 0 
 1−ψ12 
 
0 0 0 · · · − √ ψ1 2 √ 1 2
1−ψ1 1−ψ1
−1/2
Any matrix that is a constant proportion to Σo can also serve as a
legitimate transformation matrix for GLS estimation
The Cochrane-Orcutt Transformation is based on:
 q 
1 − ψ12 0 0 ··· 0 0
 
 −ψ 1 0 ··· 0 0 
 1 
 
0 −ψ1 1 ··· 0 0
q
−1/2 −1/2  
Vo = σo 1 − ψ12 Σo =  .. .. .. . . .. ..
,

 . . . . . .
 

 0 0 0 ··· 1 0 

0 0 0 · · · −ψ1 1
which depends only on the single parameter ψ1 . The resulting transformed

−1/2 −1/2
data are: y∗ = Vo y and X∗ = Vo X with
y1∗ = (1 − ψ12 )1/2 y1 , x∗1 = (1 − ψ12 )1/2 x1 ,

yt∗ = yt − ψ1 yt−1 , x∗t = xt − ψ1 xt−1 , t = 2, · · · , T ,
where xt is the t th column of X0 .
Model Extensions
Extension to AR(p) process:
t = ψ1 t−1 + · · · + ψp t−p + ut ,
where ψ1 , . . . , ψp must be restricted to ensure weak stationarity.

MA(1) (moving average of order 1) process:
t = ut − π1 ut−1 , |π1 | < 1,
where {ut } is a white noise.

IE(t ) = 0, var(t ) = (1 + π12 )σu2 .
cov(t , t−1 ) = −π1 σu2 , and cov(t , t−i ) = 0 for i ≥ 2.
MA(q) Process: t = ut − π1 ut−1 − · · · − πq ut−q .
Tests for AR(1) Disturbances
Under AR(1), the null hypothesis is ψ1 = 0. A natural estimator of ψ1 is
the OLS estimator of regressing êt on êt−1 :
PT
êt êt−1
ψ̂T = Pt=2
T
.
2
t=2 êt−1
The Durbin-Watson statistic is

PT
(ê − êt−1 )2
d = t=2PTt .
2
t=1 êt
When the sample size T is large, it can be seen that

PT 2
êt−1 ê 2 + ê 2
d = 2 − 2ψ̂T Pt=2
T
− P1 T T ≈ 2(1 − ψ̂T ).
2 2
t=1 êt t=1 êt
For 0 < ψ̂T ≤ 1 (−1 ≤ ψ̂T < 0), 0 ≤ d < 2 (2 < d ≤ 4), there may
be positive (negative) serial correlation. Hence, d essentially checks
whether ψ̂T is “close” to zero (i.e., d is “close” to 2).
Difficulty: The exact null distribution of d holds only under the
classical conditions [A1] and [A3] and depends on the data matrix X.
Thus, the critical values for d can not be tabulated, and this test is
not pivotal.
The null distribution of d lies between a lower bound (dL ) and an
upper bound (dU ):
∗
dL,α < dα∗ < dU,α
∗
.
The distributions of dL and dU are not data dependent, so that their

∗ and d ∗
critical values dL,α U,α can be tabulated.
Durbin-Watson test:
∗ ∗
(1) Reject the null if d < dL,α (d > 4 − dL,α ).
∗ ∗
(2) Do not reject the null if d > dU,α (d < 4 − dU,α ).
∗ ∗ ∗ ∗
(3) Test is inconclusive if dL,α < d < dU,α (4 − dL,α > d > 4 − dU,α ).
For the specification yt = β1 + β2 xt2 + · · · + βk xtk + γyt−1 + et ,
Durbin’s h statistic is
s
T
h = γ̂T ≈ N (0, 1),
1 − T var(γ̂
c T)
where γ̂T is the OLS estimate of γ with var(γ̂

c T ) the OLS estimate of
var(γ̂T ).
Note: var(γ̂
c T ) can not be greater 1/T . (Why?)
FGLS Estimation
Notations: Write Σ(σ 2 , ψ) and V(ψ), so that Σo = Σ(σo2 , ψ1 ) and

Vo = V(ψ1 ). Based on V(ψ)−1/2 , we have
y1 (ψ) = (1 − ψ 2 )1/2 y1 , x1 (ψ) = (1 − ψ 2 )1/2 x1 ,

yt (ψ) = yt − ψyt−1 , xt (ψ) = xt − ψxt−1 , t = 2, · · · , T .
Iterative FGLS Estimation:

(1) Perform OLS estimation and compute ψ̂T using the OLS residuals êt .
(2) Perform the Cochrane-Orcutt transformation based on ψ̂T and compute
the resulting FGLS estimate β̂ FGLS by regressing yt (ψ̂T ) on xt (ψ̂T ).
(3) Compute a new ψ̂T with êt replaced by êt,FGLS = yt − x0t β̂ FGLS .
(4) Repeat steps (2) and (3) until ψ̂T converges numerically.
Steps (1) and (2) suffice for FGLS estimation; more iterations may
improve the performance in finite samples.
Instead of estimating ψ̂T based on OLS residuals, the Hildreth-Lu
procedure adopts grid search to find a suitable ψ ∈ (−1, 1).
For a ψ in (−1, 1), conduct the Cochrane-Orcutt transformation and

compute the resulting FGLS estimate (by regressing yt (ψ) on xt (ψ))
and the ESS based on the FGLS residuals.
Try every ψ on the grid; a ψ is chosen if the corresponding ESS is the
smallest.
The results depend on the grid.
Note: This method is computationally intensive and difficult to apply

when t follow an AR(p) process with p > 2.
Consider binary y with y = 1 or 0.
Under [A1] and [A2](i), IE(yt ) = IP(yt = 1) = x0t β o ; this is known as

the linear probability model.
Problems with the linear probability model:
Under [A1] and [A2](i), there is heteroskedasticity:
var(yt ) = x0t β o (1 − x0t β o ),
and hence the OLS estimator is not the BLUE for β o .

The OLS fitted values x0t β̂ T need not be bounded between 0 and 1.
Consider binary y with y = 1 or 0.
Under [A1] and [A2](i), IE(yt ) = IP(yt = 1) = x0t β o ; this is known as

the linear probability model.
Problems with the linear probability model:
Under [A1] and [A2](i), there is heteroskedasticity:
var(yt ) = x0t β o (1 − x0t β o ),
and hence the OLS estimator is not the BLUE for β o .

The OLS fitted values x0t β̂ T need not be bounded between 0 and 1.
An FGLS estimator may be obtained using
b −1/2 = diag [x0 β̂ (1 − x0 β̂ )]−1/2 , . . . ,

h
Σ T 1 T 1 T
i
[x0T β̂ T (1 − x0T β̂ T )]−1/2 .
Problems with FGLS estimation:

b −1/2 can not be computed if x0 β̂ is not bounded between 0 and 1.
Σ T t T
−1/2
Even when ΣT b is available, there is no guarantee that the FGLS
fitted values are bounded between 0 and 1.
The finite-sample properties of the FGLS estimator are unknown.
A key issue: A linear model here fails to take into account data
characteristics.
Application: Seemingly Unrelated Regressions
To study the joint behavior of several dependent variables, consider a

system of N equations, each with ki explanatory variables and T obs:
y i = Xi β i + ei , i = 1, 2, . . . , N.
Stacking these equations yields Seemingly unrelated regressions (SUR):

      
y1 X1 0 · · · 0 β1 e1
y2 0 X2 · · · 0 β2 e2
      
      
 .. = .. .. . . ..  .. + .. .
. . . . . . .
      
      
yN 0 0 · · · XN βN eN
| {z } | {z } | {z } | {z }
y X β e
PN PN
where y is TN × 1, X is TN × i=1 ki , and β is i=1 ki × 1.
Suppose yit and yjt are contemporaneously correlated, but yit and yjτ
are serially uncorrelated, i.e., cov(yi , yj ) = σij IT .
For this system, Σo = So ⊗ IT with
 
σ12 σ12 · · · σ1N
σ21 σ22 · · · σ2N
 
 
So =  .. .. .. .. ;
. . . .
 
 
σN1 σN2 · · · σN2
that is, the SUR system has both serial and spatial correlations.
As Σ−1 −1
o = So ⊗ IT , then
β̂ GLS = [X0 (S−1 −1 0 −1

o ⊗ IT )X] X (So ⊗ IT )y,
and its covariance matrix is [X0 (S−1 −1

o ⊗ IT )X] .
Remarks:
When σij = 0 for i 6= j, So is diagonal, and so is Σo . Then, the GLS
estimator for each β i reduces to the corresponding OLS estimator, so
that joint estimation of N equations is not necessary.
If all equations in the system have the same regressors, i.e., Xi = X0
(say) and X = IN ⊗ X0 , the GLS estimator is also the same as the OLS
estimator.
More generally, there would not be much efficiency gain for GLS
estimation if yi and yj are less correlated and/or Xi and Xj are highly
correlated.
The FGLS estimator can be computed as
b−1 ⊗ I )X]−1 X0 (S
β̂ FGLS = [X0 (S b−1 ⊗ I )y.
TN T TN T
S TN is an N × N matrix:
b
ê01
 
ê02
 h
1   i
STN = 
b ..  ê ê . . . ê
1 2 N ,
T 
 .


ê0N
where êi is the OLS residual vector of the i th equation.

The estimator S 2
TN is valid provided that var(yi ) = σi IT and
b
cov(yi , yj ) = σij IT . Without these assumptions, FGLS estimation
would be more complicated.
Again, the finite-sample properties of the FGLS estimator are
unknown.

Classical Least Squares Theory - Lecture Notes

Uploaded by

Copyright:

Available Formats

Classical Least Squares Theory - Lecture Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classical Least Squares Theory - Lecture Notes

Uploaded by

Copyright:

Available Formats

Department of Finance & CRETA

October 18, 2014

5 Limitation of the Classical Conditions

6 The Method of Generalized Least Squares (GLS)

Given the variable of interest y , we are interested in finding a function of

y : Dependent variable or regressand

2 Minimizing a least-absolute-deviation (LAD) criterion wrt α and β:

3 Minimizing asymmetrically weighted absolute deviations:

with 0 < θ < 1.

2 Minimizing a least-absolute-deviation (LAD) criterion wrt α and β:

3 Minimizing asymmetrically weighted absolute deviations:

with 0 < θ < 1.

2 Minimizing a least-absolute-deviation (LAD) criterion wrt α and β:

3 Minimizing asymmetrically weighted absolute deviations:

with 0 < θ < 1.

The first order conditions (FOCs) of LS minimization are:

The solutions are known as the ordinary least squares (OLS)

α̂T = ȳ − β̂T x̄.

β̂T characterizes the the predicted change of y , given a change of one

Substituting α̂T and β̂T into the FOCs:

we have the following algebraic results:

ŝt = 15.11 − 0.54 gt , R̄ 2 = 0.17;

Classical Least Squares Theory

(a) Suicide & GDP growth rates

C.-M. Kuan (Finance & CRETA, NTU)

With k regressors x1 , . . . , xk (x1 is usually the constant one):

With data (yt , xt1 , . . . , xtk ), t = 1, . . . , T , we can write

Identification Requirement [ID-1]: X is of full column rank k.

β̂ T = (X0 X)−1 X0 y. (3)

Under [ID-1], we have the second order condition:

∇2β QT (β) = 2(X0 X)/T is p.d.

Recall that P = X(X0 X)−1 X0 is the orthogonal projection matrix that

The vector of fitted values, ŷ = Xβ̂ T = X(X0 X)−1 X0 y = Py, is the

Figure: The orthogonal projection of y onto span(x1 ,x2 ).

β̂ 1,T = [X01 (I − P2 )X1 ]−1 X01 (I − P2 )y,

β̂ 2,T = [X02 (I − P1 )X2 ]−1 X02 (I − P1 )y,

where P1 = X1 (X01 X1 )−1 X01 and P2 = X2 (X02 X2 )−1 X02 .

β̂ 1,T can also be computed from regressing (I − P2 )y on (I − P2 )X1 ,

= X01 (I − P2 )X1 β̂ 1,T + X01 (I − P2 )X2 β̂ 2,T + X01 (I − P2 )(I − P)y

= X01 (I − P2 )X1 β̂ 1,T + X01 (I − P2 )(I − P)y.

We know span(X2 ) ⊆ span(X), so that span(X)⊥ ⊆ span(X2 )⊥ . Hence,

X01 (I − P2 )y = X01 (I − P2 )X1 β̂ 1,T + X01 (I − P)y

= X01 (I − P2 )X1 β̂ 1,T ,

from which we obtain the expression for β̂ 1,T .

The OLS estimator of regressing y on X1 is not the same as β̂ 1,T ,

(I − P1 )(I − P) = I − P, so that the residual vector of regressing

(I − P1 )y = (I − P1 )X2 β̂ 2,T + (I − P)y.

P1 = P1 P, so that the orthogonal projection of y directly on

(I − P1 )X2 β̂ 2,T = (I − P1 )Py = (P − P1 )y.

Figure 1: An Illustration of the Frisch-Waugh-Lovell Theorem.

Given ŷ0 ê = 0, we have y0 y = ŷ0 ŷ + ê0 ê, where y0 y is known as TSS

measures the proportion of the total variation of yt that can be

which is the the squared sample correlation coefficient of yt and ŷt ,

Adjusted R 2 is the centered R 2 adjusted for the degrees of freedom:

R̄ 2 adds a penalty term to R 2 :

Q: How the suicide rate (s) can be explained by unemployment rate

ŝt = 15.11 − 0.54 gt , R̄ 2 = 0.17;

Estimation results with gt−1 and ut−1 :