Bruno Lecture Notes PDF
Bruno Lecture Notes PDF
Bruno Lecture Notes PDF
Econometrics
Giovanni Bruno
Chapter 1. Introduction 8
1.1. Introduction 8
1.2. The linear population model 10
Chapter 5. The Oaxacas model: OLS, optimal weighted least squares and group-wise
heteroskedasticity 89
5.1. Introduction 89
5.2. Embedding the Oaxacas model into a pooled regression framework 89
5.3. The OLS estimator in the Oaxacas model is BLUE 93
5.4. The Zyskinds condition is verified in the Oaxacas model. 96
Bibliography 248
Part 1
Linear Models
CHAPTER 1
Introduction
1.1. Introduction
Indeed, causation is not the same as correlation. Econometrics uses economic theory,
mathematics and statistics to quantify economic structural relationship, often in the search of
causal links among the variables of interest.
Although rather schematic, the following discussion should convey the basic intuition of
how this process works.
Economic theory provides the econometrician with an economic structural model,
(1.1.1) y = (x, )
from the economic factors of interest, x, to the economic response or dependent variable y.
Often in applications q = 1, which means that is treated as a catch-all random scalar.
For example, (x, ) may be the expenditure function in a population of (possibly) het-
erogenous consumers, with preferences and facing income and prices x; or it may be the
marshallian demand function for some good in the same population, with x denoting prices
and total consumption expenditure; also it may be the demand function for some input of a
population of (possibly) heterogenous firms facing input and output prices x, with comprising
technological latent heterogeneity, and so on.1
The individual (x, ), with its gradient vector of marginal effects, x (x, ), and hessian
matrix, Dxx (x, ), are typically the structural objects of interest, but sometimes attention is
centered upon aggregate structural objects, such as the population-averaged structural func-
tion,
(x, ) dF () ,
1Wooldridge (2010) prefers to think of (x, ) as a structural conditional expectation: E (y|x, ) (x, ) .
There is nothing in the present analysis that prevents such interpretation.
1.2. THE LINEAR POPULATION MODEL 10
The key question is under what conditions these estimable statistical objects are informa-
tive on (x, ). Evidently, to establish a mapping between the structural economic object
of interest and the foregoing statistical objects the econometrician needs to model the rela-
tionship between observables and unobservables in (x, ) and do so in a plausible way. The
restrictions that are used to this purpose are said identification restrictions. The next sections
describe the simplest probabilistic model for equation (1.1.1), the linear population model.
Equation (1.1.1) is a linear model of the population if the following assumptions hold.
(1.2.1) y = x0 + ,
The following proves equivalence of P.2 and P.2b (not crucial and rather technical, it can be
skipped towards the exam). I exploit the properties of the expectation and rank operators.
Assume P.2 and P r (a0 x = 0) = 1 for some conformable constant vector a. Then E (a0 xx0 a) =
0, and so a0 E (xx0 ) a = 0, which implies a = 0 by P.2, proving P.2b. Now, assume P.2b and pick
any a 6= 0. Then, P r (a0 x = 0) 6= 1 and so P r (a0 x 6= 0) > 0. But since a0 x 6= 0 is equivalent
to a0 xx0 a > 0, then P r (a0 xx0 a > 0) = P r (a0 x 6= 0) > 0. So, since P r (a0 xx0 a 0) = 1,
1.2. THE LINEAR POPULATION MODEL 11
E (a0 xx0 a) > 0, which in turn implies a0 E (xx0 ) a > 0. Therefore, E (xx0 ) is positive definite
and so non-singular, that is P.2.
Solution:
0
1
E (x1 )
E xx =
E (x1 ) E x21
and so det E (xx0 ) = E x21 E 2 (x1 ) = V ar (x1 ), and the claim is proved by noting that for
x (x, ) = .
By assumption P.3 and the law of iterated expectations E (x) = 0. Since = y x0 , then
we have the system of k moment conditions
E xy xx0 = 0
(1.2.2)
or E (xy) = E (xx0 ) . Assumption P.2, then, ensures that the foregoing system can be solved
for to have
1
(1.2.3) = E xx0 E (xy)
At this point the linear probabilistic model establishes a precise mapping between, on the
one hand, the structural objects of interest, (x, ), and x (x, ) and on the other the
observable or estimable objects y, x, E (xx0 ) and E (xy). Indeed, (x, ), and x (x, )
are equal to unique known transformations of y, x, E (xx0 ) and E (xy). This means that
1.2. THE LINEAR POPULATION MODEL 12
(x, ), and x (x, ) can be estimated using estimators for E (xx0 ) and E (xy), whose
choice depends on the underlying sampling mechanism. The most basic strategy is to carry
out estimation within the linear regression model and its variants. In essence, the linear regres-
sion model is the linear probabilistic model supplemented by a random sampling assumption.
This ensures optimal properties of the ordinary least squares estimator (OLS) and its various
generalizations.
A more restrictive specification of the linear model maintains the assumptions of condi-
tional homoskedasticity and normality
P.4: V ar (|x) = 2 .
P.5: |x N 0, 2 .
A more general variant of the linear model, instead, replaces assumption P.3 with
P.3b: E (x) = 0.
Under P.3b it is still true that = E (xx0 )1 E (xy) and x (x, ) = , with the virtue that
the conditional expectation E (y|x) is left unrestricted. Therefore, with P.3b replacing P.3,
the model is more general.
The function x0 , with = E (xx0 )1 E (xy), is relevant in either version of the linear
model and is said the linear projection of y onto x.
2.1. From the linear population model to the linear regression model
Consider the linear model of the previous chapter: the population equation (1.1.1)
y = (x, )
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik ) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
Given P.1-P.3 and RS, we have the linear regression model (LRM)
(2.1.1) yi = x0i + i
with x0 i = (xi1 xi2 ...xik ), i = 1, ..., n and {i = yi x0i , i = 1, ..., n} is a sequence of unob-
served i.i.d. errors terms.
13
2.2. THE PROPERTIES OF THE LRM 14
(2.2.1) y = X +
where
y1 x01 1
.. .. ..
.
.
.
y = yi , X = 0
xi , =
i .
n1
nk n1
.. .. ..
. . .
yn xn 0 n
It is not hard to see that model (2.2.1), given P.1-P.3 and RS, satisfies the following properties
LRM.1 is obvious. LRM.2 requires that no columns of X can be obtained as linear combina-
tions of other columns of X or, equivalently, that a = 0 if Xa = 0, or also equivalently that for
any a 6= 0, there exists at least one observation i = 1, ..., n, such that x0i a 6= 0. P.2 ensures that
this occurs with non-zero probability, which approaches unity as n . LRM.3, instead, is
a consequence of P.3 and RS. This is proved as follows. By P.3, E (i |x0i ) = 0, i = 1, ..., n or
E (yi |x0i ) x0i = 0, i = 1, ..., n. Since
E i |x01 , ... x0i , ..., x0n = E yi |x01 , ... x0i , ..., x0n x0i
2.3. DIFFICULTIES AND EXTENSIONS 15
and by RS, E (yi |x0i ) = E (yi |x01 , x02 , ..., x0n ), then
If, in addition, P.4 (conditional homoskedasticity) and P.5 (conditional normality) hold for
the population model, then one can verifies that
LRM.4: V ar (|X) = 2 In .
LRM.5: |X N 0, 2 In
While LRM.1-LRM.5 are less restrictive than P.1-P.5 and RS and, in most cases, sufficient for
accurate and precise inference, they are still strong assumptions to maintain. Finally, if P.3 is
replaced by P.3b, E (x) = 0, then LRM.3 gets replaced by
E X 0 = 0.
Some or all of LRM.1-LRM.5 may not be verified if the population model assumptions
and/or the RS mechanism are not verified in reality. Here is a list of the most important
population issues.
Non-linearities (P.1 fails): the model is non-linear in the parameters. This leads
LRM.1 to fail.
Perfect multicollinearity (P.2 fails): some variables in x are indeed linear combina-
tions of the others. LRM.2 fails, but in general this is not a serious problem, it
simply indicates that the model has not been parametrized correctly to begin with.
A different parametrization will restore identification in most cases.
Endogeneity (P.3 fails): some variables in x are related to . LRM.3 fails.
Conditional heteroskedasticity (P.4 fails): the conditional variance depends on x.
LRM.4 fails.
2.3. DIFFICULTIES AND EXTENSIONS 16
Omitted variables: some of the variables in x are not sampled. This implies that
the missing variables cannot enter the conditioning set and have to be treated as
unobserved errors, along with , which could make LRM.3-LRM.5 fail.
Measurement error: some of the variables in x are measured with error. We have
the wrong variables in the conditioning set. As in the case of omitted variables,
LRM.3-LRM.5 may fail.
Endogenous selection: some units in the sample are missing due to events related to
. Also in this case, LRM.3-LRM.5 are likely to fail.
Notice that often problems in the RS mechanism have their roots in the population model.
For example, the presence of non-random variables in x is not in general compatible with an
identically distributed sample and, in consequence with RS. It is easy to verify, though, that
non-random x along with a weaker sampling mechanism only requiring independent sampling
is compatible with LRM.1-LRM.5. Also, the presence of variables in x at different levels of
aggregation may not be compatible with an independent sampling, as observed by Moulton
(1990). In this case, the sampling mechanism can be relaxed by maintaining independence
only across groups of observations and not across observations themselves. See for example
the sampling mechanism described in Section 7.6 for panel data models, in which the sample
is neither identically distributed nor independent across observations.
Finally, it is important to emphasize that even if all the population assumptions and the
RS mechanism are valid, data problems may arise in the form of multicollinearity among
regressors.
As we will see in Chapter 4, although multicollinearity does not affect the statistical properties
of the estimators in finite samples, it can severely affect the precision of the coefficient estimates
in terms of large standard errors.
CHAPTER 3
We do not agree with Larry (the adult croc), do we? Algebra may be boring, but only if
its purpose is left obscure. Algebra in econometrics provides the bricks to construct estimators
and tests. The fact that most estimators and tests are automatically implemented by statistical
packages is no excuse to neglect the underlying algebra. First, because most does not mean
all and there may be the case that for our research work we have to build the technique by
ourselves. This is especially true for the most recent techniques. A robust hausman test for
panel data models and multiway cluster robust standard errors are just a few examples of
techniques that are not yet coded by the popular statistical packages. Second, even if the
technique is available as a built-in procedure in our preferred statistical package, to use it
correctly we have to know how it is made, which boils down to understanding its algebra.
Finally, often interpretation of results requires that we are aware of the algebraic properties
of estimators and tests. So the material here may seem rather intricate at times, but it is
certainly of practical use.
18
3.3. OLS: DEFINITION AND PROPERTIES 19
This chapter is based on my lecture notes in matrix algebra as well as on Greene (2008),
Searle (1982) and Rao (1973). Throughout, I denotes a conformable identity matrix; 0 denotes
a conformable null matrix, vector or scalar, with the appropriate meaning being clear from
the context; y is a real n 1 vector containing the observations of the dependent variable; X
is a real (n k) regressor matrix of full column rank.
The do-file algebra_OLS.do demonstrates the results of this chapter using the Stata data
set US_gasoline.dta.
Given the real (n k) matrix A, the columns of A are said linearly dependent if there
exists some non-zero (k 1) vector b such thatAb = 0 .
Given the real (n k) matrix A, the columns of A are said linearly independent if Ab = 0
only if b = 0.
Two real non-zero (n 1) vectors a and b are said to be orthogonal if a0 b = 0. Given
two real non-zero matrices A and B with the same number of rows, if each column of A is
orthogonal to all columns of B, so that A0 B = 0, then A and B are said to be orthogonal.
We do not have any model in mind here, just data for the response variable
y
1
..
.
y = yi
.
.
.
yn
3.3. OLS: DEFINITION AND PROPERTIES 20
where
S (bo ) = (y Xbo )0 (y Xbo ) .
where the second equation follows from the fact that all of the right hand side components are
11 matrices and as such necessarily symmetric, so that b0o X 0 y = y0 Xbo . Then, remembering
the derivation rules for matrices (Az/z = A0 for any matrix A, z0 Bz/z = 2Bz for any
3.3. OLS: DEFINITION AND PROPERTIES 21
S (bo )
= 2X 0 y + 2X 0 Xbo
bo
so that the first order conditions (OLS normal equations) of the minimization problem are
(3.3.1) X 0 y + X 0 Xb = 0,
1
(3.3.2) b = X 0X X 0y
Notice that
2 S (bo )
= 2X 0 X,
bo b0o
is a positive definite matrix (i.e. S (bo ) is globally convex in bo ). The latter is easily
proved as follows. A matrix A is said to be positive definite if the quadratic form
c0 Ac > 0 for any conformable vector c 6= 0. By the full column rank assumption
n
X
z = Xc 6= 0 for any c 6= 0 therefore c0 X 0 Xc = z0 z = zi2 > 0 for any c =
6 0.
i=1
(3.3.3) e = y Xb
(3.3.4) X 0 (y Xb) = 0.
3.3. OLS: DEFINITION AND PROPERTIES 22
Therefore, if X contains a column of all unity elements, say 1, three important implications
follows.
n n
X 1X
(1) The sample mean of e is zero: 10 e = ei = 0 and consequently, e =
ei = 0.
n
i=1 i=1
(2) The OLS regression line passes through the point sample means (y, x), that is y = x0 b,
where y = ( ni=1 yi ) /n and
P
n n
!
X X
0 1 1
x = n x1i . . . n xki
i=1 i=1
(3.3.5) y = Xb
y = y.
3.3.1. Stata implementation: get your Stata data file with use. All Stata data
files can be recognized by their filetype dta. Suppose you have y and X within a Stata data
file called, say, mydata.dta, stored in your Stata working directory and that you have just
launched Stata on your laptop. To get your data into memory, from the Stata command line
execute use followed by the name of the data file (specifying the filetype dta is not necessary
since use only supports dta files):
use mydata
3.3. OLS: DEFINITION AND PROPERTIES 23
If mydata.dta is not in your Stata working directory but somewhere else in your laptop,
then you must specify the path of the dta file. For example, if you have a mac and your data
file is in the folder /Users/giovanni you will write
use /Users/giovanni/mydata
use c:\giovanni\mydata
If the path involves folders with names that include blanks, then include the whole path
into double quotes. For example:
3.3.2. Stata implementation: the help command. To know syntax, options, usage
and examples for any Stata command, write help from the command line followed by the
name of the command for which you want help. For example,
help use
Title
Syntax
Menu
Description
In the second syntax for use, a subset of the data may be read.
Options
clear specifies that it is okay to replace the data in memory, even though
the current data have not been saved to disk.
nolabel prevents value labels in the saved data from being loaded. It is
unlikely that you will ever want to specify this option.
Examples
. use http://www.stata-press.com/data/r11/auto
. replace rep78 = 3 in 12
Also see
Help: [D] compress, [D] datasignature, [D] fdasave, [D] haver, [D]
infile (free format), [D] infile (fixed format), [D] infix, [D]
insheet, [D] odbc, [D] save, [D] sysuse, [D] webuse
.
3.3. OLS: DEFINITION AND PROPERTIES 25
3.3.3. Stata implementation: OLS estimates with regress. Now that you have
loaded your data into memory, Stata can work with them. Suppose your dependent variable
y is called depvar and that X contains two variables, x1 and x2. To run the OLS regression of
depvar on x1 and x2 with the constant term included, you write regress followed by depvar
and, then, the names of the regressors:
regress depvar x1 x2
The following example shows the regression in example 1.2 of Greene (2008) with annual
values of US aggregate consumption (c) used as the dependent variable and regressed on
annual values of US personal income (y) for the period 1970-1979.
regress includes the constant term (the unity vector) by default and always with the
name _cons. If you dont want it, just add the regress option, noconstant:
Notice that always, according to a general rule of the Stata syntax, the options of any
Stata command follow the comma symbol. This means that if you wish to specify options you
have to write the comma symbol after the last argument of the command, so that everything
to the right of the comma symbol is held by Stata as an option. Options can be more than
one. Of course, if you do not wish to include options dont write the comma symbol.
After execution, regress leaves behind a number of objects in memory, mainly scalars
and matrices, that will stay there, available for use, until execution of the next estimation
3.3. OLS: DEFINITION AND PROPERTIES 26
command. To know what these objects are, consult the section Saved results in the help
of regress, where you will find the following description.
Saved results
Scalars
e(N) number of observations
e(mss) model sum of squares
e(df_m) model degrees of freedom
e(rss) residual sum of squares
e(df_r) residual degrees of freedom
e(r2) R-squared
e(r2_a) adjusted R-squared
e(F) F statistic
e(rmse) root mean squared error
e(ll) log likelihood under additional assumption of i.i.d.
normal errors
e(ll_0) log likelihood, constant-only model
e(N_clust) number of clusters
e(rank) rank of e(V)
Macros
e(cmd) regress
e(cmdline) command as typed
e(depvar) name of dependent variable
e(model) ols or iv
e(wtype) weight type
e(wexp) weight expression
e(title) title in estimation output when vce() is not ols
e(clustvar) name of cluster variable
e(vce) vcetype specified in vce()
e(vcetype) title used to label Std. Err.
e(properties) b V
e(estat_cmd) program used to implement estat
e(predict) program used to implement predict
e(marginsok) predictions allowed by margins
e(asbalanced) factor variables fvset as asbalanced
e(asobserved) factor variables fvset as asobserved
Matrices
e(b) coefficient vector
e(V) variance-covariance matrix of the estimators
e(V_modelbased) model-based variance
Functions
e(sample) marks estimation sample
You should be already familiar with some of the e() objects in the Scalars and Matrices
parts. At the end of the course you will be able to understand most of them. Dont worry
about the Macros and Functions parts, they are rather technical and, however, not relevant
for our purposes.
3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 27
To know the values taken on by the e() objects, execute the command ereturn list just
after the regress instruction. In our regression example we have:
. ereturn list
scalars:
e(N) = 10
e(df_m) = 1
e(df_r) = 8
e(F) = 959.919036180133
e(r2) = .9917348458900325
e(rmse) = 8.193020017500434
e(mss) = 64435.11918375102
e(rss) = 537.0046160573024
e(r2_a) = .9907017016262866
e(ll) = -34.10649331948547
e(ll_0) = -58.08502782843004
e(rank) = 2
macros:
e(cmdline) : "regress c y"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "c"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 2
e(V) : 2 x 2
functions:
e(sample)
Consider the n-dimensional Euclidean space Rn and the (n k) real matrix A. Then, each
column of A belongs to Rn and the set of all linear combinations of the columns of A is said
the space spanned by the columns of A (or also the range of A), denoted by R (A) .
R (A) can be easily proved to be a subspace of Rn (it is obvious that R (A) Rn ; R (A)
is a vector space since, given any two vectors a1 and a2 belonging to R (A), then a1 + a2
R (A) and ca1 R (A) for any real scalar c). Since each element of R (A) is e vector of n
components, R (A) is said to be a vector space of order n. The dimension of R (A), denoted
3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 28
by dim [R (A)], is the maximum number of linearly independent vectors in R (A). Therefore,
dim [R (A)] = rank (A) and if A is of full column rank, then dim [R (A)] = k.
The set of all vectors in Rn that are orthogonal to the vectors of R (A) is denoted by A .
I now prove that A is a subspace of Rn . A Rn by definition. Given any two vectors
b1 and b2 belonging to A and for any a R (A), b01 a = 0 and b02 a = 0, but then also
(b1 + b2 )0 a = 0 and, for any scalar c, (cb01 ) a = 0, which completes the proof.
Importantly, it is possible to prove, but not pursued here, that
(3.4.1) dim A = n rank (A) .
A is commonly referred to as the space orthogonal to R (A), or also the orthogonal comple-
ment of R (A) .
For simplicity, assume A of full column rank and define the operator P[A] as
1
P[A] = A A0 A A0 .
0 = P ) and idempotent (P P
As an exercise you can verify that P[A] is a symmetric (P[A] [A] [A] [A] =
P[A] ) matrix. With this two properties, P[A] is said an orthogonal projector. In geometrical
terms, P[A] projects vectors onto R (A) along a direction that is parallel to the space orthogonal
to R (A), A . Symmetrically,
M[A] = I P[A]
is the orthogonal projector that projects vectors onto A along a direction that is parallel to
the space orthogonal to A , R (A).
Exercise 4. Prove that M[A] is an orthogonal projector (hint: just verify that M[A] is
symmetric and idempotent).
3.5. OLS RESIDUALS AND FITTED VALUES 29
The properties of orthogonal projectors, established by the following exercises, are readily
understood, once one grasps the geometrical meaning of orthogonal projectors. They can be
also demonstrated algebraically, which is what the exercises require.
Exercise 5. Given two (n k) real matrices A and B, both of full column rank, prove
that if A and B span the same space than P[A] = P[B] (hint: prove that A can be always
expressed as A = BK where Kis a non-singular (k k) matrix).
Solution: If R (A) coincides with R (B), then every column of A belongs to R (B), and
as such every column of A can be expressed as a linear combination of the columns of B,
A = BK, where K is (k k) . Therefore, P[A] = BK (K 0 B 0 BK)1 K 0 B. An important result
of linear algebra states that given two conformable matrices C and D, then rank (CD)
min [rank (C) , rank (D)] (see Greene (2008), p. 957, (A-44)). Since both A and B have rank
equal to k, in the light of the foregoing inequality, k min [k, rank (K)], which implies that
rank (K) k, and since rank (K) > k is not possible, then rank (K) = k and K is non-
singular. Finally, by the property of the inverse of the product of square matrices (see Greene
(2008), p. 963, (A-64))
1
P[A] = BK K 0 B 0 BK K 0B0
1 1
= BKK 1 B 0 B K0 K 0B0
= P[B] .
Exercise 6. Prove that P[A] and M[A] are orthogonal, that is P[A] M[A] = 0.
The foregoing results are useful to properly understand the properties of OLS. But before
going on, do the the following exercise.
3.5. OLS RESIDUALS AND FITTED VALUES 30
Exercise 7. Given any (n 1) real vector v lying onto R (A) prove that P[A] v = v and
M[A] v = 0 (hint: express v as v = Ac, where c is a real (k 1) vector).
(3.5.1) P[A] A = A
and
(3.5.2) M[A] A = 0.
(3.5.3) e = M[X] y,
where
1
M[X] = I X X 0 X X 0.
Therefore, the OLS residual vector, e, is the orthogonal projection of y onto the space orthog-
onal to that spanned by the regressors, X . For this reason M[X] is said the residual maker.
From (3.3.2) and (3.3.5) it follows that
= P[X] y
y
(3.5.4) + e.
y=y
The geometry of OLS can be grasped within the framework of a n=2, k=1 regression
model.
3.5. OLS RESIDUALS AND FITTED VALUES 31
predict y_hat
3.6. PARTITIONED REGRESSION 32
values.
You have thereby created a new variable with name y_hat that contains the y
Fitted values are the default calculation of predict, if you want residuals just add the res
option:
and you have got a new variable in your data called resid that contains e the values.
It is important to stress that predict supports any estimation command, not only regress.
So, it can be implemented, for example, after xtreg in the context of panel data.
It is often the case in econometrics that interest centers on a subset of explanatory variables
and the corresponding marginal effects and that consequently we are interested in the specific
formula for the corresponding OLS subvector.
Partition X as X = X1 X2 and, accordingly,
b1
b=
b2
where b = (X 0 X)1 X 0 y. The following important theorem provides the answer, yielding the
OLS formulas for the b1 and b2 subvectors.
Theorem 8. Assume X has full column rank, then there is a linear relationship between
b1 and b2
1
(3.6.1) bi = Xi0 Xi Xi0 (y X j bj )
i, j = 1, 2 and i 6= j.
3.6. PARTITIONED REGRESSION 33
or
1
(3.6.5) b1 = X10 X1 X10 (y X 2 b2 ) ,
and by symmetry b2 = (X20 X2 )1 X20 (y X 1 b1 ), which shows the first part of the theorem.
To prove the second part, replace the right hand side of equation (3.6.5) into the second system
(3.6.4) to obtain
1
X20 y X20 X1 X10 X1 X10 (y X 2 b2 ) X20 X2 b2 = 0
or equivalently, using the orthogonal projector notation P[X1 ] for X1 (X10 X1 )1 X10 ,
so that eventually
1
b2 = X20 M[X1 ] X2 X20 M[X1 ] y.
Finally, by symmetry,
1
b1 = X10 M[X2 ] X1 X10 M[X2 ] y.
3.6. PARTITIONED REGRESSION 34
Theorem 8 is an important result and is known in the econometric literature as the Par-
titioned Regression Theorem or also the Frisch-Waugh-Lovell Theorem after the names of the
economists who independently proved and re-proved it over time. Since the partitioning con-
sidered by Theorem is completely arbitrary, the Theorems formulas (3.6.2) permit to compute
the OLS estimator for any specific subvector of variables in the regression. For example, the
formula for a single coefficient bi is readily obtained by specifying the partitioning as X1 = xi ,
X2 = Xi , where xi is the i.th column in X and Xi is the matrix X with column xi removed.
So, applying the Theorems formulas (3.6.2) to such partitioning we find that
1 0 x0i M[X1 ] y
(3.6.6) bi = x0i M[X1 ] xi xi M[X1 ] y = 0 .
xi M[X1 ] xi
Theorem 8 is also useful in empirical work to reduce the dimension of the regression model,
without affecting the OLS estimates. Indeed, b1 can be obtained by a reduced regression where
the dependent variable is the residual vector obtained by regressing y on X2 , that is M[X2 ] y,
and the regressors are the residuals obtained from the regressions of each column of X1 on X2 ,
that is M[X2 ] X1 . The number of regressor in the reduced regression is equal to the number of
columns in X1 . For example, bi can be obtained by regressing the dependent variable M[Xi ] y
on the single regressor M[Xi ] xi and as such could be represented graphically as the slope of the
resulting fitted regression line in a two-way graph in Stata. The simple reduced regression of y
on xi , will not do as it does not account for the variation in X2 . An empirical demonstration
of the foregoing utilization of Theorem 8 is given by the following portion of the Stata dofile
algebra_OLS.do estimating an aggregate Gasoline expenditure function for the US Economy
3.6. PARTITIONED REGRESSION 35
The dofile concludes with the graphical representation of the estimated coefficient on
Gasp -0.070 as the slope of the estimated regression line with the scatterplot of ln_G_e
and ln_Gasp_e in the background
1
bi = Xi0 Xi Xi0 y,
i = 1, 2.
Corollary 9 is proved at once by substituting Xi0 Xj = 0 into the first equation of Theorem
8. It ensures that in the presence of orthogonal regressors, b1 (b2 ) can be obtained by the
reduced OLS regression of y on X1 (X2 ). From an empirical point of view orthogonal regressors
are a narrow case. Yet, Corollary 9 will be often for use in derivations.
3.6. PARTITIONED REGRESSION 38
Exercise 10. Partitioning X = X1 1 , where 1 is the (n 1) vector of all unity
elements, prove that M[1] = I 1 (10 1)1 10 transforms all variables in deviations from their
sample means, and so that the OLS estimator b1 can be obtained by regressing y demeaned
onX1 demeaned.
Remark 11. Since b exists, so do its components and the linear relationships among
them, which proves at once that all of the inverse matrices in the formulas of Theorem 8 exist.
In other words, X1 and X2 are both of full column rank and X10 M[X2 ] X1 and X20 M[X1 ] X2
are non-singular when X is of full column rank. This result can be also verified by direct
inspection, as suggested by the following exercises.
Exercise 12. Prove that if X is of full column rank, so are X1 and X2 (easy!).
Exercise 13. Prove that if X is of full column rank, then M[X1 ] X2 is of f.c.r.
Solution: I prove the result by contradiction and assume that P[X1 ] X2 b = X2 b for some
vector b 6= 0. Therefore, X1 a = X2 b, where a = (X10 X1 )1 X10 X2 b, or equivalently Xc = 0,
where
a
c= ,
b
which leads to a contradiction since c 6= 0 and X is of f.c.r.
Exercise 14. Prove that X20 M[X1 ] X2 is positive definite if X is of f.c.r. (hint: use exercise
13 to prove that M[X1 ] X2 is of full column rank and then the fact that M[X1 ] is symmetric
and idempotent).
Exercise 15.
3.6. PARTITIONED REGRESSION 39
The following result on the decomposition of orthogonal projectors into orthogonal com-
ponents will be useful in a number of occasions later on.
Proof. Equation (3.6.7) can be proved directly using the formula for the inverse of the
2 2 partitioned inverse. An indirect but easier proof is based on Theorem 8 and goes as
follows. Notice that, for any y and X = (X1 X2 ) of f.c.r., P[X] y = X1 b1 + X2 b2 , where by
1 0
Theorem 8 b1 = (X10 X1 )1 X10 (y X2 b2 ) and b2 = X20 M[X1 ] X2 X2 M[X1 ] y. So, plugging
the right hand side of b1 into the right hand side of P[X] y = X1 b1 + X2 b2 yields
1
P[X] y = X1 X10 X1 X10 (y X2 b2 ) + X2 b2
= P[X1 ] y P[X1 ] X2 b2 + X2 b2
= P[X1 ] y + M[X1 ] X2 b2
Then, plugging the right hand side of b2 into the foregoing equation yields
1
P[X] y = P[X1 ] y + M[X1 ] X2 X20 M[X1 ] X2 X20 M[X1 ] y
or P[X] y = P[X1 ] + P[M[X X2 ] y, which must hold for any y, so that P[X] = P[X1 ] +P[M[X X2 ] .
1] 1]
Remark 17. Clearly, for one partitioning there are always two representations of P[X] ,
since the subscripts in Equation (3.6.7) can be interchanged and so we may also write P[X] =
P[X2 ] + P[M[X X1 ] .
2]
Exercise 18. Verify that the two matrices in the right hand side of (3.6.7), P[X1 ] and
P[M[X X2 ] , are orthogonal.
1]
3.6.1. Residuals from the partitioned OLS regressions. Consider the OLS regres-
sion of M[X2 ] y on the columns in M[X2 ] X1 as regressors and the corresponding residual vector
eP R = M[X2 ] y M[X2 ] X1 b1 .
The next exercise asks to prove that eP R equals the OLS residual vector e = M[X] y.
1
Solution. By Theorem 8, b1 = X10 M[X2 ] X1 X10 M[X2 ] y, replacing it into the right
hand side of the eP R equation yields
1
eP R = M[X2 ] y M[X2 ] X1 X10 M[X2 ] X1 X10 M[X2 ] y
= M[X2 ] y P[M[X X1 ] y.
2]
eP R = M[X] + P[M[X X1 ] y P[M[X X1 ] y
2] 2]
= M[X] y
= e.
Consider the initial regressor matrix X and the additional regressors Z, so that there is a
larger regressor matrix, W , partitioned as W = X Z .
Theorem 21. Let e be the residual vector from the OLS regression of y on X and uthe
residual vector from the OLS regression of y on W , so that e = M[X] y and u = M[W ] y. Then,
u0 u e0 e.
or
u = e P[M[X] Z ] y
(3.6.9) u0 u = e0 e y0 P[M[X] Z ] y
u0 u e0 e.
Exercise 22. How does the formula for u0 u in Equation (3.6.9) simplify if the new re-
gressors Z are orthogonal to X, that is if X 0 Z is a null matrix?
Assume that the unity vector, 1, is part of the regressor matrix X. Total variation in y can
be expressed by the following sum of squares, referred to as Sum of Squared Total deviations
n
X
T SS = (yi y)2
i=1
T SS = y0 M[1] y
Notice that T SS is the sample variance of y, T SS/ (n 1), times the appropriate degrees of
freedom correction n1. Incidentally, the degrees-of-freedom correction in the sample variance
is just n 1 and not n, since M[1] y are the residuals from the regression of y on 1 (see exercise
10) and so there could be no more than n1 linearly independent vectors in the space to which
M[1] y belongs, 1 . In fact, since rank (1) = 1, then given equation (3.4.1), dim 1 = n 1.
+ M[1] e.
M[1] y = M[1] y
But since e and X are orthogonal and X contains 1, it follows that 10 e = 0, thereby
(3.7.1) M[1] e = e
and
+ e.
M[1] y = M[1] y
Then,
0 M[1] y
T SS = y + 2e0 M[1] y
+ e0 e.
3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE 43
0 M[1] y
T SS = y + e0 e.
0 M[1] y
Throughout, I refer to y as ESS (explained sums of squares) and e0 e as RSS
(residual sum of squares).
As it happens for T SS, RSS is the sample variance of residuals times the appropriate
degrees-of-freedom correction, n k. Again, the degrees-of-freedom correction in the sample
variance is just n k and not n since in the residual space, X , there could be no more than
n k linearly independent vectors. This follows from the assumption that X is of full column
rank, thereby rank (X) = k, then given equation (3.4.1), dim X = n k.
ESS 0 M[1] y
y
(3.7.2) R2 = = 0
T SS y M[1] y
0 M[1] y
and since y = yM[1] y e0 e,
e0 e
R2 = 1 .
y0 M[1] y
Therefore, if the constant term is included into the regression it has that 0 R2 1 and
R2 measures the portion of total variation in y explained by the OLS regression; in this sense
R2 is a measure of goodness of fit1. There are two interesting extreme cases. If all regressors,
lies onto the space spanned by 1 and M[1] y
apart from 1, are null vectors, then y = 0, so
that eventually R2 = 0. Only the constant term has explanatory power in this case, and the
regression is an horizontal line with intercept equal to the sample mean of y. If y lies already
(and also e0 e = 0) and R2 = 1, a perfect (but useless) fit2.
onto R (X), then y = y
1If the constant term is not included into the regression than (3.7.1) does not hold and R2 may be negative.
2Im maintaining throughout the obvious assumption that in any case y / R (1) . Why?
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 44
A problem with the R2 measure is that it never decreases when a regressor is added to X
(this is a straightforward implication of Theorem 21) and in principle one can obtain artificially
high R2 by inflating the model with regressors (the extreme case of R2 = 1 is attained if n = k,
since in this case y ends up to lie onto R (X)). This problem may be obviated by using the
2
corrected R2 , R , defined by including into the formula of R2 the appropriate degrees of
freedom corrections:
2 RSS/ (n k)
R =1 .
T SS/ (n 1)
2
In fact, R does not necessarily increases when one more regressor is added.
Exercise 23. Let W = 2 be the R-squared from the regression of y on W
, RW
X Z
2 R2 . is no smaller
and R2 the R-squared from the regression of y on X. Prove that RW
than given W , u and e defined as in section 3.6.2, the coefficient of determination resulting
from the regression of y on W is
2
2
RW = R2 + 1 R2 ryz
2 n1
1 R2 .
R =1
nk
Consider the OLS regression of y on the sample regressor matrix X and let b denote the
OLS vector. The centered and uncentered R-squared measures (see Hayashi (2000), p. 20) for
this regression are defined as
0 M[1] y
y b0 X 0 M[1] Xb y0 P[X] M[1] P[X] y
(3.8.1) R2 = =
y0 M[1] y y0 M[1] y y0 M[1] y
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 45
and
0y
y b0 X 0 Xb y0 P[X] y
(3.8.2) Ru2 0 = = ,
yy y0 y y0 y
e0 e
Ru2 = 1
y0 y
whether or not the unity vector 1 is included into X. In fact, since y = Xb + e and X 0 e = 0,
y0 y = y
0y
+ e0 e. The same is not true for the centered measure. Indeed, 0 R2 1 and
e0 e
(3.8.3) R2 = 1
y0 M[1] y
if and only if a) the constant is included, or b) all of the variables (y, X) have zero sample
mean, that is M[1] y = y and M[1] X = X. Clearly in the latter case, R2 = Ru2 .
A convenient property of the centered R-squared, when 1 is included into X, is that it
2 , that is
, ry,
coincides with the squared simple correlation between y and y y
2
2
0 M[1] y
y
(3.8.4) R = 0 ,
y0 M[1] y
M[1] y
y
Exercise 25. Prove equation (3.8.4) (hint: Given the definition of R2 in (3.7.2), this boils
0 M[1] y = y0 M[1] y
down to proving that y )
+ e, then
Solution. Since y = y
0 M[1] y = y
y 0 M[1] (
y + e)
0 M[1] y
= y 0 M[1] e
+y
0 M[1] y
= y +y
e
0 M[1] y
= y .
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 46
where the third equality follows from M[1] e = e, since the constant is included, and the last
and the OLS residuals.
from the orthogonality of y
This property is not shared by the uncentered R-squared, unless variables have zero sample
means.
so that eventually
y0 P[M[1] X ] y
2
(3.8.5) R = ,
y0 M[1] y
which proves at once that R2 defined in (3.8.1) can be also obtained as the uncentered R-
namely the OLS regression of y in
squared from the OLS regression of M[1] y on M[1] X,
in mean-residuals.
mean-residuals and X
CHAPTER 4
4.1. Introduction
This chapter is on the finite-sample statistical properties of OLS applied to the LRM.
Finite-sample means that we focus on a fixed sample size n as opposed to n , a case that
will be covered in Chapter 6. We will learn under what assumptions on the LRM and in which
sense the estimator is optimal. We will also learn how to test linear restrictions on the model
parameters. Finally, we will study an important case of inaccuracy for the OLS, which is the
omitted-variables problem.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using
the data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2010).
4.2. Unbiasedness
1
(4.2.1) b = + X 0X X 0 .
1
E (b|X) = + X 0 X X 0 E (|X)
= .
47
4.3. THE GAUSS-MARCOV THEOREM 48
E (b) = EX [E (b|X)]
= EX []
= .
Notice that unbiasedness does not follow if we replace LRM.3 with the weaker LRM.3b.
Lets work out the conditional and unconditional covariance matrices for OLS under
LRM.1-LRM.4.
I get started with V ar (b|X). Since,
V ar (b|X) = E (b ) (b )0 |X ,
I define the following partial order in the space of the l l symmetric matrices:
Definition 26. Given any two l l symmetric matrices A and B, A is said no smaller
than B if and only if A B is non-negative definite (n.n.d.).
Next I prove that the OLS estimator has the smallest covariance matrix in the class of
linear unbiased estimator, an important optimality result for OLS that is universally known
as the Gauss-Marcov Theorem.
Theorem 27. Assume LRM.1-LRM.4, then the OLS estimator has the smallest, in the
sense of Definition 26, covariance matrix in the class of linear unbiased estimators.
Proof. Define the generic member of the class of linear unbiased estimators as
bo = Cy
where C is a generic k n matrix that depends only on the sample information in X and,
given unbiasedness, such that CX = Ik . b is of course a member of the class, with its own C
equal to COLS = (X 0 X)1 X 0 . It is not hard to prove that V ar (bo |X) = 2 CC 0 . Define, now
D = C COLS , then DX = 0 and so
h 1 ih 1 i
2 0 0 0 0
V ar (bo |X) = D+ X X X D +X X X
1
= 2 X 0X + 2 DD0 ,
so that
Since 2 DD0 is n.n.d, according to Definition 26, the OLS conditional covariance matrix,
V ar (b|X), is no greater than the variance of any linear unbiased estimator. The same
result extends to the unconditional variances. First, notice that bo being unbiased we have
4.3. THE GAUSS-MARCOV THEOREM 50
for the unconditional covariance matrix of bo a result that parallels Equation (4.3.1)
The natural question arises of whether the partial order of Definition 26 is of any relevance
in real-world applications. It is, since it readily translates into the total order of real numbers,
which is the domain of the variances of random scalars. Indeed, if A is no smaller than
B, then r0 (A B) r 0, for any conformable r. But then, according to the Gauss-Marcov
Theorem, we can say that any linear combination of the components of b, r0 b, has smaller
variance than r0 bo . Formally, the theorem implies that r0 [V ar (bo ) V ar (b)] r 0. Then,
V ar (r0 b) = r0 V ar (b) r and V ar (r0 bo ) = r0 V ar (bo ) r and hence V ar (r0 bo ) V ar (r0 b) .
The importance of this hinges upon the fact that in empirical applications we are interested
in the linear combinations of population coefficients, as in the following example, where it is
shown that the estimates of individual coefficients can always be expressed as specific linear
combinations of the k components of the estimators.
Example 28. On noticing that bi = r0i b and boi = r0i bo , i = 1, ..., k, where ri is the
k 1 vector with all zero elements except the i.th entry, which equals unity, and given the
Gauss-Marcov Theorem, we conclude that V ar (boi ) V ar (bi ) , i = 1, ..., k.
4.4. ESTIMATING THE COVARIANCE MATRIX OF OLS 51
In general, we have that the OLS estimator of any linear combination r0 is given by r0 b
and, as the foregoing discussion demonstrates, under LRM.1-LRM.4 r0 b is BLUE (you can
easily verify that E (r0 b) = r0 ).
1
E s2 |X E 0 M[X] |X .
(4.4.1) =
nk
Since 0 M[X] is a scalar, 0 M[X] = tr 0 M[X] and so, by the permutation rule of the trace
of a matrix product, 0 M[X] = tr 0 M[X] = tr M[X] 0 . Replacing the right hand side of the
foregoing equation into equation (4.4.1) yields
1
E s2 |X E tr M[X] 0 |X .
=
nk
Then exploiting the fact that both trace and expectation are linear operators
1
E s2 |X tr E M[X] 0 |X .
=
nk
1
tr M[X] E 0 |X
=
nk
2
(4.4.2) = tr M[X]
nk
4.4. ESTIMATING THE COVARIANCE MATRIX OF OLS 52
where the last equality follows from LRM.3 and LRM.4. Now, focus on tr M[X] :
h 1 0 i
tr M[X] = tr In X X 0 X X
1
= tr In tr X 0 X X 0X
= n k,
With s2 at hand we can get an unbiased estimator for V ar (b). It is obtained by replacing
2 with s2 in the OLS conditional variance formula to get
1
ar (b) = s2 X 0 X
V\ .
Theorem 30. Assume LRM.1-LRM.4, then E V\
ar (b) = V ar (b)
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 53
1
b|X N , 2 X 0 X .
Also, since e = M[X] , e|X N 0, 2 M[X] . Using a result in Rao (1973), it is also possible
to prove at once that b and e are also jointly normal with zero covariances, conditional on X.
Specifically, since
b (X 0 X)1 X 0
= +
e 0 M[X]
or
b 2 (X 0 X)1 0
|X N ,
e 0 0 2M [X]
Exercise 32. Verify, by direct computation of Cov (b, e|X), that Cov (b, e|X) = 0kn .
1In general the matrix of conditional covariances between two random vectors x and y, conditional on z, is
E [x E (x|z)] [y E (y|z)]0 |z .
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 54
or
h 1 i
Cov (b, e|X) = E X 0X X 0 0 M[X] |X
1
X 0X X 0 E 0 |X M[X]
=
1
= 2 X 0X XM[X]
= 0kn .
b
Exercise 34. Is V ar |X non-singular? Why or why not?
e
and, given the properties of the normal distribution, bi can be standardized to have
b i
(4.5.1) q i |X N (0, 1) ,
2 (X 0 X)1
ii
i = 1, ..., k. Were 2 known, then the above statistics could be used to test hypotheses on i ,
Ho : i = i , by replacing the unknown i with i , where i is a value of interest fixed by
the researcher. For example, to test Ho : i = 0 one would use
bi
q N (0, 1) .
2 (X 0 X)1
ii
The problem is, of course, that 2 is generally unknown and so the foregoing approach cannot
be used as it is. With some adjustment we can make it operational, though. Just replace 2
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 55
bi i
(4.5.2) ti = q
s2 (X 0 X)1
ii
and then prove that ti has a t distribution with n k degrees of freedom when i = i . The
denominator term in expression (4.5.2) is the standard error estimate for coefficient bi .
First, notice that since s2 = e0 e/ (n k) = 0 M[X] / (n k),
s2 0
(4.5.3) (n k) = M [X] .
2
s2
(n k) 2 (n k) .
2
Further, r
bi i s2
ti = q / (n k) 2 / (n k),
2 (X 0 X)1
ii
p
if z N (0, 1), x 2 (p) and z and x are independent, then z/ x/p has a t
distribution with p degrees of freedom
Finally, since the t distribution does not depend on the sample information and, specifically,
on X, then ti and any component of X are statistically independent, so that the above holds
also unconditionally, that is ti t (n k), i = 1, ..., k.
Often we wish to test hypotheses involving linear combinations of , r0 , where r is a k 1
vector of known constants.
r0 b q
q t (n k) ,
s2 r0 (X 0 X)1 r
. regress c y
The OLS coefficient estimates, b, are displayed in the first column (labeled Coef.). Then,
the second column reports the standard error estimates peculiar to each OLS coefficients,
q
i = s2 (X 0 X)1
se ii , i = 1, ..., k. The third column reports the values of the t statistics for
bi
ti = q .
s2 (X 0 X)1
ii
The test is two-sided in that the alternative is H1 : i > 0 or i < 0. The fourth column
reports the so called p-value for the two-sided t-test. It is defined as the probability that
a t distributed random variable is more extreme than the outcome of ti in absolute value:
P r [(t < |ti |) (t > |ti |)] or more compactly P r (|t| > |ti |) . Clearly, if the p-value is smaller
than the chosen size of the test (the Type I error) then ti falls for sure into the critical region
and we reject the null at the chosen size. In other words, the p-value indicates the lowest size of
the critical region (the lowest Type I error) we could have fixed to reject the null, given the test
outcome. In this sense, the p-value is more informative than critical values. In the regress
example, if we choose a critical region of 5% size, given that P r (|t| > 2.42) = 0.042 < 0.05,
we can reject at 5% that the constant term is equal to zero, knowing that we could also have
rejected at, say, 4.5%, but not at 1%. A 1% size is smaller then the test p-value, which is
the minimum size allowing rejection, and for this reason we cant reject at 1%. This is a clear
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 58
case of borderline significance, one which we could not have identified with such precision by
simply looking at the 5% critical values. On the other hand, the p-value for the coefficient
on y is virtually zero (as low as 0.000). This therefore indicates that no matter how much
conservative we are towards the null, we can reject it at any conventional level of significance
(conventional sizes, with an increasing degree of conservativeness are 10%, 5%, 1%) and also
at a less conventional 0.1% (since 0.001 > 0.000).
4.5.2. From tests to confidence intervals. Let us fix the 100% critical region for our
two-sided t test for the null H0 : i = i against the alternative H1 : i 6= i and let t/2 be
the corresponding critical values: P r t < t/2 t > t/2 = . Then, the probability
of not rejecting the null when it is true is (1 ) . Formally,
bi i
bi i
P r < t/2 = P r t/2 < < t/2
se
i se
i
i t/2 < bi i < se
= P r se i t/2
= (1 ) .
But bi se i t/2 is a (1 ) 100% confidence interval for i . This proves that the
i t/2 , bi + se
(1 ) 100% confidence interval bi se i t/2 , bi + se
i t/2 contains all of the null hypotheses
i = i that we cannot reject at 100%. So while a given t test is informative only for
the specific null it is testing, the confidence interval conveys to the researcher much more
information. The last column of the regress output reports the 95% confidence intervals for
each OLS coefficients.
Exercise 36. Your regression output for a given coefficient i yields bi = 9.320 and
se
i = 1.760. 1) Compute the t-statistic for the null H0 : i = 0. 2) In your regression
n k = 334, this implies that t0.025 = 1.967, where t0.025 : P r (t > t0.025 ) = 0.025. Will you
reject or not H0 : i = 0 against H1 : i 6= 0 at a significance level of 5%? Why? 3) Given
your answer to Question 2, will you expect that 0 belongs to the 95% confidence interval for
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 59
i ? 4) Compute the 95% confidence interval for i . On the basis of the information from the
confidence interval alone, do you reject H0 : i = 6 against H1 : i 6= 6 at 5%, why? 5)
Using only your answers to Question 4, can you assert that the p-value of that test is greater
than 0.05? Also, do you expect that the absolute value of the t statistic for H0 : i = 6
be greater or smaller than 1.967, why? Verify your answer by directly computing the value of
the t statistic for H0 : i = 6. 6) Consider now the test of H0 : i 0 against H1 : i > 0
with a 5% significance level. Is the critical level for this test equal to, smaller or greater than
1.967?
Exercise 37. Your regression output for a given coefficient i yields bi = 6.668 with
i = 3.577. The outcome of the t-test for H0 : i = 0 against H1 : i 6= 0 shows p-
se
value= 0.07. Can you reject the null at 10%? Can you at 5%?
4.5.3. Testing joint linear restrictions. We want to test jointly J linear restrictions:
H0 : R q = 0, where R and q are, respectively, a J k matrix and a J 1 vector of fixed
known constants and such that no rows in R can be obtained as a linear combination of the
others, that is R is of full row rank J.
Under the null,
Rb q = R (b )
it has
h i1
(Rb q)0 R (X 0 X)1 R0 (Rb q)
W = |X 2 (J) .
2
Again, 2 is not known and so W is unfeasible as a test for H0 . We can go about as in the
previous section and replace 2 with s2 . In addition, then, divide the result by J to get the
statistic h i1
(Rb q)0 R (X 0 X)1 R0 (Rb q)
F =
Js2
Now consider another distributional results
Given two independent random scalar x1 2 (p1 ) and x2 2 (p2 ), then (x1 /p2 ) / (x1 /p2 )
F (p1 , p2 ) .
It is not hard to see that the above result can be applied to F, since it can be reformulated
as the ratio of two conditionally independent chi-squared random variables corrected by their
own degrees of freedoms. In fact, at the numerator we have
h i1
(Rb q)0 R (X 0 X)1 R0 (Rb q)
J 2
and at the denominator s2 / 2 . Conditional on X, the former is a function of b alone, while the
latter is a function of e alone. Therefore, in the light of Theorem 31 the two are conditionally
independent and so we can invoke the foregoing distributional result, to establish F |X
F (J, n k).
As with the t statistic, since the F distribution does not depend on the sample information,
we have that the above holds unconditionally: F F (J, n k) .
When H0 is a set of J exclusion restrictions, then q = 0 and each row of R has all zero
elements except unity in the entry corresponding to the parameter that is set to zero. For
example, with three parameters 0 = (1 2 3 ) and two exclusion restrictions 1 = 0 and
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 61
3 = 0, then J = 2, q0 = (0 0) and
1 0 0
R=
0 0 1
The F -test can be always rewritten as a function of the residual sum of squares under the
unrestricted model, e0 e, and the residual sum of squares under the model with restrictions
imposed, say e0 e :
(e0 e e0 e) /J
F = .
e0 e/ (n k)
This is proved for the case of exclusion restrictions by using Lemma 16.
Partition the sample regressor matrix as X = (X1 X2 ) and consider the F test for the set
of exclusion restrictions H0 : 2 = 0:
h 1 i1
b02 X20 M[X1 ] X2 b2
F =
k2 s2
0 0
b2 X2 M[X1 ] X2 b2
= .
k2 s2
The numerator of the right hand side of the foregoing equation can be written more compactly
as y0 P[M[X ] X2 ] y. Hence, by Lemma 16,
1
y0 P[X] P[X1 ] y
F =
k2 s2
y0 M[X1 ] M[X] y
F =
k2 s2
(e e e0 e) /k2
0
= .
s2
It is not hard to prove that if the constant term is kept in both models, then
R2 R2 /J
F = ,
(1 R2 ) / (n k)
where R2 is R-squared from the unrestricted model and R2 is the R-squared from the restricted
model.
The general form of the law of iterated expectations (LIE) can be stated as in Wooldridge
(2010), pp. 19-20.
LIE(scalar|vector): Given the random variable y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x].
Since the above result holds for any function f (), x can just be any subvector of w, as the
following example shows.
LIE(vector|vector): Given the random vector y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x] , where
E (y1 |x) E (y1 |w)
.. ..
E (y|x) = and E (y|w) = .
. .
E (yn |x) E (yn |w)
Remark 39. Notice that in the formulation of conditional expectations the way the con-
ditioning set is represented is just a matter of notational convenience. What matters are the
random scalars that enter the conditioning set and not the way they are organized therein.
For example, E (y|w1 , w2 , w3 , w4 ) can just be equivalently expressed as E (y|w0 ) or E (y|w),
where w = (w1 w2 w3 w4 )0 , or E (y|W ) where
w1 w3
W = ,
w2 w4
Given Remark 39 the general LIE can be formulated with conditional expectations having
the conditioning set organized in the form of random matrices rather than random vectors, as
follows.
LIE(vector|matrix): Given the random vector y and the random matrices W and X,
where X = f (W ), then E (y|X) = E [E (y|W ) |X].
Paralleling the consideration made above, since f () is a generic function, from LIE(vector|matrix)
follows a special LIE for the case in which X is a submatrix of W . Therefore, given W =
4.7. THE OMITTED VARIABLE BIAS 64
and
E (y|W2 ) = E [E (y|W ) |W2 ] .
If explanatory variables that are relevant in the population model, for some reasons, are not
included into the statistical model - they may be intrinsically latent, such as individual skills,
or the specific data-set in use do not report them, or also, although observed and available,
the researcher failed to account for them in the model specification - then our OLS estimator
may undergone what is known in the econometric literature as an omitted variable bias. Lets
see when and why.
Assume that the population model is
y = x0 + .
with x and both k 1 vectors and P.1-P.4 satisfied and consider the RS mechanism
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik ) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
So far we are in the classical regression framework, but now let x0 = (x01 x02 ) with x1 being
a k1 1 vector and x2 a k2 1 vector and k = k1 + k2 and maintain that x2 is latent
or, however, not included into the statistical model and lets explore the implications on the
statistical model. P.1 implies that
(4.7.1) y = X1 1 +
4.7. THE OMITTED VARIABLE BIAS 65
1
b1 = X10 X1 X10 y
is likely to be biased. Indeed, the bias can be easily derived as follows. Replacing the right
hand side of equation (4.7.1) into the OLS formula yields
1
b1 = 1 + X10 X1 X10
1 1
= 1 + X10 X1 X10 X2 2 + X10 X1 X10 .
1
E (b1 |X) = 1 + X10 X1 X10 X2 2
and hence, by the law of iterated expectations, we have the unconditional bias
h 1 0 i
(4.7.2) E (b1 ) 1 = E X10 X1 X1 X2 2 .
There are two specific instances, however, in which the bias is zero.
The first instance is that analyzed in Greene (2008) when X10 X2 = 0k1 k2 . In this case
(X10 X1 )1 X10 X2 2 = 0 and so the bias in equation (4.7.2) becomes zero.
The second instance occurs if in the population E (x02 2 |x1 ) = 0, as I now show. Since in
the population E (|x) = 0, then by the general law of iterated expectation also E (|x1 ) = 0.
Hence, E (x02 2 + |x1 ) = 0, which along with RS yields E (|X1 ) = 0. Therefore, the vector
in model (4.7.1) behaves like a conventional error term that satisfies LRM.3. The upshot is
that b1 is unbiased.
4.7. THE OMITTED VARIABLE BIAS 66
The two situations are not related. Clearly, E (X2 2 |X1 ) = 0 does not imply X10 X2 =
0k1 k2 . But also the converse is not true, and X10 X2 = 0k1 k2 may hold if E (X2 2 |X1 ) 6= 0,
as shown by the following example.
Be that as it may, the foregoing two instances of unbiasedness constitute a narrow case,
and in general omitted variables will bring about bias and inconsistency in the coefficient
estimates. Solutions are typically given by proxy variables, panel data estimators and instru-
mental variables estimators. The first method is briefly described below, the classical panel
data estimators are pursued in Chapter 7, while IV methods are described in Chapter 9.
To conclude, observe that if relevant variables are omitted LRM.4 does not generally hold,
unless V ar (x02 2 |x1 ) = 2 < +.
Lemma 41. Given any two non-singular square matrices of the same dimension, A and
B, if A B is n.n.d. then B 1 A1 is n.n.d.
The foregoing lemma signifies that in the space of non-singular square matrices of a given
dimension if A is no smaller than B, then A1 is no greater than B 1 . It is useful in
situations in which the difference of inverse matrices is more easily worked out than that of
the original matrices.
The following exercise asks you to think through the consequences of overfitting, namely
applying OLS to a statistical model with variables that are redundant in the population model.
4.7. THE OMITTED VARIABLE BIAS 67
y = x0 +
with x and both k 1 vectors and P.1-P.4 satisfied. Assume also that the l 1 vector z of
observable variables is available, such that rank [E (ww0 )] = k + l where w0 = (x0 z0 ). Also,
assume E (|x z) = 0 and V ar (|x z) = 2 , i.e. z is redundant in the population equation.
Finally assume there is a sample of size n from the population, such that the elements of the
sequence {(yi x0i z0i ) , i = 1, ..., n} are i.i.d. 1 (1 + k + l) random vectors. Applying the usual
notation for the sample variables,
y
1 x01 z01 1
.. .. .. ..
. . . .
y = yi , X =
x0i , Z = z0i , =
n1 i ,
. nk
n1
nl
. .. .. ..
. . . .
yn x0n z0n n
y = X +
1
(4.7.3) b = X 0X X 0y
is BLUE). 2) Prove that the overfitting strategy of regressing y on X and Z yields an unbiased
estimator for and call it bof it . 3) Derive the covariance matrix of bof it . 4) Use lemma 41
and verify that, indeed, the conditional covariance matrix of bof it is no smaller than that of
b in (4.7.3). 5) A byproduct of the overfitting strategy is the l 1 vector of OLS coefficients
4.7. THE OMITTED VARIABLE BIAS 68
for the variables in Z. Lets call it c. Express c using the first equation of Theorem 8 as
1
c = Z 0Z Z 0 (y Xbof it )
and prove that the overfitting residual vector eof it y Xbof it Zc equals
Answer: 1) Obvious, since in the population and the sampling mechanism we have all
we need for the statistical properties LRM.1-LRM.4 to be true. 2) This is proved at once by
noting that from RS and E (|x z) = 0, E (|X Z) = 0. 3) Prove that V ar (|X Z) = 2 I and
then prove that
h 1 i 1
V ar X 0 M[Z] X X 0 M[Z] y|X Z = 2 X 0 M[Z] X .
4) Write X 0 M[Z] X as X 0 M[Z] X = X 0 X X 0 P[Z] X and then verify you have all is needed to
invoke the lemma. 5) Easy, its just algebra: replace bof it and c into eof it y Xbof it Zc
and rearrange. 6) First, use the formula of the overfitting residual vector derived in the
previous question, M[M[Z] X ] M[Z] y, to set up the estimator
4.7.1. The proxy variables solution. Assume for simplicity that there is only one
omitted variable x2 from the population equation
(4.7.4) y = x01 1 + x2 2 + .
(1) The z variables are redundant in the population equation, that is E (y|x z) = x0 .
(2) Once conditioning on z, the omitted variable x2 and the included explanatory vari-
ables, x1 , are independent in conditional-mean: E (x2 |x1 z) = E (x2 |z) . Also, E (x2 |z) =
z0 .
x1
(3) rank E (x01 z0 ) = k1 +l. This is analogous to property P.2 in Chapter
z
1 and permits identification of coefficients in the proxy variable regression as we will
see below.
(4.7.5) x2 = z0 + ,
(4.7.6) y = x01 1 + z0 (2 ) + 2 + ,
It follows that E (2 + |x1 z) = 0 and so, along P.1 and P.2 (given Assumption 3), also P.3
is satisfied for equation (4.7.6). With the following RS mechanism
RS(x1 z): There is a sample of size n from the population, such that the elements of
the sequence {(yi xi1 ...xik1 zi1 ...zil ) , i = 1, ..., n} are independently identically dis-
tributed (i.i.d.) random vectors,
the resulting statistical model will satisfy LRM.1-LRM.3 and so yield unbiased OLS estimates.
Suppose that attention is centered onto a given explanatory variable whose observations
are collected into the (n 1) column vector xi , and that there are k 1 control variables
collected into the n (k 1) matrix Xi . Without loss of generality partition the (n k)
regressor matrix as X = (Xi xi ) and, correspondingly, the (k 1) OLS vector as
bi
b= ,
bi
1
V ar (bi |X) = 2 X 0 X ii
where (X 0 X)1 0 1
ii indicates the last entry onto the main diagonal of (X X) .
so, given
y = Xi i + xi i + ,
it has
1
bi = x0i M[Xi ] xi x0i M[Xi ] Xi i + xi i +
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 71
and consequently
1
bi = i + x0i M[Xi ] xi x0i M[Xi ] .
Finally
h 1 0 1 i
V ar (bi |X) = E x0i M[Xi ] xi xi M[Xi ] 0 M[Xi ] xi x0i M[Xi ] xi |X
1 1
x0i M[Xi ] xi x0i M[Xi ] E 0 |X M[Xi ] xi x0i M[Xi ] xi
=
1
= 2 x0i M[Xi ] xi
2
(4.8.1) = ,
x0i M[Xi ] xi
1 1
(4.8.2) X 0X = .
ii x0i M[Xi ] xi
Remark 43. Equation (4.8.2) is a general algebraic result providing the formula for the
generic i.th main diagonal element of the inverse of any non-singular cross-product matrix
X 0 X. I have proved it in quite a peculiar way, using a well-known and easy-to-remember
econometric result! Above all, I could get away without referring to the hard-to-remember
result on the inverse of the (2 2) partitioned matrix, which is instead the route followed by
Greene (Theorem 3.4 inGreene (2008), p. 30) .
4.8.1. The three determinants of V ar (bi |X) when 1 is a regressor. Now I get back
to V ar (bi |X) in equation (4.8.1)
1
V ar (bi |X) = 2 x0i M[Xi ] xi
and assume that Xi contains the n 1 unity vector 1, or Xi = X i 1 . Notice, now, that
M[Xi ] xi is the residual vector from the OLS regression of xi on Xi and so x0i M[Xi ] xi is the
residual sum of squares for this regression. Since the unity vector is a column of Xi , the
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 72
x0i M[Xi ] xi
Ri2 =1 0 ,
xi M[1] xi
and eventually2
2
V ar (bi |X) =
1 Ri2 x0i M[1] xi
Also, it has
n
X
x0i M[1] xi = (xij xi )2 ,
j=1
that is x0i M[1] xi is the total variation in xi around its sample mean, xi . Therefore,
2
(4.8.3) V ar (bi |X) = n ,
2 xi )2
P
1 Rk j=1 (xji
2An alternative proof is the following. Given Lemma 16, M = I P[1] P[M X i ] and so
[Xi ] [1]
h i1
2 0
V ar (bi |X) = xi M[1] P[M X i ] xi
[1]
or
1
V ar (bi |X) = 2 x0i M[1] xi x0i P[M X i ] xi
[1]
1
= 2 x0i M[1] xi x0i P[M X i ] xi
[1]
!#1
x0i P[M X i ] xi
"
2 x0i M[1] xi 1
[1]
= .
x0i M[1] xi
1
2 x0i M[1] xi 1 Ri2
=
where
x0i P[M ] xi
[1] Xi
Ri2 .
x0i M[1] xi
Given (3.8.5), Ri2 is the centered R-squared for the regression of xi , on Xi (or equivalently the uncentered
i ).
R-squared from the regression of M[1] xi on M[1] X
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 73
other things constant, Ri2 increases, in words the correlation between xi and the
other regressors increases (this is the multicollinearity effect on the variance of the
OLS individual coefficient);
Pn
other things constant, the total variation in xi , j=1 (xji xi )2 , decreases;
other things constant, the regression variance increases.
Multicollinearity is perfect when xi belongs to R X i . In this case R2 = 1 (see Section
i
3.7) and the variance of bi diverges to infinity. Coefficient i cannot be estimated given the
available data (X is not of f.c.r. in this case).
Remark 44. Multicollinearity, when it does not degenerate into perfect multicollinearity,
i.e. det (X 0 X) = 0, does not affect the finite sample properties of OLS. Nonetheless, it may
severely reduce the precision of our estimates, in terms of larger standard errors and confidence
intervals.
Exercise 45. Partition X as X = X 1 and accordingly, the OLS (k 1) vector as
0
b= b 0 b0 , where b
is of dimension (k 1) 1 and b0 is the OLS estimator of the constant
regres-
is the (k 1)1 vector of sample means for the X
where y is the sample mean of y and x
sors (hint: just use the first equation of Theorem 8 to show that b1 = (X10 X1 )1 X10 (y X2 b2 )).
= E 2 |X
2
= ,
n
4.9. A GENERAL RESULT 74
0 0
2) V ar x b|X V ar b|X
= x
x
and
0
0 0 0
b|X
3) cov y, x y E (
= E ( y |X)) x
bE x
b|X |X
1
0 0
= E M[1] X X M[1] X |X
x
1
1 0 0
0
= E 1 M[1] X X M[1] X |X
x
n
2 0
X 0 M[1] X
1
= 1 M[1] X = 0,
x
n
to prove that
2 0
V ar (b0 |X) = V ar b|X
+x .
x
n
Zyskind (1967) provides a general necessary and sufficient condition for the OLS estimator
to be BLUE.
Theorem 47. Given the regressor matrix, X, and the conditional covariance matrix ,
V ar (|X) = , the OLS estimator, b = (X 0 X)1 X 0 y, is BLUE if and only if P[X] = P[X] .
As a final remark, remind that the Zyskind condition ensures only that OLS coefficients are
BLUE, saying nothing about the properties of the OLS standard error estimates and indeed
we have seen in the previous sections that they may be biased even if b is BLUE. The following
exercise on partitioning provides another instance of such occurrence.
4.9. A GENERAL RESULT 75
(4.9.1) y = X1 1 + X2 2 + ,
maintaining LRM.1-LRM.4. 1) Verify that premultiplying both sides of the foregoing equation
by M[X2 ] boils down to the reduced regression model
(4.9.2) y 1 1 +
=X
where
y 1 = M[X ] X1 and
= M[X2 ] y, X = M[X2 ] .
2
2) How can you interpret the variables in model (4.9.2)? 3) As far as 1 is concerned, does
OLS applied to model (4.9.2) yields the same estimator as OLS applied to model (4.9.1), why
or why not? 4) Does the reduced model (4.9.2) satisfy LRM.1-LRM.4? Which ones, if any, are
not satisfied? 5) The degrees of freedom of the reduced regression are n k1 . Do you think
that the resulting OLS estimate for 2 would be unbiased? 6) Verify that the reduced model
(4.9.2) satisfies the Zyskind condition.
5) It is biased since we know that the unbiased OLS estimator uses n k degrees of freedom
to correct the OLS residual sum of squares (which is nonetheless the same for both models
(4.9.1) and (4.9.2), as seen in Subsection 3.6.1). 6) You have just to verify that
which is readily done by noting that M[X2 ] , symmetric and idempotent, is the first and the
last factor in P[M[X ] X1 ] .
2
The within regression examined in Chapter 7 (equation (7.2.7)) is a special case of model
(4.9.2) in exercise 48.
4.10.1. Introduction. Estimation of average treatment effects (ATE) and ATE on the
treated (ATET) is at the hearth of the econometric evaluation of socio-economic policies.
As an example, we consider the experimental data jtrain2.dta from Wooldridge (2010) to
estimate the causal effect of job training on earnings
. use jtrain2,clear
. tabulate train
=1 if
assigned to
job
training Freq. Percent Cum.
.
. describe train age educ black hisp married re78
.
As a first attempt at evaluation of the training program we may apply a difference-of-means
estimator evaluating the difference in earning means between treated and untreated
Robust
re78 Coef. Std. Err. t P>|t| [95% Conf. Interval]
.
4.10. ESTIMATION OF TREATMENT EFFECTS 78
We observe a significantly positive increase of 1.800$ in average earnings for workers who
got job training. The question arises if the difference-of-means estimator provides an accurate
estimate of the causal effect of training on labour earnings. The Rubins potential-outcome
framework is the best way to answer this kind of question, as explained next.
4.10.2. Potential outcomes. Let us consider a binary treatment variable w = {1, 0},
training in our example, and assume that V ar (w) 6= 0 in the population. Given the structural
equation of interest y = (w, x, ), define y 1 = (w = 1, x, ) and y 0 = (w = 0, x, )
the pair of potential outcomes for an individual in the population with observed and latent
characteristics x and .
For each individual we observe either y 0 or y 1 . Formally, the observed dependent variable
y = (w, x, ) can be expressed in terms of potential outcomes
y 0 if w=0
y=
y 1 if w=1
So, in the training example y 1 is the post-training earning, had the individual been assigned
to training, and y 0 is what the same individual would earn without training.
Our object of interest is the Average Treatment Effect
AT E = E y 1 y 0
AT ET = E y 1 y 0 |w = 1
Under certain conditions, the two effects can be estimated as the constant parameters of a
simple linear regression model. In fact,
y = y0 + w y1 y0 ,
(4.10.1)
4.10. ESTIMATION OF TREATMENT EFFECTS 79
Then, AT E is identified, that is it can be recovered from moments of the distribution of the
observables (y, w):
AT E = E y 1 E y 0
= E y 1 |w = 1 E y 0 |w = 0
= E (y|w = 1) E (y|w = 0)
Hence, using data from a random sampling, ATE can be estimated consistently by the
difference-of-means estimator: the average of y over the treated individuals minus the av-
erage of y over the untreated individuals.
Under IM AT E = AT ET , since
= E y 1 |w = 1 E y 0 |w = 1
AT ET
= E y 1 |w = 1 E y 0 |w = 0
= E (y|w = 1) E (y|w = 0) .
From the above it is also clear that a weaker version of IM, say IM: E y 0 |w = E y 0 ,
Given
y = y0 + w y1 y0
(4.10.2)
AT E (x) = E y 1 |x E y 0 |x
= E y 1 |w = 1, x E y 0 |w = 0, x
= E (y|w = 1, x) E (y|w = 0, x)
Also, AT E (x) = AT ET (x). It is also clear that for identification of AT ET (x) only, the
weaker ICM assumption, E y 0 |w, x = E y 0 |x , suffices:
AT ET (x) = E y 1 |w = 1, x E y 0 |w = 1, x
= E y 1 |w = 1, x E y 0 |w = 0, x
= E (y|w = 1, x) E (y|w = 0, x) .
AT E = Ex E y 1 |x E y 0 |x
= Ex E y 1 |w = 1, x E y 0 |w = 0, x
Overlap: Let X denote the support of x, then 0 < P r (w = 1|x) < 1 over X .
The overlap assumption assures that in the population, for any given pattern of covariates,
both the treatment and the control groups are sizable.
For identification of AT ET , we restrict to the treated subsample w = 1 and so we average
over x|w = 1:
= Ex|w=1 E y 1 |w = 1, x E y 0 |w = 1, x |w = 1
AT ET
= E y 1 |w = 1 Ex|w=1 E y 0 |w = 0, x |w = 1
From the above it is clear that in the case of AT ET we need to estimate only m0 (x) and
so a weaker overlap assumption may be applied, one that requires that just the control group
be sizable for all covariate patterns: P r (w = 1|x) < 1 over X .
In general, for the ICM assumption to hold it is necessary that there is no bad control in the
x vector. A covariate x is said a bad control if, likewise y, it is affected by the treatment variable
w so that x = x0 + w x1 x0 (with x0 and x1 differently distributed, of course). Assuming
RS: The elements of the sequence {(yi wi xi1 xi2 ...xik ) , i = 1, ..., n} are i.i.d. random
vectors.
4.10. ESTIMATION OF TREATMENT EFFECTS 82
4.10.3. Pooled regression. Under the assumption of an additive and constant causal
effect of w, potential outcomes differ only by a constant element and estimation can be im-
plemented through a simple multiple regression framework using the treatment variable as a
covariate.
Let E y 0 |w, x = x0 , so write y 0 = x0 + , where 0 = y 0 E y 0 |w, x , then assume a
constant treatment effect y 1 y 0 = . On plugging the right hand sides of the foregoing two
equations into y = y 0 + w y 1 y 0 we get
(4.10.3) y = x0 + w +
w
with E (|w, x) = 0 satisfied. Then, maintaining det E w x0 6= 0, assures
x
that Equation (4.10.3) can be estimated through a LRM. The OLS estimate of is the ATE
estimate (and also the ATET estimate). Wooldridge refers to this estimator as pooled regression
adjustment.
. regress re78 i.train age educ black hisp married re74 re75, vce(robust)
Robust
re78 Coef. Std. Err. t P>|t| [95% Conf. Interval]
.
4.10. ESTIMATION OF TREATMENT EFFECTS 83
. sum durat ldurat afchnge highearn ky mi male married indust injtype, sep(0)
4.10. ESTIMATION OF TREATMENT EFFECTS 84
. sum durat ldurat afchnge highearn ky mi male married indust injtype, sep(0)
.
The research question is the following: Does the more generous cap increase the log of
out-of-work duration (ldurat) of injured high earners (highearn)?
We can start by restricting the analysis to high earners in Kentucky and work out an
ATE based on averages of high earners before and after the policy change. As seen at the
beginning of our discussion on ATE, this estimator would be consistent under the IM assump-
0 |i = h, w
0
1
1
j
tion E yit t = E yit |i = h and E yit |i = h, wt = E yit |i = h where yit are the
potential outcomes of yit (ldurat in the data), i = h, l indicates whether the worker is a high
(h) or a low earner (l), t = a, b indicates if the workers is observed after (a) or before the
policy change (b) and wt = 1 (t = a) is our treatment variable in this case, the time dummy
indicating the period after the policy change (afchnge in the data).
The difference-of-means estimators is readily obtained from the regression of ldurat on 1
and afchnge, restricted to high earners in Kentucky.
4.10. ESTIMATION OF TREATMENT EFFECTS 85
Robust
ldurat Coef. Std. Err. t P>|t| [95% Conf. Interval]
.
j
The ATE estimate of 0.198 is bound to be inconsistent since it is very likely that yit be
affected by a confounding time trend.
The DD estimator accommodates the trend by extending the sample to low earners and
assuming that the trend is constant across high and low earners. So, letting wi = 1 (i = h) be
the dummy indicating high earners, assume
0
E yit |i, t, wit = i + t
where wit = wi wt is our new treatment variable indicating high earners in the after-policy
period. Assume also a constant ATE
1 0
E yit |i, t, wit = E yit |i, t, wit +
Then,
E (yit |i = l, t = b, wit ) = l + b
E (yit |i = l, t = a, wit ) = l + a
E (yit |i = h, t = b, wit ) = h + b
E (yit |i = h, t = a, wit ) = h + a +
4.10. ESTIMATION OF TREATMENT EFFECTS 86
So
E (yit |i = h, t = a, wit ) E (yit |i = h, t = b, wit ) = + a b
and since
E (yit |i = l, t = a, wit ) E (yit |i = l, t = b, wit ) = a b
then is identified:
The DD estimator simply replaces the theoretical means in Equation (4.10.4) with the
sample analogs.
. qui sum ldurat if highearn&ky&afchnge
. sca y_hKa=r(mean)
. sca y_hKb=r(mean)
. sca y_lKa=r(mean)
. sca y_lKb=r(mean)
. sca DD=y_hKa-y_hKb-(y_lKa-y_lKb)
. sca list DD
DD = .1906012
.
It is easy to prove that the DD estimator is implemented at once by running the OLS
regression of yit on 1, wt , wi and wit . The model is fully saturated since it has enough
4.10. ESTIMATION OF TREATMENT EFFECTS 87
E (yit |i = l, t = b, wit ) = l + b
E (yit |i = l, t = a, wit ) = l + a
E (yit |i = h, t = b, wit ) = h + b
E (yit |i = h, t = a, wit ) = h + a +
without constraints. The constant term estimates E (yit |i = l, t = b, wit ), the coefficient on wt
estimates a b the coefficient on wi estimates h l and the coefficient on wit estimates
our ATE, :
. regress ldurat i.highearn i.afchnge i.afchnge#i.highearn if ky, vce(robust)
Robust
ldurat Coef. Std. Err. t P>|t| [95% Conf. Interval]
afchnge#highearn
1 1 .1906012 .068982 2.76 0.006 .0553699 .3258325
.
The DD estimator can be also implemented by the user-written Stata command diff
(Villa, 2016)
4.10. ESTIMATION OF TREATMENT EFFECTS 88
Baseline
Control 1.126
Treated 1.382
Diff (T-C) 0.256 0.047 5.41 0.000***
Follow-up
Control 1.133
Treated 1.580
Diff (T-C) 0.447 0.049 9.05 0.000***
R-square: 0.02
* Means and Standard Errors are estimated by linear regression
**Inference: *** p<0.01; ** p<0.05; * p<0.1
. do "/var/folders/b2/mmlyvgdd6yvb7vd5b8yh7pw00000gn/T//SD22970.000000"
CHAPTER 5
group-wise heteroskedasticity
5.1. Introduction
The Oaxacas model is a good way to check your comprehension of things so far. The
treatment is more complete than Greene (2008)s. Importantly, it serves as a motivation of the
Zyskinds condition, introduced in Section 4.9. It may also serve as an introduction to a number
of topics that will be covered later on: in particular, dummy variables; heteroskedasticity;
generalized least squares estimation.
ym = Xm m + m
yf = Xf f + f
ily equal (group-wise heteroskedasticity). Hence, the resulting OLS estimators from the two
89
5.2. EMBEDDING THE OAXACAS MODEL INTO A POOLED REGRESSION FRAMEWORK 90
Let 1 denote the (n 1) vector of all unity elements and construct the (n 1) vector d, such
that its first nm entries have all unity elements and the last nf all zero elements.
Variables like d are usually referred to as dummy variables or indicator variables, since
they indicate whether any observation in the sample belongs or not to a given group. In
this particular case, d is the male dummy variable indicating whether any observation in the
sample is specific to the male group. Since the two groups are mutually exclusive, the female
dummy variable can be constructed as the complementary vector 1 d. By construction, d
and 1 d are orthogonal, that is d0 (1 d) = 0.
Let x0wi be a (1 k) row vector indicating the i.th row of Xw and yi , i and di be scalars
indicating the i.th component of y, and d, respectively.
With this in hand, the model for the generic worker i = 1, ..., n is
where 0(st) indicate a (s t) matrix of all zero elements, model (5.2.1) can be reformulated
in matrix form as
(5.2.2) y = X + .
Exercise 49. Prove that X has f.c.r. if and only if both Xm and Xf have f.c.r.
Summing up, we have two equivalent representations of the same model: 1) that in Greene
(2008), with the two separate regressions; 2) that presented here with a single regression model,
represented by (5.2.2). It turns out that the two frameworks are equivalent, as far as efficient
estimation of the population coefficients is concerned. Indeed, as I prove next, the OLS
estimator, b, from model (5.2.2) is numerically identical to the OLS estimators from the two
0
0 0
separate regressions as presented in Greene (2008), i.e. b = bm bf . Let
1
b = X 0X X 0 y.
By construction,
0 y
Xm m
X 0y =
Xf0 yf
and
0 X
Xm 0(kk)
m
X 0X = .
0(kk) Xf0 Xf
Then, by a well know property of the inverse of a block diagonal matrix (see (A-73) in Greene
(2008))
0 X )1
(Xm 0(kk)
1 m
X 0X = 1 .
0(kk) Xf0 Xf
5.2. EMBEDDING THE OAXACAS MODEL INTO A POOLED REGRESSION FRAMEWORK 92
Hence,
0 X )1
(Xm 0(kk) 0 y
m Xm m
b = 1
0(kk) Xf0 Xf Xf0 yf
(Xm 0 X )1 X 0 y
m m m
= 1
0
Xf Xf Xf0 yf
bm
= .
bf
It must be pointed out that model (5.2.2) does not satisfy assumption LRM.4. The dis-
turbances , although independently distributed, suffer from what is usually referred to as
2 = 2 . Indeed, the covari-
group-wise heteroskedasticity, as the model does not maintain m f
In this sense, model (5.2.2) is not a classical regression model. Does this mean that b is not
BLUE? No, and for an important reason. Assumptions LRM.1-LRM.4 are sufficient for the
OLS estimator to be BLUE, as it has been proved in Section 4.3, but not necessary. In specific
circumstances, even if LRM.4 is not met, the OLS estimator is still BLUE, and the Oaxacas
model is one such case. This is verified in the next two sections.
5.3. THE OLS ESTIMATOR IN THE OAXACAS MODEL IS BLUE 93
Model (5.2.2) can be transformed into a classical regression model by using a standard
procedure in econometrics and statistics: weighting. Let
1
m Inm 0(nm nf )
H= .
0(nf nm ) f1 Inf
As stated by the exercise below, the matrix H when premultiplied to any conformable
vector transforms the vector so that its first nm elements get divided by m and the remaining
by f . This is what we refer to as weighting.
Exercise 51. Verify by direct inspection that, given any (nm 1) vector xm , any (nf 1)
vector xf and
xm
x= ,
xf
then
1 x
m m
Hx = .
f1 xf
Hy = HX + H,
or
(5.3.1) y +
e = X
where the tilde indicates weighted variables. Two important facts are worth observing at this
point. First, the population parameters vector, , in the weighted model is the same as in
model (5.2.2). Second, the weighted errors satisfy LRM.4 with covariance matrix equal to
5.3. THE OLS ESTIMATOR IN THE OAXACAS MODEL IS BLUE 94
2 In , (so, if LRM.5 holds they are independent standard normal variables) since
V ar e|X
e = HH 0 = HH
1 2
m Inm 0(nm nf ) m Inm 0(nm nf )
=
0(nf nm ) f1 Inf 0(nf nm ) f2 Inf
1
m Inm 0(nm nf )
0(nf nm ) f1 Inf
1
m Inm 0(nm nf ) m Inm 0(nm nf )
= 2
0(nf nm ) f Inf 0(nf nm ) f1 Inf
= 2 In .
Therefore, the weighted model is a classical regression model that identifies the parameters of
interest, and hence, by the Gauss-Marcov Theorem, the OLS estimator applied to the weighted
model (5.3.1), referred to as the weighted least squares estimator (WLS), bw , is BLUE for .
Let us work out its formula, using exercise 51:
1
2 X 0 X
m 0(kk) 2 X 0 y
m m m m m
bw = 1
0(kk) f2 Xf0 Xf f2 Xf0 yf
2 (X 0 X )1
m 0(kk) 2 X 0 y
m m m m m
= 1
2 0 2 0
0(kk) f Xf Xf f Xf yf
(Xm0 X )1 X 0 y
m m m
= 1 ,
0
Xf Xf Xf0 yf
which proves that b = bw , namely that in the Oaxacas models the OLS estimator coincides
with the optimal WLS estimator.
Does this imply that we can do inference in the Oaxacas model feeding the Stata regress
command with the variables of model (5.2.2) without further cautions? Not quite. Although
5.3. THE OLS ESTIMATOR IN THE OAXACAS MODEL IS BLUE 95
the single OLS regression provides the BLUE estimator for the population coefficients , the
OLS estimate of V ar (b|X) that would be computed by regress,
0 X )1
(Xm 0(kk)
m
V ar (b|X) = s2 1 ,
0(kk) Xf0 Xf
with s2 obtained from the sum of squares of the pooled residuals, is biased. The reason is
that V ar (b|X) forces the regression variance estimate to be constant across the two samples.
Luckily, the same is not true for the separate regressions on the two subsamples, providing us
with the unbiased estimators of the model coefficients, bm and bf and the unbiased estimator
of the covariance matrix
0 X )1
s2m (Xm 0(kk)
m
V ar (b|X) = 1
0(kk) 2 0
sf Xf Xf
1 Pnm 1 Pn
where s2m = nm k
2
i=1 ei and s2f = nf k
2
i=nm +1 ei . Alternatively, one can implement
a feasible version of the weighted regression explained above, using sm and sf as weights.
But this is clearly more computationally cumbersome than carrying out the two separate
regressions.
5.4. THE ZYSKINDS CONDITION IS VERIFIED IN THE OAXACAS MODEL. 96
1 0
P[X] = X X 0 X X
Xm 0(nm k) (Xm0 X )1 0(kk) Xm0 0(knf )
m
= 1
0(nf k) Xf 0(kk) 0
Xf Xf 0
0(knm ) Xf
Xm (Xm 0 X )1 0(nm k) Xm0 0(knf )
m
= 1
0(nf k) Xf Xf0 Xf 0(knm ) Xf0
X (X 0 X )1 X 0 0
m m m
( m f1
m n n )
=
0(nf nm ) Xf Xf0 Xf Xf0
P [Xm ] 0 (nm nf )
=
0(nf nm ) P[Xf ]
Therefore,
2 I
m 0(nm nf ) P[Xm ] 0(nm nf )
nm
P[X] =
0(nf nm ) f2 Inf 0(nf nm ) P[Xf ]
2
m P[Xm ] 0(nm nf )
=
0(nf nm ) f2 P[Xf ]
P[Xm ] 0(nm nf ) 2 I
m 0(nm nf )
nm
= .
0(nf nm ) P[Xf ] 0(nf nm ) f2 Inf
= P[X] .
CHAPTER 6
6.1. Introduction
The linear regression model may present departures from LRM.4, such as heteroskedastic-
ity and/or cluster correlation. In this chapter we study common econometric techniques that
accommodate these issues, for both estimation and inference: primarily, the Generalized LS
(GLS) estimator for the regression coefficients and robust covariance estimators.
All known statistical properties are derived for n and so the techniques we consider
in this chapter work well in large samples.
I spell out the assumptions needed for consistency and asymptotic normality of OLS and
GLS estimators, providing the derivation of the large-sample properties.
Strict exogeneity is maintained throughout:
SE: E (|X) = 0.
A weaker version of the random sampling assumption, one which does not maintain identical
distributions of records, is invoked when proving asymptotic normality and consistency for the
variance estimators:
RS: There is a sample of size n, such that the elements of the sequence {(yi x0i ) , i = 1, ..., n}
are independent (NB not necessarily identically distributed) random vectors.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using the
data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2010).
97
6.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 98
X 0 X 1 X 0
plim (b) = + plim
n n
0
X
= + Q1 plim .
n
By strict exogeneity
X 0
E =0
n
then
X 0 X 0 0 X
1
V ar |X = E |X
n n n
1 X 0 X
=
n n
and so
X 0 X 0 X
1
V ar = E ,
n n n
which goes to zero as n by assumption OLS.1. Hence X 0 /n converges in squared mean,
and consequently in probability, to zero.
Clearly, the above implies that OLS is consistent in the classical case of LRM.4.
6.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 99
is p.d. and arbitrary, in the sense that all i2 , i = 1, ..., n, are unconstrained and may depend
on X. LRM.4H permits arbitrary heteroskedasticity but not correlation. Partition X row-wise
0
x1
x0 2
X = . .
..
xn0
V ar (xi i ) = E E 2i xi x0 i |xi = E i2 xi x0 i
and
n n
1X 1X
E i2 xi x0i
V ar (xi i ) =
n n
i=1 i=1
0
X X
= E .
n
Therefore,
n
X 0 X
1X
lim V ar (xi i ) = lim E ,
n n n n
i=1
6.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 100
which is a finite matrix by assumption and so, by the (multivariate) Lindeberg-Feller theorem,
X n
X 0
0
n X X
xi i N 0, plim .
n n d n
i=1
Eventually, given the rules for limiting distributions (Theorem D.16 in Greene (2008)),
1
X 0X X 0 X 0
n (b ) Q1 ,
n n d n
and so
0
X X
n (b ) N 0, Q1 plim Q1 .
d n
X 0 X
1
Avar (b) = Q1 plim Q1 ,
n n
1 1
(6.2.1) \
Avar (b) = X 0 X X 0 X
X 0X ,
where
e21 0 0
.. ..
0 e22 . .
=
.
.. .. ..
. . . 0
0 0 e2n
one that will be used intensively in Chapters 7 and 8, is
An equivalent way to express ,
the following
= ee0 In ,
where the symbol stands for the element-by-element matrix product (also known as Hadamard
product). Similarly, = E (0 In |X) .
6.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 101
Econometric softwares routinely compute robust OLS standard errors: these are just the
\
square roots of the main diagonal elements of Avar (b) in (6.2.1). In Stata this is done through
the regress option vce(robust) (or, equivalently, simply robust).
6.2.4. Whites heteroskedasticity test. The Whites estimator remains consistent un-
der homoskedasticity, therefore one can test for heteroskedasticity by assessing the statistical
discrepancy between s2 (X 0 X)1 and (X 0 X)1 X 0 X
(X 0 X)1 . Under the null hypothesis
of homoskedasticity, the discrepancy will be small. This is the essence of the Whites het-
eroskedasticty test. The statistics measuring such discrepancy can be implemented through
the following auxiliary regression including the constant term.
We may implement the White test manually, saving the OLS residuals through predict and
then generating squares and interactions as appropriate, or more easily by giving the following
post-estimation command after regress: imtest, white.
live in different provinces, immigrants in a given country belong to different ethnic groups and
so on.
Clustering cannot be neglected in empirical work. In the case of firm data, for example,
it is likely that there is correlation across the productivity shocks hitting firms in the same
sectoral cluster, with a resulting bias in the standard error estimates, even if White robust.
The White estimator can be made robust to cluster correlation quite easily. I explain
this in terms of the firm data example. Assume that we have cross-sectional data of n firms,
indexed by i = 1, ..., n. There are G sectors, indexed by g = 1, ..., G and we know which sector
g = 1, ..., G firm i = 1, ..., n belongs to. This information is contained in the the n G matrix
D of sectoral indicators: the element of D in row i and column j, say d (i, j), is unity if firm i
belongs to sector j and zero if not. The clustering assumption below generalizes both LRM.4
and LRM.4H
The cluster-correlation and heteroskedasticity consistent estimator for the asymptotic covari-
in Equation (6.2.1) with
ance matrix of b is then assembled by simply replacing
c = ee0 DD0 .
Stata does this through the regress option vce(cluster clustervar ), where clustervar is
the name of the cluster identifier in the data set.
Chapter 8 will cover cases of multi-clustering, that is data that are grouped along more
than one dimension.
6.2.6. Average variance estimate (skip it). I prove now that a consistent estimate of
the average variance
n
1X 2
n2 = i ,
n
i=1
6.3. GLS 103
is given by
n
1X 2
s2n = ei ,
n
i=1
0
s2n + 00 Q1 0
plim = plim
n
0
= plim .
n
By the RS assumption the squared errors, 2i , are all independently distributed with means
E 2i i2 , and given that
n
0 1X 2
= i ,
n n
i=1
I can apply the Markovs strong law of large numbers to have
n
" #
0
1X 2
plim i = 0.
n n
i=1
6.3. GLS
The estimation strategy described in the previous sections is based on OLS estimates for the
regression coefficients with standard errors estimates corrected for heteroskedasticity and/or
cluster correlation. The drawback of the approach is a loss in efficiency, if the departures from
LRM.4 are of a known form. We will see that in this case the BLUE can always be found.
To formalize the new set-up, let V ar (|X) = 2 , where is a known symmetric,
positive definite (p.d.) (n n) matrix and 2 is an unknown strictly positive scalar (that is,
is known up to a strictly positive multiplicative scalar.)
6.3. GLS 104
(6.3.1) 1 = C1 C 0
and
(6.3.3) y = X + ,
Exercise 52. Verify by direct inspection that indeed 1 = 1 = and 1/2 1/2 =
1 .
1 = C1 C 0 CC 0 = C1 IC 0 = C1 C 0 = CC 0 = I
and
1 = CC 0 C1 C 0 = CI1 C 0 = C1 C 0 = CC 0 = I.
6.3. GLS 105
The rest is proved similarly on considering that 1/2 is diagonal and so 1/2 1/2 = 1 .
0 0
Exercise 53. Use (6.3.1) and (6.3.2) to prove 1) X X = X 0 1 X; 2) X = X 0 1 ;
and 3) V ar ( |X) = 2 In then use the general law of iterated expectation to prove that also
V ar ( |X ) = 2 In
Given the results of the foregoing exercise, the OLS applied to the transformed model
(6.3.3) is the Gauss-Marcov estimator for and has the formula
0
1 0
bGLS = X X X y
1
(6.3.4) = X 0 1 X X 0 1 y
1
with V ar (bGLS |X) = 2 X 0 1 X .
is also a GLS transformation, that is OLS applied to model (6.3.5) yields bGLS .
Solution: By exercise 52
1 1
X 0 1/2 1/2 X X 0 1/2 1/2 y = X 0 1 X X 0 1 y.
The estimator bGLS is OLS applied to a classical regression model and as such it is BLUE.
The following exercise asks you to verify by direct inspection that GLS is better than OLS
in terms of covariance.
1 1 1
2 X 0X X 0 X X 0 X 2 X 0 1 X
is a n.n.d. matrix.
6.3. GLS 106
1 1
D X 0X X 0 X 0 1 X X 0 1 .
Therefore,
1 1
X 0X X 0 = X 0 1 X X 0 1 + D.
1 1
X 0X X 0 X X 0 X =
h 1 i h 1 i
X 0 1 X X 0 1 + D 1 X X 0 1 X + D0 =
1
X 0 1 X + DD0 .
Since is p.d., then for any n 1 vector z, z0 z 0, being equal to zero if and only if z = 0.
But then, z0 z 0 when, in particular, z = D0 w for any n 1 vector w, which is equivalent
to saying that w0 DD0 w 0 for any n 1 vector w, or that DD0 is n.n.d, proving the
result.
prove that
plim (bGLS ) =
0
!1 0
X X X
bGLS = + ,
n n
then consider that V ar ( |X ) = 2 In (see Exercise 53) and, finally, follow the same steps as
in Section 6.2.1.
6.3.2. Asymptotic normality. I prove asymptotic normality for bGLS under GLS.1,
SE and RS (again, remember that V ar ( |X ) = 2 In ).
By strict exogeneity E (xi i ) = 0 and hence
0
V ar (xi i ) = 2 E xi xi
and
n n
1X 1 2 X 0
V ar (xi i ) = E x i xi
n n
i=1 i=1
0 1
2 X X
= E .
n
Therefore,
n
X 0 1 X
1X
lim V ar (xi i ) = 2 lim E ,
n n n n
i=1
which is a finite matrix by assumption. By the Lindeberg-Feller central limit theorem,
Xn 0
n X X 0 1
xi i N 0, 2 Q
n n n d
i=1
and since
1
X 0 1 X X 0 1 X 0 1
n (bGLS ) Q1 ,
n n d n
then
n (bGLS ) N 0, 2 Q1 .
d
6.3. GLS 108
2 1
Avar (bGLS ) = Q ,
n
and is estimated by
1
\
Avar (bGLS ) = s2GLS X 0 1 X
where
(y X bGLS )0 (y X bGLS )
s2GLS = .
nk
(y XbGLS )0 1 (y XbGLS )
= .
nk
Exercise 57. (This may be skipped) Under GLS.1, SE and RS, prove that plim s2GLS =
2.
6.3.3. Feasible GLS. In general situations we may know the form of but not the
values taken on by its elements. Therefore to make GLS operational we need an estimate of
, say .
b Replacing by
b into (6.3.4) delivers the feasible GLS, henceforth FGLS:
1
bF GLS = X 0
b 1 X X 0
b 1 y.
Since GLS is consistent, to know that bGLS and bF GLS are asymptotically equivalent, i.e.
plim (bF GLS bGLS ) = 0, is enough to ensure that bF GLS is consistent but not that
n (bF GLS ) N 0, 2 Q1 .
d
For this we need the stronger condition that n (bF GLS ) and n (bF GLS ) be asymp-
totically equivalent, or
(6.3.6) n (bF GLS bGLS ) 0.
p
6.3. GLS 109
X 0 1 X
plim = Q,
n
Solution: Given
X 0 1 X
plim =Q
n
and !
X 0
b 1 X X 0 1 X
plim = 0,
n n
then !
X 0 1 X X 0
b 1 X X 0 1 X
plim + plim = Q,
n n n
6.3. GLS 110
and
!1
X 0
b 1 X
(6.3.9) plim = Q1 .
n
!1
X 0
b 1 X X 0
b 1
= n (bF GLS ) ,
n n
1
X 0 1 X X 0 1
= n (bGLS ) .
n n
The last two equalities, along with the maintained conditions (6.3.7) and (6.3.8), the asymp-
totic results (6.3.9) and (6.3.10) and the Slutsky Theorem, prove that both n (bGLS )
and n (bF GLS ) converge in probability to the same limit, Q1 q0 .
Conditions (6.3.7) and (6.3.8) must be verified on a case-by-case basis. Importantly, they
may hold even in cases in which
b is not consistent, as shown in the context of FGLS panel
6.4.1. Introduction. This section covers large sample tests in more detail than Greene
(2008). For the exam you can skip the derivations of the asymptotic results.
Assume the following results hold
n (b ) N 0, 2 Q1
(1)
0 d
(2) plim XnX = Q
(3) plim s2 = 2 .
and consider the following lemma, referred to as the product rule. For more on this see
White (2001) p. 67 (notice that the product rule is not mentioned in Greene (2008), although
implicitly used for proving the asymptotic distributions of the tests).
Lemma 59. (The product rule) Let An be a sequence of random (l k) matrices and bn a
sequence of random (k 1) vectors such that plim (An ) = 0 and bn z. Then, plim (An bn ) =
d
0.
6.4.2. The t-ratio test (skip derivations). We wish to derive the asymptotic distri-
bution of the t-ratio test for the null hypothesis Ho : k = ko . We begin by noting that under
Ho
n (bk ko )
(6.4.1) q N (0, 1)
d
2 Q1
kk
(bk ko )
t= q ,
2 0 1
s (X X)kk
6.4. LARGE SAMPLE TESTS 112
1
where (X 0 X)1 x0k M[X(k) ] xk
kk and X = X(k) xk (see Section 4.8). Since t can be
reformulated as
n (bk ko )
t= q 1 ,
0
s2 XnX kk
then
n (b o) n (b o)
k k k k
(6.4.2) plim q q =
X 0 X 1 2 1
s2 n kk Qkk
1 1
plim q q n (bk ko ) = 0
1
2 Q1
0
s2 XnX kk
kk
where the second equality follows from the product rule, given that, by results 2-3 and the
Slutsky Theorem (Theorem D.12 in Greene (2008), p. 1045), the first factor in the second plim
converges in probability to zero and, by result 1., the second factor converges in distribution
to a normal random scalar. Hence, the two sequences in the plim of equation (6.4.2) are
asymptotically equivalent and by Theorem D.16(3) have the same limiting distribution. Given
(6.4.1), this proves that
(bk ko )
q N (0, 1) .
d
s2 (X 0 X)1kk
r0 (b )
(6.4.3) q N (0, 1) .
2 0 0 1 d
s r (X X) r
1
0 X0X
Exercise 60. (skip) Prove (6.4.3). Hint: by the Slutsky Theorem, plim r n r =
r0 Q1 r.
6.4.3. The Chi-squared test (skip derivations). We wish to test the null hypothesis
Ho : R q = 0, where R is a non-random (J k) matrix of full-row rank and q is a (J 1)
6.4. LARGE SAMPLE TESTS 113
Now let A 2 RQ1 R0 . Since A is p.d. (R is f.r.r.), there exists a p.d. matrix A1/2 such
1
that A1/2 A1/2 = A and A1/2 = A1/2 . Then, by result 1. and the Slutsky Theorem,
(6.4.4) A1/2 R n (b ) N (0, IJ ) .
d
Similarly, let A s2 R (X 0 X/n)1 R0 . Since A is p.d., there exists a p.d. matrix A1/2 such
1
that A1/2 A1/2 = A and A1/2 = A1/2 . Then
h i
(6.4.5) plim A1/2 R n (b ) A1/2 R n (b ) =
h i
plim A1/2 A1/2 R n (b ) = 0
where the second equality follows from the product rule given that
plim A1/2 = A1/2 ,
R n (b ) N (0, A) ,
d
by result 1. and Theorem D.16(2) in Greene (2008). Hence, by Theorem D.16(3) the two
sequences in the left-hand-side plim of equation (6.4.5) have the same limiting distribution
and given (6.4.4), this proves that
6.4. LARGE SAMPLE TESTS 114
A1/2 R n (b ) N (0, IJ ) .
d
Let w A1/2 R n (b ) , then by Theorem D.16(2),
(6.4.6) w0 w 2 (J) .
d
w0 w =
n (b )0 R0 A1/2 A1/2 R n (b ) =
" 0 1 #1
0 0 XX
n (b ) R s R2
R0 R n (b ) = JF,
n
and so by (6.4.6)
JF 2 (J) .
d
CHAPTER 7
7.1. Introduction
This chapter covers the two most important panel data models: the fixed effect and the
random effect models.
For simplicity we start directly from the statistical models. The sampling mechanism will
be introduced when proving asymptotic normality.
Results in this chapter are demonstrated through the do-file paneldata.do using the data-
set airlines.dta, a panel data that I have extracted from costfn.dta (Baltagi et al. 1998).
7.2. The Fixed Effect Model (or Least Squares Dummy Variables Model)
Consider the following panel data regression model expressed at the observation level, that
is for individual i = 1, ...N and time t = 1, ..., T :
Define djit as the value taken on by the dummy variable indicating individual j = 1, ..., N
at the observation (i, t) , that is
1 if i = j, any t = 1, ..., T
djit = .
0 if i 6= j, any t = 1, ..., T
where
yi1
x0i1 i1
..
.. ..
.
.
.
yi = 0
, Xi = xit
yit , i = it
(T 1) (T k) (T 1)
..
.
.
..
. . .
yiT x0iT iT
and
1T if i = j
dji = ,
0
T if i 6= j
1T indicates the (T 1) vector of all unity elements and 0T the (T 1) vector of all zero
elements.
Stacking data by individuals, an even more compact representation of the regression model
(7.2.2), at the level of the whole data-set, is
(7.2.3) y = X + D + ,
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 117
where
y1 X1 1 1
.. .. .. ..
.
.
.
.
y = yi , X = ,
Xi = i , = i
(N T 1)
(N T k) (N T 1) (N 1)
.. .
.
.. .
.
. . . .
yN XN N N
or equivalently D = (d1 d2 ... dN ) . Under the following assumptions model (7.2.3) is a classical
regression model that includes individual dummies:
FE.1: The extended regressor matrix (X D) has f.c.r. Therefore, not only is X of f.c.r.,
but also none of its columns can be expressed as a linear combination of the dummy
variables, which boils down to saying that no column of X can be time-constant,
which in turn implies that X does not include the unity vector (indeed, there is a
constant term in model (7.2.3), but one that jumps across individuals).
FE.2: E (|X) = 0. Hence, the variables in X are strictly exogenous with respect to ,
but the statistical relationship with is left completely unrestricted. Model (7.2.3),
therefore, automatically accommodates any form of omitted-variable bias due to the
omission of time constant regressors. Notice that D is taken as a non-random matrix,
therefore conditioning on (X D) or simply X is exactly the same.
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 118
Exercise 61. Prove that the following model with the constant term is an equivalent
reparametrization of Model (7.2.3):
(7.2.4) 0 + X + D1
y = 1N T 1 + ,
(7.2.5) y = d1 1 + X + D1 1 +
d1 + D1 1N 1 = 1N T .
y = 1N T 1 + d1 1 + X + D1 1 (d1 + D1 1N 1 ) 1 +
= 1N T 1 + X + D1 1 D1 1N 1 1 +
= 1N T 1 + .
0 + X + D1
0 1 and
where 1 1 1N 1 1 .
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 119
Remark 62. Exercise 61 demonstrates that after the reparametrization the interpretation
of the coefficients is unchanged, the constant term is 1 and the coefficients on the remaining
individual dummies are no longer the individual effects of the remaining individuals, i , i =
2, ..., N , but rather the contrasts of i with respect to 1 , i = 2, ..., N . Of course, the reference
individual must not necessarily be the first one in the data-set and can be freely chosen among
the N individuals by the researcher at her/his own convenience. In Stata this is implemented
by using regress followed by the dependent variable, the X regressors and N 1 dummy
variables (see paneldata.do).
Remark 63. The interpretation of the constant in Exercise 61 is different from that in
the Stata transformation (see 10/04/12 Exercises) of Model (7.2.3). In the former case the
constant term is the effect of the individual whose dummy is removed from the regression, in
the latter it is the average of the N individual effects.
The LSDV estimator is just the OLS estimator applied to model (7.2.3) and, given FE.1-3,
it is the BLUE. The separate formulas of LSDV for and are obtained by applying Theorem
to (7.2.3). So,
1
bLSDV = X 0 M[D] X X 0 M[D] y
1
aLSDV = D0 D D0 (y XbLSDV )
is the LSDV estimator for . As already mentioned, both are BLUEs, but while bLSDV
converges in probability to when N or T or both, aLSDV converges in probability
to only when T . This discrepant large-sample behavior of bLSDV and aLSDV is due
to the fact that the dimension of increases as N increases, whereas that of is kept fixed
to k.
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 120
0i bLSDV ,
i x
aLSDVi = y
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 121
0i is the (1 k) vector of group-means for individual i, x
where x 0i = x
1i ... x
ki . It is
also clear that for any (N T 1) vector z
z1
..
.
z1
..
.
zi
..
1
P[D] z = D D0 D D0 z =
.
.
zi
..
.
zN
..
.
zN
1
The conditional variance-covariance matrix of bLSDV is V ar (bLSDV |X) = 2 X 0 M[D] X .
It is estimated by replacing 2 with the Anova estimator s2LSDV , based on the LSDV residuals
eLSDV = M[D] y M[D] XbLSDV :
e0LSDV eLSDV
(7.2.6) s2LSDV = .
NT N k
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 122
Exercise 66. Prove that E s2LSDV = 2 . This is a long one, but when done you can
tell yourself BRAVO! I just give you a few hints. First, on noting that y is determined by
the right hand side of (7.2.3) prove that e = M[M[D] X ] M[D] , then elaborate the conditional
mean of 0 M[D] M[M[D] X ] M[D] using the trace operator as we did for s2 , finally apply the law
of iterated expectations.
It is not hard to verify (do it) that bLSDV can be obtained from the OLS regression of
model (7.2.3) transformed in group-means deviations (this transformation is referred to in the
panel-data literature as the within transformation)
The intuition is simple: since the group mean of any time constant element, as i , coincides
with the element itself, all time-constant elements in model (7.2.3) are wiped out, this also
explains why X cannot contain time-constant variables. So, in a sense, the within transfor-
mation controls out the whole time-constant heterogeneity, latent or not, in model (7.2.3),
making it look like almost as a classical LRM. In particular, it can be proved easily that
LRM.1-LRM.3 hold. Notice, however, that errors in the transformed model, M[D] , have a
non-diagonal conditional covariance matrix (it is, indeed, block-diagonal and singular, can
you derive it?). Specifically, the vector M[D] presents within-group serial correlation, since
for each individual group there are only T 1 linearly independent demeaned errors. As a
consequence, LRM.4 does not apply to model (7.2.7). All the same, bLSDV is BLUE. This
is true because the condition of Theorem 47 in Section 4.9 is met (if you have answered the
previous question on the covariance matrix of M[D] , you should be able to verify also this
claim).
One should not conclude from the foregoing discussion that OLS on the within transformed
model (7.2.7) is a safe strategy. As in the Oaxacas pooled model of section 5.2, the fact that the
error covariance matrix is not spherical, presenting in this specific case within group serial cor-
relation, has bad consequences as far as standard error estimates are concerned. Indeed, should
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 123
we leave the econometric software free to treat model (7.2.7) as a classical LRM, and so regress
M[D] y on M[D] X, it would compute coefficient estimates just fine. Nonetheless, it would es-
1
timate V ar (bLSDV |X) by s2 X 0 M[D] X , with s2 = e0LSDV eLSDV / (N T k) 6= s2LSDV ,
which is biased since it uses a wrong degrees of freedom correction. The econometric software
is not aware that for each individual in the sample there are only T 1 linearly independent
demeaned errors and so, rather than dividing the residual sum of squares by N (T 1) k, it
divides it by N T k. The upshot is that standard errors estimated in this way needs rectifying
p
by multiplying each of them by the correction factor (N T k) / (N T N k).
An interesting assumption to test is that of the absence of individual heterogeneity, H0 :
1 = 2 = ... = N . Under the restriction implied by H0 , model (7.2.3) pools together all
data with no attention to the individual clustering and can be written as
(7.2.8) y = X + ,
where
0
X = (1N T X) , = .
Hence, under H0 , the pooled OLS (POLS) estimator
0 1
(7.2.9) bP OLS = X X X y
If F does not reject H0 , POLS is a legitimate, more efficient than LSDV, estimation procedure.
If F rejects H0 , then POLS is biased and LSDV should be adopted.
Exercise 67. On reparametrizing the LSDV model as in Exercise 61, the hypothesis of no
1 = 0. Prove that the resulting F-test is numerically
individual heterogeneity becomes H0 :
identical to F in Equation (7.2.11).
Solution. Easy. Since models (7.2.3) and (7.2.4) are indeed the same model, the result-
ing F-test is numerically identical to the F-test in Equation (7.2.11). This is demonstrated
empirically in the paneldata.do Stata dofile.
The random effect model has the same algebraic structure of model (7.2.1). At the obser-
vation level, i = 1, ...N and t = 1, ..., T, we have
(7.3.2) y = X + w,
where
0
X = (1N T X) , = and w = + Du.
The following is maintained.
= V ar (|X ) + V ar (Du|X )
= 2 IN T + u2 DD.0
This means that under RE.1-3 w, although homoskedastic, is non-diagonal and the POLS
estimator in (7.2.9) is unbiased (verify this) but not BLUE (unless u2 = 0). The BLUE
estimator for is therefore the GLS Random effect estimator
0 1 0
bGLSRE = X 1 X X 1 y.
7.3. THE RANDOM EFFECT MODEL 126
= 2 IN T + T u2 P[D]
= 2 M[D] + 12 P[D] .
where 12 = 2 + T u2 . Therefore,
1 1
(7.3.3) 1 = 2
M[D] + 2 P[D]
1
and
1
0 1 1 1 1
bGLSRE = X M + P X
X
M + P y.
2 [D] 12 [D] 2 [D] 12 [D]
1 1
Exercise 69. Verify that 2
M[D] + P
12 [D]
is indeed the inverse of , that is
1 1
2 M[D] 12 P[D]
+ 2
M[D] + 2 P[D] = IN T
1
2) Verify that premultiplying all variables of model (7.3.2) by M[D] + 1 P[D] transforms it into
a classical regression model, so that bGLSRE can be obtained at once by applying OLS to
7.3. THE RANDOM EFFECT MODEL 127
the transformed model. 3) Verify that the operator M[D] + 1 P[D] can be also written as
(7.3.5) M[D] + P[D] = IN T 1 P[D] .
1 1
The operator in (7.3.5), M[D] + ( /1 ) P[D] , transforms any conformable variable that
pre-multiplies in quasi-mean deviations, or partial deviations, in the sense that it only removes
a portion of the group-mean from the variable. For this reason, the coefficients on time-
constant variables are identified in the RE model: time-constant variables when premultiplied
by M[D] + ( /1 ) P[D] are not wiped out, but rescaled by a factor /1 . The RE model under
the GLS transformation is therefore
7.3.1. The Feasible GLS. The feasible version of bGLSRE , say bF GLSRE , the one
that is actually implemented in econometric softwares, can be obtained through the method
by Swamy and Arora (1972). The estimator for 2 is simply s2LSDV in (7.2.6) and that for 12
is obtained as follows.
Define the Between residual vector eB as
1Technical note: I maintain that no column of X is either time-constant or already in group-mean deviations,
0
so that both bLSDV and bB are uniquely defined (in fact, with such an assumption X P[D] X and X 0 M[D] X
are both non-singular). Indeed, this is only made for simplicity, since it is possible to prove that s2B and s2LSDV
7.3. THE RANDOM EFFECT MODEL 128
e0B eB
s2B = .
N k1
Same hint as for exercise 66: First, on noting that y is determined by the right hand
side of (7.3.2) prove that eB = M[P[D] X ] P[D] w, then elaborate the conditional mean of
w0 P[D] M[P[D] X ] P[D] w using the trace operator as we did for s2 , finally apply the law of
iterated expectations.
Solution. Replacing the formula of bB into the right hand side of equation (7.3.7) gives
1
0 0
eB = I P[D] X X P[D] X X P[D] P[D] y
1
0 0
P[D] X + P[D] w
= I P[D] X X P[D] X X P[D]
1
0 0
= I P[D] X X P[D] X X P[D] P[D] w
= M[P[D] X ] P[D] w.
Therefore,
= w0 M[P[D] X ] P[D] w
where the first equality follows from the idempotence of M[P[D] X ] and the second from
are uniquely defined even if bLSDV and bB are not. The proof requires that all inverse matrices in the residuals
formulas are replaced with generalized inverse matrices. But dont worry, I wont pursue it further.
7.3. THE RANDOM EFFECT MODEL 129
and so
e0B eB |X 0
E = E tr M[P[D] X ] P[D] ww |X
= tr E M[P[D] X ] P[D] ww0 |X
= tr M[P[D] X ] P[D] .
= 2 M[D] + 12 P[D] ,
yields P[D] = 12 P[D] , given that P[D] is idempotent and P[D] M[D] = 0(N T N T ) . Hence,
= 12 tr M[P[D] X ] P[D]
0 1 0
M[P[D] X ] P[D] = P[D] P[D] X X P[D] X X P[D] ,
0 1 0
tr M[P[D] X ] P[D] = tr P[D] tr P[D] X X P[D] X X P[D]
= tr IN tr Ik+1
= N k 1.
7.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 130
e0LSDV eLSDV
E |X = 2
NT N k
(hint: follow the same steps as above, noticing that M[D] w = M[D] .)
Exercise 73. Prove that the formula for the subvector of bGLSRE estimating the
vector is
(7.3.8)
1
2 2
X 0 M[D] + 2 P[D] P[1N T ] X X 0 M[D] + 2 P[D] P[1N T ] y.
bGLSRE =
1 1
2
M[D] + P[D] M[1N T ] M[D] + P[D] = M[D] + 2 P[D] P[1N T ]
1 1 1
and eventually
1
2 2
X 0 M[D] + 2 P[D] P[1N T ] X X 0 M[D] + 2 P[D] P[1N T ] y.
bGLSRE =
1 1
Both fixed effects and random effects estimators are implemented through the Stata com-
mand xtreg, with the usual Stata syntax for regression commands: the command is followed
by the name of the dependent variable and then the list of regressors. The noconstant option
is not admitted in this case.
7.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 131
As a preliminary step, however, a panel data declaration is needed to make Stata aware
of which variables in our data identify time and individuals. Suppose that in our data the
individual variable is named id and the time variable time, then the panel data declaration is
carried out by the instruction
xtset id time
The random effect estimator is the default of xtreg, while the fixed effects (LSDV) esti-
mator requires the option fe.
Sometimes, you may find it convenient to implement FE and RE estimators by hands,
using regress rather than xtreg. The greater computational effort may pay for the simple
reason that regress, being the most popular estimation command in Stata, is updated more
frequently to accommodate the most recent developments in statistics and econometrics, and
so has typically more options than any other estimation commands in Stata. To implement
bLSDV and aLSDV at once you may just apply regress to the LSDV model (7.2.3). This
requires to generate a full set of individual dummies from the individual-identifier id in your
panel. This is done through the tabulate command with an option, as follows
where id_ is a name of choice. If N equals, say, 100, tabulate will add the full set of 100
individual dummies to your data, with names id_1, id_2, ..., id_100 and you can just treat
them as regressors in a regress instruction to get bLSDV as the coefficient estimates for the
X variables and aLSDV as the coefficient estimates for the id_1-id_100 variable. Degrees
of freedom are correctly calculated as N T N k and so no correction of standard errors
is needed. Notice that if you include all 100 dummies, then the constant term should be
removed by the noconstant option. Alternatively, you can leave it there and include N 1
dummies. While the bLSDV estimates remain unchanged, the coefficient estimates on the
included dummies do not. The latter must now be thought of as contrasts with respect to
7.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 132
the constant estimate, which turns out to equal the individual effect estimate peculiar to
the individual excluded from the regression, who is therefore treated as the base individual.
Nothing is lost by choosing either identification strategy.
When N is large the foregoing regress strategy is not practical. The bLSDV estimator
can, then, be manually implemented by applying the within transformation, carrying out OLS
on the transformed model and then correct standard errors appropriately. Implementation of
bF GLSRE by hands is more tricky and one goes along the following steps: 1) get the two
variance components estimates from within and between regressions; 2) transform variables
(including the constant) in partial deviations and 3) apply OLS to the transformed variables.
Details can be found in a Stata do.file available on the learning space.
I recommend to use always the official xtreg command to implement the standard panel
data estimators in empirical applications, unless strictly necessary to do otherwise (for exam-
ple, if I explicitly ask you to!).
As Hausman (1978) and Mundlak (1978) independently found (in two papers appeared
in the same Econometrica issue!), the RE model is a special case of the FE model. In fact,
while in the former model assumption RE.2 models the relationship between the random
individual components, u, and X (E (u|X) = 0), the latter leaves it completely unrestricted.
In consequence, the RE model is nested into the FE model, so that a test discriminating
between them can be easily implemented with E (u|X) = 0 as the null hypothesis.
I present here two popular tests that, moving from the foregoing consideration, can provide
some guidance in the choice between RE and FE models.
7.5.1. The Hausmans test. Under Ho : E (u|X) = 0, both LSDV and FGLS-RE
estimators are consistent for N , but the LSDV estimator is inefficient: redundant
individual effects are included in the regression when they could have rather been regarded
as random disturbances, saving on degrees of freedom. On the other hand, if Ho is not true
7.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 133
the LSDV estimator remains consistent, but FGLS does not, undergoing an omitted-variable
bias. The basic idea of the Hausmans test (Hausman, 1978), therefore, is that under Ho the
statistical difference between the two estimators should be not significantly different from zero
in large samples.
Hausman proves that, under RE.1-RE.3, such difference can be measured by the statistics
Hausman also provides a useful computational result. He shows that since bF GLSRE is
asymptotically efficient and bLSDV is inefficient under the null,
so
Acov (bLSDV bF GLSRE , bF GLSRE ) = Acov (bLSDV , bF GLSRE ) Avar (bF GLSRE )
= 0
and
Avar (bLSDV bF GLSRE ) = Avar (bLSDV ) Avar (bF GLSRE ) .
Hence,
h i1
H = (bLSDV bF GLSRE )0 Avar\ \
(bLSDV ) Avar (b F GLSRE ) (bLSDV bF GLSRE ) .
Wooldridge (2010) (pp. 328-334) evidences two difficulties with the Hausmans test.
7.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 134
First, Avar (bLSDV ) Avar (bF GLSRE ) is singular if X includes aggregate variables,
such as time dummies. Therefore, along with the coefficients on time-constant variables, also
those on aggregate variables must be excluded from the Hausman statistics.
Second, and more importantly, if RE.3 fails, then, on the one hand, the asymptotic dis-
tribution of H is not standard even if RE.2 holds, so that H would be of little guidance in
detecting violations of RE.2, with an actual size that may be significantly different from the
nominal size. On the other hand, H is designed to detect violations of RE.2 and not RE.3.
In fact, if RE.2 holds both LSDV and FGLS-RE are consistent, regardless of RE.3, and H
converges in distribution rather then diverging, which means that the probability of rejecting
RE.3 when it is false does not tend to unity as N , making H inconsistent. The solution
is so to consider H as a test of RE.2 only, but in a version that is robust to violations of RE.3.
The approach I describe next is well suited to solve both difficulties at once.
Exercise 74. Given two random vectors w, z of same dimension and two constants a and
b
(1) V ar (aw + bz) = a2 V ar (w) + b2 V ar (z) + ab Cov (w, z) + Cov (w, z)0
and
(2) Cov (aw + bz, z) = aCov (w, z) + bV ar (z)
V ar (aw + bz) = E (aw + bz) (aw + bz)0 E (aw + bz) E (aw + bz)0 .
So
a2 E (w) E (w)0 abE (w) E (z)0 abE (z) E (w)0 b2 E (z) E (z)0
(2) By definition
So,
Then, similarly
1 1
bLSDV = X 0 M[D] X X 0 M[D] w = X 0 M[D] X X 0 M[D] .
Therefore.
Cov (bLSDV , bGLSRE |X) =
7.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 136
2
0
1 0 0 0 0
E X M[D] X X M[D] + u D M[D] + 2 P[D] P[1N T ] X
1
1 )
2
0
X M[D] + 2 P[D] P[1N T ] X |X =
1
2
0
1 0 0 0
X M[D] X X M[D] E |X M[D] + 2 P[D] P[1N T ] X
1
1
2
0
X M[D] + 2 P[D] P[1N T ] X =
1
2
2 0
1 0
X M[D] X X M[D] M[D] + 2 P[D] P[1N T ] X
1
1
2
0
X M[D] + 2 P[D] P[1N T ] X =
1
1
2
2 0
1 0 0
X M[D] X X M[D] X X M[D] + 2 P[D] P[1N T ] X =
1
2
X M[D] + 2 P[D] P[1N T ] X .1
2 0
1
V ar (bLSDV bGLSRE |X) = V ar (bLSDV |X)+V ar (bGLSRE |X)Cov (bLSDV , bGLSRE |X)Cov (bLSDV ,
Since RE.1-RE.3 hold we can invoke exercise 74, so Cov (bLSDV , bGLSRE |X) = V ar (bGLSRE |X),
which completes the proof.
7.5.2. The Mundlaks test. Mundlak (1978) asks the following question. Is it possible
to find an estimator that is more efficient than LSDV within a framework that allows corre-
lation between individual effects, taken as random variables, and X? To provide an answer,
he takes the move from model (7.2.3) and supposes that the individual effects are linearly
7.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 137
1
= 1N 0 + D0 D D0 X + u
h i
with E (|X) = E | (D0 D)1 D0 X , and so E (u|X) = 0. Pre-multiplying both sides of the
foregoing equation by D and then replacing the right-hand side of the resulting equation into
(7.2.3) yields
(7.5.1) y = 1N T 0 + X + P[D] X + Du + ,
which is evidently a RE model extended to the inclusion of the P[D] X regressors. Model (7.5.1)
springs up from a restriction in (7.2.3) and hence seems promising for more efficient estimates.
But this is not the case. Mundlak proves, in fact, that FGLS-RE applied to equation (7.2.3)
returns the LSDV estimator, bLSDV for the coefficients, bB bLSDV for the coefficients
and b0B for the constant term 0 , where b0B and bB are the components of the between
estimator, bB , presented in Section 7.3.1.
To summarize Mundlaks results
The standard LSDV estimator for in the FE model (equation (7.2.3)) is the FGLS-
RE estimator for in the general RE model (7.5.1)
The standard FGLS-RE estimator in the RE model (equation (7.3.2)) can be equiv-
alently obtained as a constrained FGLS estimator applied to the general RE model
(7.5.1) with constraints = 0.
Therefore, the validity of the RE model can be tested by applying a standard Wald test of
joint significance for the null hypothesis that = 0 in the context of Mundlaks equation
(7.5.1):
1
M = (bLSDV bB )0 Avar (b\
LSDV bB ) (bLSDV bB ) .
Under H0 : = 0, M 2 (k).
d
7.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR 138
Hausman and Taylor (1981) proves that the statistics H and M are numerically identical
(for a simple proof see also Baltagi (2008)). Wooldridge (2010), p. 334, nonetheless, rec-
ommends using the regression-based version of the test because it can be made fully robust
to violations of RE.3 (for example, heteroskedasticity and/or arbitrary within-group serial
correlation) using the standard robustness options available for regression commands in most
econometric packages. In addition, it is relatively easy to detect and solve singularity problems
in the context of regression-based tests.
7.5.3. Stata implementation. The Stata implementation of most results in this section
is demonstrated through a Stata do file available on the course learning space.
7.6.1. Introduction. This section proves consistency and asymptotic normality of the
LSDV estimator, then describes the heteroskedasticity and within-group serial correlation
consistent covariance estimator and finally provides a remark for practitioners.
Notation is standard. X denotes the (N T k) regressors matrix (of all time-varying
regressors) and is partitioned by stacking individuals
X1
..
.
(7.6.1) X=
Xi ,
.
.
.
XN
1T 10T
M[1T ] = IT .
T
So,
M[1T ] 0 0
. ..
M[1T ] . .
0 .
(7.6.2) M[D] = .
.. .. ..
. . . 0
0 0 M[1T ]
1
bLSDV = X 0 M[D] X X 0 M[D] y.
SE: E (|X) = 0.
The following random sampling assumption is invoked for the asymptotic normality of
bLSDV and the consistency of the bLSDV asymptotic covariance estimator:
RS: There is a sample of size n = N T , such that the elements of the sequence
{(yi Xi ) , i = 1, ..., N } are independent (NB not necessarily identically distributed)
random matrices.
Exercise 77. (This has been done in class) Prove that under LSDV.1 and LSDV.2
p limN bLSDV = .
7.6.4. Asymptotic normality. Assumptions LSDV.1 and LSDV.2 hold along with RS
and the following
is a block diagonal (N T N T ) positive definite matrix. Notice that the blocks of are
arbitrary and heterogenous, so that both arbitrary correlation across the time observations
of the same individual (referred to as within-group serial correlation) and heteroskedasticity
across individuals and over time are permitted. What is not permitted by the block-diagonal
structure is correlation of the realizations across different individuals.
Now focus on the generic individual i = 1, ..., N and notice that, given the block-diagonal
form of M[D] as in (??),
X1 M[1T ] X1
M[1T ] 0 0 .. ..
. .. . .
M[1T ] . .
0 .
M[D] X = = M X
.. .. .. Xi [1T ] i
. . . 0
.. ..
.
.
0 0 M[1T ]
XN M[1T ] XN
The proof of asymptotic normality for bLSDV parallels that in 6.2.2 with the only difference
that now the random objects whence we move are not k 1 vectors at the observation level
but k 1 vectors at the individual level, Xi0 M[1T ] i , i = 1, ..., N .
7.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR 141
so that
V ar Xi0 M[1T ] i = E Xi0 M[1T ] i M[1T ] Xi .
Therefore,
N 0
X M[D] M[D] X
1 X
V ar Xi0 M[1T ] i = lim E
lim Q ,
N N N N
i=1
which is a finite matrix by assumption LSDV.1, so that the Lindberg-Feller theorem applies
to yield
N
NX 0 X 0 M[D]
Xi M[1T ] i N (0, Q ) .
N N d
i=1
Finally, since
1
X 0 M[D] X X 0 M[D] X 0 M[D]
N (bLSDV ) Q1 ,
N N d N
) N 0, Q1 Q Q1
N (bLSDV
d
1 1
(7.6.3) Avar (bLSDV ) = Q Q Q1 .
N
7.7. A ROBUST COVARIANCE ESTIMATOR 142
i = 1, ..., N, a consistent estimator for the asymptotic covariance matrix of bLSDV in equation
(7.6.3) is given by the Whites estimator:
1 1
(7.7.1) (bLSDV ) = X 0 M[D] X
Avar\ X 0 M[D] M
[D] X X 0 M[D] X ,
Remark 78. The estimator in (7.7.1) is robust to arbitrary heteroskedasticity and within-
group serial correlation. Stock and Watson (2008) prove that in the LSDV model the Whites
is a diagonal matrix with generic
estimator correcting for heteroskedasticity only, where
element e2LSDV,it (see the first formula of section 9.6.1 in Greene (2008)), is inconsistent for
N . The crux of Stock and Watsons argument is essentially algebraic, in that demeaned
residuals are correlated over time by construction and this correlation does not vanish for
N . The recommendation for practitioners is then to correct for both heteroskedasticity
and within-group serial correlation using the estimator (7.7.1), which is not affected by the
Stock and Watsons critique.
Remark 79. In Stata the robust covariance matrix of LSDV is computed easily by using
the xtreg command with the options fe and vce(cluster id), where id is the name of the
individual categorical variable in your Stata data set.
A similar correction can be carried out for POLS and FGLS-RE. For POLS we have
0 1 0 1
bP OLS = X X
Avar\ X 0 X
X X ,
7.8. UNBALANCED PANELS 143
where
= eP OLS e0 0
P OLS DD ,
and the POLS residual vector defined as in equation (7.2.10), whereas for FGLS-RE we have
0 1 1 0 1/2 1/2 0 1 1
bF GLSRE = X
Avar \ b X X b
b X X b X ,
where
= eF GLSRE e0F GLSRE DD0 ,
eF GLSRE = y X bF GLSRE .
Remark 80. In Stata the robust asymptotic covariance matrices of POLS and FGLS-RE
is estimated by using, respectively, the regress and the xtreg, re commands, both with the
option vce(cluster id), as in the LSDV case.
All of the methods so far have been described with a balanced panel data set in mind, but
nothing prevents applying the same methods to unbalanced panels (different numbers of time
observations across individuals).
Unbalanced panels only require a slight change in notation. As always we index individuals
by i = 1, ..., N , but now the size of each individual cluster, or group, of observations varies
across individuals and so the time index is t = 1, ..., Ti . This implies the following three facts.
(1) As in balanced panels, each observation in the data is uniquely identified by the two
indexes: the pair (i, t) identifies the t.th observation of the the i.th individual.
(2) Differently from balanced panels, the group size, Ti , is no longer constant across
clusters.
PN
(3) Differently from balanced panels, the sample-size is n = i=1 Ti .
7.8. UNBALANCED PANELS 144
The LSDV estimator is implemented without any problem either creating individual dummies
or taking variables in group-mean deviations, where group means are at the individual level.
The random effect estimator requires only some algebraic modifications in the formulas allow-
ing for unbalancedness. Arellano estimator also requires simple modifications in notations to
accommodate unbalancedness: there is now a (Ti 1) LSDV residual vector given by
8.1. Introduction
The Panel-data sets considered in these notes, with a large individual dimension and a
small time dimension, are an example of one-way clustering. If the data-set is balanced, there
are n = N T observations clustered into N individual groups, each comprising T observations.
If the data-set is unbalanced, as often the case with real-world panels, there are n = N
P
i=1 Ti
145
8.2. TWO-WAY CLUSTERING 146
can occur along more than one dimension. In a student survey, for example, there could be an
additional level of clustering given by teachers, or classes, within schools. Similarly, patients
can be clustered along the two dimensions, not necessarily nested, of doctors and hospitals.
In a cross-sectional data-set of bilateral trade-flows, the cross-sectional units are the pairs of
countries and these are naturally clustered along two dimensions: the first and the second
country in the pair (Cameron et al., 2011). ln matched-employers-employees data there is the
worker dimension, the firm dimension and the time dimension (Abowd et al., 1999).
Is it possible to do inference that is robust to multi-way clustering as we do inference that is
robust to one-way clustering? A recent paper by Cameron et al. (2011) offers a computationally
simple solution extending the White estimator to multi-way contexts. In essence, their method
boils down to computing a number of one-way robust covariance estimators, that are then
combined linearly to yield the multi-way robust covariance estimator. It is, therefore, crucial
for the accuracy of the multi-way estimator that the one-way estimators be also accurate,
and so that the data-set have dimensions with a large number of clusters. Such asymptotic
requirement makes the analysis in Cameron et al. (2011) not well suited for dealing with both
individual- and time-clustering in the typical micro-econometric panel data set, where T is
fixed. Indeed, their Monte Carlo experiments show that the robust covariance estimator have
good finite-sample properties in data-sets with dimensions of 100 clusters.
To illustrate the method I focus on two-way clustering, using a notation that is close to
that inCameron et al. (2011).
Notation is general enough to embrace cases in which cluster affiliations are not sufficient
to uniquely identify an observation. There is a data-set with n observations indexed by i
{1, ..., n}. Observations are clustered into two dimensions, g {1, ..., G} and h {1, ..., H} .
Asymptotics is for both G and H . The data-sets that I have in mind are, for example,
Survey of students with, at least moderately large numbers of teachers and schools
8.2. TWO-WAY CLUSTERING 147
Survey of patients with, at least, moderately large numbers of doctors and hospitals
Bilateral trade-flows data with, at least, a moderately large number of countries.
Matched-employers-employees data with, at least, moderately large numbers of firms
and workers
For each dimension, it is known to which cluster a given observation i = 1, ..., n belongs. This
information is contained in the mappings g : {1, ..., n} {1, ..., G}
From the mappings g and h we can also construct the n G dummy variables matrix DG and
the n H dummy variables matrix DH , as the following definitions indicates
i {1, ..., n}, h {1, ..., H}. Then, DG and DH are the n G and n H matrices with (i, g)
element dig and (i, h) element dih , respectively.
Given g and h, we can define an intersection dimension, say GH, such that each cluster in
GH contains only observations that belong to one unique cluster in {1, ..., G} and one unique
cluster in {1, ..., H} . This yields the matrix of dummy variables DGH . By construction, the
8.2. TWO-WAY CLUSTERING 148
then
1 0 0 0
1 0 0 0
0 1 0 0
DGH = .
0 0 1 0
0 0 1 0
0 0 0 1
This framework allows that in a survey of patients, for example, there could be more than
one patients admitted to the same hospital and under the assistance of the same doctor. Or,
similarly, that in a panel data matching workers with firms the same worker can move across
firms over time or that, conversely, the same firm may employ different workers over time.
0 0
Then, define three n n indicator matrices: S G = DG DG , S H = DH DH and S GH =
0
DGH DGH
It is easy to verify that:
S G has ijth entry equal to one if observations i and j share any cluster g in {1, ..., G}
; zero otherwise.
S H has ijth entry equal to one if observations i and j share any cluster h in {1, ..., H};
zero otherwise.
8.2. TWO-WAY CLUSTERING 149
S GH has ijth entry equal to one if observations i and j share any cluster g in
{1, ..., G} and any cluster h in {1, ..., H}; zero otherwise.
Also, the iith entries in S G , S H and S GH equal one for all i = 1, ..., n, so the three indicator
matrices have main diagonals with all unity elements.
Consider now a linear regression model allowing for two-way clustering
yi, = x0i, + i
= E 0 S G |X + E 0 S H |X E 0 S GH |X ,
(8.2.1)
where the symbol stands for the element-by-element matrix product (also known as Hadamard
product) between matrices with equal dimension (verify equivalence of LRM.4b and (8.2.1)).
8.2. TWO-WAY CLUSTERING 150
As we know, OLS, in this case, are consistent and unbiased but not efficient. More impor-
tantly, OLS standard errors are biased, and so we need a two-way robust covariance estimator
for inference. The covariance estimator devised by Cameron et al. (2011) is the combination
of three one-way covariance estimators a la White. It is constructed along the following steps.
Carry out OLS, obtain the OLS residuals
G 1 1
\
Avar (b) = X 0 X X 0
GX X 0X
G
G = ee0 S G . Avar
where \ (b) is a White estimator that is robust to clustering only along
the G dimension.
The second one-way covariance estimator is
H 1 1
\
Avar (b) = X 0X X 0
H X X 0X
H
H = ee0 S H . Avar
where \ (b) is a White estimator that is robust to clustering only along
the H dimension.
The third one-way covariance estimator is
GH 1 1
\
Avar (b) = X 0X X 0
GH X X 0 X
8.3. STATA IMPLEMENTATION 151
GH
GH = ee0 S GH . Avar
where \ (b) is a White estimator that is robust to clustering only
along the G H dimension.
Finally, the two-way robust covariance estimator is
G H GH
(8.2.2) \
Avar \
(b) = Avar \
(b) + Avar \
(b) Avar (b) .
\
Avar (b) is robust to clustering along both G and H dimensions and is the estimator that
is used to construct our robust tests.
\
Remark 82. Writing Avar (b) as
1
(b) = X 0 X X0 G + GH X X 0 X 1
H
\
Avar
and then considering equation (8.2.1) uncovers the analogy principle on which the two-way
robust covariance estimator rests.
\
Remark 83. Cameron et al. (2011) also present a general multi-way version of Avar (b),
which is derived from a simple extension of the foregoing analysis. The additional cost is only
in terms of a more cumbersome notation. For the formulas I refer you to that paper.
\
While there is no official command for the two-way Avar (b) in Stata, it can be simply
implemented by means of three one-way OLS regressions. Suppose that in our data-set the
two categorical variables for dimensions G and H are called doctor and hospital. You can
\
assembleAvar (b) along the following steps.
(1) Create the categorical variable for the intersection dimension, G H, through the
following instruction
(2) Implement the first regress instruction with the option vce(cluster doctor) and
then save the covariance matrix estimate through the command: matrix V_d=e(V)
(V_d is a name of choice).
(3) Implement the second regress instruction with the option vce(cluster hospital)
and then save the covariance matrix estimate with: matrix V_h=e(V) (V_h is a
name of choice).
(4) Implement the last regress instruction with the option vce(cluster doc_hosp) and
then save the covariance matrix estimate with: matrix V_dh=e(V) (V_dh is a name
of choice).1
(5) Finally, work out the two-way robust covariance estimator by executing: matrix
V_robust=V_d+V_h-V_dh (V_robust is a name of choice). To see the content of
V_robust do: matrix list V_robust. The robust standard errors are just the
square roots of the main diagonal elements in V_robust.
1It may happen that clusters in the intersection dimension are all singletons (i.e. each cluster has only one
observation). In this case Stata will refuse to work with the option vce(cluster doc_hosp). This is no
problem, though, since correcting standard errors when clusters are singletons is clearly equivalent to correcting
for heteroskedasticity. Therefore, instead of vce(cluster doc_hosp), simply write vce(robust).
CHAPTER 9
9.1. Introduction
two variables is likely to be unrelated with the individual education levels. Such information,
so close to the latent variable, is often unavailable, though.
If the latent variables are invariant across individual and/or over time and there is a panel-
data set, the endogeneity problem is solved by applying the panel-data methods introduced
in Chapter 7. But not always panel data are available and even when they are, the disturbing
omitted factors may not meet the time-constancy requirement. For example, idiosyncratic
productivity shocks may well be related to input factors in the estimation of a production
function.
Neither proxy variables or panel data methods are generally usable when endogeneity
springs from reverse causality. In the strip, Wally questions the exogeneity of the exercise
variable as a determinant of individual health, hinting for an endogeneity bias due to reverse
causality. If the exercise activity is indeed affected by the health status, exercise would depend
on the observable and unobservable determinants of health, and so cannot be exogenous.
Instrumental variables (IV) and Generalized method of moment (GMM) estimators offer
a general solution to the endogeneity problem. Roughly speaking, they solve the endogeneity
problem into two stages. The first stage attempts to identify the exogenous-variation compo-
nents of the x, through a set of exogenous variables, some of which are external to the model,
said instrumental variables. The second stage applies regression analysis using only the first-
stage exogenous components as explanatory variables. IV and GMM methods are preferred
tools of econometric analysis, compared to alternative techniques, since often the first stage
can be justified on the ground of economic theory.
There are various IV GMM applications showing the methods of this chapter: IV_GMM.do
using mus06data.dta, IV_GMM_panel.do using costfn.dta and IV_GMM_DPD.do and abest.do
both using abdata.dta. There is also a Monte Carlo application implemented by bias_in_AR1_LSDV.do.
9.2. THE METHOD OF MOMENTS 155
9.2.1. The linear regression model. Consider the linear model of Chapter 1 and the
system of moment conditions (1.2.3)
E (xy) = E xx0 .
So, the true coefficient vector, , solve the population moment conditions and is equal to
b , will satisfy
= E (xx0 )1 E (xy). By the analogy principle a consistent estimator for ,
the system of k analog sample moment conditions:
n
1X 0
xi yi xi
b = 0.
n
i=1
Hence,
n
!1 n
X 0
X 1
b = xi xi xi yi = X 0 X X 0 y,
i=1 i=1
which is exactly the OLS estimator.
9.2.2. The Instrumental Variable (IV) regression model in the just identified
case. Consider the linear model of Chapter 1 but without assumption P.3, E (|x) = 0, or
even the weaker P.3b, E (x) = 0. This means that some of the variables in x are potentially
endogenous, that is related in some way to . Assume, instead, conditional mean independence
for a L 1 vector of variables z, that is E (|z) = 0, with L = k. The vector z is generally
different from x, if it is not then we are back to the classical regression model and there is no
endogeneity problem. Replace assumption P.2 with detE (zx0 ) 6= 0. Then, as before using the
9.2. THE METHOD OF MOMENTS 156
E (z) = Ez [E (z|z)]
= Ez [zE (|z)]
= 0.
E z y x0 = 0
or equivalently
E (zy) = E zx0 .
So, the true coefficient vector, , solve the population moment conditions and is equal to
b , will satisfy
= E (zx0 )1 E (zy). By the analogy principle a consistent estimator for ,
the system of k analog sample moment conditions:
n
1X 0
z i yi xi
b = 0.
n
i=1
Hence,
n
!1 n
X 0
X 1
b = z i xi zi yi = Z 0 X Z 0 y,
i=1 i=1
which is the classical IV estimator.
The intuition is straightforward: since the true coefficients solve the population moment
conditions, if the sample moments provide good estimates for the population moments, then
one might expect that the estimator solving the sample moment conditions will provide good
estimates of the true coefficients.
What if there are more moment conditions than unknown parameters, that is if L > k?
Then we turn to GMM estimation.
9.2. THE METHOD OF MOMENTS 157
m () E [f (w, )]
m () = 0.
1X N
m
b f wi ,
b
n
i=1
hence there are L equations and k unknowns so that no estimator b can solve the system of
sample moment conditions. Instead, there exists a b that can make m b as close to zero as
possible:
(9.2.1)
b
GM M = arg min Q
b
b
0
b m
where Q b Am
b is a quadratic criterion function of the sample moments and
A is a positive definite matrix weighting the squares and the cross-products of the sample
moments in Q b .
9.2. THE METHOD OF MOMENTS 158
Note that Q b 0 and since A is positive definite, Q b = 0 only if m b = 0.
Thus, in the just identified linear model to minimize Q
b is to make it exactly zero and the
9.2.4. The optimal weighting matrix. The weighting matrix A can be chosen so that
the GMM estimator is asymptotically efficient. Hansen (1982) proves that the weighting
matrix in the optimal two-step GMM estimator is
1
(9.2.2) A = Z 0 Z/n
,
a consistent estimator of the inverse of the asymptotic covariance matrix of the sample mo-
ments, V ar 1n Z 0 . The matrix in Equation (9.2.2) depends on the properties of .
It may be a known matrix, in which case the resulting GMM estimator is said a one-step
is unknown and it is generally computed through the residuals from
estimator. More often
a consistent one-step GMM estimator. In this case the resulting GMM estimator is said a
two-step estimator. For more on the choice of the weighting matrix see Section 9.4.
9.2.5. The TSLS estimator. The well-known Two Stages Least Squares estimator
(TSLS) belongs to the class of one-step GMM estimators. Consider the linear regression
model of Section 9.2.2 with L k instruments. Then, there are the following population
moments
m () E z y x0
E z y x0 = 0
Consider a p.d. weighting matrix matrix A, which we leave generic for the moment. Then,
since m ()0 Am () = 0 and m ( o )0 Am ( o ) 0 for any o ,
= argmin m ( o )0 Am ( o )
o
9.2. THE METHOD OF MOMENTS 159
or
h 0 i
= argmin E yz0 AE (zy) 2E yz0 AE zx0 o + o E xz0 AE zx0 o
o
1
= E xz0 AE zx0 E xz0 AE (zy)
The L sample moments are collected into the (L 1) vector m
b
1X n
b 1 Z0 y X
0
b
m z i yi x i b
n n
i=1
n
!1
1X 0 1
A zi zi = n Z 0Z .
n
i=1
Then
1 0 1 0
b
Q y X b Z Z 0 Z Z y X
b
n
with the following normal equations for the minimization problem:
Q b
2 1 0
X 0Z Z 0Z Z y X b =0
b n
0 0
1 0 1 0 1 0
b
T SLS X Z Z Z ZX X Z Z 0Z Z y.
or more compactly
0
1
b
T SLS X P[Z] X X 0 P[Z] y.
9.3. STATA IMPLEMENTATION OF THE TSLS ESTIMATOR 160
The estimators name derives from the fact that it is computed into two stages:
(1) Regress each column of X on Z using OLS to obtain the OLS fitted values of X:
Z (Z 0 Z)1 Z 0 X = P[Z] X. Thus, X = P[Z] X + M[Z] X, where P[Z] X is an approxi-
mately exogenous component, whose covariance with goes to zero as n , and
M[Z] X is a residual, potentially endogenous, component. Only P[Z] X is used in the
second stage.
(2) Regress y on the fitted values, P[Z] X, to obtain TSLS.
If the population moment conditions are true, then Q b
T SLS should not be significantly
different from zero. This provides a test for the validity of the L k over-identifying moment
conditions based on the following statistic (Hansen-Sargan test)
2
S nQ b
T SLS (L k) .
Solution: Z 0 X is invertible, so
1 1 1
b
T SLS X 0Z Z 0Z Z 0X X 0Z Z 0Z Z 0y
1 1 1
= Z 0X Z 0Z X 0Z X 0Z Z 0Z Z 0y
1
= Z 0X Z 0y
The notation so far was so general to accommodate models with and without endogenous
variables. Lets adopt a less generic notation to suit real-world empirical problems, where we
know that there are potentially endogenous variables and who they are.
As before, z contains all available exogenous variable. The list of explanatory variables x is
partitioned into exogenous, x1 , and potentially endogenous, x2 , with k1 and k2 indicating the
9.3. STATA IMPLEMENTATION OF THE TSLS ESTIMATOR 161
First-stage regressions
Instrumented: hi_empunion
Instruments: totchr age female blhisp linc ssiratio
The two-step GMM estimator is implemented by ivregress gmm followed by the names of
the dependent variable, the included exogenous and, within parentheses, all the right-hand-side
endogenous and the external instruments as follows
9.4.1. Choosing the weighting matrix . As discussed in subsection 9.2.4, the weight-
1
ing matrix in the optimal two-step GMM estimator is A = Z 0 Z/n , a consistent esti-
depending on the properties of , as
mator of the inverse of V ar 1n Z 0 , with choices of
discussed next.
= I, A = n (Z 0 Z)1 and the resulting GMM esti-
If is homoskedastic and independent,
mator is TSLS. Its implemented through the ivregress gmm option: wmatrix(unadjusted).
If is heteroskedastic and independent, is diagonal:
2
e1 0 0
..
0 e22 .
=
.
.. ..
. 0
0 0 2
en
9.4. STATA IMPLEMENTATION OF THE TWO-STEP GMM ESTIMATOR 163
1, ..., n. Its implemented through the ivregress gmm option, wmatrix(robust), which is the
default option.
is a block diagonal matrix with generic block equal to the
If errors are clustered, then
outer product of the residuals peculiar to the corresponding cluster:
1 0 0
0 2 ..
.
= .
.. ..
. 0
0 0 N
i = ei e0 and ei = yi x0
with i T SLS is the vector of TSLS residuals (or any consistent
b
i
one-step residual estimator) peculiar to cluster i = 1, ..., N. Its implemented through the
ivregress gmm option: wmatrix(cluster cluster_var ). This option is recommended when
ivregress gmm is applied to panel data: with cluster_var equal to the panel identifier in the
data, it calculates a GMM estimator that is optimal under heteroskedasticity and arbitrary
correlation within individual clusters.
In time-series data, when is heteroskedastic and serially correlated, the optimal weighting
matrix A may be assembled by using the Newey-West heteroskedasticity- and- autocorrelation-
consistent (HAC) estimator. This is implemented by specifying wmatrix(hac kernel # ),
which requests a weighting matrix using the specified kernel (see below) with # lags. The
bandwidth of a kernel is equal to the number of lags plus one. Specifying wmatrix(hac
kernel opt) requests an HAC weighting matrix using the specified kernel, and the lag or-
der is selected using Newey and Wests (1994) optimal lag-selection algorithm. Specifying
wmatrix(hac kernel ) requests an HAC weighting matrix using the specified kernel and n-2
lags, where n is the sample size. There are three kernels available for HAC weighting ma-
trices: bartlett (or nwest) requests the Bartlett (Newey-West) kernel; parzen (or gallant)
9.5. ROBUST VARIANCE ESTIMATORS 164
requests the Parzen (Gallant 1987) kernel; quadraticspectral (or andrews) requests the
quadratic spectral (Andrews 1991) kernel.
The less efficient, but computationally simpler and still consistent, TSLS estimator is
often used in estimation. Its robust variance-covariance matrix V ar b
T SLS is consistently
estimated as
\
0
1 0
[Z] X X 0 P[Z] X 1 ,
V ar b
T SLS = X P[Z] X X P[Z] P
legend: b/se
.
9.7. DURBIN-WU-HAUSMAN EXOGENEITY TEST 165
As already noted, if the population moment conditions are true, the minimized GMM
criterion function Q b
T SLS should not be significantly different from zero. This provides
a test for the validity of the L k over-identifying moment conditions based on the Hansen-
Sargan statistic
2
HS = nQ T SLS (L k) .
b
.
. estat overid
.
In this case the HS test supports the specification.
A conventional Hausman test can be always implemented, based on the Hausmans sta-
tistics measuring the statistical difference between IV and OLS estimates. It is not robust to
heteroskedastic and clustered errors, though. Wu suggest an alternative. But before do this
exercise, which will prove useful in the derivations below.
1
b2,T SLS = X20 P[Z] M [X1 ] P[Z] X2 X20 P[Z] M [X1 ] P[Z] y.
But then P[Z] = P[X1 ] + P[M[X Z1 ] also assures that P[Z] M [X1 ] = P[M[X ] Z1 ] , proving the result.
1] 1
The DWH test provides a robust version of the H test. It maintains instruments valid-
ity, E (|Z) = 0 and is based on the so called control-function approach, which recasts the
endogeneity problem as a misspecification problem affecting the structural equation
(9.7.1) y = X + ,
E (u|X) = 0 and is the nk2 -matrix of the errors in the first-stage equations of the variables
X2 . As such, is responsible for endogeneity of X2 .
Replacing in (9.7.1) with the residuals from the first stage regressions, = M[Z] X2 ,
makes the DWH test operational as a simple test of joint significance for in the auxiliary
OLS regression
(9.7.2) y = X + M[Z] X2 + u .
The test works well since under the alternative of 6= 0, OLS estimation of the auxiliary
regression yields the TSLS estimators. This is proved as follows.
y = P[Z] + M[Z] X + M[Z] X2 + u
9.7. DURBIN-WU-HAUSMAN EXOGENEITY TEST 167
and so
y = P[Z] X + M[Z] X + M[Z] X2 + u
y = P[Z] X + M[Z] X2 ( 2 + ) + u
and since P[Z] X and M[Z] X2 are orthogonal the Theorem 8 assures that the OLS estimator
for is
1
bT SLS = X 0 P[Z] X X 0 P[Z] y
and also
0
1
\
2 + = X2 M[Z] X2 X20 M[Z] y.
Rearrange \
2 + as follows
1
\
2+ = X20 M[Z] X2 X20 M[Z] y
1 0
= X20 M[Z] X2 X2 M[X1 ] y X20 P[M[X ] Z1 ] y
1
h
1 1 0
X20 M[Z] X2 X20 M[X1 ] X2 X20 M[X1 ] X
= X2 M[X1 ] y
1
X20 P[M[X ] Z1 ] X2 X20 P[M[X ] Z1 ] X2 X20 P[M[X ] Z1 ] y
1 1 1
1
= X20 M[Z] X2 X20 M[X1 ] X2 b2,OLS X20 P[M[X ] Z1 ]
X2 b2,T SLS
1
1 0 1 0
= X20 M[Z] X2 X2 M[X1 ] X2 b2,OLS X20 M[Z] X2 X2 P[M[X ] Z1 ] X2 b2,T SLS
1
1 h 1 i
X20 M[Z] X2 X20 M[X1 ] X2 b2,OLS + I X20 M[Z] X2 X20 M[X1 ] X2 b2,T SLS ,
=
where the second equation follows from M[Z] = M[X1 ] P[M[X ] Z1 ] , the third equation is
1
obtained by premultiplying the two terms in parentheses of the second equation by identity
matrices represented as AA1 , the fourth equation follows from Exercise 85 and the last
9.7. DURBIN-WU-HAUSMAN EXOGENEITY TEST 168
1
with K X20 M[Z] X2 X20 M[X1 ] X2 , proving that the test indeed follows the Hausman
test general principle of assessing the distance between an asymptotically efficient estimator
and a consistent but inefficient estimator under the null hypothesis.
The DWH test can be easily robustified for heteroskedasticity and/or clustered errors by
estimating (9.7.2) with regress and a suitable robust option, vce(robust) or vce(cluster
clustervar ). More easily, DWH can be immediately implemented in Stata through the
ivregress postestimation command estat endogenous.
. * Robust Durbin-Wu-Hausman test of endogeneity implemented by estat endogenous
. ivregress 2sls ldrugexp (hi_empunion = ssiratio) $xlist, vce(robust)
Robust
ldrugexp Coef. Std. Err. z P>|z| [95% Conf. Interval]
Instrumented: hi_empunion
Instruments: totchr age female blhisp linc ssiratio
.
. estat endogenous
Tests of endogeneity
Ho: variables are exogenous
.
. estat endogenous,forcenonrobust
Tests of endogeneity
Ho: variables are exogenous
.
9.8. ENDOGENOUS BINARY VARIABLES 169
. test v1hat
( 1) v1hat = 0
F( 1, 10081) = 26.43
Prob > F = 0.0000
The linear IV-GMM approach outlined so far fits the case of binary endogenous variables
producing consistent estimates. However, a first-stage regression fully accounting for the
binary structure of the endogenous variables may provide considerable efficiency gains. The
implied model (non-linear) is as follows
yi = x1i 1 + x2i 2 + i
x2i = x1i 1 + zi 2 + i
1 if x2i > 0
x2i =
0 otherwise
2 2
(i , i ) N 0, .
2 1
Staiger and Stocks rule of thumb: partial F tests in the first stage regression > 10. It
is simple, but not rigorous, tends to reject too often weak instruments and has no obvious
implementation when there are more than one endogenous variables.
Two tests by Stock and Yogo (2005) overcome all of the above difficulties. They are both
based on the on the minimum eigenvalue of the matrix analog of the partial F test, a statistics
introduced by Cragg and Donald (1993) to test nonidentification. Importantly, the large-
sample properties for both tests have been derived under the assumption of homoskedastic
and independent errors. Caution must be taken, then, when drawing conclusions from the
tests if the errors are non i.i.d.
Both tests are implemented by the ivregress postestimation command estat firststage.
Olea and Pfluger (2013) derive a new test for weak instruments that extends that by Stock
and Yogo (2005) to heteroskedasticity and cluster correlation. This is implemented in Stata
by weakivtest after ivregress.
Conditional inference on the endogenous variables coefficients in the presence of weak in-
struments is implemented through command condivreg by Mikusheva and Poi (2006). Theory
reviewed and expanded in Andrews et al. (2007). The command produces three alternative
confidence sets for the coefficient of the endogenous regressor obtained from the conditional
LR, Anderson-Rubin (option ar) and LM statistics (option lm). The syntax of condivreg is
similar to that of ivregress.
Its a system estimator including structural equations for all endogenous variables. Identifi-
cation is ensured by standard (sufficient) rank and (necessary) order conditions. It is seldomly
9.13. DYNAMIC PANEL DATA 171
used as it is inconsistent in the presence of heteroskedastic errors, which is the norm in most
micro applications. The Stata command is reg3.
Consistency of both estimators requires that E (|Z) = 0. These estimators are implemented
in Stata by xtivreg.
Situations in which past decisions have an impact on current behaviour are ubiquitous in
economics. For example, in the presence of input adjustment costs, short-run input demands
depend also on past input levels. In such cases fitting a static model to data will lead to what
is referred to as dynamic underspecification. With a panel data set, however, it is possible to
implement a dynamic model from the outset to in order to describe the phenomena of interest.
To make things simple let us get started with the simple autoregressive process
t = 1, ..., T , i = 1, ..., N .
Model (9.13.1) can be easily extended to allow for time invariant individual terms:
t = 1, ..., T , i = 1, ..., N . In vector notation, stacking time observations for each individual,
yi = y1 i + i 1T + i
i = 1, ..., N, where
yi1 yi,0 i1
...
..
.
...
yi =
, y1 i = yi,t1
yit ,
i =
it
(T 1) (T 1) (T 1)
..
.. ..
. . .
yiT yi,T 1 iT
For each individual there are T + 1 observations available in the data set, from yi0 to yiT , but
only T are usable since one is lost to taking lags.
The problem here is that E (it |y1 i i ) = 0 cannot hold for all t = 1, ..., T , that is
y1 i is not strictly exogenous. Given (9.13.2), the observation of y1 i at time t is yi,t1 =
f (y0 , i1 , i2 , ..., it1 ) and so all future observations of y1 i , from yi,t = f (y0 , i1 , ..., it ) to
yi,T 1 = f (y0 , i1 , ...it , ...i,T 1 ), depend on i,t , which makes E (i,t |y1 i , i ) = 0 fail.
Example 86. Let T=2 for simplicity and assume strict exogeneity, E (it |yi,0 , yi,1 i ) = 0,
t = 1, 2, then by LIE
cov (i1 , yi1 |yi,0 , i ) = E (i1 |yi,0 , i ) = 0,
but, given yi1 = yi,0 + i + i1 , cov (i1 , yi1 |yi,0 , i ) = 2 , contradicting strict exogeneity.
Assumption A.1 is also considered in Wooldridge (chapter 11, 2010), where it is referred to as
sequential exogeneity conditional on the unobserved effect. It may be convenient sometimes to
maintain also the following (sequential) conditional homoskedasticity assumption
Assumption A.1 implies zero serial correlation in errors, as the following exercise asks to prove.
Exercise 87. Prove that Equation (9.13.2) and Assumption A.1 imply the following (use
the LIE and i,tj = yi,tj yi,tj1 i )
and so
E it i,tj |yit1 , i = (yi,tj yi,tj1 i ) E it |yit1 , i = 0
where the first equality follows from yi,tj , yi,tj1 and i falling all into the conditioning set
of the conditional expectation and the last equality follows from A.1.
1 PP
NT yi,t1 y i.1 (it i. )
i t
LSDV = + 2 ,
1 PP
NT yi,t1 y i.1
i t
he showed that
T
!
1 XX 1X
plim yi,t1 y i.1 (it i. ) = E yi,t1 y i.1 (it i. ) =
NT t
T
i t=1
1 T 1 T + T 2
= 6= 0.
T2 (1 )2
Hence, the bias vanishes for T , but it does not for N and T fixed. For this reason,
the LSDV estimator is inaccurate in panel data sets with large N and small T and is said to
be semi-inconsistent (see also Sevestre and Trognon, 1996).
Since Nickell (1981) a number of consistent IV and GMM estimators have been proposed in
the econometric literature as an alternative to LSDV. Anderson and Hsiao (1981) (AH) suggest
two simple IV estimators that, upon transforming the model in first differences to eliminate
the unobserved individual heterogeneity, use the second lags of the dependent variable, either
differenced or in levels, as an instrument for the differenced one-time lagged dependent variable.
Arellano and Bond (1991) (AB) propose a GMM estimator for the first differenced model
which, relying on all available lags of y1 i as instruments, is more efficient than Anderson
and Hsiaos. Ahn and Smith (1995), upon noticing the Arellano and Bond estimator uses
only linear moment restrictions, suggest a set of non linear restrictions that may be used in
addition to the linear one to obtain more efficient estimates. Blundell and Bond (1998) (BB)
observe that with highly persistent data first-differenced IV or GMM estimators may suffer
of a severe small sample bias due to weak instruments. As a solution, they suggest a system
GMM estimator with first-differenced instruments for the equation in levels and instrument
9.13. DYNAMIC PANEL DATA 175
in levels for the first-differenced equation. Some of the foregoing methods are nowadays very
popular and are surveyed below.
9.13.2. The Anderson and Hsiao IV Estimator. AH take model (9.13.2) in first
differences to eliminate the individual effects:
This makes the disturbances M A(1) with unit root, and so induces correlation between the
lagged endogenous variables and the disturbances. This problem can be solved by finding valid
and relevant instruments for 4yi,t1 . AH suggest using the values from the dependent variable
lagged two times, yi,t2 , t = 2, .., T 1, since yi,t2 is negatively correlated with 4yi,t1 =
yi,t1 yi,t2 and uncorrelated with 4it = it i,t1 under A.1. So, T-1 moment restrictions
could be used:
E (yi,0 4i,2 ) = 0
E (yi,1 4i,3 ) = 0
.. .. ..
. . .
or more compactly E (yi,t2 4i,t ) = 0, t = 2, ..., T . AH, though, look for a simple estimator
and so they just use the T-1 restrictions added up into the single moment restriction
T
!
X
E yi,t2 4i,t = 0,
t=2
1Or alternatively 4y
i,t2 , but Arellano found a problem of identification for some values of in this case.
9.13. DYNAMIC PANEL DATA 176
As such, is an exactly identified IV estimator, consistent under Assumption A.1, but non
optimal and with a high root mean squared error in applications.
9.13.3. The Arellano and Bond GMM estimator. AB find for a more efficient esti-
mator than AH, introducing three efficiency improvements.
First, AB use all T-1 restrictions in (9.13.4) from the second lag.
Second, Under A.1 all the lags more remote than the second are also uncorrelated with
it i,t1 . So, for the third lag there are T 2 restrictions:
E (yi,0 4i,3 ) = 0
E (yi,1 4i,4 ) = 0
.. .. ..
. . .
E (yi,T 3 4i,T ) = 0,
for the fourth lag there are T 3 restrictions and so on until the last usable lag, the T.th, yield-
ing just one restriction E (yi,0 4i,T ) = 0. Eventually (T 1)+(T 2)+...1 = T (T 1) /2 L
restrictions are available: E (yi,ts 4i,t ) = 0, s = 2, ..., T , t = s, ..., T . The AB estimator min-
imizes a criterion function in the sample analogs of the population moments involved in these
restrictions: N1 N
P
i=1 yi,ts (4yi,t
4yi,t1 ) is the sample analog of E (yi,ts 4i,t ), s = 2, ..., T ,
t = s, ..., T .
(), AB find for optimal overidentified
Collecting the L sample moments into the vector m
GMM estimators, as discussed in Subsections 9.2.3, and here is their last improvement over
AH.
9.13. DYNAMIC PANEL DATA 177
0 0
Z = Z10 , Z20 , ..., ZN
,
y = y1 + .
where
y is a (N (T 1) 1) vector;
y1 is a (N (T 1) 1) vector;
is a (N (T 1) 1) vector
(9.13.5) Q (b )0 Am(b
) = m(b )
N
!1
1 X 0
A= Zi GZi
N
i=1
up to scale and
2 1 0 0 . . . 0 0
1 2 1 0 . . . 0
0
0 1 2 1 . . . . .
G = .
(T 1)(T 1)
. . . . . . . . .
0 0 0 0 . . . 2 1
0 0 0 0 . . . 1 2
The Arellano-Bond one-step estimator
b1 = argminQ (b
) is
!1 1
N
0
X
b1 = y1 0 Z
Zi GZi Z 0 y1
i=1
N
!1
0
X
y1 0 Z Zi GZi Z 0 y
i=1
Exercise 88. Prove that under A.1 and A.2 V ar (i ) = 2 G. Hint: Derive the for-
mula for the generic element of V ar (i ) = E (i 0i ) and then see how it specializes
for 1) the generic main-diagonal element, 2) the generic first-subdiagonal element and 3) the
9.13. DYNAMIC PANEL DATA 179
generic element onto the remaining lower subdiagonals. For all three cases, first work out the
expectations conditional to
yi0
yi1
yit2 =
..
.
yi,t2
using A.2, A.3 (which follows from A.1 as seen in Exercise 87) and the Generalized LIE (GLIE);
then work out the unconditional expectations using the LIE.
t = 2, ..., T , j = 0, ..., t 2.
When j = 0 the foregoing term is a generic element of the main diagonal of V ar (i ).
Given A.2, A.3 and the GLIE
h i
E (it i,t1 )2 |yit2 = E 2it |yit2 + E 2i,t1 |yit2 + 2E it i,t1 |yit2
(9.13.8) = 2 2
= 2 2
9.13. DYNAMIC PANEL DATA 180
When t > 2 and j = 1, expression (9.13.7) specializes to the generic element of the first
subdiagonals of V ar (i ). Given A.2, A.3 and the GLIE
E (it i,t1 ) (i,t1 i,t2 ) |yit2 = E it i,t1 |yit2 E it i,t2 |yit2
(9.13.9) = 2
= 2
When t > 3 and j = 2, ..., t 2 we are considering all of the other elements of V ar (i ).
Then, given A.3 and the GLIE
E (it i,t1 ) (i,tj i,tj1 ) |yit2 = E it i,tj |yit2 E it i,tj1 |yit2
(9.13.10) = 0
= 0
where
y
i0
yi1
yit = .
..
yit
and
i2 i1
i = i3 i2 .
i4 i3
So,
yi0 (i2 i1 )
yi0 (i3 i2 )
0
yi1 (i3 i2 )
Zi i =
yi0 (i4 i3 )
yi1 (i4 i3 )
yi2 (i4 i3 )
Without homoskedasticity (that is without Assumption A.2),
b1 is no longer optimal, but
it remains consistent and so it can be used to construct the optimal two-step estimator
b2
along the lines described in Subsection 9.4.1. Specifically,
b2 minimizes (9.13.5) with
N
!1
1 X 0
A= Zi e1i e01i Zi
N
i=1
9.13. DYNAMIC PANEL DATA 182
Monte Carlo experiments in Bowsher (2002) show that the Sargan test based on the full
instrument set has zero power when T , and consequently the moment conditions, becomes too
large for given N .
0 0
9.13.3.2. Derivation of V ar Zi i = 2 E Zi GZi . We now prove that under A.1 and
0 0
A.2, V ar Zi i = 2 E Zi GZi .
0
Given Exercise 89 it is easy to see that Zi i is the L 1 zero-mean vector
yi0 (i2 i1 )
yi0 (i3 i2 )
yi1 (i3 i2 )
yi0 (i2 i1 )
..
0
. yi1 (i3 i2 )
Zi i = =
..
yi0 (iT i,T 1 )
.
yi1 (iT i,T 1 ) yiT 2 (iT i,T 1 )
..
.
yiT 2 (iT i,T 1 )
and so
0
V ar Zi i =
yi0 (i2 i1 )
yi1 (i3 i2 )
h i
0 0
E (i2 i1 ) yi0 (i3 i2 ) yi1 (iT i,T 1 ) yiT 2
..
.
yT 2 (iT i,T 1 )
i
0
h i
E (it i,t1 ) (i,tj i,tj1 ) yit2 yitj2
t = 2, ..., T , j = 0, ..., t 2.
9.13. DYNAMIC PANEL DATA 184
So,
0
h i
E (it i,t1 ) (i,tj i,tj1 ) yit2 yitj2
0
n h io
= E E (it i,t1 ) (i,tj i,tj1 ) yit2 yitj2 |yit2
0
n o
E E (it i,t1 ) (i,tj i,tj1 ) |yit2 yit2 yitj2
=
0
E (it i,t1 ) (i,tj i,tj1 ) |yit2 E yit2 yitj2
=
0
= E [(it i,t1 ) (i,tj i,tj1 )] E yit2 yitj2
where the first equality follows from the LIE, the second from the fact that all elements of
yitj2 belongs to yit2 , the third and the last from he fact that
9.13.4. Blundell and Bond (1998) System estimator. Blundell and Bond (1998)
demonstrate that in the presence of close to unity instruments in levels are weakly correlated
with y1 leading to what is known in the econometric literature as a weak instrument bias.
This is easily seen by considering the following example taken from Blundell and Bond. Let
T = 2, then after taking the model in first differences there is only a cross-section available
for estimation:
4yi,2 = 4yi,1 + 4i,2 , i = 1, ..., N.
9.13. DYNAMIC PANEL DATA 185
To what extent is yi,0 related to 4yi,1 ? To answer this question it suffices to work out the
reduced form for 4yi,1 :
4yi,1 = ( 1) yi,0 + i + i,1
from which it is clear that the closer to unity the weaker the correlation between yi,0 and
4yi,1 .
To solve the problem they suggest exploiting the following additional moment restrictions
which are valid if along to Assumption A.1, we maintain that the process for yi,t is mean-
stationary, that is
A.4:
i
E (yi,0 |i ) =
1
Assumption A.4 is justified if the process started in the distant past. Starting from the model
at observation t = 0 and going backward in time recursively
.
i X i
= + t i,t + ui,0
1 1
t=0
9.13. DYNAMIC PANEL DATA 186
The last equality follows since E (i ui,0 ) = 0 by construction, E (i,2 i ) = 0 and E (i,1 i,2 ) =
0 by A.1. That also E (i,2 ui,0 ) = 0 follows straightforwardly from the fact that by A.1
E (i,2 yi,0 ) = 0, that is
1
E (i,2 i ) + E (i,2 ui,0 ) = 0,
1
and E (i,2 i ) = 0. Thus, Blundell and Bond (1998) suggest a system GMM estimator, which
also uses instruments in first differences for the equation in levels.
Hahn (1999) evaluates the efficiency gains brought by exploiting the stationarity of the
initial condition as done by Blundell and Bond, finding that it is substantial also for large T .
Statas xtabond performs the Arellano and Bond GMM estimator. Then, there is xtdpdsys,
which implements the GMM system estimator. Third, xtdpd, is a more general command that
allows more flexibility than both xtabond and xtdpdsys. Finally, the user-written xtabond2
(Roodman 2009) is certainly the most powerful code in Stata to implement dynamic panel
data models.
9.13.5. Application. Arellano and Bond (1991) show their methods estimating a dy-
namic employment equation on a sample of UK manufacturing companies. Their data set
in Stata format is contained in abdata.dta. The dofile IV_GMM_DPD.do implements simpler
versions of their model though differenced and system GMM using xtabond and xtabond2.
The dofile abbest.do by D. M. Roodman replicates exactly the Arellano and Bonds results
using xtabond2.
9.13. DYNAMIC PANEL DATA 187
9.13.6. Bias corrected LSDV. IV and GMM estimators in dynamic panel data models
are consistent for N large, so they can be severely biased and imprecise in panel data with a
small number of cross-sectional units. This certainly applies to most macro panels, but also
micro panels where heterogeneity concerns force the researcher to restrict estimation to small
subsamples of individuals.
Monte Carlo studies (Arellano and Bond 1991, Kiviet 1995 and Judson and Owen 1999)
demonstrate that LSDV although inconsistent has a relatively small variance compared to IV
and GMM estimators. So, an alternative approach based upon the correction of LSDV for the
finite sample bias has recently become popular in the econometric literature. Kiviet (1995)
uses higher order asymptotic expansion techniques to approximate the small sample bias of the
LSDV estimator to include terms of at most order 1/(TN). Monte Carlo evidence therein shows
that the bias-corrected LSDV estimator (LSDVC) often outperforms the IV-GMM estimators
in terms of bias and root mean squared error (RMSE). Another piece of Monte Carlo evidence
by Judson and Owen (1999) strongly supports LSDVC when N is small as in most macro
panels. In Kiviet (1999) the bias expression is more accurate to include terms of higher order.
Bun and Kiviet (2003), simplify the approximations in Kiviet (1999).
Bruno (2005a) extends the bias approximations in Bun and Kiviet (2003) to accommodate
unbalanced panels with a strictly exogenous selection rule.Bruno (2005b) presents the new
users written Stata command xtlsdvc to implement LSDVC.
Kiviet (1995) shows that the bias approximations are even more accurate when there is
a unit root in y. This makes for a simple panel unit-root test based on the bootstrapped
standard errors computed by xtlsdvc.
9.13.6.1. Estimating a dynamic labour demand equation for a given industry. Unlike the
xtabond and xtabond2 applications of Subsection 9.13.5, here we do not use all information
available to estimate the parameters of the labour demand equation in abdata.dta. Instead,
we follow a strategy that, exploiting the industry partition of the cross-sectional dimension
as defined by the categorical variable ind, lets the slopes be industry-specific. This is easily
9.13. DYNAMIC PANEL DATA 188
accomplished by restricting the usable data to the panel of firms belonging to a given industry.
While such a strategy leads to a less restrictive specification for the firm labour demand, it
causes a reduced number of cross-sectional units for use in estimation, so that the researcher
must be prepared to deal with a potentially severe small sample bias in any of the industry
regressions. Clearly, xtlsdvc is the appropriate solution in this case.
The demonstration is kept as simple as possible considering regressions for only one in-
dustry panel, ind=4.
The following instructions are implemented in a Stata-do file
Part 2
Non-linear models
CHAPTER 10
10.1. Introduction
Non-linear models may provide a more adequate specification than linear models when the
dependent variable is not gaussian. They present three important difficulties, though.
The are two do-files demonstrating the methods of this chapter: nlmr.do using the data set
mus10data.dta and nlmr2.do using the data set mus17data.dta. Both data-sets are from
Cameron and Trivedi (2010).
The regression model specifies the mean of y conditional on a vector of exogenous explana-
tory variables x by using some known, non-linear functional form
E (y|x) = (x, ) .
Or, equivalently,
y = (x, ) + u
where u = y E (y|x).
190
10.3. POISSON MODEL FOR COUNT DATA 191
Given the random sample {yi , xi } , i = 1, ..., n, the non-linear least square estimator, bN LS ,
minimizes the non-linear residual sum of squares
n
X
Q= [yi (x, b)]2 .
i=1
Let y N be a count variable: doctor visits, car accidents, etc. The Poisson regression
model is a non-linear regression model with
E (y|x) = exp x0 .
(10.3.1)
Or, equivalently,
y = exp x0 + u
E y exp x0 |x = 0
and by the Law of Iterated expectations there are zero covariances between u and x:
Ey,x x y exp x0
(10.3.2) = 0.
Exercise 90. Equations (10.3.3) are different from the first-order-conditions of the NLS
estimator for the Poisson model. Can you derive them?
10.3. POISSON MODEL FOR COUNT DATA 192
e y
f (y) = .
y!
Importantly, the Poisson model has the equidispersion property: V ar (y) = E (y) = .
Letting = exp (x0 ) we end up with the conditional log-likelihood function
n
exp [exp (x0i )] exp (x0i )yi
X
lnL (y1 ...yn |x1 ...xn , ) = ln
yi !
i=1
n
X
exp x0i + yi x0i ln (yi !)
=
i=1
bML is consistent : bM L
p
n
!1
X
(10.3.4) The covariance matrix estimator of bM L : V (bM L ) = i xi x0i
i=1
It is easily seen that the k first order conditions that maximize lnL coincide with the equa-
tions in (10.3.3), so that bM L = bGM M . This proves two things: 1) The GMM estimator is
asymptotically efficient if the conditional mean function is correctly specified and the density
function is Poisson; 2) the ML estimator is consistent even if the poisson density is not the
correct density function, as long as the conditional mean is correctly specified. In such cases,
when the likelihood function is not correctly specified, we refer to the ML estimator as a
pseudo ML estimator and a robust covariance matrix estimator should be used for inference
rather than (10.3.4):
n
!1 " n # n
!1
X X 2
X
Vrob (bM L ) = i xi x0i
(yi
i ) xi x0i i xi x0i
.
i=1 i=1 i=1
i )2 close to
(yi i = V (bM L ) close to Vrob (bM L )
10.3. POISSON MODEL FOR COUNT DATA 193
The consistency result for the (pseudo) ML estimator holds in general if two conditions are
verified:
a (y) = ln (y!),
b [ (x, )] = exp (x0 ) and
yc [ (x, )] = yx0 .
The Stata command that implements poisson regression is poisson, with a syntax close
to regress. It computes bM L with standard error estimates obtained by V (bM L ). If the
10.3. POISSON MODEL FOR COUNT DATA 194
vce(robust) option is given, then Stata recognizes the more robust pseudo ML set-up and
still provides the bM L coefficient estimates, but with the robust covariance matrix Vrob (bM L ) .
Clustering of errors is supported by vce(cluster clustervar ); vce(opg) implements the
VCE as the inverse outer product of the gradient vector evaluated at bM L , it is asymptotically
equivalent to V (bM L ); finally, bootstrap standard errors can be obtained through vce(boot
reps(# )) with reps(# ) setting the number of bootstrap replications.
10.3.2. Empirical application. We apply poisson to estimate a Poisson model for the
number of doctor visits in 2002, for a sample of U.S. individuals aged 25-64 years, working
in private sector, but not self-employed and not receiving public insurance (Medicare and
Medicaid) (see Cameron-Trivedi, 2010).
. keep if year02==1
(25,712 observations deleted)
Robust VCE:
Robust
docvis Coef. Std. Err. z P>|z| [95% Conf. Interval]
.
10.3. POISSON MODEL FOR COUNT DATA 196
Poisson can be implemented also through the Stata command performing non-linear GMM,
gmm. The syntax is non-standard here: the argument of gmm is a Stata rendition of y
exp (x0 ) within parentheses. The option instrument(varlist ) tells Stata what are the
exogenous variables (all, in this application).
GMM estimation
Number of parameters = 5
Number of moments = 5
Initial weight matrix: Unadjusted Number of obs = 4,412
Robust
Coef. Std. Err. z P>|z| [95% Conf. Interval]
10.3.3. Marginal effects. The post-estimation command margins with the option dydx(varlist )
estimates marginal effects for each of the variables in varlist . Marginal effects can be esti-
mated at a point xo (conventionally, the sample mean when variables are continuous and in
this case the option atmean must be supplied) or can be averaged over the sample (default).
The marginal effect of a binary covariate should be computed as the discrete change from
its base level, rather than using partial-derivative formulas, which are appropriate only for
continuous xs.
Stata uses the appropriate formulas if we indicate what are the binary covariates in the
regression command preceding margins, as follows:
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
Note: dy/dx for factor levels is the discrete change from the base level.
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
Note: dy/dx for factor levels is the discrete change from the base level.
.
Or we can average MEs over the sample to get AMEs. NB: This is the margins default
. quietly poisson docvis i.private i.chronic i.female income, vce(robust)
. margins, dydx(*)
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
Note: dy/dx for factor levels is the discrete change from the base level.
.
10.4. MODELLING AND TESTING OVERDISPERSION 199
e ()y
f (y|) =
y!
and
e ()y
f (y) = E .
y!
To find it in closed form we need to specify the marginal density function for . If
Gamma (1, ), then f (y) is a negative binomial density function, N B , 2 , with E (y) =
and V ar (y) = 1 + 2 . Clearly if 2 = 0, then collapses to its unity mean and f (y) is
Poisson.
Specifying = exp (x0 ) yields the NB regression model and and 2 are estimated via
ML based on N B exp (x0 ) , 2 . Testing for overdispersion within this framework boils down
Overdispersion can be tested also under the null hypothesis of 2 = 0, therefore under
Poisson regression, against the alternative of V ar (y|x) = 1 + 2 , therefore NB regres-
sion, using a Lagrange Multiplier test. This is based on an auxiliary regression implemented
h i
i )2 yi /
after poisson estimation using an estimate of [V ar (y|x) /] 1, (yi i , as the
i = exp (x0i bM L ), as the only regressor (no constant). The LM test
dependent variable and
is the t-statistic computed for the OLS coefficient estimate of
i .
CHAPTER 11
11.1. Introduction
Binary dependent variable models have a dependent variable that partitions the sample
into two categories of a given qualitative dimension of interest. For example
Labour supply. There are two categories: work/not work (univariate binary model).
Supplementary private health insurance. There are two categories: purchase/not
purchase (univariate binary model)
Binary models are said multivariate when there are multiple dimensions that are possibly
related
Two related dimensions: [Dimension 1: Being overweight (body mass index > 25)
= Two categories: yes/not] and [Dimension 2: Job satisfaction = Two categories:
satisfied/dissatisfied] (bivariate binary model).
Two related dimensions: [Dimension 1: Identity of immigrants with the host country
= Two categories: yes/not] and [Dimension 2: Identity of immigrants with the
country of origin = Two categories: yes/not] (bivariate binary model).
In these notes I focus almost exclusively with univariate binary models, except for a digression
on the bivariate probit model as estimated by Statas biprobit.
The do-file bdvm.do is a Stata application on binary models that uses the data set mus14data.dta
from Cameron and Trivedi (2010).
201
11.2. BINARY MODELS 202
Let A the event of interest (e.g. buy a supplementary health insurance). Let the indicator
function 1 (A) be unity if event A occurs and zero if not. Define the discrete random variable
y such that
(11.2.1) y = 1 (A) .
Then
P r (y = 1) = P r (A) and P r (y = 0) = 1 .
E (y) = and V ar (y) = (1 ) .
P r (y = 1|x) = F x0 .
(11.2.2)
Since P r (y = 1|x) = E (y|x), Model (11.2.2) can always be expressed as the regression model
y = F x0 + u
(11.2.3)
u = y E (y|x) .
11.2. BINARY MODELS 203
11.2.1. Latent regression. When F () is a distribution function the binary model can
be motivated as a latent regression model. In microeconomics this is a convenient way to
model individual choices.
Introduce the latent continuous random variable y with
(11.2.4) y = x0 + ,
let be a zero mean random variable that is independent from x and with F , where F is a
distribution function that is symmetric around zero. Then, let y = 1 (y > 0) . In the insurance
example we may think of y as the utility variation faced by a subject with observable and
latent characteristics x and , respectively, when she decides to buy a supplementary insurance,
so that event A occurs if and only if y > 0.
= y = 1 > x0 ,
(11.2.5)
= P r (y = 1|x) = P r > x0 |x
P r (y = 1|x) = F x0 ,
11.2.3. An application to the HRS data. To illustrate the methods discussed so far,
we use data from the panel study Health and Retirement Study (HRS) 2000, a sample of
individuals aged 52-86, all Medicare beneficiaries. We wish to explain the decision to purchase
a private insurance from any source: private market, employer or union sponsored plans, etc.
This decision is described as a yes/no binary variable, ins . Estimation is carried out through
a battery of binary model estimators.
Heres the variables involved in the binary regressions included in Statas global macros
by the Statas command global:
11.2. BINARY MODELS 205
.
11.2. BINARY MODELS 206
Robust
ins Coef. Std. Err. t P>|t| [95% Conf. Interval]
.
11.2. BINARY MODELS 209
Statas hetprobit estimates this heteroskedastic probit model and, importantly, provides a
LR test for the null hypothesis of homoskedasticity (=0).
The following is an application of hetprobit to the HRS data
ins
retire .1075926 .0476757 2.26 0.024 .0141501 .2010352
age -.0087658 .0062107 -1.41 0.158 -.0209384 .0034069
hstatusg .1629653 .0564771 2.89 0.004 .0522722 .2736584
hhincome .0011135 .000364 3.06 0.002 .0004 .001827
educyear .0642167 .0094184 6.82 0.000 .0457569 .0826765
married .3341699 .0563861 5.93 0.000 .2236551 .4446847
hisp -.4344396 .1055044 -4.12 0.000 -.6412244 -.2276548
_cons -.9089138 .4318121 -2.10 0.035 -1.75525 -.0625776
lnsigma2
chronic -.0442144 .0365848 -1.21 0.227 -.1159193 .0274906
.
11.2. BINARY MODELS 210
Robust
ins Coef. Std. Err. z P>|z| [95% Conf. Interval]
.
11.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 211
There is no exact relationship between the coefficient estimates from the three foregoing
models. Amemiya (1981) works out the following rough conversion factors
This is not very interesting, though, since we are often interested in marginal effects on response
probabilities rather than coefficients.
In logit and probit models, the marginal effect of x on the success probability, F (x0 ),
depends on how large is the index value t = x0 , for F (t) 1 or 0 as t + or ,
respectively. Hence, if x is large in size and x 6= 0, so that t = x0 is also large, F (t) tends
to be constant and x F (t) to be negligible.
The general expression for x F (t) can be worked out by using the chain rule of derivation,
so that x F (t) = t F (t) x (x0 ) and since the first derivative of a distribution function is its
density function f , x F (t) = f . The marginal effect of x at observation i are estimated by
logit and probit as
and by LPM as
(x Fi )ols = bols .
The post-estimation command margins with the option dydx(varlist ) estimates marginal
effects for each of the variables in varlist . Marginal effects can be estimated at a point x
11.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 212
(conventionally, the sample mean when variables are continuous and in this case the option
atmean must be supplied) or can be averaged over the sample (default).
Next, we present a battery of experiments in probit and logit with margins, starting with
probit marginal effects (ME) evaluated at an interesting counterfactual point:
. quietly probit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
Note: dy/dx for factor levels is the discrete change from the base level.
. quietly probit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
11.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 213
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
Note: dy/dx for factor levels is the discrete change from the base level.
. quietly probit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
11.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 214
Probit AMEs:
. quietly probit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
Note: dy/dx for factor levels is the discrete change from the base level.
. quietly logit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
11.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 215
Logit AMEs:
. quietly logit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
Note: dy/dx for factor levels is the discrete change from the base level.
.
Logit and probit AMEs are very close!
11.4. TESTS AND GOODNESS-OF-FIT MEASURES 216
AMEs from the LPM are readily evaluated as the coefficient estimates!
Robust
ins Coef. Std. Err. t P>|t| [95% Conf. Interval]
They are quite close to logits and probits! This feature of the LPM of producing AMEs
that are in line with probits and logits has been noted in Wooldridge (2010).
Parameter restrictions can be tested by Wald tests (test) and LR tests (lrtest). As ex-
plained above, hetprobit, besides producing coefficient estimates, provides an heteroskedas-
ticity test.
The most common goodness-of-fit measures reported in logit or probit outputs are the
following.
11.4.1. The overall percent correctly predicted (OPCP). Define the 0.5-cutoff
predictor yi of yi as
1 if 0.5
F x0
yi =
0 else
11.4. TESTS AND GOODNESS-OF-FIT MEASURES 217
OPCP is the number of times yi = yi over n. A problem with this measure is that it can be
high also in cases where the model poorly predicts one outcome. It may be more informative
in these cases to compute the percent correctly predicted for each outcome separately: 1) the
number of times yi = yi = 1 over the number of times yi = 1 and 2) the number of times
yi = yi = 0 over the number of times yi = 0 (it turns out that OPCP is a weighted average of
the foregoing two measures).
These measures are obtained through the Stata post-estimation command estat classification
(not supported by hetprobit). It reports several statistics, indeed. The percent correctly pre-
dicted of successes (y = 1) is referred to as sensitivity and symbolized Pr( +| D); the percent
correctly predicted of failures (y = 0) is referred to as specificity and symbolized Pr( -|~D).
The OPCP value is reported under the label correctly classified.
11.4. TESTS AND GOODNESS-OF-FIT MEASURES 218
. estat classification
True
Classified D ~D Total
We verify the estat classification results through the Stata command display (used
here as a pocket calculator!)
average probability of the same outcome within a subsample S of interest and its complemen-
tary S c (for example, S = f emales and S c = males)
1 X 1 X 0
yS yi compared to pS F xi
mS mS
iS iS
and
1 X 1 X 0
yS c yi compared to pS c F xi
mS c c
mS c c
iS iS
Key
frequency
column percentage
female
ins 0 1 Total
. margins, by(female)
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
female
0 .396303 .008885 44.60 0.000 .3788887 .4137172
1 .3770161 .0087613 43.03 0.000 .3598443 .3941878
. margins, by(female)
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
female
0 .396303 .008885 44.60 0.000 .3788888 .4137173
1 .3770162 .0087613 43.03 0.000 .3598444 .394188
.
Predictions, virtually identical between probit and logit, are quite satisfactory: yf emale =
0.35 against pf emale = 0.38 and ymale = 0.42 against pmale = 0.40
The actual frequencies yS and yS c can be also obtained from any binary model saturated
for the sample partition S = {S, S c } . A model is said saturated for a sample partition
S = {S1 , ...SM } if it has enough variables to explain the column percentages in each cell of
11.4. TESTS AND GOODNESS-OF-FIT MEASURES 221
S . In our example, a model would be saturated for S if it contains the indicators for S and
for S c , or, equivalently, the indicator for S (or S c ) and the indicator for the whole estimation
sample S S c , i.e. the constant term.
yS is always exactly reproduced by a saturated logit model, and with a negligible error by
a saturated probit model.
. quietly logit ins i.female i.retire $xlist
. margins, by(female)
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
female
0 .4241338 .0115715 36.65 0.000 .4014542 .4468135
1 .3466057 .0116829 29.67 0.000 .3237077 .3695038
. margins, by(female)
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
female
0 .4236579 .0115466 36.69 0.000 .4010271 .4462888
1 .3450644 .0116708 29.57 0.000 .32219 .3679388
.
11.4. TESTS AND GOODNESS-OF-FIT MEASURES 222
. margins
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
. estat gof
11.4.5. The pseudo R-squared. The last goodness-of-fit measure we consider is the
2 = 1 L () /L (
Pseudo R-squared: R y ) , where L () is the value of the maximized log-
likelihood and L (
y ) is the log-likelihood evaluated for the model with only the intercept.
11.5. NUMERICAL PROBLEMS 224
11.5.1. One-way causation. One-way causation occurs when the category of a binary
covariate x predicts one of the outcomes of y perfectly. For example
1 1
1 1
x = 0 and y =
0
0 1
0 0
As a consequence, there does not exist a definite solution maximizing the likelihood function.
This happens because the model is saturated for the x categories and so necessarily yx=1 =
px=1 , exactly in logit, approximately in probit. But since yx=1 = 1, then px=1 = 1, which
holds only if the estimate for is +.
Should, instead, x = 1 predict y = 0 perfectly,
1 0
1 0
x = 0 and y =
1
0 1
0 0
then px=1 = 0, which holds only if the estimate for is . Also in this case, there does not
exist a definite solution maximizing the likelihood function.
Stata solves the problem into two steps. First, it removes the individuals in the x category
predicting outcome perfectly. Then, it removes the variable x, which after discarding records
is a constant and so collinear with the intercept.
11.5. NUMERICAL PROBLEMS 225
.
. **** One-way causation
.
. tabulate foreign repair /// level 1 predicts
> /// failure (domestic)
> /// perfectly.
>
repair
Car type 1 2 3 Total
Domestic 10 27 9 46
Foreign 0 3 9 12
Total 10 30 18 58
11.5. NUMERICAL PROBLEMS 226
repair
1 0 (empty)
2 -2.197225 .7698003 -2.85 0.004 -3.706005 -.6884436
.
One-way causation occurs also when a pattern of two or more covariates (dummies or
continuous with repeated values, such as age) predicts outcome perfectly.
11.5. NUMERICAL PROBLEMS 227
. list, separator(0) /// pattern x1=0 and x2=0 predicts y=0 perfectly
>
y x1 x2
1. 0 0 0
2. 0 0 0
3. 0 1 0
4. 1 1 0
5. 0 0 1
6. 1 0 1
.
. logit y x1 x2
end of do-file
Stata removes the records perfectly predicted, but then the remaining variables are per-
fectly collinear with the constant term and convergence fails.
Stata, differently from the case of a single predictor dummy, does not solve the multi-
collinearity issue by dropping a variable. But it displays a message alerting the user, who can
take an action by her/himself (see the dofile numerical_problems.do.)
x y
1. -2.210688 0
2. -2.195971 0
3. -.5270714 0
4. .8169003 1
5. 1.472326 1
. probit y x
11.5. NUMERICAL PROBLEMS 229
. probit y x
end of do-file
A case of quasi-separation occurs when x is a great predictor of y, although not a perfect
one.
x y
1. -1.938717 0
2. -1.165771 0
3. -1.06293 0
4. -1.027958 0
5. -.8778074 0
6. -.8016517 0
7. -.1507115 0
8. -.1269327 0
9. -.0573579 1
10. .0050004 0
11. .0515246 1
12. .1676564 1
13. .1892474 1
14. .2048065 1
15. .2656516 1
16. .2677172 1
17. .9665207 1
18. 1.020527 1
19. 1.504177 1
20. 1.85036 1
. probit y x
Numerically the model is sound and Stata carries out estimation.
11.5. NUMERICAL PROBLEMS 230
. logit y x
.
Consequences are a large coefficient estimate on the predictor variable, predicted proba-
bilities that are virtually zero over the smallest values of x and virtually unity over the largest
ones. Stata issues a final message illustrating the situation. No intervention is required from
the user.
The Statas message says 4 failures and 4 successes completely determined. Lets see
why.
11.5. NUMERICAL PROBLEMS 231
. predict p
(option pr assumed; Pr(y))
. sort p
. list p in 1/4
1. 4.38e-18
2. 4.61e-11
3. 3.96e-10
4. 8.23e-10
. list p in 17/20
17. 1
18. 1
19. 1
20. 1
.
p is virtually zero over the smallest 4 values of x and virtually unity over the largest 4
ones.
The ML first order conditions for the ML estimator are given by the k equations system
Xn h i
f x0i
(11.6.1) yi F x0i
h i xi = 0
i=1
1 F x0
F x0i
i
Pn h i
= 0, which explains why in logit
Then, if the constant is in xi , i=1 yi x0i
n
1 X 0
(11.6.3) y = xi .
n
i=1
If a dummy
1 if iS
di =
0 if i Sc
and the constant belong to xi in logit, then from equations (11.6.2) and (11.6.3)
1 X 0 1 X 0
(11.6.4) yS = xi and yS c = xi ,
mS mS c c
iS iS
P )
f (x0i P )
f (x0i
and since the constant is in xi , xi
x F (x0 ) = xi <
x 1F (x0 ) . Then
i i
n h
X i
f x0i
yi F x0i
h i xi > 0.
i=1
1 F x0
F x0i
i
In the presence of endogenous regressors in the latent regression model, say y2 , one can
use an instrumental variable probit estimator. This is implemented by Statas ivprobit, with
a syntax similar to ivregress.
We reestimate an extended version, with more regressors, of our insurance model, treating the
log of head-of-household income as potentially endogenous. We use the individual and the
spouse professional statuses (retire and sretire ) as instruments. Notice that now that the
retire variable is used as an instrument it has to be removed from the model.
11.8. INDEPENDENT LATENT HETEROGENEITY 234
. global xlist2 female age age2 educyear married hisp white chronic adl hstatusg
Robust
Coef. Std. Err. z P>|z| [95% Conf. Interval]
Instrumented: linc
Instruments: female age age2 educyear married hisp white chronic adl hstatusg
sretire retire
Wald test of exogeneity (corr = 0): chi2(1) = 3.51 Prob > chi2 = 0.0610
In the latent regression model (11.2.4) all explanatory variables are observed. But it
may be the case that relevant explanatory variables are latent, as allowed by the following
11.8. INDEPENDENT LATENT HETEROGENEITY 235
|x, N 0, 2
Then + |x N 0, 1 + 2 and
y +
= x0 + ,
1 + 2 1 + 2 1 + 2
is a legitimate probit model. In fact, y / 1 + 2 is latent,
+
|x N (0, 1)
1 + 2
and so
0
x = P r (y = 1|x) .
1 + 2
It follows that we can apply standard probit ML estimation and the resulting estimator
\ 2 2 0 \
/ 1 + is consistent for / 1 + and so is x 1+2 for the response probabilities
P r (y = 1|x) .
From the above analysis it clearly emerges that / \1 + 2 estimates with a downward
bias (Yatchew and Griliches (1985)). Nonetheless, if our interest centers on marginal effects
x P r (y|, x) averaged over (AMEs), E [x P r (y|, x)], this is no problem.
Indeed, given f (|x) the conditional density function of , it is generally true that
P r (y|x) = P r (y|x, ) f (|x) d
|x
Hence, under mild regularity conditions that permit interchanging integrals and derivatives,
The above result is important, for it establishes that to estimate P r (y|x) and x P r (y|x)
0 \
is to estimate E [P r (y|, x)] and E [x P r (y|, x)], respectively. So, x 1+2 is a
0 \
consistent estimator for E [P r (y|, x)], likewise x x 1+2 is a consistent estimator of
the sample.
(11.9.1) yj = x0 j + j
j = 1, ..., m, and j , x and j are, respectively, the p1 vectors of parameters and explanatory
variables and the error term. Stacking all j s into the vector (1 , ... , m )0 , we assume
|x N (0, R). The covariance matrix R is subject to normalization restrictions that will be
made explicit below. Equation specific regressors are accommodated by allowing j to have
zeroes in the positions of the variables in x that are excluded from equation j. Cross-equation
restrictions on the s are also permitted. R is normalized for scale and so has unity diagonal
elements and arbitrary off-diagonal elements, ij , which allows for possible cross-equation
correlation of errors. It may or may not present constraints beyond normalization. If m = 2
11.9. MULTIVARIATE PROBIT MODELS 237
we have the bivariate probit model, which is estimated by the Stata command biprobit, with
a syntax similar to probit.
y1 = x 1 + y2 + 1
(11.9.2) y2 = x 2 + 2
It is then evident that estimating a bivariate recursive probit model is ancillary to estimation
of a univariate probit model with a binary endogenous regressor, the first equation of system
(11.9.2).
The feature that makes the recursive multivariate probit model appealing is that it accom-
modates endogenous, binary explanatory variables without special provisions for endogeneity,
simply maximizing the log-likelihood function as if the explanatory variables were all ordinary
exogenous variables (see Maddala 1983, Wooldridge 2010,Greene 2012 and, for a general proof,
Roodman 2011). This can be easily seen here in the case of the recursive bivariate model
= P r 1 > x0 1 , 2 > x0 2 |x
= 2 x0 1 + , x0 2
11.9. MULTIVARIATE PROBIT MODELS 238
1Wooldridge 2010 argues that, although not strictly necessary for formal identification, substantial identification
in recursive models may require exclusion restrictions in the equations of interest. For example, in system
(11.9.2) substantial identification requires some zeroes in 1 , where the corresponding variables may then be
thought of as instruments for y2 .
CHAPTER 12
12.1. Introduction
y = x0 + ,
y = y
if y > L
y=
if y L
y = L
y = E (y|x, y > L) + u
= E x0 + |x, > L x0 + u
x0 + E |x, > L x0 + u
(12.2.1)
239
12.2. TOBIT MODELS 240
where u = y E (y|x, y > L) . The following result for the density and moments of the trun-
cated normal distribution are useful (see Greene 2012, pp. 874-876):
1 z
f (z|z > ) = / {1 [( ) /]}
1 z
f (z|z < ) = / [( ) /]
[( ) /]
E (z|z > ) = +
1 [( ) /]
[( ) /]
E (z|z < ) = .
[( ) /]
The foregoing equalities are all based on the following representations of general cumulative
distribution function, F(,2 ) :
" #
(z )2
1 1 z
(,2 ) (z) = exp = .
2 2 2
Then, Model (12.2.1) can be written in closed form as
[(L x0 ) /]
y = x0 + + u.
1 [(L x0 ) /]
[(x0 L) /]
(12.2.2) y = x0 + + u.
[(x0 L) /]
[(x0 ) /]
y = x0 + + u.
[(x0 ) /]
12.2. TOBIT MODELS 241
12.2.1. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation. Let
di = 1 (yi > L). Estimation can be via ML or two-step LS.
The log-likelihood function assembles the density functions peculiar to the subsample of
individuals di = 1 and those peculiar to individuals di = 0 (left-censored). For an individual
di = 1, yi = yi and we know that yi |xi N x0i , 2 . Therefore, we can evaluate the density
/ and L/ is subsumed in the constant term). In the second step apply OLS regression
\
of yi on xi and i /i restricting to the unconstrained subsample di = 1. using yi . b2step
is consistent but standard errors needs to be adjusted since in the second step there is an
estimated regressor.
Upper limits can be dealt similarly:
y = y
if y < U
y=
if y U.
y = U
12.3. SAMPLE SELECTION 242
The Stata command that compute bM L in the tobit model is tobit. The syntax is
similar to regress, requiring in addition options specifying lower limits, ll(#), and upper
limits, ul(#) (the two-step estimator is implemented by the Stata command heckman , see
below).
Marginal effects of interest are
12.2.2. Heteroskedasticy and clustering. The same consideration made for binary
models in Sections 11.2.4 and 11.2.5 hold here. While heteroskedasticty breaks down the
specification of conditional expectations, clustering does not. Therefore, it makes sense to
apply the Stata option vce(cluster clustervar ).
We consider cases of incidental truncation, that is when the values of the dependent variable
are observed only depending on the values of some other variable. For example, in a wage
equation the wage is observed only for individuals that are employed. The employment-status
indicator acts as a selection rule: the wage is observed only over its unity values.
12.3. SAMPLE SELECTION 243
12.3.1. Incidental truncation in the linear model. There are two processes: the
first select the units into the sample, the second generates y. If the two processes are related,
selection is endogenous and as such cannot be ignored.
The selection process:
s = z0 + ,
s = 1 (s > 0)
Interest is on . Assume
0 1
|z, x N ,
0 2
The two processes are related if 6= 0. In this case selection would be endogenous and could
not be ignored.
Estimation is via ML. The log-likelihood is
n
X
lnL = {si ln [f (yi |si = 1) P r (si = 1)] + (1 si ) ln [P r (si = 0)]} .
i=1
The Stata command that compute bM L in the selection model is heckman, after the amer-
ican econometrician who introduced these methods in econometrics. The syntax is similar
to regress, requiring in addition an option specifying the list of variables in the selection
process, s and z: select(varlist_s ). It is a required option. For substantial identification
the selection equation should contain at least one variable that is not in the outcome equation.
12.3. SAMPLE SELECTION 244
Heckman developed also a two-step estimator that is asymptotically equivalent to the ML.
In fact, a model of the truncated mean analogous to Equation (12.2.1) can be worked out
[z0 /]
y = x0 + + u.
[z0 /]
The first step applies probit of s on z yielding a variable of consistent estimates for the inverse
mills ratio values (also known as non-selection hazard):
h i
z0
= h
imr i
z0
12.3.2. Incidental truncation in probit. Incidental truncation may also affect binary
models. The selection process is exactly as for heckman:
s = z0 + ,
s = 1 (s > 0)
12.3.3. Incidental truncation in linear panel data models. The methods imple-
mented by heckman are not feasible for linear panel data models, either ML or two-step, if
latent-heterogeneity components affect both the model of interest and the selection rule. The
reason is simple and boils down to an incidental-parameter bias in the probit selection rule.
An alternative procedure suggested by Wooldridge (Wooldridge, 2010 Wooldridge (2010))
can be implemented. Details are in the dofile sample_selection_pd.do.
NB: Standard-error estimates must be rectified. Wooldridge provides the analytical for-
mulas for this. Alternatively, one can estimate standard errors by bootstrap. In the dofile
bs_sample_selection_pd.do, I follow the latter strategy and implement bootstrap standard
errors.
12.4.1. Concepts and assumptions. Attrition occurs in panel data when individuals
may drop from the the sample at subsequent dates. Attrition is non-ignorable when dropouts
are related to latent elements in the model of interest. Inverse probability weighting is a general
way to deal with non-ignorable attrition and can be used in non-linear models.
Consider a panel data framework and the selection variable
1 if (i, t) is observed
sit =
0 otherwise
Consider also the set of variables that may potentially affect attrition ,vit = (yit , x0it , z0it )0
where zit is a vector of variables that are external to the model, not necessarily exogenous.
0
Setting zit = yit1 , x0it1 (see Moffitt et al. 1999; Wooldridge 2010) ensures that zit be
observed for dropouts at occasion t. This seems also quite plausible as attrition is often
closely linked to individuals changing location.
0 , ...v0 )0 , is redundant in explaining
We assume that the whole history of vit , vi = (vi1 iT
attrition at occasion t, once zit is accounted for in the conditional probability of sit = 1:
12.4. ATTRITION: INVERSE PROBABILITY WEIGHTING 246
t = 2, ..., T.
Given this assumption, the probabilities for use in IPW can be constructed sequentially:
for all i=1,. . . ,N, it has pi1 = 1 and pit = tr=2 ir , where it = P r (sit = 1|zit , sit1 = 1),
Q
t=2,. . . ,T.
Estimates of pit , pit , are then used for IPW of the conditional log-likelihood functions as
follows:
N X
T
X sit
l () = lit ()
pit
i=1 t=1
Abowd, J. M., Kramarz, F., Margolis, D. N., 1999. High wage workers and high wage firms.
Econometrica 67, 251333.
Anderson, T. W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel
data. Journal of Econometrics 18, 570606.
Andrews, D. W. K., Moreira, M. J., Stock, J. H., 2007. Performance of conditional wald tests
in iv regression with weak instruments. Journal of Econometrics 139, 116132.
Angrist, J. D., Pischke, J. S., 2009. Mostly Harmless Econometrics: An Empiricists Compan-
ion. Princeton University Press, Princeton.
Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxford
Bulletin of Economics and Statistics 49 (4), 43134.
Arellano, M., 2003. Panel Data Econometrics. Oxford University Press.
Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte carlo evidence
and an application to employment equations. Review of Economic Studies 58, 277297.
Baltagi, B. H., 2008. Econometric Analysis of Panel Data. New York: Wiley.
Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data
models. Journal of Econometrics 87, 115143.
Bowsher, C. G., 2002. On testing overidentifying restrictions in dynamic panel data models.
Economics Letters 77, 211220.
Bruno, G. S. F., 2005a. Approximating the bias of the lsdv estimator for dynamic unbalanced
panel data models. Economics Letters 87, 361366.
Bruno, G. S. F., 2005b. Estimation and inference in dynamic unbalanced panel data models
with a small number of individuals. The Stata Journal 5, 47300.
248
Bibliography 249
Bun, M. J. G., Kiviet, J. F., 2003. On the diminishing returns of higher order terms in
asymptotic expansions of bias. Economics Letters 79, 145152.
Cameron, A. C., Gelbach, J. B., Miller, D. L., 2011. Robust inference with multiway clustering.
Journal of Business & Economic Statistics 29, 238249.
Cameron, A. C., Trivedi, P. K., 2010. Microeconometrics using Stata - Revised Edition. Stata
Press, College Station, TX.
Cappellari, L., Jenkins, S. P., 2003. Multivariate probit regression using simulated maximum
likelihood. The Stata Journal 3, 278294.
Cragg, J., Donald, S., 1993. Testing identfiability and specification in instrumental variables
models. econometric theory, vol. 9, pp. Econometric Theory, 222240.
Entorf, H., 2012. Expected recidivism among young offenders: Comparing specific deterrence
under juvenile and adult criminal law. European Journal of Political Economy 28, 414429.
Evans, W. N., Schwab, R. M., 1995. Finishing high school and starting college: Do catholic
schools make a difference? The Quarterly Journal of Economics 110, 941974.
Fichera, E., Sutton, M., 2011. State and self investment in health. Journal of Health Economics
30, 11641173.
Greene, W. H., 1998. Gender economics courses in liberal arts colleges: Further results. Journal
of Economic Education 29, 291300.
Greene, W. H., 2008. Econometric Analysis, sixth Edition. Upper Saddle River, NJ: Prentice
Hall.
Greene, W. H., 2012. Econometric Analysis, seventh Edition. Upper Saddle River, NJ: Prentice
Hall.
Hansen, L. P., 1982. Large sample properties of generalized method of moments estimators.
Econometrica 50 (4), 10291054.
Hausman, J., 1978. Specification tests in econometrics. Econometrica 46, 12511271.
Hausman, J. A., Taylor, W., 1981. Panel data models and unobservable individual effects.
Econometrica 49, 13771398.
Bibliography 250
Searle, S. R., 1982. Matrix Algebra Useful for Statistics. New York: Wiley.
Stock, J. H., Watson, M. W., 2008. Heteroskedasticity-robust standard errors for fixed effects
panel data regression. Econometrica 76, 15574.
Stock, J. H., Yogo, M., 2005. Testing forweak instruments in linear iv regression. In: Andrews,
D., Stock, J. H. (Eds.), Identification and Inference for Econometric Models: Essays in
Honor of Thomas Rothenberg. Cambridge: Cambridge University Press, pp. 80108.
Swamy, P. A. B., Arora, S. S., 1972. The exact finite sample properties of the estimators of
coefficients in the error components regression models. Econometrica 40 (2), 261275.
White, H., 2001. Asymptotic Theory for Econometricians, revised edition Edition. Emerald.
Windmeijer, F., 2005. A finite sample correction for the variance of linear efficient two-step
gmm estimators. Journal of Econometrics 126, 2551.
Wooldridge, J. M., 2005a. Unobserved heterogeneity and estimation of average partial effects.
In: Andrews, D. W. K., Stock, J. H. (Eds.), Identification And Inference For Econometric
Models: Essays In Honor Of Thomas Rothenberg. Cambridge University Press, New York.
Wooldridge, J. M., 2005b. Violating ignorability of treatment by controlling for too many
factors. Econometric Theory 21, 10261028.
Wooldridge, J. M., 2010. Econometric Analysis of Cross Section and Panel Data, 2nd Edition.
The MIT Press, Cambridge, MA.
Yatchew, A., Griliches, Z., 1985. Specification error in probit models. Review of Economics
and Statistics 67, 134139.
Zyskind, G., 1967. On canonical forms, non-negative covariance matrices and best and simple
least squares linear estimators in linear models. Annals of Mathematical Statistics 36, 1092
09.