Bruno Lecture Notes PDF

Econometric Models: Lecture Notes for 2016/17 ESS
Econometrics
Giovanni Bruno
Department of Economics, Bocconi University, Milano

E-mail address: giovanni.bruno@unibocconi.it
Contents
Part 1. Linear Models 7
Chapter 1. Introduction 8
1.1. Introduction 8
1.2. The linear population model 10
Chapter 2. The linear regression model 13

2.1. From the linear population model to the linear regression model 13
2.2. The properties of the LRM 14
2.3. Difficulties and extensions 15
Chapter 3. The Algebraic properties of OLS 18

3.1. Motivation, notation, conventions and main assumptions 18
3.2. Linear combinations of vector 19
3.3. OLS: definition and properties 19
3.4. Spanning sets and orthogonal projections 27
3.5. OLS residuals and fitted values 29
3.6. Partitioned regression 32
3.7. Goodness of fit and the analysis of variance 42
3.8. Centered and uncentered goodness-of-fit measures 44
Chapter 4. The finite-sample statistical properties of OLS 47

4.1. Introduction 47
4.2. Unbiasedness 47
3
CONTENTS 4
4.3. The Gauss-Marcov Theorem 48

4.4. Estimating the covariance matrix of OLS 51
4.5. Exact tests of significance with normally distributed errors 53
4.6. The general law of iterated expectation 62
4.7. The omitted variable bias 64
4.8. The variance of an OLS individual coefficient 70
4.9. A general result 74
4.10. Estimation of treatment effects 76
Chapter 5. The Oaxacas model: OLS, optimal weighted least squares and group-wise
heteroskedasticity 89
5.2. Embedding the Oaxacas model into a pooled regression framework 89
5.3. The OLS estimator in the Oaxacas model is BLUE 93
5.4. The Zyskinds condition is verified in the Oaxacas model. 96
Chapter 6. Large sample results for OLS and GLS estimators 97

6.2. OLS with non-spherical error covariance matrix 98
6.3. GLS 103
6.4. Large sample tests 111
Chapter 7. Fixed and Random Effects Panel Data Models 115

7.2. The Fixed Effect Model (or Least Squares Dummy Variables Model) 115
7.3. The Random Effect Model 124
7.4. Stata implementation of standard panel data estimators 130
7.5. Testing fixed effects against random effects models 132
7.6. Large-sample results for the LSDV estimator 138
CONTENTS 5
7.7. A Robust covariance estimator 142

7.8. Unbalanced panels 143
Chapter 8. Robust inference with cluster samplings 145

8.2. Two-way clustering 146
8.3. Stata implementation 151
Chapter 9. Issues in linear IV and GMM estimation 153

9.2. The method of moments 155
9.3. Stata implementation of the TSLS estimator 160
9.4. Stata implementation of the two-step GMM estimator 162
9.5. Robust Variance Estimators 164
9.6. Hansen-Sargan test of overidentifying restrictions 165
9.7. Durbin-Wu-Hausman Exogeneity test 165
9.8. Endogenous binary variables 169
9.9. Testing for weak instruments 170
9.10. Inference with weak instruments 170
9.11. Three stages Least Squares 170
9.12. Panel data models with strictly exogenous instruments 171
9.13. Dynamic panel data 171
Part 2. Non-linear models 189
Chapter 10. Non-linear regression models 190

10.2. Non-linear least squares 190
10.3. Poisson model for count data 191
CONTENTS 6
10.4. Modelling and testing overdispersion 199
Chapter 11. Binary dependent variable models 201

11.2. Binary models 202
11.3. Coefficient estimates and marginal effects 211
11.4. Tests and goodness-of-fit measures 216
11.5. Numerical problems 224
11.6. Technical appendix 232
11.7. Endogenous regressors 233
11.8. Independent latent heterogeneity 234
11.9. Multivariate probit models 236
Chapter 12. Censored and selection models 239

12.2. Tobit models 239
12.3. Sample selection 242
12.4. Attrition: inverse probability weighting 245
Bibliography 248
Part 1
Linear Models
CHAPTER 1
Introduction
1.1. Introduction
Indeed, causation is not the same as correlation. Econometrics uses economic theory,
mathematics and statistics to quantify economic structural relationship, often in the search of
causal links among the variables of interest.
Although rather schematic, the following discussion should convey the basic intuition of
how this process works.
Economic theory provides the econometrician with an economic structural model,
(1.1.1) y = (x, )
where : Rk+q R. Often, the structural relationship is formulated as a probabilistic model

for a given population of interest. So, y denotes a random scalar, x = (x1 ... xk )0 X Rk is
a k 1 random vector of explanatory variables of interest and is a q 1 random vector of
latent variables. A structural model can be understood as one showing a causal relationship
8
1.1. INTRODUCTION 9
from the economic factors of interest, x, to the economic response or dependent variable y.
Often in applications q = 1, which means that is treated as a catch-all random scalar.
For example, (x, ) may be the expenditure function in a population of (possibly) het-
erogenous consumers, with preferences and facing income and prices x; or it may be the
marshallian demand function for some good in the same population, with x denoting prices
and total consumption expenditure; also it may be the demand function for some input of a
population of (possibly) heterogenous firms facing input and output prices x, with comprising
technological latent heterogeneity, and so on.1
The individual (x, ), with its gradient vector of marginal effects, x (x, ), and hessian
matrix, Dxx (x, ), are typically the structural objects of interest, but sometimes attention is
centered upon aggregate structural objects, such as the population-averaged structural func-
tion,

(x, ) dF () ,
the population-averaged marginal effects,

x (x, ) dF () ,
or the population averaged hessian matrix

Dxx (x, ) dF () .
Statistics supplements the probabilistic model with a sampling mechanism in order to

estimate characteristics of the population from a sample of observables. The population objects
of interest may be, for example, the joint probability distribution of the observables, F (y, x) ,
and its moments, E (y), E (x), E (xx0 ), E (xy), the conditional distribution of y given x,
F (y|x) and its moments, E (y|x) and V ar (y|x) . Or functions of the above:
1Wooldridge (2010) prefers to think of (x, ) as a structural conditional expectation: E (y|x, ) (x, ) .
There is nothing in the present analysis that prevents such interpretation.
1.2. THE LINEAR POPULATION MODEL 10
The key question is under what conditions these estimable statistical objects are informa-
tive on (x, ). Evidently, to establish a mapping between the structural economic object
of interest and the foregoing statistical objects the econometrician needs to model the rela-
tionship between observables and unobservables in (x, ) and do so in a plausible way. The
restrictions that are used to this purpose are said identification restrictions. The next sections
describe the simplest probabilistic model for equation (1.1.1), the linear population model.
1.2. The linear population model
Equation (1.1.1) is a linear model of the population if the following assumptions hold.
P.1: Linearity: (x, ) = x0 + , with being a random scalar (q = 1) and a k 1

vector of fixed parameters
P.2: rank [E (xx0 )] = k or, equivalently, det [E (xx0 )] 6= 0
P.3: Conditional-mean-independence of and x: E (|x) = 0
Under linearity (P.1) equation (1.1.1) becomes
(1.2.1) y = x0 + ,
then, given P.3, E (y|x) = x0 .

An equivalent, but easier to interpret, formulation of assumption P.2 states:
P.2b: No element of x in X can be obtained as linear combinations of the others with

probability equal to one: P r (a0 x = 0) = 1 only if a = 0.
The following proves equivalence of P.2 and P.2b (not crucial and rather technical, it can be
skipped towards the exam). I exploit the properties of the expectation and rank operators.
Assume P.2 and P r (a0 x = 0) = 1 for some conformable constant vector a. Then E (a0 xx0 a) =
0, and so a0 E (xx0 ) a = 0, which implies a = 0 by P.2, proving P.2b. Now, assume P.2b and pick
any a 6= 0. Then, P r (a0 x = 0) 6= 1 and so P r (a0 x 6= 0) > 0. But since a0 x 6= 0 is equivalent
to a0 xx0 a > 0, then P r (a0 xx0 a > 0) = P r (a0 x 6= 0) > 0. So, since P r (a0 xx0 a 0) = 1,
E (a0 xx0 a) > 0, which in turn implies a0 E (xx0 ) a > 0. Therefore, E (xx0 ) is positive definite
and so non-singular, that is P.2.
Exercise 1. prove that if x = (1 x1 ) then assumption P.2 is equivalent to V ar (x1 ) 6= 0.
Solution:
0
1
E (x1 )
E xx =
E (x1 ) E x21
and so det E (xx0 ) = E x21 E 2 (x1 ) = V ar (x1 ), and the claim is proved by noting that for

any k k matrix A, rank (A) = k if and only if det (A) 6= 0.

By equation (1.1.1) and assumption P.1, the latent part of (x, ), , satisfies the following
equation
= y x0
and the marginal effects, x (x, ), satisfy the following:
x (x, ) = .
By assumption P.3 and the law of iterated expectations E (x) = 0. Since = y x0 , then
we have the system of k moment conditions
E xy xx0 = 0

(1.2.2)
or E (xy) = E (xx0 ) . Assumption P.2, then, ensures that the foregoing system can be solved
for to have
1
(1.2.3) = E xx0 E (xy)
At this point the linear probabilistic model establishes a precise mapping between, on the
one hand, the structural objects of interest, (x, ), and x (x, ) and on the other the
observable or estimable objects y, x, E (xx0 ) and E (xy). Indeed, (x, ), and x (x, )
are equal to unique known transformations of y, x, E (xx0 ) and E (xy). This means that
(x, ), and x (x, ) can be estimated using estimators for E (xx0 ) and E (xy), whose
choice depends on the underlying sampling mechanism. The most basic strategy is to carry
out estimation within the linear regression model and its variants. In essence, the linear regres-
sion model is the linear probabilistic model supplemented by a random sampling assumption.
This ensures optimal properties of the ordinary least squares estimator (OLS) and its various
generalizations.
A more restrictive specification of the linear model maintains the assumptions of condi-
tional homoskedasticity and normality
P.4: V ar (|x) = 2 .
P.5: |x N 0, 2 .

A more general variant of the linear model, instead, replaces assumption P.3 with
P.3b: E (x) = 0.
Under P.3b it is still true that = E (xx0 )1 E (xy) and x (x, ) = , with the virtue that
the conditional expectation E (y|x) is left unrestricted. Therefore, with P.3b replacing P.3,
the model is more general.
The function x0 , with = E (xx0 )1 E (xy), is relevant in either version of the linear
model and is said the linear projection of y onto x.
Exercise 2. Prove that if x contains 1, then E (x) = 0 is equivalent to E () = 0 and

cov (, x) = 0 (hint: remember that cov (, x) = E (x) E (x) E () ).
Solution: Assume E (x) = 0. Since 1 is an element of x, the first component of E (x) is

E () = 0, then, given cov (, x) = E (x) E (x) E (), cov (, x) = 0. Assume E () = 0 and
cov (, x) = 0. Then, E (x) = E (x) E () = 0.
CHAPTER 2
The linear regression model
The linear regression model is a statistical model, as such it incorporates a probabilistic

model of the population and a sampling mechanism that draws the data from the population.
2.1. From the linear population model to the linear regression model
Consider the linear model of the previous chapter: the population equation (1.1.1)
y = (x, )
with the assumptions
P.1: Linearity: (x, ) = x0 + , with x = (x1 x2 ... xk )0 being a k 1 random vector,

a random scalar and = (1 2 ... k )0 a k 1 vector of fixed parameters
P.2: rank [E (xx0 )] = k for all x X .
P.3: Conditional-mean-independence of and x: E (|x) = 0
Now, add the random sampling assumption
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik ) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
Given P.1-P.3 and RS, we have the linear regression model (LRM)
(2.1.1) yi = x0i + i
with x0 i = (xi1 xi2 ...xik ), i = 1, ..., n and {i = yi x0i , i = 1, ..., n} is a sequence of unob-
served i.i.d. errors terms.
13
2.2. THE PROPERTIES OF THE LRM 14
2.2. The properties of the LRM
It is convenient to express the LRM in compact matrix form as follows
(2.2.1) y = X +
where
y1 x01 1
.. .. ..

.

.
.

y = yi , X = 0
xi , =
i .
n1
nk n1
.. .. ..
. . .

yn xn 0 n
It is not hard to see that model (2.2.1), given P.1-P.3 and RS, satisfies the following properties
LRM.1: Linearity in the parameters

LRM.2: X has full column rank, that is rank (X) = k.
LRM.3: The variables in X are strictly exogenous, that is
E i |x01 , ... x0i , ..., x0n = 0,

i = 1, ..., n, or more compactly, E (|X) = 0.
LRM.1 is obvious. LRM.2 requires that no columns of X can be obtained as linear combina-
tions of other columns of X or, equivalently, that a = 0 if Xa = 0, or also equivalently that for
any a 6= 0, there exists at least one observation i = 1, ..., n, such that x0i a 6= 0. P.2 ensures that
this occurs with non-zero probability, which approaches unity as n . LRM.3, instead, is
a consequence of P.3 and RS. This is proved as follows. By P.3, E (i |x0i ) = 0, i = 1, ..., n or
E (yi |x0i ) x0i = 0, i = 1, ..., n. Since
E i |x01 , ... x0i , ..., x0n = E yi |x01 , ... x0i , ..., x0n x0i

2.3. DIFFICULTIES AND EXTENSIONS 15
and by RS, E (yi |x0i ) = E (yi |x01 , x02 , ..., x0n ), then
E i |x01 , ... x0i , ..., x0n = E yi |x0i x0i = 0.

If, in addition, P.4 (conditional homoskedasticity) and P.5 (conditional normality) hold for
the population model, then one can verifies that
LRM.4: V ar (|X) = 2 In .
LRM.5: |X N 0, 2 In

While LRM.1-LRM.5 are less restrictive than P.1-P.5 and RS and, in most cases, sufficient for
accurate and precise inference, they are still strong assumptions to maintain. Finally, if P.3 is
replaced by P.3b, E (x) = 0, then LRM.3 gets replaced by
LRM.3b: E ( ni=1 xi i ) = 0, i = 1, ..., n, or more compactly,

P
E X 0 = 0.

2.3. Difficulties and extensions
Some or all of LRM.1-LRM.5 may not be verified if the population model assumptions
and/or the RS mechanism are not verified in reality. Here is a list of the most important
population issues.
Non-linearities (P.1 fails): the model is non-linear in the parameters. This leads
LRM.1 to fail.
Perfect multicollinearity (P.2 fails): some variables in x are indeed linear combina-
tions of the others. LRM.2 fails, but in general this is not a serious problem, it
simply indicates that the model has not been parametrized correctly to begin with.
A different parametrization will restore identification in most cases.
Endogeneity (P.3 fails): some variables in x are related to . LRM.3 fails.
Conditional heteroskedasticity (P.4 fails): the conditional variance depends on x.
LRM.4 fails.
Non-normality (P.5 fails): is not conditionally normal. LRM.5 fails.
Other important problems are instead with the RS assumption.
Omitted variables: some of the variables in x are not sampled. This implies that
the missing variables cannot enter the conditioning set and have to be treated as
unobserved errors, along with , which could make LRM.3-LRM.5 fail.
Measurement error: some of the variables in x are measured with error. We have
the wrong variables in the conditioning set. As in the case of omitted variables,
LRM.3-LRM.5 may fail.
Endogenous selection: some units in the sample are missing due to events related to
. Also in this case, LRM.3-LRM.5 are likely to fail.
Notice that often problems in the RS mechanism have their roots in the population model.
For example, the presence of non-random variables in x is not in general compatible with an
identically distributed sample and, in consequence with RS. It is easy to verify, though, that
non-random x along with a weaker sampling mechanism only requiring independent sampling
is compatible with LRM.1-LRM.5. Also, the presence of variables in x at different levels of
aggregation may not be compatible with an independent sampling, as observed by Moulton
(1990). In this case, the sampling mechanism can be relaxed by maintaining independence
only across groups of observations and not across observations themselves. See for example
the sampling mechanism described in Section 7.6 for panel data models, in which the sample
is neither identically distributed nor independent across observations.
Finally, it is important to emphasize that even if all the population assumptions and the
RS mechanism are valid, data problems may arise in the form of multicollinearity among
regressors.
Multicollinearity: some of the variables in X are almost collinear. In the population

this is reflected by det [E (xx0 )] ' 0; and in the sample by det (X 0 X) ' 0.
As we will see in Chapter 4, although multicollinearity does not affect the statistical properties
of the estimators in finite samples, it can severely affect the precision of the coefficient estimates
in terms of large standard errors.
CHAPTER 3
The Algebraic properties of OLS
3.1. Motivation, notation, conventions and main assumptions
We do not agree with Larry (the adult croc), do we? Algebra may be boring, but only if
its purpose is left obscure. Algebra in econometrics provides the bricks to construct estimators
and tests. The fact that most estimators and tests are automatically implemented by statistical
packages is no excuse to neglect the underlying algebra. First, because most does not mean
all and there may be the case that for our research work we have to build the technique by
ourselves. This is especially true for the most recent techniques. A robust hausman test for
panel data models and multiway cluster robust standard errors are just a few examples of
techniques that are not yet coded by the popular statistical packages. Second, even if the
technique is available as a built-in procedure in our preferred statistical package, to use it
correctly we have to know how it is made, which boils down to understanding its algebra.
Finally, often interpretation of results requires that we are aware of the algebraic properties
of estimators and tests. So the material here may seem rather intricate at times, but it is
certainly of practical use.
18
3.3. OLS: DEFINITION AND PROPERTIES 19
This chapter is based on my lecture notes in matrix algebra as well as on Greene (2008),
Searle (1982) and Rao (1973). Throughout, I denotes a conformable identity matrix; 0 denotes
a conformable null matrix, vector or scalar, with the appropriate meaning being clear from
the context; y is a real n 1 vector containing the observations of the dependent variable; X
is a real (n k) regressor matrix of full column rank.
The do-file algebra_OLS.do demonstrates the results of this chapter using the Stata data
set US_gasoline.dta.
3.2. Linear combinations of vector
Given the real (n k) matrix A, the columns of A are said linearly dependent if there
exists some non-zero (k 1) vector b such thatAb = 0 .
Given the real (n k) matrix A, the columns of A are said linearly independent if Ab = 0
only if b = 0.
Two real non-zero (n 1) vectors a and b are said to be orthogonal if a0 b = 0. Given
two real non-zero matrices A and B with the same number of rows, if each column of A is
orthogonal to all columns of B, so that A0 B = 0, then A and B are said to be orthogonal.
3.3. OLS: definition and properties
We do not have any model in mind here, just data for the response variable

y
1
..
.

y = yi

.
.
.

yn
and the n k regressor matrix

x01
..

.

X= 0
xi ,
..
.

x0n
where x0 i = (xi1 xi2 ...xik ) . I only maintain that rank (X) = k.
We aim at finding an optimal approximation of y using the information contained in y
and X. One such approximation can be obtained through the ordinary least squares estimator
(OLS), b, defined as the minimizer of the residual sum of squares S (bo ):
b = arg min S (bo ) ,

bo
where
S (bo ) = (y Xbo )0 (y Xbo ) .
Geometrically, Xb is an optimal approximation of y in that it minimizes the euclidean distance

from the vector y to the hyperplane Xbo . As such, b satisfies

S (bo )
= 0.
bo bo =b
By expanding (y Xbo )0 (y Xbo ):
S (bo ) = y0 y b0o X 0 y y0 Xbo + b0o X 0 Xbo
= y0 y 2y0 Xbo + b0o X 0 Xbo .
where the second equation follows from the fact that all of the right hand side components are
11 matrices and as such necessarily symmetric, so that b0o X 0 y = y0 Xbo . Then, remembering
the derivation rules for matrices (Az/z = A0 for any matrix A, z0 Bz/z = 2Bz for any
symmetric matrix B), take the partial derivatives
S (bo )
= 2X 0 y + 2X 0 Xbo
bo
so that the first order conditions (OLS normal equations) of the minimization problem are
(3.3.1) X 0 y + X 0 Xb = 0,
with the resulting formula for the OLS estimator
1
(3.3.2) b = X 0X X 0y
Notice that
The estimator exists since X 0 X is non-singular, X being of full-column rank.

The estimator is a true minimizer since the Hessian of S (bo ),
2 S (bo )
= 2X 0 X,
bo b0o
is a positive definite matrix (i.e. S (bo ) is globally convex in bo ). The latter is easily
proved as follows. A matrix A is said to be positive definite if the quadratic form
c0 Ac > 0 for any conformable vector c 6= 0. By the full column rank assumption
n
X
z = Xc 6= 0 for any c 6= 0 therefore c0 X 0 Xc = z0 z = zi2 > 0 for any c =
6 0.
i=1
The OLS residuals are defined as
(3.3.3) e = y Xb
By (3.3.1) it follows that e and X are orthogonal:
(3.3.4) X 0 (y Xb) = 0.
Therefore, if X contains a column of all unity elements, say 1, three important implications
follows.
n n
X 1X
(1) The sample mean of e is zero: 10 e = ei = 0 and consequently, e =
ei = 0.
n
i=1 i=1
(2) The OLS regression line passes through the point sample means (y, x), that is y = x0 b,
where y = ( ni=1 yi ) /n and
P
n n
!
X X
0 1 1
x = n x1i . . . n xki
i=1 i=1
(it follows straightforwardly from 10 e = 10 (y Xb) = 0).

(3) Let
(3.3.5) y = Xb
denote their sample mean

denote the OLS predicted values of y, and y
n
1X
y = yi
n
i=1
then, since the sample mean of Xb equals x0 b,
y = y.
3.3.1. Stata implementation: get your Stata data file with use. All Stata data
files can be recognized by their filetype dta. Suppose you have y and X within a Stata data
file called, say, mydata.dta, stored in your Stata working directory and that you have just
launched Stata on your laptop. To get your data into memory, from the Stata command line
execute use followed by the name of the data file (specifying the filetype dta is not necessary
since use only supports dta files):
use mydata
If mydata.dta is not in your Stata working directory but somewhere else in your laptop,
then you must specify the path of the dta file. For example, if you have a mac and your data
file is in the folder /Users/giovanni you will write
use /Users/giovanni/mydata
If you have a pc and your file is in c:\giovanni
use c:\giovanni\mydata
If the path involves folders with names that include blanks, then include the whole path
into double quotes. For example:
use "/Users/giovanni/didattica/Greene/dati/ch. 1/mydata"
3.3.2. Stata implementation: the help command. To know syntax, options, usage
and examples for any Stata command, write help from the command line followed by the
name of the command for which you want help. For example,
help use
will make appear a new window describing use:

Title
[D] use Use Stata dataset
Syntax
Load Stata-format dataset
use filename [, clear nolabel]
Load subset of Stata-format dataset
use [varlist] [if] [in] using filename [, clear nolabel]
Menu
File > Open...
Description
use loads a Stata-format dataset previously saved by save into memory. If

filename is specified without an extension, .dta is assumed. If your
filename contains embedded spaces, remember to enclose it in double
quotes.
In the second syntax for use, a subset of the data may be read.
Options
clear specifies that it is okay to replace the data in memory, even though
the current data have not been saved to disk.
nolabel prevents value labels in the saved data from being loaded. It is
unlikely that you will ever want to specify this option.
Examples
. use http://www.stata-press.com/data/r11/auto
. replace rep78 = 3 in 12
. use http://www.stata-press.com/data/r11/auto, clear

. keep make price mpg rep78 weight foreign
. save myauto
. use make rep78 foreign using myauto

. describe
. use if foreign == 0 using myauto

. tab foreign, nolabel
. use using myauto if foreign==1
. tab foreign, nolabel
Also see
Manual: [D] use
Help: [D] compress, [D] datasignature, [D] fdasave, [D] haver, [D]
infile (free format), [D] infile (fixed format), [D] infix, [D]
insheet, [D] odbc, [D] save, [D] sysuse, [D] webuse
.
3.3.3. Stata implementation: OLS estimates with regress. Now that you have
loaded your data into memory, Stata can work with them. Suppose your dependent variable
y is called depvar and that X contains two variables, x1 and x2. To run the OLS regression of
depvar on x1 and x2 with the constant term included, you write regress followed by depvar
and, then, the names of the regressors:
regress depvar x1 x2
The following example shows the regression in example 1.2 of Greene (2008) with annual
values of US aggregate consumption (c) used as the dependent variable and regressed on
annual values of US personal income (y) for the period 1970-1979.
regress includes the constant term (the unity vector) by default and always with the
name _cons. If you dont want it, just add the regress option, noconstant:
regress depvar x1 x2, noconstant
Notice that always, according to a general rule of the Stata syntax, the options of any
Stata command follow the comma symbol. This means that if you wish to specify options you
have to write the comma symbol after the last argument of the command, so that everything
to the right of the comma symbol is held by Stata as an option. Options can be more than
one. Of course, if you do not wish to include options dont write the comma symbol.
After execution, regress leaves behind a number of objects in memory, mainly scalars
and matrices, that will stay there, available for use, until execution of the next estimation
command. To know what these objects are, consult the section Saved results in the help
of regress, where you will find the following description.
Saved results
regress saves the following in e():
Scalars
e(N) number of observations
e(mss) model sum of squares
e(df_m) model degrees of freedom
e(rss) residual sum of squares
e(df_r) residual degrees of freedom
e(r2) R-squared
e(r2_a) adjusted R-squared
e(F) F statistic
e(rmse) root mean squared error
e(ll) log likelihood under additional assumption of i.i.d.
normal errors
e(ll_0) log likelihood, constant-only model
e(N_clust) number of clusters
e(rank) rank of e(V)
Macros
e(cmd) regress
e(cmdline) command as typed
e(depvar) name of dependent variable
e(model) ols or iv
e(wtype) weight type
e(wexp) weight expression
e(title) title in estimation output when vce() is not ols
e(clustvar) name of cluster variable
e(vce) vcetype specified in vce()
e(vcetype) title used to label Std. Err.
e(properties) b V
e(estat_cmd) program used to implement estat
e(predict) program used to implement predict
e(marginsok) predictions allowed by margins
e(asbalanced) factor variables fvset as asbalanced
e(asobserved) factor variables fvset as asobserved
Matrices
e(b) coefficient vector
e(V) variance-covariance matrix of the estimators
e(V_modelbased) model-based variance
Functions
e(sample) marks estimation sample
You should be already familiar with some of the e() objects in the Scalars and Matrices
parts. At the end of the course you will be able to understand most of them. Dont worry
about the Macros and Functions parts, they are rather technical and, however, not relevant
for our purposes.
3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 27
To know the values taken on by the e() objects, execute the command ereturn list just
after the regress instruction. In our regression example we have:
. ereturn list
scalars:
e(N) = 10
e(df_m) = 1
e(df_r) = 8
e(F) = 959.919036180133
e(r2) = .9917348458900325
e(rmse) = 8.193020017500434
e(mss) = 64435.11918375102
e(rss) = 537.0046160573024
e(r2_a) = .9907017016262866
e(ll) = -34.10649331948547
e(ll_0) = -58.08502782843004
e(rank) = 2
macros:
e(cmdline) : "regress c y"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "c"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 2
e(V) : 2 x 2
functions:
e(sample)
3.4. Spanning sets and orthogonal projections
Consider the n-dimensional Euclidean space Rn and the (n k) real matrix A. Then, each
column of A belongs to Rn and the set of all linear combinations of the columns of A is said
the space spanned by the columns of A (or also the range of A), denoted by R (A) .
R (A) can be easily proved to be a subspace of Rn (it is obvious that R (A) Rn ; R (A)
is a vector space since, given any two vectors a1 and a2 belonging to R (A), then a1 + a2
R (A) and ca1 R (A) for any real scalar c). Since each element of R (A) is e vector of n
components, R (A) is said to be a vector space of order n. The dimension of R (A), denoted
3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 28
by dim [R (A)], is the maximum number of linearly independent vectors in R (A). Therefore,
dim [R (A)] = rank (A) and if A is of full column rank, then dim [R (A)] = k.
The set of all vectors in Rn that are orthogonal to the vectors of R (A) is denoted by A .
I now prove that A is a subspace of Rn . A Rn by definition. Given any two vectors
b1 and b2 belonging to A and for any a R (A), b01 a = 0 and b02 a = 0, but then also
(b1 + b2 )0 a = 0 and, for any scalar c, (cb01 ) a = 0, which completes the proof.
Importantly, it is possible to prove, but not pursued here, that

(3.4.1) dim A = n rank (A) .
A is commonly referred to as the space orthogonal to R (A), or also the orthogonal comple-
ment of R (A) .
Exercise 3. Prove that any subspace of Rn contains the null vector.
For simplicity, assume A of full column rank and define the operator P[A] as
1
P[A] = A A0 A A0 .
0 = P ) and idempotent (P P
As an exercise you can verify that P[A] is a symmetric (P[A] [A] [A] [A] =
P[A] ) matrix. With this two properties, P[A] is said an orthogonal projector. In geometrical
terms, P[A] projects vectors onto R (A) along a direction that is parallel to the space orthogonal
to R (A), A . Symmetrically,
M[A] = I P[A]
is the orthogonal projector that projects vectors onto A along a direction that is parallel to
the space orthogonal to A , R (A).
Exercise 4. Prove that M[A] is an orthogonal projector (hint: just verify that M[A] is
symmetric and idempotent).
3.5. OLS RESIDUALS AND FITTED VALUES 29
The properties of orthogonal projectors, established by the following exercises, are readily
understood, once one grasps the geometrical meaning of orthogonal projectors. They can be
also demonstrated algebraically, which is what the exercises require.
Exercise 5. Given two (n k) real matrices A and B, both of full column rank, prove
that if A and B span the same space than P[A] = P[B] (hint: prove that A can be always
expressed as A = BK where Kis a non-singular (k k) matrix).
Solution: If R (A) coincides with R (B), then every column of A belongs to R (B), and
as such every column of A can be expressed as a linear combination of the columns of B,
A = BK, where K is (k k) . Therefore, P[A] = BK (K 0 B 0 BK)1 K 0 B. An important result
of linear algebra states that given two conformable matrices C and D, then rank (CD)
min [rank (C) , rank (D)] (see Greene (2008), p. 957, (A-44)). Since both A and B have rank
equal to k, in the light of the foregoing inequality, k min [k, rank (K)], which implies that
rank (K) k, and since rank (K) > k is not possible, then rank (K) = k and K is non-
singular. Finally, by the property of the inverse of the product of square matrices (see Greene
(2008), p. 963, (A-64))
1
P[A] = BK K 0 B 0 BK K 0B0
1 1
= BKK 1 B 0 B K0 K 0B0
= P[B] .
Exercise 6. Prove that P[A] and M[A] are orthogonal, that is P[A] M[A] = 0.
3.5. OLS residuals and fitted values
The foregoing results are useful to properly understand the properties of OLS. But before
going on, do the the following exercise.
Exercise 7. Given any (n 1) real vector v lying onto R (A) prove that P[A] v = v and
M[A] v = 0 (hint: express v as v = Ac, where c is a real (k 1) vector).
From exercise 7 it clearly follows that
(3.5.1) P[A] A = A
and
(3.5.2) M[A] A = 0.
Using the OLS formula in (3.3.2)
(3.5.3) e = M[X] y,
where
1
M[X] = I X X 0 X X 0.
Therefore, the OLS residual vector, e, is the orthogonal projection of y onto the space orthog-
onal to that spanned by the regressors, X . For this reason M[X] is said the residual maker.
From (3.3.2) and (3.3.5) it follows that
= P[X] y
y
, is the orthogonal projection onto the

and so the vector of OLS predicted (fitted) values, y
space spanned by the regressors, R (X). Clearly, e0 y
= 0 (see exercise 6 or also equation
(3.3.4)), therefore the OLS method decomposes y into two orthogonal components
(3.5.4) + e.
y=y
The geometry of OLS can be grasped within the framework of a n=2, k=1 regression
model.
3.5.1. Stata implementation: an important post-estimation command, predict.

and e are useful for a number of purposes. They can be obtained as new vari-
In applications y
ables in your Stata data set using the post-estimation command predict. The way predict
in
works is simple. Imagine you have just executed your regress instruction. To have now y
your data as a variable called, say, y_hat, just write from the command line:
predict y_hat
3.6. PARTITIONED REGRESSION 32
values.
You have thereby created a new variable with name y_hat that contains the y
Fitted values are the default calculation of predict, if you want residuals just add the res
option:
predict resid, res
and you have got a new variable in your data called resid that contains e the values.
It is important to stress that predict supports any estimation command, not only regress.
So, it can be implemented, for example, after xtreg in the context of panel data.
3.6. Partitioned regression
It is often the case in econometrics that interest centers on a subset of explanatory variables
and the corresponding marginal effects and that consequently we are interested in the specific
formula for the corresponding OLS subvector.

Partition X as X = X1 X2 and, accordingly,

b1
b=
b2
where b = (X 0 X)1 X 0 y. The following important theorem provides the answer, yielding the
OLS formulas for the b1 and b2 subvectors.
Theorem 8. Assume X has full column rank, then there is a linear relationship between
b1 and b2
1
(3.6.1) bi = Xi0 Xi Xi0 (y X j bj )
and there are the following formulas for b1 and b2

1
(3.6.2) bi = Xi0 M[Xj ] Xi Xi0 M[Xj ] y,
i, j = 1, 2 and i 6= j.
Proof. Reformulate the normal equations accordingly

X10 X10 X1 X10 X2 b1
y = 0
X20 X20 X1 X20 X2 b2
or
(3.6.3) X10 y X10 X1 b1 X10 X2 b2 = 0
(3.6.4) X20 y X20 X1 b1 X20 X2 b2 = 0
Expliciting the first system of equations for b1
1
(3.6.5) b1 = X10 X1 X10 (y X 2 b2 ) ,
and by symmetry b2 = (X20 X2 )1 X20 (y X 1 b1 ), which shows the first part of the theorem.
To prove the second part, replace the right hand side of equation (3.6.5) into the second system
(3.6.4) to obtain
1
X20 y X20 X1 X10 X1 X10 (y X 2 b2 ) X20 X2 b2 = 0
or equivalently, using the orthogonal projector notation P[X1 ] for X1 (X10 X1 )1 X10 ,
X20 y X20 P[X1 ] (y X 2 b2 ) X20 X2 b2 = 0.
Then, rearrange the foregoing equation
X20 I P[X1 ] y + X20 P[X1 ] I Xb2 = 0

so that eventually
1
b2 = X20 M[X1 ] X2 X20 M[X1 ] y.
Finally, by symmetry,
1
b1 = X10 M[X2 ] X1 X10 M[X2 ] y.
Theorem 8 is an important result and is known in the econometric literature as the Par-
titioned Regression Theorem or also the Frisch-Waugh-Lovell Theorem after the names of the
economists who independently proved and re-proved it over time. Since the partitioning con-
sidered by Theorem is completely arbitrary, the Theorems formulas (3.6.2) permit to compute
the OLS estimator for any specific subvector of variables in the regression. For example, the
formula for a single coefficient bi is readily obtained by specifying the partitioning as X1 = xi ,
X2 = Xi , where xi is the i.th column in X and Xi is the matrix X with column xi removed.
So, applying the Theorems formulas (3.6.2) to such partitioning we find that
1 0 x0i M[X1 ] y
(3.6.6) bi = x0i M[X1 ] xi xi M[X1 ] y = 0 .
xi M[X1 ] xi
Theorem 8 is also useful in empirical work to reduce the dimension of the regression model,
without affecting the OLS estimates. Indeed, b1 can be obtained by a reduced regression where
the dependent variable is the residual vector obtained by regressing y on X2 , that is M[X2 ] y,
and the regressors are the residuals obtained from the regressions of each column of X1 on X2 ,
that is M[X2 ] X1 . The number of regressor in the reduced regression is equal to the number of
columns in X1 . For example, bi can be obtained by regressing the dependent variable M[Xi ] y
on the single regressor M[Xi ] xi and as such could be represented graphically as the slope of the
resulting fitted regression line in a two-way graph in Stata. The simple reduced regression of y
on xi , will not do as it does not account for the variation in X2 . An empirical demonstration
of the foregoing utilization of Theorem 8 is given by the following portion of the Stata dofile
algebra_OLS.do estimating an aggregate Gasoline expenditure function for the US Economy
This is the Stata output

The dofile concludes with the graphical representation of the estimated coefficient on
Gasp -0.070 as the slope of the estimated regression line with the scatterplot of ln_G_e
and ln_Gasp_e in the background
Corollary 9. If X 1 and X 2 are orthogonal, then
1
bi = Xi0 Xi Xi0 y,
i = 1, 2.
Corollary 9 is proved at once by substituting Xi0 Xj = 0 into the first equation of Theorem
8. It ensures that in the presence of orthogonal regressors, b1 (b2 ) can be obtained by the
reduced OLS regression of y on X1 (X2 ). From an empirical point of view orthogonal regressors
are a narrow case. Yet, Corollary 9 will be often for use in derivations.

Exercise 10. Partitioning X = X1 1 , where 1 is the (n 1) vector of all unity
elements, prove that M[1] = I 1 (10 1)1 10 transforms all variables in deviations from their
sample means, and so that the OLS estimator b1 can be obtained by regressing y demeaned
onX1 demeaned.
Remark 11. Since b exists, so do its components and the linear relationships among
them, which proves at once that all of the inverse matrices in the formulas of Theorem 8 exist.
In other words, X1 and X2 are both of full column rank and X10 M[X2 ] X1 and X20 M[X1 ] X2
are non-singular when X is of full column rank. This result can be also verified by direct
inspection, as suggested by the following exercises.
Exercise 12. Prove that if X is of full column rank, so are X1 and X2 (easy!).
Exercise 13. Prove that if X is of full column rank, then M[X1 ] X2 is of f.c.r.
Solution: I prove the result by contradiction and assume that P[X1 ] X2 b = X2 b for some
vector b 6= 0. Therefore, X1 a = X2 b, where a = (X10 X1 )1 X10 X2 b, or equivalently Xc = 0,
where
a
c= ,
b
which leads to a contradiction since c 6= 0 and X is of f.c.r.
Exercise 14. Prove that X20 M[X1 ] X2 is positive definite if X is of f.c.r. (hint: use exercise
13 to prove that M[X1 ] X2 is of full column rank and then the fact that M[X1 ] is symmetric
and idempotent).
Solution: By exercise 13, M[X1 ] X2 a 6= 0 if a 6= 0. Let z = M[X1 ] X2 a, hence a0 X20 M[X1 ] X2 a =

z0 z is a sum of squares with at least one positive element. Therefore, a0 X20 M[X1 ] X2 a > 0.
Exercise 15.
The following result on the decomposition of orthogonal projectors into orthogonal com-
ponents will be useful in a number of occasions later on.
Lemma 16. Given X = (X1 X2 ), the following representation of P[X] holds
(3.6.7) P[X] = P[X1 ] + P[M[X X2 ] .

1]
Proof. Equation (3.6.7) can be proved directly using the formula for the inverse of the
2 2 partitioned inverse. An indirect but easier proof is based on Theorem 8 and goes as
follows. Notice that, for any y and X = (X1 X2 ) of f.c.r., P[X] y = X1 b1 + X2 b2 , where by
1 0
Theorem 8 b1 = (X10 X1 )1 X10 (y X2 b2 ) and b2 = X20 M[X1 ] X2 X2 M[X1 ] y. So, plugging
the right hand side of b1 into the right hand side of P[X] y = X1 b1 + X2 b2 yields
1
P[X] y = X1 X10 X1 X10 (y X2 b2 ) + X2 b2
= P[X1 ] y P[X1 ] X2 b2 + X2 b2
= P[X1 ] y + M[X1 ] X2 b2
Then, plugging the right hand side of b2 into the foregoing equation yields
1
P[X] y = P[X1 ] y + M[X1 ] X2 X20 M[X1 ] X2 X20 M[X1 ] y

or P[X] y = P[X1 ] + P[M[X X2 ] y, which must hold for any y, so that P[X] = P[X1 ] +P[M[X X2 ] .
1] 1]
Remark 17. Clearly, for one partitioning there are always two representations of P[X] ,
since the subscripts in Equation (3.6.7) can be interchanged and so we may also write P[X] =
P[X2 ] + P[M[X X1 ] .
2]
Exercise 18. Verify that the two matrices in the right hand side of (3.6.7), P[X1 ] and
P[M[X X2 ] , are orthogonal.
1]
The following corollary to Lemma 16 is immediate.

Corollary 19. Given X = (X1 X2 ), the following representation of M[X] holds
(3.6.8) M[X] = M[X1 ] P[M[X X2 ] .

1]
3.6.1. Residuals from the partitioned OLS regressions. Consider the OLS regres-
sion of M[X2 ] y on the columns in M[X2 ] X1 as regressors and the corresponding residual vector
eP R = M[X2 ] y M[X2 ] X1 b1 .
The next exercise asks to prove that eP R equals the OLS residual vector e = M[X] y.
Exercise 20. Given X = (X1 X2 ), e = M[X] y and eP R = M[X2 ] y M[X2 ] X1 b1 , prove

that e = eP R .
1
Solution. By Theorem 8, b1 = X10 M[X2 ] X1 X10 M[X2 ] y, replacing it into the right
hand side of the eP R equation yields
1
eP R = M[X2 ] y M[X2 ] X1 X10 M[X2 ] X1 X10 M[X2 ] y
= M[X2 ] y P[M[X X1 ] y.
2]
By Corollary 19, M[X] = M[X2 ] P[M[X X1 ] or M[X2 ] = M[X] + P[M[X X1 ] . Then

2] 2]

eP R = M[X] + P[M[X X1 ] y P[M[X X1 ] y
2] 2]
= M[X] y
= e.
3.6.2. Additional regressors. Including additional regressors to a given regressor ma-

trix, X, always reduce the OLS residual sum of squares, regardless of the actual relevance
of the new regressors. To prove this result with conventional matrix methods is rather long
and tedious (see for example the treatment in Greene (2012)) . But with Lemma 16 at hands
things are much simpler!
Consider the initial regressor matrix X and the additional regressors Z, so that there is a

larger regressor matrix, W , partitioned as W = X Z .
Theorem 21. Let e be the residual vector from the OLS regression of y on X and uthe
residual vector from the OLS regression of y on W , so that e = M[X] y and u = M[W ] y. Then,
u0 u e0 e.
Proof. Using Corollary 19, u can be written as

u = M[X] P[M[X] Z ] y
or
u = e P[M[X] Z ] y
with the RSS being

u0 u = e0 e 2e0 P[M[X] Z ] y + y0 P[M[X] Z ] y.
But notice that e0 P[M[X] Z ] y = y0 M[X] P[M[X] Z ] y = y0 P[M[X] Z ] y, so that eventually
(3.6.9) u0 u = e0 e y0 P[M[X] Z ] y
and given that y0 P[M[X] Z ] y is a sum of squares,
u0 u e0 e.
Exercise 22. How does the formula for u0 u in Equation (3.6.9) simplify if the new re-
gressors Z are orthogonal to X, that is if X 0 Z is a null matrix?
Solution. In this case M[X] Z = Z X (X 0 X)1 X 0 Z = Z and so u0 u = e0 e y0 P[Z] y.

3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE 42
3.7. Goodness of fit and the analysis of variance
Assume that the unity vector, 1, is part of the regressor matrix X. Total variation in y can
be expressed by the following sum of squares, referred to as Sum of Squared Total deviations
n
X
T SS = (yi y)2
i=1
or, given what established in exercise 10,
T SS = y0 M[1] y
Notice that T SS is the sample variance of y, T SS/ (n 1), times the appropriate degrees of
freedom correction n1. Incidentally, the degrees-of-freedom correction in the sample variance
is just n 1 and not n, since M[1] y are the residuals from the regression of y on 1 (see exercise
10) and so there could be no more than n1 linearly independent vectors in the space to which
M[1] y belongs, 1 . In fact, since rank (1) = 1, then given equation (3.4.1), dim 1 = n 1.

By the orthogonal decomposition for y in (3.5.4),
+ M[1] e.
M[1] y = M[1] y
But since e and X are orthogonal and X contains 1, it follows that 10 e = 0, thereby
(3.7.1) M[1] e = e
and
+ e.
M[1] y = M[1] y
Then,
0 M[1] y
T SS = y + 2e0 M[1] y
+ e0 e.
3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE 43
= Xb and e and X are orthogonal, then e0 M[1] y

But since equation (3.7.1) holds, y = e0 y
=
e0 Xb = 0 and so T SS simplifies to
0 M[1] y
T SS = y + e0 e.
0 M[1] y
Throughout, I refer to y as ESS (explained sums of squares) and e0 e as RSS
(residual sum of squares).
As it happens for T SS, RSS is the sample variance of residuals times the appropriate
degrees-of-freedom correction, n k. Again, the degrees-of-freedom correction in the sample
variance is just n k and not n since in the residual space, X , there could be no more than
n k linearly independent vectors. This follows from the assumption that X is of full column
rank, thereby rank (X) = k, then given equation (3.4.1), dim X = n k.

The coefficient of determination, R2 , is defined as
ESS 0 M[1] y
y
(3.7.2) R2 = = 0
T SS y M[1] y
0 M[1] y
and since y = yM[1] y e0 e,
e0 e
R2 = 1 .
y0 M[1] y
Therefore, if the constant term is included into the regression it has that 0 R2 1 and
R2 measures the portion of total variation in y explained by the OLS regression; in this sense
R2 is a measure of goodness of fit1. There are two interesting extreme cases. If all regressors,
lies onto the space spanned by 1 and M[1] y
apart from 1, are null vectors, then y = 0, so
that eventually R2 = 0. Only the constant term has explanatory power in this case, and the
regression is an horizontal line with intercept equal to the sample mean of y. If y lies already
(and also e0 e = 0) and R2 = 1, a perfect (but useless) fit2.
onto R (X), then y = y
1If the constant term is not included into the regression than (3.7.1) does not hold and R2 may be negative.
2Im maintaining throughout the obvious assumption that in any case y / R (1) . Why?
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 44
A problem with the R2 measure is that it never decreases when a regressor is added to X
(this is a straightforward implication of Theorem 21) and in principle one can obtain artificially
high R2 by inflating the model with regressors (the extreme case of R2 = 1 is attained if n = k,
since in this case y ends up to lie onto R (X)). This problem may be obviated by using the
2
corrected R2 , R , defined by including into the formula of R2 the appropriate degrees of
freedom corrections:
2 RSS/ (n k)
R =1 .
T SS/ (n 1)
2
In fact, R does not necessarily increases when one more regressor is added.

Exercise 23. Let W = 2 be the R-squared from the regression of y on W
, RW
X Z
2 R2 . is no smaller
and R2 the R-squared from the regression of y on X. Prove that RW
than given W , u and e defined as in section 3.6.2, the coefficient of determination resulting
from the regression of y on W is
2
2
RW = R2 + 1 R2 ryz
where R2 is the goodness of fit measure from the reduced regression.
Exercise 24. Prove that
2 n1
1 R2 .

R =1
nk
3.8. Centered and uncentered goodness-of-fit measures
Consider the OLS regression of y on the sample regressor matrix X and let b denote the
OLS vector. The centered and uncentered R-squared measures (see Hayashi (2000), p. 20) for
this regression are defined as
0 M[1] y
y b0 X 0 M[1] Xb y0 P[X] M[1] P[X] y
(3.8.1) R2 = =
y0 M[1] y y0 M[1] y y0 M[1] y
and
0y
y b0 X 0 Xb y0 P[X] y
(3.8.2) Ru2 0 = = ,
yy y0 y y0 y
respectively. It is easy to prove that 0 Ru2 1 and
e0 e
Ru2 = 1
y0 y
whether or not the unity vector 1 is included into X. In fact, since y = Xb + e and X 0 e = 0,
y0 y = y
0y
+ e0 e. The same is not true for the centered measure. Indeed, 0 R2 1 and
e0 e
(3.8.3) R2 = 1
y0 M[1] y
if and only if a) the constant is included, or b) all of the variables (y, X) have zero sample
mean, that is M[1] y = y and M[1] X = X. Clearly in the latter case, R2 = Ru2 .
A convenient property of the centered R-squared, when 1 is included into X, is that it
2 , that is
, ry,
coincides with the squared simple correlation between y and y y
2
2
0 M[1] y
y
(3.8.4) R = 0 ,
y0 M[1] y
M[1] y
y
where R2 is defined in (3.8.1) and y

= Xb.
Exercise 25. Prove equation (3.8.4) (hint: Given the definition of R2 in (3.7.2), this boils
0 M[1] y = y0 M[1] y
down to proving that y )
+ e, then
Solution. Since y = y
0 M[1] y = y
y 0 M[1] (
y + e)
0 M[1] y
= y 0 M[1] e
+y
0 M[1] y
= y +y
e
0 M[1] y
= y .
where the third equality follows from M[1] e = e, since the constant is included, and the last
and the OLS residuals.
from the orthogonality of y
This property is not shared by the uncentered R-squared, unless variables have zero sample
means.
3.8.1. A convenient formula for R2 when the constant is included. Partitioning

X as X = X 1 and using Lemma 16 gives P[X] = P[1] + P
[M[1] X ] , which replaced into (3.8.1)
gives in turn
y0 P[1] + P[M[1] X ] M[1] P[1] + P[M[1] X ] y
R2 = .
y0 M[1] y
But

P[1] + P[M[1] X ] M[1] P[1] + P[M[1] X ] =
P[1] M[1] P[1] + P[M[1] X ] M[1] P[1] +
P[1] M[1] P[M[1] X ] + P[M[1] X ] M[1] P[M[1] X ] = P[M[1] X ] ,
so that eventually
y0 P[M[1] X ] y
2
(3.8.5) R = ,
y0 M[1] y
which proves at once that R2 defined in (3.8.1) can be also obtained as the uncentered R-
namely the OLS regression of y in
squared from the OLS regression of M[1] y on M[1] X,
in mean-residuals.
mean-residuals and X
CHAPTER 4
The finite-sample statistical properties of OLS
4.1. Introduction
This chapter is on the finite-sample statistical properties of OLS applied to the LRM.
Finite-sample means that we focus on a fixed sample size n as opposed to n , a case that
will be covered in Chapter 6. We will learn under what assumptions on the LRM and in which
sense the estimator is optimal. We will also learn how to test linear restrictions on the model
parameters. Finally, we will study an important case of inaccuracy for the OLS, which is the
omitted-variables problem.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using
the data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2010).
4.2. Unbiasedness
Under LRM.1-LRM.3, OLS is unbiased, that is E (b) = .

This is proved as follows. From LRM.1, LRM.2 and OLS formula in (3.3.2)
1
(4.2.1) b = + X 0X X 0 .
From LRM.3, then,
1
E (b|X) = + X 0 X X 0 E (|X)
= .
47
4.3. THE GAUSS-MARCOV THEOREM 48
Finally, using the law of iterated expectations
E (b) = EX [E (b|X)]
= EX []
= .
Notice that unbiasedness does not follow if we replace LRM.3 with the weaker LRM.3b.
4.3. The Gauss-Marcov Theorem
Lets work out the conditional and unconditional covariance matrices for OLS under
LRM.1-LRM.4.
I get started with V ar (b|X). Since,
V ar (b|X) = E (b ) (b )0 |X ,

then, given equation (4.2.1),

h 1 1 i
V ar (b|X) = E X 0X X 0 0 X X 0 X |X ,
1 1
X 0X X 0 E 0 |X X X 0 X

=
1
= 2 X 0X ,
where the last equality follows from LRM.3 and LRM.4.

Now I turn to V ar (b). Given the law of decomposition of variance
V ar (b) = EX [V ar (b|X)] + V arx [E (b|X)]
and since E (b|X) = , we have

h 1 i
(4.3.1) V ar (b) = EX [V ar (b|X)] = 2 EX X 0X .
I define the following partial order in the space of the l l symmetric matrices:
Definition 26. Given any two l l symmetric matrices A and B, A is said no smaller
than B if and only if A B is non-negative definite (n.n.d.).
Next I prove that the OLS estimator has the smallest covariance matrix in the class of
linear unbiased estimator, an important optimality result for OLS that is universally known
as the Gauss-Marcov Theorem.
Theorem 27. Assume LRM.1-LRM.4, then the OLS estimator has the smallest, in the
sense of Definition 26, covariance matrix in the class of linear unbiased estimators.
Proof. Define the generic member of the class of linear unbiased estimators as
bo = Cy
where C is a generic k n matrix that depends only on the sample information in X and,
given unbiasedness, such that CX = Ik . b is of course a member of the class, with its own C
equal to COLS = (X 0 X)1 X 0 . It is not hard to prove that V ar (bo |X) = 2 CC 0 . Define, now
D = C COLS , then DX = 0 and so
h 1 ih 1 i
2 0 0 0 0
V ar (bo |X) = D+ X X X D +X X X
1
= 2 X 0X + 2 DD0 ,
so that
(4.3.2) V ar (bo |X) = V ar (b|X) + 2 DD0 .
Since 2 DD0 is n.n.d, according to Definition 26, the OLS conditional covariance matrix,
V ar (b|X), is no greater than the variance of any linear unbiased estimator. The same
result extends to the unconditional variances. First, notice that bo being unbiased we have
for the unconditional covariance matrix of bo a result that parallels Equation (4.3.1)
(4.3.3) V ar (bo ) = EX [V ar (bo |X)] .
Then, taking expectation over X in both sides of Equation (4.3.2) yields
EX [V ar (bo |X)] = EX [V ar (b|X)] + 2 EX DD0 ,

or, given Equations (4.3.1) and (4.3.3),
V ar (bo ) = V ar (b) + 2 EX DD0 .

Finally, since EX (DD0 ) is n.n.d. the Theorems result follows.
The natural question arises of whether the partial order of Definition 26 is of any relevance
in real-world applications. It is, since it readily translates into the total order of real numbers,
which is the domain of the variances of random scalars. Indeed, if A is no smaller than
B, then r0 (A B) r 0, for any conformable r. But then, according to the Gauss-Marcov
Theorem, we can say that any linear combination of the components of b, r0 b, has smaller
variance than r0 bo . Formally, the theorem implies that r0 [V ar (bo ) V ar (b)] r 0. Then,
V ar (r0 b) = r0 V ar (b) r and V ar (r0 bo ) = r0 V ar (bo ) r and hence V ar (r0 bo ) V ar (r0 b) .
The importance of this hinges upon the fact that in empirical applications we are interested
in the linear combinations of population coefficients, as in the following example, where it is
shown that the estimates of individual coefficients can always be expressed as specific linear
combinations of the k components of the estimators.
Example 28. On noticing that bi = r0i b and boi = r0i bo , i = 1, ..., k, where ri is the
k 1 vector with all zero elements except the i.th entry, which equals unity, and given the
Gauss-Marcov Theorem, we conclude that V ar (boi ) V ar (bi ) , i = 1, ..., k.
4.4. ESTIMATING THE COVARIANCE MATRIX OF OLS 51
In general, we have that the OLS estimator of any linear combination r0 is given by r0 b
and, as the foregoing discussion demonstrates, under LRM.1-LRM.4 r0 b is BLUE (you can
easily verify that E (r0 b) = r0 ).
4.4. Estimating the covariance matrix of OLS
Since 2 is unknown, so are V ar (b|X) and V ar (b). Unbiased estimators of V ar (b|X)

and V ar (b), therefore, require an unbiased estimator of 2 . We now prove that, under LRM.1-
LRM.4, the residual sum of squares divided by the appropriate degrees of freedom correction,
s2 = e0 e/ (n k), is one such estimator.
Theorem 29. Assume LRM.1-LRM.4, then E s2 = 2 .

Proof. From e = M[X] y and LRM.1 it follows that e = M[X] . Hence,
1
E s2 |X E 0 M[X] |X .

(4.4.1) =
nk
Since 0 M[X] is a scalar, 0 M[X] = tr 0 M[X] and so, by the permutation rule of the trace
of a matrix product, 0 M[X] = tr 0 M[X] = tr M[X] 0 . Replacing the right hand side of the
foregoing equation into equation (4.4.1) yields
1
E s2 |X E tr M[X] 0 |X .

=
nk
Then exploiting the fact that both trace and expectation are linear operators
1
E s2 |X tr E M[X] 0 |X .

=
nk
1
tr M[X] E 0 |X

=
nk
2
(4.4.2) = tr M[X]
nk
4.4. ESTIMATING THE COVARIANCE MATRIX OF OLS 52
where the last equality follows from LRM.3 and LRM.4. Now, focus on tr M[X] :
h 1 0 i
tr M[X] = tr In X X 0 X X
1
= tr In tr X 0 X X 0X
= n k,
and so the numerator and denominator in (4.4.2) simplify to have E s2 |X = 2 . Finally, by

the law of iterated expectations

E s2 = 2 .

With s2 at hand we can get an unbiased estimator for V ar (b). It is obtained by replacing
2 with s2 in the OLS conditional variance formula to get
1
ar (b) = s2 X 0 X
V\ .

Theorem 30. Assume LRM.1-LRM.4, then E V\
ar (b) = V ar (b)
Proof. Taking the conditional expectation of V\

ar (b)
1 1
ar (b)|X = E s2 |X X 0 X = 2 X 0X

E V\ = V ar (b|X)
h i
and since E V\
ar (b) = EX E V\
ar (b)|X by the law of iterated expectations, then
h 1 i
ar (b) = 2 EX X 0 X
E V\ .
and given Equation (4.3.1)

E V\
ar (b) = V ar (b) .

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 53
4.5. Exact tests of significance with normally distributed errors
Assume also LRM. 5: |X N 0, 2 In . Since equation (4.2.1) holds,

1
b|X N , 2 X 0 X .
Also, since e = M[X] , e|X N 0, 2 M[X] . Using a result in Rao (1973), it is also possible

to prove at once that b and e are also jointly normal with zero covariances, conditional on X.
Specifically, since
b (X 0 X)1 X 0
= +
e 0 M[X]
then by (8a.2.9) in Rao (1973) it has

b (X 0 X)1 X 0
|X N , 2 X (X 0 X)1 M[X]
e 0 M[X]
or
b 2 (X 0 X)1 0
|X N ,
e 0 0 2M [X]
Therefore, being normally distributed with zero conditional covariances, conditional on X, b

and e are also independent, conditional on X. This general result is important and therefore
stated as a theorem for future reference.
Theorem 31. Assume LRM.1-LRM.5. Then b and e are independent, conditional on X.
Exercise 32. Verify, by direct computation of Cov (b, e|X), that Cov (b, e|X) = 0kn .
Solution: Since E (e|X) = 0 (verify), then1
Cov (b, e|X) = E (b ) e0 |X

1In general the matrix of conditional covariances between two random vectors x and y, conditional on z, is
E [x E (x|z)] [y E (y|z)]0 |z .

or
h 1 i
Cov (b, e|X) = E X 0X X 0 0 M[X] |X
1
X 0X X 0 E 0 |X M[X]

=
1
= 2 X 0X XM[X]
= 0kn .
Exercise 33. Verify, by direct computation of V ar (e|X), that V ar (e|X) = 2 M[X] .

b
Exercise 34. Is V ar |X non-singular? Why or why not?
e
4.5.1. Testing single linear restrictions. Let (X 0 X)1

ii stand for the i.th main diagonal
element of (X 0 X)1 , then

1
bi |X N i , 2 X 0 X ii
and, given the properties of the normal distribution, bi can be standardized to have
b i
(4.5.1) q i |X N (0, 1) ,
2 (X 0 X)1
ii
i = 1, ..., k. Were 2 known, then the above statistics could be used to test hypotheses on i ,
Ho : i = i , by replacing the unknown i with i , where i is a value of interest fixed by
the researcher. For example, to test Ho : i = 0 one would use
bi
q N (0, 1) .
2 (X 0 X)1
ii
The problem is, of course, that 2 is generally unknown and so the foregoing approach cannot
be used as it is. With some adjustment we can make it operational, though. Just replace 2
with s2 in the expression for the standardized bi to get
bi i
(4.5.2) ti = q
s2 (X 0 X)1
ii
and then prove that ti has a t distribution with n k degrees of freedom when i = i . The
denominator term in expression (4.5.2) is the standard error estimate for coefficient bi .
First, notice that since s2 = e0 e/ (n k) = 0 M[X] / (n k),
s2 0
(4.5.3) (n k) = M [X] .
2
Now, consider the following distributional result
if z N (0, I), and A is a conformable idempotent matrix, then z0 Az 2 (p) with

p = rank (A) .
Since / N (0, I) and M[X] is idempotent, then (n k) s2 / 2 is an idempotent quadratic

form in a standard normal vector and, in the light of the foregoing distributional result, has

a chi-squared distribution with degrees of freedom equal to rank M[X] = n k :
s2
(n k) 2 (n k) .
2
Since Theorem 31 holds, any function of b is independent of any function of e, conditionally

q
on X, hence (bi i ) / 2 (X 0 X)1 2 2
ii and (n k) s / are conditionally independent.
Further, r
bi i s2
ti = q / (n k) 2 / (n k),

2 (X 0 X)1
ii
therefore, in the light of the following second distributional result
p
if z N (0, 1), x 2 (p) and z and x are independent, then z/ x/p has a t
distribution with p degrees of freedom
then, ti |X t (n k), i = 1, ..., k.

Finally, since the t distribution does not depend on the sample information and, specifically,
on X, then ti and any component of X are statistically independent, so that the above holds
also unconditionally, that is ti t (n k), i = 1, ..., k.
Often we wish to test hypotheses involving linear combinations of , r0 , where r is a k 1
vector of known constants.
Example 35. If we are estimating a two-input Cobb-douglas production function and 1

and 2 are the product elasticities of the two inputs, the hypothesis of constant returns to
scale is clearly important, so that our null is 1 + 2 = 1.
In general we express the null involving a linear combination of population coefficients as

H0 : r0 q = 0, where q is a known constant. In the Cobb-Douglas example r = (1 1)0 and
q = 1.
The OLS estimator for r0 is r0 b, which is normally distributed conditional on X: r0 b|X
h i
N r0 , 2 r0 (X 0 X)1 r . Therefore we have the following t test
r0 b q
q t (n k) ,
s2 r0 (X 0 X)1 r
which can be used to test H0 : r0 q = 0.

4.5.1.1. Stata implementation. For the sake of exposition, I report here the regress output
already seen in Section 3.3.3.
. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"
. regress c y
Source SS df MS Number of obs = 10

F( 1, 8) = 959.92
Model 64435.1192 1 64435.1192 Prob > F = 0.0000
Residual 537.004616 8 67.125577 R-squared = 0.9917
Adj R-squared = 0.9907
Total 64972.1238 9 7219.12487 Root MSE = 8.193
c Coef. Std. Err. t P>|t| [95% Conf. Interval]
y .9792669 .031607 30.98 0.000 .9063809 1.052153

_cons -67.58063 27.91068 -2.42 0.042 -131.9428 -3.218488
The OLS coefficient estimates, b, are displayed in the first column (labeled Coef.). Then,
the second column reports the standard error estimates peculiar to each OLS coefficients,
q
i = s2 (X 0 X)1
se ii , i = 1, ..., k. The third column reports the values of the t statistics for
the k null hypotheses i = 0, i = 1, ..., k :
bi
ti = q .
s2 (X 0 X)1
ii
The test is two-sided in that the alternative is H1 : i > 0 or i < 0. The fourth column
reports the so called p-value for the two-sided t-test. It is defined as the probability that
a t distributed random variable is more extreme than the outcome of ti in absolute value:
P r [(t < |ti |) (t > |ti |)] or more compactly P r (|t| > |ti |) . Clearly, if the p-value is smaller
than the chosen size of the test (the Type I error) then ti falls for sure into the critical region
and we reject the null at the chosen size. In other words, the p-value indicates the lowest size of
the critical region (the lowest Type I error) we could have fixed to reject the null, given the test
outcome. In this sense, the p-value is more informative than critical values. In the regress
example, if we choose a critical region of 5% size, given that P r (|t| > 2.42) = 0.042 < 0.05,
we can reject at 5% that the constant term is equal to zero, knowing that we could also have
rejected at, say, 4.5%, but not at 1%. A 1% size is smaller then the test p-value, which is
the minimum size allowing rejection, and for this reason we cant reject at 1%. This is a clear
case of borderline significance, one which we could not have identified with such precision by
simply looking at the 5% critical values. On the other hand, the p-value for the coefficient
on y is virtually zero (as low as 0.000). This therefore indicates that no matter how much
conservative we are towards the null, we can reject it at any conventional level of significance
(conventional sizes, with an increasing degree of conservativeness are 10%, 5%, 1%) and also
at a less conventional 0.1% (since 0.001 > 0.000).
4.5.2. From tests to confidence intervals. Let us fix the 100% critical region for our
two-sided t test for the null H0 : i = i against the alternative H1 : i 6= i and let t/2 be

the corresponding critical values: P r t < t/2 t > t/2 = . Then, the probability
of not rejecting the null when it is true is (1 ) . Formally,
bi i

bi i
P r < t/2 = P r t/2 < < t/2
se
i se
i
i t/2 < bi i < se

= P r se i t/2
i t/2 < i < bi + se

= P r bi se i t/2
= (1 ) .

But bi se i t/2 is a (1 ) 100% confidence interval for i . This proves that the
i t/2 , bi + se

(1 ) 100% confidence interval bi se i t/2 , bi + se
i t/2 contains all of the null hypotheses
i = i that we cannot reject at 100%. So while a given t test is informative only for
the specific null it is testing, the confidence interval conveys to the researcher much more
information. The last column of the regress output reports the 95% confidence intervals for
each OLS coefficients.
Exercise 36. Your regression output for a given coefficient i yields bi = 9.320 and
se
i = 1.760. 1) Compute the t-statistic for the null H0 : i = 0. 2) In your regression
n k = 334, this implies that t0.025 = 1.967, where t0.025 : P r (t > t0.025 ) = 0.025. Will you
reject or not H0 : i = 0 against H1 : i 6= 0 at a significance level of 5%? Why? 3) Given
your answer to Question 2, will you expect that 0 belongs to the 95% confidence interval for
i ? 4) Compute the 95% confidence interval for i . On the basis of the information from the
confidence interval alone, do you reject H0 : i = 6 against H1 : i 6= 6 at 5%, why? 5)
Using only your answers to Question 4, can you assert that the p-value of that test is greater
than 0.05? Also, do you expect that the absolute value of the t statistic for H0 : i = 6
be greater or smaller than 1.967, why? Verify your answer by directly computing the value of
the t statistic for H0 : i = 6. 6) Consider now the test of H0 : i 0 against H1 : i > 0
with a 5% significance level. Is the critical level for this test equal to, smaller or greater than
1.967?
Solution. 1) ti = 5.295. 2) Reject, because 5.295 > 1.967. 3) No, since H0 : i = 0

is rejected at 5%. 4) (12.782, 5.858) . No, since 6 (12.782, 5.858) . 5) Yes: since
H0 : i = 6 is not rejected at 5%, then ti falls within the acceptance region and so the test
p-value> 0.05. Since ti falls within the acceptance region, the value of |ti | must be smaller
than t0.025 = 1.967. Indeed, ti = 1.886. 6) Smaller: since the test is one-sided, the critical
value is t0.05 .
Exercise 37. Your regression output for a given coefficient i yields bi = 6.668 with
i = 3.577. The outcome of the t-test for H0 : i = 0 against H1 : i 6= 0 shows p-
se
value= 0.07. Can you reject the null at 10%? Can you at 5%?
4.5.3. Testing joint linear restrictions. We want to test jointly J linear restrictions:
H0 : R q = 0, where R and q are, respectively, a J k matrix and a J 1 vector of fixed
known constants and such that no rows in R can be obtained as a linear combination of the
others, that is R is of full row rank J.
Under the null,
Rb q = R (b )
and so given LRM.5,

1 0
Rb q|X N 0, 2 R X 0 X R .
Then, using the distributional result that

given the p 1 random vector x N (, ), then (x )0 1 (x ) 2 (p) ,
it has
h i1
(Rb q)0 R (X 0 X)1 R0 (Rb q)
W = |X 2 (J) .
2
Again, 2 is not known and so W is unfeasible as a test for H0 . We can go about as in the
previous section and replace 2 with s2 . In addition, then, divide the result by J to get the
statistic h i1
(Rb q)0 R (X 0 X)1 R0 (Rb q)
F =
Js2
Now consider another distributional results
Given two independent random scalar x1 2 (p1 ) and x2 2 (p2 ), then (x1 /p2 ) / (x1 /p2 )
F (p1 , p2 ) .
It is not hard to see that the above result can be applied to F, since it can be reformulated
as the ratio of two conditionally independent chi-squared random variables corrected by their
own degrees of freedoms. In fact, at the numerator we have
h i1
(Rb q)0 R (X 0 X)1 R0 (Rb q)
J 2
and at the denominator s2 / 2 . Conditional on X, the former is a function of b alone, while the
latter is a function of e alone. Therefore, in the light of Theorem 31 the two are conditionally
independent and so we can invoke the foregoing distributional result, to establish F |X
F (J, n k).
As with the t statistic, since the F distribution does not depend on the sample information,
we have that the above holds unconditionally: F F (J, n k) .
When H0 is a set of J exclusion restrictions, then q = 0 and each row of R has all zero
elements except unity in the entry corresponding to the parameter that is set to zero. For
example, with three parameters 0 = (1 2 3 ) and two exclusion restrictions 1 = 0 and
3 = 0, then J = 2, q0 = (0 0) and

1 0 0
R=
0 0 1
so that H0 can be formulated as

1
1 0 0 0
= .
2
0 0 1 0
3
The F -test can be always rewritten as a function of the residual sum of squares under the
unrestricted model, e0 e, and the residual sum of squares under the model with restrictions
imposed, say e0 e :
(e0 e e0 e) /J
F = .
e0 e/ (n k)
This is proved for the case of exclusion restrictions by using Lemma 16.
Partition the sample regressor matrix as X = (X1 X2 ) and consider the F test for the set
of exclusion restrictions H0 : 2 = 0:
h 1 i1
b02 X20 M[X1 ] X2 b2
F =
k2 s2
0 0
b2 X2 M[X1 ] X2 b2
= .
k2 s2
Now apply the Theorem 8 to F to have

1 1
y0 M[X1 ] X2 X20 M[X1 ] X2 X20 M[X1 ] X2 X20 M[X1 ] X2 X20 M[X1 ] y
F = .
k2 s2
1
y0 M[X1 ] X2 X20 M[X1 ] X2 X20 M[X1 ] y
= .
k2 s2
4.6. THE GENERAL LAW OF ITERATED EXPECTATION 62
The numerator of the right hand side of the foregoing equation can be written more compactly
as y0 P[M[X ] X2 ] y. Hence, by Lemma 16,
1
y0 P[X] P[X1 ] y

F =
k2 s2
or, adding and subtracting In within parentheses,
y0 M[X1 ] M[X] y

F =
k2 s2
(e e e0 e) /k2
0
= .
s2
It is not hard to prove that if the constant term is kept in both models, then
R2 R2 /J

F = ,
(1 R2 ) / (n k)
where R2 is R-squared from the unrestricted model and R2 is the R-squared from the restricted
model.
4.6. The general law of iterated expectation
The general form of the law of iterated expectations (LIE) can be stated as in Wooldridge
(2010), pp. 19-20.
LIE(scalar|vector): Given the random variable y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x].
Since the above result holds for any function f (), x can just be any subvector of w, as the
following example shows.
Example 38. Consider w = (w1 w2 w3 )0 and x = Aw, where

1 0 0
A=
0 1 0
4.6. THE GENERAL LAW OF ITERATED EXPECTATION 63
then x = (w1 w2 )0 and so by the general law of iterated expectations
E (y|w1 w2 ) = E [E (y|w1 w2 w3 ) |w1 w2 ] .
The law can, of course, be formulated in terms of conditional expectations of random

vectors.
LIE(vector|vector): Given the random vector y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x] , where

E (y1 |x) E (y1 |w)
.. ..

E (y|x) = and E (y|w) = .

. .

E (yn |x) E (yn |w)
Remark 39. Notice that in the formulation of conditional expectations the way the con-
ditioning set is represented is just a matter of notational convenience. What matters are the
random scalars that enter the conditioning set and not the way they are organized therein.
For example, E (y|w1 , w2 , w3 , w4 ) can just be equivalently expressed as E (y|w0 ) or E (y|w),
where w = (w1 w2 w3 w4 )0 , or E (y|W ) where

w1 w3
W = ,
w2 w4
or through any other organization of (w1 , w2 , w3 , w4 ).
Given Remark 39 the general LIE can be formulated with conditional expectations having
the conditioning set organized in the form of random matrices rather than random vectors, as
follows.
LIE(vector|matrix): Given the random vector y and the random matrices W and X,
where X = f (W ), then E (y|X) = E [E (y|W ) |X].
Paralleling the consideration made above, since f () is a generic function, from LIE(vector|matrix)
follows a special LIE for the case in which X is a submatrix of W . Therefore, given W =
4.7. THE OMITTED VARIABLE BIAS 64
(W1 W2 ), by LIE(vector|matrix) we always have
E (y|W1 ) = E [E (y|W ) |W1 ]
and
E (y|W2 ) = E [E (y|W ) |W2 ] .
4.7. The omitted variable bias
If explanatory variables that are relevant in the population model, for some reasons, are not
included into the statistical model - they may be intrinsically latent, such as individual skills,
or the specific data-set in use do not report them, or also, although observed and available,
the researcher failed to account for them in the model specification - then our OLS estimator
may undergone what is known in the econometric literature as an omitted variable bias. Lets
see when and why.
Assume that the population model is
y = x0 + .
with x and both k 1 vectors and P.1-P.4 satisfied and consider the RS mechanism
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik ) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
So far we are in the classical regression framework, but now let x0 = (x01 x02 ) with x1 being
a k1 1 vector and x2 a k2 1 vector and k = k1 + k2 and maintain that x2 is latent
or, however, not included into the statistical model and lets explore the implications on the
statistical model. P.1 implies that
(4.7.1) y = X1 1 +
where = X2 2 + is now the new n 1 vector of latent realizations. If X is of f.c.r., then

X1 is of f.c.r. as well. So, in a sense, LRM.1 and LRM.2 continue to hold. But, as far as
LRM.3 and LRM.4 are concerned, nothing can be said. Specifically, we do not know whether
E (|X1 ) = 0 or V ar (|X1 ) = 2 In . The first consequence is that the OLS estimator for 1 ,
1
b1 = X10 X1 X10 y
is likely to be biased. Indeed, the bias can be easily derived as follows. Replacing the right
hand side of equation (4.7.1) into the OLS formula yields
1
b1 = 1 + X10 X1 X10
1 1
= 1 + X10 X1 X10 X2 2 + X10 X1 X10 .
Since RS holds, i , i = 1, .., n, is conditional-mean independent of (x0i1 x0i2 ) and statistically

independent of x0j1 x0j2 , j 6= i = 1, ..., n. Therefore, E (|X) = 0, which implies that
1
E (b1 |X) = 1 + X10 X1 X10 X2 2
and hence, by the law of iterated expectations, we have the unconditional bias
h 1 0 i
(4.7.2) E (b1 ) 1 = E X10 X1 X1 X2 2 .
There are two specific instances, however, in which the bias is zero.
The first instance is that analyzed in Greene (2008) when X10 X2 = 0k1 k2 . In this case
(X10 X1 )1 X10 X2 2 = 0 and so the bias in equation (4.7.2) becomes zero.
The second instance occurs if in the population E (x02 2 |x1 ) = 0, as I now show. Since in
the population E (|x) = 0, then by the general law of iterated expectation also E (|x1 ) = 0.
Hence, E (x02 2 + |x1 ) = 0, which along with RS yields E (|X1 ) = 0. Therefore, the vector
in model (4.7.1) behaves like a conventional error term that satisfies LRM.3. The upshot is
that b1 is unbiased.
The two situations are not related. Clearly, E (X2 2 |X1 ) = 0 does not imply X10 X2 =
0k1 k2 . But also the converse is not true, and X10 X2 = 0k1 k2 may hold if E (X2 2 |X1 ) 6= 0,
as shown by the following example.
Example 40. Let y = x1 1 + x2 2 + with x1 a Bernoulli random variable:
P r (x1 = 1) = and P r (x1 = 0) = 1 ,
Let also x2 = 1 x1 . While E (x2 2 |x1 ) = (1 x1 ) 2 , x1 x2 = 0 with probability one. In

this case, a random sampling of y, x1 and x2 from the population will yield X10 X2 = 0 and
E (X2 2 |X1 ) 6= 0 with probability one.
Be that as it may, the foregoing two instances of unbiasedness constitute a narrow case,
and in general omitted variables will bring about bias and inconsistency in the coefficient
estimates. Solutions are typically given by proxy variables, panel data estimators and instru-
mental variables estimators. The first method is briefly described below, the classical panel
data estimators are pursued in Chapter 7, while IV methods are described in Chapter 9.
To conclude, observe that if relevant variables are omitted LRM.4 does not generally hold,
unless V ar (x02 2 |x1 ) = 2 < +.
Lemma 41. Given any two non-singular square matrices of the same dimension, A and
B, if A B is n.n.d. then B 1 A1 is n.n.d.
The foregoing lemma signifies that in the space of non-singular square matrices of a given
dimension if A is no smaller than B, then A1 is no greater than B 1 . It is useful in
situations in which the difference of inverse matrices is more easily worked out than that of
the original matrices.
The following exercise asks you to think through the consequences of overfitting, namely
applying OLS to a statistical model with variables that are redundant in the population model.
Exercise 42. Assume the population model is
y = x0 +
with x and both k 1 vectors and P.1-P.4 satisfied. Assume also that the l 1 vector z of
observable variables is available, such that rank [E (ww0 )] = k + l where w0 = (x0 z0 ). Also,
assume E (|x z) = 0 and V ar (|x z) = 2 , i.e. z is redundant in the population equation.
Finally assume there is a sample of size n from the population, such that the elements of the
sequence {(yi x0i z0i ) , i = 1, ..., n} are i.i.d. 1 (1 + k + l) random vectors. Applying the usual
notation for the sample variables,

y
1 x01 z01 1
.. .. .. ..
. . . .

y = yi , X =
x0i , Z = z0i , =
n1 i ,
. nk
n1
nl
. .. .. ..
. . . .

yn x0n z0n n
answer the following questions. 1) Prove that the statistical model
y = X +
satisfies LRM.1-LRM.4 (of course, this proves that
1
(4.7.3) b = X 0X X 0y
is BLUE). 2) Prove that the overfitting strategy of regressing y on X and Z yields an unbiased
estimator for and call it bof it . 3) Derive the covariance matrix of bof it . 4) Use lemma 41
and verify that, indeed, the conditional covariance matrix of bof it is no smaller than that of
b in (4.7.3). 5) A byproduct of the overfitting strategy is the l 1 vector of OLS coefficients
for the variables in Z. Lets call it c. Express c using the first equation of Theorem 8 as
1
c = Z 0Z Z 0 (y Xbof it )
and prove that the overfitting residual vector eof it y Xbof it Zc equals
eof it = M[M[Z] X ] M[Z] y.
6) Find an unbiased estimator for 2 based on eof it .
Answer: 1) Obvious, since in the population and the sampling mechanism we have all
we need for the statistical properties LRM.1-LRM.4 to be true. 2) This is proved at once by
noting that from RS and E (|x z) = 0, E (|X Z) = 0. 3) Prove that V ar (|X Z) = 2 I and
then prove that
h 1 i 1
V ar X 0 M[Z] X X 0 M[Z] y|X Z = 2 X 0 M[Z] X .
4) Write X 0 M[Z] X as X 0 M[Z] X = X 0 X X 0 P[Z] X and then verify you have all is needed to
invoke the lemma. 5) Easy, its just algebra: replace bof it and c into eof it y Xbof it Zc
and rearrange. 6) First, use the formula of the overfitting residual vector derived in the
previous question, M[M[Z] X ] M[Z] y, to set up the estimator
y0 M[Z] M[M[Z] X ] M[Z] y

2
s = .
nkl
Then, notice that

0 M[Z] M[M[Z] X ] M[Z]
2
s = .
nkl
Finally, take the expectation of the trace s2 , conditional on X and Z, and follow the same
steps as when proving unbiasedness of the standard estimator s2 . In the derivation dont
forget that M[Z] and M[M[Z] X ] are orthogonal projections.
4.7.1. The proxy variables solution. Assume for simplicity that there is only one
omitted variable x2 from the population equation
(4.7.4) y = x01 1 + x2 2 + .
where x1 is a k1 1 vector of observed regressors. Assume also that there is a l 1 vector of

observed variables z such that the following assumptions hold.
(1) The z variables are redundant in the population equation, that is E (y|x z) = x0 .
(2) Once conditioning on z, the omitted variable x2 and the included explanatory vari-
ables, x1 , are independent in conditional-mean: E (x2 |x1 z) = E (x2 |z) . Also, E (x2 |z) =
z0 .
x1
(3) rank E (x01 z0 ) = k1 +l. This is analogous to property P.2 in Chapter
z
1 and permits identification of coefficients in the proxy variable regression as we will
see below.
Assumption 2 is referred to as Selection on observables: x1 can be related to x2 but only

indirectly through z. Once the variation in z is taken into account, there is no dependence
between the remaining variation in x2 and x1 . In other words, writing x2 as
(4.7.5) x2 = z0 + ,
where = x2 E (x2 |x1 z), then E (|x1 z) = 0.

Replacing the right hand side of equation (4.7.5) into the population equation (4.7.4) yields
(4.7.6) y = x01 1 + z0 (2 ) + 2 + ,
where E (2 |x1 z) = 0. Also, by the redundancy assumption, E (y|x z) = x0 , it follows that

E (|x z) = 0 and so by the general LIE
E (|x1 z) = E [E (|x z) |x1 z] = 0.

4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 70
It follows that E (2 + |x1 z) = 0 and so, along P.1 and P.2 (given Assumption 3), also P.3
is satisfied for equation (4.7.6). With the following RS mechanism
RS(x1 z): There is a sample of size n from the population, such that the elements of
the sequence {(yi xi1 ...xik1 zi1 ...zil ) , i = 1, ..., n} are independently identically dis-
tributed (i.i.d.) random vectors,
the resulting statistical model will satisfy LRM.1-LRM.3 and so yield unbiased OLS estimates.
4.8. The variance of an OLS individual coefficient
Suppose that attention is centered onto a given explanatory variable whose observations
are collected into the (n 1) column vector xi , and that there are k 1 control variables
collected into the n (k 1) matrix Xi . Without loss of generality partition the (n k)
regressor matrix as X = (Xi xi ) and, correspondingly, the (k 1) OLS vector as

bi
b= ,
bi
where bi is (k 1) 1 and bi is a scalar. Maintain LRM.1-LRM.4, so that
1
V ar (bi |X) = 2 X 0 X ii
where (X 0 X)1 0 1
ii indicates the last entry onto the main diagonal of (X X) .
My purpose here is to derive the formula for (X 0 X)1

ii . We know from Equation (3.6.6) in
Section 3.6 that

1
bi = x0i M[Xi ] xi x0i M[Xi ] y,
so, given
y = Xi i + xi i + ,
it has
1
bi = x0i M[Xi ] xi x0i M[Xi ] Xi i + xi i +

and consequently
1
bi = i + x0i M[Xi ] xi x0i M[Xi ] .
Finally
h 1 0 1 i
V ar (bi |X) = E x0i M[Xi ] xi xi M[Xi ] 0 M[Xi ] xi x0i M[Xi ] xi |X
1 1
x0i M[Xi ] xi x0i M[Xi ] E 0 |X M[Xi ] xi x0i M[Xi ] xi

=
1
= 2 x0i M[Xi ] xi
2
(4.8.1) = ,
x0i M[Xi ] xi
which also proves that
1 1
(4.8.2) X 0X = .
ii x0i M[Xi ] xi
Remark 43. Equation (4.8.2) is a general algebraic result providing the formula for the
generic i.th main diagonal element of the inverse of any non-singular cross-product matrix
X 0 X. I have proved it in quite a peculiar way, using a well-known and easy-to-remember
econometric result! Above all, I could get away without referring to the hard-to-remember
result on the inverse of the (2 2) partitioned matrix, which is instead the route followed by
Greene (Theorem 3.4 inGreene (2008), p. 30) .
4.8.1. The three determinants of V ar (bi |X) when 1 is a regressor. Now I get back
to V ar (bi |X) in equation (4.8.1)
1
V ar (bi |X) = 2 x0i M[Xi ] xi

and assume that Xi contains the n 1 unity vector 1, or Xi = X i 1 . Notice, now, that
M[Xi ] xi is the residual vector from the OLS regression of xi on Xi and so x0i M[Xi ] xi is the
residual sum of squares for this regression. Since the unity vector is a column of Xi , the
coefficient of determination for this regression is
x0i M[Xi ] xi
Ri2 =1 0 ,
xi M[1] xi
from which we have that

x0i M[Xi ] xi = 1 Ri2 x0i M[1] xi

and eventually2
2
V ar (bi |X) =
1 Ri2 x0i M[1] xi

Also, it has
n
X
x0i M[1] xi = (xij xi )2 ,
j=1
that is x0i M[1] xi is the total variation in xi around its sample mean, xi . Therefore,
2
(4.8.3) V ar (bi |X) = n ,
2 xi )2
P
1 Rk j=1 (xji
with V ar (bi |X) increasing when
2An alternative proof is the following. Given Lemma 16, M = I P[1] P[M X i ] and so
[Xi ] [1]
h i1
2 0
V ar (bi |X) = xi M[1] P[M X i ] xi
[1]
or
1
V ar (bi |X) = 2 x0i M[1] xi x0i P[M X i ] xi
[1]
1
= 2 x0i M[1] xi x0i P[M X i ] xi
[1]
!#1
x0i P[M X i ] xi
"
2 x0i M[1] xi 1
[1]
= .
x0i M[1] xi
1
2 x0i M[1] xi 1 Ri2

=
where
x0i P[M ] xi
[1] Xi
Ri2 .
x0i M[1] xi
Given (3.8.5), Ri2 is the centered R-squared for the regression of xi , on Xi (or equivalently the uncentered
i ).
R-squared from the regression of M[1] xi on M[1] X
other things constant, Ri2 increases, in words the correlation between xi and the
other regressors increases (this is the multicollinearity effect on the variance of the
OLS individual coefficient);
Pn
other things constant, the total variation in xi , j=1 (xji xi )2 , decreases;
other things constant, the regression variance increases.

Multicollinearity is perfect when xi belongs to R X i . In this case R2 = 1 (see Section
i
3.7) and the variance of bi diverges to infinity. Coefficient i cannot be estimated given the
available data (X is not of f.c.r. in this case).
Remark 44. Multicollinearity, when it does not degenerate into perfect multicollinearity,
i.e. det (X 0 X) = 0, does not affect the finite sample properties of OLS. Nonetheless, it may
severely reduce the precision of our estimates, in terms of larger standard errors and confidence
intervals.

Exercise 45. Partition X as X = X 1 and accordingly, the OLS (k 1) vector as
0
b= b 0 b0 , where b
is of dimension (k 1) 1 and b0 is the OLS estimator of the constant
term 0 . Prove that

0
b0 = y x
b
regres-
is the (k 1)1 vector of sample means for the X
where y is the sample mean of y and x
sors (hint: just use the first equation of Theorem 8 to show that b1 = (X10 X1 )1 X10 (y X2 b2 )).
Exercise 46. Use exercise 45 and the following three facts:

n o
y |X) = E [
1) V ar ( y |X)]2 |X
y E (
= E 2 |X

2
= ,
n
4.9. A GENERAL RESULT 74

0 0
2) V ar x b|X V ar b|X
= x
x
and
0
0 0 0
b|X
3) cov y, x y E (
= E ( y |X)) x
bE x
b|X |X
1
0 0
= E M[1] X X M[1] X |X
x
1
1 0 0

0
= E 1 M[1] X X M[1] X |X
x
n
2 0
X 0 M[1] X

1
= 1 M[1] X = 0,
x
n
to prove that
2 0

V ar (b0 |X) = V ar b|X
+x .
x
n
4.9. A general result
Zyskind (1967) provides a general necessary and sufficient condition for the OLS estimator
to be BLUE.
Theorem 47. Given the regressor matrix, X, and the conditional covariance matrix ,
V ar (|X) = , the OLS estimator, b = (X 0 X)1 X 0 y, is BLUE if and only if P[X] = P[X] .
If LRM.4 holds, that is = 2 In , Zyskinds condition holds for any X, since
P[X] = P[X] In 2 = 2 P[X] .
As a final remark, remind that the Zyskind condition ensures only that OLS coefficients are
BLUE, saying nothing about the properties of the OLS standard error estimates and indeed
we have seen in the previous sections that they may be biased even if b is BLUE. The following
exercise on partitioning provides another instance of such occurrence.
4.9. A GENERAL RESULT 75
Exercise 48. Consider the partitioned regression
(4.9.1) y = X1 1 + X2 2 + ,
maintaining LRM.1-LRM.4. 1) Verify that premultiplying both sides of the foregoing equation
by M[X2 ] boils down to the reduced regression model
(4.9.2) y 1 1 +
=X
where
y 1 = M[X ] X1 and
= M[X2 ] y, X = M[X2 ] .
2
2) How can you interpret the variables in model (4.9.2)? 3) As far as 1 is concerned, does
OLS applied to model (4.9.2) yields the same estimator as OLS applied to model (4.9.1), why
or why not? 4) Does the reduced model (4.9.2) satisfy LRM.1-LRM.4? Which ones, if any, are
not satisfied? 5) The degrees of freedom of the reduced regression are n k1 . Do you think
that the resulting OLS estimate for 2 would be unbiased? 6) Verify that the reduced model
(4.9.2) satisfies the Zyskind condition.
1 in model (4.9.2) are

and X
Solution. 1) It does since M[X2 ] X2 = 0. 2) The variables y
the residuals from k1 + 1 separate regressions using y and each column of X1 as dependent
variables and the columns of X2 as regressors. 3) Yes, by Theorem 8. 4) LRM.1-LRM.3 are
met, but LRM.4 fails with

V ar 1 = 2 M[X ] .
|X 2
5) It is biased since we know that the unbiased OLS estimator uses n k degrees of freedom
to correct the OLS residual sum of squares (which is nonetheless the same for both models
(4.9.1) and (4.9.2), as seen in Subsection 3.6.1). 6) You have just to verify that
M[X2 ] P[M[X X1 ] = P[M[X X1 ] M[X2 ] ,

2] 2]
4.10. ESTIMATION OF TREATMENT EFFECTS 76
which is readily done by noting that M[X2 ] , symmetric and idempotent, is the first and the
last factor in P[M[X ] X1 ] .
2
The within regression examined in Chapter 7 (equation (7.2.7)) is a special case of model
(4.9.2) in exercise 48.
4.10. Estimation of treatment effects
4.10.1. Introduction. Estimation of average treatment effects (ATE) and ATE on the
treated (ATET) is at the hearth of the econometric evaluation of socio-economic policies.
As an example, we consider the experimental data jtrain2.dta from Wooldridge (2010) to
estimate the causal effect of job training on earnings
. use jtrain2,clear
. tabulate train
=1 if
assigned to
job
training Freq. Percent Cum.
0 260 58.43 58.43

1 185 41.57 100.00
Total 445 100.00
. sum train age educ black married re74 re75,sep(0)

Of 445 american workers with poor market histories 185 were randomly assigned to the
training (treatment) group. Real labour market earnings in 1978 (two years after treatment)
is the outcome variable, re78. There are also socio-demographic controls
. sum train age educ black married re74 re75,sep(0)
Variable Obs Mean Std. Dev. Min Max
train 445 .4157303 .4934022 0 1

age 445 25.37079 7.100282 17 55
educ 445 10.19551 1.792119 3 16
black 445 .8337079 .3727617 0 1
married 445 .1685393 .3747658 0 1
re74 445 2.102266 5.363584 0 39.5707
re75 445 1.377139 3.150961 0 25.1422
.
. describe train age educ black hisp married re78
storage display value

variable name type format label variable label
train byte %9.0g =1 if assigned to job training

age byte %9.0g age in 1977
educ byte %9.0g years of education
black byte %9.0g =1 if black
hisp byte %9.0g =1 if Hispanic
married byte %9.0g =1 if married
re78 float %9.0g real earns., 1978, $1000s
.
As a first attempt at evaluation of the training program we may apply a difference-of-means
estimator evaluating the difference in earning means between treated and untreated
. regress re78 i.train, vce(robust) // diff. in means
Linear regression Number of obs = 445

F(1, 443) = 7.15
Prob > F = 0.0078
R-squared = 0.0178
Root MSE = 6.5795
Robust
re78 Coef. Std. Err. t P>|t| [95% Conf. Interval]
1.train 1.794343 .6708247 2.67 0.008 .4759489 3.112737

_cons 4.554802 .3402038 13.39 0.000 3.886188 5.223416
.
We observe a significantly positive increase of 1.800$ in average earnings for workers who
got job training. The question arises if the difference-of-means estimator provides an accurate
estimate of the causal effect of training on labour earnings. The Rubins potential-outcome
framework is the best way to answer this kind of question, as explained next.
4.10.2. Potential outcomes. Let us consider a binary treatment variable w = {1, 0},
training in our example, and assume that V ar (w) 6= 0 in the population. Given the structural
equation of interest y = (w, x, ), define y 1 = (w = 1, x, ) and y 0 = (w = 0, x, )
the pair of potential outcomes for an individual in the population with observed and latent
characteristics x and .
For each individual we observe either y 0 or y 1 . Formally, the observed dependent variable
y = (w, x, ) can be expressed in terms of potential outcomes

y 0 if w=0
y=
y 1 if w=1
So, in the training example y 1 is the post-training earning, had the individual been assigned
to training, and y 0 is what the same individual would earn without training.
Our object of interest is the Average Treatment Effect
AT E = E y 1 y 0

or also the Average Treatment Effect on the Treated
AT ET = E y 1 y 0 |w = 1

Under certain conditions, the two effects can be estimated as the constant parameters of a
simple linear regression model. In fact,
y = y0 + w y1 y0 ,

(4.10.1)
so E (y|w = 1) = E y 1 |w = 1 and E (y|w = 0) = E y 0 |w = 0 .

We may start assuming Ignorability in mean of treatment (IM),
IM: E y 1 |w = E y 1 and E y 0 |w = E y 0 (Wooldridge 2010).

Then, AT E is identified, that is it can be recovered from moments of the distribution of the
observables (y, w):
AT E = E y 1 E y 0

= E y 1 |w = 1 E y 0 |w = 0

= E (y|w = 1) E (y|w = 0)
Hence, using data from a random sampling, ATE can be estimated consistently by the
difference-of-means estimator: the average of y over the treated individuals minus the av-
erage of y over the untreated individuals.
Under IM AT E = AT ET , since
= E y 1 |w = 1 E y 0 |w = 1

AT ET
= E y 1 |w = 1 E y 0 |w = 0

= E (y|w = 1) E (y|w = 0) .
From the above it is also clear that a weaker version of IM, say IM: E y 0 |w = E y 0 ,

is sufficient for identification of AT ET .

The IM assumption excludes cases of self-selection into treatment. A weaker assumption
would maintain that treatment is ignorable only for all individuals within the same observed
pattern of covariates, x. We refer to it as Ignorability in conditional mean (ICM):
ICM: E y 1 |w, x = E y 1 |x and E y 0 |w, x = E y 0 |x .

Given
y = y0 + w y1 y0

(4.10.2)
and ICM, we can identify ATEs at a given pattern x
AT E (x) = E y 1 |x E y 0 |x

= E y 1 |w = 1, x E y 0 |w = 0, x

= E (y|w = 1, x) E (y|w = 0, x)
Also, AT E (x) = AT ET (x). It is also clear that for identification of AT ET (x) only, the
weaker ICM assumption, E y 0 |w, x = E y 0 |x , suffices:

AT ET (x) = E y 1 |w = 1, x E y 0 |w = 1, x

= E y 1 |w = 1, x E y 0 |w = 0, x

= E (y|w = 1, x) E (y|w = 0, x) .
Taking averages over x and letting E (y|w = j, x) = mj (x) , j = 0, 1, yields
AT E = Ex E y 1 |x E y 0 |x

= Ex E y 1 |w = 1, x E y 0 |w = 0, x

= Ex [m1 (x) m0 (x)]
The probability of participating to treatment, P r (w = 1|x) p (x), referred to as prob-

ability score, is central to the empirical analysis ATE, both for identification and estimation.
Next, we will see its importance for identification.
For identification of AT E we need AT E (x) to be identified over the support of x, which

occurs if and only if there is some chance to find individuals with x in both treatment and
control groups, otherwise either E y 1 |x or E y 0 |x could not be computed. This is assured

by the overlap assumption:
Overlap: Let X denote the support of x, then 0 < P r (w = 1|x) < 1 over X .
The overlap assumption assures that in the population, for any given pattern of covariates,
both the treatment and the control groups are sizable.
For identification of AT ET , we restrict to the treated subsample w = 1 and so we average
over x|w = 1:
= Ex|w=1 E y 1 |w = 1, x E y 0 |w = 1, x |w = 1

AT ET
= E y 1 |w = 1 Ex|w=1 E y 0 |w = 0, x |w = 1

= E (y|w = 1) Ex|w=1 [m0 (x) |w = 1]
From the above it is clear that in the case of AT ET we need to estimate only m0 (x) and
so a weaker overlap assumption may be applied, one that requires that just the control group
be sizable for all covariate patterns: P r (w = 1|x) < 1 over X .
In general, for the ICM assumption to hold it is necessary that there is no bad control in the
x vector. A covariate x is said a bad control if, likewise y, it is affected by the treatment variable
w so that x = x0 + w x1 x0 (with x0 and x1 differently distributed, of course). Assuming

for simplicity that x comprises only one factor x, E y i |w = j, x = x = E y i |w = j, xj = x

,
i, j = 0, 1. Therefore, in general, E y j |w = 1, x = x 6= E y j |w = 0, x = x

and ICM cannot
hold, unless E y j |w, x = E y j (see also Angrist et al (2009) and Wooldridge 2005b).

Finally, we assume Random sampling:
RS: The elements of the sequence {(yi wi xi1 xi2 ...xik ) , i = 1, ..., n} are i.i.d. random
vectors.
4.10.3. Pooled regression. Under the assumption of an additive and constant causal
effect of w, potential outcomes differ only by a constant element and estimation can be im-
plemented through a simple multiple regression framework using the treatment variable as a
covariate.
Let E y 0 |w, x = x0 , so write y 0 = x0 + , where 0 = y 0 E y 0 |w, x , then assume a

constant treatment effect y 1 y 0 = . On plugging the right hand sides of the foregoing two
equations into y = y 0 + w y 1 y 0 we get

(4.10.3) y = x0 + w +

w
with E (|w, x) = 0 satisfied. Then, maintaining det E w x0 6= 0, assures
x
that Equation (4.10.3) can be estimated through a LRM. The OLS estimate of is the ATE
estimate (and also the ATET estimate). Wooldridge refers to this estimator as pooled regression
adjustment.
. regress re78 i.train age educ black hisp married re74 re75, vce(robust)
Linear regression Number of obs = 445

F(8, 436) = 3.00
Prob > F = 0.0028
R-squared = 0.0548
Root MSE = 6.506
Robust
re78 Coef. Std. Err. t P>|t| [95% Conf. Interval]
1.train 1.682588 .6580774 2.56 0.011 .3891892 2.975986

age .0557707 .0397936 1.40 0.162 -.0224405 .1339819
educ .4058834 .1567272 2.59 0.010 .0978486 .7139182
black -2.169781 1.008415 -2.15 0.032 -4.151741 -.1878214
hisp .1579258 1.366293 0.12 0.908 -2.527414 2.843266
married -.1402712 .8706551 -0.16 0.872 -1.851474 1.570932
re74 .0828563 .1073171 0.77 0.440 -.128067 .2937795
re75 .0515333 .1247684 0.41 0.680 -.193689 .2967557
_cons .6217388 2.384255 0.26 0.794 -4.064324 5.307802
.
4.10.4. Difference-in-difference estimators. What to do when our data set is not a

randomized experiments? There may still be some convenient stratifications in our data that
may make them similar to a randomized experiment. For example, variation in treatment may
be across individuals and over time and so we may use panel data techniques accommodating
fixed individual and time heterogeneity (see chapter 7).
Alternatively, variation in treatment may occur at a more aggregate level, for example,
between cohorts of individuals and over time. In this case we may use difference in differences
(DD) estimators accommodating fixed cohort heterogeneity along with fixed time heterogene-
ity.
Consider the debated question of whether a more generous regime of workers compensation
for injured workers has adverse effects on the labour performance. Durbin Meyer and Viscusi
(1995) (DMV) tried to answer this question by studying the case of Kentucky. On July 15, 1980
Kentucky raised the cap on the weekly earnings covered by workers compensation. While this
intervention had no effect on the benefits of low-income workers, it reduced the high-income
workers opportunity cost of staying longer in workers compensation. The data used by DMV
permits to shed some light on the issue.
The data set used by DMV is injured.dta. It is a pooled cross-section of workers in
Kentucky (and Michigan) before and after the policy change. There are the following variables.

durat float %9.0g duration of benefits

ldurat float %9.0g log(durat)
afchnge byte %9.0g =1 if after change in benefits
highearn byte %9.0g =1 if high earner
ky byte %9.0g =1 for kentucky
mi byte %9.0g =1 for michigan
male byte %9.0g =1 if male
married byte %9.0g =1 if married
indust byte %9.0g industry
injtype byte %9.0g type of injury
. sum durat ldurat afchnge highearn ky mi male married indust injtype, sep(0)
. sum durat ldurat afchnge highearn ky mi male married indust injtype, sep(0)
durat 7,150 9.922203 24.49754 .25 182

ldurat 7,150 1.332712 1.308542 -1.386294 5.204007
afchnge 7,150 .4732867 .4993208 0 1
highearn 7,150 .3988811 .4897025 0 1
ky 7,150 .7868531 .4095592 0 1
mi 7,150 .2131469 .4095592 0 1
male 7,134 .780628 .4138501 0 1
married 6,853 .6922516 .4615955 0 1
indust 7,125 2.292491 .8767738 1 3
injtype 7,150 4.450909 1.516924 1 8
.
The research question is the following: Does the more generous cap increase the log of
out-of-work duration (ldurat) of injured high earners (highearn)?
We can start by restricting the analysis to high earners in Kentucky and work out an
ATE based on averages of high earners before and after the policy change. As seen at the
beginning of our discussion on ATE, this estimator would be consistent under the IM assump-
0 |i = h, w
0
1
1
j
tion E yit t = E yit |i = h and E yit |i = h, wt = E yit |i = h where yit are the
potential outcomes of yit (ldurat in the data), i = h, l indicates whether the worker is a high
(h) or a low earner (l), t = a, b indicates if the workers is observed after (a) or before the
policy change (b) and wt = 1 (t = a) is our treatment variable in this case, the time dummy
indicating the period after the policy change (afchnge in the data).
The difference-of-means estimators is readily obtained from the regression of ldurat on 1
and afchnge, restricted to high earners in Kentucky.
. regress ldurat i.afchnge if ky&highearn, vce(robust)
Linear regression Number of obs = 2,394

F(1, 2392) = 13.94
Prob > F = 0.0002
R-squared = 0.0058
Root MSE = 1.2985
Robust
ldurat Coef. Std. Err. t P>|t| [95% Conf. Interval]
1.afchnge .1982585 .0531021 3.73 0.000 .0941276 .3023895

_cons 1.382094 .0369913 37.36 0.000 1.309556 1.454632
.
j
The ATE estimate of 0.198 is bound to be inconsistent since it is very likely that yit be
affected by a confounding time trend.
The DD estimator accommodates the trend by extending the sample to low earners and
assuming that the trend is constant across high and low earners. So, letting wi = 1 (i = h) be
the dummy indicating high earners, assume
0

E yit |i, t, wit = i + t
where wit = wi wt is our new treatment variable indicating high earners in the after-policy
period. Assume also a constant ATE
1 0

E yit |i, t, wit = E yit |i, t, wit +
Then,
E (yit |i = l, t = b, wit ) = l + b
E (yit |i = l, t = a, wit ) = l + a
E (yit |i = h, t = b, wit ) = h + b
E (yit |i = h, t = a, wit ) = h + a +
So
E (yit |i = h, t = a, wit ) E (yit |i = h, t = b, wit ) = + a b
and since
E (yit |i = l, t = a, wit ) E (yit |i = l, t = b, wit ) = a b
then is identified:
E (yit |i = h, t = a, wit ) E (yit |i = h, t = b, wit )
(4.10.4) [E (yit |i = l, t = a, wit ) E (yit |i = l, t = b, wit )] = .
The DD estimator simply replaces the theoretical means in Equation (4.10.4) with the
sample analogs.
. qui sum ldurat if highearn&ky&afchnge
. sca y_hKa=r(mean)
. qui sum ldurat if highearn&ky&(!afchnge)
. sca y_hKb=r(mean)
. qui sum ldurat if (!highearn)&ky&afchnge
. sca y_lKa=r(mean)
. qui sum ldurat if (!highearn)&ky&(!afchnge)
. sca y_lKb=r(mean)
. sca DD=y_hKa-y_hKb-(y_lKa-y_lKb)
. sca list DD
DD = .1906012
.
It is easy to prove that the DD estimator is implemented at once by running the OLS
regression of yit on 1, wt , wi and wit . The model is fully saturated since it has enough
parameters to estimate the four theoretical means
E (yit |i = l, t = b, wit ) = l + b
E (yit |i = l, t = a, wit ) = l + a
E (yit |i = h, t = b, wit ) = h + b
E (yit |i = h, t = a, wit ) = h + a +
without constraints. The constant term estimates E (yit |i = l, t = b, wit ), the coefficient on wt
estimates a b the coefficient on wi estimates h l and the coefficient on wit estimates
our ATE, :
. regress ldurat i.highearn i.afchnge i.afchnge#i.highearn if ky, vce(robust)

F(3, 5622) = 38.97
Prob > F = 0.0000
R-squared = 0.0207
Root MSE = 1.2692
Robust
ldurat Coef. Std. Err. t P>|t| [95% Conf. Interval]
1.highearn .2564785 .0473887 5.41 0.000 .1635785 .3493786

1.afchnge .0076573 .0440344 0.17 0.862 -.078667 .0939817
afchnge#highearn
1 1 .1906012 .068982 2.76 0.006 .0553699 .3258325
_cons 1.125615 .0296226 38.00 0.000 1.067544 1.183687
.
The DD estimator can be also implemented by the user-written Stata command diff
(Villa, 2016)
. diff ldurat if ky, p(afchnge) t(highearn)
DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS

Number of observations in the DIFF-IN-DIFF: 5626
Baseline Follow-up
Control: 1705 1527 3232
Treated: 1233 1161 2394
2938 2688
Outcome var. ldurat S. Err. t P>|t|
Baseline
Control 1.126
Treated 1.382
Diff (T-C) 0.256 0.047 5.41 0.000***
Follow-up
Control 1.133
Treated 1.580
Diff (T-C) 0.447 0.049 9.05 0.000***
Diff-in-Diff 0.191 0.069 2.78 0.005***
R-square: 0.02
* Means and Standard Errors are estimated by linear regression
**Inference: *** p<0.01; ** p<0.05; * p<0.1
. do "/var/folders/b2/mmlyvgdd6yvb7vd5b8yh7pw00000gn/T//SD22970.000000"
CHAPTER 5
The Oaxacas model: OLS, optimal weighted least squares and
group-wise heteroskedasticity
5.1. Introduction
The Oaxacas model is a good way to check your comprehension of things so far. The
treatment is more complete than Greene (2008)s. Importantly, it serves as a motivation of the
Zyskinds condition, introduced in Section 4.9. It may also serve as an introduction to a number
of topics that will be covered later on: in particular, dummy variables; heteroskedasticity;
generalized least squares estimation.
5.2. Embedding the Oaxacas model into a pooled regression framework
We have 2 statistically independent samples, not necessarily of equal sizes. 1) A sample

from the population of male workers, with observations for the log(wage), collected into the
(nm 1) column vector ym , and socio-demographic explanatory variables collected into the
(nm k) sample regressor matrix Xm ; 2) a sample from the population of female workers with
the same variables collected into the (nf 1) column vector yf and the (nf k) matrix Xf ,
respectively.
Assume that both population models
ym = Xm m + m
yf = Xf f + f
2 = 2 2 and 2 = 2 2 , not necessar-

meet LRM.1-LRM.5 with regression variances, m m f f
ily equal (group-wise heteroskedasticity). Hence, the resulting OLS estimators from the two
89
5.2. EMBEDDING THE OAXACAS MODEL INTO A POOLED REGRESSION FRAMEWORK 90
separate regressions, bm and bf , are independently normally distributed, with bm |Xm

h i 1
2 0 1 2 0
N m , m (Xm Xm ) and bf |Xf N f , f Xf Xf and are both BLUE.
The question I ask is whether it is possible to embed the two models into a single regression
model by pooling the two sub-samples into a larger one with size equal to n = nm + nf , and
continue to estimate m and f efficiently.
Lets try and see. Here is the pooled data-set

ym Xm m
y= , Xw = , = .
yf Xf f
Let 1 denote the (n 1) vector of all unity elements and construct the (n 1) vector d, such
that its first nm entries have all unity elements and the last nf all zero elements.
Variables like d are usually referred to as dummy variables or indicator variables, since
they indicate whether any observation in the sample belongs or not to a given group. In
this particular case, d is the male dummy variable indicating whether any observation in the
sample is specific to the male group. Since the two groups are mutually exclusive, the female
dummy variable can be constructed as the complementary vector 1 d. By construction, d
and 1 d are orthogonal, that is d0 (1 d) = 0.
Let x0wi be a (1 k) row vector indicating the i.th row of Xw and yi , i and di be scalars
indicating the i.th component of y, and d, respectively.
With this in hand, the model for the generic worker i = 1, ..., n is
(5.2.1) yi = di x0wi m + (1 di ) x0wi f + i
On setting up the (2k 1) parameter vector as

m
=
f
and the (n 2k) regressor matrix X as

Xm 0(nm k)
X= ,
0(nf k) Xf
where 0(st) indicate a (s t) matrix of all zero elements, model (5.2.1) can be reformulated
in matrix form as
(5.2.2) y = X + .
Exercise 49. Prove that X has f.c.r. if and only if both Xm and Xf have f.c.r.
Summing up, we have two equivalent representations of the same model: 1) that in Greene
(2008), with the two separate regressions; 2) that presented here with a single regression model,
represented by (5.2.2). It turns out that the two frameworks are equivalent, as far as efficient
estimation of the population coefficients is concerned. Indeed, as I prove next, the OLS
estimator, b, from model (5.2.2) is numerically identical to the OLS estimators from the two
0
0 0
separate regressions as presented in Greene (2008), i.e. b = bm bf . Let
1
b = X 0X X 0 y.
By construction,
0 y
Xm m
X 0y =
Xf0 yf
and
0 X
Xm 0(kk)
m
X 0X = .
0(kk) Xf0 Xf
Then, by a well know property of the inverse of a block diagonal matrix (see (A-73) in Greene
(2008))

0 X )1
(Xm 0(kk)
1 m
X 0X = 1 .
0(kk) Xf0 Xf
Hence,

0 X )1
(Xm 0(kk) 0 y
m Xm m
b = 1
0(kk) Xf0 Xf Xf0 yf

(Xm 0 X )1 X 0 y
m m m
= 1
0
Xf Xf Xf0 yf

bm
= .
bf

bm
b=
bf
using Theorem 8.
It must be pointed out that model (5.2.2) does not satisfy assumption LRM.4. The dis-
turbances , although independently distributed, suffer from what is usually referred to as
2 = 2 . Indeed, the covari-
group-wise heteroskedasticity, as the model does not maintain m f
ance matrix for is

2 I
m 0(nm nf )
nm
= 2 .

0(nf nm ) f2 Inf
In this sense, model (5.2.2) is not a classical regression model. Does this mean that b is not
BLUE? No, and for an important reason. Assumptions LRM.1-LRM.4 are sufficient for the
OLS estimator to be BLUE, as it has been proved in Section 4.3, but not necessary. In specific
circumstances, even if LRM.4 is not met, the OLS estimator is still BLUE, and the Oaxacas
model is one such case. This is verified in the next two sections.
5.3. THE OLS ESTIMATOR IN THE OAXACAS MODEL IS BLUE 93
5.3. The OLS estimator in the Oaxacas model is BLUE
Model (5.2.2) can be transformed into a classical regression model by using a standard
procedure in econometrics and statistics: weighting. Let

1
m Inm 0(nm nf )
H= .
0(nf nm ) f1 Inf
As stated by the exercise below, the matrix H when premultiplied to any conformable
vector transforms the vector so that its first nm elements get divided by m and the remaining
by f . This is what we refer to as weighting.
Exercise 51. Verify by direct inspection that, given any (nm 1) vector xm , any (nf 1)
vector xf and

xm
x= ,
xf
then
1 x
m m
Hx = .
f1 xf
Premultiply both sides of model (5.2.2) by H:
Hy = HX + H,
or
(5.3.1) y +
e = X
where the tilde indicates weighted variables. Two important facts are worth observing at this
point. First, the population parameters vector, , in the weighted model is the same as in
model (5.2.2). Second, the weighted errors satisfy LRM.4 with covariance matrix equal to
2 In , (so, if LRM.5 holds they are independent standard normal variables) since

V ar e|X
e = HH 0 = HH

1 2
m Inm 0(nm nf ) m Inm 0(nm nf )
=
0(nf nm ) f1 Inf 0(nf nm ) f2 Inf

1
m Inm 0(nm nf )

0(nf nm ) f1 Inf

1
m Inm 0(nm nf ) m Inm 0(nm nf )
= 2
0(nf nm ) f Inf 0(nf nm ) f1 Inf
= 2 In .
Therefore, the weighted model is a classical regression model that identifies the parameters of
interest, and hence, by the Gauss-Marcov Theorem, the OLS estimator applied to the weighted
model (5.3.1), referred to as the weighted least squares estimator (WLS), bw , is BLUE for .
Let us work out its formula, using exercise 51:
1
2 X 0 X
m 0(kk) 2 X 0 y
m m m m m
bw = 1
0(kk) f2 Xf0 Xf f2 Xf0 yf

2 (X 0 X )1
m 0(kk) 2 X 0 y
m m m m m
= 1
2 0 2 0
0(kk) f Xf Xf f Xf yf

(Xm0 X )1 X 0 y
m m m
= 1 ,
0
Xf Xf Xf0 yf
which proves that b = bw , namely that in the Oaxacas models the OLS estimator coincides
with the optimal WLS estimator.
Does this imply that we can do inference in the Oaxacas model feeding the Stata regress
command with the variables of model (5.2.2) without further cautions? Not quite. Although
the single OLS regression provides the BLUE estimator for the population coefficients , the
OLS estimate of V ar (b|X) that would be computed by regress,

0 X )1
(Xm 0(kk)
m

V ar (b|X) = s2 1 ,
0(kk) Xf0 Xf
with s2 obtained from the sum of squares of the pooled residuals, is biased. The reason is

that V ar (b|X) forces the regression variance estimate to be constant across the two samples.
Luckily, the same is not true for the separate regressions on the two subsamples, providing us
with the unbiased estimators of the model coefficients, bm and bf and the unbiased estimator
of the covariance matrix

0 X )1
s2m (Xm 0(kk)
m

V ar (b|X) = 1
0(kk) 2 0
sf Xf Xf
1 Pnm 1 Pn
where s2m = nm k
2
i=1 ei and s2f = nf k
2
i=nm +1 ei . Alternatively, one can implement
a feasible version of the weighted regression explained above, using sm and sf as weights.
But this is clearly more computationally cumbersome than carrying out the two separate
regressions.
5.4. THE ZYSKINDS CONDITION IS VERIFIED IN THE OAXACAS MODEL. 96
5.4. The Zyskinds condition is verified in the Oaxacas model.
We start by elaborating P[X] :
1 0
P[X] = X X 0 X X

Xm 0(nm k) (Xm0 X )1 0(kk) Xm0 0(knf )
m
= 1
0(nf k) Xf 0(kk) 0
Xf Xf 0
0(knm ) Xf

Xm (Xm 0 X )1 0(nm k) Xm0 0(knf )
m
= 1
0(nf k) Xf Xf0 Xf 0(knm ) Xf0

X (X 0 X )1 X 0 0
m m m
( m f1
m n n )
=

0(nf nm ) Xf Xf0 Xf Xf0

P [Xm ] 0 (nm nf )
=

0(nf nm ) P[Xf ]
Therefore,

2 I
m 0(nm nf ) P[Xm ] 0(nm nf )
nm
P[X] =

0(nf nm ) f2 Inf 0(nf nm ) P[Xf ]

2
m P[Xm ] 0(nm nf )
=

0(nf nm ) f2 P[Xf ]

P[Xm ] 0(nm nf ) 2 I
m 0(nm nf )
nm
= .

0(nf nm ) P[Xf ] 0(nf nm ) f2 Inf
= P[X] .
CHAPTER 6
Large sample results for OLS and GLS estimators
6.1. Introduction
The linear regression model may present departures from LRM.4, such as heteroskedastic-
ity and/or cluster correlation. In this chapter we study common econometric techniques that
accommodate these issues, for both estimation and inference: primarily, the Generalized LS
(GLS) estimator for the regression coefficients and robust covariance estimators.
All known statistical properties are derived for n and so the techniques we consider
in this chapter work well in large samples.
I spell out the assumptions needed for consistency and asymptotic normality of OLS and
GLS estimators, providing the derivation of the large-sample properties.
Strict exogeneity is maintained throughout:
SE: E (|X) = 0.
A weaker version of the random sampling assumption, one which does not maintain identical
distributions of records, is invoked when proving asymptotic normality and consistency for the
variance estimators:
RS: There is a sample of size n, such that the elements of the sequence {(yi x0i ) , i = 1, ..., n}
are independent (NB not necessarily identically distributed) random vectors.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using the
data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2010).
97
6.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 98
6.2. OLS with non-spherical error covariance matrix
I prove consistency in the general case of V ar (|X) = , where is an arbitrary and

unknown symmetric, p.d. matrix.
6.2.1. Consistency. The following assumptions hold.

0 0
OLS.1: plim X nX = lim E X nX , a positive definite, finite matrix
0 n
OLS.2: plim XnX = Q, a positive definite, finite matrix
The proof of consistency goes as follows.

It has
1
X 0X X 0

b=+
n n
then
X 0 X 1 X 0

plim (b) = + plim
n n
0
X
= + Q1 plim .
n
By strict exogeneity
X 0

E =0
n
then
X 0 X 0 0 X

1
V ar |X = E |X
n n n
1 X 0 X
=
n n
and so
X 0 X 0 X

1
V ar = E ,
n n n
which goes to zero as n by assumption OLS.1. Hence X 0 /n converges in squared mean,
and consequently in probability, to zero.
Clearly, the above implies that OLS is consistent in the classical case of LRM.4.
6.2.2. Asymptotic normality with heteroskedasticity. Assumptions OLS.1 and OLS.2

hold along with RS and the following
LRM.4H: V ar (|X) = , where

2
1 0 0

0 2 .. ..
2 . .
= . .

.. .. ..
. 0

0 0 n2
is p.d. and arbitrary, in the sense that all i2 , i = 1, ..., n, are unconstrained and may depend
on X. LRM.4H permits arbitrary heteroskedasticity but not correlation. Partition X row-wise

0
x1

x0 2
X = . .

..

xn0
By strict exogeneity E (xi i ) = 0 and hence
V ar (xi i ) = E E 2i xi x0 i |xi = E i2 xi x0 i

and
n n
1X 1X
E i2 xi x0i

V ar (xi i ) =
n n
i=1 i=1
0
X X
= E .
n
Therefore,
n
X 0 X

1X
lim V ar (xi i ) = lim E ,
n n n n
i=1
which is a finite matrix by assumption and so, by the (multivariate) Lindeberg-Feller theorem,
X n
X 0
0
n X X
xi i N 0, plim .
n n d n
i=1
Eventually, given the rules for limiting distributions (Theorem D.16 in Greene (2008)),
1
X 0X X 0 X 0

n (b ) Q1 ,
n n d n
and so
0

X X
n (b ) N 0, Q1 plim Q1 .
d n
6.2.3. Whites heteroskedasticity consistent estimator for the OLS standard

errors. Under LRM.4H and given the OLS residuals ei = yi x0i b, i = 1, ..., n, an het-
eroskedasticity consistent estimator for the asymptotic covariance matrix of b,
X 0 X

1
Avar (b) = Q1 plim Q1 ,
n n
is given by the Whites estimator:
1 1
(6.2.1) \
Avar (b) = X 0 X X 0 X
X 0X ,
where
e21 0 0
.. ..
0 e22 . .
=
.

.. .. ..

. . . 0

0 0 e2n
one that will be used intensively in Chapters 7 and 8, is
An equivalent way to express ,
the following
= ee0 In ,

where the symbol stands for the element-by-element matrix product (also known as Hadamard
product). Similarly, = E (0 In |X) .
Econometric softwares routinely compute robust OLS standard errors: these are just the
\
square roots of the main diagonal elements of Avar (b) in (6.2.1). In Stata this is done through
the regress option vce(robust) (or, equivalently, simply robust).
6.2.4. Whites heteroskedasticity test. The Whites estimator remains consistent un-
der homoskedasticity, therefore one can test for heteroskedasticity by assessing the statistical
discrepancy between s2 (X 0 X)1 and (X 0 X)1 X 0 X
(X 0 X)1 . Under the null hypothesis
of homoskedasticity, the discrepancy will be small. This is the essence of the Whites het-
eroskedasticty test. The statistics measuring such discrepancy can be implemented through
the following auxiliary regression including the constant term.
(1) Generate the squared OLS residuals, e2 = e e

(2) Do the OLS auxiliary regression that uses e2 as the dependent variable and the

following regressors: all k variables in the n k sample matrixX = X1 1 and
all interaction variables and squared variables in X1 . This implies that for any two
columns of X1 , say variables xi and xj , there are the additional regressors xi xj ,
xi xi and xj xj . The auxiliary regression, therefore, has p k (k + 1) /2 regressors.
(3) Save the R-squared of the auxiliary regression, say Ra2 , and multiply it by the sample
size n. The resulting statistics nRa2 2 (p 1) measures the statistical discrepancy
A
between the two covariance estimators and so provides a convenient heteroskedasticity
test: reject homoskedasticity when nRa2 is larger than a conventional percentile of
choice for 2 (p 1).
We may implement the White test manually, saving the OLS residuals through predict and
then generating squares and interactions as appropriate, or more easily by giving the following
post-estimation command after regress: imtest, white.
6.2.5. Clustering. Clustering of observations along a given dimension is the norm in

microeconometric applications. For example, firms cluster across different sectors, households
live in different provinces, immigrants in a given country belong to different ethnic groups and
so on.
Clustering cannot be neglected in empirical work. In the case of firm data, for example,
it is likely that there is correlation across the productivity shocks hitting firms in the same
sectoral cluster, with a resulting bias in the standard error estimates, even if White robust.
The White estimator can be made robust to cluster correlation quite easily. I explain
this in terms of the firm data example. Assume that we have cross-sectional data of n firms,
indexed by i = 1, ..., n. There are G sectors, indexed by g = 1, ..., G and we know which sector
g = 1, ..., G firm i = 1, ..., n belongs to. This information is contained in the the n G matrix
D of sectoral indicators: the element of D in row i and column j, say d (i, j), is unity if firm i
belongs to sector j and zero if not. The clustering assumption below generalizes both LRM.4
and LRM.4H
LRM.4C: V ar (|X) = , where = E (0 DD0 |X)
The cluster-correlation and heteroskedasticity consistent estimator for the asymptotic covari-
in Equation (6.2.1) with
ance matrix of b is then assembled by simply replacing
c = ee0 DD0 .

Stata does this through the regress option vce(cluster clustervar ), where clustervar is
the name of the cluster identifier in the data set.
Chapter 8 will cover cases of multi-clustering, that is data that are grouped along more
than one dimension.
6.2.6. Average variance estimate (skip it). I prove now that a consistent estimate of
the average variance
n
1X 2
n2 = i ,
n
i=1
6.3. GLS 103
is given by
n
1X 2
s2n = ei ,
n
i=1
s2n n2 = 0 (NB I use this formulation and not plim s2n = n2 as n2

in the sense that plim
is a sequence).
Since
1
0 0 X X 0X X 0

s2n = ,
n n n n
0

s2n + 00 Q1 0

plim = plim
n
0

= plim .
n
By the RS assumption the squared errors, 2i , are all independently distributed with means
E 2i i2 , and given that

n
0 1X 2
= i ,
n n
i=1
I can apply the Markovs strong law of large numbers to have
n
" #
0

1X 2
plim i = 0.
n n
i=1
6.3. GLS
The estimation strategy described in the previous sections is based on OLS estimates for the
regression coefficients with standard errors estimates corrected for heteroskedasticity and/or
cluster correlation. The drawback of the approach is a loss in efficiency, if the departures from
LRM.4 are of a known form. We will see that in this case the BLUE can always be found.
To formalize the new set-up, let V ar (|X) = 2 , where is a known symmetric,
positive definite (p.d.) (n n) matrix and 2 is an unknown strictly positive scalar (that is,
is known up to a strictly positive multiplicative scalar.)
6.3. GLS 104
Since is symmetric and p.d., it can be always factorized as = CC 0 where is

an (n n) diagonal matrix with the main diagonal elements all strictly positive and C is a
(n n) matrix such that C 0 C = I, implying that C 0 is the inverse of C and consequently that
CC 0 = I ( and C are said, respectively, the eigenvalues (or also characteristic roots) and the
eigenvectors (or also characteristic vectors) matrices of ).
A great benefit of the foregoing factorization is that it permits to compute the inverse of
effortlessly. In fact, it is possible to verify that
(6.3.1) 1 = C1 C 0
and
(6.3.2) 1/2 = C1/2 C 0 ,
where 1 is the inverse of , 1/2 1/2 = 1 , 1 is the inverse of and 1/2 is

a diagonal matrix with main-diagonal elements equal to the square root reciprocals of the
main-diagonal elements of .
Consider the GLS transformed model
(6.3.3) y = X + ,
such that y 1/2 C 0 y, X 1/2 C 0 X and 1/2 C 0 .
Exercise 52. Verify by direct inspection that indeed 1 = 1 = and 1/2 1/2 =
1 .
Solution. Remember that is diagonal and so 1 = 1 = I. Then,
1 = C1 C 0 CC 0 = C1 IC 0 = C1 C 0 = CC 0 = I
and
1 = CC 0 C1 C 0 = CI1 C 0 = C1 C 0 = CC 0 = I.
6.3. GLS 105
The rest is proved similarly on considering that 1/2 is diagonal and so 1/2 1/2 = 1 .
0 0
Exercise 53. Use (6.3.1) and (6.3.2) to prove 1) X X = X 0 1 X; 2) X = X 0 1 ;
and 3) V ar ( |X) = 2 In then use the general law of iterated expectation to prove that also
V ar ( |X ) = 2 In
Given the results of the foregoing exercise, the OLS applied to the transformed model
(6.3.3) is the Gauss-Marcov estimator for and has the formula
0
1 0
bGLS = X X X y
1
(6.3.4) = X 0 1 X X 0 1 y
1
with V ar (bGLS |X) = 2 X 0 1 X .
Exercise 54. Let 1/2 = C1/2 C 0 and prove that
(6.3.5) 1/2 y = 1/2 X + 1/2
is also a GLS transformation, that is OLS applied to model (6.3.5) yields bGLS .
Solution: By exercise 52
1 1
X 0 1/2 1/2 X X 0 1/2 1/2 y = X 0 1 X X 0 1 y.
The estimator bGLS is OLS applied to a classical regression model and as such it is BLUE.
The following exercise asks you to verify by direct inspection that GLS is better than OLS
in terms of covariance.
1 1 1
2 X 0X X 0 X X 0 X 2 X 0 1 X
is a n.n.d. matrix.
6.3. GLS 106
Solution: We define a k n matrix D as
1 1
D X 0X X 0 X 0 1 X X 0 1 .
Therefore,
1 1
X 0X X 0 = X 0 1 X X 0 1 + D.
On noticing that DX = 0kk ,
1 1
X 0X X 0 X X 0 X =
h 1 i h 1 i
X 0 1 X X 0 1 + D 1 X X 0 1 X + D0 =
1
X 0 1 X + DD0 .
Since is p.d., then for any n 1 vector z, z0 z 0, being equal to zero if and only if z = 0.
But then, z0 z 0 when, in particular, z = D0 w for any n 1 vector w, which is equivalent
to saying that w0 DD0 w 0 for any n 1 vector w, or that DD0 is n.n.d, proving the
result.
6.3.1. Consistency of GLS. The following assumption hold.

X 0 1 X X 0 1 X
GLS.1: plim n = lim E n = Q, a positive definite, finite matrix
n
Exercise 56. Given that

1
X 0 1 X X 0 1

bGLS = + ,
n n
prove that
plim (bGLS ) =
under assumption GLS.1 and strict exogeneity (SE).

6.3. GLS 107
Solution. Easy: just write
0
!1 0
X X X
bGLS = + ,
n n
then consider that V ar ( |X ) = 2 In (see Exercise 53) and, finally, follow the same steps as
in Section 6.2.1.
6.3.2. Asymptotic normality. I prove asymptotic normality for bGLS under GLS.1,
SE and RS (again, remember that V ar ( |X ) = 2 In ).
By strict exogeneity E (xi i ) = 0 and hence
0

V ar (xi i ) = 2 E xi xi
and
n n
1X 1 2 X 0
V ar (xi i ) = E x i xi
n n
i=1 i=1
0 1
2 X X
= E .
n
Therefore,
n
X 0 1 X

1X
lim V ar (xi i ) = 2 lim E ,
n n n n
i=1
which is a finite matrix by assumption. By the Lindeberg-Feller central limit theorem,
Xn 0
n X X 0 1
xi i N 0, 2 Q

n n n d
i=1
and since
1
X 0 1 X X 0 1 X 0 1

n (bGLS ) Q1 ,
n n d n
then

n (bGLS ) N 0, 2 Q1 .

d
6.3. GLS 108
The asymptotic covariance matrix of bGLS is therefore
2 1
Avar (bGLS ) = Q ,
n
and is estimated by
1
\
Avar (bGLS ) = s2GLS X 0 1 X
where
(y X bGLS )0 (y X bGLS )
s2GLS = .
nk
(y XbGLS )0 1 (y XbGLS )
= .
nk
Exercise 57. (This may be skipped) Under GLS.1, SE and RS, prove that plim s2GLS =

2.
6.3.3. Feasible GLS. In general situations we may know the form of but not the
values taken on by its elements. Therefore to make GLS operational we need an estimate of
, say .
b Replacing by
b into (6.3.4) delivers the feasible GLS, henceforth FGLS:
1
bF GLS = X 0
b 1 X X 0
b 1 y.
Since GLS is consistent, to know that bGLS and bF GLS are asymptotically equivalent, i.e.
plim (bF GLS bGLS ) = 0, is enough to ensure that bF GLS is consistent but not that

n (bF GLS ) N 0, 2 Q1 .

d

For this we need the stronger condition that n (bF GLS ) and n (bF GLS ) be asymp-
totically equivalent, or

(6.3.6) n (bF GLS bGLS ) 0.
p
6.3. GLS 109
Two sufficient conditions for (6.3.6) are the following

!
X 0
b 1 X 0 1
(6.3.7) plim = 0
n n
!
X 0
b 1 X X 0 1 X
(6.3.8) plim = 0.
n n
Exercise 58. Use condition GLS.1 that
X 0 1 X

plim = Q,
n
the similar condition that

X 0 1

plim = q0 ,
n
where q0 is a finite vector, and the Slutsky Theorem (if g is a continuous function, then
plim [g (z)] = g [plim (z)], p. 1113) to verify that
!
X 0
b 1 X 0 1
plim = 0
n n
!
X 0
b 1 X X 0 1 X
plim = 0
n n
are sufficient for (6.3.6).
Solution: Given
X 0 1 X

plim =Q
n
and !
X 0
b 1 X X 0 1 X
plim = 0,
n n
then !
X 0 1 X X 0
b 1 X X 0 1 X

plim + plim = Q,
n n n
6.3. GLS 110
and so applying Slutsky Theorem twice, we get

! !
X 0 1 X X 0
b 1 X X 0 1 X X 0
b 1 X
plim + = plim =Q
n n n n
and
!1
X 0
b 1 X
(6.3.9) plim = Q1 .
n
By the same token,

!
X 0
b 1
(6.3.10) plim = q0 .
n
But now, since

!1
X 0b 1 X X 0
b 1 1
= n X 0b 1 X X 0
b 1 ,
n n
0 1 1 0 1
X X X 1 0 1
= n X 0 1 X X ,
n n
1
= X 0
b 1 X b 1 and bGLS = X 0 1 X 1 X 0 1 , then
X 0

bF GLS
!1
X 0
b 1 X X 0
b 1
= n (bF GLS ) ,
n n
1
X 0 1 X X 0 1

= n (bGLS ) .
n n
The last two equalities, along with the maintained conditions (6.3.7) and (6.3.8), the asymp-

totic results (6.3.9) and (6.3.10) and the Slutsky Theorem, prove that both n (bGLS )

and n (bF GLS ) converge in probability to the same limit, Q1 q0 .
Conditions (6.3.7) and (6.3.8) must be verified on a case-by-case basis. Importantly, they
may hold even in cases in which
b is not consistent, as shown in the context of FGLS panel
data estimators by Prucha (1984).

6.4. LARGE SAMPLE TESTS 111
6.4. Large sample tests
6.4.1. Introduction. This section covers large sample tests in more detail than Greene
(2008). For the exam you can skip the derivations of the asymptotic results.
Assume the following results hold

n (b ) N 0, 2 Q1

(1)
0 d
(2) plim XnX = Q
(3) plim s2 = 2 .

and consider the following lemma, referred to as the product rule. For more on this see
White (2001) p. 67 (notice that the product rule is not mentioned in Greene (2008), although
implicitly used for proving the asymptotic distributions of the tests).
Lemma 59. (The product rule) Let An be a sequence of random (l k) matrices and bn a
sequence of random (k 1) vectors such that plim (An ) = 0 and bn z. Then, plim (An bn ) =
d
0.
6.4.2. The t-ratio test (skip derivations). We wish to derive the asymptotic distri-
bution of the t-ratio test for the null hypothesis Ho : k = ko . We begin by noting that under
Ho

n (bk ko )
(6.4.1) q N (0, 1)
d
2 Q1
kk
by result 1 and Theorem D.16(2) in Greene (2008) (p. 1050).

Then, the t-ratio test for Ho is
(bk ko )
t= q ,
2 0 1
s (X X)kk
1
where (X 0 X)1 x0k M[X(k) ] xk

kk and X = X(k) xk (see Section 4.8). Since t can be
reformulated as

n (bk ko )
t= q 1 ,
0
s2 XnX kk
then

n (b o) n (b o)
k k k k
(6.4.2) plim q q =
X 0 X 1 2 1
s2 n kk Qkk

1 1
plim q q n (bk ko ) = 0
1
2 Q1
0
s2 XnX kk

kk
where the second equality follows from the product rule, given that, by results 2-3 and the
Slutsky Theorem (Theorem D.12 in Greene (2008), p. 1045), the first factor in the second plim
converges in probability to zero and, by result 1., the second factor converges in distribution
to a normal random scalar. Hence, the two sequences in the plim of equation (6.4.2) are
asymptotically equivalent and by Theorem D.16(3) have the same limiting distribution. Given
(6.4.1), this proves that
(bk ko )
q N (0, 1) .
d
s2 (X 0 X)1kk
Consider, now, the general case of a null hypothesis Ho : r0 = q, where r is a non-

zero (k 1) vector of non-random constants and q is a non-random scalar. Using the same
approach as above it is possible to prove that
r0 (b )
(6.4.3) q N (0, 1) .
2 0 0 1 d
s r (X X) r
1
0 X0X
Exercise 60. (skip) Prove (6.4.3). Hint: by the Slutsky Theorem, plim r n r =
r0 Q1 r.
6.4.3. The Chi-squared test (skip derivations). We wish to test the null hypothesis
Ho : R q = 0, where R is a non-random (J k) matrix of full-row rank and q is a (J 1)
column vector. Under Ho , R = q and so the F test can be written as

h i1
(b )0 R0 s2 R (X 0 X)1 R0 R (b )
F = .
J
The foregoing equation can be rearranged as

" 1 #1
X 0X

JF = n (b )0 R0 s2 R R0 R n (b ) .
n
Now let A 2 RQ1 R0 . Since A is p.d. (R is f.r.r.), there exists a p.d. matrix A1/2 such
1
that A1/2 A1/2 = A and A1/2 = A1/2 . Then, by result 1. and the Slutsky Theorem,

(6.4.4) A1/2 R n (b ) N (0, IJ ) .
d
Similarly, let A s2 R (X 0 X/n)1 R0 . Since A is p.d., there exists a p.d. matrix A1/2 such
1
that A1/2 A1/2 = A and A1/2 = A1/2 . Then
h i
(6.4.5) plim A1/2 R n (b ) A1/2 R n (b ) =
h i
plim A1/2 A1/2 R n (b ) = 0
where the second equality follows from the product rule given that

plim A1/2 = A1/2 ,
by results 2. and 3. and the Slutsky Theorem, and

R n (b ) N (0, A) ,
d
by result 1. and Theorem D.16(2) in Greene (2008). Hence, by Theorem D.16(3) the two
sequences in the left-hand-side plim of equation (6.4.5) have the same limiting distribution
and given (6.4.4), this proves that

A1/2 R n (b ) N (0, IJ ) .
d

Let w A1/2 R n (b ) , then by Theorem D.16(2),
(6.4.6) w0 w 2 (J) .
d
But since " #1

1
X 0X

1/2
A 1/2
A 1
=A = s R 2
R 0
,
n
then
w0 w =

n (b )0 R0 A1/2 A1/2 R n (b ) =
" 0 1 #1
0 0 XX
n (b ) R s R2
R0 R n (b ) = JF,
n
and so by (6.4.6)
JF 2 (J) .
d
CHAPTER 7
Fixed and Random Effects Panel Data Models
7.1. Introduction
This chapter covers the two most important panel data models: the fixed effect and the
random effect models.
For simplicity we start directly from the statistical models. The sampling mechanism will
be introduced when proving asymptotic normality.
Results in this chapter are demonstrated through the do-file paneldata.do using the data-
set airlines.dta, a panel data that I have extracted from costfn.dta (Baltagi et al. 1998).
7.2. The Fixed Effect Model (or Least Squares Dummy Variables Model)
Consider the following panel data regression model expressed at the observation level, that
is for individual i = 1, ...N and time t = 1, ..., T :
(7.2.1) yit = x0it + i + it
where x0it = (x1it , ..., xkit ),

1
.

= ..

k
and i is a scalar denoting the time-constant, individual specific effect for the individual i.
115
7.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 116
Define djit as the value taken on by the dummy variable indicating individual j = 1, ..., N
at the observation (i, t) , that is

1 if i = j, any t = 1, ..., T
djit = .
0 if i 6= j, any t = 1, ..., T
Then, model (7.2.1) can be equivalently written as
(7.2.2) yit = x0it + d1it 1 + ... + dNit N + it ,
i = 1, ...N and t = 1, ..., T.

In a more compact notation, at the individual level i = 1, ..., N, model (7.2.2) is written
as
yi = Xi + d1i 1 + ... + dNi N + i ,
where
yi1
x0i1 i1
..
.. ..

.
.

.

yi = 0
, Xi = xit
yit , i = it

(T 1) (T k) (T 1)
..
.
.
..
. . .

yiT x0iT iT
and
1T if i = j
dji = ,
0
T if i 6= j
1T indicates the (T 1) vector of all unity elements and 0T the (T 1) vector of all zero
elements.
Stacking data by individuals, an even more compact representation of the regression model
(7.2.2), at the level of the whole data-set, is
(7.2.3) y = X + D + ,
where

y1 X1 1 1
.. .. .. ..

.

.

.

.

y = yi , X = ,
Xi = i , = i
(N T 1)
(N T k) (N T 1) (N 1)
.. .
.
.. .
.
. . . .

yN XN N N
and D is the (N T N ) matrix of dummy variables

1T 0T 0T

0T 1T 0T

. .. ..
D= .
. . .
.

0T 0T 0T

0T 0T 1T
or equivalently D = (d1 d2 ... dN ) . Under the following assumptions model (7.2.3) is a classical
regression model that includes individual dummies:
FE.1: The extended regressor matrix (X D) has f.c.r. Therefore, not only is X of f.c.r.,
but also none of its columns can be expressed as a linear combination of the dummy
variables, which boils down to saying that no column of X can be time-constant,
which in turn implies that X does not include the unity vector (indeed, there is a
constant term in model (7.2.3), but one that jumps across individuals).
FE.2: E (|X) = 0. Hence, the variables in X are strictly exogenous with respect to ,
but the statistical relationship with is left completely unrestricted. Model (7.2.3),
therefore, automatically accommodates any form of omitted-variable bias due to the
omission of time constant regressors. Notice that D is taken as a non-random matrix,
therefore conditioning on (X D) or simply X is exactly the same.
FE.3: V ar (|X) = 2 IN T . This is standard. It will be relaxed in Section 7.7 (and

Chapter 8, which can however be skipped for the exam).
Exercise 61. Prove that the following model with the constant term is an equivalent
reparametrization of Model (7.2.3):
(7.2.4) 0 + X + D1
y = 1N T 1 + ,
1 1 1N 1 1 , 1 = (2 ...N )0 , 1s denotes the s 1 vector

where D1 = (d2 ... dN ),
0 1 .
of all unity elements and
Solution. Partition D = (d1 D1 ) and

1
=
1
Then, rewrite model (7.2.3) equivalently as
(7.2.5) y = d1 1 + X + D1 1 +
Since D1N = 1N T , then (d1 D1 ) 1N = 1N T or equivalently
d1 + D1 1N 1 = 1N T .
Therefore, we can reparametrize model (7.2.5) adding 1N T 1 and subtracting (d1 + D1 1N 1 ) 1

to the right hand side of (7.2.5) to have
y = 1N T 1 + d1 1 + X + D1 1 (d1 + D1 1N 1 ) 1 +
= 1N T 1 + X + D1 1 D1 1N 1 1 +
= 1N T 1 + .
0 + X + D1
0 1 and
where 1 1 1N 1 1 .
Remark 62. Exercise 61 demonstrates that after the reparametrization the interpretation
of the coefficients is unchanged, the constant term is 1 and the coefficients on the remaining
individual dummies are no longer the individual effects of the remaining individuals, i , i =
2, ..., N , but rather the contrasts of i with respect to 1 , i = 2, ..., N . Of course, the reference
individual must not necessarily be the first one in the data-set and can be freely chosen among
the N individuals by the researcher at her/his own convenience. In Stata this is implemented
by using regress followed by the dependent variable, the X regressors and N 1 dummy
variables (see paneldata.do).
Remark 63. The interpretation of the constant in Exercise 61 is different from that in
the Stata transformation (see 10/04/12 Exercises) of Model (7.2.3). In the former case the
constant term is the effect of the individual whose dummy is removed from the regression, in
the latter it is the average of the N individual effects.
The LSDV estimator is just the OLS estimator applied to model (7.2.3) and, given FE.1-3,
it is the BLUE. The separate formulas of LSDV for and are obtained by applying Theorem
to (7.2.3). So,
1
bLSDV = X 0 M[D] X X 0 M[D] y
is the LSDV estimator for and
1
aLSDV = D0 D D0 (y XbLSDV )
is the LSDV estimator for . As already mentioned, both are BLUEs, but while bLSDV
converges in probability to when N or T or both, aLSDV converges in probability
to only when T . This discrepant large-sample behavior of bLSDV and aLSDV is due
to the fact that the dimension of increases as N increases, whereas that of is kept fixed
to k.
Exercise 64. Verify that

1/T 0 0

0 ..
1 1/T .
D0 D = . .

.. .. ..
. . 0

0 0 1/T
Given exercise 64, (D0 D)1 D0 = 1 0

TD and so for a generic (N T 1) vector z
T

X
1
T z1t

t=1 z1
..
. ..
.
T
0
1 0 1
X

z
DD Dz= zit z1 .

T
t=1 ..
..
.

.

X T zN
1
z

T Nt
t=1
In words, premultiplying any (N T 1) vector z by (D0 D)1 D0 transforms it into an (N 1)

, where each mean is taken over the group of observations peculiar to the
vector of means, z
same individual and for this reason is said a group mean. Therefore,
0i bLSDV ,
i x
aLSDVi = y

0i is the (1 k) vector of group-means for individual i, x
where x 0i = x
1i ... x
ki . It is
also clear that for any (N T 1) vector z

z1
..

.

z1

..

.

zi

..

1
P[D] z = D D0 D D0 z =

.

.

zi

..

.

zN

..

.

zN
In words, premultiplying any (N T 1) vector z by P[D] transforms it into a sample-conformable

(N T 1) vector of group-means: each group-mean is repeated T times. It follows that
M[D] z = z P[D] z is the (N T 1) vector of group-mean deviations. Therefore, one can
obtain bLSDV by applying OLS to the model transformed in group-means deviations, that is
regressing M[D] y on M[D] X.
Exercise 65. Verify (in a couple of seconds...) that P[D] = 1 0

T DD .
1
The conditional variance-covariance matrix of bLSDV is V ar (bLSDV |X) = 2 X 0 M[D] X .
It is estimated by replacing 2 with the Anova estimator s2LSDV , based on the LSDV residuals
eLSDV = M[D] y M[D] XbLSDV :
e0LSDV eLSDV
(7.2.6) s2LSDV = .
NT N k
Exercise 66. Prove that E s2LSDV = 2 . This is a long one, but when done you can

tell yourself BRAVO! I just give you a few hints. First, on noting that y is determined by
the right hand side of (7.2.3) prove that e = M[M[D] X ] M[D] , then elaborate the conditional
mean of 0 M[D] M[M[D] X ] M[D] using the trace operator as we did for s2 , finally apply the law
of iterated expectations.
It is not hard to verify (do it) that bLSDV can be obtained from the OLS regression of
model (7.2.3) transformed in group-means deviations (this transformation is referred to in the
panel-data literature as the within transformation)
(7.2.7) M[D] y = M[D] X + M[D]
The intuition is simple: since the group mean of any time constant element, as i , coincides
with the element itself, all time-constant elements in model (7.2.3) are wiped out, this also
explains why X cannot contain time-constant variables. So, in a sense, the within transfor-
mation controls out the whole time-constant heterogeneity, latent or not, in model (7.2.3),
making it look like almost as a classical LRM. In particular, it can be proved easily that
LRM.1-LRM.3 hold. Notice, however, that errors in the transformed model, M[D] , have a
non-diagonal conditional covariance matrix (it is, indeed, block-diagonal and singular, can
you derive it?). Specifically, the vector M[D] presents within-group serial correlation, since
for each individual group there are only T 1 linearly independent demeaned errors. As a
consequence, LRM.4 does not apply to model (7.2.7). All the same, bLSDV is BLUE. This
is true because the condition of Theorem 47 in Section 4.9 is met (if you have answered the
previous question on the covariance matrix of M[D] , you should be able to verify also this
claim).
One should not conclude from the foregoing discussion that OLS on the within transformed
model (7.2.7) is a safe strategy. As in the Oaxacas pooled model of section 5.2, the fact that the
error covariance matrix is not spherical, presenting in this specific case within group serial cor-
relation, has bad consequences as far as standard error estimates are concerned. Indeed, should
we leave the econometric software free to treat model (7.2.7) as a classical LRM, and so regress
M[D] y on M[D] X, it would compute coefficient estimates just fine. Nonetheless, it would es-
1
timate V ar (bLSDV |X) by s2 X 0 M[D] X , with s2 = e0LSDV eLSDV / (N T k) 6= s2LSDV ,
which is biased since it uses a wrong degrees of freedom correction. The econometric software
is not aware that for each individual in the sample there are only T 1 linearly independent
demeaned errors and so, rather than dividing the residual sum of squares by N (T 1) k, it
divides it by N T k. The upshot is that standard errors estimated in this way needs rectifying
p
by multiplying each of them by the correction factor (N T k) / (N T N k).
An interesting assumption to test is that of the absence of individual heterogeneity, H0 :
1 = 2 = ... = N . Under the restriction implied by H0 , model (7.2.3) pools together all
data with no attention to the individual clustering and can be written as
(7.2.8) y = X + ,
where

0
X = (1N T X) , = .

Hence, under H0 , the pooled OLS (POLS) estimator
0 1
(7.2.9) bP OLS = X X X y
is the BLUE. Let e indicate the restricted residual vector
(7.2.10) eP OLS = y X bP OLS ,
then under normality and H0
(e0P OLS eP OLS e0LSDV eLSDV ) /N 1

(7.2.11) F = F (N 1, N T N k) .
e0LSDV eLSDV /N T N k
7.3. THE RANDOM EFFECT MODEL 124
If F does not reject H0 , POLS is a legitimate, more efficient than LSDV, estimation procedure.
If F rejects H0 , then POLS is biased and LSDV should be adopted.
Exercise 67. On reparametrizing the LSDV model as in Exercise 61, the hypothesis of no
1 = 0. Prove that the resulting F-test is numerically
individual heterogeneity becomes H0 :
identical to F in Equation (7.2.11).
Solution. Easy. Since models (7.2.3) and (7.2.4) are indeed the same model, the result-
ing F-test is numerically identical to the F-test in Equation (7.2.11). This is demonstrated
empirically in the paneldata.do Stata dofile.
7.3. The Random Effect Model
The random effect model has the same algebraic structure of model (7.2.1). At the obser-
vation level, i = 1, ...N and t = 1, ..., T, we have
(7.3.1) yit = x0it + i + it
where x0it = (x1it , ..., xkit ),

1
.

= ..

k
and i is a scalar denoting the time-constant, individual specific effect for individual i. The
statistical properties of model (7.3.1) are different, though. Without loss of generality, write
i as i = 0 + ui and let
u1
..

.

u = ui .
(N 1)
..
.

uN
Model (7.3.1) can then be written compactly as
(7.3.2) y = X + w,
where
0
X = (1N T X) , = and w = + Du.

The following is maintained.
RE.1: X has f.c.r. This is equivalent to 1) X of f.c.r and 2) no linear combination of

the columns of X is equal to 1N T . Hence, as long as these two requirements are met,
X can contain time-constant variables.
RE.2: E (|X ) = 0 and E (u|X ) = 0. This maintains strict exogeneity of X with
respect to both components of w, and so with respect to w itself. It is a stringent
assumption, implying that the time constant variables that are not included into
the regression are unrelated to the included regressors X. Notice that since 1N T is
non-random, conditioning onX is the same as conditioning on X.
RE.3: V ar (|X ) = 2 IN T , V ar (u|X ) = u2 IN , Cov (, u|X ) = E (u0 |X ) =
0 .
(N T N )
Let V ar (w|X ) . Then, given RE.3,
= V ar (|X ) + V ar (Du|X )
= 2 IN T + u2 DD.0
This means that under RE.1-3 w, although homoskedastic, is non-diagonal and the POLS
estimator in (7.2.9) is unbiased (verify this) but not BLUE (unless u2 = 0). The BLUE
estimator for is therefore the GLS Random effect estimator
0 1 0
bGLSRE = X 1 X X 1 y.
For implementation of bGLS , we need to work out 1 .
Exercise 68. Verify that w is homoskedastic and in particular that V ar (wit |X ) =

2 + u2 for all i = 1, ..., N and t = 1, ..., T.
Since (see exercise 65)

1
P[D] = DD0 ,
T
then
= 2 IN T + T u2 P[D]
= 2 IN T 2 P[D] + 2 P[D] + T u2 P[D]
= 2 M[D] + 12 P[D] .
where 12 = 2 + T u2 . Therefore,
1 1
(7.3.3) 1 = 2
M[D] + 2 P[D]
1
and
1
0 1 1 1 1
bGLSRE = X M + P X
X
M + P y.
2 [D] 12 [D] 2 [D] 12 [D]
1 1
Exercise 69. Verify that 2
M[D] + P
12 [D]
is indeed the inverse of , that is

1 1
2 M[D] 12 P[D]

+ 2
M[D] + 2 P[D] = IN T
1
(easy, if you remember the properties of M[D] and P[D] .)
Exercise 70. 1) Verify that bGLSRE can be also written as

1
2 2

0 0
(7.3.4) bGLSRE = X M[D] + 2 P[D] X
X M[D] + 2 P[D] y.
1 1
2) Verify that premultiplying all variables of model (7.3.2) by M[D] + 1 P[D] transforms it into
a classical regression model, so that bGLSRE can be obtained at once by applying OLS to

the transformed model. 3) Verify that the operator M[D] + 1 P[D] can be also written as

(7.3.5) M[D] + P[D] = IN T 1 P[D] .
1 1
The operator in (7.3.5), M[D] + ( /1 ) P[D] , transforms any conformable variable that
pre-multiplies in quasi-mean deviations, or partial deviations, in the sense that it only removes
a portion of the group-mean from the variable. For this reason, the coefficients on time-
constant variables are identified in the RE model: time-constant variables when premultiplied
by M[D] + ( /1 ) P[D] are not wiped out, but rescaled by a factor /1 . The RE model under
the GLS transformation is therefore
M[D] + ( /1 ) P[D] y = M[D] + ( /1 ) P[D] X + M[D] + ( /1 ) P[D] w

(7.3.6)
and you may wish to verify that indeed
M[D] + ( /1 ) P[D] w|X = 2 IN T .

V ar
7.3.1. The Feasible GLS. The feasible version of bGLSRE , say bF GLSRE , the one
that is actually implemented in econometric softwares, can be obtained through the method
by Swamy and Arora (1972). The estimator for 2 is simply s2LSDV in (7.2.6) and that for 12
is obtained as follows.
Define the Between residual vector eB as
(7.3.7) eB = P[D] y P[D] X bB

0
1 0
where bB = X P[D] X X P[D] y. In words, eB is the residual vector from the OLS
regression of the group means of y on the group means of X . The resulting estimator, bB ,
is referred to in the panel data literature as the Between estimator1. Then, based on eB ,
1Technical note: I maintain that no column of X is either time-constant or already in group-mean deviations,
0
so that both bLSDV and bB are uniquely defined (in fact, with such an assumption X P[D] X and X 0 M[D] X
are both non-singular). Indeed, this is only made for simplicity, since it is possible to prove that s2B and s2LSDV
construct the Anova estimator for 12 as
e0B eB
s2B = .
N k1

E s2B = 2 + T u2 .

Same hint as for exercise 66: First, on noting that y is determined by the right hand
side of (7.3.2) prove that eB = M[P[D] X ] P[D] w, then elaborate the conditional mean of
w0 P[D] M[P[D] X ] P[D] w using the trace operator as we did for s2 , finally apply the law of
iterated expectations.
Solution. Replacing the formula of bB into the right hand side of equation (7.3.7) gives
1
0 0
eB = I P[D] X X P[D] X X P[D] P[D] y
1
0 0
P[D] X + P[D] w

= I P[D] X X P[D] X X P[D]
1
0 0
= I P[D] X X P[D] X X P[D] P[D] w
= M[P[D] X ] P[D] w.
Therefore,
e0B eB = w0 P[D] M[P[D] X ] P[D] w
= w0 M[P[D] X ] P[D] w
where the first equality follows from the idempotence of M[P[D] X ] and the second from
P[D] M[P[D] X ] = M[P[D] X ] P[D]
and the idempotence of P[D] .
are uniquely defined even if bLSDV and bB are not. The proof requires that all inverse matrices in the residuals
formulas are replaced with generalized inverse matrices. But dont worry, I wont pursue it further.
Upon noticing that e0B eB is a scalar,
e0B eB = tr w0 M[P[D] X ] P[D] w
= tr M[P[D] X ] P[D] ww0 ,
and so

e0B eB |X 0

E = E tr M[P[D] X ] P[D] ww |X

= tr E M[P[D] X ] P[D] ww0 |X
= tr M[P[D] X ] P[D] E ww0 |X

= tr M[P[D] X ] P[D] .
Expressing in spectral decomposition,
= 2 M[D] + 12 P[D] ,
yields P[D] = 12 P[D] , given that P[D] is idempotent and P[D] M[D] = 0(N T N T ) . Hence,
E e0B eB |X = tr M[P[D] X ] P[D] 12 .

= 12 tr M[P[D] X ] P[D]
Then, it remains to prove that tr M[P[D] X ] P[D] = N k 1. Since
0 1 0
M[P[D] X ] P[D] = P[D] P[D] X X P[D] X X P[D] ,
0 1 0
tr M[P[D] X ] P[D] = tr P[D] tr P[D] X X P[D] X X P[D]
= tr IN tr Ik+1
= N k 1.
7.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 130
In conclusion, the Feasible GLS, bF GLSRE , is

1
s2LSDV s2LSDV

0
bF GLSRE = X M[D] + P[D] X
X M[D] + P[D] y.
s2B s2B
e0LSDV eLSDV

E |X = 2
NT N k
(hint: follow the same steps as above, noticing that M[D] w = M[D] .)
Exercise 73. Prove that the formula for the subvector of bGLSRE estimating the
vector is
(7.3.8)
1
2 2

X 0 M[D] + 2 P[D] P[1N T ] X X 0 M[D] + 2 P[D] P[1N T ] y.

bGLSRE =
1 1
Solution. Simply apply Theorem 8 to the GLS-transformed RE model in (7.3.6), noticing

that by the well-known properties of orthogonal projectors P[D] P[1N T ] = P[1N T ] (remember
1N T = D1N and so 1N T R (D)) so that
2

M[D] + P[D] M[1N T ] M[D] + P[D] = M[D] + 2 P[D] P[1N T ]

1 1 1
and eventually
1
2 2

X 0 M[D] + 2 P[D] P[1N T ] X X 0 M[D] + 2 P[D] P[1N T ] y.

bGLSRE =
1 1
7.4. Stata implementation of standard panel data estimators
Both fixed effects and random effects estimators are implemented through the Stata com-
mand xtreg, with the usual Stata syntax for regression commands: the command is followed
by the name of the dependent variable and then the list of regressors. The noconstant option
is not admitted in this case.
7.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 131
As a preliminary step, however, a panel data declaration is needed to make Stata aware
of which variables in our data identify time and individuals. Suppose that in our data the
individual variable is named id and the time variable time, then the panel data declaration is
carried out by the instruction
xtset id time
The random effect estimator is the default of xtreg, while the fixed effects (LSDV) esti-
mator requires the option fe.
Sometimes, you may find it convenient to implement FE and RE estimators by hands,
using regress rather than xtreg. The greater computational effort may pay for the simple
reason that regress, being the most popular estimation command in Stata, is updated more
frequently to accommodate the most recent developments in statistics and econometrics, and
so has typically more options than any other estimation commands in Stata. To implement
bLSDV and aLSDV at once you may just apply regress to the LSDV model (7.2.3). This
requires to generate a full set of individual dummies from the individual-identifier id in your
panel. This is done through the tabulate command with an option, as follows
tabulate id, gen (id )
where id_ is a name of choice. If N equals, say, 100, tabulate will add the full set of 100
individual dummies to your data, with names id_1, id_2, ..., id_100 and you can just treat
them as regressors in a regress instruction to get bLSDV as the coefficient estimates for the
X variables and aLSDV as the coefficient estimates for the id_1-id_100 variable. Degrees
of freedom are correctly calculated as N T N k and so no correction of standard errors
is needed. Notice that if you include all 100 dummies, then the constant term should be
removed by the noconstant option. Alternatively, you can leave it there and include N 1
dummies. While the bLSDV estimates remain unchanged, the coefficient estimates on the
included dummies do not. The latter must now be thought of as contrasts with respect to
7.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 132
the constant estimate, which turns out to equal the individual effect estimate peculiar to
the individual excluded from the regression, who is therefore treated as the base individual.
Nothing is lost by choosing either identification strategy.
When N is large the foregoing regress strategy is not practical. The bLSDV estimator
can, then, be manually implemented by applying the within transformation, carrying out OLS
on the transformed model and then correct standard errors appropriately. Implementation of
bF GLSRE by hands is more tricky and one goes along the following steps: 1) get the two
variance components estimates from within and between regressions; 2) transform variables
(including the constant) in partial deviations and 3) apply OLS to the transformed variables.
Details can be found in a Stata do.file available on the learning space.
I recommend to use always the official xtreg command to implement the standard panel
data estimators in empirical applications, unless strictly necessary to do otherwise (for exam-
ple, if I explicitly ask you to!).
7.5. Testing fixed effects against random effects models
As Hausman (1978) and Mundlak (1978) independently found (in two papers appeared
in the same Econometrica issue!), the RE model is a special case of the FE model. In fact,
while in the former model assumption RE.2 models the relationship between the random
individual components, u, and X (E (u|X) = 0), the latter leaves it completely unrestricted.
In consequence, the RE model is nested into the FE model, so that a test discriminating
between them can be easily implemented with E (u|X) = 0 as the null hypothesis.
I present here two popular tests that, moving from the foregoing consideration, can provide
some guidance in the choice between RE and FE models.
7.5.1. The Hausmans test. Under Ho : E (u|X) = 0, both LSDV and FGLS-RE
estimators are consistent for N , but the LSDV estimator is inefficient: redundant
individual effects are included in the regression when they could have rather been regarded
as random disturbances, saving on degrees of freedom. On the other hand, if Ho is not true
the LSDV estimator remains consistent, but FGLS does not, undergoing an omitted-variable
bias. The basic idea of the Hausmans test (Hausman, 1978), therefore, is that under Ho the
statistical difference between the two estimators should be not significantly different from zero
in large samples.
Hausman proves that, under RE.1-RE.3, such difference can be measured by the statistics
bF GLSRE )1 (bLSDV bF GLSRE )

H = (bLSDV bF GLSRE )0 Avar (bLSDV\
and also that

H 2 (k) .
d
Hausman also provides a useful computational result. He shows that since bF GLSRE is
asymptotically efficient and bLSDV is inefficient under the null,
Acov (bLSDV bF GLSRE , bF GLSRE ) = 0,
so
Acov (bLSDV bF GLSRE , bF GLSRE ) = Acov (bLSDV , bF GLSRE ) Avar (bF GLSRE )
= 0
and
Avar (bLSDV bF GLSRE ) = Avar (bLSDV ) Avar (bF GLSRE ) .
Hence,
h i1
H = (bLSDV bF GLSRE )0 Avar\ \
(bLSDV ) Avar (b F GLSRE ) (bLSDV bF GLSRE ) .
Wooldridge (2010) (pp. 328-334) evidences two difficulties with the Hausmans test.
First, Avar (bLSDV ) Avar (bF GLSRE ) is singular if X includes aggregate variables,
such as time dummies. Therefore, along with the coefficients on time-constant variables, also
those on aggregate variables must be excluded from the Hausman statistics.
Second, and more importantly, if RE.3 fails, then, on the one hand, the asymptotic dis-
tribution of H is not standard even if RE.2 holds, so that H would be of little guidance in
detecting violations of RE.2, with an actual size that may be significantly different from the
nominal size. On the other hand, H is designed to detect violations of RE.2 and not RE.3.
In fact, if RE.2 holds both LSDV and FGLS-RE are consistent, regardless of RE.3, and H
converges in distribution rather then diverging, which means that the probability of rejecting
RE.3 when it is false does not tend to unity as N , making H inconsistent. The solution
is so to consider H as a test of RE.2 only, but in a version that is robust to violations of RE.3.
The approach I describe next is well suited to solve both difficulties at once.
Exercise 74. Given two random vectors w, z of same dimension and two constants a and
b
(1) V ar (aw + bz) = a2 V ar (w) + b2 V ar (z) + ab Cov (w, z) + Cov (w, z)0

and
(2) Cov (aw + bz, z) = aCov (w, z) + bV ar (z)
Solution. (1) By definition
V ar (aw + bz) = E (aw + bz) (aw + bz)0 E (aw + bz) E (aw + bz)0 .

So
V ar (aw + bz) = a2 E ww0 + abE wz0 + abE zw0 + b2 E zz0

a2 E (w) E (w)0 abE (w) E (z)0 abE (z) E (w)0 b2 E (z) E (z)0
= a2 E ww0 E (w) E (w)0 + b2 E zz0 E (z) E (z)0

+ab E wz0 E (w) E (z)0 + ab E zw0 E (z) E (w)0 .

(2) By definition
Cov (aw + bz, z) = E (aw + bz) z0 E (aw + bz) E (z)0 .

So,
Cov (aw + bz, z) = aE wz0 + bE zz0 aE (w) E (z)0 bE (z) E (z)0

= a E wz0 E (w) E (z)0 + b E zz0 E (z) E (z)0 .

Exercise 75. Assume RE.1-RE.3. Prove that
Cov (bLSDV , bGLSRE |X) = V ar (bGLSRE |X)
Solution. Given Equation (7.3.8)

1
2 2

0 0

bGLSRE = X M[D] + 2 P[D] P[1N T ] X X M[D] + 2 P[D] P[1N T ] w.
1 1
Then, similarly
1 1
bLSDV = X 0 M[D] X X 0 M[D] w = X 0 M[D] X X 0 M[D] .
Since E (bLSDV |X) = E (bGLSRE |X) = ,
Cov (bLSDV , bGLSRE |X) = E (bLSDV ) (bGLSRE )0 |X .

Therefore.
Cov (bLSDV , bGLSRE |X) =
2

0
1 0 0 0 0

E X M[D] X X M[D] + u D M[D] + 2 P[D] P[1N T ] X
1
1 )
2

0

X M[D] + 2 P[D] P[1N T ] X |X =
1
2

0
1 0 0 0

X M[D] X X M[D] E |X M[D] + 2 P[D] P[1N T ] X
1
1
2

0

X M[D] + 2 P[D] P[1N T ] X =
1
2

2 0
1 0
X M[D] X X M[D] M[D] + 2 P[D] P[1N T ] X
1
1
2

0

X M[D] + 2 P[D] P[1N T ] X =
1
1
2

2 0
1 0 0

X M[D] X X M[D] X X M[D] + 2 P[D] P[1N T ] X =
1
2

X M[D] + 2 P[D] P[1N T ] X .1
2 0

1
Exercise 76. Assume RE.1-RE.3. Prove that
V ar (bLSDV bGLSRE |X) = V ar (bLSDV |X) V ar (bGLSRE |X)
Solution. By Exercise 74(1) V ar (bLSDV bGLSRE |X) can be always expanded as
V ar (bLSDV bGLSRE |X) = V ar (bLSDV |X)+V ar (bGLSRE |X)Cov (bLSDV , bGLSRE |X)Cov (bLSDV ,
Since RE.1-RE.3 hold we can invoke exercise 74, so Cov (bLSDV , bGLSRE |X) = V ar (bGLSRE |X),
which completes the proof.
7.5.2. The Mundlaks test. Mundlak (1978) asks the following question. Is it possible
to find an estimator that is more efficient than LSDV within a framework that allows corre-
lation between individual effects, taken as random variables, and X? To provide an answer,
he takes the move from model (7.2.3) and supposes that the individual effects are linearly
correlated with regressors according to the following equation
1
= 1N 0 + D0 D D0 X + u
h i
with E (|X) = E | (D0 D)1 D0 X , and so E (u|X) = 0. Pre-multiplying both sides of the
foregoing equation by D and then replacing the right-hand side of the resulting equation into
(7.2.3) yields
(7.5.1) y = 1N T 0 + X + P[D] X + Du + ,
which is evidently a RE model extended to the inclusion of the P[D] X regressors. Model (7.5.1)
springs up from a restriction in (7.2.3) and hence seems promising for more efficient estimates.
But this is not the case. Mundlak proves, in fact, that FGLS-RE applied to equation (7.2.3)
returns the LSDV estimator, bLSDV for the coefficients, bB bLSDV for the coefficients
and b0B for the constant term 0 , where b0B and bB are the components of the between
estimator, bB , presented in Section 7.3.1.
To summarize Mundlaks results
The standard LSDV estimator for in the FE model (equation (7.2.3)) is the FGLS-
RE estimator for in the general RE model (7.5.1)
The standard FGLS-RE estimator in the RE model (equation (7.3.2)) can be equiv-
alently obtained as a constrained FGLS estimator applied to the general RE model
(7.5.1) with constraints = 0.
Therefore, the validity of the RE model can be tested by applying a standard Wald test of
joint significance for the null hypothesis that = 0 in the context of Mundlaks equation
(7.5.1):
1
M = (bLSDV bB )0 Avar (b\
LSDV bB ) (bLSDV bB ) .
Under H0 : = 0, M 2 (k).
d
7.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR 138
Hausman and Taylor (1981) proves that the statistics H and M are numerically identical
(for a simple proof see also Baltagi (2008)). Wooldridge (2010), p. 334, nonetheless, rec-
ommends using the regression-based version of the test because it can be made fully robust
to violations of RE.3 (for example, heteroskedasticity and/or arbitrary within-group serial
correlation) using the standard robustness options available for regression commands in most
econometric packages. In addition, it is relatively easy to detect and solve singularity problems
in the context of regression-based tests.
7.5.3. Stata implementation. The Stata implementation of most results in this section
is demonstrated through a Stata do file available on the course learning space.
7.6. Large-sample results for the LSDV estimator
7.6.1. Introduction. This section proves consistency and asymptotic normality of the
LSDV estimator, then describes the heteroskedasticity and within-group serial correlation
consistent covariance estimator and finally provides a remark for practitioners.
Notation is standard. X denotes the (N T k) regressors matrix (of all time-varying
regressors) and is partitioned by stacking individuals

X1
..

.

(7.6.1) X=
Xi ,

.
.
.

XN
with Xi indicating the (T k) block of observations peculiar to individual i = 1, ..., N. Simi-

larly, observations in the (N T 1) vectors y and are stacked by individuals.
The projection matrix M[D] projects onto the space orthogonal to that spanned by the
columns of the individual dummies matrix D and any conformable vector that is post-multiplied
to it gets transformed into group mean deviations. It is not hard to see that M[D] is a block
diagonal matrix, with blocks all equal to
1T 10T
M[1T ] = IT .
T
So,

M[1T ] 0 0
. ..
M[1T ] . .

0 .
(7.6.2) M[D] = .

.. .. ..

. . . 0

0 0 M[1T ]
The LSDV estimator for the coefficients on the X regressors, , is given by
1
bLSDV = X 0 M[D] X X 0 M[D] y.
Strict exogeneity is maintained throughout:
SE: E (|X) = 0.
The following random sampling assumption is invoked for the asymptotic normality of
bLSDV and the consistency of the bLSDV asymptotic covariance estimator:
RS: There is a sample of size n = N T , such that the elements of the sequence
{(yi Xi ) , i = 1, ..., N } are independent (NB not necessarily identically distributed)
random matrices.
7.6.2. Large-sample properties of LSDV. Let V ar (|X) = , where is an arbi-

trary and unknown p.d. matrix.
7.6.3. Consistency. The following assumptions hold.

0 0
X M[D] M[D] X X M[D] M[D] X
LSDV.1: p limN N = lim E N Q , a positive def-
N
inite and finite matrix
0
X M[D] X
LSDV.2: p limN N = Q a positive definite and finite matrix
Exercise 77. (This has been done in class) Prove that under LSDV.1 and LSDV.2
p limN bLSDV = .
7.6.4. Asymptotic normality. Assumptions LSDV.1 and LSDV.2 hold along with RS
and the following
LSDV.3: V ar (|X) = , where

1 0 0
.. ..
0 2 . .
=

.. .. ..

. . . 0

0 0 N
is a block diagonal (N T N T ) positive definite matrix. Notice that the blocks of are
arbitrary and heterogenous, so that both arbitrary correlation across the time observations
of the same individual (referred to as within-group serial correlation) and heteroskedasticity
across individuals and over time are permitted. What is not permitted by the block-diagonal
structure is correlation of the realizations across different individuals.
Now focus on the generic individual i = 1, ..., N and notice that, given the block-diagonal
form of M[D] as in (??),

X1 M[1T ] X1

M[1T ] 0 0 .. ..
. .. . .
M[1T ] . .

0 .

M[D] X = = M X

.. .. .. Xi [1T ] i

. . . 0
.. ..
.
.

0 0 M[1T ]
XN M[1T ] XN
The proof of asymptotic normality for bLSDV parallels that in 6.2.2 with the only difference
that now the random objects whence we move are not k 1 vectors at the observation level
but k 1 vectors at the individual level, Xi0 M[1T ] i , i = 1, ..., N .
First, by strict exogeneity E Xi0 M[1T ] i = 0 and hence

V ar Xi0 M[1T ] i |X = E Xi0 M[1T ] i 0i M[1T ] Xi |X

= Xi0 M[1T ] i M[1T ] Xi ,
so that
V ar Xi0 M[1T ] i = E Xi0 M[1T ] i M[1T ] Xi .

Then, averaging across individuals

N N
1 X 1 X
V ar Xi0 M[1T ] i = E Xi0 M[1T ] i M[1T ] Xi

N N
i=1 i=1
0
X M[D] M[D] X

= E .
N
Therefore,
N 0
X M[D] M[D] X

1 X
V ar Xi0 M[1T ] i = lim E

lim Q ,
N N N N
i=1
which is a finite matrix by assumption LSDV.1, so that the Lindberg-Feller theorem applies
to yield
N
NX 0 X 0 M[D]
Xi M[1T ] i N (0, Q ) .
N N d
i=1
Finally, since
1
X 0 M[D] X X 0 M[D] X 0 M[D]

N (bLSDV ) Q1 ,
N N d N

) N 0, Q1 Q Q1

N (bLSDV
d
and the asymptotic covariance matrix of bLSDV is given by
1 1
(7.6.3) Avar (bLSDV ) = Q Q Q1 .
N
7.7. A ROBUST COVARIANCE ESTIMATOR 142
7.7. A Robust covariance estimator
Arellano (1987) demonstrates that given the (T 1) LSDV residual vector
eLSDV,i = M[1T ] yi M[1T ] Xi bLSDV ,
i = 1, ..., N, a consistent estimator for the asymptotic covariance matrix of bLSDV in equation
(7.6.3) is given by the Whites estimator:
1 1
(7.7.1) (bLSDV ) = X 0 M[D] X
Avar\ X 0 M[D] M
[D] X X 0 M[D] X ,
is a block diagonal matrix with generic block given by eLSDV,i e0

where LSDV,i . More formally,
= eLSDV e0LSDV DD0 .

Remark 78. The estimator in (7.7.1) is robust to arbitrary heteroskedasticity and within-
group serial correlation. Stock and Watson (2008) prove that in the LSDV model the Whites
is a diagonal matrix with generic
estimator correcting for heteroskedasticity only, where
element e2LSDV,it (see the first formula of section 9.6.1 in Greene (2008)), is inconsistent for
N . The crux of Stock and Watsons argument is essentially algebraic, in that demeaned
residuals are correlated over time by construction and this correlation does not vanish for
N . The recommendation for practitioners is then to correct for both heteroskedasticity
and within-group serial correlation using the estimator (7.7.1), which is not affected by the
Stock and Watsons critique.
Remark 79. In Stata the robust covariance matrix of LSDV is computed easily by using
the xtreg command with the options fe and vce(cluster id), where id is the name of the
individual categorical variable in your Stata data set.
A similar correction can be carried out for POLS and FGLS-RE. For POLS we have
0 1 0 1
bP OLS = X X
Avar\ X 0 X
X X ,
7.8. UNBALANCED PANELS 143
where
= eP OLS e0 0
P OLS DD ,
and the POLS residual vector defined as in equation (7.2.10), whereas for FGLS-RE we have
0 1 1 0 1/2 1/2 0 1 1
bF GLSRE = X
Avar \ b X X b
b X X b X ,
where
= eF GLSRE e0F GLSRE DD0 ,

and the FGLS-RE residual vector defined as
eF GLSRE = y X bF GLSRE .
Remark 80. In Stata the robust asymptotic covariance matrices of POLS and FGLS-RE
is estimated by using, respectively, the regress and the xtreg, re commands, both with the
option vce(cluster id), as in the LSDV case.
7.8. Unbalanced panels
All of the methods so far have been described with a balanced panel data set in mind, but
nothing prevents applying the same methods to unbalanced panels (different numbers of time
observations across individuals).
Unbalanced panels only require a slight change in notation. As always we index individuals
by i = 1, ..., N , but now the size of each individual cluster, or group, of observations varies
across individuals and so the time index is t = 1, ..., Ti . This implies the following three facts.
(1) As in balanced panels, each observation in the data is uniquely identified by the two
indexes: the pair (i, t) identifies the t.th observation of the the i.th individual.
(2) Differently from balanced panels, the group size, Ti , is no longer constant across
clusters.
PN
(3) Differently from balanced panels, the sample-size is n = i=1 Ti .
7.8. UNBALANCED PANELS 144
The LSDV estimator is implemented without any problem either creating individual dummies
or taking variables in group-mean deviations, where group means are at the individual level.
The random effect estimator requires only some algebraic modifications in the formulas allow-
ing for unbalancedness. Arellano estimator also requires simple modifications in notations to
accommodate unbalancedness: there is now a (Ti 1) LSDV residual vector given by
eLSDV,i = M[1T ] yi M[1T ] xi bLSDV ,

i i
in (7.7.1) is a block diagonal matrix with blocks that are now of

i = 1, ..., N, and so matrix
different size. The notation using the Hadamard product does not instead require adjustments
to unbalancedness.
CHAPTER 8
Robust inference with cluster samplings
8.1. Introduction
The Panel-data sets considered in these notes, with a large individual dimension and a
small time dimension, are an example of one-way clustering. If the data-set is balanced, there
are n = N T observations clustered into N individual groups, each comprising T observations.
If the data-set is unbalanced, as often the case with real-world panels, there are n = N
P
i=1 Ti
observations clustered into N individual groups, each comprising Ti observations, i = 1, ..., N .

One-way clusterings can be observed also in cross-section data. Think for example of a
large sample of students clustered into many schools. The data structure parallels exactly
that of an unbalanced panel, just index schools by i = 1, ..., N and students within schools
by t = 1, ..., Ti . So, any observation in the data is uniquely identified by the values of i and
t. In other words, observation (i, t) refers to the t.th student in the i.th school. Therefore,
under random sampling of schools and arbitrary sampling of students within schools all of
the statistical methods described in chapter 7 can be conveniently used. This means that
pooled OLS, fixed and random effect estimators can be applied to clustered cross-sections.
The F-test on individual effects can be used to gauge the presence of latent heterogeneity.
The robust Hausman test can be used to discriminate between fixed and random effects and,
importantly, the White-Arellano estimator described in Section 7.7 can be used for computing
robust standard errors. For more on the parallel between unbalanced panel data and one-way
clustered cross sections see chapter 20 in Wooldridge (2010).
Dealing with one-way clustering is an important advance in econometrics. It is often the
case, however, that real-world economic data have a multi-dimensional structure, so clustering
145
8.2. TWO-WAY CLUSTERING 146
can occur along more than one dimension. In a student survey, for example, there could be an
additional level of clustering given by teachers, or classes, within schools. Similarly, patients
can be clustered along the two dimensions, not necessarily nested, of doctors and hospitals.
In a cross-sectional data-set of bilateral trade-flows, the cross-sectional units are the pairs of
countries and these are naturally clustered along two dimensions: the first and the second
country in the pair (Cameron et al., 2011). ln matched-employers-employees data there is the
worker dimension, the firm dimension and the time dimension (Abowd et al., 1999).
Is it possible to do inference that is robust to multi-way clustering as we do inference that is
robust to one-way clustering? A recent paper by Cameron et al. (2011) offers a computationally
simple solution extending the White estimator to multi-way contexts. In essence, their method
boils down to computing a number of one-way robust covariance estimators, that are then
combined linearly to yield the multi-way robust covariance estimator. It is, therefore, crucial
for the accuracy of the multi-way estimator that the one-way estimators be also accurate,
and so that the data-set have dimensions with a large number of clusters. Such asymptotic
requirement makes the analysis in Cameron et al. (2011) not well suited for dealing with both
individual- and time-clustering in the typical micro-econometric panel data set, where T is
fixed. Indeed, their Monte Carlo experiments show that the robust covariance estimator have
good finite-sample properties in data-sets with dimensions of 100 clusters.
To illustrate the method I focus on two-way clustering, using a notation that is close to
that inCameron et al. (2011).
8.2. Two-way clustering
Notation is general enough to embrace cases in which cluster affiliations are not sufficient
to uniquely identify an observation. There is a data-set with n observations indexed by i
{1, ..., n}. Observations are clustered into two dimensions, g {1, ..., G} and h {1, ..., H} .
Asymptotics is for both G and H . The data-sets that I have in mind are, for example,
Survey of students with, at least moderately large numbers of teachers and schools
Survey of patients with, at least, moderately large numbers of doctors and hospitals
Bilateral trade-flows data with, at least, a moderately large number of countries.
Matched-employers-employees data with, at least, moderately large numbers of firms
and workers
For each dimension, it is known to which cluster a given observation i = 1, ..., n belongs. This
information is contained in the mappings g : {1, ..., n} {1, ..., G}
g (i) = [g {1, ..., G} : observation i belongs to cluster g] , i = 1, ..., n.
and h : {1, ..., n} {1, ..., H}
h (i) = [h {1, ..., H} : observation i belongs to cluster h] , i = 1, ..., n.
From the mappings g and h we can also construct the n G dummy variables matrix DG and
the n H dummy variables matrix DH , as the following definitions indicates
Definition 81. Let

1 if g (i) = g
dig = ,
0 else
i {1, ..., n}, g {1, ..., G}, and

1 if h (i) = h
dih = ,
0 else
i {1, ..., n}, h {1, ..., H}. Then, DG and DH are the n G and n H matrices with (i, g)
element dig and (i, h) element dih , respectively.
Given g and h, we can define an intersection dimension, say GH, such that each cluster in
GH contains only observations that belong to one unique cluster in {1, ..., G} and one unique
cluster in {1, ..., H} . This yields the matrix of dummy variables DGH . By construction, the
number of clusters in the G H dimension is at most G H. For example if

1 0 0 1 0

1 0 0 1 0

0 1 0 1 0
G H
D = , D = ,

0 1 0 0 1

0 1 0 0 1

0 0 1 1 0
then
1 0 0 0

1 0 0 0

0 1 0 0
DGH = .

0 0 1 0

0 0 1 0

0 0 0 1
This framework allows that in a survey of patients, for example, there could be more than
one patients admitted to the same hospital and under the assistance of the same doctor. Or,
similarly, that in a panel data matching workers with firms the same worker can move across
firms over time or that, conversely, the same firm may employ different workers over time.
0 0
Then, define three n n indicator matrices: S G = DG DG , S H = DH DH and S GH =
0
DGH DGH
It is easy to verify that:
S G has ijth entry equal to one if observations i and j share any cluster g in {1, ..., G}
; zero otherwise.
S H has ijth entry equal to one if observations i and j share any cluster h in {1, ..., H};
zero otherwise.
S GH has ijth entry equal to one if observations i and j share any cluster g in
{1, ..., G} and any cluster h in {1, ..., H}; zero otherwise.
Also, the iith entries in S G , S H and S GH equal one for all i = 1, ..., n, so the three indicator
matrices have main diagonals with all unity elements.
Consider now a linear regression model allowing for two-way clustering
yi, = x0i, + i
i = 1, ..., n and let

1
..

.

=
i, .

..
.

n
Assumptions LRM.1-LRM.3 hold. Assumption LRM.4 is here replaced with a more general
one permitting arbitrary heteroskedasticity and maintaining zero correlation only between
errors peculiar to observations that share no cluster in common. For example, the latent error
of patient i is not correlated to the latent error of patient j only if the two subjects are under
the assistance of different doctors, say g (i) 6= g (j), and in different hospitals, h (i) 6= h (j).
Formally,
LRM.4b: V ar (|X) = E (0 |X) with E (i j |X) = 0 unless g (i) = g (j) or

h (i) = h (j), i, j = 1, ..., n.
Importantly, LRM.4b can equivalently be expressed as
= E 0 S G |X + E 0 S H |X E 0 S GH |X ,

(8.2.1)
where the symbol stands for the element-by-element matrix product (also known as Hadamard
product) between matrices with equal dimension (verify equivalence of LRM.4b and (8.2.1)).
As we know, OLS, in this case, are consistent and unbiased but not efficient. More impor-
tantly, OLS standard errors are biased, and so we need a two-way robust covariance estimator
for inference. The covariance estimator devised by Cameron et al. (2011) is the combination
of three one-way covariance estimators a la White. It is constructed along the following steps.
Carry out OLS, obtain the OLS residuals
ei,g(i),h(i) = yi,g(i),h(i) x0i,g(i),h(i) b
i = 1, ..., n and stack them into the n 1 column vector

e
1,g(1),h(1)
..

.

e= ei,g(i),h(i) .

..
.

en,g(n),h(n)
The first one-way covariance estimator is
G 1 1
\
Avar (b) = X 0 X X 0
GX X 0X
G
G = ee0 S G . Avar
where \ (b) is a White estimator that is robust to clustering only along
the G dimension.
The second one-way covariance estimator is
H 1 1
\
Avar (b) = X 0X X 0
H X X 0X
H
H = ee0 S H . Avar
where \ (b) is a White estimator that is robust to clustering only along
the H dimension.
The third one-way covariance estimator is
GH 1 1
\
Avar (b) = X 0X X 0
GH X X 0 X
8.3. STATA IMPLEMENTATION 151
GH
GH = ee0 S GH . Avar
where \ (b) is a White estimator that is robust to clustering only
along the G H dimension.
Finally, the two-way robust covariance estimator is
G H GH
(8.2.2) \
Avar \
(b) = Avar \
(b) + Avar \
(b) Avar (b) .
\
Avar (b) is robust to clustering along both G and H dimensions and is the estimator that
is used to construct our robust tests.
\
Remark 82. Writing Avar (b) as
1
(b) = X 0 X X0 G + GH X X 0 X 1
H

\
Avar
and then considering equation (8.2.1) uncovers the analogy principle on which the two-way
robust covariance estimator rests.
\
Remark 83. Cameron et al. (2011) also present a general multi-way version of Avar (b),
which is derived from a simple extension of the foregoing analysis. The additional cost is only
in terms of a more cumbersome notation. For the formulas I refer you to that paper.
8.3. Stata implementation
\
While there is no official command for the two-way Avar (b) in Stata, it can be simply
implemented by means of three one-way OLS regressions. Suppose that in our data-set the
two categorical variables for dimensions G and H are called doctor and hospital. You can
\
assembleAvar (b) along the following steps.
(1) Create the categorical variable for the intersection dimension, G H, through the
following instruction
egen doc hosp = group (doctor hospital)
where doc_hosp is a name of choice.

8.3. STATA IMPLEMENTATION 152
(2) Implement the first regress instruction with the option vce(cluster doctor) and
then save the covariance matrix estimate through the command: matrix V_d=e(V)
(V_d is a name of choice).
(3) Implement the second regress instruction with the option vce(cluster hospital)
and then save the covariance matrix estimate with: matrix V_h=e(V) (V_h is a
name of choice).
(4) Implement the last regress instruction with the option vce(cluster doc_hosp) and
then save the covariance matrix estimate with: matrix V_dh=e(V) (V_dh is a name
of choice).1
(5) Finally, work out the two-way robust covariance estimator by executing: matrix
V_robust=V_d+V_h-V_dh (V_robust is a name of choice). To see the content of
V_robust do: matrix list V_robust. The robust standard errors are just the
square roots of the main diagonal elements in V_robust.
1It may happen that clusters in the intersection dimension are all singletons (i.e. each cluster has only one
observation). In this case Stata will refuse to work with the option vce(cluster doc_hosp). This is no
problem, though, since correcting standard errors when clusters are singletons is clearly equivalent to correcting
for heteroskedasticity. Therefore, instead of vce(cluster doc_hosp), simply write vce(robust).
CHAPTER 9
Issues in linear IV and GMM estimation
9.1. Introduction
The conditional-mean-independence of and x maintained by P.2 (Section 2.1) often fails

in economic structures, where some of the x variables are chosen by the economic subjects
and as such may depend on the latent factors at the equilibrium. These x variables are said
endogenous.
In economics, think of a production function, where (some of) the observable input quan-
tities are under the firms control. The same consideration holds for the education variable
in a wage equation. These are all cases of omitted variable bias (Section 4.7), which makes
standard estimation techniques not usable.
As we have seen in Section 4.7.1, the proxy variables solution maintains that there is infor-
mation external to the model that is able to fully explain the correlation between observed and
unobserved variables. For example observed IQ scores, clearly redundant in a wage equation
with latent ability, are an imperfect measure of latent ability, but the discrepancy between the
153
9.1. INTRODUCTION 154
two variables is likely to be unrelated with the individual education levels. Such information,
so close to the latent variable, is often unavailable, though.
If the latent variables are invariant across individual and/or over time and there is a panel-
data set, the endogeneity problem is solved by applying the panel-data methods introduced
in Chapter 7. But not always panel data are available and even when they are, the disturbing
omitted factors may not meet the time-constancy requirement. For example, idiosyncratic
productivity shocks may well be related to input factors in the estimation of a production
function.
Neither proxy variables or panel data methods are generally usable when endogeneity
springs from reverse causality. In the strip, Wally questions the exogeneity of the exercise
variable as a determinant of individual health, hinting for an endogeneity bias due to reverse
causality. If the exercise activity is indeed affected by the health status, exercise would depend
on the observable and unobservable determinants of health, and so cannot be exogenous.
Instrumental variables (IV) and Generalized method of moment (GMM) estimators offer
a general solution to the endogeneity problem. Roughly speaking, they solve the endogeneity
problem into two stages. The first stage attempts to identify the exogenous-variation compo-
nents of the x, through a set of exogenous variables, some of which are external to the model,
said instrumental variables. The second stage applies regression analysis using only the first-
stage exogenous components as explanatory variables. IV and GMM methods are preferred
tools of econometric analysis, compared to alternative techniques, since often the first stage
can be justified on the ground of economic theory.
There are various IV GMM applications showing the methods of this chapter: IV_GMM.do
using mus06data.dta, IV_GMM_panel.do using costfn.dta and IV_GMM_DPD.do and abest.do
both using abdata.dta. There is also a Monte Carlo application implemented by bias_in_AR1_LSDV.do.
9.2. THE METHOD OF MOMENTS 155
9.2. The method of moments
The method of moments estimates the parameters of interest by replacing population

moment conditions with their sample analogs. Almost all popular estimators can be thought
of as methods of moments estimators. Below there are two examples.
9.2.1. The linear regression model. Consider the linear model of Chapter 1 and the
system of moment conditions (1.2.3)
E (xy) = E xx0 .

So, the true coefficient vector, , solve the population moment conditions and is equal to
b , will satisfy
= E (xx0 )1 E (xy). By the analogy principle a consistent estimator for ,
the system of k analog sample moment conditions:
n
1X 0

xi yi xi
b = 0.
n
i=1
Hence,
n
!1 n
X 0
X 1

b = xi xi xi yi = X 0 X X 0 y,
i=1 i=1
which is exactly the OLS estimator.
9.2.2. The Instrumental Variable (IV) regression model in the just identified
case. Consider the linear model of Chapter 1 but without assumption P.3, E (|x) = 0, or
even the weaker P.3b, E (x) = 0. This means that some of the variables in x are potentially
endogenous, that is related in some way to . Assume, instead, conditional mean independence
for a L 1 vector of variables z, that is E (|z) = 0, with L = k. The vector z is generally
different from x, if it is not then we are back to the classical regression model and there is no
endogeneity problem. Replace assumption P.2 with detE (zx0 ) 6= 0. Then, as before using the
law of iterated expectations we have
E (z) = Ez [E (z|z)]
= Ez [zE (|z)]
= 0.
So, there are k moment conditions in the population
E z y x0 = 0

or equivalently
E (zy) = E zx0 .

So, the true coefficient vector, , solve the population moment conditions and is equal to
b , will satisfy
= E (zx0 )1 E (zy). By the analogy principle a consistent estimator for ,
the system of k analog sample moment conditions:
n
1X 0

z i yi xi
b = 0.
n
i=1
Hence,
n
!1 n
X 0
X 1

b = z i xi zi yi = Z 0 X Z 0 y,
i=1 i=1
which is the classical IV estimator.
The intuition is straightforward: since the true coefficients solve the population moment
conditions, if the sample moments provide good estimates for the population moments, then
one might expect that the estimator solving the sample moment conditions will provide good
estimates of the true coefficients.
What if there are more moment conditions than unknown parameters, that is if L > k?
Then we turn to GMM estimation.
9.2.3. The Generalized Method of Moments. GMM estimation is general, indeed.

It can be applied to both linear and non-linear models and in the over-identified case L > k.
To see this, define the column vector of observables in the population w (y x0 z0 )0 . There
are L > k population moments collected into the (L 1) column vector m () :
m () E [f (w, )]
and suppose that the following population moment conditions hold
m () = 0.
Now consider the L sample moments
1X N
m
b f wi ,
b
n
i=1
and the L sample moment conditions

m
b =0
hence there are L equations and k unknowns so that no estimator b can solve the system of

sample moment conditions. Instead, there exists a b that can make m b as close to zero as
possible:

(9.2.1)
b
GM M = arg min Q
b

b
0
b m
where Q b Am
b is a quadratic criterion function of the sample moments and
A is a positive definite matrix weighting the squares and the cross-products of the sample

moments in Q b .

Note that Q b 0 and since A is positive definite, Q b = 0 only if m b = 0.

Thus, in the just identified linear model to minimize Q
b is to make it exactly zero and the
GMM estimator boils down to IV.
9.2.4. The optimal weighting matrix. The weighting matrix A can be chosen so that
the GMM estimator is asymptotically efficient. Hansen (1982) proves that the weighting
matrix in the optimal two-step GMM estimator is
1
(9.2.2) A = Z 0 Z/n
,
a consistent estimator of the inverse of the asymptotic covariance matrix of the sample mo-

ments, V ar 1n Z 0 . The matrix in Equation (9.2.2) depends on the properties of .
It may be a known matrix, in which case the resulting GMM estimator is said a one-step
is unknown and it is generally computed through the residuals from
estimator. More often
a consistent one-step GMM estimator. In this case the resulting GMM estimator is said a
two-step estimator. For more on the choice of the weighting matrix see Section 9.4.
9.2.5. The TSLS estimator. The well-known Two Stages Least Squares estimator
(TSLS) belongs to the class of one-step GMM estimators. Consider the linear regression
model of Section 9.2.2 with L k instruments. Then, there are the following population
moments
m () E z y x0

and population moment conditions
E z y x0 = 0

Consider a p.d. weighting matrix matrix A, which we leave generic for the moment. Then,
since m ()0 Am () = 0 and m ( o )0 Am ( o ) 0 for any o ,
= argmin m ( o )0 Am ( o )

o
or
h 0 i
= argmin E yz0 AE (zy) 2E yz0 AE zx0 o + o E xz0 AE zx0 o

o
and so satisfies the normal equations system
2E xz0 AE (zy) + 2E xz0 AE zx0 = 0k

implying that it could be derived in closed form solution as
1
= E xz0 AE zx0 E xz0 AE (zy)

The L sample moments are collected into the (L 1) vector m
b
1X n
b 1 Z0 y X
0

b
m z i yi x i b
n n
i=1
According to Equation (9.2.2), if is i.i.d. the optimal weighting matrix is
n
!1
1X 0 1
A zi zi = n Z 0Z .
n
i=1
Then
1 0 1 0
b
Q y X b Z Z 0 Z Z y X
b
n
with the following normal equations for the minimization problem:

Q b
2 1 0
X 0Z Z 0Z Z y X b =0
b n
that solved for

b yield the TSLS estimator

0 0
1 0 1 0 1 0

b
T SLS X Z Z Z ZX X Z Z 0Z Z y.
or more compactly
0
1

b
T SLS X P[Z] X X 0 P[Z] y.
9.3. STATA IMPLEMENTATION OF THE TSLS ESTIMATOR 160
The estimators name derives from the fact that it is computed into two stages:
(1) Regress each column of X on Z using OLS to obtain the OLS fitted values of X:
Z (Z 0 Z)1 Z 0 X = P[Z] X. Thus, X = P[Z] X + M[Z] X, where P[Z] X is an approxi-
mately exogenous component, whose covariance with goes to zero as n , and
M[Z] X is a residual, potentially endogenous, component. Only P[Z] X is used in the
second stage.
(2) Regress y on the fitted values, P[Z] X, to obtain TSLS.

If the population moment conditions are true, then Q b
T SLS should not be significantly
different from zero. This provides a test for the validity of the L k over-identifying moment
conditions based on the following statistic (Hansen-Sargan test)

2
S nQ b
T SLS (L k) .
Exercise 84. Prove that if L = k, TSLS collapses to IV
Solution: Z 0 X is invertible, so
1 1 1

b
T SLS X 0Z Z 0Z Z 0X X 0Z Z 0Z Z 0y
1 1 1
= Z 0X Z 0Z X 0Z X 0Z Z 0Z Z 0y
1
= Z 0X Z 0y
9.3. Stata implementation of the TSLS estimator
The notation so far was so general to accommodate models with and without endogenous
variables. Lets adopt a less generic notation to suit real-world empirical problems, where we
know that there are potentially endogenous variables and who they are.
As before, z contains all available exogenous variable. The list of explanatory variables x is
partitioned into exogenous, x1 , and potentially endogenous, x2 , with k1 and k2 indicating the
9.3. STATA IMPLEMENTATION OF THE TSLS ESTIMATOR 161
numbers of such variables, respectively. The vector z is partitioned accordingly as z = (x1 z1 ),

where z1 is the vector of exogenous variables excluded by the model, referred by some as
external instruments or more often as instrumental variables (in a strict sense) and L1 indicates
the dimension of z1 . So, k = k1 + k2 and L = k1 + L1 . It follows that the order condition
L k is satisfied if and only if the number of endogenous variables is no greater than the
number of external instruments, that is L1 k2
TSLS is implemented by the command ivregress 2sls followed by the name of the
dependent variable y, the names of the included exogenous x1 and, within parentheses, the
names of the endogenous variables x2 to the left of the equal symbol = and the names of the
excluded exogenous z1 to the right of =, as follows
ivregress 2sls depvar indepvars (endog_vars = instruments ), options
. * IV estimation of a just-identified model with single endog regressor
.
. ivregress 2sls ldrugexp (hi_empunion = ssiratio) totchr age female blhisp linc, first
First-stage regressions
Number of obs = 10089

F( 6, 10082) = 138.32
Prob > F = 0.0000
R-squared = 0.0761
Adj R-squared = 0.0755
Root MSE = 0.4672
hi_empunion Coef. Std. Err. t P>|t| [95% Conf. Interval]
totchr .0127865 .0036225 3.53 0.000 .0056856 .0198874

age -.0086323 .000713 -12.11 0.000 -.01003 -.0072347
female -.07345 .0094932 -7.74 0.000 -.0920586 -.0548414
blhisp -.06268 .0127687 -4.91 0.000 -.0877091 -.0376509
linc .0483937 .0056768 8.52 0.000 .0372661 .0595212
ssiratio -.1916432 .0141289 -13.56 0.000 -.2193387 -.1639477
_cons 1.028981 .0574094 17.92 0.000 .9164466 1.141514
Instrumental variables (2SLS) regression Number of obs = 10089

9.4. STATA IMPLEMENTATION OF THE TWO-STEP GMM ESTIMATOR 162

Wald chi2(6) = 1919.06
Prob > chi2 = 0.0000
R-squared = 0.0640
Root MSE = 1.3177
ldrugexp Coef. Std. Err. z P>|z| [95% Conf. Interval]
hi_empunion -.8975913 .2079185 -4.32 0.000 -1.305104 -.4900786

totchr .4502655 .0104189 43.22 0.000 .4298449 .4706861
age -.0132176 .0028749 -4.60 0.000 -.0188523 -.0075829
female -.020406 .0315408 -0.65 0.518 -.0822249 .0414129
blhisp -.2174244 .0386745 -5.62 0.000 -.2932249 -.1416238
linc .0870018 .0220144 3.95 0.000 .0438543 .1301493
_cons 6.78717 .2554343 26.57 0.000 6.286528 7.287812
Instrumented: hi_empunion
Instruments: totchr age female blhisp linc ssiratio
9.4. Stata implementation of the two-step GMM estimator
The two-step GMM estimator is implemented by ivregress gmm followed by the names of
the dependent variable, the included exogenous and, within parentheses, all the right-hand-side
endogenous and the external instruments as follows
ivregress gmm depvar indepvars (endog_vars = instruments ), options
9.4.1. Choosing the weighting matrix . As discussed in subsection 9.2.4, the weight-
1
ing matrix in the optimal two-step GMM estimator is A = Z 0 Z/n , a consistent esti-

depending on the properties of , as
mator of the inverse of V ar 1n Z 0 , with choices of
discussed next.
= I, A = n (Z 0 Z)1 and the resulting GMM esti-
If is homoskedastic and independent,
mator is TSLS. Its implemented through the ivregress gmm option: wmatrix(unadjusted).
If is heteroskedastic and independent, is diagonal:

2
e1 0 0
..
0 e22 .
=
.

.. ..
. 0

0 0 2
en
9.4. STATA IMPLEMENTATION OF THE TWO-STEP GMM ESTIMATOR 163
with residuals taken from a one-step consistent regression (TSLS), ei = yi x0i

b
T SLS , i =
1, ..., n. Its implemented through the ivregress gmm option, wmatrix(robust), which is the
default option.
is a block diagonal matrix with generic block equal to the
If errors are clustered, then
outer product of the residuals peculiar to the corresponding cluster:

1 0 0

0 2 ..
.

= .

.. ..
. 0

0 0 N
i = ei e0 and ei = yi x0
with i T SLS is the vector of TSLS residuals (or any consistent
b
i
one-step residual estimator) peculiar to cluster i = 1, ..., N. Its implemented through the
ivregress gmm option: wmatrix(cluster cluster_var ). This option is recommended when
ivregress gmm is applied to panel data: with cluster_var equal to the panel identifier in the
data, it calculates a GMM estimator that is optimal under heteroskedasticity and arbitrary
correlation within individual clusters.
In time-series data, when is heteroskedastic and serially correlated, the optimal weighting
matrix A may be assembled by using the Newey-West heteroskedasticity- and- autocorrelation-
consistent (HAC) estimator. This is implemented by specifying wmatrix(hac kernel # ),
which requests a weighting matrix using the specified kernel (see below) with # lags. The
bandwidth of a kernel is equal to the number of lags plus one. Specifying wmatrix(hac
kernel opt) requests an HAC weighting matrix using the specified kernel, and the lag or-
der is selected using Newey and Wests (1994) optimal lag-selection algorithm. Specifying
wmatrix(hac kernel ) requests an HAC weighting matrix using the specified kernel and n-2
lags, where n is the sample size. There are three kernels available for HAC weighting ma-
trices: bartlett (or nwest) requests the Bartlett (Newey-West) kernel; parzen (or gallant)
9.5. ROBUST VARIANCE ESTIMATORS 164
requests the Parzen (Gallant 1987) kernel; quadraticspectral (or andrews) requests the
quadratic spectral (Andrews 1991) kernel.
9.5. Robust Variance Estimators
The less efficient, but computationally simpler and still consistent, TSLS estimator is

often used in estimation. Its robust variance-covariance matrix V ar b
T SLS is consistently
estimated as
\

0
1 0
[Z] X X 0 P[Z] X 1 ,

V ar b
T SLS = X P[Z] X X P[Z] P
is chosen according to the various departures from homoskedasticity and indepen-

where
dence spelled out above. The Stata implementation of the variance-covariance estimators
is through the following ivregress options: vce(unadjusted), vce(robust), vce(cluster
cluster_var ), vce(hac kernel ... ). The option vce(cluster cluster_var ) is recom-
mended when ivregress 2sls is applied to panel data: with cluster_var equal to the name

of the panel identifier in the data, the cluster option applies an estimator V ar b
T SLS that
is consistent under heteroskedasticity and arbitrary correlation within individual clusters.
Variable TwoSLS GMM_het GMM_clu TwoSLS_~f
hi_empunion -0.98993 -0.99328 -1.03587 -0.98993

0.20459 0.20467 0.20438 0.19221
totchr 0.45121 0.45095 0.44822 0.45121
0.01031 0.01031 0.01325 0.01051
age -0.01414 -0.01415 -0.01185 -0.01414
0.00290 0.00290 0.00626 0.00278
female -0.02784 -0.02817 -0.02451 -0.02784
0.03217 0.03219 0.02919 0.03117
blhisp -0.22371 -0.22310 -0.20907 -0.22371
0.03958 0.03960 0.05018 0.03870
linc 0.09427 0.09446 0.09573 0.09427
0.02188 0.02190 0.01474 0.02123
_cons 6.87519 6.87782 6.72769 6.87519
0.25789 0.25800 0.50588 0.24528
legend: b/se
.
9.7. DURBIN-WU-HAUSMAN EXOGENEITY TEST 165
9.6. Hansen-Sargan test of overidentifying restrictions
As already noted, if the population moment conditions are true, the minimized GMM

criterion function Q b
T SLS should not be significantly different from zero. This provides
a test for the validity of the L k over-identifying moment conditions based on the Hansen-
Sargan statistic

2
HS = nQ T SLS (L k) .
b
If HS is statistically large, then the specification with L k over-identifying restrictions is

rejected. The HS test is implemented after ivregress by the postestimation command estat
overid.
. * Test of overidentifying restrictions following ivregress gmm
. quietly ivregress gmm ldrugexp (hi_empunion = ssiratio multlc) ///
> $xlist, wmatrix(robust)
.
. estat overid
Test of overidentifying restriction:
Hansen's J chi2(1) = 1.04754 (p = 0.3061)
.
In this case the HS test supports the specification.
9.7. Durbin-Wu-Hausman Exogeneity test
A conventional Hausman test can be always implemented, based on the Hausmans sta-
tistics measuring the statistical difference between IV and OLS estimates. It is not robust to
heteroskedastic and clustered errors, though. Wu suggest an alternative. But before do this
exercise, which will prove useful in the derivations below.
Exercise 85. Prove that the TSLS estimator for 2 is

1
b2,T SLS = X20 P[M[X ] Z1 ]
X2 X20 P[M[X Z1 ] y
1 1]
Solution. Applying Theorem 8 to the second-stage regression

1
b2,T SLS = X20 P[Z] M [P[Z] X1 ] P[Z] X2 X20 P[Z] M [P[Z] X1 ] P[Z] y
By Lemma 16 P[Z] = P[X1 ] + P[M[X Z1 ] , so that P[Z] X1 = X1 and

1]
1
b2,T SLS = X20 P[Z] M [X1 ] P[Z] X2 X20 P[Z] M [X1 ] P[Z] y.
But then P[Z] = P[X1 ] + P[M[X Z1 ] also assures that P[Z] M [X1 ] = P[M[X ] Z1 ] , proving the result.
1] 1
The DWH test provides a robust version of the H test. It maintains instruments valid-
ity, E (|Z) = 0 and is based on the so called control-function approach, which recasts the
endogeneity problem as a misspecification problem affecting the structural equation
(9.7.1) y = X + ,
X = (X1 X2 ), = 01 02 , Z = (X1 Z1 ) and = u + . The component u is such that

E (u|X) = 0 and is the nk2 -matrix of the errors in the first-stage equations of the variables
X2 . As such, is responsible for endogeneity of X2 .
Replacing in (9.7.1) with the residuals from the first stage regressions, = M[Z] X2 ,
makes the DWH test operational as a simple test of joint significance for in the auxiliary
OLS regression
(9.7.2) y = X + M[Z] X2 + u .
The test works well since under the alternative of 6= 0, OLS estimation of the auxiliary
regression yields the TSLS estimators. This is proved as follows.

y = P[Z] + M[Z] X + M[Z] X2 + u
and so
y = P[Z] X + M[Z] X + M[Z] X2 + u
but M[Z] X = M[Z] X2 2 since by Corollary 19 M[Z] = M[X1 ] P[M[X Z1 ] . Therefore,

1]
y = P[Z] X + M[Z] X2 ( 2 + ) + u
and since P[Z] X and M[Z] X2 are orthogonal the Theorem 8 assures that the OLS estimator
for is
1
bT SLS = X 0 P[Z] X X 0 P[Z] y
and also
0
1
\
2 + = X2 M[Z] X2 X20 M[Z] y.
Rearrange \
2 + as follows
1
\
2+ = X20 M[Z] X2 X20 M[Z] y
1 0
= X20 M[Z] X2 X2 M[X1 ] y X20 P[M[X ] Z1 ] y
1
h
1 1 0
X20 M[Z] X2 X20 M[X1 ] X2 X20 M[X1 ] X

= X2 M[X1 ] y
1
X20 P[M[X ] Z1 ] X2 X20 P[M[X ] Z1 ] X2 X20 P[M[X ] Z1 ] y
1 1 1
1
= X20 M[Z] X2 X20 M[X1 ] X2 b2,OLS X20 P[M[X ] Z1 ]
X2 b2,T SLS
1
1 0 1 0
= X20 M[Z] X2 X2 M[X1 ] X2 b2,OLS X20 M[Z] X2 X2 P[M[X ] Z1 ] X2 b2,T SLS
1
1 h 1 i
X20 M[Z] X2 X20 M[X1 ] X2 b2,OLS + I X20 M[Z] X2 X20 M[X1 ] X2 b2,T SLS ,

=
where the second equation follows from M[Z] = M[X1 ] P[M[X ] Z1 ] , the third equation is
1
obtained by premultiplying the two terms in parentheses of the second equation by identity
matrices represented as AA1 , the fourth equation follows from Exercise 85 and the last
equation from P[M[X ] Z1 ] = M[Z] M[X1 ] . Therefore

1
b = Kb2,OLS + (I K) b2,T SLS b2,T SLS

= K (b2,OLS b2,T SLS ) ,
1
with K X20 M[Z] X2 X20 M[X1 ] X2 , proving that the test indeed follows the Hausman

test general principle of assessing the distance between an asymptotically efficient estimator
and a consistent but inefficient estimator under the null hypothesis.
The DWH test can be easily robustified for heteroskedasticity and/or clustered errors by
estimating (9.7.2) with regress and a suitable robust option, vce(robust) or vce(cluster
clustervar ). More easily, DWH can be immediately implemented in Stata through the
ivregress postestimation command estat endogenous.
. * Robust Durbin-Wu-Hausman test of endogeneity implemented by estat endogenous
. ivregress 2sls ldrugexp (hi_empunion = ssiratio) $xlist, vce(robust)

Wald chi2(6) = 2000.86
Prob > chi2 = 0.0000
R-squared = 0.0640
Root MSE = 1.3177
Robust
ldrugexp Coef. Std. Err. z P>|z| [95% Conf. Interval]
hi_empunion -.8975913 .2211268 -4.06 0.000 -1.330992 -.4641908

totchr .4502655 .0101969 44.16 0.000 .43028 .470251
age -.0132176 .0029977 -4.41 0.000 -.0190931 -.0073421
female -.020406 .0326114 -0.63 0.531 -.0843232 .0435113
blhisp -.2174244 .0394944 -5.51 0.000 -.294832 -.1400167
linc .0870018 .0226356 3.84 0.000 .0426368 .1313668
_cons 6.78717 .2688453 25.25 0.000 6.260243 7.314097
Instrumented: hi_empunion
Instruments: totchr age female blhisp linc ssiratio
.
. estat endogenous
Tests of endogeneity
Ho: variables are exogenous
Robust score chi2(1) = 24.935 (p = 0.0000)

Robust regression F(1,10081) = 26.4333 (p = 0.0000)
.
. estat endogenous,forcenonrobust
Tests of endogeneity
Ho: variables are exogenous
Durbin (score) chi2(1) = 25.2819 (p = 0.0000)

Wu-Hausman F(1,10081) = 25.3253 (p = 0.0000)
Robust score chi2(1) = 24.935 (p = 0.0000)
Robust regression F(1,10081) = 26.4333 (p = 0.0000)
.
9.8. ENDOGENOUS BINARY VARIABLES 169
. * Robust Durbin-Wu-Hausman test of endogeneity implemented manually

. global xlist totchr age female blhisp linc
. quietly regress hi_empunion ssiratio $xlist
. quietly predict v1hat, resid
. quietly regress ldrugexp hi_empunion v1hat $xlist, vce(robust)
. test v1hat
( 1) v1hat = 0
F( 1, 10081) = 26.43
Prob > F = 0.0000
9.8. Endogenous binary variables
The linear IV-GMM approach outlined so far fits the case of binary endogenous variables
producing consistent estimates. However, a first-stage regression fully accounting for the
binary structure of the endogenous variables may provide considerable efficiency gains. The
implied model (non-linear) is as follows
yi = x1i 1 + x2i 2 + i
x2i = x1i 1 + zi 2 + i

1 if x2i > 0

x2i =

0 otherwise

2 2
(i , i ) N 0, .
2 1
It is estimated by the Stata procedure
treatreg depvar indepvars, treat(endog_var = instruments) other_options
through either ML (default) or a consistent two-step procedure (twostep option).

9.11. THREE STAGES LEAST SQUARES 170
9.9. Testing for weak instruments
Staiger and Stocks rule of thumb: partial F tests in the first stage regression > 10. It
is simple, but not rigorous, tends to reject too often weak instruments and has no obvious
implementation when there are more than one endogenous variables.
Two tests by Stock and Yogo (2005) overcome all of the above difficulties. They are both
based on the on the minimum eigenvalue of the matrix analog of the partial F test, a statistics
introduced by Cragg and Donald (1993) to test nonidentification. Importantly, the large-
sample properties for both tests have been derived under the assumption of homoskedastic
and independent errors. Caution must be taken, then, when drawing conclusions from the
tests if the errors are non i.i.d.
Both tests are implemented by the ivregress postestimation command estat firststage.
Olea and Pfluger (2013) derive a new test for weak instruments that extends that by Stock
and Yogo (2005) to heteroskedasticity and cluster correlation. This is implemented in Stata
by weakivtest after ivregress.
9.10. Inference with weak instruments
Conditional inference on the endogenous variables coefficients in the presence of weak in-
struments is implemented through command condivreg by Mikusheva and Poi (2006). Theory
reviewed and expanded in Andrews et al. (2007). The command produces three alternative
confidence sets for the coefficient of the endogenous regressor obtained from the conditional
LR, Anderson-Rubin (option ar) and LM statistics (option lm). The syntax of condivreg is
similar to that of ivregress.
9.11. Three stages Least Squares
Its a system estimator including structural equations for all endogenous variables. Identifi-
cation is ensured by standard (sufficient) rank and (necessary) order conditions. It is seldomly
9.13. DYNAMIC PANEL DATA 171
used as it is inconsistent in the presence of heteroskedastic errors, which is the norm in most
micro applications. The Stata command is reg3.
9.12. Panel data models with strictly exogenous instruments
The conventional panel-data transformations, within, partial deviations, first differences,

forward orthogonal deviations can be applied to yield consistent panel-data IV-GMM estima-
tors only if there is a matrix Z of strictly exogenous variables.
The FE-TSLS estimator is simply computed by applying TSLS to variables, y, Z and

X transformed in group-mean deviations.
The RE-TSLS estimator is simply computed by applying TSLS to variables, y, Z and
X transformed in partial deviations.
Consistency of both estimators requires that E (|Z) = 0. These estimators are implemented
in Stata by xtivreg.
9.13. Dynamic panel data
Situations in which past decisions have an impact on current behaviour are ubiquitous in
economics. For example, in the presence of input adjustment costs, short-run input demands
depend also on past input levels. In such cases fitting a static model to data will lead to what
is referred to as dynamic underspecification. With a panel data set, however, it is possible to
implement a dynamic model from the outset to in order to describe the phenomena of interest.
To make things simple let us get started with the simple autoregressive process
(9.13.1) yit = + yi,t1 + it
t = 1, ..., T , i = 1, ..., N .
Model (9.13.1) can be easily extended to allow for time invariant individual terms:
(9.13.2) yit = yi,t1 + i + it

t = 1, ..., T , i = 1, ..., N . In vector notation, stacking time observations for each individual,
yi = y1 i + i 1T + i
i = 1, ..., N, where

yi1 yi,0 i1

...

..
.

...

yi =
, y1 i = yi,t1
yit ,
i =
it

(T 1) (T 1) (T 1)
..
.. ..
. . .

yiT yi,T 1 iT
For each individual there are T + 1 observations available in the data set, from yi0 to yiT , but
only T are usable since one is lost to taking lags.
The problem here is that E (it |y1 i i ) = 0 cannot hold for all t = 1, ..., T , that is
y1 i is not strictly exogenous. Given (9.13.2), the observation of y1 i at time t is yi,t1 =
f (y0 , i1 , i2 , ..., it1 ) and so all future observations of y1 i , from yi,t = f (y0 , i1 , ..., it ) to
yi,T 1 = f (y0 , i1 , ...it , ...i,T 1 ), depend on i,t , which makes E (i,t |y1 i , i ) = 0 fail.
Example 86. Let T=2 for simplicity and assume strict exogeneity, E (it |yi,0 , yi,1 i ) = 0,
t = 1, 2, then by LIE
cov (i1 , yi1 |yi,0 , i ) = E (i1 |yi,0 , i ) = 0,
but, given yi1 = yi,0 + i + i1 , cov (i1 , yi1 |yi,0 , i ) = 2 , contradicting strict exogeneity.
In alternative to strict exogeneity, we maintain the weaker assumption of conditional-mean

independence between i,t and all values of yi more remote than t, say yit1 = (yi,0 , yi,1 , ..., yi,t1 )0
using the notation in Arellano (2003). More formally
A.1: E it |yit1 , i = 0 for all t = 1, ..., T

Assumption A.1 is also considered in Wooldridge (chapter 11, 2010), where it is referred to as
sequential exogeneity conditional on the unobserved effect. It may be convenient sometimes to
maintain also the following (sequential) conditional homoskedasticity assumption
A.2: E 2it |yit1 , i = 2 for all t = 1, ..., T

Assumption A.1 implies zero serial correlation in errors, as the following exercise asks to prove.
Exercise 87. Prove that Equation (9.13.2) and Assumption A.1 imply the following (use
the LIE and i,tj = yi,tj yi,tj1 i )
A.3: E it i,tj |yit1 , i = 0, for all t = 2, ...T and j = 1, ...t 1

Solution. Given i,tj = yi,tj yi,tj1 i ,
E it i,tj |yit1 , i = E it (yi,tj yi,tj1 i ) |yit1 , i

and so
E it i,tj |yit1 , i = (yi,tj yi,tj1 i ) E it |yit1 , i = 0

where the first equality follows from yi,tj , yi,tj1 and i falling all into the conditioning set
of the conditional expectation and the last equality follows from A.1.
9.13.1. Inconsistency of the LSDV Estimator. Since y1 i is not strictly exogenous,

estimators removing the s via transformation of variables are all inconsistent for N .
In the case of LSDV, after taking the model in group-mean deviations, the regressor y1,i and
i become

1 PT 1 PT
yi,0 T 1=1 yi,t1 i1 T 1=1 i,t
.. ..

.

.

y = 1 P T and e
i = 1 P T
e1i yi,t1 T 1=1 yi,t1 it T 1=1 i,t

.. ..
. .

yi,T 1 T1 T1=1 yi,t1 iT T1 T1=1 i,t
P P
So there is contemporaneous correlation at each observation, brought about by the group

means, since T1 T1=1 yi,t1 contains all values in y1 i and T1 T1=1 i,t all values in i .
P P
Nickell (1981) was the first to derive the inconsistency. Given,
1 PP
NT yi,t1 y i.1 (it i. )
i t
LSDV = + 2 ,
1 PP
NT yi,t1 y i.1
i t
he showed that
T
!
1 XX 1X
plim yi,t1 y i.1 (it i. ) = E yi,t1 y i.1 (it i. ) =
NT t
T
i t=1
1 T 1 T + T 2
= 6= 0.
T2 (1 )2
Hence, the bias vanishes for T , but it does not for N and T fixed. For this reason,
the LSDV estimator is inaccurate in panel data sets with large N and small T and is said to
be semi-inconsistent (see also Sevestre and Trognon, 1996).
Since Nickell (1981) a number of consistent IV and GMM estimators have been proposed in
the econometric literature as an alternative to LSDV. Anderson and Hsiao (1981) (AH) suggest
two simple IV estimators that, upon transforming the model in first differences to eliminate
the unobserved individual heterogeneity, use the second lags of the dependent variable, either
differenced or in levels, as an instrument for the differenced one-time lagged dependent variable.
Arellano and Bond (1991) (AB) propose a GMM estimator for the first differenced model
which, relying on all available lags of y1 i as instruments, is more efficient than Anderson
and Hsiaos. Ahn and Smith (1995), upon noticing the Arellano and Bond estimator uses
only linear moment restrictions, suggest a set of non linear restrictions that may be used in
addition to the linear one to obtain more efficient estimates. Blundell and Bond (1998) (BB)
observe that with highly persistent data first-differenced IV or GMM estimators may suffer
of a severe small sample bias due to weak instruments. As a solution, they suggest a system
GMM estimator with first-differenced instruments for the equation in levels and instrument
in levels for the first-differenced equation. Some of the foregoing methods are nowadays very
popular and are surveyed below.
9.13.2. The Anderson and Hsiao IV Estimator. AH take model (9.13.2) in first
differences to eliminate the individual effects:
(9.13.3) yit yi,t1 = (yi,t1 yi,t2 ) + it i,t1 .
This makes the disturbances M A(1) with unit root, and so induces correlation between the
lagged endogenous variables and the disturbances. This problem can be solved by finding valid
and relevant instruments for 4yi,t1 . AH suggest using the values from the dependent variable
lagged two times, yi,t2 , t = 2, .., T 1, since yi,t2 is negatively correlated with 4yi,t1 =
yi,t1 yi,t2 and uncorrelated with 4it = it i,t1 under A.1. So, T-1 moment restrictions
could be used:
E (yi,0 4i,2 ) = 0
E (yi,1 4i,3 ) = 0
.. .. ..
. . .
(9.13.4) E (yi,T 2 4i,T ) = 0
or more compactly E (yi,t2 4i,t ) = 0, t = 2, ..., T . AH, though, look for a simple estimator
and so they just use the T-1 restrictions added up into the single moment restriction
T
!
X
E yi,t2 4i,t = 0,
t=2
of which sample analog is

N T
1 XX
m(
) = yi,t2 (4yi,t 4yi,t1 ) = 0
N
i=1 t=2
1Or alternatively 4y
i,t2 , but Arellano found a problem of identification for some values of in this case.
The AH estimator, , solves the foregoing equation

PN PT
yi,t2 4yi,t
= PNi=1PT t=2 .
i=1 t=2 yi,t2 4yi,t1
As such, is an exactly identified IV estimator, consistent under Assumption A.1, but non
optimal and with a high root mean squared error in applications.
9.13.3. The Arellano and Bond GMM estimator. AB find for a more efficient esti-
mator than AH, introducing three efficiency improvements.
First, AB use all T-1 restrictions in (9.13.4) from the second lag.
Second, Under A.1 all the lags more remote than the second are also uncorrelated with
it i,t1 . So, for the third lag there are T 2 restrictions:
E (yi,0 4i,3 ) = 0
E (yi,1 4i,4 ) = 0
.. .. ..
. . .
E (yi,T 3 4i,T ) = 0,
for the fourth lag there are T 3 restrictions and so on until the last usable lag, the T.th, yield-
ing just one restriction E (yi,0 4i,T ) = 0. Eventually (T 1)+(T 2)+...1 = T (T 1) /2 L
restrictions are available: E (yi,ts 4i,t ) = 0, s = 2, ..., T , t = s, ..., T . The AB estimator min-
imizes a criterion function in the sample analogs of the population moments involved in these
restrictions: N1 N
P
i=1 yi,ts (4yi,t
4yi,t1 ) is the sample analog of E (yi,ts 4i,t ), s = 2, ..., T ,
t = s, ..., T .
(), AB find for optimal overidentified
Collecting the L sample moments into the vector m
GMM estimators, as discussed in Subsections 9.2.3, and here is their last improvement over
AH.
For individual i, the matrix of instruments is

y 0 0 0 0 0 . . . . . 0
i0

0 yi0 yi1 0 0 0 . . . . . 0

0 0 0 yi0 yi1 yi2 . . . . . .
Zi =

. . . . . . . . . . . .

. . . . . . . . . . . .

0 0 0 0 0 0 . yi0 yi1 yi2 . yi,T 2
Stacking individuals the overall matrix of instrumental variables is
0 0
Z = Z10 , Z20 , ..., ZN

,
and model (9.13.3) can be reformulated more compactly
y = y1 + .
where
y is a (N (T 1) 1) vector;
y1 is a (N (T 1) 1) vector;
is a (N (T 1) 1) vector
The number of instruments is L = T (T 1)/2. So, Z is a (N (T 1) L) matrix. The

instrumental variables satisfy for each individual i the (L 1) vector of population moment
h 0 i
conditions m() E Zi i = 0, where i stands for the i.th block of . We can define
1 0 1 0
the (L 1) vector of sample moment conditions m() = N Z () = N Z (y y1 ).
Since L > 1, this is an overidentified case and GMM estimation is needed.
Maintaining also Assumption A.2, that is , beyond being not serially correlated, is also
homoskedastic, the optimal GMM estimator can be obtained in one step. It minimizes the
following criterion function
(9.13.5) Q (b )0 Am(b
) = m(b )
where according to what seen in Subsection 9.4.1
N
!1
1 X 0
A= Zi GZi
N
i=1
is a consistent estimator of the inverse of

0 0
(9.13.6) V ar Zi i = 2 E Zi GZi
up to scale and

2 1 0 0 . . . 0 0

1 2 1 0 . . . 0
0

0 1 2 1 . . . . .
G = .

(T 1)(T 1)
. . . . . . . . .

0 0 0 0 . . . 2 1

0 0 0 0 . . . 1 2
The Arellano-Bond one-step estimator
b1 = argminQ (b
) is

!1 1
N
0
X
b1 = y1 0 Z
Zi GZi Z 0 y1
i=1

N
!1
0
X
y1 0 Z Zi GZi Z 0 y
i=1
Exercise 88. Prove that under A.1 and A.2 V ar (i ) = 2 G. Hint: Derive the for-
mula for the generic element of V ar (i ) = E (i 0i ) and then see how it specializes
for 1) the generic main-diagonal element, 2) the generic first-subdiagonal element and 3) the
generic element onto the remaining lower subdiagonals. For all three cases, first work out the
expectations conditional to
yi0

yi1
yit2 =

..

.

yi,t2
using A.2, A.3 (which follows from A.1 as seen in Exercise 87) and the Generalized LIE (GLIE);
then work out the unconditional expectations using the LIE.
Solution. The generic element of V ar (i ) = E (i 0i ) is
(9.13.7) E [(it i,t1 ) (i,tj i,tj1 )]
t = 2, ..., T , j = 0, ..., t 2.
When j = 0 the foregoing term is a generic element of the main diagonal of V ar (i ).
Given A.2, A.3 and the GLIE
h i
E (it i,t1 )2 |yit2 = E 2it |yit2 + E 2i,t1 |yit2 + 2E it i,t1 |yit2

(9.13.8) = 2 2
and so by the LIE

h i n h io
E (it i,t1 )2 = E E (it i,t1 )2 |yit2
= 2 2
When t > 2 and j = 1, expression (9.13.7) specializes to the generic element of the first
subdiagonals of V ar (i ). Given A.2, A.3 and the GLIE
E (it i,t1 ) (i,t1 i,t2 ) |yit2 = E it i,t1 |yit2 E it i,t2 |yit2

E 2i,t1 |yit2 + E it1 i,t2 |yit2

(9.13.9) = 2
and so by the LIE
E [(it i,t1 ) (i,t1 i,t2 )] = E E (it i,t1 ) (it1 i,t2 ) |yit2

= 2
When t > 3 and j = 2, ..., t 2 we are considering all of the other elements of V ar (i ).
Then, given A.3 and the GLIE
E (it i,t1 ) (i,tj i,tj1 ) |yit2 = E it i,tj |yit2 E it i,tj1 |yit2

E i,t1 i,tj |yit2 + E i,t1 i,tj1 |yit2

(9.13.10) = 0
and so by the LIE
E [(it i,t1 ) (i,tj i,tj1 )] = E E (it i,t1 ) (i,tj i,tj1 ) |yit2

= 0
Exercise 89. Prove that if T = 4, then

yi0 (i2 i1 )
0

Zi i = yi1 (i3 i2 )

yi2 (i4 i3 )
where
y
i0

yi1
yit = .

..

yit
Solution. Easy, since

yi0 0 0 0 0 0

Zi = 0

yi0 yi1 0 0 0

0 0 0 yi0 yi1 yi2
and
i2 i1

i = i3 i2 .

i4 i3
So,
yi0 (i2 i1 )

yi0 (i3 i2 )

0
yi1 (i3 i2 )
Zi i =

yi0 (i4 i3 )

yi1 (i4 i3 )

yi2 (i4 i3 )
Without homoskedasticity (that is without Assumption A.2),
b1 is no longer optimal, but
it remains consistent and so it can be used to construct the optimal two-step estimator
b2
along the lines described in Subsection 9.4.1. Specifically,
b2 minimizes (9.13.5) with
N
!1
1 X 0
A= Zi e1i e01i Zi
N
i=1
and where e1i = yi

b1 y1 i is the individual-level residual vector from the one-step
estimator:
!1 1
N
0
X
b2 = y1 0 Z
Zi e1i e1 0i Zi Z 0 y1
i=1

N
!1
0
X
y1 0 Z Zi e1i e1 0i Zi Z 0 y
i=1
If the it are iid(0, 2 ),

b1 and
b2 are asymptotically equivalent.
To test instrument validity one can apply the Hansen-Sargan test of overidentifying re-
strictions: !0 !1
N N N
!
0 0 0
X X X
S= Zi e2i Zi e1i e1 0i Zi Zi e2i
i=1 i=1 i=1
where e2i = yi b2 y1 i are the individual-level residuals from the two-step estimator.
h 0 i
Under H0 : E Zi i = 0 i = 1, ..., N , S 2L1 .
A
A second specification test suggested by Arellano and Bond (1991) is that testing lack of
AR(2) correlation in e1 or e2 , which must hold under Assumption A.1. The AR(2) test
under the null has a limiting standard normal distribution.
9.13.3.1. Inference issues. Monte Carlo studies tend to show that estimated standard er-
rors from 2-step GMM estimators are severely downward biased in finite samples (Arellano
and Bond 1991). This is not the case for 1-step GMM standard errors, which instead are vir-
tually unbiased. A possible explanation for this finding is that the weighting matrix in 2-step
GMM estimators depend on estimated parameters whereas that in 1-step GMM estimators
does not. Windmeijer (2005) proves that in fact a large portion of the finite sample bias of
2-step GMM standard errors is due to the variation of estimated parameters in the weighting
matrix. He derives both a general bias-correction and a specific one for panel data models
with predetermined regressors as in the Arellano and Bond model.
Monte Carlo experiments in Bowsher (2002) show that the Sargan test based on the full
instrument set has zero power when T , and consequently the moment conditions, becomes too
large for given N .
0 0
9.13.3.2. Derivation of V ar Zi i = 2 E Zi GZi . We now prove that under A.1 and
0 0
A.2, V ar Zi i = 2 E Zi GZi .
0
Given Exercise 89 it is easy to see that Zi i is the L 1 zero-mean vector

yi0 (i2 i1 )

yi0 (i3 i2 )

yi1 (i3 i2 )
yi0 (i2 i1 )
..
0
. yi1 (i3 i2 )
Zi i = =

..

yi0 (iT i,T 1 )
.

yi1 (iT i,T 1 ) yiT 2 (iT i,T 1 )

..

.

yiT 2 (iT i,T 1 )
and so
0
V ar Zi i =

yi0 (i2 i1 )

yi1 (i3 i2 )

h i

0 0
E (i2 i1 ) yi0 (i3 i2 ) yi1 (iT i,T 1 ) yiT 2

..

.

yT 2 (iT i,T 1 )

i
whose generic block is
0
h i
E (it i,t1 ) (i,tj i,tj1 ) yit2 yitj2
t = 2, ..., T , j = 0, ..., t 2.
So,
0
h i
E (it i,t1 ) (i,tj i,tj1 ) yit2 yitj2
0
n h io
= E E (it i,t1 ) (i,tj i,tj1 ) yit2 yitj2 |yit2
0
n o
E E (it i,t1 ) (i,tj i,tj1 ) |yit2 yit2 yitj2

=
0

E (it i,t1 ) (i,tj i,tj1 ) |yit2 E yit2 yitj2

=
0

= E [(it i,t1 ) (i,tj i,tj1 )] E yit2 yitj2
where the first equality follows from the LIE, the second from the fact that all elements of
yitj2 belongs to yit2 , the third and the last from he fact that
E (it i,t1 ) (i,tj i,tj1 ) |yit2

is a constant element as established by Equations (9.13.8), (9.13.9) and (9.13.10) in Exercise

88. The proof is completed by noting that
E [(it i,t1 ) (i,tj i,tj1 )]
t = 2, ..., T , j = 0, ..., t 2, is the generic element of V ar (i ) , which in turn equals 2 G, as

established in Exercise 88.
9.13.4. Blundell and Bond (1998) System estimator. Blundell and Bond (1998)
demonstrate that in the presence of close to unity instruments in levels are weakly correlated
with y1 leading to what is known in the econometric literature as a weak instrument bias.
This is easily seen by considering the following example taken from Blundell and Bond. Let
T = 2, then after taking the model in first differences there is only a cross-section available
for estimation:
4yi,2 = 4yi,1 + 4i,2 , i = 1, ..., N.
and only one moment condition

N
1 X
(4yi,2 4yi,1 ) yi,0 = 0.
N
i=1
To what extent is yi,0 related to 4yi,1 ? To answer this question it suffices to work out the
reduced form for 4yi,1 :
4yi,1 = ( 1) yi,0 + i + i,1
from which it is clear that the closer to unity the weaker the correlation between yi,0 and
4yi,1 .
To solve the problem they suggest exploiting the following additional moment restrictions
(9.13.11) E [(yi,t yi,t1 ) 4yi,t1 ] = 0, t = 2, ..., T
which are valid if along to Assumption A.1, we maintain that the process for yi,t is mean-
stationary, that is
A.4:
i
E (yi,0 |i ) =
1
Assumption A.4 is justified if the process started in the distant past. Starting from the model
at observation t = 0 and going backward in time recursively
yi,0 = yi,1 + i + i,0
= 2 yi,2 + i + i + i,1 + i,0
= 3 yi,3 + 2 i + i + i + 2 i,2 + i,1 + i,0
.

i X i
= + t i,t + ui,0
1 1
t=0
where E (ui,0 |1 ) = 0 by Assumption A.1.

That the moment restrictions hold under Assumptions 1 and 4 can be seen for t = 2
E [(yi,2 yi,1 ) 4yi,1 ] =
E {(i + i,2 ) [( 1) yi,0 + i + i,1 ]} =

i
E (i + i,2 ) ( 1) + ui,0 + i + i,1 =
1
E {(i + i,2 ) [( 1) ui,0 + i,1 ]} = 0
The last equality follows since E (i ui,0 ) = 0 by construction, E (i,2 i ) = 0 and E (i,1 i,2 ) =
0 by A.1. That also E (i,2 ui,0 ) = 0 follows straightforwardly from the fact that by A.1
E (i,2 yi,0 ) = 0, that is
1
E (i,2 i ) + E (i,2 ui,0 ) = 0,
1
and E (i,2 i ) = 0. Thus, Blundell and Bond (1998) suggest a system GMM estimator, which
also uses instruments in first differences for the equation in levels.
Hahn (1999) evaluates the efficiency gains brought by exploiting the stationarity of the
initial condition as done by Blundell and Bond, finding that it is substantial also for large T .
Statas xtabond performs the Arellano and Bond GMM estimator. Then, there is xtdpdsys,
which implements the GMM system estimator. Third, xtdpd, is a more general command that
allows more flexibility than both xtabond and xtdpdsys. Finally, the user-written xtabond2
(Roodman 2009) is certainly the most powerful code in Stata to implement dynamic panel
data models.
9.13.5. Application. Arellano and Bond (1991) show their methods estimating a dy-
namic employment equation on a sample of UK manufacturing companies. Their data set
in Stata format is contained in abdata.dta. The dofile IV_GMM_DPD.do implements simpler
versions of their model though differenced and system GMM using xtabond and xtabond2.
The dofile abbest.do by D. M. Roodman replicates exactly the Arellano and Bonds results
using xtabond2.
9.13.6. Bias corrected LSDV. IV and GMM estimators in dynamic panel data models
are consistent for N large, so they can be severely biased and imprecise in panel data with a
small number of cross-sectional units. This certainly applies to most macro panels, but also
micro panels where heterogeneity concerns force the researcher to restrict estimation to small
subsamples of individuals.
Monte Carlo studies (Arellano and Bond 1991, Kiviet 1995 and Judson and Owen 1999)
demonstrate that LSDV although inconsistent has a relatively small variance compared to IV
and GMM estimators. So, an alternative approach based upon the correction of LSDV for the
finite sample bias has recently become popular in the econometric literature. Kiviet (1995)
uses higher order asymptotic expansion techniques to approximate the small sample bias of the
LSDV estimator to include terms of at most order 1/(TN). Monte Carlo evidence therein shows
that the bias-corrected LSDV estimator (LSDVC) often outperforms the IV-GMM estimators
in terms of bias and root mean squared error (RMSE). Another piece of Monte Carlo evidence
by Judson and Owen (1999) strongly supports LSDVC when N is small as in most macro
panels. In Kiviet (1999) the bias expression is more accurate to include terms of higher order.
Bun and Kiviet (2003), simplify the approximations in Kiviet (1999).
Bruno (2005a) extends the bias approximations in Bun and Kiviet (2003) to accommodate
unbalanced panels with a strictly exogenous selection rule.Bruno (2005b) presents the new
users written Stata command xtlsdvc to implement LSDVC.
Kiviet (1995) shows that the bias approximations are even more accurate when there is
a unit root in y. This makes for a simple panel unit-root test based on the bootstrapped
standard errors computed by xtlsdvc.
9.13.6.1. Estimating a dynamic labour demand equation for a given industry. Unlike the
xtabond and xtabond2 applications of Subsection 9.13.5, here we do not use all information
available to estimate the parameters of the labour demand equation in abdata.dta. Instead,
we follow a strategy that, exploiting the industry partition of the cross-sectional dimension
as defined by the categorical variable ind, lets the slopes be industry-specific. This is easily
accomplished by restricting the usable data to the panel of firms belonging to a given industry.
While such a strategy leads to a less restrictive specification for the firm labour demand, it
causes a reduced number of cross-sectional units for use in estimation, so that the researcher
must be prepared to deal with a potentially severe small sample bias in any of the industry
regressions. Clearly, xtlsdvc is the appropriate solution in this case.
The demonstration is kept as simple as possible considering regressions for only one in-
dustry panel, ind=4.
The following instructions are implemented in a Stata-do file
Part 2
Non-linear models
CHAPTER 10
Non-linear regression models
10.1. Introduction
Non-linear models may provide a more adequate specification than linear models when the
dependent variable is not gaussian. They present three important difficulties, though.
(1) Closed-form solutions for estimators are generally not available.

(2) Marginal effects do not coincide with the model coefficients and vary over the sample.
(3) Latent heterogeneity components in cross-sections or panel data require special at-
tention.
The are two do-files demonstrating the methods of this chapter: nlmr.do using the data set
mus10data.dta and nlmr2.do using the data set mus17data.dta. Both data-sets are from
Cameron and Trivedi (2010).
10.2. Non-linear least squares
The regression model specifies the mean of y conditional on a vector of exogenous explana-
tory variables x by using some known, non-linear functional form
E (y|x) = (x, ) .
Or, equivalently,
y = (x, ) + u
where u = y E (y|x).
190
10.3. POISSON MODEL FOR COUNT DATA 191
Given the random sample {yi , xi } , i = 1, ..., n, the non-linear least square estimator, bN LS ,
minimizes the non-linear residual sum of squares
n
X
Q= [yi (x, b)]2 .
i=1
10.3. Poisson model for count data
Let y N be a count variable: doctor visits, car accidents, etc. The Poisson regression
model is a non-linear regression model with
E (y|x) = exp x0 .

(10.3.1)
Or, equivalently,
y = exp x0 + u

where u = y E (y|x). From Equation (10.3.1)
E y exp x0 |x = 0

and by the Law of Iterated expectations there are zero covariances between u and x:
Ey,x x y exp x0

(10.3.2) = 0.
10.3.1. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation.

Given the population moment restrictions (10.3.2), estimation can be carried out with a lim-
ited set of assumptions within a GMM set-up: by the analogy principle the consistent GMM
estimator bGM M solve the system of k sample analog restrictions
n
X
xi yi exp x0i = 0.

(10.3.3)
i=1
Exercise 90. Equations (10.3.3) are different from the first-order-conditions of the NLS
estimator for the Poisson model. Can you derive them?
Alternatively, we can maintain a Poisson density function for y with mean :
e y
f (y) = .
y!
Importantly, the Poisson model has the equidispersion property: V ar (y) = E (y) = .
Letting = exp (x0 ) we end up with the conditional log-likelihood function
n
exp [exp (x0i )] exp (x0i )yi
X
lnL (y1 ...yn |x1 ...xn , ) = ln
yi !
i=1
n
X
exp x0i + yi x0i ln (yi !)

=
i=1
and the ML estimator bM L that maximizes it:
bML is consistent : bM L
p
n
!1
X
(10.3.4) The covariance matrix estimator of bM L : V (bM L ) = i xi x0i

i=1
It is easily seen that the k first order conditions that maximize lnL coincide with the equa-
tions in (10.3.3), so that bM L = bGM M . This proves two things: 1) The GMM estimator is
asymptotically efficient if the conditional mean function is correctly specified and the density
function is Poisson; 2) the ML estimator is consistent even if the poisson density is not the
correct density function, as long as the conditional mean is correctly specified. In such cases,
when the likelihood function is not correctly specified, we refer to the ML estimator as a
pseudo ML estimator and a robust covariance matrix estimator should be used for inference
rather than (10.3.4):
n
!1 " n # n
!1
X X 2
X
Vrob (bM L ) = i xi x0i
(yi
i ) xi x0i i xi x0i
.
i=1 i=1 i=1
With equidispersion V ar (y|x) = E (y|x) = :
i )2 close to
(yi i = V (bM L ) close to Vrob (bM L )
With overdispersion V ar (y|x) > E (y|x) =
i )2 tends to be greater than

(yi i = V (bM L ) is inconsistent, with smaller
variance estimates than Vrob (bM L ), which remains consistent.
The consistency result for the (pseudo) ML estimator holds in general if two conditions are
verified:
(1) The conditional mean is correctly specified

(2) The density function belongs to an exponential family
Definition 91. An exponential family of distributions is one whose conditional log-

likelihood function at a generic observation is of the form
lnL (y|x,) = a (y) + b [ (x, )] + yc [ (x, )] .
A member of the family is identified by the numerical values of .
We verify that Poisson is an exponential family:
a (y) = ln (y!),
b [ (x, )] = exp (x0 ) and
yc [ (x, )] = yx0 .
The Normal distribution with a known variance 2

( )
1 [y (x, )]2
(y|x, ) = exp
2 2 2
is an exponential family also:

a (y) = ln 2 y 2 /2 2 ,
b [ (x, )] = (x, )2 /2 2 and
yc [ (x, )] = y (x, ) / 2 .
The Stata command that implements poisson regression is poisson, with a syntax close
to regress. It computes bM L with standard error estimates obtained by V (bM L ). If the
vce(robust) option is given, then Stata recognizes the more robust pseudo ML set-up and
still provides the bM L coefficient estimates, but with the robust covariance matrix Vrob (bM L ) .
Clustering of errors is supported by vce(cluster clustervar ); vce(opg) implements the
VCE as the inverse outer product of the gradient vector evaluated at bM L , it is asymptotically
equivalent to V (bM L ); finally, bootstrap standard errors can be obtained through vce(boot
reps(# )) with reps(# ) setting the number of bootstrap replications.
10.3.2. Empirical application. We apply poisson to estimate a Poisson model for the
number of doctor visits in 2002, for a sample of U.S. individuals aged 25-64 years, working
in private sector, but not self-employed and not receiving public insurance (Medicare and
Medicaid) (see Cameron-Trivedi, 2010).
. use mus10data.dta, clear
. keep if year02==1
(25,712 observations deleted)
. describe docvis private chronic female income

docvis int %8.0g number of doctor visits

private byte %8.0g = 1 if private insurance
chronic byte %8.0g = 1 if a chronic condition
female byte %8.0g = 1 if female
income float %9.0g Income in $ / 1000
. summarize docvis private chronic female income
docvis 4,412 3.957389 7.947601 0 134

private 4,412 .7853581 .4106202 0 1
chronic 4,412 .3263826 .4689423 0 1
female 4,412 .4718948 .4992661 0 1
income 4,412 34.34018 29.03987 -49.999 280.777
. poisson docvis private chronic female income

. poisson docvis private chronic female income
Iteration 0: log likelihood = -18504.413

Poisson regression Number of obs = 4,412

LR chi2(4) = 8852.71
Prob > chi2 = 0.0000
Log likelihood = -18503.549 Pseudo R2 = 0.1930
docvis Coef. Std. Err. z P>|z| [95% Conf. Interval]
private .7986652 .027719 28.81 0.000 .744337 .8529934

chronic 1.091865 .0157985 69.11 0.000 1.060901 1.12283
female .4925481 .0160073 30.77 0.000 .4611744 .5239218
income .003557 .0002412 14.75 0.000 .0030844 .0040297
_cons -.2297262 .0287022 -8.00 0.000 -.2859814 -.173471
. poisson docvis private chronic female income, vce(robust)
Robust VCE:
. poisson docvis private chronic female income, vce(robust)
Iteration 0: log pseudolikelihood = -18504.413

Poisson regression Number of obs = 4,412

Wald chi2(4) = 594.72
Prob > chi2 = 0.0000
Log pseudolikelihood = -18503.549 Pseudo R2 = 0.1930
Robust
docvis Coef. Std. Err. z P>|z| [95% Conf. Interval]
private .7986652 .1090014 7.33 0.000 .5850263 1.012304

chronic 1.091865 .0559951 19.50 0.000 .9821167 1.201614
female .4925481 .0585365 8.41 0.000 .3778187 .6072774
income .003557 .0010825 3.29 0.001 .0014354 .0056787
_cons -.2297262 .1108732 -2.07 0.038 -.4470338 -.0124186
.
Poisson can be implemented also through the Stata command performing non-linear GMM,
gmm. The syntax is non-standard here: the argument of gmm is a Stata rendition of y
exp (x0 ) within parentheses. The option instrument(varlist ) tells Stata what are the
exogenous variables (all, in this application).
. gmm (docvis - exp({xb:private chronic female income}+{b0})), ///

> instruments(private chronic female income) onestep nolog
Final GMM criterion Q(b) = 6.57e-19

. gmm (docvis - exp({xb:private chronic female income}+{b0})), ///
> instruments(private chronic female income) onestep nolog
Final GMM criterion Q(b) = 6.57e-19
note: model is exactly identified
GMM estimation
Number of parameters = 5
Number of moments = 5
Initial weight matrix: Unadjusted Number of obs = 4,412
Robust
Coef. Std. Err. z P>|z| [95% Conf. Interval]
private .7986654 .1089891 7.33 0.000 .5850507 1.01228

chronic 1.091865 .0559888 19.50 0.000 .9821291 1.201601
female .4925481 .0585298 8.42 0.000 .3778317 .6072644
income .003557 .0010824 3.29 0.001 .0014356 .0056784
/b0 -.2297263 .1108607 -2.07 0.038 -.4470093 -.0124434
Instruments for equation 1: private chronic female income _cons
10.3.3. Marginal effects. The post-estimation command margins with the option dydx(varlist )
estimates marginal effects for each of the variables in varlist . Marginal effects can be esti-
mated at a point xo (conventionally, the sample mean when variables are continuous and in
this case the option atmean must be supplied) or can be averaged over the sample (default).
Next, we present a battery of Poisson experiments with margins.

The marginal effect of a binary covariate should be computed as the discrete change from
its base level, rather than using partial-derivative formulas, which are appropriate only for
continuous xs.
Stata uses the appropriate formulas if we indicate what are the binary covariates in the
regression command preceding margins, as follows:
. quietly poisson docvis i.private i.chronic i.female income, vce(robust)
. margins, dydx(*) atmean
. margins, dydx(*) atmean
Conditional marginal effects Number of obs = 4,412

Model VCE : Robust
Expression : Predicted number of events, predict()

dy/dx w.r.t. : 1.private 1.chronic 1.female income
at : 0.private = .2146419 (mean)
1.private = .7853581 (mean)
0.chronic = .6736174 (mean)
1.chronic = .3263826 (mean)
0.female = .5281052 (mean)
1.female = .4718948 (mean)
income = 34.34018 (mean)
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
1.private 1.978178 .204407 9.68 0.000 1.577547 2.378808

1.chronic 4.200068 .2794096 15.03 0.000 3.652435 4.7477
1.female 1.528406 .1775762 8.61 0.000 1.180363 1.876449
income .0107766 .0033149 3.25 0.001 .0042796 .0172737
Note: dy/dx for factor levels is the discrete change from the base level.
We can carry out counterfactual analysis by evaluating MEs at representative values

. margins, dydx(*) at(private=1 chronic=0 female=1 income=10) noatlegend

Model VCE : Robust

Delta-method
1.private 1.647648 .2072813 7.95 0.000 1.241385 2.053912

1.chronic 5.930251 .4017655 14.76 0.000 5.142805 6.717697
1.female 1.164985 .1546063 7.54 0.000 .8619621 1.468008
income .0106545 .0028688 3.71 0.000 .0050317 .0162772
.
Or we can average MEs over the sample to get AMEs. NB: This is the margins default
. margins, dydx(*)
Average marginal effects Number of obs = 4,412

Model VCE : Robust

Delta-method
1.private 2.404721 .2438573 9.86 0.000 1.926769 2.882672

1.chronic 4.599174 .2886176 15.94 0.000 4.033494 5.164854
1.female 1.900212 .2156694 8.81 0.000 1.477508 2.322917
income .0140765 .0043457 3.24 0.001 .0055591 .0225939
.
10.4. MODELLING AND TESTING OVERDISPERSION 199
10.4. Modelling and testing overdispersion
We start from a specific Poisson density function, conditional on a random scalar ,
e ()y
f (y|) =
y!
with E () = 1 and V ar () = 2 . We can find for the unconditional moments of y applying

iterated expectations
E (y) = E [E (y|)] =
and
V ar (y) = E [V ar (y|)] + V ar [E (y|)] = E [] + V ar [] = + 2 2 = 1 + 2 >

so, overdispersion is allowed.

The marginal density function of y, f (y), is what is needed for ML estimation since is
not observable. Its generic expression is
e ()y

f (y) = E .
y!
To find it in closed form we need to specify the marginal density function for . If
Gamma (1, ), then f (y) is a negative binomial density function, N B , 2 , with E (y) =

and V ar (y) = 1 + 2 . Clearly if 2 = 0, then collapses to its unity mean and f (y) is

Poisson.
Specifying = exp (x0 ) yields the NB regression model and and 2 are estimated via
ML based on N B exp (x0 ) , 2 . Testing for overdispersion within this framework boils down

to testing the null hypothesis 2 = 0.

The Stata command that implements the NB regression is nbreg, with a syntax close to
regress and poisson. The output also gives the overdispersion (LR) test of 2 = 0.
10.4. MODELLING AND TESTING OVERDISPERSION 200
Overdispersion can be tested also under the null hypothesis of 2 = 0, therefore under
Poisson regression, against the alternative of V ar (y|x) = 1 + 2 , therefore NB regres-

sion, using a Lagrange Multiplier test. This is based on an auxiliary regression implemented
h i
i )2 yi /
after poisson estimation using an estimate of [V ar (y|x) /] 1, (yi i , as the
i = exp (x0i bM L ), as the only regressor (no constant). The LM test
dependent variable and
is the t-statistic computed for the OLS coefficient estimate of
i .
CHAPTER 11
Binary dependent variable models
11.1. Introduction
Binary dependent variable models have a dependent variable that partitions the sample
into two categories of a given qualitative dimension of interest. For example
Labour supply. There are two categories: work/not work (univariate binary model).
Supplementary private health insurance. There are two categories: purchase/not
purchase (univariate binary model)
Binary models are said multivariate when there are multiple dimensions that are possibly
related
Two related dimensions: [Dimension 1: Being overweight (body mass index > 25)
= Two categories: yes/not] and [Dimension 2: Job satisfaction = Two categories:
satisfied/dissatisfied] (bivariate binary model).
Two related dimensions: [Dimension 1: Identity of immigrants with the host country
= Two categories: yes/not] and [Dimension 2: Identity of immigrants with the
country of origin = Two categories: yes/not] (bivariate binary model).
In these notes I focus almost exclusively with univariate binary models, except for a digression
on the bivariate probit model as estimated by Statas biprobit.
The do-file bdvm.do is a Stata application on binary models that uses the data set mus14data.dta
from Cameron and Trivedi (2010).
201
11.2. BINARY MODELS 202
11.2. Binary models
Let A the event of interest (e.g. buy a supplementary health insurance). Let the indicator
function 1 (A) be unity if event A occurs and zero if not. Define the discrete random variable
y such that
(11.2.1) y = 1 (A) .
Then
P r (y = 1) = P r (A) and P r (y = 0) = 1 .
E (y) = and V ar (y) = (1 ) .
We wish to evaluate the impact of x on the probability of A and to do so we model P r (y = 1|x)

as a function of x.
Since 0 P r (y = 1|x) 1 a suitable functional form for P r (y = 1|x) is any cumulative
distribution function evaluated at a linear combination of x, F (x0 ). Accordingly, we specify
P r (y = 1|x) = F x0 .

(11.2.2)
Two popular choices for F () are
Probit Model: F () (), the Standard Normal distribution

Logit model: F () () exp (x0 ) / [1 + exp (x0 )] , the Logistic distribution
with zero mean and variance 2 /3.
Alternatively, we may model F () directly as a linear function of x:
Linear Probability Model (LPM): F (x0 ) x0 .
Since P r (y = 1|x) = E (y|x), Model (11.2.2) can always be expressed as the regression model
y = F x0 + u

(11.2.3)
u = y E (y|x) .
11.2.1. Latent regression. When F () is a distribution function the binary model can
be motivated as a latent regression model. In microeconomics this is a convenient way to
model individual choices.
Introduce the latent continuous random variable y with
(11.2.4) y = x0 + ,
let be a zero mean random variable that is independent from x and with F , where F is a
distribution function that is symmetric around zero. Then, let y = 1 (y > 0) . In the insurance
example we may think of y as the utility variation faced by a subject with observable and
latent characteristics x and , respectively, when she decides to buy a supplementary insurance,
so that event A occurs if and only if y > 0.
= y = 1 > x0 ,

(11.2.5)
= P r (y = 1|x) = P r > x0 |x

Since and x are independent P r ( x0 |x) = F (x0 ). Moreover, by symmetry of F ,

P r ( > x0 |x) = F (x0 ) and so
P r (y = 1|x) = F x0 ,

which is exactly Model (11.2.2).

Inspection of (11.2.5) clarifies that V ar () = 2 and cannot be separately identified,
since P r ( > x0 ) = P r [(/) > x0 (/)]. Therefore, to identify , 2 must be fixed to
some known value. In the probit model 2 = 1 and in the logit model 2 = 2 /3.
11.2.2. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation. In

the logit and probit models estimation is carried out via ML. The ML estimator, bM L maxi-
mizes the conditional log-likelihood function
n
X
yi ln F x0i + (1 yi ) ln 1 F x0i

lnL (y1 ...yn |x1 ...xn , ) =
i=1
bM L is consistent for with covariance matrix

( n )
f (x 0b ) x x 0
M L i
X
(11.2.6) V (bM L ) = i i
,1
F (x0i bM L ) [1 F (x0i bM L )]
i=1
where f is the density function of F (remember that x F (x) = f (x)).

The Stata commands that compute bM L and V (bM L ) in the probit and logit models are,
respectively, probit and logit. The syntax is similar to regress.
The LPM assumes F = X. So, Equation (11.2.3) is a linear regression model that can be
estimated by regress. In this case the model coefficients are identical to the marginal effects
of interest. But V ar (u|x) = V ar (y|x) = x0 (1 x0 ), so the model is heteroskedastic and
regress should be supplemented by the vce(robust) option.
11.2.3. An application to the HRS data. To illustrate the methods discussed so far,
we use data from the panel study Health and Retirement Study (HRS) 2000, a sample of
individuals aged 52-86, all Medicare beneficiaries. We wish to explain the decision to purchase
a private insurance from any source: private market, employer or union sponsored plans, etc.
This decision is described as a yes/no binary variable, ins . Estimation is carried out through
a battery of binary model estimators.
Heres the variables involved in the binary regressions included in Statas global macros
by the Statas command global:
. use mus14data.dta, clear
. global xlist age hstatusg hhincome educyear married hisp
. generate linc = ln(hhinc)

(9 missing values generated)
. global extralist linc female white chronic adl sretire
. summarize ins retire $xlist $extralist

along with a bunch of descriptive statistics produced by summarize:
. summarize ins retire $xlist $extralist
ins 3,206 .3870867 .4871597 0 1

retire 3,206 .6247661 .4842588 0 1
age 3,206 66.91391 3.675794 52 86
hstatusg 3,206 .7046163 .4562862 0 1
hhincome 3,206 45.26391 64.33936 0 1312.124
educyear 3,206 11.89863 3.304611 0 17

married 3,206 .7330006 .442461 0 1
hisp 3,206 .0726762 .2596448 0 1
linc 3,197 3.383047 .9393629 -2.292635 7.179402
female 3,206 .477854 .4995872 0 1
white 3,206 .8206488 .383706 0 1

chronic 3,206 2.063319 1.416434 0 8
adl 3,206 .301622 .8253646 0 5
sretire 3,206 .3883344 .4874473 0 1
.
The logit estimator:

. logit ins retire $xlist

Logistic regression Number of obs = 3,206

LR chi2(7) = 289.79
Prob > chi2 = 0.0000
ins Coef. Std. Err. z P>|z| [95% Conf. Interval]
retire .1969297 .0842067 2.34 0.019 .0318875 .3619718

age -.0145955 .0112871 -1.29 0.196 -.0367178 .0075267
hstatusg .3122654 .0916739 3.41 0.001 .1325878 .491943
hhincome .0023036 .000762 3.02 0.003 .00081 .0037972
educyear .1142626 .0142012 8.05 0.000 .0864288 .1420963
married .578636 .0933198 6.20 0.000 .3957327 .7615394
hisp -.8103059 .1957522 -4.14 0.000 -1.193973 -.4266387
_cons -1.715578 .7486219 -2.29 0.022 -3.18285 -.2483064
. probit ins retire $xlist

The probit estimator:

. probit ins retire $xlist

Probit regression Number of obs = 3,206

LR chi2(7) = 292.30
Prob > chi2 = 0.0000
retire .1183567 .0512678 2.31 0.021 .0178736 .2188397

age -.0088696 .006899 -1.29 0.199 -.0223914 .0046521
hstatusg .1977357 .0554868 3.56 0.000 .0889835 .3064878
hhincome .001233 .0003866 3.19 0.001 .0004754 .0019907
educyear .0707477 .0084782 8.34 0.000 .0541308 .0873647
married .362329 .0560031 6.47 0.000 .252565 .4720931
hisp -.4731099 .1104393 -4.28 0.000 -.689567 -.2566529
_cons -1.069319 .4580794 -2.33 0.020 -1.967139 -.1715002
. regress ins retire $xlist, vce(robust)

and the LPM estimator:

F(7, 3198) = 58.98
Prob > F = 0.0000
R-squared = 0.0826
Root MSE = .46711
Robust
ins Coef. Std. Err. t P>|t| [95% Conf. Interval]
retire .0408508 .0182217 2.24 0.025 .0051234 .0765782

age -.0028955 .0023254 -1.25 0.213 -.0074549 .0016638
hstatusg .0655583 .0190126 3.45 0.001 .0282801 .1028365
hhincome .0004921 .0001874 2.63 0.009 .0001247 .0008595
educyear .0233686 .0027081 8.63 0.000 .0180589 .0286784
married .1234699 .0186521 6.62 0.000 .0868987 .1600411
hisp -.1210059 .0269459 -4.49 0.000 -.1738389 -.068173
_cons .1270857 .1538816 0.83 0.409 -.1746309 .4288023
.
11.2.4. Heteroskedasticity. Unlike the non-linear models examined in Chapter 10, in

probit and logit models heteroskedasticity brings about misspecification of the conditional
mean, so that ML estimators of both models become inconsistent. Hence, it makes little sense
to complement probit and logit coefficient estimates with heteroskedasticity-robust standard
error estimates.
Heteroskedasticity can be modeled, though. In the probit model, instead of fixing 2 = 1,
one can allow heteroskedasticity by setting i2 = exp (z0i ) , so that
P r (yi = 1|x) = x0i /exp z0i

(11.2.7) .
Statas hetprobit estimates this heteroskedastic probit model and, importantly, provides a
LR test for the null hypothesis of homoskedasticity (=0).
The following is an application of hetprobit to the HRS data
. hetprob ins retire $xlist, het(chronic) nolog
Heteroskedastic probit model Number of obs = 3,206

Zero outcomes = 1,965
Nonzero outcomes = 1,241
Wald chi2(7) = 90.34

Log likelihood = -1992.904 Prob > chi2 = 0.0000
ins
retire .1075926 .0476757 2.26 0.024 .0141501 .2010352
age -.0087658 .0062107 -1.41 0.158 -.0209384 .0034069
hstatusg .1629653 .0564771 2.89 0.004 .0522722 .2736584
hhincome .0011135 .000364 3.06 0.002 .0004 .001827
educyear .0642167 .0094184 6.82 0.000 .0457569 .0826765
married .3341699 .0563861 5.93 0.000 .2236551 .4446847
hisp -.4344396 .1055044 -4.12 0.000 -.6412244 -.2276548
_cons -.9089138 .4318121 -2.10 0.035 -1.75525 -.0625776
lnsigma2
chronic -.0442144 .0365848 -1.21 0.227 -.1159193 .0274906
LR test of lnsigma2=0: chi2(1) = 1.44 Prob > chi2 = 0.2303
.
Notice that homoskedasticity is not rejected at conventional levels of significance!
11.2.5. Clustering. Differently from heteroskedasticity, it makes sense to adjust stan-

dard error estimates to within-cluster correlation. This is the case since within-cluster correla-
tion leaves the conditional expectation of an individual observation unaffected, so that the ML
estimator can be motivated as a partial ML estimator, which remains consistent even if ob-
servations are not independent (see Wooldridge 2010, p. 609). The Stata option vce(cluster
clustervar ), therefore, can be conveniently included in both probit and logit statements.
We reestimate the model with a VCE clustered by age
. probit ins retire $xlist, vce(cluster age) nolog
Probit regression Number of obs = 3,206

Wald chi2(7) = 303.15
Prob > chi2 = 0.0000
Log pseudolikelihood = -1993.6237 Pseudo R2 = 0.0683
(Std. Err. adjusted for 33 clusters in age)
Robust
retire .1183567 .0612164 1.93 0.053 -.0016253 .2383386

age -.0088696 .0054275 -1.63 0.102 -.0195073 .0017681
hstatusg .1977357 .0694313 2.85 0.004 .0616529 .3338184
hhincome .001233 .0007054 1.75 0.080 -.0001495 .0026155
educyear .0707477 .0132054 5.36 0.000 .0448657 .0966298
married .362329 .0365601 9.91 0.000 .2906726 .4339855
hisp -.4731099 .1206751 -3.92 0.000 -.7096288 -.2365911
_cons -1.069319 .3855152 -2.77 0.006 -1.824915 -.3137234
.
11.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 211
11.3. Coefficient estimates and marginal effects
There is no exact relationship between the coefficient estimates from the three foregoing
models. Amemiya (1981) works out the following rough conversion factors
blogit ' 4bols
bprobit ' 2.5bols
blogit ' 1.6bprobit .
This is not very interesting, though, since we are often interested in marginal effects on response
probabilities rather than coefficients.
In logit and probit models, the marginal effect of x on the success probability, F (x0 ),
depends on how large is the index value t = x0 , for F (t) 1 or 0 as t + or ,
respectively. Hence, if x is large in size and x 6= 0, so that t = x0 is also large, F (t) tends
to be constant and x F (t) to be negligible.
The general expression for x F (t) can be worked out by using the chain rule of derivation,
so that x F (t) = t F (t) x (x0 ) and since the first derivative of a distribution function is its
density function f , x F (t) = f . The marginal effect of x at observation i are estimated by
logit and probit as
(x Fi )probit = fprobit,i bprobit = x0i bprobit bprobit

(x Fi )logit = flogit,i blogit = x0i blogit 1 x0i blogit blogit

and by LPM as
(x Fi )ols = bols .
The post-estimation command margins with the option dydx(varlist ) estimates marginal
effects for each of the variables in varlist . Marginal effects can be estimated at a point x

(conventionally, the sample mean when variables are continuous and in this case the option
atmean must be supplied) or can be averaged over the sample (default).
Next, we present a battery of experiments in probit and logit with margins, starting with
probit marginal effects (ME) evaluated at an interesting counterfactual point:
. quietly probit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
. margins, dydx(*) at (retire=1 age=75 hstatusg=1 hhincome=35 educyear=12 ///

> married=1 hisp=1) noatlegend

Model VCE : OIM
Expression : Pr(ins), predict()

dy/dx w.r.t. : 1.retire age 1.hstatusg hhincome educyear 1.married 1.hisp
Delta-method
1.retire .0370423 .0157535 2.35 0.019 .0061661 .0679186

age -.0028885 .0021715 -1.33 0.183 -.0071446 .0013676
1.hstatusg .0601247 .0167259 3.59 0.000 .0273426 .0929068
hhincome .0004016 .0001281 3.14 0.002 .0001505 .0006526
educyear .02304 .0034797 6.62 0.000 .0162199 .0298602
1.married .1032364 .0172782 5.97 0.000 .0693718 .1371009
1.hisp -.1728396 .0365783 -4.73 0.000 -.2445318 -.1011474
Probit MEs at the mean point:

. margins, dydx(*) atmean noatlegend

Model VCE : OIM

Delta-method
1.retire .0446104 .0192053 2.32 0.020 .0069687 .0822521

age -.0033611 .0026141 -1.29 0.199 -.0084846 .0017624
1.hstatusg .0737992 .0203356 3.63 0.000 .0339421 .1136562
hhincome .0004672 .0001466 3.19 0.001 .00018 .0007545
educyear .0268093 .0032034 8.37 0.000 .0205306 .0330879
1.married .132552 .0195747 6.77 0.000 .0941863 .1709177
1.hisp -.1635165 .0335817 -4.87 0.000 -.2293355 -.0976976
Probit AMEs:
. margins, dydx(*) noatlegend

Model VCE : OIM

Delta-method
1.retire .0419325 .0181078 2.32 0.021 .0064419 .0774231

age -.0031459 .0024451 -1.29 0.198 -.0079382 .0016465
1.hstatusg .0699212 .0194393 3.60 0.000 .0318209 .1080216
hhincome .0004373 .0001366 3.20 0.001 .0001696 .000705
educyear .0250926 .0029084 8.63 0.000 .0193923 .0307929
1.married .1266929 .0189138 6.70 0.000 .0896226 .1637632
1.hisp -.1564431 .0328337 -4.76 0.000 -.2207959 -.0920903
. quietly logit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
Logit AMEs:
. quietly logit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp
. margins, dydx(*) noatlegend

Model VCE : OIM

Delta-method
1.retire .0426943 .0181787 2.35 0.019 .0070647 .0783239

age -.0031693 .0024486 -1.29 0.196 -.0079686 .00163
1.hstatusg .0675283 .0196091 3.44 0.001 .0290951 .1059615
hhincome .0005002 .0001646 3.04 0.002 .0001777 .0008228
educyear .0248111 .0029705 8.35 0.000 .0189891 .0306332
1.married .1235562 .0191419 6.45 0.000 .0860388 .1610736
1.hisp -.1608825 .0339246 -4.74 0.000 -.2273735 -.0943914
.
Logit and probit AMEs are very close!
11.4. TESTS AND GOODNESS-OF-FIT MEASURES 216
AMEs from the LPM are readily evaluated as the coefficient estimates!

F(7, 3198) = 58.98
Prob > F = 0.0000
R-squared = 0.0826
Root MSE = .46711
Robust
ins Coef. Std. Err. t P>|t| [95% Conf. Interval]
retire .0408508 .0182217 2.24 0.025 .0051234 .0765782

age -.0028955 .0023254 -1.25 0.213 -.0074549 .0016638
hstatusg .0655583 .0190126 3.45 0.001 .0282801 .1028365
hhincome .0004921 .0001874 2.63 0.009 .0001247 .0008595
educyear .0233686 .0027081 8.63 0.000 .0180589 .0286784
married .1234699 .0186521 6.62 0.000 .0868987 .1600411
hisp -.1210059 .0269459 -4.49 0.000 -.1738389 -.068173
_cons .1270857 .1538816 0.83 0.409 -.1746309 .4288023
They are quite close to logits and probits! This feature of the LPM of producing AMEs
that are in line with probits and logits has been noted in Wooldridge (2010).
11.4. Tests and goodness-of-fit measures
Parameter restrictions can be tested by Wald tests (test) and LR tests (lrtest). As ex-
plained above, hetprobit, besides producing coefficient estimates, provides an heteroskedas-
ticity test.
The most common goodness-of-fit measures reported in logit or probit outputs are the
following.
11.4.1. The overall percent correctly predicted (OPCP). Define the 0.5-cutoff
predictor yi of yi as

1 if 0.5
F x0
yi =
0 else
OPCP is the number of times yi = yi over n. A problem with this measure is that it can be
high also in cases where the model poorly predicts one outcome. It may be more informative
in these cases to compute the percent correctly predicted for each outcome separately: 1) the
number of times yi = yi = 1 over the number of times yi = 1 and 2) the number of times
yi = yi = 0 over the number of times yi = 0 (it turns out that OPCP is a weighted average of
the foregoing two measures).
These measures are obtained through the Stata post-estimation command estat classification
(not supported by hetprobit). It reports several statistics, indeed. The percent correctly pre-
dicted of successes (y = 1) is referred to as sensitivity and symbolized Pr( +| D); the percent
correctly predicted of failures (y = 0) is referred to as specificity and symbolized Pr( -|~D).
The OPCP value is reported under the label correctly classified.
. quietly probit ins retire $xlist
. estat classification
Probit model for ins
True
Classified D ~D Total
+ 335 305 640

- 906 1660 2566
Total 1241 1965 3206
Classified + if predicted Pr(D) >= .5

True D defined as ins != 0
Sensitivity Pr( +| D) 26.99%

Specificity Pr( -|~D) 84.48%
Positive predictive value Pr( D| +) 52.34%
Negative predictive value Pr(~D| -) 64.69%
False + rate for true ~D Pr( +|~D) 15.52%

False - rate for true D Pr( -| D) 73.01%
False + rate for classified + Pr(~D| +) 47.66%
False - rate for classified - Pr( D| -) 35.31%
Correctly classified 62.23%
We verify the estat classification results through the Stata command display (used
here as a pocket calculator!)
. display (335 + 1660 )/3206 // correctly classified (OPCP)

.62227074
. display 335/1241 // sensitivity (success PCP)

.26994359
. display 1660/1965 // specificity (failure PCP)

.84478372
11.4.2. Discrepancy of actual and estimated frequencies over subsamples. We

may evaluate the discrepancy between the actual frequency of an outcome and the estimated
average probability of the same outcome within a subsample S of interest and its complemen-
tary S c (for example, S = f emales and S c = males)
1 X 1 X 0
yS yi compared to pS F xi
mS mS
iS iS
and
1 X 1 X 0
yS c yi compared to pS c F xi
mS c c
mS c c
iS iS
where ms is the sample size of subsample s = S, S c .

The actual frequencies yS and yS c can be obtained via a two-way cross tabulation with
tabulate and the option column, which makes tabulate compute the relative frequency of
each cell within its column (i.e. within the subsample of interest)
. tabulate ins female, column
Key
frequency
column percentage
female
ins 0 1 Total
0 964 1,001 1,965

57.59 65.34 61.29
1 710 531 1,241

42.41 34.66 38.71
Total 1,674 1,532 3,206

100.00 100.00 100.00
. quietly logit ins i.retire $xlist

Then, we get the logit and probit predicted frequencies pS and pS c through margins with
the option by(), including in parentheses the name of the binary variable partitioning the
sample, here female.
. margins, by(female)
Predictive margins Number of obs = 3,206

Model VCE : OIM

over : female
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
female
0 .396303 .008885 44.60 0.000 .3788887 .4137172
1 .3770161 .0087613 43.03 0.000 .3598443 .3941878
. quietly probit ins i.retire $xlist


Model VCE : OIM

over : female
Delta-method
female
0 .396303 .008885 44.60 0.000 .3788888 .4137173
1 .3770162 .0087613 43.03 0.000 .3598444 .394188
.
Predictions, virtually identical between probit and logit, are quite satisfactory: yf emale =
0.35 against pf emale = 0.38 and ymale = 0.42 against pmale = 0.40
The actual frequencies yS and yS c can be also obtained from any binary model saturated
for the sample partition S = {S, S c } . A model is said saturated for a sample partition
S = {S1 , ...SM } if it has enough variables to explain the column percentages in each cell of
S . In our example, a model would be saturated for S if it contains the indicators for S and
for S c , or, equivalently, the indicator for S (or S c ) and the indicator for the whole estimation
sample S S c , i.e. the constant term.
yS is always exactly reproduced by a saturated logit model, and with a negligible error by
a saturated probit model.
. quietly logit ins i.female i.retire $xlist

Model VCE : OIM

over : female
Delta-method
female
0 .4241338 .0115715 36.65 0.000 .4014542 .4468135
1 .3466057 .0116829 29.67 0.000 .3237077 .3695038
. quietly probit ins i.female i.retire $xlist

. quietly probit ins i.female i.retire $xlist

Model VCE : OIM

over : female
Delta-method
female
0 .4236579 .0115466 36.69 0.000 .4010271 .4462888
1 .3450644 .0116708 29.57 0.000 .32219 .3679388
.
Given the above, the discrepancy measure |

yS pS | can be always thought of as measuring
the prediction performance of a model not saturated for S against a saturated one . So, it
makes no sense to compare the actual frequencies in S with the average estimated frequencies
in S from a saturated (for S ) model. This also explains why we average over data groups:
any binary model with the constant term is saturated for the trivial partition given by the
whole estimation sample.
. tabulate ins
ins Freq. Percent Cum.
0 1,965 61.29 61.29

1 1,241 38.71 100.00
Total 3,206 100.00
. margins

Model VCE : OIM
Delta-method
_cons .3870867 .0082298 47.03 0.000 .3709566 .4032168
11.4.3. The Pearson goodness-of-fit statistic X 2 . The discrepancy measures dis-

cussed above are not rigorous statistical tests. The Pearson goodness-of-fit statistic X 2 is a
chi-squared statistics based on the sample partition determined by the covariates x in the
model, S = {S1 , ...SM }, so that the x are constant over each group (or, using Statas jargon,
covariates pattern) of S
M
2
X ys ps )2
ms (
X =
ps (1 ps )
s=1
Under the null of correct specification, it is asymptotically distributed as a 2 (p), where p is

the difference between M and the number of covariates. Clearly, it is computable as long as
p > 0, that is the model is not saturated for S . It is computed after probit or logit by the
post-estimation command estat gof.
If x contains a continuous variable it is likely that M is very close to the sample size and
so that most subsamples are actually singletons. This is the case for our model:
. estat gof
Logistic model for ins, goodness-of-fit test
number of observations = 3206

number of covariate patterns = 3192
Pearson chi2(3184) = 3167.06
Prob > chi2 = 0.5810

The covariates determine 3192 patterns and since the number of covariates, 8, is smaller
than 3192 the model is not saturated and the Pearson statistic is computed. Discrepancies
are not jointly significant: high goodness of fit.
11.4.4. The HosmerLemeshow goodness-of-fit statistic. When M is close to n,

the actual distribution of the Pearson statistic may be different from 2 (p). The Hos-
merLemeshow goodness-of-fit statistic (Hosmer and Lemeshow 1980) may be used in alterna-
tive. It is calculated similarly, except that rather than using the M covariate patterns as the
group definition, the quantiles of the ordered predicted probabilities are used to form groups.
It is computed by estat gof with the option group(# ) where G=# is the number of groups
chosen by the user. If G=10, then each group is a decile of the ordered predicted probabilities.
Under the null of correct specification, it is asymptotically distributed as a 2 (G 2).
11.4.5. The pseudo R-squared. The last goodness-of-fit measure we consider is the
2 = 1 L () /L (
Pseudo R-squared: R y ) , where L () is the value of the maximized log-
likelihood and L (
y ) is the log-likelihood evaluated for the model with only the intercept.
11.5. NUMERICAL PROBLEMS 224
2 < 1 and the closer R

Always 0 < R 2 to unity the better the fit. It is reported by the logit,
probit and hetprobit output by default.
11.5. Numerical problems
11.5.1. One-way causation. One-way causation occurs when the category of a binary
covariate x predicts one of the outcomes of y perfectly. For example

1 1

1 1

x = 0 and y =

0

0 1

0 0
As a consequence, there does not exist a definite solution maximizing the likelihood function.
This happens because the model is saturated for the x categories and so necessarily yx=1 =
px=1 , exactly in logit, approximately in probit. But since yx=1 = 1, then px=1 = 1, which
holds only if the estimate for is +.
Should, instead, x = 1 predict y = 0 perfectly,

1 0

1 0

x = 0 and y =

1

0 1

0 0
then px=1 = 0, which holds only if the estimate for is . Also in this case, there does not
exist a definite solution maximizing the likelihood function.
Stata solves the problem into two steps. First, it removes the individuals in the x category
predicting outcome perfectly. Then, it removes the variable x, which after discarding records
is a constant and so collinear with the intercept.
. use repair, clear

(1978 Automobile Data)
.
. **** One-way causation
.
. tabulate foreign repair /// level 1 predicts
> /// failure (domestic)
> /// perfectly.
>
repair
Car type 1 2 3 Total
Domestic 10 27 9 46
Foreign 0 3 9 12
Total 10 30 18 58
. logit foreign b3.repair /// use repair as a

> /// factor variable with
> /// level 3 as the base
> /// category.
> /// This is equivalent
> /// to including dummies
> /// for level 1 and 2
> /// (along with _cons)
>
note: 1.repair != 0 predicts failure perfectly

1.repair dropped and 10 obs not used

Logistic regression Number of obs = 48

LR chi2(1) = 9.53
Prob > chi2 = 0.0020
foreign Coef. Std. Err. z P>|z| [95% Conf. Interval]
repair
1 0 (empty)
2 -2.197225 .7698003 -2.85 0.004 -3.706005 -.6884436
_cons 7.94e-17 .4714045 0.00 1.000 -.9239359 .9239359
.
One-way causation occurs also when a pattern of two or more covariates (dummies or
continuous with repeated values, such as age) predicts outcome perfectly.
. use logitxmpl, clear
. list, separator(0) /// pattern x1=0 and x2=0 predicts y=0 perfectly
>
y x1 x2
1. 0 0 0
2. 0 0 0
3. 0 1 0
4. 1 1 0
5. 0 0 1
6. 1 0 1
.
. logit y x1 x2

Iteration 11: log likelihood = -2.7725887 (not concave)
(output omitted)

convergence not achieved

LR chi2(1) = 2.09
Prob > chi2 = 0.1480
y Coef. Std. Err. z P>|z| [95% Conf. Interval]
x1 18.3704 2 9.19 0.000 14.45047 22.29033

x2 18.3704 . . . . .
_cons -18.3704 1.414214 -12.99 0.000 -21.14221 -15.5986
Note: 2 failures and 0 successes completely determined.

convergence not achieved
r(430);
end of do-file
Stata removes the records perfectly predicted, but then the remaining variables are per-
fectly collinear with the constant term and convergence fails.
Stata, differently from the case of a single predictor dummy, does not solve the multi-
collinearity issue by dropping a variable. But it displays a message alerting the user, who can
take an action by her/himself (see the dofile numerical_problems.do.)
11.5.2. Two-way causation (separation). Two-way causation occurs when a variable

x perfectly predicts the outcome, both failures and successes (it does not matter if x is binary
or continuous).
x y
1. -2.210688 0
2. -2.195971 0
3. -.5270714 0
4. .8169003 1
5. 1.472326 1
. probit y x
Here, x > .5270714 and x .5270714 predict y = 1 and y = 0, respectively. The

problem arises since the LogLF becomes monotone in the coefficient of the variable that
predicts outcome and no ML solution can be found. Stata simply stops execution.
. probit y x
x > -.5270714 predicts data perfectly

r(2000);
end of do-file
A case of quasi-separation occurs when x is a great predictor of y, although not a perfect
one.
x y
1. -1.938717 0
2. -1.165771 0
3. -1.06293 0
4. -1.027958 0
5. -.8778074 0
6. -.8016517 0
7. -.1507115 0
8. -.1269327 0
9. -.0573579 1
10. .0050004 0
11. .0515246 1
12. .1676564 1
13. .1892474 1
14. .2048065 1
15. .2656516 1
16. .2677172 1
17. .9665207 1
18. 1.020527 1
19. 1.504177 1
20. 1.85036 1
. probit y x
Numerically the model is sound and Stata carries out estimation.
. logit y x


LR chi2(1) = 22.43
Prob > chi2 = 0.0000
y Coef. Std. Err. z P>|z| [95% Conf. Interval]
x 20.91633 13.94952 1.50 0.134 -6.424232 48.25689

_cons .5824715 1.181759 0.49 0.622 -1.733733 2.898676
Note: 4 failures and 4 successes completely determined.
.
Consequences are a large coefficient estimate on the predictor variable, predicted proba-
bilities that are virtually zero over the smallest values of x and virtually unity over the largest
ones. Stata issues a final message illustrating the situation. No intervention is required from
the user.
The Statas message says 4 failures and 4 successes completely determined. Lets see
why.
. predict p
(option pr assumed; Pr(y))
. sort p
. list p in 1/4
1. 4.38e-18
2. 4.61e-11
3. 3.96e-10
4. 8.23e-10
. list p in 17/20
17. 1
18. 1
19. 1
20. 1
.
p is virtually zero over the smallest 4 values of x and virtually unity over the largest 4
ones.
11.5.3. An overlap restriction. The foregoing numerical problems are finite-sample

issues. Experiencing them in large samples should be considered as evidence of either model
misspecification or non-random sampling. In fact, letting X denote the domain of x, in both
logit and probit 0 < F (x0 ) < 1 over X . The above property is referred to in the treatment-
effect literature as an overlap restriction. It assures that for any given covariate pattern both
success (treatment) and failure (control) groups are sizable.
11.6. TECHNICAL APPENDIX 232
11.6. Technical appendix
The ML first order conditions for the ML estimator are given by the k equations system

Xn h i
f x0i
(11.6.1) yi F x0i
h i xi = 0
i=1
1 F x0
F x0i
i
In logit f (x0i ) = (x0i ) [1 (x0i )] and so the ML equations simplify

n h
X i
(11.6.2) yi x0i
xi = 0.
i=1
Pn h i
= 0, which explains why in logit
Then, if the constant is in xi , i=1 yi x0i
n
1 X 0
(11.6.3) y = xi .
n
i=1
If a dummy
1 if iS
di =
0 if i Sc
and the constant belong to xi in logit, then from equations (11.6.2) and (11.6.3)
1 X 0 1 X 0
(11.6.4) yS = xi and yS c = xi ,
mS mS c c
iS iS
where S c is the subsample complementary to S and mS + mS c = n.

Equations (11.6.3) and (11.6.4) hold only approximately in probit, when we replace x0i

.
with x0i
Next, we verify the monotone-LF result in separated data for either logit or probit.
Assume there is a covariate xin xi such that

1 if xi x

yi =
0 if else
11.7. ENDOGENOUS REGRESSORS 233
then from equation (11.6.1)

n h
X i
f x0i
yi F x0i h i xi =
i=1
1 F x0
F x0i
i

X f xi 0 X f xi 0
xi xi
xi
x F x 0 x <
x 1 F x 0
i i i
P )
f (x0i P )
f (x0i
and since the constant is in xi , xi
x F (x0 ) = xi <
x 1F (x0 ) . Then
i i

n h
X i
f x0i
yi F x0i
h i xi > 0.
i=1
1 F x0
F x0i
i
11.7. Endogenous regressors
In the presence of endogenous regressors in the latent regression model, say y2 , one can
use an instrumental variable probit estimator. This is implemented by Statas ivprobit, with
a syntax similar to ivregress.
By default Statas ivprobit implements a ML estimator, with a likelihood function

that is based on the joint density function conditional on the vector of exogenous
variables z
f (y1 , y2 |z) = f (y1 |z, y2 ) f (y2 |z)
ivprobit has a syntax similar to ivregress.

It displays a Wald test of exogeneity for the potentially endogenous variables.
NB: ivprobit is not suitable for discrete endogenous regressors! This case can be
dealt with by a bivariate recursive probit model as explained in Subsection 11.9.1.
We reestimate an extended version, with more regressors, of our insurance model, treating the
log of head-of-household income as potentially endogenous. We use the individual and the
spouse professional statuses (retire and sretire ) as instruments. Notice that now that the
retire variable is used as an instrument it has to be removed from the model.
11.8. INDEPENDENT LATENT HETEROGENEITY 234
. global xlist2 female age age2 educyear married hisp white chronic adl hstatusg
. global ivlist2 sretire retire
. ivprobit ins $xlist2 (linc = $ivlist2), vce(robust) nolog
Probit model with endogenous regressors Number of obs = 3,197
. ivprobit ins $xlist2 (linc = $ivlist2), vce(robust) nolog
Probit model with endogenous regressors Number of obs = 3,197

Wald chi2(11) = 382.35
Log pseudolikelihood = -5407.7151 Prob > chi2 = 0.0000
Robust
Coef. Std. Err. z P>|z| [95% Conf. Interval]
linc -.5338252 .3852132 -1.39 0.166 -1.288829 .2211788

female -.1394072 .0494471 -2.82 0.005 -.2363218 -.0424926
age .2862293 .1280821 2.23 0.025 .0351929 .5372656
age2 -.0021472 .0009318 -2.30 0.021 -.0039735 -.0003209
educyear .1136881 .0237914 4.78 0.000 .0670579 .1603183
married .7058309 .2377594 2.97 0.003 .239831 1.171831
hisp -.5094514 .1049487 -4.85 0.000 -.715147 -.3037558
white .1563454 .1035674 1.51 0.131 -.0466429 .3593338
chronic .0061939 .027525 0.23 0.822 -.0477542 .060142
adl -.1347664 .0349799 -3.85 0.000 -.2033258 -.0662071
hstatusg .2341789 .0709755 3.30 0.001 .0950694 .3732883
_cons -10.00787 4.065771 -2.46 0.014 -17.97664 -2.039107
corr(e.linc,e.ins) .5879559 .2355329 -.0309872 .8809669

sd(e.linc) .7177787 .0167816 .6856296 .7514352
Instrumented: linc
Instruments: female age age2 educyear married hisp white chronic adl hstatusg
sretire retire
Wald test of exogeneity (corr = 0): chi2(1) = 3.51 Prob > chi2 = 0.0610
. ivprobit ins $xlist2 (linc = $ivlist2), twostep first
11.8. Independent latent heterogeneity
In the latent regression model (11.2.4) all explanatory variables are observed. But it
may be the case that relevant explanatory variables are latent, as allowed by the following
11.8. INDEPENDENT LATENT HETEROGENEITY 235
specification of the model

y = x0 + w0 w + ,
where the ws are latent variables. There is so a latent heterogeneity component w0 w

to consider in the model along with . We make the following assumptions
|x, N 0, 2

Then + |x N 0, 1 + 2 and

y +
= x0 + ,
1 + 2 1 + 2 1 + 2

is a legitimate probit model. In fact, y / 1 + 2 is latent,
+
|x N (0, 1)
1 + 2
and so

0
x = P r (y = 1|x) .
1 + 2
It follows that we can apply standard probit ML estimation and the resulting estimator

\ 2 2 0 \
/ 1 + is consistent for / 1 + and so is x 1+2 for the response probabilities
P r (y = 1|x) .

From the above analysis it clearly emerges that / \1 + 2 estimates with a downward
bias (Yatchew and Griliches (1985)). Nonetheless, if our interest centers on marginal effects
x P r (y|, x) averaged over (AMEs), E [x P r (y|, x)], this is no problem.
Indeed, given f (|x) the conditional density function of , it is generally true that

P r (y|x) = P r (y|x, ) f (|x) d
|x
But since and x are independent f (|x) = f () and so

P r (y|x) = P r (y|x, ) f () d = E [P r (y|, x)]

11.9. MULTIVARIATE PROBIT MODELS 236
Hence, under mild regularity conditions that permit interchanging integrals and derivatives,
x P r (y|x) = E [x P r (y|, x)]
The above result is important, for it establishes that to estimate P r (y|x) and x P r (y|x)

0 \
is to estimate E [P r (y|, x)] and E [x P r (y|, x)], respectively. So, x 1+2 is a

0 \
consistent estimator for E [P r (y|, x)], likewise x x 1+2 is a consistent estimator of

E [x P r (y|, x)] (see Wooldridge (2005a) and Wooldridge (2010)).

If evaluated at a given point x0 , the AMEs are averages over alone. To estimate

\
Ex, [P r (y|, x)] and Ex, [x P r (y|, x)] just average x0i
and x \
0
over
1+ 2 x i 1+ 2
the sample.
11.9. Multivariate probit models
Multivariate probit models can be conveniently represented using the latent-regression

framework. There are m binary variables, y1 , , y2 ... ym that may be related.
Multivariate probit models are constructed by supplementing the random vector y defined
in (11.2.1) with the latent regression model
(11.9.1) yj = x0 j + j
j = 1, ..., m, and j , x and j are, respectively, the p1 vectors of parameters and explanatory
variables and the error term. Stacking all j s into the vector (1 , ... , m )0 , we assume
|x N (0, R). The covariance matrix R is subject to normalization restrictions that will be
made explicit below. Equation specific regressors are accommodated by allowing j to have
zeroes in the positions of the variables in x that are excluded from equation j. Cross-equation
restrictions on the s are also permitted. R is normalized for scale and so has unity diagonal
elements and arbitrary off-diagonal elements, ij , which allows for possible cross-equation
correlation of errors. It may or may not present constraints beyond normalization. If m = 2
we have the bivariate probit model, which is estimated by the Stata command biprobit, with
a syntax similar to probit.
11.9.1. Recursive models. An interesting class of multivariate probit models is that

of the recursive models. In recursive probit models the variables in y = (y1 , , y2 ... ym ) are
allowed as right-hand-side variables of the latent system provided that the m m matrix of
coefficients on y is restricted to be triangular (Roodman 2011). This means that if the model
is bivariate, the latent system is
y1 = x 1 + y2 + 1
(11.9.2) y2 = x 2 + 2
It is then evident that estimating a bivariate recursive probit model is ancillary to estimation
of a univariate probit model with a binary endogenous regressor, the first equation of system
(11.9.2).
The feature that makes the recursive multivariate probit model appealing is that it accom-
modates endogenous, binary explanatory variables without special provisions for endogeneity,
simply maximizing the log-likelihood function as if the explanatory variables were all ordinary
exogenous variables (see Maddala 1983, Wooldridge 2010,Greene 2012 and, for a general proof,
Roodman 2011). This can be easily seen here in the case of the recursive bivariate model
P r (y1 = 1, y2 = 1|x) = P r (y1 = 1|y2 = 1, x) P (y2 = 1|x)
= P r 1 > x0 1 |y2 = 1, x P [y2 = 1|x]

= P r 1 > x0 1 |2 > x0 2 , x P 2 > x0 2 |x

= P r 1 > x0 1 , 2 > x0 2 |x

= 2 x0 1 + , x0 2

The crux of the above derivations is that, given
y1 = 1 1 > x0 1 y2 and y2 = 1 2 > x0 2 ,

1 is independent of the lower limit of integration conditional on 2 > x0 2 and so no endo-

geneity issue emerges when working out the joint probability as a joint normal distribution.
The other three joint probabilities are similarly derived, so that eventually the likelihood
function is assembled exactly as in a conventional multivariate probit model1.
Starting with the contributions of Evans and Schwab (1995) and Greene (1998), there are
by now many econometric applications of this model, including the recent articles by Fichera
and Sutton (2011) and Entorf (2012). The user-written command mvprobit deals with m > 2,
it evaluates multiple integrals by simulation (see Cappellari and Jenkins (2003)). The recent
user-written command cmp (see Roodman (2011)) is a more general simulation-based procedure
that can estimate many multiple-response and multivariate models.
1Wooldridge 2010 argues that, although not strictly necessary for formal identification, substantial identification
in recursive models may require exclusion restrictions in the equations of interest. For example, in system
(11.9.2) substantial identification requires some zeroes in 1 , where the corresponding variables may then be
thought of as instruments for y2 .
CHAPTER 12
Censored and selection models
12.1. Introduction
Censored models (Tobit models): y has lower and/or upper limits

Selection models: some values of y are missing not at random.
12.2. Tobit models
Consider the latent regression model
y = x0 + ,
with |x N 0, 2 . y is an observed random variable such that

y = y

if y > L
y=
if y L

y = L
where L is a known lower limit.

Think of a utility maximizer individual with latent and observable characteristics and
x, respectively, choosing y subject to the inequality constraint y L, with y denoting the
unconstrained solution. For some individuals in the population the constraint is binding (y = L
) and for others is not (y > L). The regression model for the latter subpopulation is
y = E (y|x, y > L) + u
= E x0 + |x, > L x0 + u

x0 + E |x, > L x0 + u

(12.2.1)
239
12.2. TOBIT MODELS 240
where u = y E (y|x, y > L) . The following result for the density and moments of the trun-
cated normal distribution are useful (see Greene 2012, pp. 874-876):

1 z
f (z|z > ) = / {1 [( ) /]}

1 z
f (z|z < ) = / [( ) /]

[( ) /]
E (z|z > ) = +
1 [( ) /]
[( ) /]
E (z|z < ) = .
[( ) /]
The foregoing equalities are all based on the following representations of general cumulative
distribution function, F(,2 ) :
F(,2 ) () = P r [(z ) / ( ) /] = F(0,1) [( ) /]
and general normal densities (,2 ) (z) :
" #
(z )2

1 1 z
(,2 ) (z) = exp = .
2 2 2
Then, Model (12.2.1) can be written in closed form as
[(L x0 ) /]
y = x0 + + u.
1 [(L x0 ) /]
By symmetry of the normal distribution,
[(x0 L) /]
(12.2.2) y = x0 + + u.
[(x0 L) /]
If L = 0, the foregoing reduces to
[(x0 ) /]
y = x0 + + u.
[(x0 ) /]
12.2. TOBIT MODELS 241
12.2.1. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation. Let
di = 1 (yi > L). Estimation can be via ML or two-step LS.
The log-likelihood function assembles the density functions peculiar to the subsample of
individuals di = 1 and those peculiar to individuals di = 0 (left-censored). For an individual
di = 1, yi = yi and we know that yi |xi N x0i , 2 . Therefore, we can evaluate the density

at the single point yi x0i

yi x0i

1
f (yi |x) = ,

For a left-censored individual (di = 0), all we know is that i L x0i , so we integrate the
density over the interval i L x0i to get P r (yi = L|xi ) = [(L x0i ) /]. Therefore,
the log-likelihood function is
n
yi x0i (L x0i )

X 1
lnL (y1 ...yn |x1 ...xn , ) = di ln + (1 di ) ln .

i=1
The ML estimator bM L is consistent for , asymptotically normal and asymptotically efficient.

Two-step LS, b2step , is based on Equation (12.2.2). In the first step we apply a probit
regression using di as the dependent variable to estimate [(x0 L) /] / [(x0 L) /]
\ 0 0
as i /i (x bprobit ) / (x bprobit ) (recall that bprobit is indeed a consistent estimate of
/ and L/ is subsumed in the constant term). In the second step apply OLS regression
\
of yi on xi and i /i restricting to the unconstrained subsample di = 1. using yi . b2step
is consistent but standard errors needs to be adjusted since in the second step there is an
estimated regressor.
Upper limits can be dealt similarly:

y = y

if y < U
y=
if y U.

y = U
12.3. SAMPLE SELECTION 242
Also, lower and upper limits jointly:

y=L if y L

y = y = y if L < y < U

if y U

y = U
The Stata command that compute bM L in the tobit model is tobit. The syntax is
similar to regress, requiring in addition options specifying lower limits, ll(#), and upper
limits, ul(#) (the two-step estimator is implemented by the Stata command heckman , see
below).
Marginal effects of interest are
x E (y |x) = , computed by margins with the option predict(xb)

x E (y|x, y > L) = 1 w (w) 2 (w) , where w = (x0 L) / and (w) =

[(x0 L) /] / [(x0 L) /] . It is computed by margins with the option predict(e(L ,.))

x E (y|x) is computed by margins with the option predict(ystar(L ,.)).
12.2.2. Heteroskedasticy and clustering. The same consideration made for binary
models in Sections 11.2.4 and 11.2.5 hold here. While heteroskedasticty breaks down the
specification of conditional expectations, clustering does not. Therefore, it makes sense to
apply the Stata option vce(cluster clustervar ).
12.3. Sample selection
We consider cases of incidental truncation, that is when the values of the dependent variable
are observed only depending on the values of some other variable. For example, in a wage
equation the wage is observed only for individuals that are employed. The employment-status
indicator acts as a selection rule: the wage is observed only over its unity values.
12.3.1. Incidental truncation in the linear model. There are two processes: the
first select the units into the sample, the second generates y. If the two processes are related,
selection is endogenous and as such cannot be ignored.
The selection process:
s = z0 + ,
s = 1 (s > 0)
The process for y:

y = x0 + ,

y = y

if s = 1
y=

y = missing if s = 0
Interest is on . Assume

0 1
|z, x N ,
0 2
The two processes are related if 6= 0. In this case selection would be endogenous and could
not be ignored.
Estimation is via ML. The log-likelihood is
n
X
lnL = {si ln [f (yi |si = 1) P r (si = 1)] + (1 si ) ln [P r (si = 0)]} .
i=1
The Stata command that compute bM L in the selection model is heckman, after the amer-
ican econometrician who introduced these methods in econometrics. The syntax is similar
to regress, requiring in addition an option specifying the list of variables in the selection
process, s and z: select(varlist_s ). It is a required option. For substantial identification
the selection equation should contain at least one variable that is not in the outcome equation.
Heckman developed also a two-step estimator that is asymptotically equivalent to the ML.
In fact, a model of the truncated mean analogous to Equation (12.2.1) can be worked out
[z0 /]
y = x0 + + u.
[z0 /]
The first step applies probit of s on z yielding a variable of consistent estimates for the inverse
mills ratio values (also known as non-selection hazard):
h i
z0
= h
imr i
z0
via OLS (with corrected standard-error estimates).

The second step regresses y on x and imr
The two-step procedure is implemented at once by heckman with the option twostep.
12.3.2. Incidental truncation in probit. Incidental truncation may also affect binary
models. The selection process is exactly as for heckman:
s = z0 + ,
s = 1 (s > 0)
but the model of interest is probit

y = x0 + ,

1 (y > 0) if s = 1

y=

missing if s = 0
Estimation is by ML with a LF based on the bivariate normal distribution as in biprobit.

The Stata command for this is heckprobit, with a syntax similar to heckman, including the
required select(varlist_s ) option. As in heckman, substantial identification requires that
the selection equation contain at least one variable that is not in the outcome equation.
12.4. ATTRITION: INVERSE PROBABILITY WEIGHTING 245
12.3.3. Incidental truncation in linear panel data models. The methods imple-
mented by heckman are not feasible for linear panel data models, either ML or two-step, if
latent-heterogeneity components affect both the model of interest and the selection rule. The
reason is simple and boils down to an incidental-parameter bias in the probit selection rule.
An alternative procedure suggested by Wooldridge (Wooldridge, 2010 Wooldridge (2010))
can be implemented. Details are in the dofile sample_selection_pd.do.
NB: Standard-error estimates must be rectified. Wooldridge provides the analytical for-
mulas for this. Alternatively, one can estimate standard errors by bootstrap. In the dofile
bs_sample_selection_pd.do, I follow the latter strategy and implement bootstrap standard
errors.
12.4. Attrition: inverse probability weighting
12.4.1. Concepts and assumptions. Attrition occurs in panel data when individuals
may drop from the the sample at subsequent dates. Attrition is non-ignorable when dropouts
are related to latent elements in the model of interest. Inverse probability weighting is a general
way to deal with non-ignorable attrition and can be used in non-linear models.
Consider a panel data framework and the selection variable

1 if (i, t) is observed

sit =

0 otherwise
Consider also the set of variables that may potentially affect attrition ,vit = (yit , x0it , z0it )0
where zit is a vector of variables that are external to the model, not necessarily exogenous.
0
Setting zit = yit1 , x0it1 (see Moffitt et al. 1999; Wooldridge 2010) ensures that zit be
observed for dropouts at occasion t. This seems also quite plausible as attrition is often
closely linked to individuals changing location.
0 , ...v0 )0 , is redundant in explaining
We assume that the whole history of vit , vi = (vi1 iT
attrition at occasion t, once zit is accounted for in the conditional probability of sit = 1:
P r (sit = 1|vi , sit1 = 1) = P r (sit = 1|zit , sit1 = 1) ,
t = 2, ..., T.
Given this assumption, the probabilities for use in IPW can be constructed sequentially:
for all i=1,. . . ,N, it has pi1 = 1 and pit = tr=2 ir , where it = P r (sit = 1|zit , sit1 = 1),
Q
t=2,. . . ,T.
Estimates of pit , pit , are then used for IPW of the conditional log-likelihood functions as
follows:
N X
T
X sit
l () = lit ()
pit
i=1 t=1
where lit () denotes the loglikelihood contribution of observation (i,t)
12.4.2. General implementation. Implementation goes through the following steps.
Step 1: Construct the estimation sample.

(1) We exclude from the panel all the new entrants at t>1
(2) We balance the panel so that there are T records for each individual, in this way
the missing records for dropouts are not discarded but set to missing values.
(3) We make the missing data pattern monotone, by setting to missing all of the
records subsequent to the first exit for each dropout, even though some of them
will later return (Robins and Rotnitzky 1995 prove that this is not generally
detrimental for estimation efficiency) .
Step 2: Estimate weights, i2 , i3 , ..., iT :
(1) Generate the selection variable s containing all the values sit , i=1,. . . ,N, t=1,. . . ,T
and then generate s forwarded one time, F (1) s: that is a variable containing
all the values sit+1 , i=1,. . . ,N, t=1,. . . ,T-1.
(2) Implement T-1 cross-sectional logit (or probit) regressions. In all of them F (1) s
is the binary dependent variable and (y, x) are the regressors. What changes
across regressions is the estimation sample: the first logit/probit regression is
restricted to wave t=1 and so uses all N subjects. It yields i2 , i=1,. . . N, as

the logit/probit predicted probability values. The second logit/probit regression
is restricted to wave t=2 and so uses only the individuals that remain in the
sample at this occasion, whose number is N2. It yields i3 , i=1,. . . N2, as the
logit/probit predicted probability values. We proceed in the same way for all of
the other regressions until the last, performed over wave t=T-1, producing iT ,
i=1,. . . NT-1.
Step 3: Assemble the log-likelihood function
N X
T
X sit
l () = lit ()
pit
i=1 t=1
Qt
using pit = r=2 ir .
12.4.3. Stata implementation. There is no all-purpose command in Stata to imple-

ment IPW. Some coding is required to implement Steps 1-3. See the do file.
Bibliography
Abowd, J. M., Kramarz, F., Margolis, D. N., 1999. High wage workers and high wage firms.
Econometrica 67, 251333.
Anderson, T. W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel
data. Journal of Econometrics 18, 570606.
Andrews, D. W. K., Moreira, M. J., Stock, J. H., 2007. Performance of conditional wald tests
in iv regression with weak instruments. Journal of Econometrics 139, 116132.
Angrist, J. D., Pischke, J. S., 2009. Mostly Harmless Econometrics: An Empiricists Compan-
ion. Princeton University Press, Princeton.
Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxford
Bulletin of Economics and Statistics 49 (4), 43134.
Arellano, M., 2003. Panel Data Econometrics. Oxford University Press.
Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte carlo evidence
and an application to employment equations. Review of Economic Studies 58, 277297.
Baltagi, B. H., 2008. Econometric Analysis of Panel Data. New York: Wiley.
Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data
models. Journal of Econometrics 87, 115143.
Bowsher, C. G., 2002. On testing overidentifying restrictions in dynamic panel data models.
Economics Letters 77, 211220.
Bruno, G. S. F., 2005a. Approximating the bias of the lsdv estimator for dynamic unbalanced
panel data models. Economics Letters 87, 361366.
Bruno, G. S. F., 2005b. Estimation and inference in dynamic unbalanced panel data models
with a small number of individuals. The Stata Journal 5, 47300.
248
Bibliography 249
Bun, M. J. G., Kiviet, J. F., 2003. On the diminishing returns of higher order terms in
asymptotic expansions of bias. Economics Letters 79, 145152.
Cameron, A. C., Gelbach, J. B., Miller, D. L., 2011. Robust inference with multiway clustering.
Journal of Business & Economic Statistics 29, 238249.
Cameron, A. C., Trivedi, P. K., 2010. Microeconometrics using Stata - Revised Edition. Stata
Press, College Station, TX.
Cappellari, L., Jenkins, S. P., 2003. Multivariate probit regression using simulated maximum
likelihood. The Stata Journal 3, 278294.
Cragg, J., Donald, S., 1993. Testing identfiability and specification in instrumental variables
models. econometric theory, vol. 9, pp. Econometric Theory, 222240.
Entorf, H., 2012. Expected recidivism among young offenders: Comparing specific deterrence
under juvenile and adult criminal law. European Journal of Political Economy 28, 414429.
Evans, W. N., Schwab, R. M., 1995. Finishing high school and starting college: Do catholic
schools make a difference? The Quarterly Journal of Economics 110, 941974.
Fichera, E., Sutton, M., 2011. State and self investment in health. Journal of Health Economics
30, 11641173.
Greene, W. H., 1998. Gender economics courses in liberal arts colleges: Further results. Journal
of Economic Education 29, 291300.
Greene, W. H., 2008. Econometric Analysis, sixth Edition. Upper Saddle River, NJ: Prentice
Hall.
Greene, W. H., 2012. Econometric Analysis, seventh Edition. Upper Saddle River, NJ: Prentice
Hall.
Hansen, L. P., 1982. Large sample properties of generalized method of moments estimators.
Econometrica 50 (4), 10291054.
Hausman, J., 1978. Specification tests in econometrics. Econometrica 46, 12511271.
Hausman, J. A., Taylor, W., 1981. Panel data models and unobservable individual effects.
Econometrica 49, 13771398.
Bibliography 250
Hayashi, F., 2000. Econometrics. Princeton University Press.

Judson, R. A., Owen, A. L., 1999. Estimating dynamic panel data models: a guide for macroe-
conomists. Economics Letters 65, 915.
Kiviet, J. F., 1995. On bias, inconsistency and efficiency of various estimators in dynamic
panel data models. Journal of Econometrics 68, 5378.
Kiviet, J. F., 1999. Expectation of expansions for estimators in a dynamic panel data model;
some results for weakly exogenous regressors. In: C. Hsiao, K. Lahiri, L.-F. L., Pesaran,
M. H. (Eds.), Analysis of Panels and Limited Dependent Variable Models. Cambridge Uni-
versity Press, Cambridge, pp. 199225.
Maddala, G. S., 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cam-
bridge University Press, Cambridge.
Mikusheva, A., Poi, B. P., 2006. Tests and confidence sets with correct size when instruments
are potentially weak. The Stata Journal 6, 335347.
Moulton, B. R., 1990. An illustration of a pitfall in estimating the effects of aggregate variables
on micro units. The Review of Economics and Statistics 72 (2), 33438.
Mundlak, Y., 1978. On the pooling of time series and cross section data. Econometrica 46,
6985.
Nickell, S. J., 1981. Biases in dynamic models with fixed effects. Econometrica 49, 14171426.
Olea, J. L. M., Pfluger, C., 2013. A robust test for weak instruments. Journal of Business &
Economic Statistics, 358369.
Prucha, I. R., 1984. On the asymptotic efficiency of feasible aitken estimators for seemingly
unrelated regression models with error components. Econometrica 52, 203207.
Rao, C. R., 1973. Linear Statistical Inference and Its Applications. New York: Wiley.
Roodman, D. M., 2009. How to do xtabond2: An introduction to difference and system gmm
in stata. The Stata Journal 9 (1), 86136.
Roodman, D. M., 2011. Fitting fully observed recursive mixed-process models with cmp. The
Stata Journal 11, 159206.
Bibliography 251
Searle, S. R., 1982. Matrix Algebra Useful for Statistics. New York: Wiley.
Stock, J. H., Watson, M. W., 2008. Heteroskedasticity-robust standard errors for fixed effects
panel data regression. Econometrica 76, 15574.
Stock, J. H., Yogo, M., 2005. Testing forweak instruments in linear iv regression. In: Andrews,
D., Stock, J. H. (Eds.), Identification and Inference for Econometric Models: Essays in
Honor of Thomas Rothenberg. Cambridge: Cambridge University Press, pp. 80108.
Swamy, P. A. B., Arora, S. S., 1972. The exact finite sample properties of the estimators of
coefficients in the error components regression models. Econometrica 40 (2), 261275.
White, H., 2001. Asymptotic Theory for Econometricians, revised edition Edition. Emerald.
Windmeijer, F., 2005. A finite sample correction for the variance of linear efficient two-step
gmm estimators. Journal of Econometrics 126, 2551.
Wooldridge, J. M., 2005a. Unobserved heterogeneity and estimation of average partial effects.
In: Andrews, D. W. K., Stock, J. H. (Eds.), Identification And Inference For Econometric
Models: Essays In Honor Of Thomas Rothenberg. Cambridge University Press, New York.
Wooldridge, J. M., 2005b. Violating ignorability of treatment by controlling for too many
factors. Econometric Theory 21, 10261028.
Wooldridge, J. M., 2010. Econometric Analysis of Cross Section and Panel Data, 2nd Edition.
The MIT Press, Cambridge, MA.
Yatchew, A., Griliches, Z., 1985. Specification error in probit models. Review of Economics
and Statistics 67, 134139.
Zyskind, G., 1967. On canonical forms, non-negative covariance matrices and best and simple
least squares linear estimators in linear models. Annals of Mathematical Statistics 36, 1092
09.

Bruno Lecture Notes PDF

Uploaded by

Copyright:

Available Formats

Bruno Lecture Notes PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bruno Lecture Notes PDF

Uploaded by

Copyright:

Available Formats

Econometric Models: Lecture Notes for 2016/17 ESS

Department of Economics, Bocconi University, Milano

Part 1. Linear Models 7

Chapter 2. The linear regression model 13

Chapter 3. The Algebraic properties of OLS 18

Chapter 4. The finite-sample statistical properties of OLS 47

4.3. The Gauss-Marcov Theorem 48

Chapter 6. Large sample results for OLS and GLS estimators 97

Chapter 7. Fixed and Random Effects Panel Data Models 115

7.7. A Robust covariance estimator 142

Chapter 8. Robust inference with cluster samplings 145

Chapter 9. Issues in linear IV and GMM estimation 153

Part 2. Non-linear models 189

Chapter 10. Non-linear regression models 190

10.4. Modelling and testing overdispersion 199

Chapter 11. Binary dependent variable models 201

Chapter 12. Censored and selection models 239

where : Rk+q R. Often, the structural relationship is formulated as a probabilistic model

the population-averaged marginal effects,

or the population averaged hessian matrix

Statistics supplements the probabilistic model with a sampling mechanism in order to

1.2. The linear population model

P.1: Linearity: (x, ) = x0 + , with being a random scalar (q = 1) and a k 1

Under linearity (P.1) equation (1.1.1) becomes

then, given P.3, E (y|x) = x0 .

P.2b: No element of x in X can be obtained as linear combinations of the others with

Exercise 1. prove that if x = (1 x1 ) then assumption P.2 is equivalent to V ar (x1 ) 6= 0.

any k k matrix A, rank (A) = k if and only if det (A) 6= 0.

and the marginal effects, x (x, ), satisfy the following:

Exercise 2. Prove that if x contains 1, then E (x) = 0 is equivalent to E () = 0 and

Solution: Assume E (x) = 0. Since 1 is an element of x, the first component of E (x) is

The linear regression model

The linear regression model is a statistical model, as such it incorporates a probabilistic

with the assumptions

P.1: Linearity: (x, ) = x0 + , with x = (x1 x2 ... xk )0 being a k 1 random vector,

Now, add the random sampling assumption

2.2. The properties of the LRM

It is convenient to express the LRM in compact matrix form as follows

LRM.1: Linearity in the parameters

E i |x01 , ... x0i , ..., x0n = 0,

i = 1, ..., n, or more compactly, E (|X) = 0.

E i |x01 , ... x0i , ..., x0n = E yi |x0i x0i = 0.

LRM.3b: E ( ni=1 xi i ) = 0, i = 1, ..., n, or more compactly,

2.3. Difficulties and extensions

Non-normality (P.5 fails): is not conditionally normal. LRM.5 fails.

Other important problems are instead with the RS assumption.

Multicollinearity: some of the variables in X are almost collinear. In the population

The Algebraic properties of OLS

3.1. Motivation, notation, conventions and main assumptions

3.2. Linear combinations of vector

3.3. OLS: definition and properties

and the n k regressor matrix

b = arg min S (bo ) ,

Geometrically, Xb is an optimal approximation of y in that it minimizes the euclidean distance

By expanding (y Xbo )0 (y Xbo ):

S (bo ) = y0 y b0o X 0 y y0 Xbo + b0o X 0 Xbo

= y0 y 2y0 Xbo + b0o X 0 Xbo .

symmetric matrix B), take the partial derivatives