Causal Inference and Research Design Scott Cunningham (Baylor)

Causal Inference and Research Design
Scott Cunningham (Baylor)
Figure: xkcd
Where to find this material
A lot of this material is drawn from my book,

Causal Inference: The Mixtape, which you can download from
my website www.scunning.com
Cunningham Causal Inference

Structure and Assessment
The fundamental theme linking all lectures will be the estimation of

causal effects
Part 1 covers “the core” of applied econometrics, including
hidden curriculum
Part 2 covers causality foundations like potential outcomes and
DAGs
Part 3 covers contemporary research designs
Stata and R
A secondary goal of the workshop is to provide you with

programming examples in Stata and R for implementing some but
not all of the procedures we’ll cover
R and Stata code are provided many procedures (with more to
come)
I wrote the Stata and had the written by my RAs reviewed by
a third exceptional student
Programs and data can be downloaded from my GitHub
repository (https://github.com/scunning1975/mixtape)
Textbooks
Helpful Textbooks imho

1 Cunningham (2018) (Mixtape) (under contract with Yale, but
I can’t share the new version yet – this deck is the closest
thing to it)
2 Angrist and Pischke (2009) Mostly Harmless Econometrics
(MHE)
3 Morgan and Winship (2014) Counterfactuals and Causal
Inference (MW)
Readings
Readings:
We will also discuss a number of papers in each lecture, each
of which you will need to learn inside and out.
Lecture slides and reading lists are available
Key literature is contained in the shared dropbox folder which
I’ll distribute beforehand
About me
Professor of economics at Baylor (Waco Texas),

Graduated in 2007 from University of Georgia with a field in
econometrics, IO, public, and labor field courses
I knew I was going to be an empiricist, so I made econometrics
my main field – passed field exam on second attempt
Since graduating I’ve focused on topics in crime and risky sex
such as sex work, drug policy, abortion, mental healthcare.
I knew I couldn’t achieve my goals without learning causal
inference which I could tell I had only a vague understanding of
This is because causal inference isn’t taught historically in
traditional econometrics

Sad story (to me!)
Once upon a time there was a boy who wrote a job market
paper using the NLSY97.
This boy presented the findings a half dozen times, spoke to
the media a few times, got 17 interviews at the ASSA, 7
flyouts, and an offer from Baylor
He submitted the job market paper to the Journal of Human
Resources, a top field journal in labor, and received a “revise
and resubmit” request from the editor (woo hoo!)
The horror!
But then digging into his one directory, he found countless

versions of his do file and hundreds of files with random names
And once he finally was able to get the code running again, he
found a critical coding error that when corrected (“destroyed”)
his results
The young boy was devastated and never resubmitted which
he does not recommend (but he was sad!)
All competent empirical work is a mousetrap
“Happy families are all alike; every unhappy family is unhappy in its
own way.” - Leo Tolstoy, Anna Karenina
“Good empirical work is all alike; every bad empirical work is bad in
its own way.” - Scott Cunningham, This slide

Cunningham Empirical Workflow Conjecture
The cause of most of your errors is not due to insufficient

knowledge of syntax in your chosen programming language
The cause of most of your errors is due to a poorly designed
empirical workflow
Workflow
Wikipedia definition:
“A workflow consists of an orchestrated and repeatable
pattern of activity, enabled by the systematic organization
of resources into processes that transform materials,
provide services, or process information.”
Dictionary definition:
“the sequence of industrial, administrative, or other
processes through which a piece of work passes from
initiation to completion.”
Empirical workflow
Workflow is a fixed set of routines you bind yourself to which

when followed identifies the most common errors
Think of it as your morning routine: alarm goes off, go to
wash up, make your coffee, check Twitter, repeat ad infinitum
Finding the outlier errors is a different task; empirical
workflows catch typical and common errors created by the
modal data generating processes
Why do we use checklists?
Before going on a trip, you use a checklist to make sure you

have everything you need
Charger (check), underwear (check), toothbrush (check),
passport (oops), . . .
The empirical checklist is solely referring to the intermediate
step between “getting the data” and “analyzing the data”
It largely focuses on ensuring data quality for the most
common, easiest to identify, situations you’ll find yourself in
Simple checks
Your checklist should be a few simple, yet non-negotiable,

programming commands and exercises to check for coding
errors
Let’s discuss a few
Time
People often think empirical research is about “getting the

data” and “analyzing the data”
They have an “off to the races” mindset
Just like running a marathon involves far far more time
training than you ever spend running the marathon, doing
empirical research involves far far more time doing tedious,
repetitive tasks
Since you do the tedious tasks repeatedly, they have the most
potential for error which can be catastrophic
How can we minimize these errors through a checklist?
Figure: Image from Wenfei Xu at Columbia
Read the codebook
We stand on the shoulders of giants

Few like reading the codebook as it is not gripping literature
But the codebook explains how to interpret the data you have
acquired and it is not a step you can skip
Set aside time to study it, and have it in a place where you
can regularly return to it
This goes for the readme that accompanies some datasets,
too.
Look at the data
The eyeball is not nearly appreciated enough for its ability to

spot problems
Use browse or excel to just read the spreadsheet with your
eyes.
Scroll through the variables and accompany yourself with what
you’ve got visually
Missing observations
Check the size of your dataset in Stata using count

Check the number of observations per variable in Stata using
summarize
String variables will always report zero observations under
summarize so count if X=="" will work
Use tabulate also because oftentimes missing observations
are recorded with a −9 or some other illogical negative value
Missing years
Panel data can be overwhelming bc looking at each

state/city/firm/county borders on the impossible
Start with collapse to the national level by year and simply
list to see if anything looks strange
What’s “strange” look like?
Well wouldn’t it be strange if national unemployment rates
were zero in any year?
You can use xtline to see time series for panel identifiers,
with or without the subcommand of overlay
Panel observations are N × T
Say you have 51 state units (50 states plus DC) and 10 years
51 × 10 = 510 observations
If you do not have 510 observations, then you have an
unbalanced panel; if you have 510 observations you have a
balanced panel
Check the patterns using xtdescribe and simple counting
tricks
Merge
During a stage of arranging datasets, you will likely merge –

oftentimes a lot
Make sure you count before and after you merge so you can
figure out what went wrong, if anything
Also make sure you’re using the contemporary m:m syntax as
many an excellent empiricists have been hurt by merge syntax
errors
Don’t forget the question
“Exploring the data” is intoxicating to the point of distracting

“All you can do is write the best paper on the question you’re
studying” – Mark Hoekstra
Note he didn’t say “Write the best paper you’re capable of
writing”
He said the best paper
Important therefore to choose the right questions with real
upside
Slow down, think big picture, force yourself to figure out
exactly what your question is, who is in your sample (and
importantly who won’t be) and what time periods you’ll pull
Organize your directories
After the coding error fiasco, I spent a lot of time wondering

how this could happen
I decided it was partly because of four problems related to
1 organized subdirectories
2 automation
3 naming conventions
4 version control
I’ll discuss each but I highly recommend that you just read
Gentzkow and Shapiro’s excellent resource “Code and Data for
the Social Sciences: A Practitioner’s Guide” https://web.
stanford.edu/~gentzkow/research/CodeAndData.pdf

No correct organization
There is no correct way to organize your directories,

But all competent empiricists have adopted an intentional
philosophy of how to organize their directories
Why? Because you’re writing for your future self, and your
future self is lazy, distracted, disinterested and busy
Directories
The typical applied micro project may have hundreds of files of

various type and will take years just to finish not including
time to publication
So simply finding the files you need becomes more difficult if
everything is stored in the same place
When I start a new project, the first thing I do is create the
following directories

Subdirectory organization
1) Name the project (“Texas”)

2) A subdirectory for all articles you cite in the paper

3) Data subdirectory containing all datasets

4) A subdirectory for all do files and log files

5) All figures produced by Stata or image files

6) Project-specific heterogeneity (e.g., “Inference”, “Grants”,

“Interview notes”, “Presentations”, “Misc”)
7) All tables generated by Stata (e.g., .tex tables produced by

-estout-)
8) A subdirectory reserved only for writing

Always use scripting programs NOT GUI
Guess what - your future self doesn’t even remember making

do files, tables or figures, let alone typing into GUI command
line
Therefore throw her a bone, hold her hand and walk her
exactly through everything
Which means you’ve got to have replicable scripting files*
* Sure, sometimes use the the command line for messing
around
But then put that messing around in the program

Good text editor
Remember: the goal is to make beautiful programs

Invest in a good text editor which has bundling capabilities
that will integrate with Stata, R or LaTeX
I use Textmate 2 because I use a Mac and in addition to a
Stata and R bundle, it also allows for column editing
PC users tend to love Sublime for the same reasons
Stata and Rstudio also come with built-in text editors, which
use slick colors for various types of programming commands
Headers
Speak clearly
“Be conservative in what you do; be liberal in what you accept from
others.” - Jon Postel
Smart sounding quote about both programming and

relationships
Your future self is time constrained, so explain everything to
her as well as write clear code
Optimally document your programs
But speak your future self’s love language so she understands
Automating Tables and Figures
Your goal is to make “beautiful tables” that are never edited

post-production as well as readable on their own
Large fixed costs learning commands like -estout- or -outreg2-:
incur them bc marginal costs are zero
I use -estout- because Jann has written an excellent help file at
http://repec.org/bocode/e/estout/hlp_esttab.html
but many like -outreg2-
Learn -twoway- and/or -ggplot2- and make “beautiful pictures”
too
Different elements
When I found my error, and after I regained my exposure, I

eventually developed a system of naming
1 variables,
2 datasets, and
3 do files
As these are the three things you repeatedly use, you need to
have a system, even if not mine

Naming conventions for variables
Variables should be readable to a stranger

Say that you want to create the product of two variables.
Name it the two variables with an underscore
gen price_mpg = price * mpg
Otherwise name the variable exactly what it is
gen bmi = weight / (height^2 * 703)
Avoid meaningless words (e.g., lmb2), dating (e.g.,
temp05012020) and numbering (e.g., outcome25) as your
future self will be confused
Naming datasets and do files
The overarching goal is always to name things so that a

stranger seeing them can know what they are
One day you will be the stranger on your own project! Make it
easy on your future self!
Choose some combination of simplicity and clarity but
whatever you do, be consistent
Avoid numbering datasets unless the numbers correspond to
some meaningful thing, like randomization inference where
each file is a set of coefficients and numbered according to
FIPS index
Version control
People swear by git, particularly Gentzkow and Shapiro

I use Dropbox, and have for years. They have some version
history for instance, though I’m not sure if it compares to git’s
capabilities.
I’m slowly learning git and use git Tower, but many use the
command line in Terminal
Ideally your system allows you to revert to earlier versions
without having ten billion files with names like
prison_03102019_sc.do, etc.
Selling your work
If you don’t advocate for your work, no one will.

Network, network, network
You will need to become an expert in 1.5 areas, and you will
need experts in those 1.5 areas to agree
Study the effective of rhetoric of successful economists who
expertly communicate their work to others both in their
writing of the actual manuscript, as well as the presentation
and promotion of their work

Find your mentors and sponsors
Working with senior people at some point becomes necessary

Good news: many senior people want to help you
Bad news: they don’t know who you are and can’t find you
It’s a two sided matching problem
Introduce yourself in socially appropriate ways!
Al Roth story
I wrote Al Roth in 2007 and like Robert Browning to Elizabeth

Barrett introduced myself by saying “I love your book on
twosided matching with Sotomayor with all my heart.”
We became pen pals and then he won the Nobel Prize
Scared, I wrote to congratulate him on the day he won and he
immediately asked to help me
“Interpersonal favors are meant to be paid forward not
backwards” - Roth to me after a second favor!
Nobody can help you if you don’t know them bc help,
sponsorship and mentoring is a two sided matching problem
More readings
I’ve put several deck of slides and helpful articles for you in the
dropbox folder
Jesse Shapiro’s “How to Present an Applied Micro Paper”
Gentzkow and Shapiro’s coding practices manual
Rachael Meager on presenting as an academic
Ljubica “LJ” Ristovska’s language agnostic guide to
programming for economists
Grant McDermott on Version Control using Github
https://raw.githack.com/uo-ec607/lectures/master/
02-git/02-Git.html#1
Data Visualization
Every project should present compelling graphics summarizing the

main results and main takeaway
Study other people’s pictures and get help from experts
1 Kieran Healy’s 2018 Visualization: A Practical Introduction
(Princeton University Press); free version is
http://socviz.co/index.html#preface.
2 Ed Tufte’s book Visual display of quantitative information is
classic, but more a coffee table book plus no programming
assistance.
Learn Stata’s -twoway- capabilities and/or R’s -ggplot2-
Introduction: OLS Review
Derivation of the OLS estimator

Algebraic properties of OLS
Statistical Properties of OLS
Variance of OLS and standard errors

Foundations of scientific knowledge
Scientific methodologies are the epistemological foundation of

scientific knowledge
Science does not collect evidence in order to “prove” what
people already believe or want others to believe
Science accepts unexpected and even undesirable answers
Science is process oriented, not outcome oriented
Terminology
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
LHS RHS
The terms “explained” and “explanatory” are probably best, as they

are the most descriptive and widely applicable. But “dependent”
and “independent” are used often. (The “independence” here is not
really statistical independence.)
We said we must confront three issues:
1 How do we allow factors other than x to affect y ?
2 What is the functional relationship between y and x?
3 How can we be sure we are capturing a ceteris paribus
relationship between y and x?
We will argue that the simple regression model
y = β0 + β1 x + u (1)
addresses each of them.
Simple linear regression model
The simple linear regression (SLR) model is a population

model.
When it comes to estimating β1 (and β0 ) using a random
sample of data, we must restrict how u and x are related to
each other.
What we must do is restrict the way u and x relate to each
other in the population.
The error term
We make a simplifying assumption (without loss of generality): the

average, or expected, value of u is zero in the population:
E (u) = 0 (2)
where E (·) is the expected value operator.
The intercept
The presence of β0 in
y = β0 + β1 x + u (3)
allows us to assume E (u) = 0. If the average of u is different from

zero, say α0 , we just adjust the intercept, leaving the slope the
same:
y = (β0 + α0 ) + β1 x + (u − α0 ) (4)
where α0 = E (u). The new error is u − α0 and the new intercept is
β0 + α0 . The important point is that the slope, β1 , has not
changed.
Mean independence of the error term
An assumption that meshes well with our introductory treatment

involves the mean of the error term for each “slice” of the
population determined by values of x:
E (u|x) = E (u), all values x (5)

where E (u|x) means “the expected value of u given x”.
Then, we say u is mean independent of x.
Distribution of ability across education
Suppose u is “ability” and x is years of education. We need,

for example,
E (ability |x = 8) = E (ability |x = 12) = E (ability |x = 16)
so that the average ability is the same in the different portions

of the population with an 8th grade education, a 12th grade
education, and a four-year college education.
Because people choose education levels partly based on ability,
this assumption is almost certainly false.
Zero conditional mean assumption
Combining E (u|x) = E (u) (the substantive assumption) with

E (u) = 0 (a normalization) gives the zero conditional mean
assumption.
E (u|x) = 0, all values x (6)

Population regression function
Because the conditional expected value is a linear operator,

E (u|x) = 0 implies
E (y |x) = β0 + β1 x (7)
which shows the population regression function is a linear
function of x.
The straight line in the graph on the next page is what
Wooldridge calls the population regression function, and
what Angrist and Pischke call the conditional expectation
function
E (y |x) = β0 + β1 x
The conditional distribution of y at three different values of x
are superimposed. for a given value of x, we see a range of y
values: remember, y = β0 + β1 x + u, and u has a distribution
in the population.
Deriving the Ordinary Least Squares Estimates
Given data on x and y , how can we estimate the population

parameters, β0 and β1 ?
Let {(xi , yi ) : i = 1, 2, ..., n} be a random sample of size n
(the number of observations) from the population.
Plug any observation into the population equation:
yi = β0 + β1 xi + ui (8)
where the i subscript indicates a particular observation.

We observe yi and xi , but not ui (but we know it is there).
We use the two population restrictions:
E (u) = 0
Cov (x, u) = 0
to obtain estimating equations for β0 and β1 . We talked about the

first condition. The second condition means that x and u are
uncorrelated. Both conditions are implied by E (u|x) = 0
With E (u) = 0, Cov (x, u) = 0 is the same as E (xu) = 0. Next we
plug in for u:
E (y − β0 − β1 x) = 0
E [x(y − β0 − β1 x)] = 0
These are the two conditions in the population that effectively

determine β0 and β1 .
So we use their sample counterparts (which is a method of
moments approach to estimation):
n
X
−1
n (yi − β̂0 − β̂1 xi ) = 0
i=1
n
X
−1
n xi (yi − β̂0 − β̂1 xi ) = 0
i=1
where β̂0 and β̂1 are the estimates from the data.
These are two linear equations in the two unknowns β̂0 and β̂1 .
Pass the summation operator through the first equation:
n
X
−1
n (yi − β̂0 − β̂1 xi ) (9)
i=1
n
X n
X n
X
−1 −1 −1
=n yi − n β̂0 − n β̂1 xi (10)
i=1 i=1 i=1
n n
!
X X
= n−1 yi − β̂0 − β̂1 n−1 xi (11)
i=1 i=1
= y − β̂0 − β̂1 x (12)

We use the standard notation y = n−1 ni=1 yi for the average of
P
the n numbers {yi : i = 1, 2, ..., n}. For emphasis, we call y a
sample average.
We have shown that the first equation,
n
X
n−1 (yi − β̂0 − β̂1 xi ) = 0 (13)
i=1
implies
y = β̂0 + β̂1 x (14)

Now, use this equation to write the intercept in terms of the slope:
β̂0 = y − β̂1 x (15)

Plug this into the second equation (but where we take away the
division by n):
n
X
xi (yi − β̂0 − β̂1 xi ) = 0 (16)
i=1
so
n
X
xi [yi − (y − β̂1 x) − β̂1 xi ] = 0 (17)
i=1
Simple algebra gives

n n
" #
X X
xi (yi − y ) = β̂1 xi (xi − x) (18)
i=1 i=1
So, the equation to solve is
n n
" #
X X
(xi − x)(yi − y ) = β̂1 (xi − x)2 (19)
i=1 i=1
Pn
If i=1 (xi − x)2 > 0, we can write
Pn
(x − x)(yi − y ) Sample Covariance(xi , yi )
Pn i
β̂1 = i=1 2
= (20)
i=1 (xi − x) Sample Variance(xi )
OLS
The previous formula for β̂1 is important. It shows us how to

take the data we have and compute the slope estimate.
β̂1 is called the ordinary least squares (OLS) slope estimate.
It can be computed whenever the sample variance of the xi is
not zero, which only rules out the case where each xi has the
same value.
The intuition is that the variation in x is what permits us to
identify its impact on y .
Solving for βb
Once we have β̂1 , we compute β̂0 = y − β̂1 x. This is the OLS

intercept estimate.
These days, we let the computer do the calculations, which are
tedious even if n is small.
Predicting y
For any candidates β̂0 and β̂1 , define a fitted value for each i
as
ŷi = β̂0 + β̂1 xi (21)

We have n of these.
ŷi is the value we predict for yi given that x = xi and β = β̂.
The residual
The “mistake” from our prediction is called the residual:
ûi = yi − ŷi
= yi − β̂0 − β̂1 xi
Suppose we measure the size of the mistake, for each i, by

squaring it. Then we add them all up to get the sum of
squared residuals
n
X n
X
ûi2 = (yi − β̂0 − β̂1 xi )2
i=1 i=1
Choose β̂0 and β̂1 to minimize the sum of squared residuals

which gives us the same solutions we obtained before.
Algebraic Properties of OLS Statistics
Remembering how the first moment condition allows us to obtain
β̂0 and β̂1 , we have:
n
X
(yi − β̂0 − β̂1 xi ) = 0 (22)
i=1
Notice the logic here: this means the OLS residuals always add up
to zero, by construction,
n
X
ûi = 0 (23)
i=1
Because yi = ŷi + ûi by definition,

n
X n
X n
X
n−1 yi = n−1 ŷi + n−1 ûi (24)
i=1 i=1 i=1
and so y = ŷ .
Second moment
Similarly the way we obtained our estimates,

n
X
n−1 xi (yi − β̂0 − β̂1 xi ) = 0 (25)
i=1
The sample covariance (and therefore the sample correlation)

between the explanatory variables and the residuals is always zero:
n
X
n−1 xi ûi = 0 (26)
i=1
Bringing things together
Because the ŷi are linear functions of the xi , the fitted values and
residuals are uncorrelated, too:
n
X
n−1 ŷi ûi = 0 (27)
i=1
Averages
A third property is that the point (x, y ) is always on the OLS

regression line. That is, if we plug in the average for x, we predict
the sample average for y :
y = β̂0 + β̂1 x (28)

Again, we chose the estimates to make this true.
Expected Value of OLS
Mathematical statistics: How do our estimators behave across
different samples of data? On average, would we get the right
answer if we could repeatedly sample?
We need to find the expected value of the OLS estimators – in
effect, the average outcome across all possible random samples
– and determine if we are right on average.
Leads to the notion of unbiasedness, which is a “desirable”
characteristic for estimators.
E (β̂) = β (29)
Don’t forget why we’re here
Plato’s allegory of the cave - reality is outside the cave, the

reflections on the wall are our estimates of that reality.
The population parameter that describes the relationship
between y and x is β1
For this class, β1 is a causal parameter, and our sole objective
is to estimate β1 with a sample of data
But never forget that β̂1 is an estimator of that causal
parameter obtained with a specific sample from the
population.
Uncertainty and sampling variance
Different samples will generate different estimates (β̂1 ) for the

“true” β1 which makes β̂1 a random variable.
Unbiasedness is the idea that if we could take as many random
samples on Y as we want from the population, and compute
an estimate each time, the average of these estimates would
be equal to β1 .
But, this also implies that βˆ1 has spread and therefore variance
Assumptions
Assumption SLR.1 (Linear in Parameters)
The population model can be written as
y = β0 + β1 x + u (30)
where β0 and β1 are the (unknown) population parameters.
We view x and u as outcomes of random variables; thus, y is
random.
Stating this assumption formally shows that our goal is to
estimate β0 and β1 .
Assumption SLR.2 (Random Sampling)
We have a random sample of size n, {(xi , yi ) : i = 1, ..., n},
following the population model.
We know how to use this data to estimate β0 and β1 by OLS.
Because each i is a draw from the population, we can write,
for each i,
yi = β0 + β1 xi + ui (31)
Notice that ui here is the unobserved error for observation i. It
is not the residual that we compute from the data!
Assumption SLR.3 (Sample Variation in the Explanatory
Variable)
The sample outcomes on xi are not all the same value.
This is the same as saying the sample variance of
{xi : i = 1, ..., n} is not zero.
In practice, this is no assumption at all. If the xi are all the
same value, we cannot learn how x affects y in the population.
Assumption SLR.4 (Zero Conditional Mean)
In the population, the error term has zero mean given any
value of the explanatory variable:
E (u|x) = E (u) = 0. (32)

This is the key assumption for showing that OLS is unbiased,
with the zero value not being important once we assume
E (u|x) does not change with x.
Note that we can compute the OLS estimates whether or not
this assumption holds, or even if there is an underlying
population model.
Showing OLS is unbiased
How do we show β̂1 is unbiased for β1 ? What we need to show is
E (β̂1 ) = β1 (33)
where the expected value means averaging across random samples.
Step 1: Write down a formula for β̂1 . It is convenient to use
Pn
(xi − x)yi
β̂1 = Pi=1
n 2
(34)
i=1 (xi − x)
which is one of several equivalent forms.
It is convenient to define SSTx = ni=1 (xi − x)2 , to total variation
P
in the xi , and write
Pn
(xi − x)yi
β̂1 = i=1 (35)
SSTx
Remember, SSTx is just some positive number. The existence of
β̂1 is guaranteed by SLR.3.
Step 2: Replace each yi with yi = β0 + β1 xi + ui (which uses

SLR.1 and the fact that we have data from SLR.2).
The numerator becomes
n
X n
X
(xi − x)yi = (xi − x)(β0 + β1 xi + ui ) (36)
i=1 i=1
n
X n
X n
X
= β0 (xi − x) + β1 (xi − x)xi + (xi − x)ui (37)
i=1 i=1 i=1
n
X n
X
= 0 + β1 (xi − x)2 + (xi − x)ui (38)
i=1 i=1
n
X
= β1 SSTx + (xi − x)ui (39)
i=1
Pn Pn Pn
We used i=1 (xi − x) = 0 and i=1 (xi − x)xi = i=1 (xi − x)2 .
We have shown
Pn Pn
β1 SSTx + i=1 (xi − x)ui i=1 (xi− x)ui
β̂1 = = β1 + (40)
SSTx SSTx
Note how the last piece is the slope coefficient from the OLS
regression of ui on xi , i = 1, ..., n. We cannot do this regression
because the ui are not observed.
Now define
(xi − x)
wi = (41)
SSTx
so we have
n
X
β̂1 = β1 + w i ui (42)
i=1
β̂1 is a linear function of the unobserved errors, ui . The wi are

all functions of {x1 , x2 , ..., xn }.
The (random) difference between β̂1 and β1 is due to this
linear function of the unobservables.
Step 3: Find E (β̂1 ).
Under Assumptions SLR.2 and SLR.4, E (ui |x1 , x2 , ..., xn ) = 0.
That means, conditional on {x1 , x2 , ..., xn },
E (wi ui |x1 , x2 , ..., xn ) = wi E (ui |x1 , x2 , ..., xn ) = 0

because wi is a function of {x1 , x2 , ..., xn }. (In the next slides I
omit the conditioning in the expectations)
This would not be true if, in the population, u and x are
correlated.
Now we can complete the proof: conditional on {x1 , x2 , ..., xn },
n
!
X
E (β̂1 ) = E β1 + w i ui (43)
i=1
n
X n
X
= β1 + E (wi ui ) = β1 + wi E (ui ) (44)
i=1 i=1
= β1 (45)
Remember, β1 is the fixed constant in the population. The

estimator, β̂1 , varies across samples and is the random outcome:
before we collect our data, we do not know what β̂1 will be.
THEOREM (Unbiasedness of OLS)
Under Assumptions SLR.1 through SLR.4
E (β̂0 ) = β0 and E (β̂1 ) = β1 . (46)
Omit the proof for β̂0 .

Each sample leads to a different estimate, β̂0 and β̂1 . Some
will be very close to the true values β0 = 3 and β1 = 2.
Nevertheless, some could be very far from those values.
If we repeat the experiment again and again, and average the
estimates, we would get very close to 2.
The problem is, we do not know which kind of sample we
have. We can never know whether we are close to the
population value.
We hope that our sample is "typical" and produces a slope
estimate close to β1 but we can never know.
Reminder
Errors are the vertical distances between observations and the

unknown Conditional Expectation Function. Therefore, they
are unknown.
Residuals are the vertical distances between observations and
the estimated regression function. Therefore, they are known.
SE and the data
The correct SE estimation procedure is given by the underlying

structure of the data
It is very unlikely that all observations in a dataset are
unrelated, but drawn from identical distributions
(homoskedasticity)
For instance, the variance of income is often greater in families
belonging to top deciles than among poorer families
(heteroskedasticity)
Some phenomena do not affect observations individually, but
they do affect groups of observations uniformly within each
group (clustered data)
Variance of the OLS Estimators
Under SLR.1 to SLR.4, the OLS estimators are unbiased. This

tells us that, on average, the estimates will equal the
population values.
But we need a measure of dispersion (spread) in the sampling
distribution of the estimators. We use the variance (and,
ultimately, the standard deviation).
We could characterize the variance of the OLS estimators
under SLR.1 to SLR.4 (and we will later). For now, it is easiest
to introduce an assumption that simplifies the calculations.
Assumption SLR.5 (Homoskedasticity, or Constant Variance)
The error has the same variance given any value of the explanatory
variable x:
Var (u|x) = σ 2 > 0 (47)

where σ 2 is (virtually always) unknown.
Because we assume SLR.4, that is, E (u|x) = 0 whenever we

assume SLR.5, we can also write
E (u 2 |x) = σ 2 = E (u 2 ) (48)
Under the population Assumptions SLR.1 (y = β0 + β1 x + u),
SRL.4 (E (u|x) = 0) and SLR.5 (Var (u|x) = σ 2 ),
E (y |x) = β0 + β1 x
Var (y |x) = σ 2
So the average or expected value of y is allowed to change with x –

in fact, this is what interests us – but the variance does not change
with x. (See Graphs on next two slides)
THEOREM (Sampling Variances of OLS)
Under Assumptions SLR.1 to SLR.2,
σ2 σ2
Var (β̂1 |x) = Pn 2
=
i=1 (xi − x) SSTx
2 −1
Pn 2

σ n i=1 xi
Var (β̂0 |x) =
SSTx
(conditional on the outcomes {x1 , x2 , ..., xn }).
To show this, write, as before,
n
X
β̂1 = β1 + w i ui (49)
i=1
where wi = (xi − x)/SSTx . We are treating this as nonrandom in

the derivation. Because β1 is a constant, it does not affect Var (β̂1 ).
Now, we need to use the fact that, for uncorrelated random
variables, the variance of the sum is the sum of the variances.
The {ui : i = 1, 2, ..., n} are actually independent across i, and so
they are uncorrelated. So (remember that if we know x, we know
w)
n
!
X
Var (β̂1 |x) = Var wi ui |x
i=1
n
X n
X
= Var (wi ui |x) = wi2 Var (ui |x)
i=1 i=1
n
X n
X
= wi2 σ 2 =σ 2
wi2
i=1 i=1
where the second-to-last equality uses Assumption SLR.5, so that

the variance of ui does not depend on xi .
Now we have
n n Pn 2
X X (xi − x)2 i=1 (xi − x)
wi2 = =
(SSTx )2 (SSTx )2
i=1 i=1
SSTx 1
= 2
=
(SSTx ) SSTx
We have shown
σ2
Var (β̂1 ) = (50)
SSTx
Usually we are interested in β1 . We can easily study the two factors
that affect its variance.
σ2
Var (β̂1 ) = (51)
SSTx
1 As the error variance increases, i.e, as σ 2 increases, so does

Var (β̂1 ). The more “noise” in the relationship between y and
x – that is, the larger variability in u – the harder it is to learn
about β1 .
2 By contrast, more variation in {xi } is a good thing:
SSTx ↑ implies Var (β̂1 ) ↓ (52)

Notice that SSTx /n is the sample variance in x. We can think of
this as getting close to the population variance of x, σx2 , as n gets
large. This means
SSTx ≈ nσx2 (53)

which means, as n grows, Var (β̂1 ) shrinks at the rate 1/n. This is
why more data is a good thing: it shrinks the sampling variance of
our estimators.
The standard deviation of β̂1 is the square root of the variance. So
σ
sd(β̂1 ) = √ (54)
SSTx
This turns out to be the measure of variation that appears in
confidence intervals and test statistics.
Estimating the Error Variance
In the formula
σ2
Var (β̂1 ) = (55)
SSTx
we can compute SSTx from {xi : i = 1, ..., n}. But we need to
estimate σ 2 .
Recall that
σ 2 = E (u 2 ). (56)
Therefore, if we could observe a sample on the errors,
{ui : i = 1, 2, ..., n}, an unbiased estimator of σ 2 would be the
sample average
n
X
n−1 ui2 (57)
i=1
But this not an estimator because we cannot compute it from the

data we observe, since ui are unobserved.
How about replacing each ui with its “estimate”, the OLS residual
ûi ?
ui = yi − β0 − β1 xi
ûi = yi − β̂0 − β̂1 xi
ûi can be computed from the data because it depends on the
estimators β̂0 and β̂1 . Except by fluke,
ûi 6= ui (58)
for any i.
ûi = yi − β̂0 − β̂1 xi = (β0 + β1 xi + ui ) − β̂0 − β̂1 xi

= ui − (β̂0 − β0 ) − (β̂1 − β1 )xi
E (β̂0 ) = β0 and E (β̂1 ) = β1 , but the estimators almost always

differ from the population values in a sample.
Now, what about this as an estimator of σ 2 ?
n
X
n−1 ûi2 = SSR/n (59)
i=1
It is a true estimator and easily computed from the data after OLS.
As it turns out, this estimator is slightly biased: its expected value
is a little less than σ 2 .
The estimator does not account for the two restrictions on the
residuals, used to obtain β̂0 and β̂1 :
n
X
ûi = 0
i=1
n
X
xi ûi = 0
i=1
There is no such restriction on the unobserved errors.

The unbiased estimator of σ 2 uses a degrees-of-freedom
adjustment. The residuals have only n − 2 degrees-of-freedom, not
n.
SSR
σ̂ 2 = (60)
(n − 2)
THEOREM: Unbiased Estimator of σ 2
Under Assumptions SLR.1 to SLR.5,
E (σ̂ 2 ) = σ 2 (61)
In regression output, it is
s
√ SSR
σ̂ = σ̂ 2 = (62)
(n − 2)
that is usually reported. This is an estimator ofPsd(u), the standard
deviation of the population error. And SSR = ni=1 ub2 .
σ̂ is called the standard error of the regression, which
means it is an estimate of the standard deviation of the error
in the regression. Stata calls it the root mean squared error.
Given σ̂, we can now estimate sd(β̂1 ) and sd(β̂0 ). The
estimates of these are called the standard errors of the β̂j .
We just plug σ̂ in for σ:
σ̂
se(β̂1 ) = √ (63)
SSTx
where both the numerator and denominator are computed

from the data.
For reasons we will see, it is useful to report the standard errors
below the corresponding coefficient, usually in parentheses.
OLS inference is generally faulty in the presence of
heteroskedasticity
Fortunately, OLS is still useful
Assume SLR.1-4 hold, but not SLR.5. Therefore
Var (ui |xi ) = σi2
The variance of our estimator, βb1 equals:

Pn
(xi − x)2 σi2
Var (βb1 ) = i=1
SSTx2
When σi2 = σ 2 for all i, this formula reduces to the usual form,
σ2
SSTx2
A valid estimator of Var(βb1 ) for heteroskedasticity of any form
(including homoskedasticity) is
Pn
− x)2 ubi 2
i=1 (xi
Var (βb1 ) =
SSTx2
which is easily computed from the data after the OLS

regression
As a rule, you should always use the , robust command in
STATA.
Clustered data
But what if errors are not iid?

For instance, maybe observations between units in a group are
related to each other
You want to regress kids’ grades on class size to determine the
effect of class size on grades
The unobservables of kids belonging to the same classroom
will be correlated (e.g., teacher quality, recess routines) while
will not be correlated with kids in far away classrooms
Then i.i.d. is violated. But maybe i.i.d. holds across clusters,
just not within clusters
Simulations
Let’s first try to understand what’s going on with a few

simulations
We will begin with a baseline of non-clustered data
We’ll show the distribution of estimates in Monte Carlo
simulation for 1000 draws and iid errors
We’ll then show the number of times you reject the null
incorrectly at α = 0.05.
Figure: Distribution of the least squares estimator over 1,000 random
draws.
Figure: Distribution of the 95% confidence intervals with coloring
showing those which are incorrectly rejecting the null.
Clustered data and heteroskedastic robust
Now let’s look at clustered data

But this time we will estimate the model using heteroskedastic
robust standard errors
Earlier we saw mass all the way to -2.5 to 2; what do we get
when we incorrectly estimate the standard errors?
Figure: Distribution of the least squares estimator over 1,000 random
draws. Clustered data without correcting for clustering
Figure: Distribution of 1,000 95% confidence intervals with dashed region
representing those estimates that incorrectly reject the null.
Over-rejecting the null
Those 95 percent confidence intervals are based on an

α = 0.05.
Look how many parameter estimates are different from zero;
that’s what we mean by “over-rejecting the null”
You saw signs of it though in the variance of the estimated
effect, bc the spread only went from -.15 to .15 (whereas
earlier it had gone from -.25 to .2)
Now let’s correct for arbitrary within group correlations using
the cluster robust option in Stata/R
Figure: Distribution of 1,000 95% confidence intervals from a cluster
robust least squares regression with dashed region representing those
estimates that incorrectly reject the null.
Cluster robust standard errors
Better. We don’t have the same over-rejection problem as

before. If anything it’s more conservative.
The formula for estimating standard errors changes when
allowing for arbitrary serial correlation within group.
Instead of summing over each individual, we first sum over
groups
I’ll use matrix notation as it’s easier for me to explain by
stacking the data.
Clustered data
Let’s stack the observations by cluster
yg = xg β + ug
The OLS estimator of β is:
βb = [X 0 X ]−1 X 0 y
The variance is given by:
Var (β) = E [[X 0 X ]−1 X 0 ΩX [X 0 X ]−1 ]

Clustered data
With this in mind, we can now write the variance-covariance matrix

for clustered data
G
X
b = [X 0 X ]−1 [
Var (β) xg0 ubg ubg0 xg ][X 0 X ]−1
i=1
where ûg are residuals from the stacked regression

In STATA: vce(cluster clustervar). Where clustervar
is a variable that identifies the groups in which unobservables
are allowed to correlate
The importance of knowing your data
In real world you should never go with the “independent and

identically distributed” (i.e., homoskedasticity) case. Life is not
that simple.
You need to know your data in order to choose the correct
error structure and then infer the required SE calculation
If you have aggregate variables, like class size, clustering at
that level is required
Foundations of scientific knowledge
Scientific methodologies are the epistemological foundation of

scientific knowledge, which is a particular kind of knowledge
Science does not collect evidence in order to “prove” what
people already believe or want others to believe.
Science is process oriented, not outcome oriented.
Therefore science allows us to accept unexpected and
sometimes even undesirable answers.
My strong pragmatic claim
“Credible” causal inference is essential to scientific discovery,

publishing and your career
Non-credibly identified empirical micro papers, even ones with
ingenious theory, will have trouble getting published and won’t
be taken seriously
Causal inference in 2019 is a necessary, not a sufficient,
condition
Outline
Properties of the conditional expectation function (CEF)

Reasons for using linear regression
Regression anatomy theorem
Omitted variable bias
Properties of the conditional expectation function
Assume we are interested in the returns to schooling in a wage

regression.
We can summarize the predictive power of schooling’s effect
on wages with the conditional expectation function
E (yi |xi ) (64)
The CEF for a dependent variable, yi , given covariates Xi , is

the expectation, or population average, of yi with xi held
constant.
E (yi |xi ) gives the expected value of y for given values of x
It provides a reasonable representation of how y changes with
x
If x is random, then E (yi |xi ) is a random function
When there are only two values that xi can take on, then there
are only two values the CEF can take on – but the dummy
variable is a special case
We’re often interested in CEFs that are functions of many
variables, conveniently subsumed in the vector xi , and for a
specific value of xi , we will write
E (yi |xi = x)
Helpful result: Law of Iterated Expectations
Definition of Law of Iterated Expectations (LIE)

The unconditional expectation of a random variable is equal to the
expectation of the conditional expectation of the random variable
conditional on some other random variable
E (Y ) = E (E [Y |X ])
.
We use LIE for a lot of stuff, and it’s actually quite intuitive. You
may even know it and not know you know it!
Simple example of LIE
Say you want to know average IQ but only know average IQ by

gender.
LIE says we get the former by taking conditional expectations
by gender and combining them (properly weighted)
E [IQ] = E (E [IQ|Sex])
X
= Pr (Sexi ) · E [IQ|Sexi ]
Sexi
= Pr (Male) · E [IQ|Male]
+Pr (Female) · E [IQ|Female]
In words: the weighted average of the conditional averages is

the unconditional average.
Person Gender IQ
1 M 120
2 M 115
3 M 110
4 F 130
5 F 125
6 F 120
E[IQ] = 120
E[IQ | Male] = 115; E[IQ | Female] = 125
LIE: E ( E [ IQ | Sex ] ) = (0.5)×115 + (0.5)×125 = 120
Proof.
For the continuous case:
Z
E [E (Y |X )] = E (Y |X = u)gx (u)du
Z Z
= tfy |x (t|X = u)dt gx (u)du
Z Z
= tfy |x (t|X = u)gx (u)dudt
Z Z
= t fy |x (t|X = u)gx (u)du dt
Z
= t [fx,y du] dt
Z
= tgy (t)dt
= E (y )
Proof.
For the discrete case,
X
E (E [Y |X ]) = E [Y |X = x]p(x)
x
!
X X
= yp(y |x) p(x)
x y
XX
= yp(x, y )
x y
X X
= y p(x, y )
y x
X
= yp(y )
y
= E (Y )
Property 1: CEF Decomposition Property
The CEF Decomposition Property
yi = E (yi |xi ) + ui
where
1 ui is mean independent of xi ; that is
E (ui |xi ) = 0
2 ui is uncorrelated with any function of xi
In words: Any random variable, yi , can be decomposed into two

parts: the part that can be explained by xi and the part left over
that can’t be explained by xi . Proof is in Angrist and Pischke (ch.
3)
Property 2: CEF Prediction Property
The CEF Prediction Property

Let m(xi ) be any function of xi . The CEF solves
E (yi |xi ) = arg minm(xi ) E [(yi − m(xi ))2 ].
In words: The CEF is the minimum mean squared error predictor of

yi given xi . Proof is in Angrist and Pischke (ch. 3)
3 reasons why linear regression may be of interest
Linear regression may be interesting even if the underlying CEF is

not linear. We review some of the linear theorems now. These are
merely to justify the use of linear models to approximate the CEF.
The Linear CEF Theorem
Suppose the CEF is linear. Then the population regression is it.
Comment: Trivial theorem imho because if the population CEF is

linear, then it makes the most sense to use linear regression to
estimate it. Proof in Angrist and Pischke (ch. 3). Proof uses the
CEF Decomposition Property from earlier.
The Best Linear Predictor Theorem
1 The CEF, E (y |x ), is the minimum mean squared error
i i
(MMSE) predictor of yi given xi in the class of all functions xi
by the CEF prediction property
2 The population regression function, E (xi yi )E (xi xi0 )−1 , is the
best we can do in the class of all linear functions
Proof is in Angrist and Pischke (ch. 3).
The Regression CEF Theorem
The function xi β provides the minimum mean squared error
(MMSE) linear approximation to E (yi |xi ), that is
β = arg minb E {(E (yi |xi ) − xi0 b)2 }
Again, proof in Angrist and Pischke (ch. 3).

Random families
We are interested in the causal effect of family size on labor

supply so we regress labor supply onto family size
labor _supplyi = β0 + β1 numkidsi + εi
If couples had kids by flipping coins, then numkidsi

independent of εi , then estimation is simple - just compare
families with different sizes to get the causal effect of numkids
on labor _supply
But how do we interpret βb1 if families don’t flip coins?
Non-random families
If family size is random, you could visualize the causal effect

with a scatter plot and the regression line
If family size is non-random, then we can’t do this because we
need to control for multiple variables just to remove the
factors causing family size to be correlated with ε
Non-random families
Assume that family size is random once we condition on race,

age, marital status and employment.
labor_supplyi = β0 + β1 Numkidsi + γ1 Whitei + γ2 Marriedi

+γ3 Agei + γ4 Employedi + εi
To estimate this model, we need:

1 a data set with all 6 variables;
2 Numkids must be randomly assigned conditional on the other
4 variables
Now how do we interpret βb1 ? And can we visualize βb1 when
there’s multiple dimensions to the data? Yes, using the
regression anatomy theorem, we can.
Regression Anatomy Theorem
Assume your main multiple regression model of interest:
yi = β0 + β1 x1i + · · · + βk xki + · · · + βK xKi + ei
and an auxiliary regression in which the variable x1i is regressed on

all the remaining independent variables
x1i = γ0 + γk−1 xk−1i + γk+1 xk+1i + · · · + γK xKi + fi
and x̃1i = x1i − xb1i being the residual from the auxiliary regression.
The parameter β1 can be rewritten as:
Cov (yi , x̃1i )

β1 =
Var (x̃1i )
In words: The regression anatomy theorem says that βb1 is a scaled

covariance with the x˜1 residual used instead of the actual data x.
Regression Anatomy Proof
To prove the theorem, note E [x̃ki ] = E [xki ] − E [b
xki ] = E [fi ], and plug yi and residual
x̃ki from xki auxiliary regression into the covariance cov (yi , x̃ki )
cov (yi , x̃ki )

βk =
var (x̃ki )
cov (β0 + β1 x1i + · · · + βk xki + · · · + βK xKi + ei , x̃ki )
=
var (x̃ki )
cov (β0 + β1 x1i + · · · + βk xki + · · · + βK xKi + ei , fi )
=
var (fi )
1 Since by construction E [fi ] = 0, it follows that the term β0 E [fi ] = 0.

2 Since fi is a linear combination of all the independent variables with the
exception of xki , it must be that
β1 E [fi x1i ] = · · · = βk−1 E [fi xk−1i ] = βk+1 E [fi xk+1i ] = · · · = βK E [fi xKI ] = 0
Regression Anatomy Proof (cont.)
3 Consider now the term E [ei fi ]. This can be written as:
E [ei fi ] = E [ei fi ]
= E [ei x̃ki ]
= E [ei (xki − xbki )]
= E [ei xki ] − E [ei x̃ki ]
Since ei is uncorrelated with any independent variable, it is also uncorrelated

with xki : accordingly, we have E [ei xki ] = 0. With regard to the second term of
the subtraction, substituting the predicted value from the xki auxiliary
regression, we get
E [ei x̃ki ] = E [ei (γb0 + γb1 x1i + · · · + γ bk+1 xk+1i + · · · + γ

bk−1 xk−1 i + γ bK xKi )]
Once again, since ei is uncorrelated with any independent variable, the expected
value of the terms is equal to zero. Then, it follows E [ei fi ] = 0.
Regression Anatomy Proof (cont.)
4 The only remaining term is E [βk xki fi ] which equals E [βk xki x̃ki ] since fi = x̃ki . The
term xki can be substituted using a rewriting of the auxiliary regression model, xki ,
such that
xki = E [xki |X−k ] + x̃ki
This gives
E [βk xki x̃ki ] = E [βk E [x̃ki (E [xki |X−k ] + x̃ki )]]

= βk E [x̃ki (E [xki |X−k ] + x̃ki )]
= βk {E [x̃ki2 ] + E [(E [xki |X−k ]x̃ki )]}
= βk var (x̃ki )
which follows directly from the orthogonoality between E [xki |X−k ] and x̃ki . From
previous derivations we finally get
cov (yi , x̃ki ) = βk var (x̃ki )
which completes the proof.

Stata command: reganat (i.e., regression anatomy)
. ssc install reganat, replace

. sysuse auto
. regress price length weight headroom mpg
. reganat price length weight headroom mpg, dis(weight length) biline
Big picture
1 Regression provides the best linear predictor for the dependent

variable in the same way that the CEF is the best unrestricted
predictor of the dependent variable
2 If we prefer to think of approximating E (yi |xi ) as opposed to
predicting yi , the regression CEF theorem tells us that even if
the CEF is nonlinear, regression provides the best linear
approximation to it.
3 Regression anatomy theorem helps us interpret a single slope
coefficient in a multiple regression model by the
aforementioned decomposition.
Omitted Variable Bias
A typical problem is when a key variable is omitted. Assume

schooling causes earnings to rise:
Yi = β0 + β1 Si + β2 Ai + ui
Yi = log of earnings
Si = schooling measured in years
Ai = individual ability
Typically the econometrician cannot observe Ai ; for instance,

the Current Population Survey doesn’t present adult
respondents’ family background, intelligence, or motivation.
Shorter regression
What are the consequences of leaving ability out of the

regression? Suppose you estimated this shorter regression
instead:
Yi = β0 + β1 Si + ηi
where ηi = β2 Ai + ui ; β0 , β1 , and β2 are population regression
coefficients; Si is correlated with ηi through Ai only; and ui is
a regression residual uncorrelated with all regressors by
definition.
Derivation of Ability Bias
Suppressing the i subscripts, the OLS estimator for β1 is:
Cov (Y , S) E [YS] − E [Y ]E [S]

βb1 = =
Var (S) Var (S)
Plugging in the true model for Y , we get:
Cov [(β0 + β1 S + β2 A + u), S]

βb1 =
Var (S)
E [(β0 S + β1 S 2 + β2 SA + uS)] − E (S)E [β0 + β1 S + β2 A + u]
=
Var (S)
β1 E (S 2 ) − β1 E (S)2 + β2 E (AS) − β2 E (S)E (A) + E (uS) − E (S)E (u)
=
Var (S)
Cov (A, S)
= β1 + β2
Var (S)
If β2 > 0 and Cov(A, S)> 0 the coefficient on schooling in the shortened

regression (without controlling for A) would be upward biased
Summary
When Cov (A, S) > 0 then ability and schooling are correlated.
When ability is unobserved, then not even multiple regression
will identify the causal effect of schooling on wages.
Here we see one of the main justifications for this workshop –
what will we do when the treatment variable is endogenous?
We will need an identification strategy to recover the causal
effect
Introduction to the Selection Problem
Aliens come and orbit earth, see sick people in hospitals and
conclude “these ‘hospitals’ are hurting people”
Motivated by anger and compassion, they kill the doctors to
save the patients
Sounds stupid, but earthlings do this too - all the time

#1: Correlation and causality are very different concepts
Causal question:
“If I hospitalize (D) my child, will her health (Y) improve?”
Correlation question:
1 Cov (D, Y )
√ √
n VarD VarY
These are not the same thing

#2: Coming first may not mean causality!
Every morning the rooster crows and then the sun rises
Did the rooster cause the sun to rise? Or did the sun cause
the rooster to crow?
Post hoc ergo propter hoc: “after this, therefore, because of
this”
#3: No correlation does not mean no causality!
A sailor sails her sailboat across a lake

Wind blows, and she perfectly counters by turning the rudder
The same aliens observe from space and say “Look at the way
she’s moving that rudder back and forth but going in a straight
line. That rudder is broken.” So they send her a new rudder
They’re wrong but why are they wrong? There is, after all, no
correlation
Introduction to potential outcomes model
Let the treatment be a binary variable:

(
1 if hospitalized at time t
Di,t =
0 if not hospitalized at time t
where i indexes an individual observation, such as a person

Potential outcomes:
(
j 1 health if hospitalized at time t
Yi,t =
0 health if not hospitalized at time t
where j indexes a counterfactual state of the world

Moving between worlds
I’ll drop t subscript, but note – these are potential outcomes

for the same person at the exact same moment in time
A potential outcome Y 1 is not the historical outcome Y either
conceptually or notationally
Potential outcomes are hypothetical states of the world but
historical outcomes are ex post realizations
Major philosophical move here: go from the potential worlds
to the actual (historical) world based on your treatment
assignment
Important definitions
Definition 1: Individual treatment Definition 2: Average treatment effect

effect (ATE)
The individual treatment effect, δi , The average treatment effect is the
equals Yi1 − Yi0 population average of all i individual
treatment effects
Definition 3: Switching equation E [δi ] = E [Yi1 − Yi0 ]

An individual’s observed health = E [Yi1 ] − E [Yi0 ]
outcomes, Y , is determined by
treatment assignment, Di , and
corresponding potential outcomes:
Yi = Di Yi1 + (1 − Di )Yi0
(
Yi1 if Di = 1
Yi =
Yi0 if Di = 0
So what’s the problem?
Definition 4: Fundamental problem of causal inference

It is impossible to observe both Yi1 and Yi0 for the same individual
and so individual causal effects, δi , are unknowable.
Conditional Average Treatment Effects
Definition 5: Average Treatment Effect on the Treated (ATT)

The average treatment effect on the treatment group is equal to
the average treatment effect conditional on being a treatment
group member:
E [δ|D = 1] = E [Y 1 − Y 0 |D = 1]
= E [Y 1 |D = 1] − E [Y 0 |D = 1]
Definition 6: Average Treatment Effect on the Untreated (ATU)

The average treatment effect on the untreated group is equal to
the average treatment effect conditional on being untreated:
E [δ|D = 0] = E [Y 1 − Y 0 |D = 0]
= E [Y 1 |D = 0] − E [Y 0 |D = 0]
Causality and comparisons
Comparisons are at the heart of the causal problem, but not all
comparisons are equal because of the selection problem
Does the hospital make me sick? Or am I sick, and that’s why
I went to the hospital?
Why can’t I just compare my health (Scott) with someone
who isn’t in the hospital (Nathan)? Aren’t we supposed to
have a “control group”?
What are we actually measuring if we compare average health
outcomes for the hospitalized with the non-hospitalized?
Definition 7: Simple difference in mean outcomes (SDO)
A simple difference in mean outcomes (SDO) is the difference
between the population average outcome for the treatment and
control groups, and can be approximated by the sample averages:
SDO = E [Y 1 |D = 1] − E [Y 0 |D = 0]
= EN [Y |D = 1] − EN [Y |D = 0]
in large samples.
SDO vs. ATE
Notice the subtle difference between the SDO and ATE notation:
E [Y |D = 1] − E [Y |D = 0] <
> E [Y 1 ] − E [Y 0 ]
The SDO is an estimate, whereas ATE is a parameter

SDO is a crank that turns data into numbers
ATE is a parameter that is unknowable because of the
fundamental problem of causal inference
SDO can line up with the ATE and also cannot line up with
the ATE.
Biased simple difference in mean outcomes
Decomposition of the SDO

The simple difference in mean outcomes can be decomposed into
three parts (ignoring sample average notation):
E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+E [Y 0 |D = 1] − E [Y 0 |D = 0]
+(1 − π)(ATT − ATU)
Seeing is believing so let’s work through this identity

Decomposition of SDO
ATE is equal to sum of conditional average expectations by LIE
ATE = E [Y 1 ] − E [Y 0 ]
= {πE [Y 1 |D = 1] + (1 − π)E [Y 1 |D = 0]}
−{πE [Y 0 |D = 1] + (1 − π)E [Y 0 |D = 0]}
Use simplified notations
E [Y 1 |D = 1] = a
E [Y 1 |D = 0] = b
E [Y 0 |D = 1] = c
E [Y 0 |D = 0] = d
ATE = e
Rewrite ATE
e = {πa + (1 − π)b}
−{πc + (1 − π)d}
Move SDO terms to LHS
e = {πa + (1 − π)b} − {πc + (1 − π)d}

e = πa + b − πb − πc − d + πd
e = πa + b − πb − πc − d + πd + (a − a) + (c − c) + (d − d)
= e − πa − b + πb + πc + d − πd − a + a − c + c − d + d
a − d = e − πa − b + πb + πc + d − πd + a − c + c − d
a − d = e + (c − d) + a − πa − b + πb − c + πc + d − πd
a − d = e + (c − d) + (1 − π)a − (1 − π)b + (1 − π)d − (1 − π)c
a − d = e + (c − d) + (1 − π)(a − c) − (1 − π)(b − d)
Substitute conditional means
E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)({E [Y 1 |D = 1] − E [Y 0 |D = 1]}
−(1 − π){E [Y 1 |D = 0] − E [Y 0 |D = 0]})
1 0
E [Y |D = 1] − E [Y |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)(ATT − ATU)
Decomposition of difference in means
EN [yi |di = 1] − EN [yi |di = 0] = E [Y 1 ] − E [Y 0 ]

| {z } | {z }
SDO Average Treatment Effect
+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias
where EN [Y |D = 1] → E [Y 1 |D = 1],
EN [Y |D = 0] → E [Y 0 |D = 0] and (1 − π) is the share of the
population in the control group.
Independence assumption
Treatment is independent of potential outcomes
(Y 0 , Y 1 ) ⊥
⊥D
In words: Random assignment means that the treatment has been assigned to units
independent of their potential outcomes. Thus, mean potential outcomes for the
treatment group and control group are the same for a given state of the world
E [Y 0 |D = 1] = E [Y 0 |D = 0]
E [Y 1 |D = 1] = E [Y 1 |D = 0]

Random Assignment Solves the Selection Problem

| {z } | {z }
SDO Average Treatment Effect
+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias
If treatment is independent of potential outcomes, then swap

out equations and selection bias zeroes out:
E [Y 0 |D = 1] − E [Y 0 |D = 0] = 0
Random Assignment Solves the Heterogenous Treatment Effects
How does randomization affect heterogeneity treatment effects bias from the
third line? Rewrite definitions for ATT and ATU:
ATT = E [Y 1 |D = 1] − E [Y 0 |D = 1]
ATU = E [Y 1 |D = 0] − E [Y 0 |D = 0]
Rewrite the third row bias after 1 − π:
ATT − ATU = E[Y1 | D=1] − E [Y 0 |D = 1]

−E[Y1 | D=0] + E [Y 0 |D = 0]
= 0
If treatment is independent of potential outcomes, then:

SDO = ATE
Careful with this notation
Independence only implies that the the average values for a

given potential outcome (i.e., Y 1 or Y 0 ) are the same for the
groups who did receive the treatment as those who did not
Independence does not does not imply
E [Y 1 |D = 1] = E [Y 0 |D = 0]
SUTVA
Potential outcomes model places a limit on what we can

measure: the “stable unit-treatment value assumption” .
Horrible acronym.
1 S: stable
2 U: across all units, or the population
3 TV: treatment-value (“treatment effect”, “causal effect”)
4 A: assumption
SUTVA means that average treatment effects are parameters
that assume (1) homogenous dosage, (2) potential outcomes
are invariant to who else (and how many) is treated (e.g.,
externalities), and (3) partial equilibrium
SUTVA: Homogenous dose
SUTVA constrains what the treatment can be.

Individuals are receiving the same treatment – i.e., the “dose”
of the treatment to each member of the treatment group is
the same. That’s the “stable unit” part.
If we are estimating the effect of hospitalization on health
status, we assume everyone is getting the same dose of the
hospitalization treatment.
Easy to imagine violations if hospital quality varies, though,
across individuals. But, that just means we have to be careful
what we are and are not defining as the treatment
SUTVA: No spillovers to other units
What if hospitalizing Scott (hospitalized, D = 1) is actually

about vaccinating Scott from small pox?
If Scott is vaccinated for small pox, then Nathan’s potential
health status (without vaccination) may be higher than when
he isn’t vaccinated.
0
In other words, YNathan , may vary with what Scott does
regardless of whether he himself receives treatment.
SUTVA means that you don’t have a problem like this.
If there are no externalities from treatment, then δi is stable
for each i unit regardless of whether someone else receives the
treatment too.
SUTVA: Partial equilibrium only
Easier to imagine this with a different example.

Scaling up can be a problem because of rising costs of
production
Let’s say we estimate a causal effect of early childhood
intervention in some state
Now the President wants to adopt it for the whole United
States – will it have the same effect as we found?
What if expansion requires hiring lower quality teachers just to
make classes?
Demand for Learning HIV Status
Rebecca Thornton implemented an RCT in rural Malawi for

her job market paper at Harvard in mid-2000s
At the time, it was an article of faith that you could fight the
HIV epidemic in Africa by encouraging people to get tested;
but Thornton wanted to see if this was true
She randomly assigned cash incentives to people to incentivize
learning their HIV status
Also examined whether learning changed sexual behavior.
Experimental design
Respondents were offered a free door-to-door HIV test

Treatment is randomized vouchers worth between zero and
three dollars
These vouchers were redeemable once they visited a nearby
voluntary counseling and testing center (VCT)
Estimates her models using OLS with controls
Why Include Control Variables?
To evaluate experimental data, one may want to add

additional controls in the multivariate regression model. So,
instead of estimating the prior equation, we might estimate:
Yi = α + δDi + γXi + ηi
There are 2 main reasons for including additional controls in

the regression models:
1 Conditional random assignment. Sometimes randomization is
done conditional on some observable (e.g., gender, school,
districts)
2 Exogenous controls increase precision. Although control
variables Xi are uncorrelated with Di , they may have
substantial explanatory power for Yi . Including controls thus
reduces variance in the residuals which lowers the standard
errors of the regression estimates.
Table: Impact of Monetary Incentives and Distance on Learning HIV
Results
1 2 3 4 5
Any incentive 0.431*** 0.309*** 0.219*** 0.220*** 0.219 ***
(0.023) (0.026) (0.029) (0.029) (0.029)
Amount of incentive 0.091*** 0.274*** 0.274*** 0.273***
(0.012) (0.036) (0.035) (0.036)
Amount of incentive2 −0.063*** −0.063*** −0.063***
(0.011) (0.011) (0.011)
HIV −0.055* −0.052 −0.05 −0.058* −0.055*
(0.031) (0.032) (0.032) (0.031) (0.031)
Distance (km) −0.076***
(0.027)
Distance2 0.010**
(0.005)
Controls Yes Yes Yes Yes Yes
Sample size 2,812 2,812 2,812 2,812 2,812
Average attendance 0.69 0.69 0.69 0.69 0.69
Figure: Visual representation of cash transfers on learning HIV test
results.
Results
Even small incentives were effective

Any incentive increases learning HIV status by 43% compared
to the control (mean 34%)
Next she looks at the effect that learning HIV status has on
risky sexual behavior
Figure: Visual representation of cash transfers on condom purchases for
HIV positive individuals.
Table: Reactions to Learning HIV Results among Sexually Active at
Baseline
Dependent variables: Bought Number of
condoms condoms bought
OLS IV OLS IV
Got results −0.022 −0.069 −0.193 −0.303
(0.025) (0.062) (0.148) (0.285)
Got results × HIV 0.418*** 0.248 1.778*** 1.689**
(0.143) (0.169) (0.564) (0.784)
HIV −0.175** −0.073 −0.873 −0.831
(0.085) (0.123) (0.275) (0.375)
Controls Yes Yes Yes Yes
Sample size 1,008 1,008 1,008 1,008
Mean 0.26 0.26 0.95 0.95
Results
For those who were HIV+ and got their test results, 42% more
likely to buy condoms (but shrinks and becomes insignificant
at conventional levels with IV).
Number of condoms bought – very small. HIV+ respondents
who learned their status bought 2 more condoms
Randomization inference and causal inference
“In randomization-based inference, uncertainty in estimates

arises naturally from the random assignment of the
treatments, rather than from hypothesized sampling from a
large population.” (Athey and Imbens 2017)
Athey and Imbens is part of growing trend of economists using
randomization-based methods for doing causal inference

Lady tasting tea experiment
Ronald Aylmer Fisher (1890-1962)

Two classic books on statistics: Statistical Methods for
Research Workers (1925) and The Design of Experiments
(1935), as well as a famous work in genetics, The Genetical
Theory of Natural Science
Developed many fundamental notions of modern statistics
including the theory of randomized experimental design.
Lady tasting tea
Muriel Bristol (1888-1950)

A PhD scientist back in the days when women weren’t PhD
scientists
Worked with Fisher at the Rothamsted Experiment Station
(which she established) in 1919
During afternoon tea, Muriel claimed she could tell from taste
whether the milk was added to the cup before or after the tea
Scientists were incredulous, but Fisher was inspired by her
strong claim
He devised a way to test her claim which she passed using
randomization inference
Description of the tea-tasting experiment
Original claim: Given a cup of tea with milk, Bristol claims she
can discriminate the order in which the milk and tea were
added to the cup
Experiment: To test her claim, Fisher prepares 8 cups of tea –
4 milk then tea and 4 tea then milk – and presents each
cup to Bristol for a taste test
Question: How many cups must Bristol correctly identify to
convince us of her unusual ability to identify the order in which
the milk was poured?
Fisher’s sharp null: Assume she can’t discriminate. Then
what’s the likelihood that random chance was responsible for
her answers?
Choosing subsets
The lady performs the experiment by selecting 4 cups, say, the

ones she claims to have had the tea poured first.

n n!
=
k k! (n − k)!
“8 choose 4” – 84 – ways to choose 4 cups out of 8

Numerator is 8 × 7 × 6 × 5 = 1, 680 ways to choose a first cup,

a second cup, a third cup, and a fourth cup, in order.
Denominator is 4 × 3 × 2 × 1 = 24 ways to order 4 cups.
Choosing subsets
There are 70 ways to choose 4 cups out of 8, and therefore a

1.4% probability of producing the correct answer by chance
24
= 1/70 = 0.014.
1680
For example, the probability that she would correctly identify
1
all 4 cups is 70
Statistical significance
Suppose the lady correctly identifies all 4 cups. Then . . .

1 Either she has no ability, and has chosen the correct 4 cups
purely by chance, or
2 She has the discriminatory ability she claims.
Since choosing correctly is highly unlikely in the first case (one
chance in 70), the second seems plausible.
1 Fisher is the originator of the convention that a result is
considered “statistically significant” if the probability of its
occurrence by chance is < 0.05, or, less than 1 out of 20.
Bristol actually got all four correct
Replication
Let’s look at tea.do and tea.R to see this experiment

Null hypothesis
In this example, the null hypothesis is the hypothesis that the

lady has no special ability to discriminate between the cups of
tea.
We can never prove the null hypothesis, but the data may
provide evidence to reject it.
In most situations, rejecting the null hypothesis is what we
hope to do.
Null hypothesis of no effect
Randomization inference allows us to make probability

calculations revealing whether the treatment assignment was
“unusual”
Fisher’s sharp null is when entertain the possibility that no unit
has a treatment effect
This allows us to make “exact” p-values which do not depend
on large sample approximations
It also means the inference is not dependent on any particular
distribution (e.g., Gaussian); sometimes called nonparametric
Sidebar: bootstrapping is different
Sometimes people confuse randomization inference with

bootstrapping
Bootstrapping randomly draws a percent of the total
observations for estimation; “uncertainty over the sample”
Randomization inference randomly reassigns the treatment;
“uncertainty over treatment assignment”
6-step guide to randomization inference
1 Choose a sharp null hypothesis (e.g., no treatment effects)

2 Calculate a test statistic (T is a scalar based on D and Y )
3 Then pick a randomized treatment vector D˜1
4 Calculate the test statistic associated with (D̃, Y )
5 Repeat steps 3 and 4 for all possible combinations to get
T̃ = {T̃1 , . . . , T̃K }
Calculate exact p-value as p = K1 K
P
k=1 I (T̃k ≥ T )
6
Pretend experiment
Table: Pretend DBT intervention for some homeless population
Name D Y Y0 Y1
Andy 1 10 . 10
Ben 1 5 . 5
Chad 1 16 . 16
Daniel 1 3 . 3
Edith 0 5 5 .
Frank 0 7 7 .
George 0 8 8 .
Hank 0 10 10 .
For concreteness, assume a program where we pay homeless people

$15 to take dialectical behavioral therapy. Outcomes are some
measure of mental health 0-20 with higher scores being better.
Step 1: Sharp null of no effect
Fisher’s Sharp Null Hypothesis

H0 : δi = Yi1 − Yi0 = 0 ∀i
Assuming no effect means any test statistic is due to chance

Neyman and Fisher test statistics were different – Fisher was
exact, Neyman was not
Neyman’s null was no average treatment effect (ATE=0). If
you have a treatment effect of 5 and I have a treatment effect
of -5, our ATE is zero. This is not the sharp null even though
it also implies a zero ATE
More sharp null
Since under the Fisher sharp null δi = 0, it means each unit’s

potential outcomes under both states of the world are the same
We therefore know each unit’s missing counterfactual
The randomization we will perform will cycle through all
treatment assignments under a null well treatment assignment
doesn’t matter because all treatment assignments are
associated with a null of zero unit treatment effects
We are looking for evidence against the null
Step 1: Fisher’s sharp null and missing potential outcomes
Table: Missing potential outcomes are no longer missing
Name D Y Y0 Y1
Andy 1 10 10 10
Ben 1 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 0 7 7 7
George 0 8 8 8
Hank 0 10 10 10
Fisher sharp null allows us to fill in the missing counterfactuals bc

under the null there’s zero treatment effect at the unit level. This
guarantees zero ATE, but is different in formulation than Neyman’s
null effect of no ATE.
Step 2: Choosing a test statistic
Test Statistic
A test statistic T (D, Y ) is a scalar quantity calculated from the
treatment assignments D and the observed outcomes Y
By scalar, I just mean it’s a number (vs. a function) measuring

some relationship between D and Y
Ultimately there are many tests to choose from; I’ll review a
few later
If you want a test statistic with high statistical power, you
need large values when the null is false, and small values when
the null is true (i.e., extreme)
Simple difference in means
Consider the absolute SDO from earlier

N N
1 X 1 X
δSDO =
Di Yi − (1 − Di )Yi
NT NC
i=1 i=1
Larger values of δSDO are evidence against the sharp null

Good estimator for constant, additive treatment effects and
relatively few outliers in the potential outcomes
Step 2: Calculate test statistic, T (D, Y )
Table: Calculate T using D and Y
Name D Y Y0 Y1 δi
Andy 1 10 10 10 0
Ben 1 5 5 5 0
Chad 1 16 16 16 0
Daniel 1 3 3 3 0
Edith 0 5 5 5 0
Frank 0 7 7 7 0
George 0 8 8 8 0
Hank 0 10 10 10 0
We’ll start with this simple the simple difference in means test
statistic, T (D, Y ): δSDO = 34/4 − 30/4 = 1
Steps 3-5: Null randomization distribution
Randomization steps reassign treatment assignment for every

combination, calculating test statistics each time, to obtain
the entire distribution of counterfactual test statistics
The key insight of randomization inference is that under
Fisher’s sharp null, the treatment assignment shouldn’t matter
Ask yourself:
if there is no unit level treatment effect, can you picture a
distribution of counterfactual test statistics?
and if there is no unit level treatment effect, what must
average counterfactual test statistics equal?
Step 6: Calculate “exact” p-values
Question: how often would we get a test statistic as big or

bigger as our “real” one if Fisher’s sharp null was true?
This can be calculated “easily” (sometimes) once we have the
randomization distribution from steps 3-5
The number of test statistics (t(D, Y )) bigger than the
observed divided by total number of randomizations
P
I (T (D, Y ) ≥ T (D̃, Y )
Pr (T (D, Y ) ≥ T (D̃, Y |δ = 0)) = D∈Ω
K
These are “exact” tests when they use every possible
combination of D
When you can’t use every combination, then you can get
approximate p-values from a simulation (TBD)
With a rejection threshold of α (e.g., 0.05), randomization
inference test will falsely reject less than 100×α% of the time
First permutation (holding NT fixed)
Name D˜2 Y Y0 Y1
Andy 1 10 10 10
Ben 0 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 1 7 7 7
George 0 8 8 8
Hank 0 10 10 10
T̃1 = |36/4 − 28/4|= 9 − 7 = 2

Second permutation (again holding NT fixed)
Name D˜3 Y Y0 Y1
Andy 1 10 10 10
Ben 0 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 0 7 7 7
George 1 8 8 8
Hank 0 10 10 10
Trank = |36/4 − 27/4|= 9 − 6.75 = 2.25

Sidebar: Should it be 4 treatment groups each time?
In this experiment, I’ve been using the same NT under the

assumption that NT had been fixed when the experiment was
drawn.
But if the original treatment assignment had been generated
by something like a Bernoulli distribution (e.g., coin flips over
every unit), then you should be doing a complete permutation
that is also random in this way
This means that for 8 units, sometimes you’d have 1 treated,
or even 8
Correct inference requires you know the original data
generating process
Randomization distribution
Assignment D1 D2 D3 D4 D5 D6 D7 D8 |Ti |
True D 1 1 1 1 0 0 0 0 1
D˜2 1 0 1 1 0 1 0 0 2
D˜3 1 0 1 1 0 0 1 0 2.25
...
Step 2: Other test statistics
The simple difference in means is fine when effects are

additive, and there are few outliers in the data
But outliers create more variation in the randomization
distribution
What are some alternative test statistics?
Transformations
What if there was a constant multiplicative effect:

Yi1 /Yi0 = C ?
Difference in means will have low power to detect this
alternative hypothesis
So we transform the observed outcome using the natural log:
N N
1 X 1 X
Tlog =
Di ln(Yi ) − (1 − Di )ln(Yi )
NT NC
i=1 i=1
This is useful for skewed distributions of outcomes

Difference in medians/quantiles
We can protect against outliers using other test statistics such

as the difference in quantiles
Difference in medians:
Tmedian = |median(YT ) − median(YC )|
We could also estimate the difference in quantiles at any point

in the distribution (e.g., 25th or 75th quantile)
Rank test statistics
Basic idea is rank the outcomes (higher values of Yi are

assigned higher ranks)
Then calculate a test statistic based on the transformed
ranked outcome (e.g., mean rank)
Useful with continuous outcomes, small datasets and/or many
outliers
Rank statistics formally
Rank is the domination of others (including oneself):

N
X
R̃ = R̃i (Y1 , . . . , YN ) = I (Yj ≤ Yi )
j=1
Normalize the ranks to have mean 0

N
X N +1
R̃i = R̃i (Y1 , . . . , YN ) = I (Yj ≤ Yi ) −
2
j=1
Calculate the absolute difference in average ranks:

P P
i:Di =1 Ri i:Di =0 Ri

Trank = |R T − R C |= −
NT NC
Minor adjustment (averages) for ties

Randomization distribution
Name D Y Y0 Y1 Rank Ri
Andy 1 10 10 10 6.5 2
Ben 1 5 5 5 2.5 -2
Chad 1 16 16 16 8 3.5
Daniel 1 3 3 3 1 -3.5
Edith 0 5 5 5 2.5 -1
Frank 0 7 7 7 4 -0.5
George 0 8 8 8 5 0.5
Hank 0 10 10 10 6.5 2
Trank = |0 − 1/4|= 1/4

Effects on outcome distributions
Focused so far on “average” differences between groups.

Kolmogorov-Smirnov test statistics is based on the difference
in the distribution of outcomes
Empirical cumulative distribution function (eCDF):
1 X
FbC (Y ) = 1(Yi ≤ Y )
NC
i:Di =0
1 X
FbT (Y ) = 1(Yi ≤ Y )
NT
i:Di =1
Proportion of observed outcomes below a chosen value for

treated and control separately
If two distributions are the same, then FbC (Y ) = FbT (Y )
Kolmogorov-Smirnov statistic
Test statistics are scalars not functions

eCDFs are functions, not scalars
Solution: use the maximum discrepancy between the two
eCDFs:
TKS = max|FbT (Yi ) − FbC (Yi )|
Kernel density by group status
.4
.3
Kolmogorov-Smirnov test
kdensity y
.2.1
0
-5 0 5 10
x
Treatment Control
eCDFs by treatment status and test statistic
1
.8
eCDF
ECDF of y
.4 .2
0 .6
-5 0 5 10
y
Treatment Control
KS Test Statistic
Treatment D Exact P-value

K-S 0.4500 0.034
Max distance is 0.45. Exact p is 0.034.

“Which bear is best?” – Jim Halpert
A good test statistic is the one that best fits your data. Some test
statistics will have weird properties in the randomization as we’ll
see in synthetic control.
One-sided or two-sided?
So far, we have defined all test statistics as absolute values

We are testing against a two-sided alternative hypothesis
H0 : δi = 0 ∀i
H1 : δi 6= 0 for some i
What about a one-sided alternative
H0 : δi = 0 ∀i
H1 : δi > 0 for some i
For these, use a test statistic that is bigger under the

alternative:
Tdiff ∗ = Y T − Y C
Small vs. Modest Sample Sizes are non-trivial
Computing the exact randomization distribution is not always

feasible (Wolfram Alpha)
N = 6 and NT = 3 gives us 20 assignment vectors
N = 20 and NT = 10 gives us 184,756 assignment vectors
N = 50 and NT = 25 gives us 1.2641061×1014 assignment
vectors
Exact p calculations are not realistic bc the number of assignments
explodes at even modest size
Approximate p values
Use simulation to get approximate p-values

Take K samples from the treatment assignment space
Calculate the randomization distribution in the K samples
Tests no longer exact, but bias is under your control (increase
K)
Imbens and Rubin show that p values converge to stable p
values pretty quickly (in their example after 1000 replications)
Sample dataset
Let’s do this now with Thornton’s data. You can replicate that
using thorton_ri.do or thornton_ri.R
Thornton’s experiment
ATE Iteration Rank p no. trials

0.45 1 1 0.01 100
0.45 1 1 0.002 500
0.45 1 1 0.001 1000
Table: Estimated p-value using different number of trials.

Including covariate information
Let Xi be a pretreatment measure of the outcome

0
One way is to use this as a gain score: Y d = Yid − Xi
Causal effects are the same Y 1i − Y 0i = Yi1 − Yi0
But the test statistic is different:

Tgain = (Y T − Y C ) − (X T − X C )
If Xi is strongly predictive of Yi0 , then this could have higher

power
Ygain will have lower variance under the null
This makes it easier to detect smaller effects
Regression in RI
We can extend this to use covariates in more complicated ways

For instance, we can use an OLS regression:
Yi = α + δDi + βXi + ε
Then our test statistic could be TOLS = δb

RI is justified even if the model is wrong
OLS is just another way to generate a test statistic
The more the model is “right” (read: predictive of Yi0 ), the
higher the power TOLS will have
See if you can do this in Thornton’s dataset using the loops
and saving the OLS coefficient (or just use ritest)
Judea Pearl and DAGs
Judea Pearl and colleagues in Artificial Intelligence at UCLA

developed DAG modeling to create a formalized causal
inference methodology
They causality concepts extremely clear, they provide a map to
the estimation strategy, and maybe best of all, they
communicate to others what must be true about the data
generating process to recover the causal effect

Judea Pearl, 2011 Turing Award winner, drinking his first IPA
Further reading
1 Pearl (2018) The Book of Why: The

New Science of Cause and Effect, Basic Books (popular)
2 Morgan and Winship (2014)
Counterfactuals and Causal Inference: Methods and Principles
for Social Research, Cambridge University Press, 2nd edition
(excellent)
3 Pearl, Glymour and Jewell (2016)
Causal Inference In Statistics: A Primer, Wiley Books
(accessible)
4 Pearl (2009) Causality: Models, Reasoning and Inference,
Cambridge, 2nd edition (difficult)
5 Cunningham (2021) Causal Inference: The Mixtape, Yale, 1st
edition (best choice, no question)
Causal model
The causal model is sometimes called the structural model,

but for us, I prefer the former as it’s less alienating
It’s the system of equations describing the relevant aspects of
the world
It necessarily is filled with causal effects associated with some
particular comparative statics
To illustrate, I will assume a Beckerian human capital model
Human capital model: statements and graphs
Let’s describe my simplified Beckerian human capital model.

Individuals maximize utility by choosing consumption and
schooling (D) subject to multi-period budget constraint
Education has current costs but longterm returns
But people choose different levels of schooling based on a
number of things we will call “background” (B) which won’t be
in the dataset (“unobserved”)
And own-schooling will also be because of parental schooling
(PE)
Finally, wages (Y) are a function of parental schooling
Becker’s human capital causal model
We can represent that causal model visually
PE I
D Y
B
PE is parental education, B is “unobserved background factors

(i.e., “ability”)”, I is family income, D is college education and Y is
log wages. The DAG is an approximation of Becker’s underlying
(causal) human capital model.
Arrows, but also missing arrows
Before we dive into all this notation, couple of things
PE I
D Y
B
PE and D are caused by B. But why doesn’t B cause Y ?? Do you

believe this? Why/why not? We can dispute this, but notice – we
can see the assumption, which is transparent and communicates
the author’s beliefs, as well as the needed assumptions in their
forthcoming empirical model. Every empirical strategy makes
assumptions, but oftentimes they are not as transparent to us as
this is.
PE I
D Y
B
B is a parent of PE and D
PE and D are descendants of B
There is a direct (causal) path from D to Y
There is a mediated (causal) path from B to Y through D
There are four paths from PE to Y but none are direct, and
one is unlike the others
Colliders
PE I
D Y
B
Notice anything different with this DAG? Look closely.

D is a collider along the path B → D ← I (i.e., “colliding” at
D)
D is a noncollider along the path B → D → Y
Summarizing Value of DAGs imo
1 Facilitates the task of designing identification strategy for

estimating average causal effects
2 Facilitates the task of testing compatibility of the model with
your data
3 Visualizes the identifying assumptions which opens up the
model to critical scrutiny
Creating DAGs
The DAG is a relevant causal relationships describing the

relationship between D and Y
It will include:
All direct causal effects among the relevant variables in the
graph
All common causes of any pair of relevant variables in the
graph
No need to model a dinosaur stepping on a bug causing in a
million years some evolved created that impacted your decision
to go to college
We get ideas for DAGs from theory, models, observation,
experience, prior studies, intuition
Sometimes called the data generating process.
Confounding
Omitted variable bias has a name in DAGs: “confounding”

Confounding occurs when when the treatment and the
outcomes have a common cause or parent which creates
spurious correlation between D and Y
D Y
The correlation between D and Y no longer reflects the causal

effect of D on Y
Backdoor Paths
Confounding creates backdoor paths between treatment and

outcome (D ← X → Y ) – i.e., spurious correlations
Not the same as mediation (D → X → Y )
We can “block” backdoor paths by conditioning on the
common cause X
Once we condition on X , the correlation between D and Y
estimates the causal effect of D on Y
Conditioning means calculating
E [Y |D = 1, X ] − E [Y |D = 0, X ] for each value of X then
combining (e.g., integrating)
D Y
X
Blocked backdoor paths
A backdoor path is blocked if and only if:

It contains a noncollider that has been conditioned on
Or it contains a collider that has not been conditioned on
Examples of blocked paths
Examples:
1 Conditioning on a noncollider blocks a path:
X Z Y
2 Conditioning on a collider opens a path (i.e., creates spurious
correlations):
Z X Y
3 Not conditioning on a collider blocks a path:
Z X Y
Backdoor criterion
Backdoor criterion
Conditioning on X satisfies the backdoor criterion with respect to
(D, Y ) directed path if:
1 All backdoor paths are blocked by X
2 No element of X is a collider
In words: If X satisfies the backdoor criterion with respect to
(D, Y ), then controlling for or matching on X identifies the causal
effect of D on Y
What control strategy meets the backdoor criterion?
List all backdoor paths from D to Y . I’ll wait.
X1 D Y
X2
What are the necessary and sufficient set of controls which will
satisfy the backdoor criterion?
What if you have an unobservable?
List all the backdoor paths from D to Y .

X1
U X2 D Y
What are the necessary and sufficient set of controls which will
satisfy the backdoor criterion?
What about the unobserved variable, U?
Multiple strategies
X1
X3 D Y
X2
X1
X3 D Y
X2
Conditioning on the common causes, X1 and X2 , is sufficient

. . . but so is conditioning on X3
Testing the Validity of the DAG
The DAG makes testable predictions

Conditional on D and I , parental education (PE ) should no
longer be correlated with Y
Can be hard to figure this out by hand, but software can help
(e.g., Daggity.net is browser based)
PE I
D Y
B
Collider bias
Conditioning on a collider introduces spurious

correlations; can even mask causal directions
There is only one backdoor path from D to Y
X1 D Y
X2
Conditioning on X1 blocks the backdoor path

But what if we also condition on X2 ?
X1 D Y
X2
Conditioning on X2 opens up a new path, creating new

spurious correlations between D and Y
Even controlling for pretreatment covariates can create
bias
Name the backdoor paths. Is it open or closed?
U1
X D Y
U2
But what if we condition on X ?

U1
X D Y
U2
Living in reality - he doesn’t love you
Fact #1: We can’t know if we have a collider bias

(confounder) problem without making assumptions about the
causal model (i.e. not in the codebook)
Fact # 2: You can’t just haphazardly throw in a bunch of
controls on the RHS (i.e., “the kitchen sink”) bc you may
inadvertently be conditioning on a collider which can lead to
massive biases
Fact # 3: You have no choice but to leverage economic
theory, intuition, intimate familiarity with institutional details
and background knowledge for research designs.
Fact #4: You can only estimate causal effects with data and
assumptions.
Examples of collider bias
Bad controls
Angrist and Pischke in MHE talk about a specific type of

danger associated with controlling for an outcome – “bad
controls”
The problem is not controlling for an outcome;
The problem is controlling for a collider and don’t correct for
that
This has implications for when you work with non-random
administrative data, too
Sample selection example of collider bias
Important: Since unconditioned colliders block back-door paths,

what exactly does conditioning on a collider do? Let’s illustrate
with a fun example and some made-up data
CNN.com headline: Megan Fox voted worst – but sexiest –
actress of 2009 (link)
Are these two things actually negatively correlated in the
world?
Assume talent and beauty are independent, but each causes
someone to become a movie star. What’s the correlation
between talent and beauty for a sample of movie stars
compared to the population as a whole (stars and non-stars)?
What if the sample consists only of movie stars?
Movie Star
Talent Beauty
Stata code
clear all
set seed 3444
* 2500 independent draws from standard normal distribution

set obs 2500
generate beauty=rnormal()
generate talent=rnormal()
* Creating the collider variable (star)

gen score=(beauty+talent)
egen c85=pctile(score), p(85)
gen star=(score>=c85)
label variable star "Movie star"
* Conditioning on the top 15%

twoway (scatter beauty talent, mcolor(black) msize(small) msymbol(smx)),
ytitle(Beauty) xtitle(Talent) subtitle(Aspiring actors and actresses)
by(star, total)
Aspiring actors and actresses
0 1
4
2
0
-2
-4
Beauty
-4 -2 0 2 4
Total
4
2
0
-2
-4
-4 -2 0 2 4
Talent
Graphs by Movie star
Figure: Top left figure: Non-star sample scatter plot of beauty (vertical axis) and talent
(horizontal axis). Top right right figure: Star sample scatter plot of beauty and talent.
Bottom left figure: Entire (stars and non-stars combined) sample scatter plot of beauty and
talent.
Stata
Run Stata file star.do

Occupational sorting and discrimination example of collider
bias
Let’s look at another example: very common for think tanks

and journalists to say that the gender gap in earnings
disappears once you control for occupation.
But what if occupation is a collider, which it could be in a
model with occupational sorting
Then controlling for occupation in a wage regression searching
for discrimination can lead to all kinds of crazy results even in
a simulation where we explicitly design there to be
discrimination
DAG
F y
o A
F is female, d is discrimination, o is occupation, y is earnings and

A is ability. Dashed lines mean the variable cannot be observed.
Note, by design, being a female has no effect on earnings or
occupation, and has no relationship with ability. So earnings is
coming through discrimination, occupation, and ability.
d
F y
o A
Mediation and Backdoor paths

1 d →o→y
2 d →o←A→y
Stata model (Erin Hengel)
Erin Hengel (www.erinhengel.com) and I worked out this

code and she gave me permission to put in my Mixtape
Let’s look at collider_discrimination.do or
collider_discrimination.R together
Table: Regressions illustrating collider bias with simulated gender disparity
Covariates: Unbiased combined effect Biased Unbiased wage effect only

Female -3.074*** 0.601*** -0.994***
(0.000) (0.000) (0.000)
Occupation 1.793*** 0.991***
(0.000) (0.000)
Ability 2.017***
(0.000)
N 10,000 10,000 10,000

Mean of dependent variable 0.45 0.45 0.45
Recall we designed there to be a discrimination coefficient of -1

If we do not control for occupation, then we get the combined effect of
d → o → y and d → y
Because it seems intuitive to control for occupation, notice column 2 - the sign
flips!
We are only able to isolate the direct causal effect by conditioning on ability and
occupation, but ability is unobserved
Administrative data
Admin data has become extremely common, if not absolutely

necessary
But naive use of admin data can be dangerous if the drawing
of the sample is itself a collider problem (Heckman 1979;
Elwert and Winship 2014)
Let’s look at a new paper by Fryer (2019) and a critique by
Knox, et al. (2019)
Collider bias and police use of force
Claims of excessive and discriminator use of police force

against minorities (e.g., Black Lives Matter, Trayvon Martin,
Michael Brown, Eric Garner)
Challenging to identify
Police-citizen interactions are conditional on interactions
having already been triggered
That initial interaction is unobserved
Fryer (2019) is a monumental study for its data collection and
analysis: Stop and Frisk, Police-Public Contact Survey, and
admin data from two jurisdictions
Codes up almost 300 variables from arrest narratives which
range from 2-100 pages in length – shoeleather!
Initial interaction
Fryer finds that blacks and Hispanics were more than 50%
more likely to have an interaction with the policy in NYC Stop
and Frisk as well as Police-Public Contact survey
It survives extensive controls – magnitudes fall, but still very
large (21%)
Moves to admin data
Conditional on police interaction, no racial differences in
officer-related shootings
Fryer calls it one of the most surprising findings in his career
Lots of eyes on this study as a result of the counter intuitive
results; published in JPE
Knox, et al (202) claim his data is itself a collider. What?
Controls
X
Minority Stop Force

D M Y
U
Suspicion
Fryer told us D → M exists from both Stop and Frisk and

Police-Public. But note: admin data is instances of M stops, which
is itself a collider. If this DAG is true, then spurious correlations
enter between M and Y which may dilute our ability to estimate
causal effects.
Knox, et al (2020)
Move from DAG to more contemporary potential outcomes

notation to design relevant parameters
Use potential outcomes and bounds
Even with lower bound estimates of the incidence of police
violence against civilians is more than 5x higher than what
Fryer (2019) finds
Heckman (1979) – we cannot afford to ignore sample selection
Summarizing all of this
Your dataset will not come with a codebook flagging some

variables as “confounders” and other variables as “colliders”
because those terms are always context specific
Except for some unique situations that aren’t generally
applicable, you also don’t always know statistically you have an
omitted variable bias problem; but both of these are fatal for
any application
You only know to do what you’re doing based on knowledge
about data generating process.
All identification must be guided by theory, experience,
observation, common sense and knowledge of institutions
DAGs absorb that information and can be then used to write
out the explicit identifying model
DAGs are not panacea
DAGs cannot handle, though, reverse causality or simultaneity

So there are limitations. “All models are wrong but some are
useful”
They are also not popular (see Twitter ongoing debates which
have descended into light hearted jokes as well as aggressive
debates)
But I think they are helpful and while not necessary, showcase
what is necessary – assumptions
Heckman (1979) can maybe provide some justification at times
What is regression discontinuity design?
Very popular particular type of research design known as regression

discontinuity design (RDD). Cook (2008) has a fascinating history
of thought on how and why.
Donald Campbell, educational psychologist, invented
regression discontinuity design (Thistlethwaite and Campbell,
1960), but then it went dormant for decades (Cook 2008).
Angrist and Lavy (1999) and Black (1999) independently
rediscover it. It’s become incredibly popular in economics

Tell me what
THE EFFECT OF ATTENDING you thinkSTATE
THE FLAGSHIP is happening
UNIVERSITY ON EARNINGS
FIGURE 1.—FRACTION ENROLLED AT THE FLAGSHIP STATE UNIVERSITY
.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
Local Average
V. Results
discontinuity in earnings at the admission cutoff. This
is shown for white men in figure 2, which shows a
A. Earnings Discontinuities at the Admission Cutoff
Tell me what you think is ofhappening
regression residual earnings on a cubic polynomial of
To the extent that there are economic returns to attend- adjusted SAT score. Table 1 shows the discontinuity
ing the flagship state university, one should observe a estimates that result from varying functional form
FIGURE 2.—NATURAL LOG OF ANNUAL EARNINGS FORWHITE MEN TEN TO FIFTEEN YEARS AFTER HIGH SCHOOL GRADUATION (FIT WITH A CUBIC
POLYNOMIAL OF ADJUSTED SAT SCORE)
Estimated Discontinuity = 0.095 (z = 3.01)

.2
(Residual) Natural Log of Earnings
-.3 -.2 -.1-.4 0 .1
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
SAT Points Above the Admission Cutoff
Predicted Earnings Local Average

What is a regression discontinuity design?
We want to estimate some causal effect of a treatment on

some outcome, but we’re worried about selection bias
E [Y 0 |D = 1] 6= E [Y 0 |D = 0]]
due to self-selection into treatment

RDD is based on a idea: if treatment assignment occurs
abruptly when some underlying variable X called the “running
variable” passes a cutoff c0 , then we can use that to estimate
the causal effect even of a self-selected treatment
Running and jumping
Firms, schools and govt agencies have running variables that

are used to assign treatments in their rules
And consequently, probabilities of treatment will “jump” when
that running variable exceeds a known threshold
Most effective RDD studies involve programs where running
variables assign treatments based on a “hair trigger”
Good reasons; inexplicable reasons; arbitrary rules; a choice
made by necessity and resource constraints; natural
experiments
Selection examples and solutions from the literature
Think of these in light of a treatment where

E [Y 0 |D = 1] 6= E [Y 0 |D = 0]
Yelp rounded a continuous score of ratings to generate stars
which Anderson and Magruder 2011 used to study firm revenue
US targeted air strikes in Vietnam using rounded risk scores
which Dell and Querubin 2018 used to study the military and
political activities of the communist state
Card, Dobkin, and Maeskas 2008 studied the effect of
universal healthcare on mortality and healthcare usage
exploiting jumps at age 65
Almond, et al. 2010 studied the effect of intensive medical
attention on health outcomes when a newborn’s birthweight
fell just below 1,500 grams
Hungry, hungry hippo
Data requirements can be substantial. Large sample sizes are

characteristic features of the RDD
If there are strong trends, one typically needs a lot of data for
reasons I’ll explain soon
Researchers are typically using administrative data or settings
such as birth records where there are many observations
Might explain why the method never caught on until the 00’s
(A) Data generating graph (B) Limiting graph
X U X → c0 U
D Y D Y

1
Conditional probability of treatment
Fuzzy RD Design
Sharp RD Design
0
X
Running variable X
Figure: Sharp vs. Fuzzy RDD

Sharp vs. Fuzzy RDD
There’s traditionally thought to be two kinds of RD designs:

1 Sharp RDD: Treatment is a deterministic function of running
variable, X . Example: Medicare benefits.
2 Fuzzy RDD: Discontinuous “jump” in the probability of
treatment when X > c0 . Cutoff is used as an instrumental
variable for treatment. Example: attending state flagship
Fuzzy is a type of IV strategy and requires explicit IV
estimators like 2SLS; sharp is reduced form IV and doesn’t
require IV-like estimators
Overlap
Independence implies an equal distribution of characteristics

across two groups guaranteeing overlap
In an RCT you can find 65 year olds treated and untreated
But RDD doesn’t have this feature bc you don’t have groups
with the same value of X in each group, so no overlap
64 years olds are control, not treatment. 66 years olds are in
treatment not control
Some methods require overlap and therefore are off the table
without it; but RDD has a workaround using extrapolation
Treatment assignment in the sharp RDD
Deterministic treatment assignment (“sharp RDD”)

In Sharp RDD, treatment status is a deterministic and
discontinuous function of a covariate, Xi :
(
1 if Xi ≥ c0
Di =
0 if Xi < c0
where c0 is a known threshold or cutoff. In other words, if you

know the value of Xi for a unit i, you know treatment assignment
for unit i with certainty.
Universal health insurance: Americans aged 64 are not eligible for

Medicare, but Americans aged 65 (X ≥ c65 ) are eligible for
Medicare (ignoring disability exemptions)
Treatment effect definition and estimation
Definition of treatment effect

The treatment effect parameter, δ, is the discontinuity in the
conditional expectation function:
δ = limXi →c0 E [Yi1 |Xi = c0 ] − limc0 ←Xi E [Yi0 |Xi = c0 ]

= limXi →c0 E [Yi |Xi = c0 ] − limc0 ←Xi E [Yi |Xi = c0 ]
The sharp RDD estimation is interpreted as an average causal

effect of the treatment at the discontinuity
δSRD = E [Yi1 − Yi0 |Xi = c0 ]
D is correlated with X and deterministic function of X ; overlap

only occurs in the limit and thus the treatment effect is in the limit
as X approaches c0
Extrapolation
In RDD, the counterfactuals are conditional on X .

We use extrapolation in estimating treatment effects with the
sharp RDD bc we do not have overlap
Left of cutoff, only non-treated observations, Di = 0 for
X < c0
Right of cutoff, only treated observations, Di = 1 for X ≥ c0
The extrapolation is to a counterfactual

Extrapolation
Estimation methods attempt to approximate the limiting parameter

using units left and right of the cutoff
Figure: Dashed lines are extrapolations

Key identifying assumption
Smoothness (or continuity) of conditional expectation functions

(Hahn, Todd and Van der Klaauw 2001)
E [Yi0 |X = c0 ] and E [Yi1 |X = c0 ] are continuous (smooth) in X at
c0 .
Potential outcomes not actual outcomes

If population average potential outcomes, Y 1 and Y 0 , are
smooth functions of X through the cutoff, c0 , then potential
average outcomes won’t jump at c0 .
Implies the cutoff is exogenous – i.e., nothing else changes
related to potential outcomes at c0
Unobservables are evolving smoothly, too, through the cutoff
Smoothness is the identifying assumption and untestable
The smoothness assumption allows us to use average outcome

of units right below the cutoff as a valid counterfactual for
units right above the cutoff.
In other words, extrapolation is allowed if smoothness is
credible, and extrapolation is nonsensical if smoothing isn’t
credible
The causal effect of the treatment will be based on
extrapolation from the trend, E [Yi0 |X < c0 ], to those values
of X > c0 for the E [Yi0 |X > c0 ].
Means you have to think long and hard about smoothness and
what violations mean in your context
Why then is it not directly testable? Because potential
outcomes are counterfactual
Graphical example of the smoothness assumption
Note these are potential not actual outcomes

Graphical example of the treatment effect, not the
smoothness assumption
800
600
Vertical distance is the treatment effect

Outcome (Y)
400
200
0
0 50 100 150 200 250

Test Score (X)
Note that these are actual, not potential outcomes

Re-centering the data
It is common for authors to transform X by “centering” at c0 :
Yi = α + β(Xi − c0 ) + δDi + εi
This doesn’t change the interpretation of the treatment effect

– only the interpretation of the intercept.
Re-centering the data
Example: Medicare and age 65. Center the running variable

(age) by subtracting 65:
Y = β0 + β1 (Age − 65) + β2 Edu

= β0 + β1 Age − β1 65 + β2 Edu
= α + β1 Age + β2 Edu
where α = β0 − β1 65.
All other coefficients, notice, have the same interpretation,
except for the intercept.
Regression without re-centering
reg y D x
Regression with centering
gen x_c = x 140
r e g y D x_c
Nonlinearity bias
Smoothness and linearity are different things.

What if the trend relation E [Yi0 |Xi ] does not jump at c0 but
rather is simply nonlinear?
Then your linear model will identify a treatment effect when
there isn’t because the functional form had poor predictive
properties beyond the cutoff
Let’s look at a simulation
gen x2 = x ∗ x
gen x3 = x ∗ x ∗ x
gen y = 10000 + 0∗D 100∗ x +x2 + r n o r m a l ( 0 , 1 0 0 0 )
s c a t t e r y x i f D==0, m s i z e ( v s m a l l ) | | s c a t t e r y x //
i f D==1, m s i z e ( v s m a l l ) l e g e n d ( o f f ) x l i n e ( 1 4 0 , ///
l s t y l e ( f o r e g r o u n d ) ) y l a b e l ( none ) | | l f i t y x ///
i f D ==0, c o l o r ( r e d ) | | l f i t y x i f D ==1, ///
c o l o r ( r e d ) x t i t l e ( " T e s t S c o r e (X) " ) ///
y t i t l e ( " Outcome (Y) " )
See how the two lines don’t touch at c0 but empirically should?
That’s bc the linear fit is the wrong functional form – we know this
from the simulation that it’s the wrong functional form.
Sharp RDD: Nonlinear Case
Suppose the nonlinear relationship is E [Yi0 |Xi ] = f (Xi ) for

some reasonably smooth function f (Xi ) (drumroll – like a
cubic!)
In that case we’d fit the regression model:
Yi = f (Xi ) + δDi + ηi
Since f (Xi ) is counterfactual for values of Xi > c0 , how will

we model the nonlinearity?
There are 2 common ways of approximating f (Xi )
Nonlinearities
People until Gelman and Imbens 2018 favored“higher order

polynomials” but this is problematic due to overfitting. Gelman and
Imbens 2018 recommend at best a quadratic
1 Use global and local regressions with f (Xi ) equalling a p th
order polynomial
Yi = α + δDi + β1 xi + β2 xi2 + · · · + βp xip + ηi
2 Or use some nonparametric kernel method which I’ll cover later

Different polynomials on the 2 sides of the discontinuity
We can generalize the function, f (xi ), by allowing it to differ

on both sides of the cutoff by including them both individually
and interacting them with Di .
In that case we have:
E [Yi0 |Xi ] = α + β01 X̃i + β02 X̃i2 + · · · + β0p X̃ip

E [Yi1 |Xi ] = α + δ + β11 X̃i + β12 X̃i2 + · · · + β1p X̃ip
where X̃i is the centered running variable (i.e., Xi − c0 ).

Lines to the left, lines to the right of the cutoff
Re-centering at c0 ensures that the treatment effect at

Xi = c0 is the coefficient on Di in a regression model with
interaction terms
As Lee and Lemieux (2010) note, allowing different functions
on both sides of the discontinuity should be the main results in
an RDD paper
Different polynomials on the 2 sides of the discontinuity
To derive a regression model, first note that the observed

values must be used in place of the potential outcomes:
E [Y |X ] = E [Y 0 |X ] + E [Y 1 |X ] − E [Y 0 |X ] D

which is the switching equation from earlier expressed in terms

of conditional expectation functions
Regression model you estimate is:
Yi = α + β01 x̃i + β02 x̃i2 + · · · + β0p x̃ip

+δDi + β1∗ Di x̃i + β2∗ Di x̃i2 + · · · + βp∗ Di x̃ip + εi
where β1∗ = β11 − β01 , β2∗ = β21 − β21 and βp∗ = β1p − β0p
The treatment effect at c0 is δ
Polynomial simulation example
c a p t u r e d r o p y x2 x3
gen x2 = x ∗ x
gen x3 = x ∗ x ∗ x
gen y = 10000 + 0∗D 100∗ x +x2 + r n o r m a l ( 0 , 1 0 0 0 )
r e g y D x x2 x3
p r e d i c t yhat
s c a t t e r y x i f D==0, m s i z e ( v s m a l l ) | | s c a t t e r y x
i f D==1, m s i z e ( v s m a l l ) l e g e n d ( o f f ) x l i n e ( 1 4 0 ,
l s t y l e ( f o r e g r o u n d ) ) y l a b e l ( none ) | | l i n e y h a t x
i f D ==0, c o l o r ( r e d ) s o r t | | l i n e y h a t x i f D==1,
s o r t c o l o r ( r e d ) x t i t l e ( " T e s t S c o r e (X) " )
y t i t l e ( " Outcome (Y) " )
Outcome (Y)
0 50 100 150 200 250

Test Score (X)
Figure: Third degree polynomial. Actual model second degree polynomial.
Notice: no more gap at c0 once we model the function f (x)

Stata simulation
gen x2_c = x2 _ 140
gen x3_c = x3 _ 140
r e g y D x x2
r e g y D x_c x2_c
Notice: no more gap at c0 once we model the function f (x) (e.g.,

D is insignificant once we include polynomials)
And centering did nothing to the interpretation of the main results

(D), only to the intercept.
Robustness against what?
Are you done now that you have your main results? No
You main results are only causal insofar as smoothness is a
credible belief, and since smoothness isn’t guaranteed by “the
science” like an RCT, you have to build your case
You must now scrutinize alternative hypotheses that are
consistent with your main results through sensitivity checks,
placebos and alternative approaches

Main Challenges
Classify your concern regarding smoothness violations into two

categories:
Manipulation on the running variable
Endogeneity of the cutoff
Most robustness is aimed at building credibility around these,
Manipulation of your running variable score
Treatment is not as good as randomly assigned around the

cutoff, c0 , when agents are able to manipulate their running
variable scores. This happens when:
1 the assignment rule is known in advance
2 agents are interested in adjusting
3 agents have time to adjust
4 administrative quirks like nonrandom heaping along the
running variable
Examples include re-taking an exam, self-reported income,
certain types of non-random rounding.
Since necessarily treatment assignment is no longer
independent of potential outcomes, it’s likely this implies
smoothness has been violated
Test 1: Manipulation of the running variable
Manipulation of the running variable

Assume a desirable treatment, D, and an assignment rule X ≥ c0 .
If individuals sort into D by choosing X such that X ≥ c0 , then we
say individuals are manipulating the running variable.
Also can be called “sorting on the running variable” – same thing

A badly designed RCT
Suppose a doctor randomly assigns heart patients to statin

and placebo to study the effect of the statin on heart attacks
within 10 years
Patients are placed in two different waiting rooms, A and B,
and plans to give those in A the statin and those in B the
placebo.
The doors are unlocked and movement between the two can
happen
Versions of this happened with HIV RCTs in the 1980s
ironically in which medication from treatment group was given
to the control group, but I’m talking about something a little
different
McCrary Density Test
We would expect waiting room A to become crowded. In the RDD

context, sorting on the running variable implies heaping on the
“good side” of c0
McCrary (2008) suggests a formal test: under the null the
density should be continuous at the cutoff point.
Under the alternative hypothesis, the density should increase
at the kink (where D is viewed as good)
1 Partition the assignment variable into bins and calculate
frequencies (i.e., number of observations) in each bin
2 Treat those frequency counts as dependent variable in a local
linear regression
This is oftentimes visualized with confidence intervals
illustrating the effect of the discontinuity on density - you need
no jump to pass this test
McCrary density test
The McCrary Density Test has become mandatory for every

analysis using RDD.
If you can estimate the conditional expectations, you evidently
have data on the running variable. So in principle you can
always do a density test
You can download the (no longer supported) Stata ado
package, DCdensity, to implement McCrary’s density test
(http://eml.berkeley.edu/~jmccrary/DCdensity/)
You can install rdrobust for Stata and R too, and it will
implement the test
Caveats about McCrary Density Test
For RDD to be useful, you already need to know something about the
mechanism generating the assignment variable and how susceptible it could be
to manipulation. Note the rationality of economic
Fig. 1. The agent’s problem. actors that this test is built
on.
A discontinuity
0.50 in the density is “suspicious” 0.50 – it suggests manipulation of X
Conditional Expectation
Conditional Expectation
around the0.30
cutoff is probably going on. 0.30 In principle one doesn’t need continuity.
Estimate
Estimate
0.10 0.10
This is a-0.10high-powered test. You need a-0.10lot of observations ` at c0 to distinguish
a discontinuity
-0.30
in the density from noise. -0.30
-0.50 -0.50
5 10 15 20 5 10 15 20
Income Income
0.16 0.16
0.14 0.14
Density Estimate
Density Estimate
0.12 0.12
0.10 0.10
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0.00 0.00
5 10 15 20 5 10 15 20
Income Income
Fig. 2. Hypothetical example: gaming the system with an income-tested job training program: (A) conditional expectation of returns to
Figure:
treatmentPanel
with noCpre-announcement
is density of income when there
and no manipulation; is no pre-announcement
(B) conditional expectation of returns and no manipulation.
to treatment Panel D is
with pre-announcement
theand
density of income
manipulation; when
(C) density therewith
of income is no
pre-announcement
pre-announcement and and manipulation.
no manipulation; From
(D) density McCrary
of income (2008).
with pre-announcement
and manipulation.
also necessary, and we may characterize those who reduce their labor supply as those with coai pc=f i and
bi 4ai ð1 " f i Þ=d.
Fig. 2 shows the implications of these behavioral effects using a simulated data set on 50,000 agents with
Visualizing manipulation
Figure 3: Running McCrary z-statistic

Figure 2: Distribution of marathon finishing times (n = 9, 378, 546)
150
100
McCrary z
50
0
-50
2:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 6:30 7:00
threshold
Non-categorical 10 minute threshold

30 minute threshold
NOTE: The McCrary test is run at each minute threshold from 2:40 to 7:00 to test whether there is a significant discontinuity
NOTE: The dark bars highlight the density in the minute bin just prior to each 30 minute threshold. in the density function at that threshold.
Figure: Figures 2 and 3 from Eric Allen, Patricia Dechow, Devin Pope and George Wu’s (2013)
“Reference-Dependent Preferences: Evidence from Marathon Runners”.
http://faculty.chicagobooth.edu/devin.pope/research/pdf/Website_Marathons.pdf
12 14
Newborn mortality and medical expenditure
Almond, et al. 2010 attempted to estimate the causal effect of

medical expenditures on health outcomes, which is ordinarily
rife with selection bias due to endogeneous physician behavior
(independence is violated)
In the US, newborns whose birthweight falls below 1500 grams
receive heightened medical attention bc 1500 is the “very low
birth weight” range and quite dangerous for infants
Used RDD with hospital administrative records and found
1-year infant mortality decreased by 1pp just below 1500
grams compared to just above – medical expenditures are
cost-effective
Heaping problem
Figure: Distribution of births by gram from Almond, et al. 2010

Heaping, Running and Jumping
This picture shows “heaping” which is excess mass at certain

points along the running variable
Unlikely births actually heap at certain intervals; more likely
someone is rounding
Some scales may be less sophisticated, some practices may be
more common in some types of hospitals than others, there
could outright manipulation
Failure to reject
Almond, et al. 2010 used the McCrary density test but found
no evidence of manipulation
Ironically, the McCrary density test may fail to reject in a
heaping scenario
In this scenario, the heaping is associated with high mortality
children who are outliers compared to newborns both to the
left and to the right
“This [heaping at 1500 grams] may be a signal that
poor-quality hospitals have relatively high propensities to
round birth weights but is also consistent with
manipulation of recorded birth weights by doctors, nurses,
or parents to obtain favorable treatment for their children.
Barreca, et al. 2011 show that this nonrandom heaping
leads one to conclude that it is “good” to be strictly less
than any 100-g cutoff between 1,000 and 3,000 grams.”
Donut holes
RDD compares means as we approach c0 from either direction

along X
Estimates should not logically be sensitive to the observations
at the cutoff – if it is, then smoothness may be violated
Through Monte Carlos, Barreca, et al. 2016 suggest an
alternative strategy – drop the units in the vicinity of 1500
grams, and re-estimate the model
They call this a “donut” RDD bc you drop the units at the
cutoff (the “donut hole”) and estimate your model on the units
in the neighborhood instead
Newborn mortality and medical expenditure
Dropping units (e.g., trimming) always changes the parameter

we’re estimating – it’s not the ATE, the ATT, not even the
LATE except under strong assumptions
In this case, dropping at the threshold reduced sample size by
2%
But the strength of this practice is that it allows for the
possibility that units at the heap differ markedly due to
selection bias than those in the surrounding area
Donut RDD analysis found effect sizes that were
approximately 50% smaller than Almond, et al 2010
Caution with heaping is a good attitude to have
Endogenous cutoffs
(A) Data generating graph (B) Limiting graph
X U X → c0 U
D Y D Y
Endogeneous cutoffs
RCT randomization breaks all ordinary backdoor paths

between D and Y because that’s how “the science” of
randomization works
RDD blocks the backdoor path from D ← X ←? → U → Y ;
it assumes away the backdoor path D ← U → Y
But if cutoffs are endogenous, then it is there, which means
absent the treatment, smoothness would’ve been violated
anyway
Smoothness isn’t guaranteed by an RDD unless D ← U → Y
isn’t present – which is why it is the critical identifying
assumption
Endogeneous cutoffs
Examples of endogenous cutoffs

Age thresholds used for policy (i.e., person turns 18, and faces
more severe penalties for crime) is correlated with other
variables that affect the outcome (i.e., graduation, voting
rights, etc.)
Age 65 is correlated with factors that directly affect healthcare
expenditure and mortality such as retirement
But some of these can be weakly defended with balance tests
(observables), or may be directly testable through placebos
assuming you have the data
Evaluating smoothness through balance
Balance tests and placebo tests are related but distinct

We can’t directly test smoothness bc we are missing
counterfactuals
Ask yourself: why should average values of exogenous
covariates jump if potential outcomes are smooth through the
cutoff?
If there are exogenous (non collider) covariates strongly
associated with potential outcomes but exogenous to them,
then they should be the same on either side of the cutoff if
smoothness holds
In this sense, balance tests are indirect searching for evidence
supporting smoothness
Balance implementation
Don’t make it hard – do what you did to Y , only to Z

Choose other noncolliders associated with potential outcomes,
Z
Create similar graphical plots as you did for Y
Could also conduct the parametric and nonparametric
estimation on Z
You do not want to see a jump around the cutoff, c0
Visualizing Balance
DO VOTERS AFFECT OR ELECT POLICIES? 835
Downloaded from http://qje.oxfordjournals.org/ at Baylor University on

FIGURE III
Similarity
Figure: Figure 3 from of Constituents’
Lee, Moretti Characteristics
and Butler (2004), “Do in Bare Democrat
Voters Affect and Republican
or Elect Policies?” Quarterly
Districts–Part 1
Journal of Economcis.PanelsPanels
refer refer to (top
to (from lefttoto
top left bottom
bottom right)
right) the following
the following district
district characteristics: real
character-
income, percentageistics:
with real income, percentage
high-school degree,with high-school
percentage degree,
black, percentageeligible
percentage black, percent-
to vote. Circles represent
age eligible to vote. Circles represent the average characteristic within intervals
the average characteristic within intervals of 0.01 in Democratic vote share. The continuous line represents
of 0.01 in Democrat vote share. The continuous line represents the predicted
the predicted values from
values a fourth-order
from a fourth-orderpolynomial
polynomial in in vote share fitted
vote share fittedseparately
separatelyforfor points above and
points
abovethreshold.
below the 50 percent and belowThe the dotted
50 percentlinethreshold.
representsThe thedotted line represents
95 percent confidence theinterval.
95
percent confidence interval.
Placebos at non-discontinuous points
Placebos in time are common with panels; placebo in running

variables are their equivalent in RDD
Imbens and Lemieux (2010) suggest we look at one side of the
discontinuity (e.g., X < c0 ), take the median value of the
running variable in that section, and pretend it was a
discontinuity, c00
Then test whether in reality there is a discontinuity at c00 . You
do not want to find anything.
Remember though: smoothness at placebo points is neither
necessary nor sufficient for smoothness in the potential
outcomes at the cutoff
So there are Type I and Type II risks of error with this
Pictures, pictures and more pictures
Synthetic control and RDD are visually intense

Eyeball tests are rampant (and deservedly) in RDD studies
Even if your main results are all parametric, you’ll still want to
present at least some nonparametric style pictures according to
Imbens and Lemieux (2010)
Let’s review some of the graphs you have to include

Outcomes
1 Outcome by running variable, (Xi ):

Construct bins and average the outcome within bins on both
sides of the cutoff
Look at different bin sizes when constructing these graphs
Plot the running variables, Xi , on the horizontal axis and the
average of Yi for each bin on the vertical axis
Consider plotting a relatively flexible regression line on top of
the bin means, but some readers prefer an eyeball test without
the regression line to avoid “priming”
Example: Outcomes by Forcing Variable
Example: Outcomes by Running Variables
From Lee and Lemieux (2010) based on Lee (2008)
Waldinger (Warwick) 26 / 48
Example: Outcomes by Forcing Variable - Smaller Bins
Example:
From Lee Outcomes
and Lemieux by Running
(2010) based Variables with smaller bins
on Lee (2008)
Probability of treatment
2 Probability of treatment by running variable if fuzzy

RDD
In a fuzzy RDD, you also want to see that the treatment
variable jumps at c0
This tells you whether you have a first stage (“bite”)
Let’s look at that again from earlier Hoekstra (2008) and
enrollment at the flagship
THE EFFECT OF ATTENDING THE FLAGSHIP STATE UNIVERSITY ON EARNINGS
.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
Local Average
McCrary Density
3 Density of the running variable

One should plot the number of observations in each bin.
This plot allows to investigate whether there is a discontinuity
or heaping in the distribution of the running variable at the
threshold
Heaping or discontinuities in the density suggest that people
can manipulate their running variable score
This is an indirect test of the identifying assumption that each
individual has imprecise control over the assignment variable,
which may violate smoothness
Density
Density of the forcingofvariable
the running variable
From Lee & Lemieux (2010) based on Lee (2008)
Balance pictures
4 Covariates by a running variable

Construct a similar graph to the outcomes graph but use a
noncollider covariate as the “outcome”
Balance implies smoothness through the cutoff, c0 .
If noncollider covariates jump at the cutoff, one is probably
justified to reject that potential outcomes aren’t also probably
jumping there
Example:
Example Covariates
Covariates by ForcingbyVariable
Running Variable
From Lee and Lemieux (2010) based on Lee (2008)
Inference – honesty
Lee and Card (2008) and Lee and Lemieux (2010) recommend
clustering standard errors on the running variable
Kolesár and Rothe (2018) provide extensive theoretical and
simulation-based evidence that this is not good; you’d be
better off just with heteroskedastic robust
They propose two alternative confidence intervals that achieve
correct coverage in large samples – called “honest” (great
intro! Still studying this procedure)
Unavailable in Stata, but is available in R – RDHonest – at
https://github.com/kolesarm/RDHonest

Inference – randomization inference
Cattaeneo, et al. (2015) say to consider that the cutoff is a

randomized experiment
Use randomization inference which is a test of the null of no
individual unit level treatment effect at the cutoff
Parametric vs. nonparametric approaches
Least squares approaches, because it models the

counterfactual using functional forms, is parametric
As a result, it can have poor predictive properties on
counterfactuals above/below the cutoff
Another way of approximating f (Xi ) is to use a nonparametric
kernel which has its own problems; just not that one.
e nonparametric kernel method has its problems in this case
ause you are trying to estimate
Kernel regressions
regression at the cuto§ point.
s results in a "boundary problem".
ile the "true" e§ect is AB, with a certain bandwidth a rectangular

nel would While the “true”
estimate effect isasAB,
the e§ect with a certain bandwidth a
A’B’.
rectangular kernel would estimate the effect as A0 B 0
ere is therefore systematic bias with the kernel method if the f (X )
There is therefore systematic bias with the kernel method if
upwards orthedownwards sloping.
f (X ) is upwards or downwards sloping
nger (Warwick) 21 /
Kernel weighted local polynomial regression
The nonparametric one-sided kernel estimation problems are

called “boundary problems” at the cutoff (Hahn, Todd and Van
der Klaauw 2001)
Kernel estimation (such as lowess) may have poor properties
because the point of interest is at a boundary
They proposed to use “local linear nonparametric regressions”
instead
Local linear regression with weights
Local linear nonparametric regression substantially reduces the

bias
Think of it as a weighted regression restricted to a window –
kernel provides the weights to that regression.
n
xi − c0
X
b ≡a,b 2
(b
a, b) (yi − a − b(xi − c0 )) K 1(xi > c0 )
h
i=1
where xi is the value of the running variable, c0 is the cutoff, K is a

kernel function and h > 0 is a suitable bandwidth
Animation of a local linear regression
https://twitter.com/page_eco/status/958687180104245248
Estimation
Stata’s poly estimates kernel-weighted local polynomial

regressions.
A rectangular kernel would give the same result as E [Y ] at a
given bin on X . The triangular kernel gives more importance
to observations close to the center.
This method will be sensitive to how large the bandwidth
(window) you choose
Optimal bandwidths
A rectangular kernel would give the same result as taking E [Y ]

at a given bin on X whereas the triangular kernel gives more
importance to the observations closer to the center.
While estimating this in a given window of width h around the
cutoff is straightforward, it’s more difficult to choose this
bandwidth (or window), and the method is sensitive to the
choice of bandwidth.
Bandwidths
Several methods for choosing the optimal bandwidth

(window), but it’s always a trade off between bias and variance
In practical applications, you want to check for balance around
that window
Standard error of the treatment effects can be bootstrapped
but there are also other alternatives
You could add other variables to nonparametric methods.
Bandwidths
Imbens and Kalyanaraman (2012), and more recently Calonico,

et al. (2017), have proposed methods for estimating “optimal”
bandwidths which may differ on either side of the cutoff.
Calonico, et al (2017) propose local-polynomial regression
discontinuity estimators with robust confidence intervals
Stata ado package and R package are both called rdrobust
Implementation
The following paper is a seminal paper in public choice both

scientifically and methodologically – the close election RDD
I call the close election RDD a type of sub-RDD in that it’s
widely used in political science and economics to the point
that it’s taken on a life of its own
Let’s take everything we’ve done and apply it by replicating
this paper using programs I’ve provided
Public choice
There are two fundamentally different views of the role of voters in

a representative democracy.
1 Convergence: Voters force candidates to become relatively
moderate depending on their size in the distribution (Downs
1957).
“Competition for votes can force even the most
partisan Republicans and Democrats to moderate
their policy choices. In the extreme case, competition
may be so strong that it leads to ‘full policy
convergence’: opposing parties are forced to adopt
identical policies” – Lee, Moretti, and Butler 2004.
2 Divergence: Voters pick the official and after taking office,

she pursues her most-preferred policy.
Falsification of either hypothesis had been hard
Very difficult to test either one of these since you don’t observe
the counterfactual votes of the loser for the same district/time
Winners in a district are selected based on their policy’s
conforming to unobserved voter preferences, too
Lee, Moretti and Butler (2004) develop the “close election
RDD” which has the aim of determining whether convergence,
while theoretically appealing, has any explanatory power in
Congress
The metaphor of the RCT is useful here: maybe close elections
are being determined by coin flips (e.g., a few votes here, a
few votes there)
Outcome is Congress person’s liberal voting score
Liberal voting score is a report card from the Americans for

Democratic Action (ADA) for the House election results
1946-1995
Authors use the ADA score for all US House Representatives
from 1946 to 1995 as their voting record index
For each Congress, ADA chooses about twenty high-profile
roll-call votes and creates an index varying 0 and 100 for each
Representative of the House measuring liberal voting record
Democratic “voteshare” is the running variable
Voteshare from the same races

The running variable is voteshare which is the share of all
votes that went to a Democrat.
They use a close Democratic victory to check whether
convergence or divergence is correct (what’s smoothness here?)
Discontinuity in the running variable occurs at
voteshare= 0.5. When voteshare> 0.5, the Democratic
candidate wins.
I’ll show lmb1.do to lmb10.do (and R) at times just so we
can all see the simple estimation methods ourselves.
Remember these results
832 QUARTERLY JOURNAL OF ECONOMICS
TABLE I
RESULTS BASED ON ADA SCORES—CLOSE ELECTIONS SAMPLE
Total effect Elect component Affect component

! #1 (PD R D R *D *R
t"1 $ Pt"1) #1[(Pt"1 $ Pt"1)] #0[P t "1 $ P t "1]
Variable ADA t"1 ADAt DEMt"1 (col. (2)*(col. (3)) (col. (1)) $ (col. (4))
(1) (2) (3) (4) (5)
Estimated gap 21.2 47.6 0.48

(1.9) (1.3) (0.02)
22.84 $1.64
Downloaded from http://qje.oxfordjournals.org/ at Ba

(2.2) (2.0)
Standard errors are in parentheses. The unit of observation is a district-congressional session. The
sample includes only observations where the Democrat vote share at time t is strictly between 48 percent and
52 percent. The estimated gap is the difference in the average of the relevant variable for observations for
which the Democrat vote share at time t is strictly between 50 percent and 52 percent and observations for
which the Democrat vote share at time t is strictly between 48 percent and 50 percent. Time t and t " 1 refer
to congressional sessions. ADA t is the adjusted ADA voting score. Higher ADA scores correspond to more
liberal roll-call voting records. Sample size is 915.
Figure: Lee, Moretti, and Butler 2004, Table 1.

primarily elect policies (full divergence) rather than affect poli-
cies (partial convergence).
Here we quantify our estimates more precisely. In the analy-
sis that follows, we restrict our attention to “close elections”—
where the Democrat vote share in time t is strictly between 48
and 52 percent. As Figures I and II show, the difference between
Nonparametric estimation
Hahn, Todd and Van der Klaauw (2001) emphasized using

local polynomial regressions
Estimate E [Y |X ] in such a way that doesn’t require
committing to a functional form
That model would be something general like
Y = f (X ) + ε
Nonparametric estimation (cont.)
We’ll do this estimation just rolling E [ADA] across the running

variable voteshare visually
Stata has an option to do this called cmogram and it has a lot
of useful options, though many people prefer to graph it
themselves bc it gives more flexibility.
We can recreate Figures I, IIA and IIB using it
Future liberal voting score
Downloaded from http://qje.oxfordjournals.org/ at Baylor University on April 7, 2014

FIGURE I
Total Effect of Initial Win on Future ADA Scores: "
This figure plots ADA scores after the election at time t ! 1 against the
Democrat vote share, time t. Each circle is the average ADA score within 0.01
intervals of the Democrat vote share. Solid lines are fitted values from fourth-
order polynomial regressions on either side of the discontinuity. Dotted lines are
pointwise 95 percent confidence intervals. The discontinuity gap estimates
" ! # 0$P *t !1
D
" P *t !1
R
% # # 1$P *t !1
D
" P *t !1
R
%.
“Affect” “Elect”
Figure:
be a Lee, Moretti,
continuous andfunction
and smooth Butler 2004,
of vote sharesFigure I. γ ≈ 20
everywhere,
except at the threshold that determines party membership. There
is a large discontinuous jump in ADA scores at the 50 percent
Contemporaneous liberal voting score
Downloaded from http://qje.oxfordjournals.org/ at Baylor U

FIGURE IIa
Effect of Party Affiliation: #1
Figure: Lee, Moretti, and Butler 2004, Figure IIa. π1 ≈ 45

p://qje.oxfordjournals.org/ at Baylor University on April 7, 2014
Incumbency advantage
FIGURE IIa
Effect of Party Affiliation: #1
FIGURE IIb
Effect of Initial Win on Winning Next Election: (P D R
t!1 " P t!1 )
Top panel plots ADA scores after the election at time t against the Democrat
vote share, time t. Bottom panel plots probability of Democrat victory at t ! 1
against Democrat vote share, time t. See caption of Figure III for more
D details.R
Figure: Lee, Moretti, and Butler 2004, Figure IIb. (Pt+1 − Pt+1 ) ≈ 0.50
Concluding remarks
Caughey and Sekhon (2011) questioned the finding (not the

design per se) saying that bare winners and bare losers in the
US House elections differed considerably on pretreatment
covariates (imbalance), which got worse in the closest elections
Eggers, et al. (2014) evaluated 40,000 close elections
including the House in other time periods, mayor races, and
other types of US races including nine other countries
They couldn’t find another instance where Caughey and
Sekhon’s critique applied
Assumptions behind close election design therefore probably
holds and is one of the best RD designs we have
Fuzzy RDD, IV and ITT
Fuzzy RDD is an IV estimator, and requires those assumptions

You may be more comfortable with presenting the
intent-to-treat (ITT) parameter which is just the reduced form
regression of Y on Z , therefore
Many papers will not present an IV-style parameter, but rather
a blizzard of ITT parameters, out of a “fear” that the exclusion
restrictions may not hold
But let’s review the IV approach anyway for completeness
(more IV to come!)

Probability of treatment jumps at discontinuity
Probabilistic treatment assignment (i.e. “fuzzy RDD”)

The probability of receiving treatment changes discontinuously at
the cutoff, c0 , but need not go from 0 to 1
limXi →c0 Pr (Di = 1|Xi = c0 ) 6= limc0 ←Xi Pr (Di = 1|Xi = c0 )
Examples: Incentives to participate in some program may change

discontinuously at the cutoff but are not powerful enough to move
everyone from non participation to participation.
Deterministic (sharp) vs. probabilistic (fuzzy)
In the sharp RDD, Di was determined by Xi ≥ c0

In the fuzzy RDD, the conditional probability of treatment
jumps at c0 .
The relationship between the conditional probability of
treatment and Xi can be written as:
P[Di = 1|Xi ] = g0 (Xi ) + [g1 (Xi ) − g0 (Xi )]Zi
where Zi = 1 if (Xi ≥ c0 ) and 0 otherwise.

Visualization of identification strategy (i.e. smoothness)
E [Y 0 |X ] and E [Y 1 |X ] for D = 0, 1 are the dashed/solid

continuous functions
E [Y |X ] is the solid which jumps at X = 6
5
0
0 1 2 3 4 5 6 7 8 9 10
Hoekstra flagship school
THE EFFECT OF ATTENDING THE FLAGSHIP STATE UNIVERSITY ON EARNINGS

.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
Local Average
Instrumental variables
As said, fuzzy designs are numerically equivalent and

conceptually similar to IV
“Reduced form” Numerator: “jump” in the regression of the
outcome on the running variable, X .
“First stage” Denominator: “jump” in the regression of the
treatment indicator on the running variable X .
Same IV assumptions, caveats about compliers vs. defiers, and
statistical tests that we will discuss in next lecture with
instrumental variables apply here – e.g., check for weak
instruments using F test on instrument in first stage, etc.
Wald estimator
Wald estimator of treatment effect under Fuzzy RDD

Average causal effect of the treatment is the Wald IV parameter
limX →c0 E [Y |X = c0 ] − limc0 ←X E [Y |X = c0 ]

δFuzzy RDD =
limX →c0 E [D|X = c0 ] − limc0 ←X E [D|X = c0 ]
RDD’s Relationship to IV
Center X it’s equal to zero at c0 and define Z = 1(X ≥ 0)

The coefficient on Z in a regression like
. reg Y Z X X2 X3
is the reduced form discontinuity, and
. reg D Z X X2 X3
is the first stage discontinuity

Ratio of discontinuities is estimate of δFuzzy RDD
Simple way to implement is IV
. ivregress 2sls Y (D=Z) X X2 X3

First stage relationship between X and D
One can use both Zi as well as the interaction terms as

instruments for Di .
If one uses only Zi as IV, then the it is a “just identified”
model which usually has good finite sample properties.
In the just identified case, the first stage would be:
Di = γ0 + γ1 Xi + γ2 Xi2 + · · · + γp Xip + πZi + ε1i
where π is the causal effect of Z on the conditional probability

of treatment.
The fuzzy RD reduced form is:
Yi = µ + κ1 Xi + κ2 Xi2 + · · · + κp Xip + ρπZi + ε2i

Fuzzy RDD with varying Treatment Effects - Second Stage
As in the sharp RDD case one can allow the smooth function
to be different on both sides of the discontinuity.
The second stage model with interaction terms would be the
same as before:
Yi = α + β01 x̃i + β02 x̃i2 + · · · + β0p x̃ip

+ρDi + β1∗ Di x̃i + β2∗ Di x̃i2 + · · · + βp∗ Di x̃ip + ηi
Where x̃ are now not only normalized with respect to c0 but

are also fitted values obtained from the first stage regression.
Fuzzy RDD with Varying Treatment Effects - First Stages
Again one can use both Zi as well as the interaction terms as

instruments for Di
Only using Z the estimated first stages would be:
Di = γ00 + γ01 X̃i + γ02 X̃i2 + · · · + γ0p X̃ip

+πZi + γ1∗ X̃i Zi + γ2∗ X̃i2 Zi + · · · + γp∗ Zi + ε1i
We would also construct analogous first stages for X̃i Di ,

X̃i2 Di , . . . , X̃ip Di .
Limitations of the LATE
Fuzzy RDD has assumptions of all standard IV framework

(exclusion, independence, nonzero first stage, and
monotonicity)
As with other binary IVs, the fuzzy RDD is estimating LATE:
the local average treatment effect for the group of compliers
In RDD, the compliers are those whose treatment status
changed as we moved the value of xi from just to the left of c0
to just to the right of c0
Means we can use Medicare age cutoff to estimate the effect
of public insurance on mortality (LATE) and still not know the
effect of public insurance on mortality (ATE)
Instrumental variables
If treatment is tied to an unobservable, then conditioning

strategies, even RDD, are invalid
Instrumental variables offers some hope at recovering the
causal effect of D on Y
The best instruments come from deep knowledge of
institutional details (Angrist and Krueger 1991)
Certain types of natural experiments can be the source of such
opportunities and may be useful

When is IV used?
Instrumental variables methods are typically used to address the

following kinds of problems encountered in naive regressions
1 Omitted variable bias
2 Measurement error
3 Simultaneity bias
4 Reverse causality
5 Randomized control trials with noncompliance
Selection on unobservables
D Y
Then D is endogenous due to backdoor path D ← U → Y and

causal effect D → Y is not identified using the backdoor criterion.
Instruments
Z U
D Y
Notice how the path from Z → D ← U → Y is blocked by a

collider.
Phillip Wright
Philip Wright was a renaissance man - published in JASA,

QJE, AER, you name it, while on a very intense teaching load.
Also published poetry, and even personally published Carl
Sandburg’s first book of poetry!
Spent a long time at Tufts
He was very concerned about the negative effects of tariffs and
wrote a book about commodity markets
Elasticity of demand is unidentified
James Stock notes that his publications had a theme regarding

identification
He knew, for instance, that he couldn’t simple look at
correlations between price and quantity if he wanted the
elasticity of demand due to simultaneous shifts in supply and
demand
The pairs of quantity and price weren’t demand, or supply -
they were demand and supply equilibrium values and therefore
didn’t reflect the demand or the supply curve, both of which
are counterfactuals
Those points are nothing more than a bunch of numbers – no
more, no less – that have no practical use, scientific or
otherwise
Exhibit 1
The Graphical Demonstration of the Identification Problem in Appendix B (p. 296)
FicruRB 4. Price-output Data Fail to Revbal Either Supply

or Demand Curve.
without affecting cost conditions or which (B) affect cost conditions without
Figure: Wright’s graphical demonstration of the identification problem
affecting demand conditions.
B then two derivations of the instrumental variable estimator

Sewell Wright
Sewell was his son, who did not go into the family business
Rather, he decided to become a genius and invent genetics
Developed path diagrams (which Pearl revived 50 years later
for causal inference)
Father and son engage in letter correspondence as Philip tried
to solve the “identification problem”
Figure: Wright’s letter to Sewell, his son
Figure: Recognize these?
QJE Rejects
QJE misses a chance to make history and rejects his paper

proving an IV estimator
Sticks his proof in Appendix B of 1928 book,
The Tariff on Animal and Vegetable Oils
His work on IV is ignored, and is then rediscovered 15 years
later (e.g., Olav Reiersøl).
James Stock and others have helped correct the record
Sidebar: stylometric analysis
Long standing question was who wrote Appendix B? Answer

according to Stock and Trebbi (2003) using stylometric
methods is that Philip wrote it.
But who invented it? It was collaborative, but Sewell
acknowledged he didn’t know how to handle endogeneity and
simultaneity (that was Philip)
Constant treatment effects
Constant treatment effects (i.e., β is constant across all

individual units)
Constant treatment effects is the traditional econometric
pedagogy when first learning instrumental variables, and
doesn’t need the potential outcomes model or notation to get
the point across
Constant treatment effects is identical to assuming that
ATE=ATT=ATU because constant treatment effects assumes
βi = β−i = β for all units
Heterogenous treatment effects
Heterogeneous treatment effects (i.e., βi varies across

individual units)
Heterogeneous treatment effects means that the
ATE 6= ATT 6= ATU because βi differs across the population
This is equivalent to assuming the coefficient, βi , is a random
variable that varies across the population
Heterogenous treatment effects is based on work by Angrist,
Imbens and Rubin (1996) and Imbens and Angrist (1994)
which introduced the “local average treatment effect” (LATE)
concept
Data requirements
Your data isn’t going to come with a codebook saying

“instrumental variable”. So how do you find it?
Well, sometimes the researcher just knows.
That is, the researcher knows of a variable (Z ) that actually is
randomly assigned and that affects the endogenous variable
but not the outcome (except via the endogenous variable)
Such a variable is called an “instrument”.
Picking a good instrument
The best instruments you think of first, then you seek the data
second (but often students go in the reverse order which is
basically guaranteed to be a crappy instrument)
If you want to use IV, then ask:
What moves around the covariate of interest that
might be plausibly random?
Is there any element in the treatment that could be construed

as random?
If you were to find that random piece, then you have found an
instrument
Once you have identified such a variable, begin to think about
what data sets might have information on an outcome of
interest, the treatment, and the instrument you have put your
finger on.
Does family size reduce labor supply or is it selection?
Angrist and Evans (1998), “Children and their parents’ labor

supply” American Economic Review,
They want to know the effect of family size on labor supply,
but need exogenous changes in family size
So what if I told you if the first two children born were of the
same gender, then you’re less likely to work. What?!
Angrist and Evans cont.
Many parents have a preference for having at least one child of

each gender
Consider a couple whose first two kids were both boys; they
will often have a third, hoping to have a girl
Consider a couple whose first two kids were girls; they will
often have a third, hoping for a boy
Consider a couple with one boy and one girls; they will often
not have a third kid
The gender of your kids is arguably randomly assigned (maybe
not exactly, but close enough)
Good instruments must be a bit strange
On its face, it’s puzzling that the first two kids’ gender
predicts labor market participation
Instrumental variables strategies formalize strangeness of the
instrument, which is the inference drawn by an intelligent
layperson with no particular knowledge of the phenomena or
background in statistics.
You need more information, in other words, otherwise the
layperson can’t understand what same gender of your children
has to do with working
When a good IV strategy finally makes sense
But then the researchers point out that women whose first two
children are of the same gender are more likely to have
additional children than women whose first two children are of
different genders
The layperson then asks himself, “Hm. I wonder if the labor
market differences are due solely to the differences in the
number of kids the woman has...”
Sunday Candy is a good instrument
Let’s listen to a few lines from “Ultralight Beam” by Kanye

West. Chance the Rapper sings on it and says
“I made Sunday Candy, I’m never going to hell
I met Kanye West, I’m never going to fail.”
- Chance the Rapper
What does making a song have to do with hell? What does

meeting Kanye West have to do with success? Let’s consider
each in order
What are we missing?
“I made Sunday Candy,

I’m never going to hell”,
There must be more to this story, right?

So what if it’s something like this
“I made Sunday Candy

this pastor invited me to church on Sunday,
I’m never going to hell”
Sunday Candy DAG
Sunday Candy U
Church Hell
Kanye West is a bad instrument
Chance long idolized and was inspired by Kanye West – both

Chicago, both very creative hip hop artists
Kanye West is not a good instrument for Chance’s inspiration,
though, because Kanye West can singlehandedly make a
person’s career
Kanye is not strange enough
Kanye West DAG
Kanye West U
Inspiration Success
Foreshadowing the questions you need to be asking
1 Is our instrument highly correlated with the treatment? With

the outcome? Can you test that?
2 Are there random elements within the treatment? Why do you
think that?
3 Is the instrument exogenous? Why do you think that?
4 Could the instrument affect outcomes directly? Why do you
think that?
5 Could the instrument be associated with anything that causes
the outcome even if it doesn’t directly? Why do you think
that?
Our causal model: Returns to schooling again
Y = α + δS + γA + ν
where Y is log earnings, S is years of schooling, A is unobserved

ability, and ν is the error term
Suppose there exists a variable, Zi , that is correlated with Si .
We can estimate δ with this variable, Z :
How can IV be used to obtain consistent estimates?
Cov (Y , Z ) = Cov (α + δS + γA + ν, Z )
= E [(α + δS + γA + ν)Z ] − E [α + δS + γA + ν]E [Z ]
= {αE (Z ) − αE (Z )} + δ{E (SZ ) − E (S)E (Z )}
+γ{E (AZ ) − E (A)E (Z )} + E (νZ ) − E (ν)E (Z )
Cov (Y , Z ) = δCov (S, Z ) + γCov (A, Z ) + Cov (ν, Z )
Divide both sides by Cov (S, Z ) and the first term becomes δ, the
LHS becomes the ratio of the reduced form to the first stage, plus
two other scaled terms.
Consistency
What conditions must hold for a valid IV design?

Cov (S, Z ) 6= 0 – “first stage” exists. S and Z are correlated
Cov (A, Z ) = Cov (ν, Z ) = 0 – “exclusion restriction”. This
means Z that orthogonal to the factors in ν, such as
unobserved ability, A, as well as the structural disturbance
term, ν
Assuming the first stage exists and that the exclusion
restriction holds, then we can estimate δ with δIV :
Cov (Y , Z )
δIV =
Cov (S, Z )
= δ
IV is Consistent if IV Assumptions are Satisfied
The IV estimator is consistent if the IV assumptions are

satisfied. Substitute true model for Y :
Cov ([α + ρS + γA + ν], Z )
δIV =
Cov (S, Z )
Cov ([S], Z ) Cov ([A], Z ) Cov ([ν], Z )
= δ +γ +
Cov (S, Z ) Cov (S, Z ) Cov (S, Z )
Cov (η, Z )
= δ+γ
Cov (S, Z )
Identifying assumptions and consistency
Taking the probability limit which is an asymptotic operation

to show consistency:
Cov (η, Z )
plim δbIV = plim δ + γ
Cov (S, Z )
= δ
because Cov ([A], Z ) = 0 and Cov ([ν], Z ) = 0 due to the

exclusion restriction, and Cov (S, Z ) 6= 0 (due to the first
stage)
IV Assumptions
But, if Z is not independent of η (either correlated with A or

ν), and if the correlation between S and Z is “weak”, then the
second term blows up.
We will explore the problems created by weak instruments in
just a moment.
First, let’s look at a DAG summarizing all this information
One of these DAGs is not like the other
Z A
S Y
(a)
Z A
S Y
(b)
Notice - the top DAG, a, satisfies both exclusion and relevance

(i.e., non-zero first stage), but the bottom DAG, b, satisfies
relevance but not exclusion.
Two-stage least squares
The two-stage least squares estimator was developed by Theil

(1953) and Basman (1957) independently
Note, while IV is a research design, 2SLS is a specific
estimator.
Others include LIML, the Wald estimator, jacknive IV, two
sample IV, and more

Two Sample IV
In a pinch, you can even get by with two different data sets
1 Dataset 1 needs information on the outcome and the
instrument
2 Dataset 1 needs information on the treatment and the
instrument.
This is known as “Two sample IV” because there are two
samples involved, rather than the traditional one sample.
Once we define what IV is measuring carefully, you will see
why this works.
Two-stage least squares concepts
Causal model. Sometimes called the structural model:
Yi = α + δSi + ηi
First-stage regression. Gets the name because of two-stage

least squares:
Si = γ + ρZi + ζi
Second-stage regression. Notice the fitted values, S:
b
Yi = β + δ Sbi + νi
Reduced form
Some people like a simpler approach because they don’t want

to defend IV’s assumptions
Reduced form a regression of Y onto the instrument:
Yi = ψ + πZi + εi
This would be like regressing hell onto Sunday Candy, as

opposed to regressing hell onto church with Sunday Candy
instrumenting for church
Suppose you have a sample of data on Y , X , and Z . For each

observation i we assume the data are generated according to
Yi = α + δSi + ηi
where Cov (Z , ηi ) = 0 and ρ 6= 0.

Plug in covariance and write out the following:
Cov (Z , Y )
δd
2sls =
Cov (Z , S)
1 Pn
n i=1 (Zi − Z )(Yi − Y )
= 1 Pn
n i=1 (Zi − Z )(Si − S)
1 Pn
n i=1 (Zi − Z )Yi
= 1 Pn
n i=1 (Zi − Z )Si
Substitute the causal model definition of Y to get:

1 Pn
n i=1 (Zi − Z ){α + δSi + ηi }
δ2sls =
d
1 Pn
n i=1 (Zi − Z )Si
1
n (Zi − Z )ηi
= δ+ 1 P n
n i=1 (Zi − Z )Si
= δ + "small if n is large"
Where did the first term go? Why did the second term become δ?
Calculate the ratio of “reduced form” (π) to “first stage”

coefficient (ρ):
Cov (Z ,Y )
Cov (Z , Y ) Var (Z ) π
δb2sls = = =
b
Cov (Z , S) Cov (Z ,S) ρb
Var (Z )
Rewrite ρb as
Cov (Z , S)
ρb =
Var (Z )
ρbVar (Z ) = Cov (Z , S)
Then rewrite δb2sls
Cov (Z , Y ) ρbCov (Z , Y ) ρbCov (Z , Y )

δb2sls = = = 2
Cov (Z , S) ρbCov (Z , S) ρb Var (Z )
Cov (b
ρZ , Y )
=
Var (b
ρZ )
Recall
Then
Sb = γ
b + ρbZ
Then
Cov (b
ρZ , Y ) Cov (S,
b Y)
δb2sls = =
Var (b
ρZ ) Var (S)
b
Proof.
We will show that δCov
b (Y , Z ) = Cov (S,
b Y ). I will leave it to you
to show that Var (δZ ) = Var (S)
b b
Cov (S, b ] − E [S]E

b Y ) = E [SY b [Y ]
= E (Y [b b ]) − E (Y )E (b
ρ + δZ ρ + δZ
b )
b (YZ ) − ρbE (Y ) − δE
= ρbE (Y ) + δE b (Y )E (Z )
b (YZ ) − E (Y )E (Z )]
= δ[E
Cov (S,
b Y ) = δCov
b (Y , Z )
Intuition of 2SLS
Two stage least squares is nice because in addition to being an

estimator, there’s also great intuition contained in it which you
can use as a device for thinking about IV more generally.
The intuition is that 2SLS estimator replaces S with the fitted
values of S (i.e., S)
b from the first stage regression of S onto Z
and all other covariates.
By using the fitted values of the endogenous regressor from
the first stage regression, our regression now uses only the
exogenous variation in the regressor due to the instrumental
variable itself
Intuition of IV in 2SLS
. . . but think about it – that variation was there before, but

was just a subset of all the variation in the regressor
Go back to what we said in the beginning - we need the
endogenous variable to have pieces that are random, and IV
finds them.
Instrumental variables therefore reduces the variation in the
data, but that variation which is left is exogenous
“With a long enough [instrument], you can [estimate any
causal effect]” - Scott Cunningham paraphrasing Archimedes
Estimation with software
One manual way is just to estimate the reduced form and first
stage coefficients and take the ratio of the respective
coefficients on Z
But while it is always a good idea to run these two regressions,
don’t compute your IV estimate this way
It is often the case that a pattern of missing data will differ

between Y and S
In such a case, the usual procedure of “casewise deletion” is to
keep the subsample with non-missing data on Y , S, and Z .
But the reduced form and first stage regressions would be
estimated off of different sub-samples if you used the two step
method before
The standard errors from the second stage regression are also
wrong
Estimate this in Stata using -ivregress 2sls-.

Estimate this in R -ivreg()- which is in the AER package
Weak instruments
A weak instrument is one that is not strongly correlated with

the endogenous variable in the first stage
This can happen if the two variables are independent or the
sample is small
If you have a weak instrument, then the bias of 2SLS is
centered on the bias of OLS and the cure ends up being worse
than the disease
We knew this was a problem, but it was brought into sharp
focus with Angrist and Krueger (1991) and some papers that
followed

Angrist and Krueger (1991)
In practice, it is often difficult to find convincing instruments –

usually because potential instruments don’t satisfy the
exclusion restriction
But in an early paper in the causal inference movement,
Angrist and Krueger (1991) wrote a very interesting and
influential study instrumental variable
They were interested in schooling’s effect on earnings and
instrumented for it with which quarter of the year you were
born
Remember Chance quote - what the heck would birth quarter
have to do with earnings such that it was an excludable
instrument?
Compulsory schooling
In the US, you could drop out of school once you turned 16
“School districts typically require a student to have turned age
six by January 1 of the year in which he or she enters school”
(Angrist and Krueger 1991, p. 980)
Children have different ages when they start school, though,
and this creates different lengths of schooling at the time they
turn 16 (potential drop out age):
Born Turn Start
Dec 6 School S
Born Turn Start 16

Jan 6 School
If you’re born in the fourth quarter, you hit 16 with more schooling
than those born in the first quarter
Visuals
You need good data visualization for IV partly because of the

scrutiny around the design
The two pieces you should be ready to build pictures for are
the first stage and the reduced form
Angrist and Krueger (1991) provide simple, classic and
compelling pictures of both
First Stage
First Stages
Men born earlier in the year have lower schooling. This indicates
that there is a first stage. Notice all the 3s and 4s at the top. But
Men born earlier in the year have lower schooling. This indicates that
then notice how it attenuates over time . . .
there is a first stage.
Reduced Form Reduced Form
Do differences in schooling due to different quarter of birth

Do di§erences in schooling due to di§erent quarter of birth translate
translate into different earnings?
into di§erent earnings?
Two Stage Least Squares model
The causal model is

Yi = δSi + ε
The first stage regression is:
Si = X π10 + π11 Zi + η1i
The reduced form regression is:
Yi = X π20 + π21 Zi + η2i
The covariate adjusted IV estimator is the sample analog of

the ratio, ππ11
21
Two Stage Least Squares
Angrist and Krueger instrument for schooling using three

quarter of birth dummies: a dummies for 2nd, 3rd and 4th qob
Their estimated first-stage regression is:
Si = X π10 + Z1i π11 + Z2i π12 + Z3i π13 + η1
The second stage is the same as before, but the fitted values
are from the new first stage
First stage regression results
First Stageof Regressions
Quarter birth is a stronginpredictor
Angrist & Krueger
of total (1991)
years of education
Quarter of birth is a strong predictor of total years of education.

First stage regression results: Placebos
IV Results
IV Estimates
IV Estimates Birth Birth
Cohorts 20-29, 1980 Cohorts
Census 20-29, 1980 Census
Sidebar: Wald estimator
Recall that 2SLS uses the predicted values from a first stage
regression – but we showed that the 2SLS method was
(Y ,Z )
equivalent to Cov
Cov (X ,Z )
The Wald estimator simply calculates the return to education
as the ratio of the difference in earnings by quarter of birth to
the difference in years of education by quarter of birth – it’s a
version of the above
E (Y |Z =1)−E (Y |Z =0)
Formally, IVWald = E (D|Z =1)−E (D|Z =0)
Mechanism
In addition to log weekly wage, they examined the impact of

compulsory schooling on log annual salary and weeks worked
The main impact of compulsory schooling is on the log weekly
wage – not on weeks worked
More instruments
Problem enters with many quarter of birth interactions
They want to increase the precision of their 2SLS estimates, so

they load up their first stage with more instruments
Specifications with 30 (quarter of birth × year) dummy
variables and 150 (quarter of birth × state) instruments
What’s the intuition here? The effect of quarter of birth may
vary by birth year or by state
It reduced the standard errors, but that comes at a cost of
potentially having a weak instruments problem
More instruments
More instruments
Weak Instruments
For a long time, applied empiricists were not attentive to the

small sample bias of IV
But in the early 1990s, a number of papers highlighted that IV
can be severely biased – in particular, when instruments have
only a weak correlation with the endogenous variable of
interest and when many instruments are used to instrument for
one endogenous variable (i.e., there are many overidentifying
restrictions).
In the worst case, if the instruments are so weak that there is
no first stage, then the 2SLS sampling distribution is centered
on the probability limit of OLS
Causal model
Let’s consider a model with a single endogenous regressor and

a simple constant treatment effect (i.e., “just identified”)
The causal model of interest is:
Y = βX + ν
Matrices and instruments
We’ll sadly need some matrix notation, but I’ll try to make it
painless.
The matrix of instrumental variables is Z with the first stage
equation:
X = Z 0π + η
And let Pz be the project matrix producing residuals from
population regression of X on Z
Pz = Z (Z 0 Z )−1 Z 0
Weak instruments and bias towards OLS
If νi and ηi are correlated, estimating the first equation by

OLS would lead to biased results, wherein the OLS bias is:
Cov (ν, X )
E [βOLS − β] =
Var (X )
σνη
If νi and ηi are correlated the OLS bias is therefore: ση2
Deriving the bias of 2SLS
βb2sls = (X 0 Pz X )−1 X 0 PZ Y
= β + (X 0 Pz X )−1 X 0 PZz ν
substitution of Y = βX + ν
2SLS bias
βb2sls − β = (X 0 Pz X )−1 X 0 Pz ν
= aX 0 Pz ν
= a[π 0 Z 0 + η 0 ]Pz ν
= aπ 0 Z 0 ν + aη 0 PZ ν
= (X 0 PZ X )−1 π 0 Z 0 ν + (X 0 Pz X )−1 η 0 Pz ν
The bias of 2SLS comes from the non-zero expectation of terms on

the right-hand-side even though Z and ν are not correlated.
Taking expectations
Angrist and Pischke (ch. 4) note that taking expectations of

that prior expression is hard because the expectation operator
won’t pass through (X 0 Pz X )−1 .
However, the expectation of the ratios in the second term can
be closely approximated
βb2sls − β = (X 0 PZ X )−1 π 0 Z 0 ν + (X 0 Pz X )−1 η 0 Pz ν

−1 −1
0 0 0 0
E [β2sls − β] ≈
b E [X PZ X ] E [π Z ν] + E [X Pz X ] E [η 0 Pz ν]
Approximate bias of 2SLS
We know E [π 0 Z 0 ν] = 0 and E [π 0 Z 0 η] = 0. So letting E [η 0 Pz ν] = b

bc this is hard for me otherwise
E [βb2sls − β] ≈ E [X 0 Pz X )−1 b
≈ E 9X 0 Z (Z 0 Z )−1 Z 0 X )−1 b
≈ E [(πZ + η)0 Pz (πZ + η)]−1 b

0 0 0 −1
≈ E (π Z Z π) + E (η Pz η) b

≈ E (π 0 Z 0 Z π) + E (η 0 Pz η)−1 E [η 0 Pz ν]
That last term is what creates the bias so long as η and ν are
correlated – which it’s because they are that you picked up 2SLS to
begin with
First stage F
With some algebra and manipulation, Angrist and Pische show that
the bias of 2SLS is equal to
−1
σνη E (π 0 Z 0 Z π)/Q

E [βb2sls − β] ≈ +1
ση2 ση2
where the interior term is the population F-statistic for the joint
significance of all regressions in the first stage
Substituting F for that big term, we can derive the

approximate bias of 2SLS as:
σνη 1
E [βb2SLS − β] ≈ 2
ση F + 1
Consider the intuition all that work bought us now: if the first
stage is weak (i.e, F → 0), then the bias of 2SLS approaches
σνη
ση2
This is the same as the OLS bias as for π = 0 in the second

equation on the earlier slide (i.e., there is no first stage
σ
relationship) σx2 = ση2 and therefore the OLS bias σνη 2 becomes
η
σνη
ση2
.
But if the first stage is very strong (F → ∞) then the 2SLS
bias is approaching 0.
Cool thing is – you can test this with an F test on the joint
significance of Z in the first stage
It’s absolutely critical therefore that you choose instruments
that are strongly correlated with the endogenous regressor,
otherwise the cure is worse than the disease
Weak Instruments - Adding More Instruments
Adding more weak instruments will increase the bias of 2SLS

By adding further instruments without predictive power, the
first stage F -statistic goes toward zero and the bias increases
We will see this more closely when we cover judge fixed effects
If the model is “just identified” – mean the same number of
instrumental variables as there are endogenous covariates –
weak instrument bias is less of a problem
Weak instrument problem
After Angrist and Krueger study, there were new papers

highlighting issues related to weak instruments and finite
sample bias
Key papers are Nelson and Startz (1990), Buse (1992), Bekker
(1994) and especially Bound, Jaeger and Baker (1995)
Bound, Jaeger and Baker (1995) highlighted this problem for
the Angrist and Krueger study.
Bound, Jaeger and Baker (1995)
Remember, AK present findings from expanding their instruments

to include many interactions
1 Quarter of birth dummies → 3 instruments
2 Quarter of birth dummies + (quarter of birth) × (year of birth)
+ (quarter of birth) × (state of birth) → 180 instruments
So if any of these are weak, then the approximate bias of 2SLS gets
worse
Adding Instruments in Angrist & Krueger
Adding
Table from Bound, Jaeger, instruments in -Angrist
and Baker (1995) 3 and 30 and
IVs Krueger
Adding
Addingmore
moreweak
weakinstruments
instrumentsreduced
reducedthethefirst
firststage -statistic and
stageFF-statistic
and increases
moves the bias of
the coe¢cient 2SLS. the
towards Notice
OLSitscoe¢cient.
also moved closer to OLS.
Adding Instruments in Angrist
Adding instruments & Krueger
in Angrist and Krueger
Table from Bound, Jaeger, and Baker (1995) - 180 IVs
Adding
More more weakincrease
instruments instruments reduced
precision, the first
but drive downstage F-statistic and
F , therefore
moves the coe¢cient towards the OLS
we know the problem has gotten worse coe¢cient.
Guidance on working around weak instruments
Use a just identified model with your strongest IV

Use a limited information maximum likelihood estimator
(LIML) as it is approximately median unbiased for over
identified constant effects models and provides the same
asymptotic distribution as 2SLS (under constant effects) with
a finite-sample bias reduction.
Find stronger instruments – easier said than done
Look at the reduced form
1 Look at the reduced form

The reduced form is estimated with OLS and is therefore
unbiased
If you can’t see the causal relationship of interest in the
reduced form, it is probably not there

Report the first stage
2 Report the first stage (preferably in the same table as your

main results)
Does it make sense?
Do the coefficients have the right magnitude and sign?
Please make beautiful IV tables – you’ll be celebrated across
the land if you do
Report F statistic and OLS
3 Report the F -statistic on the excluded instrument(s).

Stock, Wright and Yogo (2002) suggest that F -statistics > 10
indicate that you do not have a weak instrument problem –
this is not a proof, but more like a rule of thumb
If you have more than one endogenous regressor for which you
want to instrument, reporting the first stage F -statistic is not
enough (because 1 instrument could affect both endogenous
variables and the other could have no effect – the model would
be under identified). In that case, you want to report the
Cragg-Donald EV statistic.
4 Report OLS – you said it was biased, but we want to still see it
Table: OLS and 2SLS regressions of Log Earnings on Schooling
Dependent variable Log wage

OLS 2SLS
educ 0.071*** 0.124**
(0.003) (0.050)
exper 0.034*** 0.056***
(0.002) (0.020)
black -0.166*** -0.116**
(0.018) (0.051)
south -0.132*** -0.113***
(0.015) (0.023)
married -0.036*** -0.032***
(0.003) (0.005)
smsa 0.176*** 0.148***
(0.015) (0.031)
First Stage Instrument

College in the county 0.327***
Robust standard error 0.082
F statistic for IV in first stage 15.767
N 3,003 3,003
Mean Dependent Variable 6.262 6.262
Std. Dev. Dependent Variable 0.444 0.444
Standard errors in parenthesis. * p<0.10, ** p<0.05, *** p<0.01
Practical Tips for IV Papers
5 If you have many IVs, pick your best instrument and report the
just identified model (weak instrument problem is much less
problematic)
6 Check over identified 2SLS models with LIML
Make beautiful pictures of first stage and reduced form
7 This cannot be overstated: you must present your main results

in beautiful pictures
Show pictures of the first stage. Convince the reader
something is there. The eyeball is underrated
You can’t show a second stage with raw data, so instead show
pictures of the reduced form.
Visualizing the instrument: supply shocks on meth prices
Visualizing the first stage
Visualizing the reduced form
Heterogenous Treatment Effects
Up to this point, we only considered models where the causal

effect was the same for all individuals
Constant treatment effects where Yi1 − Yi0 = δ for all i units)
Let’s now try to understand what instrumental variables
estimation is measuring if treatment effects are heterogenous
Yi1 − Yi0 = δi which varies across the population

Why do we care about heterogeneity?
Heterogeneity, it turns out, makes life interesting and

challenging
There are two issues here:
1 We care about internal validity: Does the design successfully
uncover causal effects for the population that we are studying?
2 We care about external validity: Does the study’s results
inform us about different populations?
What parameter did we even estimate using IV when there
were heterogenous treatment effects?
Potential outcome notation
“Potential treatment status” (D j ) versus “observed” treatment

status (D)
Di1 = i’s treatment status when Zi = 1
Di0 = i’s treatment status when Zi = 0
We’ll represent outcomes as a function of both treatment status
and instrument status. In other words, Yi (Di = 0, Zi = 1) is
represented as Yi (0, 1)
Switching equation
Move from potential treatment status to observed treatment status
Di = Di0 + (Di1 − Di0 )Zi

= π0i + π1i Zi + ζi
π0i = E [Di0 ]
π1i = (Di1 − Di0 ) is the heterogenous causal effect of the IV
on Di .
E [π1i ] = The average causal effect of Zi on Di
Identifying assumptions under heterogenous treatment
effects
1 Stable Unit Treatment Value Assumption (SUTVA)

2 Random Assignment
3 Exclusion Restriction
4 Nonzero First Stage
5 Monotonicity
Stable Unit Treatment Value Assumption (SUTVA)
Stable Unit Treatment Value Assumption (SUTVA)

If Zi = Zi0 , then Di (Z) = Di (Z0 )
If Zi = Zi0 and Di = Di0 , then Yi (D,Z) = Yi (D’,Z’)
Potential outcomes for each person i are unrelated to the

treatment status of other individuals.
Example: Your instrument is the death of a CEO for hirings.
But if a CEO dies, then perhaps other companies lose a CEO
as they are hired in the vacant spots.
In which case, the instrument is related to treatment status of
other individuals.
Independence assumption (e.g., “as good as random assignment”)

{Yi (Di1 , 1), Yi (Di0 , 0), Di1 , Di0 } ⊥
⊥ Zi
The IV is independent of the vector of potential outcomes and

potential treatment assignments (i.e. “as good as randomly
assigned”)
First two children of the same gender are assigned to families
randomly. That is, they are assigned to those with higher
likelihood of working just as often as it is assigned to those
less likely to work.
It’s all about the randomness of the instrument, in other
words, not the instrument’s effect.
Independence
Independence means that the first stage measures the causal effect
of Zi on Di :
E [Di |Zi = 1] − E [Di |Zi = 0] = E [Di1 |Zi = 1] − E [Di0 |Zi = 0]

= E [Di1 − Di0 ]
Independence
The independence assumption is sufficient for a causal

interpretation of the reduced form:
E [Yi |Zi = 1] − E [Yi |Zi = 0] = E [Yi (Di1 , 1)|Zi = 1]

−E [Yi (Di0 , 0)|Zi = 0]
= E [Yi (Di1 , 1)] − E [Yi (Di0 , 0)]
Exclusion Restriction
Exclusion Restriction
Y(D,Z) = Y(D,Z’) for all Z, Z’, and for all D
Any effect of Z on Y must be via the effect of Z on D. In

other words, Yi (Di , Zi ) is a function of D only. Or formally:
Yi (Di , 0) = Yi (Di , 1) for D = 0, 1
Sometimes called the “only through” assumption because

you’re assuming the effect of Z on Y is “only through” its
effect on D.
Recall the DAG and the missing arrows from Z to u and from
Z to Y .
Exclusion restriction
Use the exclusion restriction to define potential outcomes

indexed solely against treatment status:
Yi1 = Yi (1, 1) = Yi (1, 0)

Yi0 = Yi (0, 1) = Yi (0, 0)
Rewrite the switching equation:
Yi = Yi (0, Zi ) + [Yi (1, Zi ) − Yi (0, Zi )]Di

Yi = Yi0 + [Yi1 − Yi0 ]Di
Random coefficients notation for this is:
Yi = α0 + δi Di
with α0 = E [Yi0 ] and δi = Yi1 − Yi0
Spotting violations of exclusion is a sport
Watch the gears turn:

We are interested in causal effect of military service on
earnings, and so use draft number are instrument for military
service.
Draft number is generated by a random number generator.
Therefore independence is met as draft number is independent
of potential outcomes and potential treatment status.
But, people with higher draft numbers evade draft by investing
in schooling. Earnings change for reasons other than military
service. Exclusion is violated
In other words, random lottery numbers (independence) do not
imply that the exclusion restriction is satisfied
Strong first stage
Nonzero Average Causal Effect of Z on D

E [Di1 − Di0 ] 6= 0
D 1 means instrument is turned on, and D 0 means it is turned

off. We need treatment to change when instrument changes.
Z has to have some statistically significant effect on the
average probability of treatment
First two children of the same gender makes you more likely to
have a third.
Finally – a testable assumption. We have data on Z and D
Monotonicity
Monotonicity
Either π1i ≥ 0 for all i or π1i ≤ 0 for all i = 1, . . . , N
Recall that π1i is the reduced form causal effect of the

instrumental variable on an individual i’s treatment status.
Monotonicity requires that the instrumental variable (weakly)
operate in the same direction on all individual units.
In other words, while the instrument may have no effect on
some people, all those who are affected are affected in the
same direction (i.e., positively or negatively, but not both).
Monotonicity cont.
We instrument for schooling with birth quarter. Under

monotonicity scenarios 1-2:
1 they get more schooling or the same schooling if born in the
fourth quarter
2 they get less schooling or the same schooling if born in the
fourth quarter
Monotonicity says either of these can be true, but they cannot
both be true in your data – yet it’s not hard to imagine
violations where two people respond differently
Without monotonicity, IV estimators are not guaranteed to
estimate a weighted average of the underlying causal effects of
the affected group, Yi1 − Yi0 .
Force yourself to think of monotonicity violations
In the quarter of birth example for schooling, this assumption

may not be satisfied (see Barua and Lang 2009).
Being born in the 4th quarter (which typically increases
schooling) may have reduced schooling for some because their
school enrollment was held back by their parents
Local average treatment effect
If all 1-5 assumptions are satisfied, then IV estimates the local

average treatment effect (LATE) of D on Y :
Effect of Z on Y
δIV ,LATE =
Effect of Z on D
Estimand
Instrumental variables (IV) estimand:
E [Yi (Di1 , 1) − Yi (Di0 , 0)]

δIV ,LATE =
E [Di1 − Di0 ]
= E [(Yi1 − Yi0 )|Di1 − Di0 = 1]
Local Average Treatment Effect
The LATE parameters is the average causal effect of D on Y

for those whose treatment status was changed by the
instrument, Z
For example, IV estimates the average effect of military service
on earnings for the subpopulation who enrolled in military
service because of the draft but would not have served
otherwise.
LATE does not tell us what the causal effect of military service
was for patriots (volunteers) or those who were exempted from
military service for medical reasons
LATE cont.
We have reviewed the properties of IV with heterogenous

treatment effects using a very simple dummy endogenous
variable, dummy IV, and no additional controls example.
The intuition of LATE generalizes to most cases where we
have continuous endogenous variables and instruments, and
additional control variables.
LATE and subpopulations
The instrument partitions any population into 4 distinct groups:

1 Compliers: The subpopulation with Di1 = 1 and Di0 = 0. Their
treatment status is affected by the instrument in the “correct
direction”.
2 Always takers: The subpopulation with Di1 = Di0 = 1. They
always take the treatment independently of Z .
3 Never takers: The subpopulation with Di1 = Di0 = 0. They
never take the treatment independently of Z .
4 Defiers: The subpopulation with Di1 = 0 and Di0 = 1. Their
treatment status is affected by the instrument in the “wrong
direction”.
Subpopulations of soldieres
Examples of subpopulations:
1 Compliers: I only enrolled in the military because I was drafted
otherwise I wouldn’t have served
2 Always takers: My family have always served, so I serve
regardless of whether I am drafted
3 Never takers: I’m a contentious objector so under no
circumstances will I serve, even if drafted
4 Defiers: When I was drafted, I dodged. But had I not been
drafted, I would have served. I can’t make up my mind.
Never-Takers Complier
Di1 − Di0 = 0 Di1 − Di0 = 1
Yi (0, 1) − Yi (0, 0) = 0 Yi (1, 1) − Yi (0, 0) = Yi (1) − Yi (0)
By Exclusion Restriction, causal Average Treatment Effect among
effect of Z on Y is zero. Compliers
Defier Always-taker
Di1 − Di0 = −1 Di1 − Di0 = 0
Yi (0, 1) − Yi (1, 0) = Yi (0) − Yi (1) Yi (1, 1) − Yi (1, 0) = 0
By Monotonicity, no one in this By Exclusion Restriction, causal
group effect of Z on Y is zero.
Monotonicity Ensures that there are no defiers
Why is it important to not have defiers?

If there were defiers, effects on compliers could be (partly)
canceled out by opposite effects on defiers
One could then observe a reduced form which is close to zero
even though treatment effects are positive for everyone (but
the compliers are pushed in one direction by the instrument
and the defiers in the other direction)
Monotonicity assumes there are no defiers
What Does IV (Not) Estimate?
As said, with all 5 assumptions satisfied, IV estimates the

average treatment effect for compliers, or LATE
Without further assumptions (e.g., constant causal effects),
LATE is not informative about effects on never-takers or
always-takers because the instrument does not affect their
treatment status
So what? Well, it matters because in most applications, we
would be mostly interested in estimating the average
treatment effect on the whole population:
ATE = E [Yi1 − Yi0 ]
But that’s not possible usually with IV

Sensitivity to assumptions: exclusion restriction
Someone at risk of draft (low lottery number) changes

education plans to retain draft deferments and avoid
conscription.
Increased bias to IV estimand through two channels:
Average direct effect of Z on Y for compliers
Average direct effect of Z on Y for noncompliers multiplied by
odds of being a non-complier
Severity depends on:
Odds of noncompliance (smaller → less bias)
“Strength” of instrument (stronger → less bias)
Effect of the alternative channel on Y
Sensitivity to assumptions: Monotonicity violations
Someone who would have volunteered for Army when not at

risk of draft (high lottery number) chooses to avoid military
service when at risk of being drafted (low lottery number)
Bias to IV estimand (multiplication of 2 terms):
Proportion defiers relative to compliers
Difference in average causal effects of D on Y for compliers
and defiers
Severity depends on:
Proportion of defiers (small → less bias)
“Strength” of instrument (stronger → less bias)
Variation in effect of D on Y (less → less bias)
Summarizing
The potential outcomes framework gives a more subtle

interpretation of what IV is measuring
In the constant coefficients world, IV measures δ which is “the”
causal effect of Di on Yi , and assumed to be the same for all i
units
In the random coefficients world, IV measures instead an
average of heterogeneous causal effects across a particular
population – E [δi ] for some group of i units
IV, therefore, measures the local average treatment effect or
LATE parameter, which is the average of causal effects across
the subpopulation of compliers, or those units whose covariate
of interest, Di , is influenced by the instrument.
Summarizing
Under heterogenous treatment effects, Angrist and Evans

(1996) identify the causal effect of the gender composition of
the first two kids on labor supply
This is not the same thing as identifying the causal effect of
children on labor supply; the former is a LATE whereas the
latter might be better described as an ATE
Ex post this is probably obvious, but like many obvious things,
it wasn’t obvious until it was worked out. This was a real
breakthrough (see Angrist, Imbens and Rubin 1996; Imbens
and Angrist 1994)
IV in Randomized Trials
In many randomized trials, participation is nonetheless

voluntary among those randomly assigned to treatment
Consequently, noncompliance is not uncommon and without
correcting for it, creates selection biases
IV designs may even be helpful when evaluating a randomized
trial, even though treatment was randomly assigned
The solution is to instrument for treatment with whether you
“won the lottery” and estimate LATE

Lottery designs
The instrument is your randomized lottery

Examples might be randomized lottery for attending charter
schools to study effect of charter schools on educational
outcomes, or a randomized voucher to encourage the
collection of health information
Recall Thornton (2008) instrumented for getting HIV results
to estimate causal effect of learning one was HIV+ on condom
purchases
We’ll discuss two papers from 2012 and 2014 evaluating a
lottery-based expansion of Medicaid health insurance on
Oregon on numerous health and financial outcomes
Overarching question
What are the effects of expanding access to public health

insurance for low income adults?
Magnitudes, and even the signs, associated with that question
were uncertain
Limited existing evidence
Institute of Medicine review of evidence was suggestive, but a
lot of uncertainty
Observational studies are confounded by selection into health
insurance
Quasi-experimental work often focuses on elderly and children
Only one randomized experiment in a developed country: the
RAND health insurance experiment
1970s experiment on a general population
Randomized cost-sharing, not coverage itself
The Oregon Health Insurance Experiment
Setting: Oregon Health Plan Standard

Oregon’s Medicaid expansion program for poor adults
Eligibility
Poor (<100% federal poverty line) adults 19-64
Not eligible for other programs
Uninsured > 6 months
Legal residents
Comprehensive coverage (no dental or vision)
Minimum cost-sharing
Similar to other states in payments, management
Closed to new enrollment in 2004
The Oregon Medicaid Experiment
Oregon held a lottery

Waiver to operate lottery
5-week sign-up period, heavy advertising (January to February
2008)
Low barriers to sign up, no eligibility pre-screening
Limited information on list
Randomly drew 30,000 out of 85,000 on list (March-October
2008)
Those selected given chance to apply
Treatment at household level
Had to return application within 45 days
60% applied; 50% of those deemed eligible → 10,000 enrollees
Oregon Health Insurance Experiment
Evaluate effects of Medicaid using lottery as randomized

controlled trial (RCT)
Intent-to-treat: Reduced form comparison of outcomes
between treatment group (lottery selected individuals) and
controls (not selected)
LATE: IV using lottery as instrument for insurance coverage
First stage: about a 25 percentage point increase in insurance
coverage
Archived analysis plan
Massive data collect effort – primary and secondary
Similar to ACA expansion but limits to generalizability
Partial equilibrium vs. General equilibrium
Mandate and external validity
Oregon vs. other states
Short vs. Long-run
Examine Broad Range of Outcomes
Costs: Health care utilization

Insurance increases resources (income) and lowers price,
increasing utilization
But improved efficiency (and improved health), decreasing
utilization (“offset”)
Additional uncertainty when comparing Medicaid to no
insurance
Benefits I: Financial risk exposure
Insurance supposed to smooth consumption
But for very low income, is most care de jure or de facto free?
Benefits II: Health
Expected to improve (via increased quantity / quality of care)
But could discourage health investments (“ex ante moral
hazard”)
Data
Pre-randomization demographic information

From lottery sign-up
State administrative records on Medicaid enrollment
Primary measure of first stage (i.e., insurance coverage)
Outcomes
Administrative data (∼16 months post-notification): Hospital
discharge data, mortality, credit reports
Mail surveys (∼15 months): some questions ask 6-month
look-back; some ask current
In-person survey and measurements (∼25 months): Detailed
questionnaires, blood samples, blood pressure, body mass index
Study Population
10
Empirical Framework
They present reduced form estimates of the causal effect of

lottery selection
Yihj = β0 + β1 LOTTERYh + Xih β2 + Vih β3 + εihj
Validity of experimental design: randomization; balance on

treatment and control. This is what readers expect
Empirical framework
They also present IV results because they want to isolate the

causal effect of insurance coverage
INSURANCEihj = δ0 + δ1 LOTTERYih + Xih δ2 + Vih δ3 + µihj

yihj \
= π0 + π1 INSURANCE ih + Xih π2 + Vih π3 + vihj
Effect of lottery on coverage: about 25 percentage points

We have independence guaranteed; now we need exclusion: the
primary pathway of the lottery must be via being on Medicaid
Could affect participation in other programs, but actually small
“Warm glow” of winning – especially early
Analysis plan, multiple inference adjustment
Effect of lottery on coverage (first stage)
Effects of Lottery on Coverage (1st Stage)
Full sample Credit subsample Survey respondents
Control Estimated Control Estimated Control Estimated
mean FS mean FS mean FS
Ever on Medicaid 0.141 0.256 0.135 0.255 0.135 0.290
(0.004) (0.004) (0.007)
Ever on OHP Standard 0.027 0.264 0.028 0.264 0.026 0.302
(0.003) (0.004) (0.005)
# of Months on Medicaid 1.408 3.355 1.352 3.366 1.509 3.943
(0.045) (0.055) -0.09
On Medicaid, end of study period 0.106 0.148 0.101 0.151 0.105 0.189
(0.003) (0.004) (0.006)
Currently have any insurance (self report) 0.325 0.179
(0.008)
Currenty have private ins. (self report) 0.128 -0.008
(0.005)
Currently on Medicaid (self report) 0.117 0.197
(0.006)
Currently on Medicaid 0.093 0.177
(0.006)
12
Amy Finkelstein, et al. (2012). “The Oregon Health
Insurance Experiment: Evidence from the First Year”,
Quarterly Journal of Economics, vol. 127, issue 3, August.
Effects of Medicaid
Use primary and secondary data to gauge 1-year effects

Mail surveys: 70,000 surveys at baseline, 12 months
Administrative data
Medicaid enrollment records
Statewide Hospital discharge data, 2007-2010
Credit report data, 2007-2010
Mortality data, 2007-2010
Mail survey data
Fielding protocol
∼70,000 people, surveyed at baseline and 12 months later
Basic protocol: three-stage male survey protocol,
English/Spanish
Intensive protocol on a 30% subsample included additional
tracking, mailings, phone attempts (done to adjust for
non-response bias)
Response rate
Effective response rate = 50%
Non-response bias aways possible, but response rate and
pre-randomization measures in administrative data were
balanced between treatment and control
Administrative data
Medicaid records
Pre-randomization demographics from list
Enrollment records to assess “first stage” (how many of the
selected got insurance coverage)
Hospital discharge data
Probabilistically matched to list, de-identified at Oregon
Health Plan
Includes dates and source of admissions, diagnoses,
procedures, length of stay, hospital identifier
Includes years before and after randomization
Other data
Mortality data from Oregon death records
Credit report data, probabilistically matched, de-identified
Sample
89,824 unique individuals on the waiting list

Sample exclusions (based on pre-randomization data only)
Ineligible for OHP Standard (out of state address, age, etc.)
Individuals with institutional addresses on list
Final sample: 79,922 individuals out of 66,385 households
29,834 treated individuals (surveyed 29,589)
40,088 control individuals (surveyed 28,816)
Sample Characteristics
Sample characteristics
19
Outcomes
Access and use of care

Is access to care improved? Do the insured use more care? Is
there a shift in the types of care being used?
Mail surveys and hospital discharge data
Financial strain
How much does insurance protect against financial strain?
What are the out-of-pocket implications?
Mail surveys and credit reports
Health
What are the short-term impacts on self-reported physical and
mental health?
Mail surveys and vital statistics (mortality)
Results: Access & Use of Care
Effect of lottery on coverage
Gaining
Gaining insuranceresulted
insurance resulted in
in better
betteraccess
accesstoto
care andand
care higher
higher
satisfactionwith
satisfaction withcare
care (conditional
(conditional ononactually
actuallygetting care).
getting care)
CONTROL RF Model IV Model P-Value

(ITT) (LATE)
Have a usual place of care 49.9% +9.9% +33.9% .0001
Have a personal doctor 49.0% +8.1% +28.0% .0001
Got all needed health care 68.4% +6.9% +23.9% .0001
Got all needed prescriptions 76.5% +5.6% +19.5% .0001
Satisfied with quality of care 70.8% +4.3% +14.2% .001
SOURCE: Survey data
21
22
Results: Access & Use
Effect of lottery of Care
on coverage
Gaining insurance resulted in increased probability of hospital

Gaining insurance resulted in increased probability of hospital
admissions, primarily
admissions, driven
primarily drivenbybynon-emergency department
non-ED admissions.
admissions

(ITT) (LATE)
Any hospital admission 6.7% +.50% +2.1% .004
--Admits through ED 4.8% +.2% +.7% .265
--Admits NOT through 2.9% +.4% +1.6% .002
ED
SOURCE: Hospital Discharge Data

Overall, this represents a 30% higher probability of admission, although admissions
are still rare events.
Overall, this represents a 30% higher probability of admission,
although admissions are still rare events 23
Total Use By Condition
24
Summary: Access and use of care
Overall, utilization and costs went up relative to controls

30% increase in probability of an inpatient admission
35% increase in probability of an outpatient visit
15% increase in probability of taking prescription medications
Total $777 increase in average spending (a 25% increase)
With this increased spending, those who gained insurance were
35% more likely to get all needed care

25% more likely to get all needed medications
Far more likely to follow preventive care guidelines, such as
mammograms (60%) and PAP tests (45%)
Results: Financial Strain
Results: Financial Strain
Gaining insurance
Gaining insuranceresulted
resultedininaareduced probabilityofofhaving
reduced probability havingmedical
medical collections in credit reports, and in lower amounts
collections in credit reports, and in lower amounts owed. owed

(ITT) (LATE)
Had a bankruptcy 1.4% +0.2% +0.9% .358
Had a collection 50.0% -1.2% -4.8% .013
--Medical collections 28.1% -1.6% -6.4% .0001
--Non-medical collections 39.2% -0.5 -1.8% .455
$ owed medical collections $1,999 -$99 -$390 .025
SOURCE: Credit report data

Source: Credit report data
26
27
Summary: Financial Strain
Overall, reductions in collections on credit reports were evident
25% decrease in probability of a medical collection

Those with a collection owed significantly less
Household financial strain related to medical costs was
mitigated
Substantial reduction across all financial strain measures
Captures “informal channels” people use to make it work
Implications for both patients and providers
Only 2% of bills sent to collections are ever paid
Results: Self-Reported Health
Results: Self-reported health
Self-reported measures
Self-reported measuresshowed
showedsignificant
significant improvements oneyear
improvements one year
afterafter
randomization
randomization
(ITT) (LATE)
Health good, v good, excellent 54.8% +3.9% +13.3% .0001
Health stable or improving 71.4% +3.3% +11.3% .0001
Depression screen NEGATIVE 67.1% +2.3% +7.8% .003
CDC Healthy Days (physical) 21.86 +.381 +1.31 .018

CDC Healthy Days (mental) 18.73 +.603 +2.08 .003
SOURCE: Survey data

Source: Survey data
29
Summary: Self-reported health
Overall, big improvements in self-reported physical and mental

health
25% increase in probability of good, very good or excellent
health
10% decrease in probability of screening for depression
Physical health measures open to several interpretations
Improvements consistent with findings of increased utilization,
better access, and improved quality
BUT in their baseline surveys, results appeared shortly after
coverage (∼2/3rds magnitude of full result)
May suggest increase in perception of well-being rather than
physical health
Biomarker data can shed light on this issue
Discussion
At 1 year, found increases in utilization, reductions in financial

strain, and improvements in self-reported health
Medicaid expansion had benefits and costs – didn’t “pay for
itself”
Confirmed biases inherent in observational studies – would
have estimated bigger increases in use and smaller
improvements in outcomes
Policy-makers may have different views on value of different
aspects of improved well-being
“I have an incredible amount of fear because I don’t know if
the cancer has spread or not.”
“A lot of times I wanted to rob a bank so I could pay for the
medicine I was just so scared . . . People with cancer either
have a good chance or no chance. In my case it’s hard to
recover from lung cancer but it’s possible. Insurance took so
long to kick in that I didn’t think I would get it. Now there is
a big bright light shining on me.” (Anecdotes)
Important to have broad evidence on multifaceted effects of
Medicaid expansions
Baicker, Katherine, et al. (2014). “The Oregon Experiment
– Effects of Medicaid on Clinical Outcomes”, The New
England Journal of Medicine.
In-person data collection
Questionnaire and health examination including

Survey questions
Anthropometric and blood pressure measurement
Dried blood spot collection
Catalog of all medications
Fielded between September 2009 and December 2010
Average response ∼25 months after lottery began
Limited to Portland area: 20,745 person sample
12,229 interviews for effective response rate of 73%
Analytic approach
Intent to treat effect of lottery selection

Comparing all selected with all not selected
Random treatment assignment
No differential selection for outcome measurement
Local average treatment effect on Medicaid coverage
Using lottery selection as an instrument for coverage
∼24 percentage point increase in Medicaid enrollment
No change in private insurance (no crowd-out)
No effect of lottery except via Medicaid coverage
Statistical inference is the same for both
Results
1 Health care use

2 Financial strain
3 Clinical health outcomes
36
37
Health care use results
Increases in use in various settings

Increases in probability and number of outpatient visits
Increases in probability and number of prescription drugs
No discernible change in hospital or ED use (imprecise)
Increases in preventive care across range of services
Increases in perceived access and quality
Implied 35% increase in spending for insured
Results
1 Health care use

2 Financial strain
40
Financial Hardship Results
Reduction in strain, out-of-pocket (OOP), money owed

Substantial reduction across measures
Elimination of catastrophic OOP health spending
Implications for distribution of burden/benefits
Some borne by patients, some by providers
Non-financial burden of medical expenses and debt
Results
1 Health care use

2 Financial strain
Focusing on specific conditions
Measured:
Blood pressure
Cholesterol levels
Glycated hemoglobin
Depression
Reasons for selecting these:
Reasonably prevalent conditions
Clinically effective medications exist
Markers of longer term risk of cardiovascular disease
Can be measured by trained interviewers and lab tests
A limited window into health status
44
45
46
47
Results on specific conditions
Large reductions in depression

Increases in diagnoses and medication
In-person estimate of −9 percentage points in being depressed
Glycated hemoglobin
Increases in diagnosis and medication
No significant effect on HbA1c; wide confidence intervals
Blood pressure and cholesterol
No significant effects on diagnosis or medication
No significant effects on outcomes
Framingham risk score
No significant effect (in general or sub-poplulations)
49
Summary
One to two years after expanded access to Medicaid:

Increases in health care use and associated costs
Increases in compliance with recommended preventive care
Improvements in quality and access
Reductions in financial strain
Improvements in self-reported health
Improvements in depression
No significant change in specific physical measures
Sense of the relative magnitude of the effects
Use and access, financial benefits, general health, depression
Physical measures of specific chronic conditions
Extrapolation to Obamacare (ACA) Expansion
Context quite relevant for health care reform:

States can choose to cover a similar population in planned
2014 Medicaid expansions (up to 138% of federal poverty line)
But important caveats to bear in mind
Oregon and Portland vs. US generally
Voluntary enrollment vs. mandate
Partial vs. general equilibrium effects
Short-run (1-2 years) vs. medium or long run
We will revisit this again later in the difference-in-differences
section when discussing Miller, et al. (2019)
Updating Priors based on Study’s Findings
“Medicaid is worthless or worse than no insurance"’

Studies found increases in utilization and perceived access and
quality
Reductions in financial strain, improvement in self-reported
health
Improvement in depression
Can reject large declines in several physical measures
“Health insurance expansion saves money”
In short run, studies showed increases in utilization and cost
and no change in ED use
Increases in preventive care, improvements in self-reported
health, improvements in depression
Conclusion
Effects of expanding Medicaid likely to be manifold

Hard to establish with observational data and often misleading
Expanding Medicaid generates both costs and benefits
Increased spending
Measurably improves some aspects of health but not others
Important caveats about generalizability
Weighing them depends on policy priorities
Further research on alternative policies needed
Many steps in pathway between insurance and outcome
Role for innovation in insurance coverage
Complements to health care (e.g., social determinants)
Judge fixed effects designs
Imagine the following:

1 A person moves through a pipeline and hits a critical point
where treatment occurs as a result of some decision-maker
2 There are many different decision-makers and you’re assigned
randomly to one of them
3 Each decision-maker differs in terms of their leniency in
assigning the treatment
Very popular in criminal justice bc of how often judges are
randomly assigned to defendants (Kling 2006; Mueller-Smith
2015; Dobbie, et al. 2018) or even children to foster care case
workers (Doyle 2007; Doyle 2008)

Juvenile incarceration
Aizer and Doyle (2015) were interested in the causal effect of

juvenile imprisonment on future crime and human capital
accumulation
Extremely important policy question given the US has the
world’s highest incarceration rate and prison population of any
country in the world by a significant margin (500 prisoners per
100,000, over 2 million adults imprisoned, 4.8 million under
supervision)
High rates of incarceration extend to juveniles: in 2010, the
stock of juvenile detainees stood at 70,792, a rate of 2.3 per
1,000 aged 10-19.
Including supervision, US has a juvenile corrections rate 5x
higher than the next highest country, South Africa
Confounding
D Y
We are interested in the causal effect of juvenile incarceration

(D) on life outcomes, like adult crime and high school
completion
But youth choose to commit crimes, and that choice may be
due to unobserved criminogenic factors like poverty or
underlying criminal propensities which are themselves causing
those future outcomes
Leniency as an instrument
Z e
D Y
Aizer and Doyle (2015) propose an instrument - the propensity

to convict by the judge the youth is randomly assigned
If judge assignment is random, and the various assumptions
hold, then the IV strategy identifies the local average
treatment effect of juvenile incarceration on life outcomes
The Main Idea
“Plausibly exogenous” variation in juvenile detention stemming

from the random assignment of cases to judges who vary in
their sentencing
Consider two juveniles randomly assigned to two different
judges with different incarceration tendencies (Scott and Bob)
Random assignment ensures that differences in incarceration
between Scott and Bob are due to the judge, not themselves,
because remember, they’re identical
Data
35,000 juveniles administrative records over 10 years who came

before a juvenile court in Chicago (Juvenile Court of Cook
County Delinquency Database)
Data were linked to public school data for Chicago (Chicago
Public Schools) and adult incarceration data for Illinois (Illinois
Dept. of Corrections Adult Admissions and Exits)
They wanted to know the effect of juvenile incarceration on
high school completion (2nd data needed) and adult crime
(3rd data needed) using randomized judge assignment (1st
data needed)
They need personal identifying information in each data set to
make this link (i.e., name, DOB, address)
Preview of findings
Juvenile incarceration decreased high school graduation by 13

percentage points (vs. 39pp in OLS)
Increased adult incarceration by 23 percentage points (vs.
41pp in OLS)
Marginal cases are high risk of adult incarceration and low risk
of high school completion as a result of juvenile custody
Unlikely to ever return to school after incarcerated, but when
they do return, they are more likely to be classified as special
ed students, and more likely to be classified for special ed
services due to behavioral/emotional disorders (as opposed to
cognitive disability)
“Plausibly” exogenous
Very common in these studies for the assignment to some

decision-maker to be arbitrary but not clearly random (i.e., not
random no. generator)
In this case, juveniles charged with a crime are assigned to a
calendar corresponding to their neighborhood and calendars
have 1-2 judges who preside over them
1/5 of hearings are presided over by judges who cover the
calendar when the main judge can’t, known as swing judges
Judge assignment is a function of the sequence with which
cases happen to enter into the system and judge availability
that is set in advance
No scope for which judge you see first; conversations with
court administrators confirm its random
Structural equation
Yi = β0 + β1 JIi + β2 Xi + εi
where Xi is controls and εi is an error term. In this, juvenile

incarceration is likely correlated with the error term.
This is the “long” causal model. But note, from the prior DAG, we
cannot control for e because it is unobserved. But it is confounding
the estimation of juvenile incarceration’s effect on outcomes.
Incarceration Propensity as an Instrument
The instrument is based on the randomized judge equalling the

propensity to incarcerate from the randomly assigned judge
“Leave-one-out mean”
nj(i) −1
1 X
Zj(i) = JI
ek
nj(i) − 1
k6=i
The nj(i) terms is the total number of cases seen by judge k,

and JI
e k is equal to 1 if the juvenile was incarcerated during
their first case
Thus the instrument is the judge’s incarceration among first
cases based on all their other cases
It’s basically a judge fixed effect given the likelihood two
judges have precisely the same propensity is small
Information about the instrument
There are 62 judges in the data, and the average number of

initial cases per judge is 607
Substantial variation in the data - raw measure ranges from
4% to 21%
Residualized measure based on controls still has substantial
variation from 6% to 18%
Variation comes from two sources: variation among the regular
(nonswing) judges (80% of cases) and variation from the
swing judges (20% of cases)
Distribution of IV
Balance test
First stage
High school completion
Adult crime
Crime type
High school transfers
Developing emotional problems
Concluding remarks
Sad, but important, paper - the marginal kid shouldn’t have

been incarcerated
More generally, leniency designs are very powerful and very
common if you know how to look for them
Bottleneck, influential decision-makers, discretion - these are
the three elements of the design
Comments on judge fixed effects
Leave-one-out average propensity of the decision-maker, or

some residualized instrument, is very common
More often you’ll see jackknife IV (JIVE) which drops
observations while running regressions to improve finite sample
bias
The biggest threats aren’t exclusion probably (though
sometimes), but monotonicity
Might judges be harsh in some situations (violent crimes) but
lenient in others (female defendants, first time offenders)
Tests for violations
New paper by Frandsen, Lefgren and Leslie (2019) proposes a

test
They show that the identifying assumptions imply a conditional
expectation of the outcome of interest given the judge
assignment is a continuous function of the judge propensity
They propose a two-part test that generalizes the
Sargan-Hansen over identification test and assesses whether
treatment effects across judge propensities are possible
Software available on Emily Leslie’s website
Multi-dimensional instrument
Peter Hull in a cautionary note notes that while combining

judge fixed effects into a single propensity is numerically
equivalent, it’s still a series of dummies
Therefore it’s very important to keep in mind the lessons we
learned from weak instruments – the more weak instruments
you have when a parameter is overidentified, the larger the bias
It’s ongoing at the moment to think about ways to improve
instrument selection, but not settled
I encourage you to read Peter’s note on his website and begin
thinking about this yourself
Discussion questions
When working on a judge fixed effects project, write down an

IV DAG
Whereas monotonicity cannot be visualized to my knowledge
on a DAG, exclusion can – so what does an exclusion violation
mean in this context?
Use logic and conversations with those administering the
program to answer the following – what does monotonicity
mean in this context and how might it be violated?
Empirical exercise
Let’s estimate the effect of cash bail on defendant outcomes

using 2SLS and JIVE
Excellent paper by Megan Stevenson
-bail.do- and -bail.r- in dropbox and github
Twoway fixed effects
When working with panel data, the so-called “twoway fixed

effects” (TWFE) estimator is the workhorse estimator
It’s easy to run, a version of OLS, and many people are just
interested in mean effects anyway
It’s the most common model for estimating treatment effects
in a difference-in-differences, and so for all these reasons, we
need to spend some time understanding what it is

Panel Data
Panel data: we observe the same units (individuals, firms,

countries, schools, etc.) over several time periods
Often our outcome variable depends on unobserved factors
which are also correlated with our explanatory variable of
interest
If these omitted variables are constant over time, we can use
panel data estimators to consistently estimate the effect of our
explanatory variable
What I will cover
I will cover pooled OLS and twoway fixed effects

But I won’t be covering random effects, Arrelano and Bond
and any number of important panel estimators because the
purpose here is to present the modal regression model used in
difference-in-differences
Yi1 Yi2 Yi3
Di1 Di2 Di3
Xi
ci
Sorry - drawing the DAG for a simple panel model is somewhat

messy!
When to use this
Traditionally, this was used for estimating constant treatment

effects with unobserved time-invariant heterogeneity – recall
the ci was constant across all time periods
It’s a linear model, so you’ll be estimating conditional mean
treatment effects – if you want the median, you can’t use this
Once you enter into a world with dynamic treatment effects
and differential timing, this loses all value
Problems that fixed effects cannot solve
Reverse causality: Becker predicted police reduce crime, but

when you regress crime onto police, it’s usually positive
βbFE inconsistent unless strict exogeneity conditional on ci
holds
E [εit |xi1 , xi2 , . . . , xiT , ci ] = 0; t = 1, 2, . . . , T
implies εit uncorrelated with past, current and future
regressors
Time-varying unobserved heterogeneity
It’s the time-varying unobservables you have to worry about in
fixed effects
Can include time-varying controls, but as always, don’t
condition on a collider
Formal panel notation
Let y and x ≡ (x1 , x2 , . . . , xk ) be observable random variables

and c be an unobservable random variable
We are interested in the partial effects of variable xj in the
population regression function
E [y |x1 , x2 , . . . , xk , c]
Formal panel notation cont.
We observe a sample of i = 1, 2, . . . , N cross-sectional units

for t = 1, 2, . . . , T time periods (a balanced panel)
For each unit i, we denote the observable variables for all time
periods as {(yit , xit ) : t = 1, 2, . . . , T }
xit ≡ (xit1 , xit2 , . . . , xitk ) is a 1 × K vector
Typically assume that cross-sectional units are i.i.d. draws
from the population: {yi , xi , ci }N
i=1 ∼ i.i.d. (cross-sectional
independence)
yi ≡ (yi1 , yi2 , . . . , yiT )0 and xi ≡ (xi1 , xi2 , . . . , xiT )
Consider asymptotic properties with T fixed and N → ∞
Formal panel notation
Single unit:
   
yi1 Xi,1,1 Xi,1,2 Xi,1,j ... Xi,1,K
 ..   .. .. .. .. 
 .   . . . . 
   
yi =  yit 
  Xi =  Xi,t,1 Xi,t,2 Xi,t,j
 . . . Xi,t,K 
 ..  .. .. .. ..

 
 .   . . . . 
yiT T ×1 Xi,T ,1 Xi,T ,2 Xi,T ,j . . . Xi,T ,K T ×K
Panel with all units:

   
y1 X1
 ..   .. 
 .   . 
   
y =
 yi 
 X =
 Xi 

 ..   .. 
 .   . 
yN NT ×1 XN NT ×K
Unobserved heterogeneity
For a randomly drawn cross-sectional unit i, the model is given

by
yit = xit β + ci + εit , t = 1, 2, . . . , T
yit : log wages i in year t

xit : 1 × K vector of variable events for person i in year t, such
as education, marriage, etc. plus an intercept
β : K × 1 vector of marginal effects of events
ci : sum of all time-invariant inputs known to people i (but
unobserved for the researcher), e.g., ability, beauty, grit, etc.,
often called unobserved heterogeneity or fixed effect
εit : time-varying unobserved factors, such as a recession,
unknown to the farmer at the time the decision on the events
xit are made, sometimes called idiosyncratic error
Pooled OLS
When we ignore the panel structure and regress yit on xit we

get
yit = xit β + vit ; t = 1, 2, . . . , T
with composite error vit ≡ ci + εit
What happens when we regress yit on xit if x is correlated with
ci ?
Then x ends up correlated with v , the composite error term.
Somehow we need to eliminate this bias, but how?

Pooled OLS
Main assumption to obtain consistent estimates for β is:

E [vit |xi1 , xi2 , . . . , xiT ] = E [vit |xit ] = 0 for t = 1, 2, . . . , T
xit are strictly exogenous: the composite error vit in each time
period is uncorrelated with the past, current and future
regressors
But: education xit likely depends on grit and ability ci and so
we have omitted variable bias and βb is not consistent
No correlation between xit and vit implies no correlation
between unobserved effect ci and xit for all t
Violations are common: whenever we omit a time-constant
variable that is correlated with the regressors (heterogeneity
bias)
Additional problem: vit are serially correlated for same i since
ci is present in each t and thus pooled OLS standard errors are
invalid
Pooled OLS
Always ask: is there a time-constant unobserved variable (ci )

that is correlated with the regressors?
If yes, then pooled OLS is problematic
This is how we motivate a fixed effects model: because we
believe unobserved heterogeneity is the main driving force
making the treatment variable endogenous
Fixed effect regression
Our unobserved effects model is:
yit = xit β + ci + εit ; t = 1, 2, . . . , T
If we have data on multiple time periods, we can think of ci as

fixed effects to be estimated
OLS estimation with fixed effects yields
N X
X T
(β,
b cb1 , . . . , cbN ) = argmin (yit − xit b − mi )2
b,m1 ,...,mN i=1 t=1
this amounts to including N individual dummies in regression

of yit on xit
Derivation: fixed effects regression
N X
X T
(β,
b cb1 , . . . , cbN ) = argmin (yit − xit b − mi )2
b,m1 ,...,mN i=1 t=1
The first-order conditions (FOC) for this minimization problem are:

N X
X T
xit0 (yit − xit βb − cbi ) = 0
i=1 t=1
and
T
X
(yit − xit βb − cbi ) = 0
t=1
for i = 1, . . . , N.
Derivation: fixed effects regression
Therefore, for i = 1, . . . , N,
T
1 X
cbi = (yit − xit β)
b = ȳi − x̄i β,
b
T
t=1
where
T T
1 X 1 X
x̄i ≡ xit ; ȳi ≡ yit
T T
t=1 t=1
Plug this result into the first FOC to obtain:

N X
X T N X
−1 X T
βb = (xit − x̄i )0 (xit − x̄i ) (xit − x̄i )0 (yit − ȳ )
i=1 t=1 i=1 t=1
N X
X T N X
−1 X T
βb = ẍit0 ẍit ẍit0 ẍit
i=1 t=1 i=1 t=1
with time-demeaned variables ẍit ≡ xit − x̄, ÿit ≡ yit − ȳi

Fixed effects regression
Running a regression with the time-demeaned variables

ÿit ≡ yit − ȳi and ẍit ≡ xit − x̄ is numerically equivalent to a
regression of yit on xit and unit specific dummy variables.
Even better, the regression with the time demeaned variables is

consistent for β even when Cov [xit , ci ] 6= 0 because
time-demeaning eliminates the unobserved effects
yit = xit β + ci + εit

ȳi = x̄i β + ci + ε̄i
(yit − ȳi ) = (xit − x̄)β + (ci − ci ) + (εit − ε̄i )

ÿit = ẍit β + ε̈it
Fixed effects regression: main results
Identification assumptions:
1 E [εit |xi1 , x + i2, . . . , xiT , ci ] = 0; t = 1, 2, . . . , T
regressors are strictly exogenous conditional on the unobserved
effect
allows xit to be arbitrarily related to ci

PT 0
rank t=1 E [ẍit ẍit ] =K
2
regressors vary over time for at least some i and not collinear
Fixed effects estimator
1 Demean and regress ÿit on ẍit (need to correct degrees of
freedom)
2 Regress yit on xit and unit dummies (dummy variable
regression)
3 Regress yit on xit with canned fixed effects routine
Stata: xtreg y x, fe i(PanelID)
FE main results
Properties (under assumptions 1-2):

βbFE is consistent: plim βbFE ,N = β
N→∞
βbFE is unbiased conditional on X
Fixed effects regression: main issues
Inference:
Standard errors have to be “clustered” by panel unit (e.g.,
farm) to allow correlation in the εit ’s for the same i.
Yields valid inference as long as number of clusters is
reasonably large
Typically we care about β, but unit fixed effects ci could be of
interest
cbi from dummy variable regression is unbiased but not
consistent for ci (based on fixed T and N → ∞)
Application: SASP
From 2008-2009, I fielded a survey of Internet sex workers

(685 respondents, 5% response rate)
I asked two types of questions: static provider-specific
information (e.g., age, weight) and dynamic session
information over last 5 sessions
Let’s look at the panel aspect of this analysis together

Risk premium equation
Yis = βi Xi + δDis + γis Zis + ui + εis

Ÿis = γis Z̈is + η̈is
where Y is log price, D is unprotected sex with a client in a

session, X are client and session characteristics, Z is unobserved
heterogeneity, and ui is both unobserved and correlated with Zis .
Table: POLS, FE and Demeaned OLS Estimates of the Determinants of
Log Hourly Price for a Panel of Sex Workers
Depvar: POLS FE Demeaned OLS
Unprotected sex with client of any kind 0.013 0.051* 0.051*

(0.028) (0.028) (0.026)
Ln(Length) -0.308*** -0.435*** -0.435***
(0.028) (0.024) (0.019)
Client was a Regular -0.047* -0.037** -0.037**
(0.028) (0.019) (0.017)
Age of Client -0.001 0.002 0.002
(0.009) (0.007) (0.006)
Age of Client Squared 0.000 -0.000 -0.000
(0.000) (0.000) (0.000)
Client Attractiveness (Scale of 1 to 10) 0.020*** 0.006 0.006
(0.007) (0.006) (0.005)
Second Provider Involved 0.055 0.113* 0.113*
(0.067) (0.060) (0.048)
Asian Client -0.014 -0.010 -0.010
(0.049) (0.034) (0.030)
Black Client 0.092 0.027 0.027
(0.073) (0.042) (0.037)
Hispanic Client 0.052 -0.062 -0.062
(0.080) (0.052) (0.045)
Other Ethnicity Client 0.156** 0.142*** 0.142***
(0.068) (0.049) (0.045)
Met Client in Hotel 0.133*** 0.052* 0.052*
(0.029) (0.027) (0.024)
Gave Client a Massage -0.134*** -0.001 -0.001
(0.029) (0.028) (0.024)
Age of provider 0.003 0.000 0.000
(0.012) (.) (.)
Age of provider squared -0.000 0.000 0.000
(0.000) (.) (.)
Table: POLS, FE and Demeaned OLS Estimates of the Determinants of
Log Hourly Price for a Panel of Sex Workers
Depvar: POLS FE Demeaned OLS
Body Mass Index -0.022*** 0.000 0.000

(0.002) (.) (.)
Hispanic -0.226*** 0.000 0.000
(0.082) (.) (.)
Black 0.028 0.000 0.000
(0.064) (.) (.)
Other -0.112 0.000 0.000
(0.077) (.) (.)
Asian 0.086 0.000 0.000
(0.158) (.) (.)
Imputed Years of Schooling 0.020** 0.000 0.000
(0.010) (.) (.)
Cohabitating (living with a partner) but unmarried -0.054 0.000 0.000
(0.036) (.) (.)
Currently married and living with your spouse 0.005 0.000 0.000
(0.043) (.) (.)
Divorced and not remarried -0.021 0.000 0.000
(0.038) (.) (.)
Married but not currently living with your spouse -0.056 0.000 0.000
(0.059) (.) (.)
N 1,028 1,028 1,028

Mean of dependent variable 5.57 5.57 0.00
Heteroskedastic robust standard errors in parenthesis clustered at the provider level. * p<0.10,
** p<0.05, *** p<0.01
Unit specific time trends often eliminate “results”
Table: Demeaned OLS Estimates of the Determinants of Log Hourly

Price for a Panel of Sex Workers with provider specific trends
Depvar: FE w/provider trends
Unprotected sex with client of any kind 0.004

(0.046)
Ln(Length) -0.450***
(0.020)
Client was a Regular -0.071**
(0.023)
Age of Client 0.008
(0.005)
Age of Client Squared -0.000
(0.000)
Client Attractiveness (Scale of 1 to 10) 0.003
(0.003)
Second Provider Involved 0.126*
(0.055)
Asian Client -0.048***
(0.007)
Black Client 0.017
(0.043)
Hispanic Client -0.015
(0.022)
Other Ethnicity Client 0.135***
(0.031)
Met Client in Hotel 0.073***
(0.019)
Gave Client a Massage 0.022
(0.012)
Concluding remarks
This is not a review of panel econometrics; for that see

Wooldridge and other excellent options
We reviewed POLS and TWFE because they are commonly
used with individual level panel data and
difference-in-differences
Their main value is how they control for unobserved
heterogeneity through a simple demeaning
Now let’s discuss difference-in-differences which will at various
times use the TWFE model
John Snow
John Snow was a practicing anesthesiologist in the mid 19th

century London
He was then famous for inventing a machine that would
carefully deliver chloroform to patients in homogenous dosage
which reduced mortality from anasthesia
But he is now famous for providing convincing evidence that
cholera was a waterborne disease during the 1854 outbreak
Published two works on cholera – an essay in 1849, and a
book in 1855
Died of a stroke in 1858

Figure: Daily cholera deaths, London (Coleman 2019)
Cholera background
Cholera hits London three times in the early to mid 1800s

causing large waves of tens of thousands of deaths
Three London epidemics – 1831-1832, 1848-1849, 1853-1854
Cholera attacked victims suddenly, with a 50% survival rate,
and very painful symptoms included vomiting and acute
diarrhea
Miasmis
19th century London was a filthy place with waste collecting in

cesspools under houses or emptied into open ditches and
sewers
Majority opinion about disease was miasmis
Miasmis hypothesized that disease transmission was caused by
vapors and smells; unclear its relevance for person-to-person
Never before seen microorganism
Microscopes were around but had horrible resolution

Most human pathogens couldn’t be seen
Johnson (2007) reports Snow did track down a microscope but
could only see blurry things moving around
Isolating these microorganisms wouldn’t occur for half a
century
Snow’s hypothesis
Snow (as well as a few others like Rev. Henry Whitehead)

believe miasmis is not relevant for explaining cholera
Snow hypothesizes that the active agent was a living organism
that entered the body, got into the alimentary canal with food
or drink, multiplied in the body, and generated some poison
that caused the body to expel water
The organism passed out of the body with these evacuations,
entered the water supply and infected new victims
The process repeated itself, growing rapidly through the
common water supply, causing an epidemic
Thought Experiment
How will he convince anyone that cholera is waterborne and

not due to “bad air”?
Consider the ideal experiment: randomize households by coin
flip to receive water from runoff (control) vs. water without
runoff (treatment)
Unethical, impractical and unrealistic
Even if the randomized experiment is not possible, the thought
experiment suggests the observational equivalent
Multiple sources of evidence, not just one
Snow makes his argument with many pieces of evidence that when
taken together are very compelling that water, not air, is the cause
of the cholera epidemics. These can be categorized as:
1 Observation
2 Broad Street Pump
3 Grand Experiment
Observation
Observed progression of the disease for years

Tracked Patient Zero
Treatments didn’t work: Snow would cover with burlap sacks,
which did nothing
Strange irregular patterns – higher deaths in close proximity to
a public pump on Broad Street, fewer deaths at a pub
“cholera extended to nearly all the houses in which the

water was thus tainted, and to no others.” (Snow 1849)
Broad street outbreak
“The most terrible outbreak of cholera which ever

occurred in this kingdom, is probably that which took
place in Broad Street, Golden Square, and the adjoining
streets, a few weeks ago. Within two hundred and fifty
yards of the spot where Cambridge Street [now Lexington
St.] joins Broad Street [now Broadwick], there were
upwards of five hundred fatal attacks of cholera in ten
days.” (Snow 1855)
How he argues for the Broad street pump
Famous map showing unusual mass of cholera deaths near the

public Broad street pump
He was looking for the source, but he was not inductively
forming his theory with this map because he already knew the
mechanism
He was assembling evidence that would further refute the
explanations of those who advocated an alternative
explanation of the outbreak
Figure: Cholera deaths laid over a small area of London near Broad Street
Map was important but not enough on its own
“[Snow] could see at a glance that he’d be able to

demonstrate that the outbreak was clustered around the
pump, yet he knew from experience that that kind of
evidence, on its own, would not satisfy a miasmatist. The
cluster could just as easily reflect some pocket of poisoned
air that had settled over that part of Soho, something
emanating from the gulley holes or cesspools – or perhaps
even from the pump itself. Snow knew that the case
would be made in the exceptions from the norm. Pockets
of life where you could expect death, pockets of death
where you would expect life.” Johnson (2007) p. 140
Two companies fight for customers
Southwark and Vauxhall Waterworks Company and the

Lambeth Water Company competed over some of the regions
south of the Thames
In 16 sub-districts, with a population of 300,000, they
competed directly, even supplying customers side-by-side
“In many cases a single house has a supply different from

that on either side. Each company supplies both rich and
poor, both large houses and small; there is no difference
in the condition or occupation of the persons receiving
the water of the different companies.” Snow (1855) p 75
Lambeth moves its pipe
During the 1849 epidemic, both companies drew water from

Thames which was polluted with sewage and cholera
London passes legislation requiring utility companies to move
their pipes above the city
In 1852, the Lambeth Company, a water utility company,
changed supply from Hungerford Bridge
It moved its intake pipe upstream to cleaner water and in
response to legislation (SV delayed)
This created a natural experiment because Southwark and
Vauxhall left its intake pipe in place
Meticulous Data Collection
Two types of data: DD uses aggregate deaths bc of mixing of

customers whereas his Broad Street evidence focused on
individuals
Collected detailed information from households with cholera
deaths on utility subscription (Lambeth or SV)
Many residents didn’t know their water company – distant
landlords paid for it
He knew Lambeth water was four times saltier, so he’d take a
sample and test it using a saline test back at his office
Shoeleather and knowledge of institutional details
Careful balance checks – “the pipes of each Company go down

all the streets into nearly all the courts and alleys”
Concern for sample selection bias –“No fewer than 3000 people
of both sexes [of all types affected]”
Treatment assignment was arbitrary – “a few houses supplied
by one Company and a few by the other”
Table XII
Modified Table XII (Snow 1854)

Company name 1849 1854
Southwark and Vauxhall 135 147
Lambeth 85 19
Estimated ATT using DD is 78 fewer deaths per 10,000

Failure to convince
“In spite of what has since been recognized as a classic

exercise in data, analysis, and argument, Snow failed to
convince the medical profession, the policy-making
establishment, or the public.” (Coleman 2019)
Final victory
Another cholera outbreak in 1866, east of London, is when

Snow’s ideas were gradually and reluctantly accepted by public
officials and the scientific community
1866 outbreak was confined only to the east of London, which
was the last area not yet connected to the newly constructed
sewage system which discharged sewage below the Thames
The rest of London didn’t have an outbreak
This was the final piece of evidence that swayed skeptics and
led to a more reasoned assessment of Snow’s data and analysis
Merits of Snow’s work
Long commitment to the topic led him to reject unsound

hypotheses and form new ones based on observation and
experience (shoe leather)
Expert handling of data analysis, data visualization, and a
framing of evidence with a ladder of reasoning
Layered rhetoric of research
“The strength of his model derived from its ability to use

observed phenomena on one scale to make predictions
about behavior on other scales up and down the chain. ...
If cholera were waterborne then the patterns of infection
must correlate with the patterns of water distribution in
London’s neighborhoods. Snow?s theory was like a
ladder; each individual rung was impressive enough, but
the power of it lay in ascending from bottom to top, from
the membrane of the small intestine all the way up to the
city itself.” (Johnson, Ghost Map)
Simple cross-sectional design
Table: Lambeth and Southwark and Vauxhall, 1854
Company Cholera mortality

Lambeth Y =L+D
Southwark and Vauxhall Y = SV

Interrupted time series design
Table: Lambeth, 1849 and 1854
Company Time Cholera mortality

Lambeth 1854 Y =L
1849 Y = L + (T + D)
Difference-in-differences
Table: Lambeth and Southwark and Vauxhall, 1849 and 1854
Companies Time Outcome D1 D2

Lambeth Before Y =L
After Y =L+T +D T +D
D
Southwark and Vauxhall Before Y = SV
After Y = SV + T T
Sample averages

2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
Population expectations

2x2
δbkU = E [Yk |Post] − E [Yk |Pre] − E [YU |Post] − E [YU |Pre]
Potential outcomes and the switching equation

2x2
δbkU = E [Yk1 |Post] − E [Yk0 |Pre] − E [YU0 |Post] − E [YU0 |Pre]
| {z }
Switching equation
+ E [Yk0 |Post] − 0
E [Yk |Post]
| {z }
Adding zero
Parallel trends bias
2x2
δbkU = E [Yk1 |Post] − E [Yk0 |Post]
| {z }
ATT

0 0 0 0
+ E [Yk |Post] − E [Yk |Pre] − E [YU |Post] − E [YU |Pre]
| {z }
Non-parallel trends bias in 2x2 case
Another famous DD study
Card and Krueger (1994) was a seminal study on the minimum

wage both for the result and for the design
Not the first time we saw DD in the modern period - there’s
Ashenfelter (1978) and Card (1991) - but got a lot of attention
Competitive vs noncompetitive markets
Suppose you are interested in the effect of minimum wages on

employment which is a classic and divisive question.
In a competitive input market, increases in the minimum wage
would move us up a downward sloping labor demand curve →
employment would fall
Monopsony (imperfect labor markets) suggest the opposite
effect whereby raising the minimum wage increases
employment
Monopsony’s minimum wage predictions
Card and Krueger (1994)
In February 1992, New Jersey increased the state minimum

wage from $4.25 to $5.05. Pennsylvania’s minimum wage
stayed at $4.25.
Locations of Restaurants (Card and Krueger 2000)
J. Hainmueller (MIT) 5 / 50
They surveyed about 400 fast food stores both in New Jersey
and Pennsylvania before and after the minimum wage increase
in New Jersey - shoeleather!
Parallel trends assumption
Key identifying assumption is the “parallel trends” assumption

0 0 0 0
[E [YNJ |Post] − E [YNJ |Pre]] − [E [YPA |Post] − E [YPA |Pre]]
| {z }
Non-parallel trends bias
Note the counterfactual - it is not testable no matter what

someone tells you, bc New Jersey’s post period potential
employment in a world with a lower minimum wage is
unobserved
Let’s look at this a couple of different ways, including a
graphic showing the binding minimum wage
Wages After Rise in Minimum Wage
TABLE3-AVERAGE EMPLOYMENT PER S
IN NEW JERSEYMI
Stores by state
Difference,
PA NJ NJ-PA
Variable (i) (ii) (iii)
1. FTE employment before,
all available observations
2. FTE employment after,
all available observations
3. Change in mean FTE
employment
employment, balanced
sample of storesC
J. Hainmueller (MIT) 17
Surprisingly, employment rose in NJ relative to PA after the
minimum wage changesetting
employment, - consistent with monopsony theory
Regression DD
Remember, I said there are good reasons to use TWFE

It’s easy to calculate the standard errors
We can control for other variables which may reduce the
residual variance (lead to smaller standard errors)
It’s easy to include multiple periods
We can study treatments with different treatment intensity.
(e.g., varying increases in the minimum wage for different
states)
But there are bad reasons, too, which I’ll discuss under
differential timing
Regression DD
The typical regression model we estimate is
Yit = β1 + β2 Treati + β3 Postt + β4 (Treat × Post)it + εit
where Treat is a dummy if the observation is in the treatment

group and Post is a post treatment dummy
Regression DD - Card and Krueger
In the Card and Krueger case, the equivalent regression would

be:
Yits = α + γNJs + λdt + δ(NJ × d)st + εits
NJ is a dummy equal to 1 if the observation is from NJ

d is a dummy equal to 1 if the observation is from November
(the post period)
This equation takes the following values
PA Pre: α
PA Post: α + λ
NJ Pre: α + γ
NJ Post: α + γ + λ + δ
DD estimate: (NJ Post - NJ Pre) - (PA Post - PA Pre) = δ
Graph - Observed Data
Graph - DD
YistY= =+
ist α γNJ
a+ s +
gNJ s +λd
ldt t+ d(NJs× d)
+δ(NJ dt )st++#ist
εist
Graph - DD
==
YistYist α+
a+γNJ
gNJs s++λd
ldtt + d(NJs ×
+ δ(NJ  dd) ++
t )st #istεist
Graph - DD
YYistist ==αa + gNJ +ld

γNJss + NJs ×
λdt t++d(δ(NJ dt d) + εist
) +st#ist
Key assumption of any DD strategy: Parallel trends
The key assumption for any DD strategy is that the outcome

in treatment and control group would follow the same time
trend in the absence of the treatment
This doesn’t mean that they have to have the same mean of
the outcome
But regardless of parallel trends, OLS always estimates the
vertical bar on next slide
Graph - DD
Yist==αa+
Yist gNJss + ld
+γNJ λdtt + (NJs ×dtd)
+dδ(NJ )+st #+
ist εist
Losing parallel trends
If parallel trends doesn’t hold, then ATT is not identified

But, regardless of whether ATT is identified, OLS always
estimates the same thing
That’s because OLS uses the slope of the control group to
estimate the DD parameter, which is only unbiased if that
slope is the correct counterfactual for the treatment group
Labor Supply
Treated
Observed PA
Observed NJ
δ OLS
δ ATT
Counterfactual NJ
Time
Feb Nov
Figure: DD regression diagram without parallel trends

Compositional differences violate parallel trends
One of the risks of a repeated cross-section is that the

composition of the sample may have changed between the pre
and post period
Hong (2011) uses repeated cross-sectional data from the
Consumer Expenditure Survey (CEX) containing music
expenditure and internet use for a random sample of
households
Study exploits the emergence of Napter (first file sharing
software widely used by Internet users) in June 1999 as a
natural experiment
Study compares internet users and internet non-users before
and after emergence of Napster
Compositional di↵erences?
Figure 1: Internet Di↵usion and Average Quarterly Music Expenditure in the CEX
40 100
Non-user Group Internet User Group Internet Diffusion 90

35
% of HHs w/Internet connection

80
Average Music Expenditure
30
70
(in 1998 dollars)
25
60
20 50
40
15
30
10
20
5
10
0 0
1996 1997 1998 1999 2000 2001
Year
Table 1: Descriptive Statistics for Internet User and Non-user Groupsa
Year 1997 1998 1999 2000

Internet User Non-user Internet User Non-user Internet User Non-user Internet User
Average Expenditure
Recorded Music $25.73 $10.90 $24.18 $9.97 $20.92 $9.37 $17.42
Entertainment $195.03 $96.71 $193.38 $84.92 $182.42 $80.19 $164.88
Zero Expenditure
Recorded Music .56 .79 .60 .80 .64 .81 .68
Entertainment .08 .32 .09 .35 .14 .39 .17
Demographics
Age 40.2 49.0 42.3 49.0 44.1 49.4 44.3
Income $52,887 $30,459 $51,995 $28,169 $49,970 $26,649 $47,510
High School Grad. .18 .31 .17 .32 .21 .32 .22
Some College .37 .28 .35 .27 .34 .27 .36
College Grad. .43 .21 .45 .21 .42 .20 .37
40
Manager .16 .08 .16 .08 .14 .08 .14

Professional .23 .11 .22 .10 .21 .10 .19
Living in a Dorm .12 0 .08 0 .05 0 .05
Di↵usion of the internet changes samples (e.g. younger music fans are
Urban .93 .87 .93 .86 .91 .87 .89
Diffusion
early of the Internet changes samples (e.g., younger music fans are early
adopters)
Inside a MSA .84 .78 .83 .78 .83 .78 .81
Pop. Size > 4 million .34 .26 .30 .26 .31 .25 .28
adopters)
Appliance Ownership
Computer .79 .27 .81 .28 .80 .28 .81
Sound System .81 .57 .79 .58 .78 .56 .76
VCR .83 .72 .86 .74 .86 .72 .85
Total Households
(in million) 15 91 22 86 28 80 34
Observations 3,163 19,052 5,624 21,550 8,191 22,810 9,606
a
All the statistics are weighted using the weights provided by the CEX. Years refer to the period from June of the year to May of the next y
households are computed by summing the CEX weights.
Parallel leads, not trends
The identifying assumption for all DD designs is some

representation of a counterfactual parallel trend
Parallel trends cannot be directly verified because technically
one of the parallel trends is an unobserved counterfactual
But one often will check using pre-treatment data to show
that the trends had been the same prior to treatment
But, even if pre-trends are the same one still has to worry
about other policies changing at the same time (omitted
variable bias)

g at the same time.
Plot the raw data when there’s only two groups
Differential timing makes pre-treatment undefined for
untreated groups
New Jersey treated in late 1992, New York in late 1993,

Pennsylvania never treated
Pre-treatment:
New Jersey: <1992
New York: <1993
Pennsylvania: undefined
So how do we check parallel leads?
Randomize treatment dates to control units
Figure: Anderson, et al. (2013) display of raw traffic fatality rates for
re-centered treatment states and control states with randomized
treatment dates
Randomized control counties to receive arbitrary dates as
treatment can be misleading
Average birth rates
5.8
Average birth rates per 1000
5.2 5.4
5 5.6
-50 0 50
treat_date
Craigslist counties Non-Craigslist counties
Figure: From one of my studies. Looks decent right?

Event study regression
Including leads into the DD model is an easy way to analyze

pre-treatment trends
Lags can be included to analyze whether the treatment effect
changes over time after assignment
The estimated regression would be:
−q
X m
X
Yits = γs + λt + γτ Dsτ + δτ Dsτ + xist + εist
τ =−1 τ =0
Treatment occurs in year 0

Includes q leads or anticipatory effects
Includes m leads or post treatment effects
.4
Birth Rates per 15-44yo per 1,000
-.2 0 .2
DD Coefficient = -0.18 (s.e. = 0.02)

-.4
-4 -3 -2 -1 0 1 2 3 4 5
10 Months relative to CL Entry
Same data as a couple slides ago, leads don’t look good

Medicaid and Affordable Care Act example
Miller, et al. (2019) examine a rollout of Medicaid under the

Affordable Care Act
They link large-scale survey data with administrative death
records
9.3 reduction in annual mortality caused by Medicaid expansion
Driven by a reduction in disease-related deaths which grows
over time
Figure: Miller, et al. (2019) estimates of Medicaid expansion’s effects on
on annual mortality
Standard errors in DD strategies
Many paper using DD strategies use data from many years –

not just 1 pre and 1 post period
The variables of interest in many of these setups only vary at a
group level (say a state level) and outcome variables are often
serially correlated
As Bertrand, Duflo and Mullainathan (2004) point out,
conventional standard errors often severely understate the
standard deviation of the estimators – standard errors are
biased downward (i.e., too small, over reject)
Standard errors in DD – practical solutions
Bertrand, Duflo and Mullainathan propose the following

solutions:
1 Block bootstrapping standard errors (if you analyze states the
block should be the states and you would sample whole states
with replacement for bootstrapping)
2 Clustering standard errors at the group level (in Stata one
would simply add , cluster(state) to the regression
equation if one analyzes state level variation)
3 Aggregating the data into one pre and one post period.
Literally works if there is only one treatment data. With
staggered treatment dates one should adopt the following
procedure:
Regress Yst onto state FE, year FE and relevant covariates
Obtain residuals from the treatment states only and divide
them into 2 groups: pre and post treatment
Then regress the two groups of residuals onto a post dummy
Note about groups
Correct treatment of standard errors sometimes makes the

number of groups very small: in the Card and Krueger study
the number of groups is only 2.
DD Robustness
Very common for readers and others to request a variety of

“robustness checks” from a DD design
Think of these as along the same lines as the leads and lags
we already discussed
Event study (already discussed)
Falsification test using data for alternative control group
Falsification test using alternative “placebo” outcome that
should not be affected by the treatment
Within group controls - triple diff
Table: Differences-in-differences-in-differences
States Group Period Outcomes D1 D2 D

After NJ + T + NJt + lt + D
Low wage employment T + NJt + lt + D
Before NJ
NJ D + lt − st
After NJ + T + NJt + st
High wage employment T + NJt + st
Before NJ
D
After PA + T + PAt + lt
Low wage employment T + PAt + lt
Before PA
PA lt − st
After PA + T + PAt + st
High wage employment T + PAt + st
Before PA
Di↵erence-in-Di↵erences: Threats to Validity
DDD Example by Gruber
Triple DDD: Mandated Maternity Benefits (Gruber, 1994
DDD in Regression
Yijt = α + β1 Xijt + β2 τt + β3 δj + β4 Di + β5 (δ × τ )jt

+ β6 (τ × D)ti + β7 (δ × D)ij + β8 (δ × τ × D)ijt + εijt
The DDD estimate is the difference between the DD of

interest and a placebo DD (which is supposed to be zero)
If the placebo DD is non-zero, it might be difficult to convince
the reviewer that the DDD removed all the bias
If the placebo DD is zero, then DD and DDD give the same
results but DD is preferable because standard errors are
smaller for DD than DDD
But now you have multiple parallel trends assumption - both
the control group trends are good counterfactuals, and
within-state placebo trends for within-state treatment unit
counterfactual trends
Implementing DDD
Have to get the structure of the data correct because now you
have (1) before and after, (2) treatment and control states,
and (3) within state placebo
I give an example in my Mixtape (p. 278) looking at abortion
legalization’s effect on longterm risky sexual behavior,
including do file
Let’s review first the paper, then work through the exercise
itself using data.
Figure: Longrun effects of abortion legalization on Risky Sex
Motivation
Legalization caused teen childbearing to fall by 12% (Levine

2004)
Gruber, et al. (1999) showed that the marginal child would
have been 60% more likely to live in a single-parent household,
50% more likely to live in poverty, and 45% more likely to be a
recipient of public services
Mechanism was believed to be non-random selection
associated with high risk conditions
Emerging influence
Donohue and Levitt (2001) linked abortion legalization to

declining crime in the 1990s, one of several reasons given for
his John Bates Clark award
Freakonomics popularizes the sensational theory
Other papers followed like Charles and Stephens (2006) who
find that children exposed in utero to legalization were less
likely to use illegal substances
Controversy
Triple diff by Joyce finds no evidence for it when using an

(arbitrary) cutoff of the median abortion rate within early
repeal treatment states
Foote and Goetz (2008) argue the abortion ratio was
constructed incorrectly, and report a coding error leaving out
state-year fixed effects; construction problem destroys results,
state-year fixed effects somewhat attenuates
Literature stops and theory is ignored
In defense of Steve Levitt
I want to remind people though: we only know about the

coding error bc Levitt posted his do files and gave them to
anyone who asked (very easy to “lose do files”)
Levitt had and has oodles of scientific integrity for his
willingness to cooperate; not always the case
“If abortion lowers homicide rates by 20 – 30%, then it is
likely to have affected an entire spectrum of outcomes
associated with well-being: infant health, child
development, schooling, earnings and marital status.
Similarly, the policy implications are broader than
abortion. Other interventions that affect fertility control
and that lead to fewer unwanted births – contraception or
sexual abstinence – have huge potential payoffs. In short,
a causal relationship between legalized abortion and crime
has such significant ramifications for social policy and at
the same time is so controversial, that further assessment
of the identifying assumptions and their robustness to
alternative strategies is warranted.” Ted Joyce in his triple
diff paper
Figure: Light bending around the sun, predicted by Einstein, and
confirmed in a natural experiment involving an eclipse. Artwork by Seth
Hahne .c
In defense of falsifiable predictions
Theories which make falsifiable predictions (comparative

statics) are more convincing of causal effects than simpler
reduced form studies
Great paper by Coleman on (2019) Snow’s rhetoric in his 1849
essay and his 1855 book on cholera – mounts different data to
make his argument, some of which is of this nature
Those predictions are threefold:
Where we should find effects
Where we should not find effects
The kind of effects we should find
If all three are met, an identified causal effect becomes
epistemologically more credible
Falsifiable predictions contained in a diff-in-diff
Figure: Group-time differential exposure predicts a temporary parabolic

ATT
Figure: Raw data for repeal and Roe states.
Estimating equation
Yst = β1 Repeals + β2 DTt + β3t Repeals × DTt + Xst ψ + αs DSs

+ γ1 t + γ2s × t + εst
Estimated effect of abortion legalization on gonorrhea
Black females 15-19 year-olds
1.00
Repeal x year estimated coefficient
-1.50 -1.00 -0.50 0.00 0.50
1995 1997 1999 2000

1998
1994 1996
1986
1987 1993
1992
1991
1988 1990
1989
1985 1990 1995 2000

Year
Whisker plots are estimated coefficients of DD estimates
Figure: Differences in black female gonorrhea incidence between repeal

and Roe cohorts.
Assuaging doubt
Maybe spurious - something happened in those years, but

what?
Crack epidemic maybe? But we control for the crack index by
Fryer, et al.
Maybe something else - let’s try a within-state control group
(the older cohort)
DDD Equation
Yast = β1 Repeals + β2 DTt + β3 DA + β4t Repeals · DTt +

+ β5 Repeals · DA + β6t DA · DTt + β7t Repeals · DA · DTt
+ Xst ξ + α1s DSs + α2s DSs · DA + γ1 t + γ2s DSs · t + γ3 DA · t
+ γ4s DSs · DA · t + ast
One will be dropped, but I want to focus your attention on the

number of interactions needed to identify DDD parameters
Stacking Structure
DDD Results

Black females 15-19 year-olds vs Black females 25-29 year-olds
Repeal x 15-19yo x year estimated coefficient
0.50
1997
1994
1996
1995
1992 1998 1999
1991 2000
0.00
1990
1993
1989
1986
1987 1988
-1.00 -0.50
1985 1990 1995 2000

Year
Whisker plots are estimated coefficients of DDD coefficients
My original conclusions
Model made narrow predictions of a parabola within a given

window but only for the treatment cohort
Amazingly we actually found that very shape in the DD – did
we vindicate Gruber, et al. and Donohue and Levitt then?
Also used older group as within-state controls in a DDD, and
still found the parabola, though not as great a look as DD
which is a bit of a red flag
Paper also illustrates the usefulness of having a specific
theoretical prediction. Limits the number of competing
hypotheses (Popperian type of reasoning).
But was I done? Look back at the table
Going beyond Cornwell and Cunningham (2013)
Figure: Second theoretical prediction - this time for 20-24 year olds
Black females 20-24 year-olds
1.00
Repeal x year estimated coefficient
0.00 0.50
1995 1999 2000

1986 1987 1996 1997 1998
1994
1993
1992
-0.50
1988
1991
1989 1990
-1.00
1985 1990 1995 2000

Year
Whisker plots are estimated coefficients of DD estimates
Second prediction fails second DD model
Ugh. lo tov (Hebrew to English: not good)

Well, maybe DDD will look better?
Repeal x 20-24yo x year estimated coefficient Black females 20-24 year-olds vs Black females 25-29 year-olds
0.50
1997
1994 1996
0.00
1992 1998 1999 2000

1991 1995
1990
1986 1987 1989
1988 1993
-1.00 -0.50
1985 1990 1995 2000

Year
Whisker plots are estimated coefficients of DDD coefficients
Second predictions fails DDD too
Notice that when we exploited just one testable prediction, we

found evidence
But when we exploit all of the testable predictions, the results
fall apart, suggesting original DD was spurious
Imagine for a moment, though – what if we had seen the
group-time ATT moving with the cohort as they aged?
Other alternative is the repeal-Roe effects dissipate by early to
late 20s, but what does Ockham’s Razor say is the more
credible explanation?
Perhaps the Gruber, et al. (1999) and Donohue and Levitt
(2001) hypothesis was always spurious
Stata replication
Let’s replicate this using the abortion.do file. Pay close attention to
the stacking of the data by group-state, not just state, and the
exact way in which the interactions must therefore be constructed
Falsification test with alternative outcome
The within-group control group (DDD) is a form of placebo

analysis using the same outcome
But there are also placebos using a different outcome – but
you need a hypothesis of mechanisms to figure out what is in
fact a different outcome
Figure out what those are, and test them – finding no effect
raises the epistemological credibility of the first result,
interestingly
Cheng and Hoekstra (2013) examine the effect of castle
doctrine gun laws on non-gun related offenses like grand theft
auto and find no evidence of an effect
Rational addiction as a placebo critique
Sometimes, an empirical literature may be criticized using nothing

more than placebo analysis
“A majority of [our] respondents believe the literature is a
success story that demonstrates the power of economic
reasoning. At the same time, they also believe the
empirical evidence is weak, and they disagree both on the
type of evidence that would validate the theory and the
policy implications. Taken together, this points to an
interesting gap. On the one hand, most of the
respondents claim that the theory has valuable real world
implications. On the other hand, they do not believe the
theory has received empirical support.”
Placebo as critique of empirical rational addiction
Auld and Grootendorst (2004) estimated standard “rational

addiction” models (Becker and Murphy 1988) on data with
milk, eggs, oranges and apples.
They find these plausibly non-addictive goods are addictive,
which casts doubt on the empirical rational addiction models.
Placebo as critique of peer effects
Several studies found evidence for “peer effects” involving

inter-peer transmission of smoking, alcohol use and happiness
tendencies
Christakis and Fowler (2007) found significant network effects
on outcomes like obesity
Cohen-Cole and Fletcher (2008) use similar models and data
and find similar network “effects” for things that aren’t
contagious like acne, height and headaches
Ockham’s razor - given social interaction endogeneity (Manski
1993), homophily more likely explanation
State federalism and differential timing
We’ve been considering situations where treatment occurs in

one area for the most part
But the modal situation is when there is differential timing
This happens in America usually because each area (state,
municipality) will adopt a policy whenever they want to, which
creates tendencies for roll out to occur
Example might be the minimum wage though we will look at
others
Summary
Cheng and Hoekstra (2013) are interested in whether

expansions to “castle doctrine statutes” at the state level
increase or decrease gun violence.
Prior to these expansions, English common law principle
required “duty to retreat” before using lethal force against an
assailant except when the assailant is an intruder in the home
The home is one’s “castle” – hence, “castle doctrine”
When intruders threatened the victim in the home, the duty to
retreat was waived and lethal force in self-defense was allowed
Castle doctrine law explained
In 2005, Florida passed a law that expanded self-defense

protections beyond the house
2000 to 2010, 21 states explicitly put “castle doctrine” into
statute, and (more importantly) extended it to places outside
the home
In other words, 21 states removed the duty to retreat in
specified circumstances
Other changes:
Presumption of reasonable fear is added
Civil liability for those acting under the law is removed
Economic theory predicts more lethal homicides
Workers supply legal or illegal labor and are therefore

responsive to costs and benefits
Castle doctrine expansions lowered the (expected) cost of
killing someone in self-defense
If people are rational, then lowering the price of lethal
self-defense should increase lethal homicides
Economic theory also predicts less crime from deterrence
Although deterrence is a theoretical possibility, note that the

goal of the laws was to protect enhance victim rights, not
deter crime
Testable prediction with data and same design
Treatment passage
Summary:
21 states passed laws removing “duty to retreat” in places
outside the home
17 states removed “duty to retreat” in any place one had a
legal right to be
13 states include a presumption of reasonable fear
18 states remove civil liability when force was justified under
law
Cheng and Hoekstra’s identification strategy
Panel fixed effects estimation
Yit = β1 Di + β2 Tt + β3 (CDLit ) + α1 Xit + ci + ut + εit
CDL is a fraction between 0 and 1 depending on the percent

of the year the state has a castle doctrine law
Preferred specifications includes “region-by-year fixed effects”
Data
FBI Uniform Crime Reports Part 1 Offenses (2000-2010)

State-level crime rates, or “offenses per 100,000 population”
Falsification outcomes: motor vehicle theft and larceny
Dataset on justifiable homicides by private citizens
Outcomes (in order)
Deterrence and homicide outcomes:

1 Burglary: the unlawful entry of a structure to commit a felony
or a theft
2 Robbery: the talking or attempting to take anything of value
from the care, custody or control of a person or persons by
force or threat of force or violence and/or putting the victim in
fear
3 Aggravated assault: unlawful attack by one person upon
another for the purpose of inflicting severe or aggravated
bodily injury
Homicide categories
1 Total homicides – murder plus non-negligent manslaughter
(∼14,000 per year)
2 Justifiable homicides by private citizens (∼250/year)
Inference: Clustering
Statistical inference: cluster standard errors at the state level

Are disturbances random draws from individually identical
distribution?
It’s likely that within a state, unobserved determinants of
crime are serially correlated
They follow Bertand, Duflo and Mullainathan (2004) and
adjust for serial correlation in unobserved disturbances within
states at the level of the treatment
Inference: Fisher’s sharp null
How likely is it that we estimate effects of this magnitude

when using randomly chosen pre-treatment time periods and
randomly assigning placebo treatments?
Randomizes dates within-state for the pre-treatment period
(<2000)
Randomization inference and exact p-values
Region-by-year fixed effects
Absent passing castle doctrine laws, outcomes in these 21

states would have changed similar to other states in their same
region
Recall the “region-by-year fixed effects” in the X term
By including “region-by-year fixed effects”, they are arguing
that unobserved changes in crime are running “parallel” to the
treatment states within region over time
Need not hold across regions since the across region variation
is not being used in this analysis due to the saturation of the
model with “region-by-year fixed effects”
State specific time trends
Alabama, et al. dummy interacted with TREND which equals

1 in 2000, 2 in 2001, . . . , 11 in 2010
Forces the identification to come from variation in outcomes
around the state-specific linear trend
Outcomes must be large enough and different enough from a
state-specific linear trend otherwise it is collinear with the
state-trend
Same argument applies to any control though
Goodman-Bacon (2019) suggests group-trends are less taxing
and satisfying than unit-specific trends
Control variables
Controls (X matrix in earlier equation)

Full-time police employment per 100,000 state residents from
the LEKOA data (FBI data)
Persons incarcerated in state prison per 100,000 residents
Shares of white/black men in 15-24 and 25-44 age groups
State per capita spending on public assistance
State per capita spending on public welfare
Parallel Leads
Look at each set of treatment states against never-treated

figure by figure (rare)
Use a one-period lead in the regression model (not as common)
I’m going to look at event study coefficients (most common)
Step one: Falsification test
Policy-makers are not just randomly flipping coins when

passing laws, but presumably do so because of things they
observe on the ground
Address concerns up front this isn’t driven by spurious crime
results
Cheng and Hoekstra (2013) present falsification of larceny and
motor vehicle theft first, then results
Step one (cont.)
Results will be presented separately under six different

specifications
Each new specification adds more controls
Pop quiz: What should you expect to find on key variables of
interest when conducting a falsification and why?
Answer
No statistically significant association between the CDL

passage and the placebos; preferably precise zeroes
No association on the one-year lead either
Basically, you should not find effects where there are no
theoretical policy effects; gun laws shouldn’t affect non-violent
offenses
Step one (cont.)
How do you interpret coefficients?

His model is “log outcomes” regressed onto a dummy variable
(level), so these are semi-elasticities and approximate
percentage changes – but you should transform them by taking
the exponential of each coefficient and then differencing it
from one to find the actual percentage change
Ex: CDL = -0.0137 (column 12, Table 3, “Log (larceny rate)”
outcome.) Exp(-0.0137) = 0.986, and so 1-0.986 = 1.4. Thus,
CDL reduced larceny rates by 1.4 percent, which is not
statistically significant.
Results – Falsification Exercise
Table 3: Placebo Tests

OLS - Unweighted
7 8 9 10 11 12
Panel A: Larceny Log (Larceny Rate)
Castle Doctrine Law 0.00745 0.00145 -0.00188 -0.00445 -0.00361 -0.0137
(0.0227) (0.0205) (0.0210) (0.0226) (0.0201) (0.0228)
One Year Before Adoption of -0.0103

Castle Doctrine Law (0.0114)
Observation 550 550 550 550 550 550

Panel B: Motor Vehicle Theft Log (Motor Vehicle Theft Rate)
Castle Doctrine Law 0.0767* 0.0138 0.00814 0.00775 0.00977 -0.00373

(0.0413) (0.0444) (0.0407) (0.0462) (0.0391) (0.0361)
One Year Before Adoption of -0.00155

Castle Doctrine Law (0.0287)
Observation 550 550 550 550 550 550

State and Year Fixed Effects Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes
Controls for Larceny or Motor Theft Yes
State-Specific Linear Time Trends Yes
Notes: Each column in each panel represents a separate regression. The unit of observation is state-year. Robust standard errors are
clustered at the state level. Time-varying controls include policing and incarceration rates, welfare and public assistance spending, median
income, poverty rate, unemployment rate, and demographics.
Step two: testing the deterrence hypothesis
Having found no effect on their placebos, Cheng and Hoekstra

(2013) examine the effect of CDL on three deterrence
outcomes: burglary, robbery and aggravated assault
They will, again, have six specifications per outcome in the
“weighted” regression, and then another five for the
“unweighted” regression
Pop quiz: What does deterrence look like?
Answer
Negative signs on the CDL variable is consistent with

deterrence – these crimes were “deterred”, in other words
Based on early work by Becker (1968) and 1970s work by his
student Isaac Ehrlich; higher probabilities of getting hurt in
public may cause offenders to avoid violence in public
altogether
Bounds on the magnitudes from the standard errors are used
to provide some confidence about the estimates as well
Results – Deterrence
OLS - Weighted by State Population OLS - Unweighted

1 2 3 4 5 6 7 8 9 10 11 12
Panel A: Burglary Log (Burglary Rate) Log (Burglary Rate)
Castle Doctrine Law 0.0780*** 0.0290 0.0223 0.0164 0.0327* 0.0237 0.0572** 0.00961 0.00663 0.00277 0.00683 0.0207
(0.0255) (0.0236) (0.0223) (0.0247) (0.0165) (0.0207) (0.0272) (0.0291) (0.0268) (0.0304) (0.0222) (0.0259)
One Year Before Adoption of -0.0201 -0.0154

Castle Doctrine Law (0.0139) (0.0214)
Panel B: Robbery Log (Robbery Rate) Log (Robbery Rate)
Castle Doctrine Law 0.0408 0.0344 0.0262 0.0216 0.0376** 0.0515* 0.0448 0.0320 0.00839 0.00552 0.00874 0.0267
(0.0254) (0.0224) (0.0229) (0.0246) (0.0181) (0.0274) (0.0331) (0.0421) (0.0387) (0.0437) (0.0339) (0.0299)

Panel C: Aggravated Assault Log (Aggravated Assault Rate) Log (Aggravated Assault Rate)
Castle Doctrine Law 0.0434 0.0397 0.0372 0.0362 0.0424 0.0414 0.0555 0.0698 0.0343 0.0305 0.0341 0.0317
(0.0387) (0.0407) (0.0319) (0.0349) (0.0291) (0.0285) (0.0604) (0.0630) (0.0433) (0.0478) (0.0405) (0.0380)

Observations 550 550 550 550 550 550 550 550 550 550 550 550
State and Year Fixed Effects Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes Yes Yes Yes Yes
Contemporaneous Crime Rates Yes Yes
State-Specific Linear Time Trends Yes Yes
Conclusion
“In short, these estimates provide strong evidence against the

possibility that castle doctrine laws cause economically
meaningful deterrence effects” (p. 17)
Translation: They can’t find evidence of large deterrence
effects
“Thus, while castle doctrine law may well have benefits to
those legally justified in protecting themselves in self-defense,
there is no evidence that the law provides positive spillovers by
deterring crime more generally” (p. 17)
They note in footnote 24 that they cannot measure the
benefits to victims whose crimes were deterred, or the benefits
from lower legal costs; their focus is limited to whether it
deterred the crimes, not whether the net benefits from the
laws were positive
Obviously, if there is no deterrence, though, then the net
benefits are lower from CDL than they would be if they did
deter
Step 3: Homicides
The key finding in this study focuses on CDL and its effect on
homicides and non-negligent manslaughter
Pop quiz: what should the sign on CDL be here?
Answer
Effects should be positive

Cheng and Hoekstra want to show the raw data, but have
differential timing
Differential timing means you can’t show pre-treatment raw
data for the never-treated groups
So they show it one by one – which isn’t the most aesthetically
pleasing way to do it, but which has the benefit of being
transparent
Parallel pre-treatment trends
Keep your eyes on whether pre-treatment trends are parallel

for treatment and control groups
Remember, though – he needs parallel trends within-region –
these figures don’t show that
But starting with pictures and raw data has value
Log Homicide Rates – 2005 Adopter = Florida
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Treatment: Florida (law enacted in October 2005)

Control: States that did not enact a law 2000 - 2010
Log Homicide Rates – 2006 Adopter (13 states)
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Treatment: States that enacted the law in 2006

2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

Log Homicide Rates – 2009 Adopter = Montana
1.5
1.4
1.3
Log Homicide Rate
1.2
1.1
1
.9
.8
.7
.6
.5
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Treatment: State that enacted the law in 2009 (Montana)

Modeling
He uses a class of estimators more appropriate for “counts”

called “count models”, like the negative binomial estimated
with maximum likelihood
Results are robust to least squares and count models
Homicide – Negative Binomial; Murder – OLS
1 2 3 4 5 6
Panel C: Homicide (Negative Binomial - Unweighted)
Castle Doctrine Law 0.0565* 0.0734** 0.0879*** 0.0783** 0.0937*** 0.108***

(0.0331) (0.0305) (0.0313) (0.0355) (0.0302) (0.0346)
One Year Before Adoption of Castle Doctrine -0.0352

Law (0.0260)
Observations 550 550 550 550 550 550

Panel D: Log Murder Rate (OLS - Weighted)
Castle Doctrine Law 0.0906** 0.0955** 0.0916** 0.0884** 0.0981** 0.0813

(0.0424) (0.0389) (0.0382) (0.0404) (0.0391) (0.0520)
One Year Before Adoption of Castle Doctrine -0.0110
Law (0.0230)
Observations 550 550 550 550 550 550
State and Year Fixed Effects Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes
Contemporaneous Crime Rates Yes
State-Specific Linear Time Trends Yes
Fisher sharp null
Move the 11-year panel back one year at a time (covering

1960-2009) and estimate 40 placebo “effects” of passing CDL 1 to
40 years earlier
Method Average Estimates larger

estimate than actual estimate
Weighted OLS –0.003 0/40
Unweighted OLS 0.001 1/40
Negative binomial 0.001 0/40
My replication using event study plots
Log Murder Rate
0.400
0.200
0.172
0.105
0.078 0.082 0.079
0.009 0.005 -0.004 0.012 0.019

0.000
-0.026
-0.137
-0.200
-0.261
-0.304
-0.400
lead9
lead8
lead7
lead6
lead5
lead4
lead3
lead2
lead1
lag1
lag2
lag3
lag4
lag5
Figure: Homicide event study plots using coefplot
.4
.2
Log Murders
0
DD Coefficient = 0.08 (s.e. = 0.03)

-.2
-.4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Years before and after castle doctrine expansion
Figure: Homicide event study plots using twoway

.3
.2
Log Murders
.1
0
-.1

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Figure: Homicide event study plots using twoway and force early leads
into one coefficient
.3
.2
Log Murders
.1
0
-.1

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Figure: Homicide event study plots using twoway dropping imbalanced

states
Interpretation
No evidence that Castle Doctrine/Stand Your Ground Laws

deter violent crimes such as burglary, robbery and aggravated
assault
These laws do lead to an 8% net increase in homicide rates,
translating to around 600 additional homicides per year across
the 21 adopting states
Unlikely that all of the additional homicides were legally
justified
Incentives matter in some contexts (lethal force) but not
others (deterrence)
Where to from here?
Now that we’ve reviewed the twoway fixed effects with

treatment that differed across time, how does this more
general form of “differential timing” compare with the 2x2 DD
that we reviewed?
Complicated derivation, but simple interpretation - twoway
fixed effects with differential timing estimates a weighted
average of all 2x2
Andrew Goodman-Bacon (2018; 2019) and Callaway and
Sant’ann (2019)
I will be making the argument that under certain modal
situations, the twoway fixed effects model has major problems,
even fatal ones, due to biases even when parallel trends
plausibly holds
Reminder of 2x2 DD
To understand differential timing, we need to remind ourselves 2x2

form

Post to pre difference for treatment group compared to the post to

pre difference for never treated
Different treatment dates by panel unit
yit = βDi + τ Postt + δ(Di × Postt ) + Xit + αi + αt + εit

| {z }
2x2 DD
yit = δDit + Xit + αi + αt + it
| {z }
Twoway FE
We know a lot about 2x2, but about the twoway fixed effects
estimator when it comes to DD designs
Decomposition Preview
Linear panel models estimate a treatment parameter that is a

weighted average over all 2x2 in your sample
The estimator is a weighted average of all potential δ 2×2 in
which treated units act as both controls and treatment
depending on the situation
Weights are function of sample sizes of each “group” and the
variance of the treatment dummies for the groups
Decomposition (cont.)
Under the assumptions of variance weighted common trends

(VWCT) and time invariant treatment effects, the estimator
called the variance weighted ATT is a weighted average of all
possible ATTs
Under more restrictive assumptions it perfectly matches the
ATT
Time varying treatment effects generate a bias that needs to
be accounted for
3 Group Example
Suppose two treatment groups (k,l) and one untreated group

(u)
k,l define the groups based on when they receive treatment
(differently in time) with k receiving it later than l
Denote D k as the share of time each group spends in
treatment status
2x2,j
Denote δbab as the canonical 2 × 2 DD estimator for groups a
and b where j is the treatment group
So what are the possible 2 × 2 combinations?
How many 2x2?
A lot!
When there’s three groups - a never treated (U), an early
treated (k) and a late treated (l), there are four 2x2s
But typically, we have more than 3 groups making the number
of potential 2x2 even larger
With K timing groups and one untreated group, there are K 2
distinct 2x2 DDs
K 2 distinct DDs
Assume 3 timing groups (a, b and c) and one untreated group (U).
Then there should be 9 2x2 DDs. Here they are:
a to b b to a c to a
a to c b to c c to b
a to U b to U c to U
Simple example with 3 groups
We’ll stick with two groups, k and l, who will get the treatment
at tk∗ and tl∗ , and the third group U will never get treated
The earlier period before anyone is treated is “pre”, the period
between k and l treatment is “mid”, and the period after l is
treated is “post”
Three important 2x2 DDs


2x2 mid(k,l) pre(k) mid(k,l) pre(k)
δbkl = yk − yk − yl − yl

2x2 post(l) mid(k,l) post(l) mid(k,l)
δblk = yl − yl − yk − yk
where the first 2x2 is any timing group compared to untreated, the
second is a group compared to yet-to-be-treated timing group, and
the last is the eventually-treated compared to the already-treated
controls.


2x2 post(l) pre(l) post(l) pre(l)
δblU = yl − yl − yU − yU

2x2,k MID(k,l) Pre(k,l) MID(k,l) PRE (k,l)
δkl = yk − yk − yl − yl

2x2,l POST (k,l) MID(k,l) POST (k,l) MID(k,l)
δlk = yl − yl − yk − yk
Second, what makes up the DD estimator?
The least squares estimate yields a weighted combination of each

groups’ respective 2x2 (of which there are 4 in this example)
X XX 2x2,k 2x2,l

DD 2x2
δ
b = skU δkU +
b skl µkl δkl
b + (1 − µkl )δlk
b
k6=U k6=U l>k
where that first 2x2 is the k compared to U and the l compared to

U (combined to make the equation shorter)
Third, the Weights
nk nu D k (1 − D k )
sku =
d (D̃it )
Var
nk nl (D k − D l )(1 − (D k − D l ))
skl =
d (D̃it )
Var
1 − Dk
µkl =
1 − (D k − D l )
where n refer to sample sizes, D k (1 − D k )

(D k − D l )(1 − (D k − D l )) expressions refer to variance of
treatment, and the final equation is the same for two timing groups.
Weights discussion
Two things pop out of these weights

“Group” variation matters more than unit-level variation. A
group is if two states got treated in 1995. They are the 1995
group. More units in a group, the bigger that 2x2 is practically
Within-group treatment variance matters a lot.
Think about what causes the treatment variance to be as big
as possible. Let’s think about the sku weights.
1 D = 0.1. Then 0.1 × 0.9 = 0.09
2 D = 0.4. Then 0.4 × 0.6 = 0.24
3 D = 0.5. Then 0.5 × 0.5 = 0.25
What’s this mean? The weight on treatment variance is
maximized for groups treated in middle of the panel
More weights discussion
But what about the “treated on treated” weights? What’s this

D k − D l business about?
Well, same principle as before - when the difference between
treatment variance is close to 0.5, those 2x2s are given the
greatest weight
For instance, say tk∗ = 0.15 and tl∗ = 0.67. Then
D k − D l = 0.52. And thus 0.52 × 0.48 = 0.2496.
TWFE and centralities
Groups in the middle of the panel weight up their respective

2x2s via the variance weighting
But when looking at treated to treated comparisons, when
differences in timing have a spacing of around 1/2, those also
weight up the respective 2s2s via variance weighting
But there’s no theoretical reason why should prefer this as it’s
just a weighting procedure being determined by how we drew
the panel
This is the first thing about TWFE that should give us pause,
as not all estimators do this
Potential outcomes
Previous just showed that DD was based on a weighted

“adding up” of particular 2x2s. That tells us what DD is
numerically. But that’s not the end
Because the decomposition theorem expresses the DD
coefficient in terms of sample averages, the movement to
potential outcomes is easy.
Now we express DD in terms of ATT which is essential for
understanding identification and bias
Average treatment effect on the treatment group (ATT)
Define the year-specific ATT as
ATTk (τ ) = E [Yit1 − Yit0 |k, t = τ ]
Now define it over a time window W (e.g., a post-treatment

window)
ATTk (τ ) = E [Yit1 − Yit0 |k, τ ∈ W ]
Define differences in average potential outcomes over time as:
∆Ykh (W1 , W0 ) = E [Yith |k, W1 ] − E [Yith |k, W0 ]
for h = 0 (i.e., Y 0 ) or h = 1 (i.e., Y 1 )

Changing potential outcomes
Figure: With trends, differences in mean potential outcomes is non-zero

From 2x2 to ATT

2x2
δbkU = E [Yj |Post] − E [Yj |Pre] − E [Yu |Post] − E [Yu |Pre]

1 0 0 0
= E [Yj |Post] − E [Yj |Pre] − E [Yu |Post] − E [Yu |Pre]
| {z }
Switching equation
+ E [Yj0 |Post] − 0
E [Yj |Post]
| {z }
Adding zero
= E [Yj |Post] − E [Yj0 |Post]

1
| {z }
ATT

0 0 0 0
+ E [Yj |Post] − E [Yj |Pre] − E [YU |Post] − E [YU |Pre]
| {z }
Non-parallel trends bias in 2x2 case
Potential outcomes
2x2 0 0
δbkU = ATTPost,j + ∆YPost,Pre,j − ∆YPost,Pre,U
| {z }
Selection bias!
Hah! It’s that another selection bias term, like when we

decomposed the simple difference in outcomes! But here we see it’s
basis - non-parallel trends in potential outcomes themselves. Notice
one of these is counterfactuals, but which one?
Two benign 2x2
2x2
δbkU = ATTk Post + ∆Yk0 (Post(k), Pre(k)) − ∆YU0 (Post(k), Pre)
2x2
δbkl = ATTk (MID) + ∆Yk0 (MID, Pre) − ∆Yl0 (MID, Pre)
These look the same because you’re always comparing the treated
unit with an untreated unit (though in the second case it’s just that
they haven’t been treated yet).
The dangerous 2x2
But what about the 2x2 that compared the late groups to the
already-treated earlier groups? With a lot of substitutions like we
did we get:
2x2
δblk = ATTl,Post(l) + ∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
| {z }
Parallel trends bias
− (ATTk (Post) − ATTk (Mid))
| {z }
Heterogeneity bias!
Heterogeneity bias?
That old decomposition of the simple difference in outcomes rears

its ugly head!
2x2
δbkl = ATTl,Post(l)
+∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
−(ATTk (Post) − ATTk (Mid))
The first part is the ATT we are looking for

The selection bias which only zeroes out if Y 0 for k and l has
the same parallel trends from mid to post period
The heterogeneity bias (3) occurs if the ATT for k differs over
time. If not, then it just zeroes out.
Substitute all this stuff into the decomposition formula

2x2,k 2x2,l
X XX
2x2
δbDD = skU δbkU + skl µkl δbkl + (1 − µkl )δbkl
k6=U k6=U l>k
where we will make these substitutions

2x2
δbkU = ATTk (Post) + ∆Yl0 (Post, Pre) − ∆YU0 (Post, Pre)
2x2,k
δbkl = ATTk (Mid) + ∆Yl0 (Mid, Pre) − ∆Yl0 (Mid, Pre)
δb2x2,l
lk = ATTl Post(l) + ∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
−(ATTk (Post) − ATTk (Mid))
Notice all those potential sources of biases!

Potential Outcome Notation
DD
p lim δbn→∞ = δ DD
= VWATT + VWCT − ∆ATT
Notice the number of assumptions needed even to estimate

this very strange weighted ATT (which is a function of how
you drew the panel in the first place).
With dynamics, it attenuates the estimate (bias) and can even
reverse sign depending on the magnitudes of what is otherwise
effects in the sign in a reinforcing direction!
Let’s look at each of these three parts more closely
Variance weighted ATT
X
VWATT = σkU ATTk (Post(k))
k6=U
XX
+ σkl µkl ATTk (MID) + (1 − µkl )ATTl (POST (l))
k6=U l>k
where σ is like s only population terms not samples.

Weights sum to one.
Note, if all the ATT are identical, then the weighting is
irrelevant.
But otherwise, it’s basically weighting each of the individual
sets of ATT we have been discussing, where weights depend
on group size and variance
Variance weighted common trends
VWCT can be understood as a variance weighted common

trends component,
This is the collection of selection biases we previously wrote
out,
But notice – identification requires variance weighted common
trends to hold.
You get this with identical trends, but you don’t need identical
trends anymore as the weights can make it hold without.
Huge pain to write out, unfortunately.
X
0 0
VWCT = σkU ∆Yk (Post(k), Pre) − ∆YU (Post(k), Pre)
k6=U
XX
+ σkl µkl {∆Yk0 (Mid, Pre(k)) − ∆Yl0 (Mid, Pre(k))}
k6=U l>k

+ (1 − µkl ){∆Yl0 (Post(l), Mid) − ∆Yk0 (Post(l), Mid)}
This is new. But while this is a lot to be equalling zero, it’s

ironically a weaker identifying assumption than we thought bc you
don’t need identical common trends since the weights can
technically correct for unequal trends.
Heterogeneity bias
XX
∆ATT = (1 − µkl ) ATTk (Post(l) − ATTk (Mid))
k6=U l>k
Now, if the ATT is constant over time, then this difference is zero,
but what if the ATT is not constant? Then TWFE is biased, and
depending on the dynamics and the VWATT, may even flip signs
Case 1: ATT varies across units but not time
DD
p lim δbn→∞ = VWATT + VWCT
because ∆ATT = 0 here. Assume VWCT=0. Then the VWATT

equals
X k−1
X K
X
VWATT = ATTk σkU + σjk (1 − µjk ) + σjk µjk
k6=U j−1 j=k+1
X
= ATTk wkT
k6=U
the VWATT weights together group-specific ATTs by a function of

sample shares and treatment variance.
Case 1 cont.
The processes that determine treatment timing are central to

the interpretation of VWATT.
Assume treatment rolls out first to units with the largest
ATTs.
Then regression DD underestimates the sample-weighted ATT
if t1∗ is early enough, or if there are a lot of post periods, so
that D 1 very small and D k ≈ 0.5
Regression DD overestimates if t1∗ is late enough (or if there
are a lot of pre periods) so that D 1 ≈ 0.5 and D k is small
Goodman-Bacon (2018) suggests scattering the weights
against each group’s sample share. They may be close if there
is little variation in treatment timing, if the untreated group is
very large, or if some timing groups are very large
Case 2: Constant ATT across units, but heterogenous over
time
Time varying treatment effects, even if they are identical

across units, generate cross-group heterogeneity because of the
differing post-treatment windows
Let’s consider a case where the counterfactual outcomes are
identical, but the treatment effect is a linear break in the
trend. For instance, Yit1 = Yit0 + θ(t − t1∗ + 1) similar to Meer
and West (2013)
Treatment effect is break in trend
Case 2 cont.
The first 2x2 uses the later group as its control in the middle
period. But in the late period, the later treated unit is using
the earlier treated as its control
But notice, this effect is biased because the control group is
experiencing a trend in outcomes (heterogeneous treatment
effects)
This bias feeds through to the later 2x2 according to the size
of the weight (1 − µkl )
If treatment effects are constant over time, then we only need

VWCT = 0 to identify VWATT. “Only”!
The assumption itself is not testable because common trends
is based on counterfactual Y 0 for the treatment groups in the
post-treatment period, and we only have pre-treatment data
But let’s assume differential counterfactual trends Yk0 are
linear throughout the panel. Then we can get a convenient
approximation to the VWCT on the next slide
X k−1
X K
X
VWCT = ∆Yk0 σkU + σjk (1 − 2µjk ) + σkj (2µkj − 1)
k6=U j=1 j=k+1
X
−∆YU0 σkU
k6=U
Obviously, for this bias to be inconsequential, we need the sum of

the two weighted counterfactual trends to be zero. You get this
with identical trends, but those are not necessary due to the
weights ability to shift non-identical trends so as to satisfy the zero
condition.
The weight on each group’s counterfactual trend equals the

difference between the total weight it gets when it acts as a
treatment group (wkT ) minus the total weight it gets when it acts
as a control (wkc ).
X
∆Yk0 [wkT − wkC ] = 0
k
where wkT is the sum of all weights where group k is the treatment
group
k−1
X K
X
T
wk = σkU + σjk (1 − µjk ) + σkj µkj
k=1 j=k+1
and wkc is the sum of al weights where group k is the control group
k−1
X K
X
wkc = σjk µjk + σjk (1 − µjk )
k=1 j=k+1
The bias induced by each group will depend on whether it is a

net treatment/control group
A positive pre-trend for group j will bias the results upwards if
j is a net treatment group (wjT > wjC ) or down if its a net
control group, and if they are equal, then the bias will be zero
regardless of group pre-trend
Units treated towards the ends of the panel get relatively more
weight when they act as controls.
Needless to say, the size of the bias from a given trend is larger
for groups with more weight
What this means is that while all units are acting as controls,
treatment timing causes some units to be controls more often
- hence why they become negative (e.g., wkT − wkC < 0 implies
wkC has become relatively large)
The earliest and/or latest units get more weight as controls
than treatments
Units treated in the middle of the panel have high treatment
variance as we’ve noted repeatedly, and so get more weight
when they act as the treatment group
Variance weighted common trend weights
Testing VWCT
The identifying assumption k ∆Yk0 [wkT − wkC ] = 0 shows us how

P
to exactly weight averages of xit and perform a single t-test that
directly captures the identifying assumption.
1 Generate a dummy for the effective treatment group
1[Bk ] = wkT − wkC > 0
2 Estimate
xk = βBk + εk
weighted by |wkT − wkC |
The coefficient βb equals covariate differences weighted by the
actual identifying variation and its t-statistic tests the null of
reweighted balance implied the VWCT equality
Software to check the 2x2s and weights
Austin Nichols and Thomas Goldring have made available a

package in Stata called ddtiming.ado
This will estimate each individual 2x2 and the weights
associated with a simple twoway fixed effects model
Let’s look it. First download Cheng and Hoekstra data from
earlier (castle-doctrine-2000-2010.dta)
Now install ddtiming.ado and use the do file that I’ve supplied
called hoekstra-cheng.do
Stata
. use castle-doctrine-2000-2010.dta, replace

. areg l_murder post i.year, a(sid) robust
Dep var Log homicide

Castle doctrine law 0.105
(0.032)
Recall the estimated ATT is 0.105
. ddtiming l_murder post, i(sid) t(year)
DD Comparison Weight Avg DD Est

Earlier T vs. Later C 0.060 -0.039
Later T vs. Earlier C 0.032 0.063
T vs. Never treated 0.908 0.116
. di (0.060*-0.039) + (0.032*0.063) + (0.908*0.116)

. 0.105
Most of the 0.105 is coming from comparing treatment units to

never treated units; the others cancel out
2x2s and their corresponding weights
.4 .2
2x2 DD Estimate
-.2 0
-.4
-.6
0.00 0.20 0.40 0.60

Weight
Earlier Group Treatment vs. Later Group Control

Later Group Treatment vs. Earlier Group Control
Treatment vs. Never Treated
Biased DD with OLS
Review baker.do
So we see – with differential timing, and heterogeneous
treatment effects over time, the TWFE bias can be gigantic
because:
plim = VWATT + VWCT − ∆ATTlk
New papers are coming out focused on the issues that we are
seeing with TWFE
Callaway and Sant’anna (2019) is one of these (currently R&R
at Journal of Econometrics)
Preliminary
Callaway and Sant’anna consider identification, estimation and

inference procedures for ATE in DD models with
1 multiple time periods
2 variation in treatment timing (i.e., differential timing)
3 parallel trends only holds after conditioning on observables
Group-time ATE
Key concept: the ATE for a specific group and time

Groups are basically cohorts of units treated at the same time
Their method will calculate an ATE per group/time which
yields many individual ATE estimates
Group-time ATE estimates are not determined by the
estimation method one adopts (first difference or FE)
Does not directly restrict heterogeneity with respect to
observed covariates, timing or the evolution of treatment
effects over time
Provides a way to aggregate over these to get a single ATE
Another contribution
Typical econometrics paper: they propose estimators and

provide asymptotically valid inference procedures for the causal
parameter of interest
Uses a particularly kind of bootstrapping that is
computationally convenient to obtain confidence intervals
This is an extension of an older Abadie (2006) paper on
semi-parametric DD with some subtle and substantive
differences
The estimator will look awfully similar to an inverse probability
weighting estimator down to the use of propensity scores
Parallel trends assumption
Parallel trends is never directly testable

If you assume though that it holds in the pre-treatment period
that therefore it holds in the counterfactual periods, then fine
(IMO, this begs the question [as in assumes the conclusion].
Obviously if treatment is endogenous then parallel trends
doens’t hold even if it did hold prior (see Kahn-Lang and Lang
2018))
Notation
T periods going from t = 1, . . . , T

Units are either treated (Dt = 1) or untreated (Dt = 0) but
once treated cannot revert to untreated state
Gg signifies a group and is binary. Equals one if individual
units are treated at time period t.
C is also binary and indicates a control group unit equalling
one if “never treated”
Recall the problem with OLS on using treatment units as
controls
Callaway and Sant’anna seem to know this and working to
specifically address it by essentially not using those units at all
as controls
Generalized propensity score:
ˆ ) = Pr (Gg = 1|X , Gc + C = 1)
p(X
Propensity scores
They’ll estimate a propensity score based on group covariates

using probit or logit (but not OLS)
That score will then be normalized (e.g., Hajek weight) which
improves finite sample bias
You may need to trim it on the [0.1,0.9] interval as is
commonly suggested in other applications
Essentially, units in control group will be weighted up if their
propensity scores are high, and weighted down if low, making
more apple-to-apples comparisons
Detour into IPW
Horvitz weights
1 PN Di − pb(Xi )
δbATT = i=1 Yi ·
NT 1 − pb(Xi )
Harjek weights
N N N N
Yi (1 − Di ) (1 − Di )
X X X X
Yi Di Di
δbATT = / − /
pb pb (1 − pb) (1 − pb)
i=1 i=1 i=1 i=1
Parameter of interest
ATT (g , t) = E [Yt1 − Yt0 |Gg = 1]

Potential uses of this estimator
1 Are treatment effects heterogenous by time of adoption?

2 Does treatment effect change over time?
3 Are shortrun effects more pronounced than longrun effects?
4 Do treatment effect dynamics differ if people are first treated
in a recession relative to expansion years?
Assumptions
Assumption 1: Sampling is iid (panel data)
Assumption 2: Conditional parallel trends
E [Yt0 − Yt−1
0
|X , Gg = 1] = [Yt0 − Yt−1
0
|X , C = 1]
Assumption 3: Irreversible treatment
Assumption 4: Common support (propensity score)

Estimator
Theorem 1
p̂(X )C
Gg 1−p̂(X )
ATT (g , t) = E − (Yt − Yg −1 )
E [Gg ] p̂(X )C
E 1−p̂(X )
Which units will and will not be controls?
Callaway and Sant’anna are keeping us from calculating DD’s

using TWFE, which is problematic in part bc you’re implicitly
calculating 2x2s by comparing later treated units to early
treated units, which is a sin
But what if you never have a true control group, or “never
treated”?
Remarks about “staggered adoption” with universal coverage
Proof.
Remark 1: In some applications, eventually all units are treated,
implying that C is never equal to one. In such cases one can
consider the “not yet treated” (Dt = 0) as a control group instead
of the “never treated?” (C = 1).
Aggregated vs single year/group ATT
The method they propose is really just identifying very narrow

ATT per group time.
But we are often interested in more aggregate parameters, like
the ATT across all groups and all times
They present two alternative methods for building “interesting
parameters”
“We can aggregate the group-time treatment effects into

fewer interpretable causal effect parameters, which makes
interpretation easier, and also increases statistical power
and reduces estimation uncertainty.” - Andrew Baker
Interesting Parameter 1
T T
2 XX
1{g ≤ t}ATT (g , t)
T (T − 1)
g =2 t=2
where T is number of pre-treatment years (Assumption 2 regarding

conditional parallel trends). Let’s look at an example.
Aggregating the first way
ATT (1986, 1986) = 10

ATT (1986, 1987) = 15
ATT (1986, 1988) = 20
Let data run from 1983 - 1988. Thus T = 3. ATT simple average
is 15.
Interesting Parameter 2
T T
1 XX
1{g ≤ t}ATT (g , t)P(G = g )
k
g =2 t=2
This is a weighted average of each ATT (g , t) putting more weight

on ATT (g , t) with larger group sizes
Bootstrap inference
They propose a bootstrap procedure to conduct asymptotically

valid inference which can adjust for autocorrelation and clustering
Stata example
See baker.do
Concluding remarks on DD
Chances are you are going to write more papers using DD than
any other design
Goodman-Bacon (2018, 2019) is worth your time so that you
know what you are estimating
And Callaway and Sant’ann (2019) is an extremely useful
contribution to the DD toolbox for showing a way to estimate
the group-time ATT using any variety of approaches, including
regression
What is synthetic control
Synthetic control has been called the most important

innovation in causal inference of the last 15 years (Athey and
Imbens 2017)
It’s extremely useful for case studies, which is nice because
that’s often all we have
Continues to also be methodologically a frontier for applied
econometrics
Consider this talk a starting point for you

What is a comparative case study
Single treated unit – country, state, whatever

Social scientists tackle such situations in two ways:
qualitatively and quantitatively
In political science, probably others, you see as a stark dividing
line between camps
Not so much in economics
Qualitative comparative case studies
In qualitative comparative case studies, the goal is to reason

inductively the causal effects of events or characteristics of a
single unit on some outcome, oftentimes through logic and
historical analysis.
May not answer the causal questions at all because there is
rarely a counterfactual, or if so, it’s ad hoc.
Classic example of comparative case study approach is Alexis
de Toqueville’s Democracy in America (but he is regularly
comparing the US to France)
Traditional quantitative comparative case studies
Quantitative comparative case studies are often explicitly

causal designs.
Usually a natural experiment applied to a single aggregate unit
(e.g., city, school, firm, state, country)
Method compares the evolution of an aggregate outcome for
the unit affected by the intervention to the evolution of the
same ad hoc aggregate control group (Card 1990; Card and
Krueger 1994)
Pros and cons of traditional case study approaches
Pros:
Policy interventions often take place at an aggregate level
Aggregate/macro data are often available
Cons:
Selection of control group is ad hoc
Standard errors do not reflect uncertainty about the ability of
the control group to reproduce the counterfactual of interest
Description of the Mariel Boatlift
How do inflows of immigrants affect the wages and

employment of natives in local labor markets?
Card (1990) uses the Mariel Boatlift of 1980 as a natural
experiment to measure the effect of a sudden influx of
immigrants on unemployment among less-skilled natives
The Mariel Boatlift increased the Miami labor force by 7%
Individual-level data on unemployment from the Current
Population Survey (CPS) for Miami and four comparison cities
(Atlanta, Los Angeles, Houston, Tampa-St. Petersburg)
Why these four?
( ~\ i
-..~
0\
---
V Vii'}
--~
1J')NOO
~
V'
- ~ci- "'"'"
N"':N
"'"'"
- I / C'i~~
-0- C1C1o.
-N-
~
O\C;:) I I I I
,
vi
u
~
u
.r:
c
~
00
~ - r;::; ~ ~ ~
O\MO\ 000\00.
ft ddd "';:dN c
'-' '-' '-" ,-
O\MV \0\00 C
~
M~d o\NM ~
Motivating Example: The Mariel Boatlift
E -
Card’s main results
~ OO~ - 0
-00. --
O\N I I.r: ~
u
a ~
d
U, ~
C
L..
='
0
t:
C
0
',----M- --~
r-..000\"d
0u
.-0
c:
- 0 ... - - 0 ... - ~C
K ~ . -vr-..
-""""" MMO
\~ '/ .-00 ~ 0\ . V .. 0
1J') 00'" 0 N g
au t--
~c.s
- C/)
.-
e >c I """
\0
10..,
"d
~ ~
~uU M
rl
C c: :0
U
0 0 ~
.I::
...
10.., .~.~~
0.
~
0..
~
L.
0
0
a E 0\
--
'a
0 0 0\
~
~ U
I Y
~
';j
II)
u
'-.-
u 6
.~
U
(I) .
,-
~
u
~
.~
u'-
~
~
~
'-'
a
r \
'-'
~ 0.2-: c.2-:
OGJ
a
0
g °u
~
U '~ U
C
~
.~ U
C ~
.L..
~F;0..L.. U
I£: -
eU~o..~
L..
u,.. II) .- U ~ ,;... U
"0
~ R
=' ,~ ~ ,- a ~ u ;a a ~
0 ~ ~ .- 0 ~ 0..
t!: (,5 ~~UQ ~~UQ ~
<
';,
,.
: GJ
II)
II)
; U
C U
0
' U -0
Z
'I!~
'
,
.-
~
GJ
, ---N ---" ,
.
or
l~ Q ~_c ~~~
,
\
:)
Can this ever lead to subjective biases?
Card found that the Mariel boatlift reduced unemployment

compared to the four cities he chose
Is there anything principled we could do that doesn’t give the
researcher so much control over control group?
Enter synthetic control (Abadie and Gardeazabal 2003;
Abadie, Diamond and Hainmueller 2010)
Synthetic Control
First appears in Abadie and Gardeazabal (2003) in a study of a

terrorist attack in Spain (Basque) on GDP
Revisited again in a 2011 JASA with Diamond and
Hainmueller, two political scientists who were PhD students at
Harvard (more proofs and inference)
A combination of comparison units often does a better job
reproducing the characteristics of a treated unit than single
comparison unit alone
Researcher’s objectives
Our goal here is to reproduce the counterfactual of a treated

unit by finding the combination of untreated units that best
resembles the treated unit before the intervention in terms of
the values of k relevant covariates (predictors of the outcome
of interest)
Method selects weighted average of all potential comparison
units that best resembles the characteristics of the treated
unit(s) - called the “synthetic control”
Synthetic control method: advantages
Precludes extrapolation (unlike regression) because

counterfactual forms a convex hull
Does not require access to post-treatment outcomes in the
“design” phase of the study - no peeking
Makes explicit the contribution of each comparison unit to the
counterfactual
Formalizing the way comparison units are chosen has direct
implications for inference
Synthetic control method: disadvantages
1 Subjective researcher bias kicked down to the model selection

stage
2 Significant diversity at the moment as to how to principally
select models - from machine learning to modifications - as
well as estimation and software
Furman and Pinto (2018) recommend showing a few different
results in their “cherry picking” JPAM
Synthetic control method: estimation
Suppose that we observe J + 1 units in periods 1, 2, . . . , T

Unit “one” is exposed to the intervention of interest (that is,
“treated”) during periods T0 + 1, . . . , T
The remaining J are an untreated reservoir of potential
controls (a “donor pool”)
Potential outcomes notation
Let Yit0 be the outcome that would be observed for unit i at

time t in the absence of the intervention
Let Yit1 be the outcome that would be observed for unit i at
time t if unit i is exposed to the intervention in periods T0 + 1
to T .
Dynamic ATT
Treatment effect parameter is defined as dynamic ATT where
δ1t = Y1t1 − Y1t0

= Y1t − Y1t0
for each post-treatment period, t > T0 and Y1t is the outcome for
unit one at time t. We will estimate Y1t0 using the J units in the
donor pool
Estimating optimal weights
Let W = (w2 , . . . , wJ+1 )0 with wj ≥ 0 for j = 2, . . . , J + 1 and

w2 + · · · + wj+1 = 1. Each value of W represents a potential
synthetic control
Let X1 be a (k × 1) vector of pre-intervention characteristics
for the treated unit. Similarly, let X0 be a (k × J) matrix
which contains the same variables for the unaffected units.
The vector W ∗ = (w2∗ , . . . , wJ+1
∗ )0 is chosen to minimize
||X1 − X0 W ||, subject to our weight constraints

Optimal weights differ by another weighting matrix
Abadie, et al. consider

p
||X1 − X0 W ||= (X1 − X0 W )0 V (X1 − X0 W )
where Xjm is the value of the m-th covariates for unit j and V is
some (k × k) symmetric and positive semidefinite matrix
More on the V matrix
Typically, V is diagonal with main diagonal v1 , . . . , vk . Then, the

synthetic control weights w2∗ , . . . , wJ+1
∗ minimize:
k
X J+1
X 2
vm X1m − wj Xjm
m=1 j=2
where vm is a weight that reflects the relative importance that we

assign to the m-th variable when we measure the discrepancy
between the treated unit and the synthetic controls
Choice of V is critical
The synthetic control W ∗ (V ∗ ) is meant to reproduce the

behavior of the outcome variable for the treated unit in the
absence of the treatment
Therefore, the V ∗ weights directly shape W ∗
Estimating the V matrix
Choice of v1 , . . . , vk can be based on

Assess the predictive power of the covariates using regression
Subjectively assess the predictive power of each of the
covariates, or calibration inspecting how different values for
v1 , . . . , vk affect the discrepancies between the treated unit
and the synthetic control
Minimize mean square prediction error (MSPE) for the
pre-treatment period (default):
T0
X J
X 2
∗ ∗
Y1t − wj (V )Yjt
t=1 j=2
Cross validation
Divide the pre-treatment period into an initial training period

and a subsequent validation period
For any given V , calculate W ∗ (V ) in the training period.
Minimize the MSPE of W ∗ (V ) in the validation period
Suppose Y 0 is given by a factor model
What about unmeasured factors affecting the outcome variables as

well as heterogeneity in the effect of observed and unobserved
factors?
Yit0 = αt + θt Zi + λt ui + εit
where αt is an unknown common factor with constant factor

loadings across units, and λt is a vector of unobserved common
factors
With some manipulation
J+1
X J+1
X T0
X T0
X −1
Y1t0 − wj∗ Yjt = wj∗ λt λ0n λn λ0s (εjs − ε1s )
j=2 j=2 s=1 n=1
J+1
X
− wj∗ (εjt − ε1t )
j=2
If T
P 0 0
t=1 λt λt is nonsingular, then RHS will be close to zero if
number of preintervention periods is “large” relative to size of
transitory shocks
Only units that are alike in observables and unobservables
should produce similar trajectories of the outcome variable
over extended periods of time
Proof in Appendix B of ADH (2011)
Example: California’s Proposition 99
In 1988, California first passed comprehensive tobacco control

legislation:
increased cigarette tax by 25 cents/pack
earmarked tax revenues to health and anti-smoking budgets
funded anti-smoking media campaigns
spurred clean-air ordinances throughout the state
produced more than $100 million per year in anti-tobacco
projects
Other states that subsequently passed control programs are
excluded from donor pool of controls (AK, AZ, FL, HI, MA,
MD, MI, NJ, OR, WA, DC)
Cigarette Consumption: CA and the Rest of the US
Cigarette Consumption: CA and the Rest of the U.S.
140
California
rest of the U.S.
120
per−capita cigarette sales (in packs)
100
80
60
40
Passage of Proposition 99
20
0
1970 1975 1980 1985 1990 1995 2000
year
Cigarette Consumption: CA and synthetic CA
Cigarette Consumption: CA and synthetic CA
140
California
synthetic California
120
per−capita cigarette sales (in packs)
100
80
60
40
20
0
1970 1975 1980 1985 1990 1995 2000
year
Predictor Means:
Predictor ActualActual
Means: vs. Synthetic CaliforniaCalifornia
vs. Synthetic
California Average of
Variables Real Synthetic 38 control states
Ln(GDP per capita) 10.08 9.86 9.86
Percent aged 15-24 17.40 17.40 17.29
Retail price 89.42 89.41 87.27
Beer consumption per capita 24.28 24.20 23.75
Cigarette sales per capita 1988 90.10 91.62 114.20
Note: All variables except lagged cigarette sales are averaged for the 1980-
1988 period (beer consumption is averaged 1984-1988).
Smoking Gap between CA and synthetic CA
Smoking Gap Between CA and synthetic CA
30
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
−30
1970 1975 1980 1985 1990 1995 2000
year
Inference
To assess significance, we calculate exact p-values under

Fisher’s sharp null using a test statistic equal to after to before
ratio of RMSPE
Exact p-value method
Iteratively apply the synthetic method to each country/state in
the donor pool and obtain a distribution of placebo effects
Compare the gap (RMSPE) for California to the distribution of
the placebo gaps. For example the post-Prop. 99 RMSPE is:
T J+1 2 21
1 X X
∗
RMSPE = Y1t − wj Yjt
T − T0
t=T0 +1 j=2
and the exact p-value is the treatment unit rank divided by J

Smoking Gap for CA and 38 control states

(All States in Donor Pool)
California
30
control states
20
10
0
−10
−20
−30
1970 1975 1980 1985 1990 1995 2000
year

(Pre-Prop. 99 MSPE  20 Times Pre-Prop. 99 MSPE for CA)
California
30
control states
20
10
0
−10
−20
−30
1970 1975 1980 1985 1990 1995 2000
year

California
30
control states
20
10
0
−10
−20
−30
1970 1975 1980 1985 1990 1995 2000
year

California
30
control states
20
10
0
−10
−20
−30
1970 1975 1980 1985 1990 1995 2000
year
Ratio Post-Prop. 99 RMSPE to Pre-Prop. 99 RMSPE
Ratio Post-Prop. 99 MSPE to Pre-Prop. 99 MSPE

(All 38 States in Donor Pool)
5
4
3
frequency
California
2
1
0
0 20 40 60 80 100 120
post/pre−Proposition 99 mean squared prediction error

Facts
The US has the highest prison population of any OECD

country in the world
2.3 million are currently incarcerated in US federal and state
prisons and county jails
Another 4.75 million are on parole
From the early 1970s to the present, incarceration and prison
admission rates quintupled in size

Prison constraints
Prisons are and have been at capacity for a long time.

Requires managing flows through
Prison construction
Overcrowding
Paroles
Texas prison boom
Ruiz v. Estelle 1980

Class action lawsuit against TX Dept of Corrections (Estelle,
warden).
TDC lost. Lengthy period of appeals and legal decrees.
Lengthy period of time relying on paroles to manage flows
Governor Ann Richards (D) 1991-1995
Operation prison capacity increased 30-35% in 1993, 1994 and
1995.
Prison capacity increased from 55,000 in 1992 to 130,000 in
1995.
Building of new prisons (private and public)
New prison construction
25
Number of new prison construction
5 10 15
0 20
1840 1860 1880 1900 1920 1940 1960 1980 2000 2020
Year
Red dashed line is 1993
Texas prison growth
Operational capacity
5 10 15 20 25 30 35
00
Percent change in capacity operational

00
16
Prison capacity operational
00
00
14
00
00
12
00
00
10
0
00
80
0
00
-5 0
60
0
00
40
1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
Operational capacity Percent change

Texas Prison Flows Measures per 100 000 Population
400
Coutns per 100 000 Population
100 200 0 300
1980 1985 1990 1995 2000 2005
Prison Admissions Prison Releases

Discretionary Paroles
Total incarceration per 100 000
Texas vs US
800
Total incarceration rates
200 400
0 600
1980 1985 1990 1995 2000 2005
TX USA (excluding TX)

1993 starts the prison expansion
Data
National Prisoner Statistics - prison measures, including race

and gender-specific incarceration
Current Population Survey - controls
SEER - population
Incarcerated persons per 100,000
1993 Treatment
.5
.3
Gap in prediction error
.1
-.1
-.3
-.5
1975 1980 1985 1990 1995 2000 2005

Texas rank: 2, p-value: 0.04
What if you can’t conduct randomized experiment?
Problems with the experimental design itself:

non-compliance by administrators
non-compliance by members of the treatment group
non-compliance by members of the control group
Experiments may be impractical due to:
Too expensive
Unethical
Not feasible for some other reason

GERRY B. HILL, WAYNE MILLAR and JAMES CONNELLY
Figure 1
Lung Cancer at Autopsy: Combined Results from 18 Studies
Per cent of autopsies

8
*
1860 1870 1880 1890 1900 l910 1920 1930 1940 1950
Year
Observed +fitted
Mortality Statistics
"The
TheGreat Debate"
Registrar General of England and Wales began publishing the num- 371
bers of deaths for specific cancer sites in 1911.W The death rates for can-
cer of the lung from 1911 to 1955 were
Figure 2(a)published by Percy Stocks.26The
rates increased exponentially
Mortality overof
from Cancer thetheperiod:
Lung in10% per year in males
Males
and 6% per year in females. Canadian rates for the period 1931-52 were
Rate per 100,000
published
120 by A. J. Phillips.27 The rates were consistently lower in Canada
than in- England and Wales, but also increased exponentially at 8% per
l00
year in males and 4% per year in females.
The
80 -British and Canadian rates are shown in Figure 2. The rates (a) for
males, and (b) for females have been age-standardized,28 and the trends
6 0 - to 1990, using data published by Richard Peto and colleagues, 29
extended
and40by- Statistics Canada.30In British males the rates reached a maxi-
I mum in the mid-1970's and then declined. In Canadian males the initial
rise 20
was- more prolonged, reaching a maximum in 1990.Among females
the age-standardized rates continue to climb in both countries, the rise
0
beingl910
steeper in Canada
1920 1930 1940than in Britain.
1960 1960 1970 l980 l990 2000
The fact that mortality was lower Yearat first in Canada than in Britain
may be explained by the difference in smoking in the two countries.
-England
Percy Stocks31 cited data on the+
& Wales annual
Canadaconsumption
+ United per adult of ciga-
Kingdom
rettes in various countries between 1939 and 1957. In 1939 the con-
increases with the amount smoked.
Figure 4
Smoking and Lung Cancer Case-control Studies
376 Odds Ratio* GERRY B. HILL, WAYNE MILLAR and JAMES CONNELLY
'0 1
Cohort Studies 60
Cohort studies, though less prone to bias, are much more difficult to
perform than case-control studies, since it is necessary to assemble many
thousands of individuals, determine their smoking status, and follow
them up for several years to determine how many develop lung cancer.
Four such studies were mounted in the 1950s. The subjects used were
British doctors,61United States veterans,62 Canadian veterans,63 and vol-
unteers assembled by the American Cancer Society.@All four used mor-
tality as"the end-point.
Lees than 20 20 or more All
Figure 5 shows the combined mortality ratios for cancer of the lung in
males by level of cigarette smoking. Two of the studies involved females,
deaths m
but the numbers of lung cancerMale8 wereFemales
too small to provide precise
estimates. Since all causes of death were recorded in the cohort studies it
was Weighted
possiblemean
to determine the relationship between smoking and dis-
'
eases other than lung cancer. Sigruficant associations were found in rela-
tion to several types of cancer (e.g. mouth, pharynx, larynx, esophagus,
bladder) and with chronic respiratory disease and cardiovascular disease.
Figure 5
Smoking and Lung cancer Cohort Studies in Males
Mortality Ratio*
25
-1
Less than 10 10 to 19 , 20 or more
welghied mean of 4 etudles

Does Smoking Cause Cancer?
Smoking, S, causes lung cancer, C (S → C ) versus spurious

correlation due to backdoor path:
S C
Nature of the criticism
Criticisms from Joseph Berkson, Jerzy Neyman and Ronald Fisher:

(Hill, Millar and Connelly 2003)
1 Correlation b/w smoking and lung cancer is spurious due to
biased selection of subjects (e.g., conditioning on collider
problem)
2 Functional form complaints about using “risk ratios” and “odds
ratios”
3 Confounder, Z , creates backdoor path between smoking and
cancer
4 Implausible magnitudes
5 No experimental evidence to incriminate smoking as a cause of
lung cancer
Fisher’s confounding theory
Fisher, equally famous as a geneticist, argued from logic,

statistics and genetic evidence for a hypothetical confounding
genome, Z , and therefore smokers and non-smokers were not
exchangeable (violation of independence assumption)
Other studies showed that cigarette smokers and non-smokers
were different on observables – more extraverted than
non-smokers and pipe smokers, differed in age, differed in
income, differed in education, etc.
Hindsight is 20/20
Fisher was a chain smoking pipe smoker, he died of cancer,

and he was a paid expert witness for the tobacco industry.
But cynicism aside, it is easy to criticize Fisher because we
look back with more information to when the smoking/lung
cancer link was not universally accepted, and evidence for the
causal link was shallow:
“the [the epidemiologists] turned out to be right, but
only because bad logic does not necessarily lead to
wrong conclusions.” Robert Hooke’s (1983)
Motivation: Smoking and Mortality
Table: Death rates per 1,000 person-years (Cochran 1968)
Smoking group Canada U.K. U.S.

Non-smokers 20.2 11.3 13.5
Cigarettes 20.5 14.1 13.5
Cigars/pipes 35.5 20.7 17.4
Are cigars dangerous?

Non-smokers and smokers differ in mortality and age
Table: Mean ages, years (Cochran 1968)

Non-smokers 54.9 49.1 57.0
Older people die at a higher rate, and for reasons other than
just smoking cigars
Maybe cigar smokers higher observed death rates is because
they’re older on average
Subclassification
One way to think about the problem is that the covariates are
not balanced – their mean values differ for treatment and
control group. So let’s try to balance them.
Worth a pause - blocking on confounders vs controlling for
covariates. The latter reduces residual variance, but shouldn’t
affect the bias of the estimator. Ceteris paribus vs blocking
Subclassification (also called stratification): Compare mortality
rates across the different smoking groups within age groups so
as to neutralize covariate imbalances in the observed sample
Subclassification
Divide the smoking group samples into age groups

For each of the smoking group samples, calculate the mortality
rates for the age group
Construct probability weights for each age group as the
proportion of the sample with a given age
Compute the weighted averages of the age groups mortality
rates for each smoking group using the probability weights
Subclassification: example
Death rates Number of

Pipe-smokers Pipe-smokers Non-smokers
Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40
Question: What is the average death rate for pipe smokers?


Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40
Question: What is the average death rate for pipe smokers?

11 13 16
15 · + 35 · + 50 · = 35.5
40 40 40

Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40
Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?

Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40
Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?

29 9 2
15 · + 35 · + 50 · = 21.2
40 40 40
Table: Adjusted death rates using 3 age groups (Cochran 1968)

Non-smokers 20.2 11.3 13.5
Covariates
Definition: Predetermined Covariates

Variable X is predetermined with respect to the treatment D (also
called “pretreatment”) if for each individual i, Xi0 = Xi1 , i.e., the
value of Xi does not depend on the value of Di . Such
characteristics are called covariates.
Comment I: Does not imply X and D are independent
Comment II: Predetermined variables are often time invariant (e.g.,
sex, race), but time invariance is not a necessary condition
Comment III: Beware of colliders
Outcomes
Definition: Outcomes
Those variables, Y , that are (possibly) not predetermined are called
outcomes (for some individual i, Yi0 6= Yi1 )
Adjustment for Observables
Subclassification (Cochran 1968)

Nearest Neighbor matching (Abadie and Imbens 2006, 2008)
Propensity score (Rosenbaum and Rubin 1973)
Multivariate regression
Identification under independence
Recall that randomization implies
(Y 0 , Y 1 ) ⊥
⊥D
and therefore:
E [Y |D = 1] − E [Y |D = 0] = E [Y 1 |D = 1] − E [Y 0 |D = 0]
| {z }
by the switching equation
1 0
= E [Y ] − E [Y ]
| {z }
by independence
1 0
= E [Y − Y ]
| {z }
ATE
As well as that ATT = ATE :
E [Y 1 − Y 0 ] = E [Y 1 − Y 0 |D = 1]
Identification under conditional independence
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Given assumption 1:
E [Y 1 − Y 0 |X ] = E [Y 1 − Y 0 |X , D = 1]
= E [Y |X , D = 1] − E [Y |X , D = 0]
Given assumption 2:
δATE = E [Y 1 − Y 0 ]
Z
= E [Y 1 − Y 0 |X , D = 1]dPr (X )
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X )
Identification under conditional independence
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Similarly
δATT = E [Y 1 − Y 0 |D = 1]
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X |D = 1)
To identify δATT the conditional independence and common

support assumptions can be relaxed to:
1 Y0 ⊥⊥ D|X
2 Pr (D = 1|X ) < 1 (with Pr (D = 1) > 0)
Subclassification estimator
The identification result is:

Z
δATE = (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X )
Z
δATT = (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X |D = 1)
Assume X takes on K different cells {X 1 , . . . , X k , . . . , X K }.

Then the analogy principle suggests the following estimators:
K k
X 1,k 0,k N
δATE =
b (Y − Y ) ·
N
k=1
K k
X 1,k 0,k NT
δbATT = (Y − Y ) ·
NT
k=1
where Nk is the number of obs. and NTK is the number of

1,k
treatment observations in cell k; Y is the mean outcome for
0,k
the treated in cell k; Y is the mean outcome for the control
in cell k
Subclassification by Age (K = 2)
Death Rate Number of

Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20
K
Nk

Question: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1

Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20
K
Nk

X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1

13 17
4· +6· = 5.13
30 30

Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20
K
NTk

X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1

Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20
K
NTk

X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1

3 7
4· +6· = 5.4
10 10
Subclassification by Age and Gender (K = 4)

Old Males 28 22 4 3 7
Old Females 24 3
Young Males 21 16 5 3 4
Young Females 23 17 6 4 6
Total 10 20
K
Nk

Problem: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1

Old Males 28 22 4 3 7
Old Females 24 3
Total 10 20
K
Nk

Problem: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
Not identified!

Old Males 28 22 4 3 7
Old Females 24 3
Total 10 20
K
NTk

X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1

Old Males 28 22 4 3 7
Old Females 24 3
Total 10 20
K
NTk

X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1

3 3 4
4· +5· +6· = 5.1
10 10 10
Curse of Dimensionality
Subclassification may become less feasible in finite samples as

the number of covariates grows (e.g., K = 4 was too many for
this sample)
Assume we have k covariates and we divide each into 3 coarse
categories (e.g., age: young, middle age, old; income: low,
medium, high, etc.)
The number of sub classification cells (or “strata”) is 3k . For
k = 10, then it’s 310 = 59, 049
Curse of Dimensionality
If sparseness occurs, it means many cells may contain either

only treatment units or only control units but not both. If so,
we cannot use sub classification.
Subclassification is also a problem if the cells are “too coarse”.
We can always use “finer” classifications, but finer cells
worsens the dimensional problem, so we don’t gain much from
that. ex: using 10 variables and 5 categories for each, we get
510 = 9, 765, 625.
Nearest Neighbor Matching
See Abadie and Imbens (2006). “Large sample properties of

matching estimators for average treatment effects”.
Econometrica
We could also estimate δATT by imputing the missing potential
outcome of each treatment unit i using the observed outcome
from that outcome’s “nearest” neighbor j in the control set
1 X
δATT = (Yi − Yj(i) )
NT
Di =1
where Yj(i) is the observed outcome of a control unit such that

Xj(i) is the closest value to Xi among all of the control
observations (eg match on X )

Matching
We could also use the average observed outcome over M

closest matches:
M
" #!
1 X 1 X
δATT = Yi − Yjm (1)
NT M
Di =1 m=1
Works well when we can find good matches for each treatment
group unit, so M is usually defined to be small (i.e., M = 1 or
M = 2)
Matching
We can also use matching to estimate δATE . In that case, we

match in both directions:
1 If observation i is treated, we impute Yi0 using the control
matches, {Yj1 (i) , . . . , YjM (i) }
2 If observation i is control, we impute Yi1 using the treatment
matches, {Yj1 (i) , . . . , YjM (i) }
The estimator is:
N M
" !#
1 X 1 X
δbATE = (2Di − 1) Yi − Yjm (i)
N M
i=1 m=1
Matching example with single covariate
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
ATT = (Yi − Yj(i) )?
NT
Di =1
Potential Outcome
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
NT
Di =1
Match and plug in!
Potential Outcome
i Yi1 Yi0 DI Xi
1 6 9 1 3
2 1 0 1 1
3 0 9 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
NT
Di =1
1 1 1
δbATT = · (6 − 9) + · (1 − 0) + · (0 − 9) = −3.7
3 3 3
A Training Example
Trainees Non-Trainees
unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average: 28.5 13982
21 32 25900
Average: 33 20724
Age Distribution: Before Matching
A: Trainees
3
2
1
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Age Distribution: After Matching
A: Trainees
3
2
1
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Training E↵ect Estimates
Di↵erence in average earnings between trainees and non-trainees

Before matching
16426 20724 = 4298
After matching:
16426 13982 = 2444

Alternative distance metric: Euclidean distance
 
X1
X2 
When the vector of matching covariates, X =  .  has more
 
 .. 
Xk
than one dimension (k > 1) we will need a new definition of
distance to measure “closeness”.
Definition: Euclidean distance
q
||Xi − Xj || = (Xi − Xj )0 (Xi − Xj )
v
u k
uX
= t (Xni − Xnj )2
n=1
Comment: The Euclidean distance is not invariant to changes in

the scale of the X ’s. For this reason, alternative distance metrics
that are invariant to changes in scale are used
Normalized Euclidean distance
Definition: Normalized Euclidean distance

A commonly used distance is the normalized Euclidean distance:
q
||Xi − Xj ||= (Xi − Xj )0 Vb −1 (Xi − Xj )
where
 2 
σb1 0 . . . 0
0 σ b22 . . . 0 
Vb −1 = . .. . . .
 
 .. . . .. 
0 0 ... σ bk2
Notice that the normalized Euclidean distance is equal to:

v
u k
uX (Xni − Xnj )
||Xi − Xj ||= t
bn2
σ
n=1
bn2 , and
Thus, if there are changes in the scale of Xni , these changes also affect σ
the normalized Euclidean distance does not change
Mahalanobis distance
Definition: Mahalanobis distance

The Mahalanobis distance is the scale-invariant distance metric:
q
b −1 (Xi − Xj )
||Xi − Xj ||= (Xi − Xj )0 Σ X
where Σ
b X is the sample variance-covariance matrix of X .
Arbitrary weights
Or, you could just create your own arbitrary weights

v
u k
uX
||Xi − Xj ||= t ωn · (Xni − Xnj )2
n=1
(with all ωn ≥ 0) so that we assign large ωn ’s to those covariates

that we want to match particularly well.
Matching and the Curse of Dimensionality
Dimensionality creates headaches for us in matching.

Bad news: Matching discrepancies ||Xi − Xj(i) || tend to
increase with k, the dimension of X
Good news: Matching discrepancies converge to zero . . .
Bad news: . . . but they converge very slow if k is large
Good news: Mathematically, it can be shown that ||Xi − Xj(i) ||
converges to zero at the same rate as 11
Nk
Bad news: It’s hard to find good matches when X has a large
dimension: you need many observations if k is big.
Deriving the matching bias
Derive the matching bias by first writing out the sample ATT
estimate:
1 X
δbATT = (Yi − Yj(i) ),
NT
Di =1
where each i and j(i) units are matched, Xi ≈ Xj(i) and Dj(i) = 0.
Define potential outcomes and switching eq.
µ0 (x) = E [Y |X = x, D = 0] = E [Y 0 |X = x],
µ1 (x) = E [Y |X = x, D = 1] = E [Y 1 |X = x],
Yi = µDi (Xi ) + εi
Substitute and distribute terms
1 X 1
(µ (Xi ) + εi ) − (µ0 (Xj(i) ) + εj(i) )

δbATT =
NT
Di =1
1 X 1 1 X
= (µ (Xi ) − µ0 (Xj(i) )) + (εi − εj(i) )
NT NT
Di =1 Di =1
Difference between sample estimate and population parameter is:

1 X 1
µ (Xi ) − µ0 (Xj(i) ) − δATT

δbATT − δATT =
NT
Di =1
1 X
+ (εi − εj(i) )
NT
Di =1
Algebraic manipulation and simplification:

1 X 1
µ (Xi ) − µ0 (Xi ) − δATT

δbATT − δATT =
NT
Di =1
1 X
+ (εi − εj(i) )
NT
Di =1
1 X 0
µ (Xi ) − µ0 (Xj(i) ) .

+
NT
Di =1
Apply the Central Limit Theorem and the difference,
r
1 b
(δATT − δATT ),
N
converges to a Normal distribution with zero mean. But, however,
r r
1 b 1 0
E[ (δATT − δATT )] = E [ (µ (Xi ) − µ0 (Xj(i) ))|D = 1].
N N
Now consider the implications if k is large:
The difference between Xi and Xj(i) converges to zero very
slowly
The difference µ0 (Xi ) − µ0 (Xj(i) ) converges to zero very slowly
q
E [ N1 (µ0 (Xi ) − µ0 (Xj(i) ))|D = 1] may not converge to zero
andqan be very large!
E [ 1 (δbATT − δATT )] may not converge to zero because the
N
bias of the matching discrepancy is dominating the matching
estimator!
Bias is often an issue when we match in many dimensions
Solutions to matching bias problem
The bias of the matching estimator is caused by large matching

discrepancies ||Xi − Xj(i) ||. The curse of dimensionality virtually
guarantees this. However:
1 But the matching discrepancies are observed. We can always
check in the data how well we’re matching the covariates.
2 For δbATT we can always make the matching discrepancies
small by using a large reservoir of untreated units to select the
matches (that is, by making NC large).
3 If the matching discrepancies are large, so we are worried
about potential biases, we can apply bias correction techniques
4 Partial solution: propensity score methods (coming soon. . . )
Matching with bias correction
Each treated observation contributes
µ0 (Xi ) − µ0 (Xj(i) )
to the bias.
Bias-corrected (BC) matching:
BC 1 Xh 0 0
i
δATT =
b (Yi − Yj(i) ) − (µ (Xi ) − µ (Xj(i) ))
c c
NT
Di =1
c0 (X ) is an estimate of E [Y |X = x, D = 0]. For

where µ
example using OLS.
Under some conditions, the bias correction eliminates the bias
of the matching estimator without affecting the variance.
Bias adjustment in matched data
Potential Outcome
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
Potential Outcome
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
c0 (X ) = βb0 + βb1 X = 2 + X
For the bias correction, estimate µ
Potential Outcome
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
For the bias correction, estimate µc0 (X ) = βb0 + βb1 X = 2 + X
(10 − 8) − (µc0 (3) − µ

c0 (4)) (4 − 1) − (µ c0 (1) − µ
c0 (0))
δbATT = +
3 3
0 0
(10 − 9) − (µ (10) − µ (8))
c c
+ = 1.33
3
Matching bias: Implications for practice
Bias arises because of the effect of large matching discrepancies on

µ0 (Xi ) − µ0 (Xj(i) ). To minimize matching discrepancies:
1 Use a small M (e.g., M = 1). Larger values of M produce
large matching discrepancies.
2 Use matching with replacement. Because matching with
replacement can use untreated units as a match more than
once, matching with replacement produces smaller matching
discrepancies than matching without replacement.
3 Try to match covariates with a large effect on µ0 (·)
particularly well.
Large sample distribution for matching estimators
Matching estimators have a Normal distribution in large

samples (provided the bias is small):
p d 2
NT (δbATT − δATT ) −
→ N(0, σATT )
For matching without replacement, the “usual” variance

estimator:
M
!2
2 1 X 1 X
bATT =
σ Yi − Yjm (i) − δbATT ,
NT M
Di =1 m=1
is valid.
Large sample distribution for matching estimators
For matching with replacement:

M
!2
2 1 X 1 X
σ
bATT = Yi − Yjm (i) − δbATT
NT M
Di =1 m=1
1 X Ki (Ki − 1)

+ c (ε|Xi , Di = 0)
var
NT M2
Di =0
where Ki is the number of times observation i is used as a

match.
c (Yi |Xi , Di = 0) can be estimated also by matching. For
var
example, take two observations with Di = Dj = 0 and
Xi ≈ Xj , then
(Yi − Yj )2
c (Yi |Xi , Di = 0) =
var
2
c (εi |Xi , Di = 0))
is an unbiased estimator of var
The bootstrap doesn’t work!
Avoiding dimensionality problems
Curse of dimensionality makes matching on K covariates

challenging
Rubin (1977) and Rosenbaum and Rubin (1983) develop a
method that can contain those K covariates used for adjusting
Insofar as treatment is random conditional on K covariates,
then one can use the propensity score to adjust for confounders

Least squares
OLS is best linear predictor and approximation to the

conditional expectation function
But if probability of treatment is nonlinear, this conditional
mean may be less informative
Propensity scores relax the linearity assumption and have other
advantages
The Idea behind propensity scores
Earlier we matched on X ’s to compare units “near” one

another based on some distance but matching discrepancies
and sparseness created problems
Propensity scores summarize covariate information about
treatment selection into a single number bounded between 0
and 1 (i.e., a probability)
Now we compare units with similar estimated probabilities of
treatment
And once we adjust using the propensity score, we no longer
need to adjust for X
Identifying assumptions
We need two assumptions for propensity scores to help us

identify causal effects
1 Conditional independence, or unconfoundedness
2 Common support or overlap
The first is based on state of the art and institutional details
sufficient to warrant such a judgment call, making propensity
scores arguably more, not less, advanced
The latter is testable
Identifying assumption I: Conditional independence
(Yi0 , Yi1 ) ⊥
⊥ D|Xi . There exists a set X of observable covariates
such that after controlling for these covariates, treatment
assignment is independent of potential outcomes.
Conditional on X , treatment assignment is ‘as good as

random’.
‘As good as random’ is English for “independent of potential
outcomes” potential outcomes jargon
Also sometimes called ‘ignorable treatment assignment’,
‘unconfoundedness”, ‘selection on observables’, ‘exogeneity’,
‘conditional zero mean’
CIA is assumed, not tested, bc potential outcomes are
missing. Consult your doctor
Identifying assumption II: Common support
For ranges of X , there is a positive probability of being both
treated and untreated
We’ll talk about the propensity score in just a second; for now
this assumption is only about X
Assumption requires that there are units in both treatment
and control for the range of propensity score
Recall, RDD did not have common support so relied on
extrapolation sensitive to functional form assumptions
Common support ensures we can find similar enough donors in
the control pool
Unlike CIA, common support is testable
Formal Definition
Definition of Propensity score

A propensity score is a number bounded between 0 and 1
measuring the probability of treatment assignment conditional on a
vector of confounding variables: p(X ) = Pr (D = 1|X )
Two Necessary Identification Assumptions:

1 (Y 0 , Y 1 ) ⊥
⊥ D|X (CIA)
2 0 < Pr (D = 1|X ) < 1 (common support)
Steps
1 Estimate the propensity score using logit/probit

2 Estimate a particular ATE incorporating the propensity score
using stratification, imputation, regression, or inverse
probability weighting
3 Estimate standard errors
Estimating the propensity score
Estimate the conditional probability of treatment using probit

or logit model
Pr (Di = 1|Xi ) = F (βXi )
Use the estimated coefficients to calculate the propensity score
for each unit i
ρbi = βX
b i
Propensity score is the predicted conditional probability of

treatment, or the fitted value for each unit – same thing
Identification
A group of unit’s average treatment effect may depend on

some characteristic, X
E [δi (Xi )] = E [Yi1 − Yi0 |Xi = x]

= E [Yi1 |Xi = x] − E [Yi0 |Xi = x]
CIA allow us to substitute
E [Yi |Di = 1, Xi = x] = E [Yi1 |Di = 1, Xi = x]
and similar for other term Y 0 using a switching equation

Common support allows us to estimate both terms
Visualizing the propensity score theorem
D Y
p(X )
It’s similar to the visualization of the RDD strategy from earlier

except that it achieves common support
Propensity score theorem
If (Y 1 , Y 0 ) ⊥
⊥ D|X (CIA), then (Y 1 , Y 0 ) ⊥
⊥ D|ρ(X ) where
ρ(X ) = Pr (D = 1|X ), the propensity score
Conditioning on the propensity score is enough to have

independence between D and (Y 1 , Y 0 ) (Rosenbaum and
Rubin 1983)
Valuable theorem because of dimension reduction and
convergence rate issues which can introduce biases
Big picture: You can toss X out if you have ρb because all
information from X have been absorbed into ρb
Proof
Before diving into the proof, first recognize that
Pr (D = 1|Y 0 , Y 1 ρ(X )) = E [D|Y 0 , Y 1 , ρ(X )]
because
E [D|Y 0 , Y 1 , ρ(X )] = 1 × Pr (D = 1|Y 0 , Y 1 , ρ(X ))

+0 × Pr (D = 0|Y 0 , Y 1 , ρ(X ))
and the second term cancels out.

Proof.
Assume (Y 1 , Y 0 ) ⊥
⊥ D|X (CIA). Then:
Pr (D = 1|Y 1 , Y 0 , ρ(X )) = E [D|Y 1 , Y 0 , ρ(X )]

| {z }
See previous slide
= E [E [D|Y , Y , ρ(X ), X ]|Y 1 , Y 0 , ρ(X )]

1 0
| {z }
by LIE
= E [E [D|Y , Y , X ]|Y , Y 0 , ρ(X )]

1 0 1
| {z }
Given X , we know p(X )
= E [E [D|X ]|Y 1 , Y 0 , ρ(X )]

| {z }
by CIA
1 0
= E [ρ(X )|Y , Y , ρ(X )]
| {z }
propensity score definition
= ρ(X )
Similar proof
We also can show that the probability of treatment conditional on

the propensity score is the propensity score using a similar
argument:
Pr (D = 1|ρ(X )) = E [D|ρ(X )]
| {z }
Previous slide
= E [E [D|X ]|ρ(X )]
| {z }
LIE
= E [p(X )|ρ(X )]
| {z }
definition
= ρ(X )
and Pr (D = 1|Y 1 , Y 0 , ρ(X )) = Pr (D = 1|ρ(X )) by CIA

Unbiased estimation of the ATE
Exact methods to do this to be discussed later, but until then, we

can say this:
Corollary: Estimating the ATE
If (Y 1 , Y 0 ) ⊥
⊥ D|X , we can estimate average treatment effects:
E [Y 1 − Y 0 |ρ(X )] = E [Y |D = 1, ρ(X )] − E [Y |D = 0, ρ(X )]

Balancing property
Because the propensity score is a function of X, we know:
Pr (D = 1|X , ρ(X )) = Pr (D = 1|X )

= ρ(X )
Conditional on ρ(X ), the probability that D = 1 does not

depend on X .
D and X are independent conditional on p(X ):
D⊥
⊥ X |ρ(X )
Balancing property
So we obtain the balancing property of the propensity score:
Pr (X |D = 1, p(X )) = Pr (X |D = 0, p(X ))
conditional on the property score, the distribution of the

covariates is the same for treatment and control group units
We can use this to check if our estimated propensity score
actually produces balance:
Pr (X |D = 1, pb(X )) = Pr (X |D = 0, pb(X ))
Propensity score theorem
This theorem tells us the only covariate we need to adjust for

is the conditional probability of treatment itself (i.e., the
propensity score)
It does not tell us which method we should use to do that
adjustment, though, which is an estimation question
There are options: inverse probability weighting, forms of
imputation, stratification, and sometimes even regressions will
incorporate the score as weights
Checking the common support assumption
We can summarize the propensity scores in the treatment and

control group and count how many units are off-support
Crump, et al. (2009) offer a rule of thumb: keep scores on
interval [0.1,0.9].
Tossing out observations beyond those min and max scores
A histogram of propensity scores by treatment and control
group also highlights the overlap problem; software also can
help such as teffects overlap
Inverse probability weighting
I really like the simple method of inverse probability weighting

aesthetically because there are no black boxes; it’s all
non-parametric averaging done through a particular kind of
weights based on the propensity score
IPW involves fewer implementation choices like number of
neighbors, common support, etc.
And because IPW is a smooth estimator, the bootstrap is valid
for inference unlike covariate nearest neighbor matching which
Abadie and Imbens (2008) show is not valid

Inverse probability weighting
IPW is basically a reweighting of the outcomes by the

propensity score developed in Robins and Rotnitzky (1995),
Imbens (2000), Hirano and Imbens (2001)
The weights can be expressed in two ways – without
normalization (Horvitz and Thompson 1952) or normalized
(Hajek1971) – the difference being how well either approach
can handle extreme values of the propensity score; the
differences come out of the survey sampling literature
Notation is far far scarier than in fact what we are doing, so I’ll
show you this in a Stata and R simulation to help pin down
the intuition a little better
We’ll start with the basic idea using the Horvitz and Thompson
(1952) expression of the weights as it’s not as messy.
Inverse Probability Weighting
Proposition
If Y 1 , Y 0 ⊥
⊥ D|X , then
δATE = E [Y 1 − Y 0 ]
D − ρ(X )

= E Y·
ρ(X ) · (1 − ρ(X ))
δATT = E [Y 1 − Y 0 |D = 1]
D − ρ(X )

1
= ·E Y ·
Pr (D = 1) 1 − ρ(X )
IPW Proof
Proof.
D − ρ(X )

Y
E Y· X = E X , D = 1 ρ(X )
ρ(X )(1 − ρ(X )) ρ(X )
−Y

+E X , D = 0 (1 − ρ(X ))
1 − ρ(X )
= E [Y |X , D = 1] − E [Y |X , D = 0]
and the results follow from integrating over P(X ) and

P(X |D = 1).
Weighting on the propensity score
Previous formulas used population concepts. Switching to samples,

we use a two-step estimator:
1 Estimate the propensity score: ρb(X )
2 Use estimated score to produce analog estimators. Let δbATE
and δbATT be an estimate of the ATE and ATT parameter:
N
1 X Di − ρb(Xi )
δbATE = Yi ·
N ρb(Xi ) · (1 − ρb(Xi ))
i=1
N
1 X Di − ρb(Xi )
δbATT = Yi ·
NT 1 − ρb(Xi )
i=1
Weighting on the propensity score
Standard errors can be constructed a few different ways:

We need to adjust the standard errors for first-step estimation
of ρ(X )
Parameteric first step: Newey and McFadden (1994)
Non-parametric first step: Newey (1994)
Or bootstrap the entire two-step procedure (Adudumilli 2018
and Bodory et al. 2020)
Implementation with software
I like estimating with IPW manually because I like being

reminded how simple a procedure it is
But Stata’s -teffects- and R’s -ipw- do it too, and -teffects-
uses the Hajek normalization weights which will produce
identical estimates to my program
My programs don’t do the inference, but I think that would be
fun and easy to do using the bootstrap
Let’s look at it real quickly now with an example from
LaLonde’s 1986 paper on the NSW job trainings program
(which I’ll discuss again soon)
Double robust estimators
Lots of papers: Robins and Rotnizky (1995) originally, Hirano

and Imbens (2001), etc.
Basic idea is you are going to control for covariates twice:
through regression and then through the propensity score
We say that estimators combining regression with IPW are
double robust so long as
The regression for the outcome is properly specified, or
The propensity score is properly specified
Hence the name “double robust”. We give ourselves two
chances to get it right (either/or not both/and)
Estimation of outcome model
Di 1 − Di
yi = α0 + Xi β + α˜1 Di + θ0 + θ1 + ε˜i
[
ρ(X i)
[
1 − ρ(X i)
Propensity score matching
Matching, or what I like to call “imputation”, is another way

that utilizes the ρb
They all use the same first stage, but differ on their second
and third stages
Part of the second stage may be imposing common support
through “trimming”, but for different reasons because now this
idea of distance is entering and maybe you think some units
are “too far away” to be relevant counterfactuals

Standard matching strategy
Pair each treatment unit i with one or more comparable

control group unit j, where comparability is in terms of
proximity to the estimated propensity score
Impute the unit’s missing counterfactual outcome Yi(j) based
on the unit or units chosen in the previous step
If more than one are “nearest neighbors”, then use the
neighbors’ weighted outcomes
X
Yi(j) = wij Yj
j∈C (i)
where C (i) is the set of neighbors with W = 0 of the

treatment
P unit iand wij is the weight of control group units j
with j∈C (i) wij = 1
Imputing the counterfactuals
A parameter of interest:
E [Yi1 |Di = 1] − E [Yi0 |Di = 1]
We estimate it as follows
1 X
[
ATT = = Yi − Yi(j)
NT
i:Wi =1
where NT is the number of matched treatment units in the sample.

Note the difference between imputation and weighting
Matching methods
The probability of observing two units with exactly the same

propensity score is in principle zero because p(x) is continuous
Several matching methods have been proposed in the
literature, but the most widely used are:
Stratification matching
Nearest-neighbor matching (with or without caliper)
Radius matching
Kernel matching
Typically, one treatment unit i is matched to several control
units j, but sometimes one-to-one matching is used
Stratification
Stratification is used to force covariate balance by finding

strata where there is no difference in mean covariate values.
You then use those strata to calculate within differences in
means and sum over properly weighted strata. See Becker and
Ichino (2002)
Stratification is a brute force method for imposing balance by
grouping the data and testing for differences in covariate
means
It’s actually kind of similar to coarsened exact matching, only
using the propensity score for the “stratification” not the
covariates
Stratification: Achieving Balance
The algorithm is brute force covariate balancing

1 Sort the data by propensity score and divide into groups of
observations with similar propensity scores (e.g., percentiles)
2 Within each group, test (using a t-test) whether the means of
the covariates (X ) are equal between treatment and control
3 If so, then stop. If not, it means the covariates aren’t balanced
within that group. Divide the group in half and repeat
4 If a particular covariate is unbalanced for multiple groups,
modify the initial logit or probit equation by including higher
order terms and/or interactions with that covariate and repeat
Historically this could be done with -pscore2.ado- or manually
oneself if they felt so inclined, but it was dropped with -teffects-
Nearest Neighbor
Pretty similar to covariate matching. Formula is

NN 1 X X
ATT = Yi − wij Yj
NT
i:Wi =1 j∈C (i)M
NT is the number of Treatment units i

wij is equal to N1C if j is a control unit and zero otherwise; NC
is number of control units j
And unit j is chosen as a control for i if it’s propensity score is
nearest to that of i
NN Matching: Bias vs. Variance
But how far away on the propensity score will you use? Herein
lies the different types of matching proposed
Matching just one nearest neighbor minimizes bias at the cost
of larger variance
Matching using additional nearest neighbors increases the bias
but decreases the variance
Matching with or without replacement
with replacement keeps bias low at the cost of larger variance
without replacement keeps variance low but at the cost of
potential bias
Distance between treatment and control units
What was historically done was limiting “distance” through

various ad hoc choices
Imagine these choices as creating like a lasso (like the cowboy
rope)
Anything within the lasso could be used for the imputation;
anything outside the lasso could not
There were two common ways – caliper matching and radius
matching.
Caliper matching
Caliper matching is a variation on NN matching that tries to

build brakes into the algorithm as to avoid “bad neighbors”
It does this by imposing a tolerable maximum distance (e.g.,
0.2 units in the propensity score away from a treatment unit
i’s propensity score)
Note – this is a one-to-one imputation, and if there doesn’t
exist anybody in the control group unit j within that “caliper”,
then treatment unit i is discarded
Means we aren’t estimating the ATE anymore once we start
dropping units
It’s difficult to know what this caliper should be ex ante, hence
why I said it is somewhat ad hoc
Radius matching
Each treatment unit i is matched with the control group units

whose propensity score are in a predefined neighborhood of the
propensity score of the treatment unit.
All the control units with ρbj falling within a radius r from ρbi
are matched to the treatment unit i – this is what
distinguishes it from calipers, and makes it more similar to
covariate matching (Abadie and Imbens 2006, 2008)
The smaller the radius, the better the quality of the matches,
but the higher the possibility some treatment units are not
matched because the neighborhood does not contain control
group units j
Software
I think you can use -teffects, psmatch- to get at these two

nearest neighbor approaches by setting the number of matches
You can use -pscore2- for stratification, but I think the
standard errors are wrong, so you may need to just do it
manually using bootstrapping or variance approximation, and
that may be a pain to program up
Not sure of the R command, but I know it’s out there
King and Nielsen (2019)
There is a King and Nielsen (2019) critique of these methods

that is popularly known but not popularly studied
King and Nielsen (2019) is not a critique of the propensity
score, because it does not apply to stratification, regression
adjustment, or inverse probability weighting
It only applies to nearest neighbor and is related to forced
balance through trimming and a myriad of other common
choices made by the researcher
“ ‘[The] more balanced the data, or the more balance it
becomes by [trimming] some of the observations through
matching, the more likely propensity score matching will
degrade inferences.” – King and Nielsen (2019)
Examples of propensity score matching
Workhorse example of propensity score matching is the Job

Trainings Program (NSW)
First studied by LaLonde (1986) evaluating multiple
econometric models for program evaluation
All the standard estimators failed to estimate the known ATE
when replacing experimental controls with non-experimental
controls – even difference-in-differences
Dehejia and Wahba (1999; 2002) use LaLonde’s data with
propensity score matching and found better results
Critiques by Petra Todd, Jeff Smith and others followed which
I won’t review here for sake of time
Description of NSW Job Trainings Program
The National Supported Work Demonstration (NSW), operated by

Manpower Demonstration Research Corp in the mid-1970s:
was a temporary employment program designed to help
disadvantaged workers lacking basic job skills move into the
labor market by giving them work experience and counseling in
a sheltered environment
was also unique in that it randomly assigned qualified
applicants to training positions:
Treatment group: received all the benefits of NSW program
Control group: left to fend for themselves
admitted AFDC females, ex-drug addicts, ex-criminal
offenders, and high school dropouts of both sexes
NSW Program
Treatment group members were:

guaranteed a job for 9-18 months depending on the target
group and site
divided into crews of 3-5 participants who worked together and
met frequently with an NSW counselor to discuss grievances
and performance
paid for their work
Control group members were randomized so the same
Note: the randomization balanced observables and
unobservables across the two arms, thus enabling the
estimation of an ATE for the people who self-selected into the
program
NSW Program
Other details about the NSW program:

Wages: NSW offered the trainees lower wage rates than they
would’ve received on a regular job, but allowed their earnings
to increase for satisfactory performance and attendance
Post-treatment: after their term expired, they were forced to
find regular employment
Job types: varied within sites – gas station attendant, working
at a printer shop – and males and females were frequently
performing different kinds of work
NSW Data
NSW data collection:

MDRC collected earnings and demographic information from
both treatment and control at baseline and every 9 months
thereafter
Conducted up to 4 post-baseline interviews
Different sample sizes from study to study can be confusing,
but has simple explanations
NSW Data
Estimation:
NSW was a randomized job trainings program; therefore
estimating the average treatment effect is straightforward:
1 X 1 X
Yi − Yi ≈ E [Y 1 − Y 0 ]
Nt Nc
Di =1 Di =0
in large samples assuming treatment selection is independent

of potential outcomes (randomization) – i.e., (Y 0 , Y 1 ) ⊥
⊥ D.
NSW worked: Treatment group participants’ real earnings
post-treatment (1978) was positive and economically
meaningful – ≈ $900 (LaLonde 1986) to $1,800 (Dehejia and
Wahba 2002) depending on the sample used
LaLonde, Robert J. (1986). “Evaluating the Econometric
Evaluations of Training Programs with Experimental Data”.
American Economic Review.
LaLonde’s study was not an evaluation of the NSW program, as

that had been done, but rather an evaluation of econometric
models done by:
replacing the experimental NSW control group with
non-experimental control group drawn from two nationally
representative survey datasets: Current Population Survey
(CPS) and Panel Study of Income Dynamics (PSID)
estimating the average effect using non-experimental workers
as controls for the NSW trainees
comparing his non-experimental estimates to the experimental
estimates of $900
LaLonde (1986)
LaLonde’s conclusion: available econometric approaches were

biased and inconsistent
His estimates were way off and usually the wrong sign
Conclusion was influential in policy circles and led to greater
push for more experimental evaluations
Imbalanced covariates for experimental and non-experimental
samples
CPS NSW
All Controls Trainees
Nc = 15, 992 Nt = 297
covariate mean (s.d.) mean mean t-stat diff
Black 0.09 0.28 0.07 0.80 47.04 -0.73
Hispanic 0.07 0.26 0.07 0.94 1.47 -0.02
Age 33.07 11.04 33.2 24.63 13.37 8.6
Married 0.70 0.46 0.71 0.17 20.54 0.54
No degree 0.30 0.46 0.30 0.73 16.27 -0.43
Education 12.0 2.86 12.03 10.38 9.85 1.65
1975 Earnings 13.51 9.31 13.65 3.1 19.63 10.6
1975 Unemp 0.11 0.32 0.11 0.37 14.29 -0.26
Dehija and Wahba (1999)
Dehejia and Wahba (DW) update LaLonde’s original study

using propensity score matching
1 Dehejia, Rajeev H. and Sadek Wahba (1999). “Causal Effects
in Nonexperimental Studies: Reevaluating the Evaluation of
Training Programs”. Journal of the American Statistical
Association, vol. 94(448): 1053-1062 (pdf)
Can propensity score matching improve over the estimators
that LaLonde examined?
Proposition 2
X ⊥
⊥ D|p(X )
Conditional on the propensity score, the covariates are

independent of the treatment, suggesting that the distribution
of covariate values should be the same for both treatment and
control groups
This can be checked as we have data on all three once we’ve
estimated the propensity score
Trimming the data
Common terms are “trimming” or “pruning”

Drop units which do not overlap in terms of estimated
propensity score
Sometimes as a rule of thumb, just keep units on the
[0.1,0.9] interval
Common support
Overlap
Results
Covariate balance
Estimation in Stata
I have written up code that will implement IPW on the DW

data
It’s nonparametric, so it doesn’t use any packages
But you are welcome to try some packages, particularly the
-teffects- command
Kernel matching
Alternatively we can perform propensity score matching with a

kernel-based method.
Notice on the next slide that the estimate of the ATT switches
sign relative to that produced by the NN matching algorithm
Stata syntax
psmatch2 t r e a t e d , p s c o r e ( s c o r e ) outcome ( r e 7 8 )
k e r n e l k ( n o r m a l ) bw ( 0 . 0 1 )
p s t e s t 2 age b l a c k h i s p a n i c m a r r i e d educ n o d e g r e e
r e 7 8 , sum g r a p h
Panel B for outcomes. Notice the large differences in back- pretreatment covariates in Table 1, panel A, but do not include
ground characteristics between the program participants and theany higher order terms
Matchings vs. Propensity score or interactions, with only the control
PSID sample. This is what makes drawing causal inferences units that are used as a match [the units j such that W = 0 and j
Table 2. Experimental and nonexperimental estimates for the NSW data
M=1 M=4 M = 16 M = 64 M = 2490

Est. (SE) Est. (SE) Est. (SE) Est. (SE) Est. (SE)
Panel A:
Experimental estimates
Covariate matching 1.22 (0.84) 1.99 (0.74) 1.75 (0.74) 2.20 (0.70) 1.79 (0.67)
Bias-adjusted cov matching 1.16 (0.84) 1.84 (0.74) 1.54 (0.75) 1.74 (0.71) 1.72 (0.68)
Pscore matching 1.43 (0.81) 1.95 (0.69) 1.85 (0.69) 1.85 (0.68) 1.79 (0.67)
Bias-adjusted pscore matching 1.22 (0.81) 1.89 (0.71) 1.78 (0.70) 1.67 (0.69) 1.72 (0.68)
Regression estimates
Mean difference 1.79 (0.67)
Linear 1.72 (0.68)
Quadratic 2.27 (0.80)
Weighting on pscore 1.79 (0.67)
Weighting and linear regression 1.69 (0.66)
Panel B:
Nonexperimental estimates
Simple matching 2.07 (1.13) 1.62 (0.91) 0.47 (0.85) −0.11 (0.75) −15.20 (0.61)
Bias-adjusted matching 2.42 (1.13) 2.51 (0.90) 2.48 (0.83) 2.26 (0.71) 0.84 (0.63)
Pscore matching 2.32 (1.21) 2.06 (1.01) 0.79 (1.25) −0.18 (0.92) −1.55 (0.80)
Bias-adjusted pscore matching 3.10 (1.21) 2.61 (1.03) 2.37 (1.28) 2.32 (0.94) 2.00 (0.84)
Regression estimates
Mean difference −15.20 (0.66)
Linear 0.84 (0.88)
Quadratic 3.26 (1.04)
Weighting on pscore 1.77 (0.67)
Weighting and linear regression 1.65 (0.66)
NOTE: The outcome is earnings in 1978 in thousands of dollars.
Subsequent studies
Heckman et al. (1996, 1998) used experimental data from the

US National Job Training Partnership Act (JTPA)
They conclude that in order for matching estimators to have
low bias, it is important that the data include a rich set of
variables related to program participation and labor market
outcomes, that the nonexperimental comparison group be
drawn from the same local labor markets as the participants
and the dependent variable (typically earnings) be measured in
the same way for participants and nonparticipants
All three of these conditions fail to hold in DW (1999, 2002)
according to Smith and Todd (2005)
Smith and Todd
Difference-in-differences with propensity scores tended to work

well in Smith and Todd (2005)
But hard to make this a rule, because it’s hard to know ex ante
if we’ve specified the propensity score correctly (i.e., have CIA)
It is vital you know you’re data, if you’re going to use these
methods, which means understanding at a deep level the way
in which selection (i.e., treatment assignment) works in your
data
Beating a dead horse
The propensity score can make groups comparable but only on

the variables used to estimate the propensity score in the first
place. There is NO guarantee you are balancing on
unobserved covariates.
If you know that there are important unobservable variables,
you may need another tool.
Remember: randomization ensure that both observable and
unobservable variables are balanced
Coarsened exact matching
There are two kinds of matching as we’ve said

1 Exact matching matches a treated unit to all of the control
units with the same covariate value. Sometimes this is
impossible (e.g., continuous covariate).
2 Approximate matching specifies a metric to find control units
that are close to the treated unit. Requires a distance metric,
such as Euclidean, Mahalanobis, or the propensity score. All of
which can be implemented in Stata’s teffects.
Iacus, King and Porro (2011) propose another version of
matching they call coarsened exact matching (CEM). Some
big picture ideas

Checking imbalance
Iacus, King and Porro (2008) say that in practice approximate

matching requires setting the matching solution beforehand,
then checking for imbalance after.
Start over, repeat, until the user is exhausted by checking for
imbalance.
CEM Algorithm
1 Begin with covariates X . Make a copy called X ∗

2 Coarsen X ∗ according to user-defined cutpoints or CEM’s
automatic binning algorithm
Schooling → less than high school, high school, some college,
college, post college
3 Create one stratum per unique observation of X ∗ and place
each observation in a stratum
4 Assign these strata to the original data, X , and drop any
observation whose stratum doesn’t contain at least one treated
and control unit
You then add weights for stratum size and analyze without
matching.
Tradeoffs
Larger bins mean more coarsening. This results in fewer strata.

Fewer strata result in more diverse observations within the
same strata and thus higher imbalance
CEM prunes both treatment and control group units, which
changes the parameter of interest. Be transparent about this
as you’re not estimating the ATE or the ATT when you start
pruning
Benefits
The key benefit of CEM is that it is in a class of matching

methods called monotonic imbalance bounding
MIB methods bound the maximum imbalance in some feature
of the empirical distributions by an ex ante decision by the user
In CEM, this ex ante choice is the coarsening decision
By choosing the coarsening beforehand, users can control the
amount of imbalance in the matching solution
It’s also wicked fast.
Imbalance
There are several ways of measuring imbalance, but here we

focus on the L1 (f , g ) measure which is
1 X
L1 (f , g ) = |fl1 ...lk − gl1 ...lk |
2
l1 ...lk
where the f and g record the relative frequencies for the

treatment and control group units.
Perfect global imbalance is indicated by L1 = 0. Larger values
indicate larger imbalance between the groups, with a
maximum of L1 = 1.
Stata
Download cem from Stata: ssc install cem, replace

You will automatically compute the global imbalance measure,
as well as several unidimensional measures of imbalance, when
using cem
I got a L1 = 0.55. What does it mean?
By itself, it’s meaningless. It’s a reference point between
matching solutions.
Once we have a matching solution, we will compare its L1 to
0.55 and gauge the increase in balance due to the matching
solution from that difference.
Thus L1 works for imbalance as R 2 works for model fit: the
absolute values mean less than comparisons between matching
solutions.
More Stata
Because cem bounds the imbalance ex ante, the most

important information in the Stata output is the number of
observations matched.
You can also choose the coarsening as opposed to relying on
the algorithm’s automated binning.
Once you have estimated the strata, you regress the outcome
onto the treatment and then weight the regression by
cem_weights. For instance,
regress re78 treat [iweight=cem_weight]
For more on this, see Blackwell, et al. Stata journal article

from 2009.
The credibility revolution won
People like Heckman, Rubin, Ashenfelter, LaLonde, Angrist,

Krueger, Card, Imbens, Athey, Duflo, Abadie and many others
built on their backs a movement of sorts
The movement tried to shift economists and other social
scientists away from naive empirical methods that couldn’t
hope to estimate behavioral causal parameters towards things
that might
It’s in your best interest to study empirical methods, papers
that use them, how they communicate their findings and the
econometricians so that you can be ready when the
opportunity arises

Make the stone stoney again
A man walks up the mountain barefoot til he can’t feel his feet
again – Victor Schlosskey said art is there to make “the stone
feel like a stone again”. I want research to feel like research
again for you
Research is a quest for honest answers to good faith questions
that people care about
Most of all, research is truly fun for those who find such things
fun. It’s a form of self-expression and creativity for many of us
And it is fun to understand the answers you get and why those
answers are reliable which requires checklists, workflows,
clearly defined assumptions and proper tools for the job
It is not fun to get a bad answer to a poorly defined question
that you’re not confident about
A Priori Knowledge is Necessary for Identification
Think hard about these questions:

Can you write down a DAG or otherwise model the data
generating process?
What parameter do you think is interesting (e.g., ATT, LATE)
What are the assumptions needed for identifying that
parameter?
Pick estimators based on these questions, not the other way
around
Less so, pick data based on these questions, not the other way
around (usually)
What’s a good research design?
A good research design is one you are excited to tell people

about – that’s basically what characterizes all research designs,
whether propensity score matching or regression discontinuity
designs
Don’t get enamored by statistical modeling that obscures the
identification problem from plain sight.
Always understand what assumptions you must make, be clear
which parameters you are and are not identifying
Good research designs help you believe and not be afraid of
your answers
What’s the reason for your work?
Causal identification is a necessary but not a sufficient

condition for publishing well these days bc the credibility
revolution won
Must also be an “interesting” question - admittedly subjective
If it must be interesting, then the best thing you can do for
yourself is choose a topic that you care about
Publishing is simply too difficult to be working on something
you find trivial
Free disposal advice
My colleague said “a good study and a bad study take the

same amount of time” – don’t work on stuff just to work on it
Finding projects with upside, in a set of potential projects, is a
good idea
The sooner you can cut bait on a bad project and move on,
the better – beware the sunk cost fallacy
For my personality, questions are practically existential quests
for the meaning of life, but not everyone needs extreme
incentives
So know yourself, work to your strengths, figure out things
that downplay your weaknesses, believe in yourself, find your
sponsors and mentors, seek help

Causal Inference and Research Design Scott Cunningham (Baylor)

Uploaded by

Copyright:

Available Formats

Causal Inference and Research Design Scott Cunningham (Baylor)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Causal Inference and Research Design Scott Cunningham (Baylor)

Uploaded by

Copyright:

Available Formats

Causal Inference and Research Design

Scott Cunningham (Baylor)

A lot of this material is drawn from my book,

Cunningham Causal Inference

The fundamental theme linking all lectures will be the estimation of

A secondary goal of the workshop is to provide you with

Helpful Textbooks imho

Professor of economics at Baylor (Waco Texas),

Cunningham Causal Inference

But then digging into his one directory, he found countless

Cunningham Causal Inference

The cause of most of your errors is not due to insufficient

Workflow is a fixed set of routines you bind yourself to which

Before going on a trip, you use a checklist to make sure you

Your checklist should be a few simple, yet non-negotiable,

People often think empirical research is about “getting the

We stand on the shoulders of giants

The eyeball is not nearly appreciated enough for its ability to

Check the size of your dataset in Stata using count

Panel data can be overwhelming bc looking at each

During a stage of arranging datasets, you will likely merge –

“Exploring the data” is intoxicating to the point of distracting

After the coding error fiasco, I spent a lot of time wondering

Cunningham Causal Inference

There is no correct way to organize your directories,

The typical applied micro project may have hundreds of files of

Cunningham Causal Inference

1) Name the project (“Texas”)

2) A subdirectory for all articles you cite in the paper

3) Data subdirectory containing all datasets

4) A subdirectory for all do files and log files

5) All figures produced by Stata or image files

6) Project-specific heterogeneity (e.g., “Inference”, “Grants”,

7) All tables generated by Stata (e.g., .tex tables produced by

8) A subdirectory reserved only for writing

Guess what - your future self doesn’t even remember making

Cunningham Causal Inference

Remember: the goal is to make beautiful programs

Smart sounding quote about both programming and

Your goal is to make “beautiful tables” that are never edited

When I found my error, and after I regained my exposure, I

Cunningham Causal Inference

Variables should be readable to a stranger

The overarching goal is always to name things so that a

People swear by git, particularly Gentzkow and Shapiro

If you don’t advocate for your work, no one will.

Cunningham Causal Inference

Working with senior people at some point becomes necessary

I wrote Al Roth in 2007 and like Robert Browning to Elizabeth

Every project should present compelling graphics summarizing the

Derivation of the OLS estimator

Cunningham Causal Inference

Scientific methodologies are the epistemological foundation of

The terms “explained” and “explanatory” are probably best, as they

The simple linear regression (SLR) model is a population

We make a simplifying assumption (without loss of generality): the

allows us to assume E (u) = 0. If the average of u is different from

An assumption that meshes well with our introductory treatment