Nothing Special   »   [go: up one dir, main page]

Causal Inference and Research Design Scott Cunningham (Baylor)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1056

Causal Inference and Research Design

Scott Cunningham (Baylor)

Figure: xkcd
Where to find this material

A lot of this material is drawn from my book,


Causal Inference: The Mixtape, which you can download from
my website www.scunning.com

Cunningham Causal Inference


Structure and Assessment

The fundamental theme linking all lectures will be the estimation of


causal effects
Part 1 covers “the core” of applied econometrics, including
hidden curriculum
Part 2 covers causality foundations like potential outcomes and
DAGs
Part 3 covers contemporary research designs
Stata and R

A secondary goal of the workshop is to provide you with


programming examples in Stata and R for implementing some but
not all of the procedures we’ll cover
R and Stata code are provided many procedures (with more to
come)
I wrote the Stata and had the written by my RAs reviewed by
a third exceptional student
Programs and data can be downloaded from my GitHub
repository (https://github.com/scunning1975/mixtape)
Textbooks

Helpful Textbooks imho


1 Cunningham (2018) (Mixtape) (under contract with Yale, but
I can’t share the new version yet – this deck is the closest
thing to it)
2 Angrist and Pischke (2009) Mostly Harmless Econometrics
(MHE)
3 Morgan and Winship (2014) Counterfactuals and Causal
Inference (MW)
Readings

Readings:
We will also discuss a number of papers in each lecture, each
of which you will need to learn inside and out.
Lecture slides and reading lists are available
Key literature is contained in the shared dropbox folder which
I’ll distribute beforehand
About me

Professor of economics at Baylor (Waco Texas),


Graduated in 2007 from University of Georgia with a field in
econometrics, IO, public, and labor field courses
I knew I was going to be an empiricist, so I made econometrics
my main field – passed field exam on second attempt
Since graduating I’ve focused on topics in crime and risky sex
such as sex work, drug policy, abortion, mental healthcare.
I knew I couldn’t achieve my goals without learning causal
inference which I could tell I had only a vague understanding of
This is because causal inference isn’t taught historically in
traditional econometrics

Cunningham Causal Inference


Sad story (to me!)

Once upon a time there was a boy who wrote a job market
paper using the NLSY97.
This boy presented the findings a half dozen times, spoke to
the media a few times, got 17 interviews at the ASSA, 7
flyouts, and an offer from Baylor
He submitted the job market paper to the Journal of Human
Resources, a top field journal in labor, and received a “revise
and resubmit” request from the editor (woo hoo!)
The horror!

But then digging into his one directory, he found countless


versions of his do file and hundreds of files with random names
And once he finally was able to get the code running again, he
found a critical coding error that when corrected (“destroyed”)
his results
The young boy was devastated and never resubmitted which
he does not recommend (but he was sad!)
All competent empirical work is a mousetrap

“Happy families are all alike; every unhappy family is unhappy in its
own way.” - Leo Tolstoy, Anna Karenina

“Good empirical work is all alike; every bad empirical work is bad in
its own way.” - Scott Cunningham, This slide

Cunningham Causal Inference


Cunningham Empirical Workflow Conjecture

The cause of most of your errors is not due to insufficient


knowledge of syntax in your chosen programming language
The cause of most of your errors is due to a poorly designed
empirical workflow
Workflow

Wikipedia definition:
“A workflow consists of an orchestrated and repeatable
pattern of activity, enabled by the systematic organization
of resources into processes that transform materials,
provide services, or process information.”
Dictionary definition:
“the sequence of industrial, administrative, or other
processes through which a piece of work passes from
initiation to completion.”
Empirical workflow

Workflow is a fixed set of routines you bind yourself to which


when followed identifies the most common errors
Think of it as your morning routine: alarm goes off, go to
wash up, make your coffee, check Twitter, repeat ad infinitum
Finding the outlier errors is a different task; empirical
workflows catch typical and common errors created by the
modal data generating processes
Why do we use checklists?

Before going on a trip, you use a checklist to make sure you


have everything you need
Charger (check), underwear (check), toothbrush (check),
passport (oops), . . .
The empirical checklist is solely referring to the intermediate
step between “getting the data” and “analyzing the data”
It largely focuses on ensuring data quality for the most
common, easiest to identify, situations you’ll find yourself in
Simple checks

Your checklist should be a few simple, yet non-negotiable,


programming commands and exercises to check for coding
errors
Let’s discuss a few
Time

People often think empirical research is about “getting the


data” and “analyzing the data”
They have an “off to the races” mindset
Just like running a marathon involves far far more time
training than you ever spend running the marathon, doing
empirical research involves far far more time doing tedious,
repetitive tasks
Since you do the tedious tasks repeatedly, they have the most
potential for error which can be catastrophic
How can we minimize these errors through a checklist?
Figure: Image from Wenfei Xu at Columbia
Read the codebook

We stand on the shoulders of giants


Few like reading the codebook as it is not gripping literature
But the codebook explains how to interpret the data you have
acquired and it is not a step you can skip
Set aside time to study it, and have it in a place where you
can regularly return to it
This goes for the readme that accompanies some datasets,
too.
Look at the data

The eyeball is not nearly appreciated enough for its ability to


spot problems
Use browse or excel to just read the spreadsheet with your
eyes.
Scroll through the variables and accompany yourself with what
you’ve got visually
Missing observations

Check the size of your dataset in Stata using count


Check the number of observations per variable in Stata using
summarize
String variables will always report zero observations under
summarize so count if X=="" will work
Use tabulate also because oftentimes missing observations
are recorded with a −9 or some other illogical negative value
Missing years

Panel data can be overwhelming bc looking at each


state/city/firm/county borders on the impossible
Start with collapse to the national level by year and simply
list to see if anything looks strange
What’s “strange” look like?
Well wouldn’t it be strange if national unemployment rates
were zero in any year?
You can use xtline to see time series for panel identifiers,
with or without the subcommand of overlay
Panel observations are N × T

Say you have 51 state units (50 states plus DC) and 10 years
51 × 10 = 510 observations
If you do not have 510 observations, then you have an
unbalanced panel; if you have 510 observations you have a
balanced panel
Check the patterns using xtdescribe and simple counting
tricks
Merge

During a stage of arranging datasets, you will likely merge –


oftentimes a lot
Make sure you count before and after you merge so you can
figure out what went wrong, if anything
Also make sure you’re using the contemporary m:m syntax as
many an excellent empiricists have been hurt by merge syntax
errors
Don’t forget the question

“Exploring the data” is intoxicating to the point of distracting


“All you can do is write the best paper on the question you’re
studying” – Mark Hoekstra
Note he didn’t say “Write the best paper you’re capable of
writing”
He said the best paper
Important therefore to choose the right questions with real
upside
Slow down, think big picture, force yourself to figure out
exactly what your question is, who is in your sample (and
importantly who won’t be) and what time periods you’ll pull
Organize your directories

After the coding error fiasco, I spent a lot of time wondering


how this could happen
I decided it was partly because of four problems related to
1 organized subdirectories
2 automation
3 naming conventions
4 version control
I’ll discuss each but I highly recommend that you just read
Gentzkow and Shapiro’s excellent resource “Code and Data for
the Social Sciences: A Practitioner’s Guide” https://web.
stanford.edu/~gentzkow/research/CodeAndData.pdf

Cunningham Causal Inference


No correct organization

There is no correct way to organize your directories,


But all competent empiricists have adopted an intentional
philosophy of how to organize their directories
Why? Because you’re writing for your future self, and your
future self is lazy, distracted, disinterested and busy
Directories

The typical applied micro project may have hundreds of files of


various type and will take years just to finish not including
time to publication
So simply finding the files you need becomes more difficult if
everything is stored in the same place
When I start a new project, the first thing I do is create the
following directories

Cunningham Causal Inference


Subdirectory organization

1) Name the project (“Texas”)


Subdirectory organization

2) A subdirectory for all articles you cite in the paper


Subdirectory organization

3) Data subdirectory containing all datasets


Subdirectory organization

4) A subdirectory for all do files and log files


Subdirectory organization

5) All figures produced by Stata or image files


Subdirectory organization

6) Project-specific heterogeneity (e.g., “Inference”, “Grants”,


“Interview notes”, “Presentations”, “Misc”)
Subdirectory organization

7) All tables generated by Stata (e.g., .tex tables produced by


-estout-)
Subdirectory organization

8) A subdirectory reserved only for writing


Always use scripting programs NOT GUI

Guess what - your future self doesn’t even remember making


do files, tables or figures, let alone typing into GUI command
line
Therefore throw her a bone, hold her hand and walk her
exactly through everything
Which means you’ve got to have replicable scripting files*
* Sure, sometimes use the the command line for messing
around
But then put that messing around in the program

Cunningham Causal Inference


Good text editor

Remember: the goal is to make beautiful programs


Invest in a good text editor which has bundling capabilities
that will integrate with Stata, R or LaTeX
I use Textmate 2 because I use a Mac and in addition to a
Stata and R bundle, it also allows for column editing
PC users tend to love Sublime for the same reasons
Stata and Rstudio also come with built-in text editors, which
use slick colors for various types of programming commands
Headers
Speak clearly

“Be conservative in what you do; be liberal in what you accept from
others.” - Jon Postel

Smart sounding quote about both programming and


relationships
Your future self is time constrained, so explain everything to
her as well as write clear code
Optimally document your programs
But speak your future self’s love language so she understands
Automating Tables and Figures

Your goal is to make “beautiful tables” that are never edited


post-production as well as readable on their own
Large fixed costs learning commands like -estout- or -outreg2-:
incur them bc marginal costs are zero
I use -estout- because Jann has written an excellent help file at
http://repec.org/bocode/e/estout/hlp_esttab.html
but many like -outreg2-
Learn -twoway- and/or -ggplot2- and make “beautiful pictures”
too
Different elements

When I found my error, and after I regained my exposure, I


eventually developed a system of naming
1 variables,
2 datasets, and
3 do files
As these are the three things you repeatedly use, you need to
have a system, even if not mine

Cunningham Causal Inference


Naming conventions for variables

Variables should be readable to a stranger


Say that you want to create the product of two variables.
Name it the two variables with an underscore
gen price_mpg = price * mpg
Otherwise name the variable exactly what it is
gen bmi = weight / (height^2 * 703)
Avoid meaningless words (e.g., lmb2), dating (e.g.,
temp05012020) and numbering (e.g., outcome25) as your
future self will be confused
Naming datasets and do files

The overarching goal is always to name things so that a


stranger seeing them can know what they are
One day you will be the stranger on your own project! Make it
easy on your future self!
Choose some combination of simplicity and clarity but
whatever you do, be consistent
Avoid numbering datasets unless the numbers correspond to
some meaningful thing, like randomization inference where
each file is a set of coefficients and numbered according to
FIPS index
Version control

People swear by git, particularly Gentzkow and Shapiro


I use Dropbox, and have for years. They have some version
history for instance, though I’m not sure if it compares to git’s
capabilities.
I’m slowly learning git and use git Tower, but many use the
command line in Terminal
Ideally your system allows you to revert to earlier versions
without having ten billion files with names like
prison_03102019_sc.do, etc.
Selling your work

If you don’t advocate for your work, no one will.


Network, network, network
You will need to become an expert in 1.5 areas, and you will
need experts in those 1.5 areas to agree
Study the effective of rhetoric of successful economists who
expertly communicate their work to others both in their
writing of the actual manuscript, as well as the presentation
and promotion of their work

Cunningham Causal Inference


Find your mentors and sponsors

Working with senior people at some point becomes necessary


Good news: many senior people want to help you
Bad news: they don’t know who you are and can’t find you
It’s a two sided matching problem
Introduce yourself in socially appropriate ways!
Al Roth story

I wrote Al Roth in 2007 and like Robert Browning to Elizabeth


Barrett introduced myself by saying “I love your book on
twosided matching with Sotomayor with all my heart.”
We became pen pals and then he won the Nobel Prize
Scared, I wrote to congratulate him on the day he won and he
immediately asked to help me
“Interpersonal favors are meant to be paid forward not
backwards” - Roth to me after a second favor!
Nobody can help you if you don’t know them bc help,
sponsorship and mentoring is a two sided matching problem
More readings

I’ve put several deck of slides and helpful articles for you in the
dropbox folder
Jesse Shapiro’s “How to Present an Applied Micro Paper”
Gentzkow and Shapiro’s coding practices manual
Rachael Meager on presenting as an academic
Ljubica “LJ” Ristovska’s language agnostic guide to
programming for economists
Grant McDermott on Version Control using Github
https://raw.githack.com/uo-ec607/lectures/master/
02-git/02-Git.html#1
Data Visualization

Every project should present compelling graphics summarizing the


main results and main takeaway
Study other people’s pictures and get help from experts
1 Kieran Healy’s 2018 Visualization: A Practical Introduction
(Princeton University Press); free version is
http://socviz.co/index.html#preface.
2 Ed Tufte’s book Visual display of quantitative information is
classic, but more a coffee table book plus no programming
assistance.
Learn Stata’s -twoway- capabilities and/or R’s -ggplot2-
Introduction: OLS Review

Derivation of the OLS estimator


Algebraic properties of OLS
Statistical Properties of OLS
Variance of OLS and standard errors

Cunningham Causal Inference


Foundations of scientific knowledge

Scientific methodologies are the epistemological foundation of


scientific knowledge
Science does not collect evidence in order to “prove” what
people already believe or want others to believe
Science accepts unexpected and even undesirable answers
Science is process oriented, not outcome oriented
Terminology
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
LHS RHS

The terms “explained” and “explanatory” are probably best, as they


are the most descriptive and widely applicable. But “dependent”
and “independent” are used often. (The “independence” here is not
really statistical independence.)
We said we must confront three issues:
1 How do we allow factors other than x to affect y ?
2 What is the functional relationship between y and x?
3 How can we be sure we are capturing a ceteris paribus
relationship between y and x?
We will argue that the simple regression model

y = β0 + β1 x + u (1)
addresses each of them.
Simple linear regression model

The simple linear regression (SLR) model is a population


model.
When it comes to estimating β1 (and β0 ) using a random
sample of data, we must restrict how u and x are related to
each other.
What we must do is restrict the way u and x relate to each
other in the population.
The error term

We make a simplifying assumption (without loss of generality): the


average, or expected, value of u is zero in the population:

E (u) = 0 (2)
where E (·) is the expected value operator.
The intercept

The presence of β0 in

y = β0 + β1 x + u (3)

allows us to assume E (u) = 0. If the average of u is different from


zero, say α0 , we just adjust the intercept, leaving the slope the
same:
y = (β0 + α0 ) + β1 x + (u − α0 ) (4)
where α0 = E (u). The new error is u − α0 and the new intercept is
β0 + α0 . The important point is that the slope, β1 , has not
changed.
Mean independence of the error term

An assumption that meshes well with our introductory treatment


involves the mean of the error term for each “slice” of the
population determined by values of x:

E (u|x) = E (u), all values x (5)


where E (u|x) means “the expected value of u given x”.
Then, we say u is mean independent of x.
Distribution of ability across education

Suppose u is “ability” and x is years of education. We need,


for example,

E (ability |x = 8) = E (ability |x = 12) = E (ability |x = 16)

so that the average ability is the same in the different portions


of the population with an 8th grade education, a 12th grade
education, and a four-year college education.
Because people choose education levels partly based on ability,
this assumption is almost certainly false.
Zero conditional mean assumption

Combining E (u|x) = E (u) (the substantive assumption) with


E (u) = 0 (a normalization) gives the zero conditional mean
assumption.

E (u|x) = 0, all values x (6)


Population regression function

Because the conditional expected value is a linear operator,


E (u|x) = 0 implies

E (y |x) = β0 + β1 x (7)
which shows the population regression function is a linear
function of x.
The straight line in the graph on the next page is what
Wooldridge calls the population regression function, and
what Angrist and Pischke call the conditional expectation
function
E (y |x) = β0 + β1 x
The conditional distribution of y at three different values of x
are superimposed. for a given value of x, we see a range of y
values: remember, y = β0 + β1 x + u, and u has a distribution
in the population.
Deriving the Ordinary Least Squares Estimates

Given data on x and y , how can we estimate the population


parameters, β0 and β1 ?
Let {(xi , yi ) : i = 1, 2, ..., n} be a random sample of size n
(the number of observations) from the population.
Plug any observation into the population equation:

yi = β0 + β1 xi + ui (8)

where the i subscript indicates a particular observation.


We observe yi and xi , but not ui (but we know it is there).
We use the two population restrictions:

E (u) = 0
Cov (x, u) = 0

to obtain estimating equations for β0 and β1 . We talked about the


first condition. The second condition means that x and u are
uncorrelated. Both conditions are implied by E (u|x) = 0
With E (u) = 0, Cov (x, u) = 0 is the same as E (xu) = 0. Next we
plug in for u:

E (y − β0 − β1 x) = 0
E [x(y − β0 − β1 x)] = 0

These are the two conditions in the population that effectively


determine β0 and β1 .
So we use their sample counterparts (which is a method of
moments approach to estimation):

n
X
−1
n (yi − β̂0 − β̂1 xi ) = 0
i=1
n
X
−1
n xi (yi − β̂0 − β̂1 xi ) = 0
i=1

where β̂0 and β̂1 are the estimates from the data.
These are two linear equations in the two unknowns β̂0 and β̂1 .
Pass the summation operator through the first equation:
n
X
−1
n (yi − β̂0 − β̂1 xi ) (9)
i=1
n
X n
X n
X
−1 −1 −1
=n yi − n β̂0 − n β̂1 xi (10)
i=1 i=1 i=1
n n
!
X X
= n−1 yi − β̂0 − β̂1 n−1 xi (11)
i=1 i=1

= y − β̂0 − β̂1 x (12)


We use the standard notation y = n−1 ni=1 yi for the average of
P
the n numbers {yi : i = 1, 2, ..., n}. For emphasis, we call y a
sample average.
We have shown that the first equation,
n
X
n−1 (yi − β̂0 − β̂1 xi ) = 0 (13)
i=1

implies

y = β̂0 + β̂1 x (14)


Now, use this equation to write the intercept in terms of the slope:

β̂0 = y − β̂1 x (15)


Plug this into the second equation (but where we take away the
division by n):
n
X
xi (yi − β̂0 − β̂1 xi ) = 0 (16)
i=1
so
n
X
xi [yi − (y − β̂1 x) − β̂1 xi ] = 0 (17)
i=1

Simple algebra gives


n n
" #
X X
xi (yi − y ) = β̂1 xi (xi − x) (18)
i=1 i=1
So, the equation to solve is
n n
" #
X X
(xi − x)(yi − y ) = β̂1 (xi − x)2 (19)
i=1 i=1
Pn
If i=1 (xi − x)2 > 0, we can write

Pn
(x − x)(yi − y ) Sample Covariance(xi , yi )
Pn i
β̂1 = i=1 2
= (20)
i=1 (xi − x) Sample Variance(xi )
OLS

The previous formula for β̂1 is important. It shows us how to


take the data we have and compute the slope estimate.
β̂1 is called the ordinary least squares (OLS) slope estimate.
It can be computed whenever the sample variance of the xi is
not zero, which only rules out the case where each xi has the
same value.
The intuition is that the variation in x is what permits us to
identify its impact on y .
Solving for βb

Once we have β̂1 , we compute β̂0 = y − β̂1 x. This is the OLS


intercept estimate.
These days, we let the computer do the calculations, which are
tedious even if n is small.
Predicting y

For any candidates β̂0 and β̂1 , define a fitted value for each i
as

ŷi = β̂0 + β̂1 xi (21)


We have n of these.
ŷi is the value we predict for yi given that x = xi and β = β̂.
The residual

The “mistake” from our prediction is called the residual:

ûi = yi − ŷi
= yi − β̂0 − β̂1 xi

Suppose we measure the size of the mistake, for each i, by


squaring it. Then we add them all up to get the sum of
squared residuals

n
X n
X
ûi2 = (yi − β̂0 − β̂1 xi )2
i=1 i=1

Choose β̂0 and β̂1 to minimize the sum of squared residuals


which gives us the same solutions we obtained before.
Algebraic Properties of OLS Statistics
Remembering how the first moment condition allows us to obtain
β̂0 and β̂1 , we have:
n
X
(yi − β̂0 − β̂1 xi ) = 0 (22)
i=1

Notice the logic here: this means the OLS residuals always add up
to zero, by construction,
n
X
ûi = 0 (23)
i=1

Because yi = ŷi + ûi by definition,


n
X n
X n
X
n−1 yi = n−1 ŷi + n−1 ûi (24)
i=1 i=1 i=1

and so y = ŷ .
Second moment

Similarly the way we obtained our estimates,


n
X
n−1 xi (yi − β̂0 − β̂1 xi ) = 0 (25)
i=1

The sample covariance (and therefore the sample correlation)


between the explanatory variables and the residuals is always zero:
n
X
n−1 xi ûi = 0 (26)
i=1
Bringing things together

Because the ŷi are linear functions of the xi , the fitted values and
residuals are uncorrelated, too:
n
X
n−1 ŷi ûi = 0 (27)
i=1
Averages

A third property is that the point (x, y ) is always on the OLS


regression line. That is, if we plug in the average for x, we predict
the sample average for y :

y = β̂0 + β̂1 x (28)


Again, we chose the estimates to make this true.
Expected Value of OLS
Mathematical statistics: How do our estimators behave across
different samples of data? On average, would we get the right
answer if we could repeatedly sample?
We need to find the expected value of the OLS estimators – in
effect, the average outcome across all possible random samples
– and determine if we are right on average.
Leads to the notion of unbiasedness, which is a “desirable”
characteristic for estimators.

E (β̂) = β (29)
Don’t forget why we’re here

Plato’s allegory of the cave - reality is outside the cave, the


reflections on the wall are our estimates of that reality.
The population parameter that describes the relationship
between y and x is β1
For this class, β1 is a causal parameter, and our sole objective
is to estimate β1 with a sample of data
But never forget that β̂1 is an estimator of that causal
parameter obtained with a specific sample from the
population.
Uncertainty and sampling variance

Different samples will generate different estimates (β̂1 ) for the


“true” β1 which makes β̂1 a random variable.
Unbiasedness is the idea that if we could take as many random
samples on Y as we want from the population, and compute
an estimate each time, the average of these estimates would
be equal to β1 .
But, this also implies that βˆ1 has spread and therefore variance
Assumptions
Assumption SLR.1 (Linear in Parameters)
The population model can be written as

y = β0 + β1 x + u (30)
where β0 and β1 are the (unknown) population parameters.
We view x and u as outcomes of random variables; thus, y is
random.
Stating this assumption formally shows that our goal is to
estimate β0 and β1 .
Assumption SLR.2 (Random Sampling)
We have a random sample of size n, {(xi , yi ) : i = 1, ..., n},
following the population model.
We know how to use this data to estimate β0 and β1 by OLS.
Because each i is a draw from the population, we can write,
for each i,

yi = β0 + β1 xi + ui (31)
Notice that ui here is the unobserved error for observation i. It
is not the residual that we compute from the data!
Assumption SLR.3 (Sample Variation in the Explanatory
Variable)
The sample outcomes on xi are not all the same value.
This is the same as saying the sample variance of
{xi : i = 1, ..., n} is not zero.
In practice, this is no assumption at all. If the xi are all the
same value, we cannot learn how x affects y in the population.
Assumption SLR.4 (Zero Conditional Mean)
In the population, the error term has zero mean given any
value of the explanatory variable:

E (u|x) = E (u) = 0. (32)


This is the key assumption for showing that OLS is unbiased,
with the zero value not being important once we assume
E (u|x) does not change with x.
Note that we can compute the OLS estimates whether or not
this assumption holds, or even if there is an underlying
population model.
Showing OLS is unbiased

How do we show β̂1 is unbiased for β1 ? What we need to show is

E (β̂1 ) = β1 (33)
where the expected value means averaging across random samples.
Step 1: Write down a formula for β̂1 . It is convenient to use
Pn
(xi − x)yi
β̂1 = Pi=1
n 2
(34)
i=1 (xi − x)
which is one of several equivalent forms.
It is convenient to define SSTx = ni=1 (xi − x)2 , to total variation
P
in the xi , and write
Pn
(xi − x)yi
β̂1 = i=1 (35)
SSTx
Remember, SSTx is just some positive number. The existence of
β̂1 is guaranteed by SLR.3.

Step 2: Replace each yi with yi = β0 + β1 xi + ui (which uses


SLR.1 and the fact that we have data from SLR.2).
The numerator becomes
n
X n
X
(xi − x)yi = (xi − x)(β0 + β1 xi + ui ) (36)
i=1 i=1
n
X n
X n
X
= β0 (xi − x) + β1 (xi − x)xi + (xi − x)ui (37)
i=1 i=1 i=1
n
X n
X
= 0 + β1 (xi − x)2 + (xi − x)ui (38)
i=1 i=1
n
X
= β1 SSTx + (xi − x)ui (39)
i=1
Pn Pn Pn
We used i=1 (xi − x) = 0 and i=1 (xi − x)xi = i=1 (xi − x)2 .
We have shown

Pn Pn
β1 SSTx + i=1 (xi − x)ui i=1 (xi− x)ui
β̂1 = = β1 + (40)
SSTx SSTx
Note how the last piece is the slope coefficient from the OLS
regression of ui on xi , i = 1, ..., n. We cannot do this regression
because the ui are not observed.
Now define

(xi − x)
wi = (41)
SSTx
so we have
n
X
β̂1 = β1 + w i ui (42)
i=1

β̂1 is a linear function of the unobserved errors, ui . The wi are


all functions of {x1 , x2 , ..., xn }.
The (random) difference between β̂1 and β1 is due to this
linear function of the unobservables.
Step 3: Find E (β̂1 ).
Under Assumptions SLR.2 and SLR.4, E (ui |x1 , x2 , ..., xn ) = 0.
That means, conditional on {x1 , x2 , ..., xn },

E (wi ui |x1 , x2 , ..., xn ) = wi E (ui |x1 , x2 , ..., xn ) = 0


because wi is a function of {x1 , x2 , ..., xn }. (In the next slides I
omit the conditioning in the expectations)
This would not be true if, in the population, u and x are
correlated.
Now we can complete the proof: conditional on {x1 , x2 , ..., xn },
n
!
X
E (β̂1 ) = E β1 + w i ui (43)
i=1
n
X n
X
= β1 + E (wi ui ) = β1 + wi E (ui ) (44)
i=1 i=1

= β1 (45)

Remember, β1 is the fixed constant in the population. The


estimator, β̂1 , varies across samples and is the random outcome:
before we collect our data, we do not know what β̂1 will be.
THEOREM (Unbiasedness of OLS)
Under Assumptions SLR.1 through SLR.4

E (β̂0 ) = β0 and E (β̂1 ) = β1 . (46)

Omit the proof for β̂0 .


Each sample leads to a different estimate, β̂0 and β̂1 . Some
will be very close to the true values β0 = 3 and β1 = 2.
Nevertheless, some could be very far from those values.
If we repeat the experiment again and again, and average the
estimates, we would get very close to 2.
The problem is, we do not know which kind of sample we
have. We can never know whether we are close to the
population value.
We hope that our sample is "typical" and produces a slope
estimate close to β1 but we can never know.
Reminder

Errors are the vertical distances between observations and the


unknown Conditional Expectation Function. Therefore, they
are unknown.
Residuals are the vertical distances between observations and
the estimated regression function. Therefore, they are known.
SE and the data

The correct SE estimation procedure is given by the underlying


structure of the data
It is very unlikely that all observations in a dataset are
unrelated, but drawn from identical distributions
(homoskedasticity)
For instance, the variance of income is often greater in families
belonging to top deciles than among poorer families
(heteroskedasticity)
Some phenomena do not affect observations individually, but
they do affect groups of observations uniformly within each
group (clustered data)
Variance of the OLS Estimators

Under SLR.1 to SLR.4, the OLS estimators are unbiased. This


tells us that, on average, the estimates will equal the
population values.
But we need a measure of dispersion (spread) in the sampling
distribution of the estimators. We use the variance (and,
ultimately, the standard deviation).
We could characterize the variance of the OLS estimators
under SLR.1 to SLR.4 (and we will later). For now, it is easiest
to introduce an assumption that simplifies the calculations.
Assumption SLR.5 (Homoskedasticity, or Constant Variance)
The error has the same variance given any value of the explanatory
variable x:

Var (u|x) = σ 2 > 0 (47)


where σ 2 is (virtually always) unknown.

Because we assume SLR.4, that is, E (u|x) = 0 whenever we


assume SLR.5, we can also write

E (u 2 |x) = σ 2 = E (u 2 ) (48)
Under the population Assumptions SLR.1 (y = β0 + β1 x + u),
SRL.4 (E (u|x) = 0) and SLR.5 (Var (u|x) = σ 2 ),

E (y |x) = β0 + β1 x
Var (y |x) = σ 2

So the average or expected value of y is allowed to change with x –


in fact, this is what interests us – but the variance does not change
with x. (See Graphs on next two slides)
THEOREM (Sampling Variances of OLS)
Under Assumptions SLR.1 to SLR.2,

σ2 σ2
Var (β̂1 |x) = Pn 2
=
i=1 (xi − x) SSTx
2 −1
Pn 2

σ n i=1 xi
Var (β̂0 |x) =
SSTx
(conditional on the outcomes {x1 , x2 , ..., xn }).
To show this, write, as before,
n
X
β̂1 = β1 + w i ui (49)
i=1

where wi = (xi − x)/SSTx . We are treating this as nonrandom in


the derivation. Because β1 is a constant, it does not affect Var (β̂1 ).
Now, we need to use the fact that, for uncorrelated random
variables, the variance of the sum is the sum of the variances.
The {ui : i = 1, 2, ..., n} are actually independent across i, and so
they are uncorrelated. So (remember that if we know x, we know
w)

n
!
X
Var (β̂1 |x) = Var wi ui |x
i=1
n
X n
X
= Var (wi ui |x) = wi2 Var (ui |x)
i=1 i=1
n
X n
X
= wi2 σ 2 =σ 2
wi2
i=1 i=1

where the second-to-last equality uses Assumption SLR.5, so that


the variance of ui does not depend on xi .
Now we have

n n Pn 2
X X (xi − x)2 i=1 (xi − x)
wi2 = =
(SSTx )2 (SSTx )2
i=1 i=1
SSTx 1
= 2
=
(SSTx ) SSTx

We have shown

σ2
Var (β̂1 ) = (50)
SSTx
Usually we are interested in β1 . We can easily study the two factors
that affect its variance.

σ2
Var (β̂1 ) = (51)
SSTx

1 As the error variance increases, i.e, as σ 2 increases, so does


Var (β̂1 ). The more “noise” in the relationship between y and
x – that is, the larger variability in u – the harder it is to learn
about β1 .
2 By contrast, more variation in {xi } is a good thing:

SSTx ↑ implies Var (β̂1 ) ↓ (52)


Notice that SSTx /n is the sample variance in x. We can think of
this as getting close to the population variance of x, σx2 , as n gets
large. This means

SSTx ≈ nσx2 (53)


which means, as n grows, Var (β̂1 ) shrinks at the rate 1/n. This is
why more data is a good thing: it shrinks the sampling variance of
our estimators.
The standard deviation of β̂1 is the square root of the variance. So
σ
sd(β̂1 ) = √ (54)
SSTx
This turns out to be the measure of variation that appears in
confidence intervals and test statistics.
Estimating the Error Variance
In the formula

σ2
Var (β̂1 ) = (55)
SSTx
we can compute SSTx from {xi : i = 1, ..., n}. But we need to
estimate σ 2 .
Recall that

σ 2 = E (u 2 ). (56)
Therefore, if we could observe a sample on the errors,
{ui : i = 1, 2, ..., n}, an unbiased estimator of σ 2 would be the
sample average
n
X
n−1 ui2 (57)
i=1

But this not an estimator because we cannot compute it from the


data we observe, since ui are unobserved.
How about replacing each ui with its “estimate”, the OLS residual
ûi ?

ui = yi − β0 − β1 xi
ûi = yi − β̂0 − β̂1 xi
ûi can be computed from the data because it depends on the
estimators β̂0 and β̂1 . Except by fluke,

ûi 6= ui (58)
for any i.

ûi = yi − β̂0 − β̂1 xi = (β0 + β1 xi + ui ) − β̂0 − β̂1 xi


= ui − (β̂0 − β0 ) − (β̂1 − β1 )xi

E (β̂0 ) = β0 and E (β̂1 ) = β1 , but the estimators almost always


differ from the population values in a sample.
Now, what about this as an estimator of σ 2 ?
n
X
n−1 ûi2 = SSR/n (59)
i=1

It is a true estimator and easily computed from the data after OLS.
As it turns out, this estimator is slightly biased: its expected value
is a little less than σ 2 .
The estimator does not account for the two restrictions on the
residuals, used to obtain β̂0 and β̂1 :

n
X
ûi = 0
i=1
n
X
xi ûi = 0
i=1

There is no such restriction on the unobserved errors.


The unbiased estimator of σ 2 uses a degrees-of-freedom
adjustment. The residuals have only n − 2 degrees-of-freedom, not
n.

SSR
σ̂ 2 = (60)
(n − 2)
THEOREM: Unbiased Estimator of σ 2
Under Assumptions SLR.1 to SLR.5,

E (σ̂ 2 ) = σ 2 (61)
In regression output, it is
s
√ SSR
σ̂ = σ̂ 2 = (62)
(n − 2)
that is usually reported. This is an estimator ofPsd(u), the standard
deviation of the population error. And SSR = ni=1 ub2 .
σ̂ is called the standard error of the regression, which
means it is an estimate of the standard deviation of the error
in the regression. Stata calls it the root mean squared error.
Given σ̂, we can now estimate sd(β̂1 ) and sd(β̂0 ). The
estimates of these are called the standard errors of the β̂j .
We just plug σ̂ in for σ:
σ̂
se(β̂1 ) = √ (63)
SSTx

where both the numerator and denominator are computed


from the data.
For reasons we will see, it is useful to report the standard errors
below the corresponding coefficient, usually in parentheses.
OLS inference is generally faulty in the presence of
heteroskedasticity
Fortunately, OLS is still useful
Assume SLR.1-4 hold, but not SLR.5. Therefore

Var (ui |xi ) = σi2

The variance of our estimator, βb1 equals:


Pn
(xi − x)2 σi2
Var (βb1 ) = i=1
SSTx2

When σi2 = σ 2 for all i, this formula reduces to the usual form,
σ2
SSTx2
A valid estimator of Var(βb1 ) for heteroskedasticity of any form
(including homoskedasticity) is
Pn
− x)2 ubi 2
i=1 (xi
Var (βb1 ) =
SSTx2

which is easily computed from the data after the OLS


regression
As a rule, you should always use the , robust command in
STATA.
Clustered data

But what if errors are not iid?


For instance, maybe observations between units in a group are
related to each other
You want to regress kids’ grades on class size to determine the
effect of class size on grades
The unobservables of kids belonging to the same classroom
will be correlated (e.g., teacher quality, recess routines) while
will not be correlated with kids in far away classrooms
Then i.i.d. is violated. But maybe i.i.d. holds across clusters,
just not within clusters
Simulations

Let’s first try to understand what’s going on with a few


simulations
We will begin with a baseline of non-clustered data
We’ll show the distribution of estimates in Monte Carlo
simulation for 1000 draws and iid errors
We’ll then show the number of times you reject the null
incorrectly at α = 0.05.
Figure: Distribution of the least squares estimator over 1,000 random
draws.
Figure: Distribution of the 95% confidence intervals with coloring
showing those which are incorrectly rejecting the null.
Clustered data and heteroskedastic robust

Now let’s look at clustered data


But this time we will estimate the model using heteroskedastic
robust standard errors
Earlier we saw mass all the way to -2.5 to 2; what do we get
when we incorrectly estimate the standard errors?
Figure: Distribution of the least squares estimator over 1,000 random
draws. Clustered data without correcting for clustering
Figure: Distribution of 1,000 95% confidence intervals with dashed region
representing those estimates that incorrectly reject the null.
Over-rejecting the null

Those 95 percent confidence intervals are based on an


α = 0.05.
Look how many parameter estimates are different from zero;
that’s what we mean by “over-rejecting the null”
You saw signs of it though in the variance of the estimated
effect, bc the spread only went from -.15 to .15 (whereas
earlier it had gone from -.25 to .2)
Now let’s correct for arbitrary within group correlations using
the cluster robust option in Stata/R
Figure: Distribution of 1,000 95% confidence intervals from a cluster
robust least squares regression with dashed region representing those
estimates that incorrectly reject the null.
Cluster robust standard errors

Better. We don’t have the same over-rejection problem as


before. If anything it’s more conservative.
The formula for estimating standard errors changes when
allowing for arbitrary serial correlation within group.
Instead of summing over each individual, we first sum over
groups
I’ll use matrix notation as it’s easier for me to explain by
stacking the data.
Clustered data

Let’s stack the observations by cluster

yg = xg β + ug

The OLS estimator of β is:

βb = [X 0 X ]−1 X 0 y

The variance is given by:

Var (β) = E [[X 0 X ]−1 X 0 ΩX [X 0 X ]−1 ]


Clustered data

With this in mind, we can now write the variance-covariance matrix


for clustered data
G
X
b = [X 0 X ]−1 [
Var (β) xg0 ubg ubg0 xg ][X 0 X ]−1
i=1

where ûg are residuals from the stacked regression


In STATA: vce(cluster clustervar). Where clustervar
is a variable that identifies the groups in which unobservables
are allowed to correlate
The importance of knowing your data

In real world you should never go with the “independent and


identically distributed” (i.e., homoskedasticity) case. Life is not
that simple.
You need to know your data in order to choose the correct
error structure and then infer the required SE calculation
If you have aggregate variables, like class size, clustering at
that level is required
Foundations of scientific knowledge

Scientific methodologies are the epistemological foundation of


scientific knowledge, which is a particular kind of knowledge
Science does not collect evidence in order to “prove” what
people already believe or want others to believe.
Science is process oriented, not outcome oriented.
Therefore science allows us to accept unexpected and
sometimes even undesirable answers.
My strong pragmatic claim

“Credible” causal inference is essential to scientific discovery,


publishing and your career
Non-credibly identified empirical micro papers, even ones with
ingenious theory, will have trouble getting published and won’t
be taken seriously
Causal inference in 2019 is a necessary, not a sufficient,
condition
Outline

Properties of the conditional expectation function (CEF)


Reasons for using linear regression
Regression anatomy theorem
Omitted variable bias
Properties of the conditional expectation function

Assume we are interested in the returns to schooling in a wage


regression.
We can summarize the predictive power of schooling’s effect
on wages with the conditional expectation function

E (yi |xi ) (64)

The CEF for a dependent variable, yi , given covariates Xi , is


the expectation, or population average, of yi with xi held
constant.
E (yi |xi ) gives the expected value of y for given values of x
It provides a reasonable representation of how y changes with
x
If x is random, then E (yi |xi ) is a random function
When there are only two values that xi can take on, then there
are only two values the CEF can take on – but the dummy
variable is a special case
We’re often interested in CEFs that are functions of many
variables, conveniently subsumed in the vector xi , and for a
specific value of xi , we will write

E (yi |xi = x)
Helpful result: Law of Iterated Expectations

Definition of Law of Iterated Expectations (LIE)


The unconditional expectation of a random variable is equal to the
expectation of the conditional expectation of the random variable
conditional on some other random variable

E (Y ) = E (E [Y |X ])

.
We use LIE for a lot of stuff, and it’s actually quite intuitive. You
may even know it and not know you know it!
Simple example of LIE

Say you want to know average IQ but only know average IQ by


gender.
LIE says we get the former by taking conditional expectations
by gender and combining them (properly weighted)

E [IQ] = E (E [IQ|Sex])
X
= Pr (Sexi ) · E [IQ|Sexi ]
Sexi
= Pr (Male) · E [IQ|Male]
+Pr (Female) · E [IQ|Female]

In words: the weighted average of the conditional averages is


the unconditional average.
Person Gender IQ
1 M 120
2 M 115
3 M 110
4 F 130
5 F 125
6 F 120

E[IQ] = 120
E[IQ | Male] = 115; E[IQ | Female] = 125
LIE: E ( E [ IQ | Sex ] ) = (0.5)×115 + (0.5)×125 = 120
Proof.
For the continuous case:
Z
E [E (Y |X )] = E (Y |X = u)gx (u)du
Z Z 
= tfy |x (t|X = u)dt gx (u)du
Z Z
= tfy |x (t|X = u)gx (u)dudt
Z Z 
= t fy |x (t|X = u)gx (u)du dt
Z
= t [fx,y du] dt
Z
= tgy (t)dt
= E (y )
Proof.
For the discrete case,
X
E (E [Y |X ]) = E [Y |X = x]p(x)
x
!
X X
= yp(y |x) p(x)
x y
XX
= yp(x, y )
x y
X X
= y p(x, y )
y x
X
= yp(y )
y
= E (Y )
Property 1: CEF Decomposition Property

The CEF Decomposition Property

yi = E (yi |xi ) + ui
where
1 ui is mean independent of xi ; that is

E (ui |xi ) = 0

2 ui is uncorrelated with any function of xi

In words: Any random variable, yi , can be decomposed into two


parts: the part that can be explained by xi and the part left over
that can’t be explained by xi . Proof is in Angrist and Pischke (ch.
3)
Property 2: CEF Prediction Property

The CEF Prediction Property


Let m(xi ) be any function of xi . The CEF solves

E (yi |xi ) = arg minm(xi ) E [(yi − m(xi ))2 ].

In words: The CEF is the minimum mean squared error predictor of


yi given xi . Proof is in Angrist and Pischke (ch. 3)
3 reasons why linear regression may be of interest

Linear regression may be interesting even if the underlying CEF is


not linear. We review some of the linear theorems now. These are
merely to justify the use of linear models to approximate the CEF.
The Linear CEF Theorem
Suppose the CEF is linear. Then the population regression is it.

Comment: Trivial theorem imho because if the population CEF is


linear, then it makes the most sense to use linear regression to
estimate it. Proof in Angrist and Pischke (ch. 3). Proof uses the
CEF Decomposition Property from earlier.
The Best Linear Predictor Theorem
1 The CEF, E (y |x ), is the minimum mean squared error
i i
(MMSE) predictor of yi given xi in the class of all functions xi
by the CEF prediction property
2 The population regression function, E (xi yi )E (xi xi0 )−1 , is the
best we can do in the class of all linear functions
Proof is in Angrist and Pischke (ch. 3).
The Regression CEF Theorem
The function xi β provides the minimum mean squared error
(MMSE) linear approximation to E (yi |xi ), that is

β = arg minb E {(E (yi |xi ) − xi0 b)2 }

Again, proof in Angrist and Pischke (ch. 3).


Random families

We are interested in the causal effect of family size on labor


supply so we regress labor supply onto family size

labor _supplyi = β0 + β1 numkidsi + εi

If couples had kids by flipping coins, then numkidsi


independent of εi , then estimation is simple - just compare
families with different sizes to get the causal effect of numkids
on labor _supply
But how do we interpret βb1 if families don’t flip coins?
Non-random families

If family size is random, you could visualize the causal effect


with a scatter plot and the regression line
If family size is non-random, then we can’t do this because we
need to control for multiple variables just to remove the
factors causing family size to be correlated with ε
Non-random families

Assume that family size is random once we condition on race,


age, marital status and employment.

labor_supplyi = β0 + β1 Numkidsi + γ1 Whitei + γ2 Marriedi


+γ3 Agei + γ4 Employedi + εi

To estimate this model, we need:


1 a data set with all 6 variables;
2 Numkids must be randomly assigned conditional on the other
4 variables
Now how do we interpret βb1 ? And can we visualize βb1 when
there’s multiple dimensions to the data? Yes, using the
regression anatomy theorem, we can.
Regression Anatomy Theorem
Assume your main multiple regression model of interest:

yi = β0 + β1 x1i + · · · + βk xki + · · · + βK xKi + ei

and an auxiliary regression in which the variable x1i is regressed on


all the remaining independent variables

x1i = γ0 + γk−1 xk−1i + γk+1 xk+1i + · · · + γK xKi + fi

and x̃1i = x1i − xb1i being the residual from the auxiliary regression.
The parameter β1 can be rewritten as:

Cov (yi , x̃1i )


β1 =
Var (x̃1i )

In words: The regression anatomy theorem says that βb1 is a scaled


covariance with the x˜1 residual used instead of the actual data x.
Regression Anatomy Proof
To prove the theorem, note E [x̃ki ] = E [xki ] − E [b
xki ] = E [fi ], and plug yi and residual
x̃ki from xki auxiliary regression into the covariance cov (yi , x̃ki )

cov (yi , x̃ki )


βk =
var (x̃ki )
cov (β0 + β1 x1i + · · · + βk xki + · · · + βK xKi + ei , x̃ki )
=
var (x̃ki )
cov (β0 + β1 x1i + · · · + βk xki + · · · + βK xKi + ei , fi )
=
var (fi )

1 Since by construction E [fi ] = 0, it follows that the term β0 E [fi ] = 0.


2 Since fi is a linear combination of all the independent variables with the
exception of xki , it must be that

β1 E [fi x1i ] = · · · = βk−1 E [fi xk−1i ] = βk+1 E [fi xk+1i ] = · · · = βK E [fi xKI ] = 0
Regression Anatomy Proof (cont.)
3 Consider now the term E [ei fi ]. This can be written as:

E [ei fi ] = E [ei fi ]
= E [ei x̃ki ]
= E [ei (xki − xbki )]
= E [ei xki ] − E [ei x̃ki ]

Since ei is uncorrelated with any independent variable, it is also uncorrelated


with xki : accordingly, we have E [ei xki ] = 0. With regard to the second term of
the subtraction, substituting the predicted value from the xki auxiliary
regression, we get

E [ei x̃ki ] = E [ei (γb0 + γb1 x1i + · · · + γ bk+1 xk+1i + · · · + γ


bk−1 xk−1 i + γ bK xKi )]

Once again, since ei is uncorrelated with any independent variable, the expected
value of the terms is equal to zero. Then, it follows E [ei fi ] = 0.
Regression Anatomy Proof (cont.)
4 The only remaining term is E [βk xki fi ] which equals E [βk xki x̃ki ] since fi = x̃ki . The
term xki can be substituted using a rewriting of the auxiliary regression model, xki ,
such that
xki = E [xki |X−k ] + x̃ki
This gives

E [βk xki x̃ki ] = E [βk E [x̃ki (E [xki |X−k ] + x̃ki )]]


= βk E [x̃ki (E [xki |X−k ] + x̃ki )]
= βk {E [x̃ki2 ] + E [(E [xki |X−k ]x̃ki )]}
= βk var (x̃ki )

which follows directly from the orthogonoality between E [xki |X−k ] and x̃ki . From
previous derivations we finally get

cov (yi , x̃ki ) = βk var (x̃ki )

which completes the proof.


Stata command: reganat (i.e., regression anatomy)

. ssc install reganat, replace


. sysuse auto
. regress price length weight headroom mpg
. reganat price length weight headroom mpg, dis(weight length) biline
Big picture

1 Regression provides the best linear predictor for the dependent


variable in the same way that the CEF is the best unrestricted
predictor of the dependent variable
2 If we prefer to think of approximating E (yi |xi ) as opposed to
predicting yi , the regression CEF theorem tells us that even if
the CEF is nonlinear, regression provides the best linear
approximation to it.
3 Regression anatomy theorem helps us interpret a single slope
coefficient in a multiple regression model by the
aforementioned decomposition.
Omitted Variable Bias

A typical problem is when a key variable is omitted. Assume


schooling causes earnings to rise:

Yi = β0 + β1 Si + β2 Ai + ui

Yi = log of earnings
Si = schooling measured in years
Ai = individual ability

Typically the econometrician cannot observe Ai ; for instance,


the Current Population Survey doesn’t present adult
respondents’ family background, intelligence, or motivation.
Shorter regression

What are the consequences of leaving ability out of the


regression? Suppose you estimated this shorter regression
instead:
Yi = β0 + β1 Si + ηi
where ηi = β2 Ai + ui ; β0 , β1 , and β2 are population regression
coefficients; Si is correlated with ηi through Ai only; and ui is
a regression residual uncorrelated with all regressors by
definition.
Derivation of Ability Bias

Suppressing the i subscripts, the OLS estimator for β1 is:

Cov (Y , S) E [YS] − E [Y ]E [S]


βb1 = =
Var (S) Var (S)

Plugging in the true model for Y , we get:

Cov [(β0 + β1 S + β2 A + u), S]


βb1 =
Var (S)
E [(β0 S + β1 S 2 + β2 SA + uS)] − E (S)E [β0 + β1 S + β2 A + u]
=
Var (S)
β1 E (S 2 ) − β1 E (S)2 + β2 E (AS) − β2 E (S)E (A) + E (uS) − E (S)E (u)
=
Var (S)
Cov (A, S)
= β1 + β2
Var (S)

If β2 > 0 and Cov(A, S)> 0 the coefficient on schooling in the shortened


regression (without controlling for A) would be upward biased
Summary

When Cov (A, S) > 0 then ability and schooling are correlated.
When ability is unobserved, then not even multiple regression
will identify the causal effect of schooling on wages.
Here we see one of the main justifications for this workshop –
what will we do when the treatment variable is endogenous?
We will need an identification strategy to recover the causal
effect
Introduction to the Selection Problem

Aliens come and orbit earth, see sick people in hospitals and
conclude “these ‘hospitals’ are hurting people”
Motivated by anger and compassion, they kill the doctors to
save the patients
Sounds stupid, but earthlings do this too - all the time

Cunningham Causal Inference


#1: Correlation and causality are very different concepts

Causal question:

“If I hospitalize (D) my child, will her health (Y) improve?”

Correlation question:

1 Cov (D, Y )
√ √
n VarD VarY

These are not the same thing


#2: Coming first may not mean causality!

Every morning the rooster crows and then the sun rises
Did the rooster cause the sun to rise? Or did the sun cause
the rooster to crow?
Post hoc ergo propter hoc: “after this, therefore, because of
this”
#3: No correlation does not mean no causality!

A sailor sails her sailboat across a lake


Wind blows, and she perfectly counters by turning the rudder
The same aliens observe from space and say “Look at the way
she’s moving that rudder back and forth but going in a straight
line. That rudder is broken.” So they send her a new rudder
They’re wrong but why are they wrong? There is, after all, no
correlation
Introduction to potential outcomes model

Let the treatment be a binary variable:


(
1 if hospitalized at time t
Di,t =
0 if not hospitalized at time t

where i indexes an individual observation, such as a person


Potential outcomes:
(
j 1 health if hospitalized at time t
Yi,t =
0 health if not hospitalized at time t

where j indexes a counterfactual state of the world


Moving between worlds

I’ll drop t subscript, but note – these are potential outcomes


for the same person at the exact same moment in time
A potential outcome Y 1 is not the historical outcome Y either
conceptually or notationally
Potential outcomes are hypothetical states of the world but
historical outcomes are ex post realizations
Major philosophical move here: go from the potential worlds
to the actual (historical) world based on your treatment
assignment
Important definitions

Definition 1: Individual treatment Definition 2: Average treatment effect


effect (ATE)
The individual treatment effect, δi , The average treatment effect is the
equals Yi1 − Yi0 population average of all i individual
treatment effects

Definition 3: Switching equation E [δi ] = E [Yi1 − Yi0 ]


An individual’s observed health = E [Yi1 ] − E [Yi0 ]
outcomes, Y , is determined by
treatment assignment, Di , and
corresponding potential outcomes:

Yi = Di Yi1 + (1 − Di )Yi0
(
Yi1 if Di = 1
Yi =
Yi0 if Di = 0
So what’s the problem?

Definition 4: Fundamental problem of causal inference


It is impossible to observe both Yi1 and Yi0 for the same individual
and so individual causal effects, δi , are unknowable.
Conditional Average Treatment Effects

Definition 5: Average Treatment Effect on the Treated (ATT)


The average treatment effect on the treatment group is equal to
the average treatment effect conditional on being a treatment
group member:

E [δ|D = 1] = E [Y 1 − Y 0 |D = 1]
= E [Y 1 |D = 1] − E [Y 0 |D = 1]

Definition 6: Average Treatment Effect on the Untreated (ATU)


The average treatment effect on the untreated group is equal to
the average treatment effect conditional on being untreated:

E [δ|D = 0] = E [Y 1 − Y 0 |D = 0]
= E [Y 1 |D = 0] − E [Y 0 |D = 0]
Causality and comparisons

Comparisons are at the heart of the causal problem, but not all
comparisons are equal because of the selection problem
Does the hospital make me sick? Or am I sick, and that’s why
I went to the hospital?
Why can’t I just compare my health (Scott) with someone
who isn’t in the hospital (Nathan)? Aren’t we supposed to
have a “control group”?
What are we actually measuring if we compare average health
outcomes for the hospitalized with the non-hospitalized?
Definition 7: Simple difference in mean outcomes (SDO)
A simple difference in mean outcomes (SDO) is the difference
between the population average outcome for the treatment and
control groups, and can be approximated by the sample averages:

SDO = E [Y 1 |D = 1] − E [Y 0 |D = 0]
= EN [Y |D = 1] − EN [Y |D = 0]

in large samples.
SDO vs. ATE

Notice the subtle difference between the SDO and ATE notation:

E [Y |D = 1] − E [Y |D = 0] <
> E [Y 1 ] − E [Y 0 ]

The SDO is an estimate, whereas ATE is a parameter


SDO is a crank that turns data into numbers
ATE is a parameter that is unknowable because of the
fundamental problem of causal inference
SDO can line up with the ATE and also cannot line up with
the ATE.
Biased simple difference in mean outcomes

Decomposition of the SDO


The simple difference in mean outcomes can be decomposed into
three parts (ignoring sample average notation):

E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+E [Y 0 |D = 1] − E [Y 0 |D = 0]
+(1 − π)(ATT − ATU)

Seeing is believing so let’s work through this identity


Decomposition of SDO

ATE is equal to sum of conditional average expectations by LIE

ATE = E [Y 1 ] − E [Y 0 ]
= {πE [Y 1 |D = 1] + (1 − π)E [Y 1 |D = 0]}
−{πE [Y 0 |D = 1] + (1 − π)E [Y 0 |D = 0]}

Use simplified notations

E [Y 1 |D = 1] = a
E [Y 1 |D = 0] = b
E [Y 0 |D = 1] = c
E [Y 0 |D = 0] = d
ATE = e

Rewrite ATE

e = {πa + (1 − π)b}
−{πc + (1 − π)d}
Move SDO terms to LHS

e = {πa + (1 − π)b} − {πc + (1 − π)d}


e = πa + b − πb − πc − d + πd
e = πa + b − πb − πc − d + πd + (a − a) + (c − c) + (d − d)
= e − πa − b + πb + πc + d − πd − a + a − c + c − d + d
a − d = e − πa − b + πb + πc + d − πd + a − c + c − d
a − d = e + (c − d) + a − πa − b + πb − c + πc + d − πd
a − d = e + (c − d) + (1 − π)a − (1 − π)b + (1 − π)d − (1 − π)c
a − d = e + (c − d) + (1 − π)(a − c) − (1 − π)(b − d)

Substitute conditional means

E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)({E [Y 1 |D = 1] − E [Y 0 |D = 1]}
−(1 − π){E [Y 1 |D = 0] − E [Y 0 |D = 0]})
1 0
E [Y |D = 1] − E [Y |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)(ATT − ATU)
Decomposition of difference in means

EN [yi |di = 1] − EN [yi |di = 0] = E [Y 1 ] − E [Y 0 ]


| {z } | {z }
SDO Average Treatment Effect

+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias

where EN [Y |D = 1] → E [Y 1 |D = 1],
EN [Y |D = 0] → E [Y 0 |D = 0] and (1 − π) is the share of the
population in the control group.
Independence assumption

Independence assumption
Treatment is independent of potential outcomes

(Y 0 , Y 1 ) ⊥
⊥D

In words: Random assignment means that the treatment has been assigned to units
independent of their potential outcomes. Thus, mean potential outcomes for the
treatment group and control group are the same for a given state of the world

E [Y 0 |D = 1] = E [Y 0 |D = 0]
E [Y 1 |D = 1] = E [Y 1 |D = 0]

Cunningham Causal Inference


Random Assignment Solves the Selection Problem

EN [yi |di = 1] − EN [yi |di = 0] = E [Y 1 ] − E [Y 0 ]


| {z } | {z }
SDO Average Treatment Effect

+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias

If treatment is independent of potential outcomes, then swap


out equations and selection bias zeroes out:

E [Y 0 |D = 1] − E [Y 0 |D = 0] = 0
Random Assignment Solves the Heterogenous Treatment Effects

How does randomization affect heterogeneity treatment effects bias from the
third line? Rewrite definitions for ATT and ATU:

ATT = E [Y 1 |D = 1] − E [Y 0 |D = 1]
ATU = E [Y 1 |D = 0] − E [Y 0 |D = 0]

Rewrite the third row bias after 1 − π:

ATT − ATU = E[Y1 | D=1] − E [Y 0 |D = 1]


−E[Y1 | D=0] + E [Y 0 |D = 0]
= 0

If treatment is independent of potential outcomes, then:

EN [yi |di = 1] − EN [yi |di = 0] = E [Y 1 ] − E [Y 0 ]


SDO = ATE
Careful with this notation

Independence only implies that the the average values for a


given potential outcome (i.e., Y 1 or Y 0 ) are the same for the
groups who did receive the treatment as those who did not
Independence does not does not imply

E [Y 1 |D = 1] = E [Y 0 |D = 0]
SUTVA

Potential outcomes model places a limit on what we can


measure: the “stable unit-treatment value assumption” .
Horrible acronym.
1 S: stable
2 U: across all units, or the population
3 TV: treatment-value (“treatment effect”, “causal effect”)
4 A: assumption
SUTVA means that average treatment effects are parameters
that assume (1) homogenous dosage, (2) potential outcomes
are invariant to who else (and how many) is treated (e.g.,
externalities), and (3) partial equilibrium
SUTVA: Homogenous dose

SUTVA constrains what the treatment can be.


Individuals are receiving the same treatment – i.e., the “dose”
of the treatment to each member of the treatment group is
the same. That’s the “stable unit” part.
If we are estimating the effect of hospitalization on health
status, we assume everyone is getting the same dose of the
hospitalization treatment.
Easy to imagine violations if hospital quality varies, though,
across individuals. But, that just means we have to be careful
what we are and are not defining as the treatment
SUTVA: No spillovers to other units

What if hospitalizing Scott (hospitalized, D = 1) is actually


about vaccinating Scott from small pox?
If Scott is vaccinated for small pox, then Nathan’s potential
health status (without vaccination) may be higher than when
he isn’t vaccinated.
0
In other words, YNathan , may vary with what Scott does
regardless of whether he himself receives treatment.
SUTVA means that you don’t have a problem like this.
If there are no externalities from treatment, then δi is stable
for each i unit regardless of whether someone else receives the
treatment too.
SUTVA: Partial equilibrium only

Easier to imagine this with a different example.


Scaling up can be a problem because of rising costs of
production
Let’s say we estimate a causal effect of early childhood
intervention in some state
Now the President wants to adopt it for the whole United
States – will it have the same effect as we found?
What if expansion requires hiring lower quality teachers just to
make classes?
Demand for Learning HIV Status

Rebecca Thornton implemented an RCT in rural Malawi for


her job market paper at Harvard in mid-2000s
At the time, it was an article of faith that you could fight the
HIV epidemic in Africa by encouraging people to get tested;
but Thornton wanted to see if this was true
She randomly assigned cash incentives to people to incentivize
learning their HIV status
Also examined whether learning changed sexual behavior.
Experimental design

Respondents were offered a free door-to-door HIV test


Treatment is randomized vouchers worth between zero and
three dollars
These vouchers were redeemable once they visited a nearby
voluntary counseling and testing center (VCT)
Estimates her models using OLS with controls
Why Include Control Variables?

To evaluate experimental data, one may want to add


additional controls in the multivariate regression model. So,
instead of estimating the prior equation, we might estimate:

Yi = α + δDi + γXi + ηi

There are 2 main reasons for including additional controls in


the regression models:
1 Conditional random assignment. Sometimes randomization is
done conditional on some observable (e.g., gender, school,
districts)
2 Exogenous controls increase precision. Although control
variables Xi are uncorrelated with Di , they may have
substantial explanatory power for Yi . Including controls thus
reduces variance in the residuals which lowers the standard
errors of the regression estimates.
Table: Impact of Monetary Incentives and Distance on Learning HIV
Results

1 2 3 4 5
Any incentive 0.431*** 0.309*** 0.219*** 0.220*** 0.219 ***
(0.023) (0.026) (0.029) (0.029) (0.029)
Amount of incentive 0.091*** 0.274*** 0.274*** 0.273***
(0.012) (0.036) (0.035) (0.036)
Amount of incentive2 −0.063*** −0.063*** −0.063***
(0.011) (0.011) (0.011)
HIV −0.055* −0.052 −0.05 −0.058* −0.055*
(0.031) (0.032) (0.032) (0.031) (0.031)
Distance (km) −0.076***
(0.027)
Distance2 0.010**
(0.005)
Controls Yes Yes Yes Yes Yes
Sample size 2,812 2,812 2,812 2,812 2,812
Average attendance 0.69 0.69 0.69 0.69 0.69
Figure: Visual representation of cash transfers on learning HIV test
results.
Results

Even small incentives were effective


Any incentive increases learning HIV status by 43% compared
to the control (mean 34%)
Next she looks at the effect that learning HIV status has on
risky sexual behavior
Figure: Visual representation of cash transfers on condom purchases for
HIV positive individuals.
Table: Reactions to Learning HIV Results among Sexually Active at
Baseline
Dependent variables: Bought Number of
condoms condoms bought
OLS IV OLS IV
Got results −0.022 −0.069 −0.193 −0.303
(0.025) (0.062) (0.148) (0.285)
Got results × HIV 0.418*** 0.248 1.778*** 1.689**
(0.143) (0.169) (0.564) (0.784)
HIV −0.175** −0.073 −0.873 −0.831
(0.085) (0.123) (0.275) (0.375)
Controls Yes Yes Yes Yes
Sample size 1,008 1,008 1,008 1,008
Mean 0.26 0.26 0.95 0.95
Results

For those who were HIV+ and got their test results, 42% more
likely to buy condoms (but shrinks and becomes insignificant
at conventional levels with IV).
Number of condoms bought – very small. HIV+ respondents
who learned their status bought 2 more condoms
Randomization inference and causal inference

“In randomization-based inference, uncertainty in estimates


arises naturally from the random assignment of the
treatments, rather than from hypothesized sampling from a
large population.” (Athey and Imbens 2017)
Athey and Imbens is part of growing trend of economists using
randomization-based methods for doing causal inference

Cunningham Causal Inference


Lady tasting tea experiment

Ronald Aylmer Fisher (1890-1962)


Two classic books on statistics: Statistical Methods for
Research Workers (1925) and The Design of Experiments
(1935), as well as a famous work in genetics, The Genetical
Theory of Natural Science
Developed many fundamental notions of modern statistics
including the theory of randomized experimental design.
Lady tasting tea

Muriel Bristol (1888-1950)


A PhD scientist back in the days when women weren’t PhD
scientists
Worked with Fisher at the Rothamsted Experiment Station
(which she established) in 1919
During afternoon tea, Muriel claimed she could tell from taste
whether the milk was added to the cup before or after the tea
Scientists were incredulous, but Fisher was inspired by her
strong claim
He devised a way to test her claim which she passed using
randomization inference
Description of the tea-tasting experiment

Original claim: Given a cup of tea with milk, Bristol claims she
can discriminate the order in which the milk and tea were
added to the cup
Experiment: To test her claim, Fisher prepares 8 cups of tea –
4 milk then tea and 4 tea then milk – and presents each
cup to Bristol for a taste test
Question: How many cups must Bristol correctly identify to
convince us of her unusual ability to identify the order in which
the milk was poured?
Fisher’s sharp null: Assume she can’t discriminate. Then
what’s the likelihood that random chance was responsible for
her answers?
Choosing subsets

The lady performs the experiment by selecting 4 cups, say, the


ones she claims to have had the tea poured first.
 
n n!
=
k k! (n − k)!

“8 choose 4” – 84 – ways to choose 4 cups out of 8




Numerator is 8 × 7 × 6 × 5 = 1, 680 ways to choose a first cup,


a second cup, a third cup, and a fourth cup, in order.
Denominator is 4 × 3 × 2 × 1 = 24 ways to order 4 cups.
Choosing subsets

There are 70 ways to choose 4 cups out of 8, and therefore a


1.4% probability of producing the correct answer by chance
24
= 1/70 = 0.014.
1680
For example, the probability that she would correctly identify
1
all 4 cups is 70
Statistical significance

Suppose the lady correctly identifies all 4 cups. Then . . .


1 Either she has no ability, and has chosen the correct 4 cups
purely by chance, or
2 She has the discriminatory ability she claims.
Since choosing correctly is highly unlikely in the first case (one
chance in 70), the second seems plausible.
1 Fisher is the originator of the convention that a result is
considered “statistically significant” if the probability of its
occurrence by chance is < 0.05, or, less than 1 out of 20.
Bristol actually got all four correct
Replication

Let’s look at tea.do and tea.R to see this experiment


Null hypothesis

In this example, the null hypothesis is the hypothesis that the


lady has no special ability to discriminate between the cups of
tea.
We can never prove the null hypothesis, but the data may
provide evidence to reject it.
In most situations, rejecting the null hypothesis is what we
hope to do.
Null hypothesis of no effect

Randomization inference allows us to make probability


calculations revealing whether the treatment assignment was
“unusual”
Fisher’s sharp null is when entertain the possibility that no unit
has a treatment effect
This allows us to make “exact” p-values which do not depend
on large sample approximations
It also means the inference is not dependent on any particular
distribution (e.g., Gaussian); sometimes called nonparametric
Sidebar: bootstrapping is different

Sometimes people confuse randomization inference with


bootstrapping
Bootstrapping randomly draws a percent of the total
observations for estimation; “uncertainty over the sample”
Randomization inference randomly reassigns the treatment;
“uncertainty over treatment assignment”
6-step guide to randomization inference

1 Choose a sharp null hypothesis (e.g., no treatment effects)


2 Calculate a test statistic (T is a scalar based on D and Y )
3 Then pick a randomized treatment vector D˜1
4 Calculate the test statistic associated with (D̃, Y )
5 Repeat steps 3 and 4 for all possible combinations to get
T̃ = {T̃1 , . . . , T̃K }
Calculate exact p-value as p = K1 K
P
k=1 I (T̃k ≥ T )
6
Pretend experiment

Table: Pretend DBT intervention for some homeless population

Name D Y Y0 Y1
Andy 1 10 . 10
Ben 1 5 . 5
Chad 1 16 . 16
Daniel 1 3 . 3
Edith 0 5 5 .
Frank 0 7 7 .
George 0 8 8 .
Hank 0 10 10 .

For concreteness, assume a program where we pay homeless people


$15 to take dialectical behavioral therapy. Outcomes are some
measure of mental health 0-20 with higher scores being better.
Step 1: Sharp null of no effect

Fisher’s Sharp Null Hypothesis


H0 : δi = Yi1 − Yi0 = 0 ∀i

Assuming no effect means any test statistic is due to chance


Neyman and Fisher test statistics were different – Fisher was
exact, Neyman was not
Neyman’s null was no average treatment effect (ATE=0). If
you have a treatment effect of 5 and I have a treatment effect
of -5, our ATE is zero. This is not the sharp null even though
it also implies a zero ATE
More sharp null

Since under the Fisher sharp null δi = 0, it means each unit’s


potential outcomes under both states of the world are the same
We therefore know each unit’s missing counterfactual
The randomization we will perform will cycle through all
treatment assignments under a null well treatment assignment
doesn’t matter because all treatment assignments are
associated with a null of zero unit treatment effects
We are looking for evidence against the null
Step 1: Fisher’s sharp null and missing potential outcomes

Table: Missing potential outcomes are no longer missing

Name D Y Y0 Y1
Andy 1 10 10 10
Ben 1 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 0 7 7 7
George 0 8 8 8
Hank 0 10 10 10

Fisher sharp null allows us to fill in the missing counterfactuals bc


under the null there’s zero treatment effect at the unit level. This
guarantees zero ATE, but is different in formulation than Neyman’s
null effect of no ATE.
Step 2: Choosing a test statistic

Test Statistic
A test statistic T (D, Y ) is a scalar quantity calculated from the
treatment assignments D and the observed outcomes Y

By scalar, I just mean it’s a number (vs. a function) measuring


some relationship between D and Y
Ultimately there are many tests to choose from; I’ll review a
few later
If you want a test statistic with high statistical power, you
need large values when the null is false, and small values when
the null is true (i.e., extreme)
Simple difference in means

Consider the absolute SDO from earlier


N N
1 X 1 X
δSDO =
Di Yi − (1 − Di )Yi
NT NC
i=1 i=1

Larger values of δSDO are evidence against the sharp null


Good estimator for constant, additive treatment effects and
relatively few outliers in the potential outcomes
Step 2: Calculate test statistic, T (D, Y )

Table: Calculate T using D and Y

Name D Y Y0 Y1 δi
Andy 1 10 10 10 0
Ben 1 5 5 5 0
Chad 1 16 16 16 0
Daniel 1 3 3 3 0
Edith 0 5 5 5 0
Frank 0 7 7 7 0
George 0 8 8 8 0
Hank 0 10 10 10 0

We’ll start with this simple the simple difference in means test
statistic, T (D, Y ): δSDO = 34/4 − 30/4 = 1
Steps 3-5: Null randomization distribution

Randomization steps reassign treatment assignment for every


combination, calculating test statistics each time, to obtain
the entire distribution of counterfactual test statistics
The key insight of randomization inference is that under
Fisher’s sharp null, the treatment assignment shouldn’t matter
Ask yourself:
if there is no unit level treatment effect, can you picture a
distribution of counterfactual test statistics?
and if there is no unit level treatment effect, what must
average counterfactual test statistics equal?
Step 6: Calculate “exact” p-values

Question: how often would we get a test statistic as big or


bigger as our “real” one if Fisher’s sharp null was true?
This can be calculated “easily” (sometimes) once we have the
randomization distribution from steps 3-5
The number of test statistics (t(D, Y )) bigger than the
observed divided by total number of randomizations
P
I (T (D, Y ) ≥ T (D̃, Y )
Pr (T (D, Y ) ≥ T (D̃, Y |δ = 0)) = D∈Ω
K
These are “exact” tests when they use every possible
combination of D
When you can’t use every combination, then you can get
approximate p-values from a simulation (TBD)
With a rejection threshold of α (e.g., 0.05), randomization
inference test will falsely reject less than 100×α% of the time
First permutation (holding NT fixed)

Name D˜2 Y Y0 Y1
Andy 1 10 10 10
Ben 0 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 1 7 7 7
George 0 8 8 8
Hank 0 10 10 10

T̃1 = |36/4 − 28/4|= 9 − 7 = 2


Second permutation (again holding NT fixed)

Name D˜3 Y Y0 Y1
Andy 1 10 10 10
Ben 0 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 0 7 7 7
George 1 8 8 8
Hank 0 10 10 10

Trank = |36/4 − 27/4|= 9 − 6.75 = 2.25


Sidebar: Should it be 4 treatment groups each time?

In this experiment, I’ve been using the same NT under the


assumption that NT had been fixed when the experiment was
drawn.
But if the original treatment assignment had been generated
by something like a Bernoulli distribution (e.g., coin flips over
every unit), then you should be doing a complete permutation
that is also random in this way
This means that for 8 units, sometimes you’d have 1 treated,
or even 8
Correct inference requires you know the original data
generating process
Randomization distribution

Assignment D1 D2 D3 D4 D5 D6 D7 D8 |Ti |
True D 1 1 1 1 0 0 0 0 1
D˜2 1 0 1 1 0 1 0 0 2
D˜3 1 0 1 1 0 0 1 0 2.25
...
Step 2: Other test statistics

The simple difference in means is fine when effects are


additive, and there are few outliers in the data
But outliers create more variation in the randomization
distribution
What are some alternative test statistics?
Transformations

What if there was a constant multiplicative effect:


Yi1 /Yi0 = C ?
Difference in means will have low power to detect this
alternative hypothesis
So we transform the observed outcome using the natural log:
N N
1 X 1 X
Tlog =
Di ln(Yi ) − (1 − Di )ln(Yi )
NT NC
i=1 i=1

This is useful for skewed distributions of outcomes


Difference in medians/quantiles

We can protect against outliers using other test statistics such


as the difference in quantiles
Difference in medians:

Tmedian = |median(YT ) − median(YC )|

We could also estimate the difference in quantiles at any point


in the distribution (e.g., 25th or 75th quantile)
Rank test statistics

Basic idea is rank the outcomes (higher values of Yi are


assigned higher ranks)
Then calculate a test statistic based on the transformed
ranked outcome (e.g., mean rank)
Useful with continuous outcomes, small datasets and/or many
outliers
Rank statistics formally

Rank is the domination of others (including oneself):


N
X
R̃ = R̃i (Y1 , . . . , YN ) = I (Yj ≤ Yi )
j=1

Normalize the ranks to have mean 0


N
X N +1
R̃i = R̃i (Y1 , . . . , YN ) = I (Yj ≤ Yi ) −
2
j=1

Calculate the absolute difference in average ranks:


P P
i:Di =1 Ri i:Di =0 Ri

Trank = |R T − R C |= −
NT NC

Minor adjustment (averages) for ties


Randomization distribution

Name D Y Y0 Y1 Rank Ri
Andy 1 10 10 10 6.5 2
Ben 1 5 5 5 2.5 -2
Chad 1 16 16 16 8 3.5
Daniel 1 3 3 3 1 -3.5
Edith 0 5 5 5 2.5 -1
Frank 0 7 7 7 4 -0.5
George 0 8 8 8 5 0.5
Hank 0 10 10 10 6.5 2

Trank = |0 − 1/4|= 1/4


Effects on outcome distributions

Focused so far on “average” differences between groups.


Kolmogorov-Smirnov test statistics is based on the difference
in the distribution of outcomes
Empirical cumulative distribution function (eCDF):

1 X
FbC (Y ) = 1(Yi ≤ Y )
NC
i:Di =0
1 X
FbT (Y ) = 1(Yi ≤ Y )
NT
i:Di =1

Proportion of observed outcomes below a chosen value for


treated and control separately
If two distributions are the same, then FbC (Y ) = FbT (Y )
Kolmogorov-Smirnov statistic

Test statistics are scalars not functions


eCDFs are functions, not scalars
Solution: use the maximum discrepancy between the two
eCDFs:
TKS = max|FbT (Yi ) − FbC (Yi )|
Kernel density by group status

.4
.3
Kolmogorov-Smirnov test
kdensity y
.2.1
0

-5 0 5 10
x

Treatment Control
eCDFs by treatment status and test statistic

1
.8
eCDF
ECDF of y
.4 .2
0 .6

-5 0 5 10
y

Treatment Control
KS Test Statistic

Treatment D Exact P-value


K-S 0.4500 0.034

Max distance is 0.45. Exact p is 0.034.


“Which bear is best?” – Jim Halpert

A good test statistic is the one that best fits your data. Some test
statistics will have weird properties in the randomization as we’ll
see in synthetic control.
One-sided or two-sided?

So far, we have defined all test statistics as absolute values


We are testing against a two-sided alternative hypothesis

H0 : δi = 0 ∀i
H1 : δi 6= 0 for some i

What about a one-sided alternative

H0 : δi = 0 ∀i
H1 : δi > 0 for some i

For these, use a test statistic that is bigger under the


alternative:

Tdiff ∗ = Y T − Y C
Small vs. Modest Sample Sizes are non-trivial

Computing the exact randomization distribution is not always


feasible (Wolfram Alpha)
N = 6 and NT = 3 gives us 20 assignment vectors
N = 8 and NT = 4 gives us 70 assignment vectors
N = 10 and NT = 5 gives us 252 assignment vectors
N = 20 and NT = 10 gives us 184,756 assignment vectors
N = 50 and NT = 25 gives us 1.2641061×1014 assignment
vectors
Exact p calculations are not realistic bc the number of assignments
explodes at even modest size
Approximate p values

Use simulation to get approximate p-values


Take K samples from the treatment assignment space
Calculate the randomization distribution in the K samples
Tests no longer exact, but bias is under your control (increase
K)
Imbens and Rubin show that p values converge to stable p
values pretty quickly (in their example after 1000 replications)
Sample dataset

Let’s do this now with Thornton’s data. You can replicate that
using thorton_ri.do or thornton_ri.R
Thornton’s experiment

ATE Iteration Rank p no. trials


0.45 1 1 0.01 100
0.45 1 1 0.002 500
0.45 1 1 0.001 1000

Table: Estimated p-value using different number of trials.


Including covariate information

Let Xi be a pretreatment measure of the outcome


0
One way is to use this as a gain score: Y d = Yid − Xi
Causal effects are the same Y 1i − Y 0i = Yi1 − Yi0
But the test statistic is different:


Tgain = (Y T − Y C ) − (X T − X C )

If Xi is strongly predictive of Yi0 , then this could have higher


power
Ygain will have lower variance under the null
This makes it easier to detect smaller effects
Regression in RI

We can extend this to use covariates in more complicated ways


For instance, we can use an OLS regression:

Yi = α + δDi + βXi + ε

Then our test statistic could be TOLS = δb


RI is justified even if the model is wrong
OLS is just another way to generate a test statistic
The more the model is “right” (read: predictive of Yi0 ), the
higher the power TOLS will have
See if you can do this in Thornton’s dataset using the loops
and saving the OLS coefficient (or just use ritest)
Judea Pearl and DAGs

Judea Pearl and colleagues in Artificial Intelligence at UCLA


developed DAG modeling to create a formalized causal
inference methodology
They causality concepts extremely clear, they provide a map to
the estimation strategy, and maybe best of all, they
communicate to others what must be true about the data
generating process to recover the causal effect

Cunningham Causal Inference


Judea Pearl, 2011 Turing Award winner, drinking his first IPA
Further reading

1 Pearl (2018) The Book of Why: The


New Science of Cause and Effect, Basic Books (popular)
2 Morgan and Winship (2014)
Counterfactuals and Causal Inference: Methods and Principles
for Social Research, Cambridge University Press, 2nd edition
(excellent)
3 Pearl, Glymour and Jewell (2016)
Causal Inference In Statistics: A Primer, Wiley Books
(accessible)
4 Pearl (2009) Causality: Models, Reasoning and Inference,
Cambridge, 2nd edition (difficult)
5 Cunningham (2021) Causal Inference: The Mixtape, Yale, 1st
edition (best choice, no question)
Causal model

The causal model is sometimes called the structural model,


but for us, I prefer the former as it’s less alienating
It’s the system of equations describing the relevant aspects of
the world
It necessarily is filled with causal effects associated with some
particular comparative statics
To illustrate, I will assume a Beckerian human capital model
Human capital model: statements and graphs

Let’s describe my simplified Beckerian human capital model.


Individuals maximize utility by choosing consumption and
schooling (D) subject to multi-period budget constraint
Education has current costs but longterm returns
But people choose different levels of schooling based on a
number of things we will call “background” (B) which won’t be
in the dataset (“unobserved”)
And own-schooling will also be because of parental schooling
(PE)
Finally, wages (Y) are a function of parental schooling
Becker’s human capital causal model

We can represent that causal model visually

PE I

D Y
B

PE is parental education, B is “unobserved background factors


(i.e., “ability”)”, I is family income, D is college education and Y is
log wages. The DAG is an approximation of Becker’s underlying
(causal) human capital model.
Arrows, but also missing arrows

Before we dive into all this notation, couple of things

PE I

D Y
B

PE and D are caused by B. But why doesn’t B cause Y ?? Do you


believe this? Why/why not? We can dispute this, but notice – we
can see the assumption, which is transparent and communicates
the author’s beliefs, as well as the needed assumptions in their
forthcoming empirical model. Every empirical strategy makes
assumptions, but oftentimes they are not as transparent to us as
this is.
PE I

D Y
B

B is a parent of PE and D
PE and D are descendants of B
There is a direct (causal) path from D to Y
There is a mediated (causal) path from B to Y through D
There are four paths from PE to Y but none are direct, and
one is unlike the others
Colliders

PE I

D Y
B

Notice anything different with this DAG? Look closely.


D is a collider along the path B → D ← I (i.e., “colliding” at
D)
D is a noncollider along the path B → D → Y
Summarizing Value of DAGs imo

1 Facilitates the task of designing identification strategy for


estimating average causal effects
2 Facilitates the task of testing compatibility of the model with
your data
3 Visualizes the identifying assumptions which opens up the
model to critical scrutiny
Creating DAGs

The DAG is a relevant causal relationships describing the


relationship between D and Y
It will include:
All direct causal effects among the relevant variables in the
graph
All common causes of any pair of relevant variables in the
graph
No need to model a dinosaur stepping on a bug causing in a
million years some evolved created that impacted your decision
to go to college
We get ideas for DAGs from theory, models, observation,
experience, prior studies, intuition
Sometimes called the data generating process.
Confounding

Omitted variable bias has a name in DAGs: “confounding”


Confounding occurs when when the treatment and the
outcomes have a common cause or parent which creates
spurious correlation between D and Y
D Y

The correlation between D and Y no longer reflects the causal


effect of D on Y
Backdoor Paths

Confounding creates backdoor paths between treatment and


outcome (D ← X → Y ) – i.e., spurious correlations
Not the same as mediation (D → X → Y )
We can “block” backdoor paths by conditioning on the
common cause X
Once we condition on X , the correlation between D and Y
estimates the causal effect of D on Y
Conditioning means calculating
E [Y |D = 1, X ] − E [Y |D = 0, X ] for each value of X then
combining (e.g., integrating)

D Y

X
Blocked backdoor paths

A backdoor path is blocked if and only if:


It contains a noncollider that has been conditioned on
Or it contains a collider that has not been conditioned on
Examples of blocked paths

Examples:
1 Conditioning on a noncollider blocks a path:

X Z Y
2 Conditioning on a collider opens a path (i.e., creates spurious
correlations):
Z X Y
3 Not conditioning on a collider blocks a path:
Z X Y
Backdoor criterion

Backdoor criterion
Conditioning on X satisfies the backdoor criterion with respect to
(D, Y ) directed path if:
1 All backdoor paths are blocked by X
2 No element of X is a collider
In words: If X satisfies the backdoor criterion with respect to
(D, Y ), then controlling for or matching on X identifies the causal
effect of D on Y
What control strategy meets the backdoor criterion?

List all backdoor paths from D to Y . I’ll wait.

X1 D Y

X2

What are the necessary and sufficient set of controls which will
satisfy the backdoor criterion?
What if you have an unobservable?

List all the backdoor paths from D to Y .


X1

U X2 D Y

What are the necessary and sufficient set of controls which will
satisfy the backdoor criterion?
What about the unobserved variable, U?
Multiple strategies

X1

X3 D Y

X2
X1

X3 D Y

X2

Conditioning on the common causes, X1 and X2 , is sufficient


. . . but so is conditioning on X3
Testing the Validity of the DAG

The DAG makes testable predictions


Conditional on D and I , parental education (PE ) should no
longer be correlated with Y
Can be hard to figure this out by hand, but software can help
(e.g., Daggity.net is browser based)

PE I

D Y
B
Collider bias

Conditioning on a collider introduces spurious


correlations; can even mask causal directions
There is only one backdoor path from D to Y

X1 D Y

X2

Conditioning on X1 blocks the backdoor path


But what if we also condition on X2 ?

X1 D Y

X2

Conditioning on X2 opens up a new path, creating new


spurious correlations between D and Y
Even controlling for pretreatment covariates can create
bias
Name the backdoor paths. Is it open or closed?
U1

X D Y

U2

But what if we condition on X ?


U1

X D Y

U2
Living in reality - he doesn’t love you

Fact #1: We can’t know if we have a collider bias


(confounder) problem without making assumptions about the
causal model (i.e. not in the codebook)
Fact # 2: You can’t just haphazardly throw in a bunch of
controls on the RHS (i.e., “the kitchen sink”) bc you may
inadvertently be conditioning on a collider which can lead to
massive biases
Fact # 3: You have no choice but to leverage economic
theory, intuition, intimate familiarity with institutional details
and background knowledge for research designs.
Fact #4: You can only estimate causal effects with data and
assumptions.
Examples of collider bias
Bad controls

Angrist and Pischke in MHE talk about a specific type of


danger associated with controlling for an outcome – “bad
controls”
The problem is not controlling for an outcome;
The problem is controlling for a collider and don’t correct for
that
This has implications for when you work with non-random
administrative data, too
Sample selection example of collider bias

Important: Since unconditioned colliders block back-door paths,


what exactly does conditioning on a collider do? Let’s illustrate
with a fun example and some made-up data
CNN.com headline: Megan Fox voted worst – but sexiest –
actress of 2009 (link)
Are these two things actually negatively correlated in the
world?
Assume talent and beauty are independent, but each causes
someone to become a movie star. What’s the correlation
between talent and beauty for a sample of movie stars
compared to the population as a whole (stars and non-stars)?
What if the sample consists only of movie stars?

Movie Star

Talent Beauty
Stata code
clear all
set seed 3444

* 2500 independent draws from standard normal distribution


set obs 2500
generate beauty=rnormal()
generate talent=rnormal()

* Creating the collider variable (star)


gen score=(beauty+talent)
egen c85=pctile(score), p(85)
gen star=(score>=c85)
label variable star "Movie star"

* Conditioning on the top 15%


twoway (scatter beauty talent, mcolor(black) msize(small) msymbol(smx)),
ytitle(Beauty) xtitle(Talent) subtitle(Aspiring actors and actresses)
by(star, total)
Aspiring actors and actresses
0 1

4
2
0
-2
-4
Beauty

-4 -2 0 2 4

Total
4
2
0
-2
-4

-4 -2 0 2 4
Talent
Graphs by Movie star

Figure: Top left figure: Non-star sample scatter plot of beauty (vertical axis) and talent
(horizontal axis). Top right right figure: Star sample scatter plot of beauty and talent.
Bottom left figure: Entire (stars and non-stars combined) sample scatter plot of beauty and
talent.
Stata

Run Stata file star.do


Occupational sorting and discrimination example of collider
bias

Let’s look at another example: very common for think tanks


and journalists to say that the gender gap in earnings
disappears once you control for occupation.
But what if occupation is a collider, which it could be in a
model with occupational sorting
Then controlling for occupation in a wage regression searching
for discrimination can lead to all kinds of crazy results even in
a simulation where we explicitly design there to be
discrimination
DAG

F y

o A

F is female, d is discrimination, o is occupation, y is earnings and


A is ability. Dashed lines mean the variable cannot be observed.
Note, by design, being a female has no effect on earnings or
occupation, and has no relationship with ability. So earnings is
coming through discrimination, occupation, and ability.
d

F y

o A

Mediation and Backdoor paths


1 d →o→y
2 d →o←A→y
Stata model (Erin Hengel)

Erin Hengel (www.erinhengel.com) and I worked out this


code and she gave me permission to put in my Mixtape
Let’s look at collider_discrimination.do or
collider_discrimination.R together
Table: Regressions illustrating collider bias with simulated gender disparity

Covariates: Unbiased combined effect Biased Unbiased wage effect only


Female -3.074*** 0.601*** -0.994***
(0.000) (0.000) (0.000)
Occupation 1.793*** 0.991***
(0.000) (0.000)
Ability 2.017***
(0.000)

N 10,000 10,000 10,000


Mean of dependent variable 0.45 0.45 0.45

Recall we designed there to be a discrimination coefficient of -1


If we do not control for occupation, then we get the combined effect of
d → o → y and d → y
Because it seems intuitive to control for occupation, notice column 2 - the sign
flips!
We are only able to isolate the direct causal effect by conditioning on ability and
occupation, but ability is unobserved
Administrative data

Admin data has become extremely common, if not absolutely


necessary
But naive use of admin data can be dangerous if the drawing
of the sample is itself a collider problem (Heckman 1979;
Elwert and Winship 2014)
Let’s look at a new paper by Fryer (2019) and a critique by
Knox, et al. (2019)
Collider bias and police use of force

Claims of excessive and discriminator use of police force


against minorities (e.g., Black Lives Matter, Trayvon Martin,
Michael Brown, Eric Garner)
Challenging to identify
Police-citizen interactions are conditional on interactions
having already been triggered
That initial interaction is unobserved
Fryer (2019) is a monumental study for its data collection and
analysis: Stop and Frisk, Police-Public Contact Survey, and
admin data from two jurisdictions
Codes up almost 300 variables from arrest narratives which
range from 2-100 pages in length – shoeleather!
Initial interaction

Fryer finds that blacks and Hispanics were more than 50%
more likely to have an interaction with the policy in NYC Stop
and Frisk as well as Police-Public Contact survey
It survives extensive controls – magnitudes fall, but still very
large (21%)
Moves to admin data
Conditional on police interaction, no racial differences in
officer-related shootings
Fryer calls it one of the most surprising findings in his career
Lots of eyes on this study as a result of the counter intuitive
results; published in JPE
Knox, et al (202) claim his data is itself a collider. What?
Controls
X

Minority Stop Force


D M Y

U
Suspicion

Fryer told us D → M exists from both Stop and Frisk and


Police-Public. But note: admin data is instances of M stops, which
is itself a collider. If this DAG is true, then spurious correlations
enter between M and Y which may dilute our ability to estimate
causal effects.
Knox, et al (2020)

Move from DAG to more contemporary potential outcomes


notation to design relevant parameters
Use potential outcomes and bounds
Even with lower bound estimates of the incidence of police
violence against civilians is more than 5x higher than what
Fryer (2019) finds
Heckman (1979) – we cannot afford to ignore sample selection
Summarizing all of this

Your dataset will not come with a codebook flagging some


variables as “confounders” and other variables as “colliders”
because those terms are always context specific
Except for some unique situations that aren’t generally
applicable, you also don’t always know statistically you have an
omitted variable bias problem; but both of these are fatal for
any application
You only know to do what you’re doing based on knowledge
about data generating process.
All identification must be guided by theory, experience,
observation, common sense and knowledge of institutions
DAGs absorb that information and can be then used to write
out the explicit identifying model
DAGs are not panacea

DAGs cannot handle, though, reverse causality or simultaneity


So there are limitations. “All models are wrong but some are
useful”
They are also not popular (see Twitter ongoing debates which
have descended into light hearted jokes as well as aggressive
debates)
But I think they are helpful and while not necessary, showcase
what is necessary – assumptions
Heckman (1979) can maybe provide some justification at times
What is regression discontinuity design?

Very popular particular type of research design known as regression


discontinuity design (RDD). Cook (2008) has a fascinating history
of thought on how and why.
Donald Campbell, educational psychologist, invented
regression discontinuity design (Thistlethwaite and Campbell,
1960), but then it went dormant for decades (Cook 2008).
Angrist and Lavy (1999) and Black (1999) independently
rediscover it. It’s become incredibly popular in economics

Cunningham Causal Inference


Tell me what
THE EFFECT OF ATTENDING you thinkSTATE
THE FLAGSHIP is happening
UNIVERSITY ON EARNINGS

FIGURE 1.—FRACTION ENROLLED AT THE FLAGSHIP STATE UNIVERSITY

.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0

-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350

Local Average
V. Results
discontinuity in earnings at the admission cutoff. This
is shown for white men in figure 2, which shows a
A. Earnings Discontinuities at the Admission Cutoff
Tell me what you think is ofhappening
regression residual earnings on a cubic polynomial of
To the extent that there are economic returns to attend- adjusted SAT score. Table 1 shows the discontinuity
ing the flagship state university, one should observe a estimates that result from varying functional form

FIGURE 2.—NATURAL LOG OF ANNUAL EARNINGS FORWHITE MEN TEN TO FIFTEEN YEARS AFTER HIGH SCHOOL GRADUATION (FIT WITH A CUBIC
POLYNOMIAL OF ADJUSTED SAT SCORE)

Estimated Discontinuity = 0.095 (z = 3.01)


.2
(Residual) Natural Log of Earnings
-.3 -.2 -.1-.4 0 .1

-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
SAT Points Above the Admission Cutoff

Predicted Earnings Local Average


What is a regression discontinuity design?

We want to estimate some causal effect of a treatment on


some outcome, but we’re worried about selection bias

E [Y 0 |D = 1] 6= E [Y 0 |D = 0]]

due to self-selection into treatment


RDD is based on a idea: if treatment assignment occurs
abruptly when some underlying variable X called the “running
variable” passes a cutoff c0 , then we can use that to estimate
the causal effect even of a self-selected treatment
Running and jumping

Firms, schools and govt agencies have running variables that


are used to assign treatments in their rules
And consequently, probabilities of treatment will “jump” when
that running variable exceeds a known threshold
Most effective RDD studies involve programs where running
variables assign treatments based on a “hair trigger”
Good reasons; inexplicable reasons; arbitrary rules; a choice
made by necessity and resource constraints; natural
experiments
Selection examples and solutions from the literature

Think of these in light of a treatment where


E [Y 0 |D = 1] 6= E [Y 0 |D = 0]
Yelp rounded a continuous score of ratings to generate stars
which Anderson and Magruder 2011 used to study firm revenue
US targeted air strikes in Vietnam using rounded risk scores
which Dell and Querubin 2018 used to study the military and
political activities of the communist state
Card, Dobkin, and Maeskas 2008 studied the effect of
universal healthcare on mortality and healthcare usage
exploiting jumps at age 65
Almond, et al. 2010 studied the effect of intensive medical
attention on health outcomes when a newborn’s birthweight
fell just below 1,500 grams
Hungry, hungry hippo

Data requirements can be substantial. Large sample sizes are


characteristic features of the RDD
If there are strong trends, one typically needs a lot of data for
reasons I’ll explain soon
Researchers are typically using administrative data or settings
such as birth records where there are many observations
Might explain why the method never caught on until the 00’s
(A) Data generating graph (B) Limiting graph

X U X → c0 U

D Y D Y

Cunningham Causal Inference


1
Conditional probability of treatment
Fuzzy RD Design

Sharp RD Design
0
X
Running variable X

Figure: Sharp vs. Fuzzy RDD


Sharp vs. Fuzzy RDD

There’s traditionally thought to be two kinds of RD designs:


1 Sharp RDD: Treatment is a deterministic function of running
variable, X . Example: Medicare benefits.
2 Fuzzy RDD: Discontinuous “jump” in the probability of
treatment when X > c0 . Cutoff is used as an instrumental
variable for treatment. Example: attending state flagship
Fuzzy is a type of IV strategy and requires explicit IV
estimators like 2SLS; sharp is reduced form IV and doesn’t
require IV-like estimators
Overlap

Independence implies an equal distribution of characteristics


across two groups guaranteeing overlap
In an RCT you can find 65 year olds treated and untreated
But RDD doesn’t have this feature bc you don’t have groups
with the same value of X in each group, so no overlap
64 years olds are control, not treatment. 66 years olds are in
treatment not control
Some methods require overlap and therefore are off the table
without it; but RDD has a workaround using extrapolation
Treatment assignment in the sharp RDD

Deterministic treatment assignment (“sharp RDD”)


In Sharp RDD, treatment status is a deterministic and
discontinuous function of a covariate, Xi :
(
1 if Xi ≥ c0
Di =
0 if Xi < c0

where c0 is a known threshold or cutoff. In other words, if you


know the value of Xi for a unit i, you know treatment assignment
for unit i with certainty.

Universal health insurance: Americans aged 64 are not eligible for


Medicare, but Americans aged 65 (X ≥ c65 ) are eligible for
Medicare (ignoring disability exemptions)
Treatment effect definition and estimation

Definition of treatment effect


The treatment effect parameter, δ, is the discontinuity in the
conditional expectation function:

δ = limXi →c0 E [Yi1 |Xi = c0 ] − limc0 ←Xi E [Yi0 |Xi = c0 ]


= limXi →c0 E [Yi |Xi = c0 ] − limc0 ←Xi E [Yi |Xi = c0 ]

The sharp RDD estimation is interpreted as an average causal


effect of the treatment at the discontinuity

δSRD = E [Yi1 − Yi0 |Xi = c0 ]

D is correlated with X and deterministic function of X ; overlap


only occurs in the limit and thus the treatment effect is in the limit
as X approaches c0
Extrapolation

In RDD, the counterfactuals are conditional on X .


We use extrapolation in estimating treatment effects with the
sharp RDD bc we do not have overlap
Left of cutoff, only non-treated observations, Di = 0 for
X < c0
Right of cutoff, only treated observations, Di = 1 for X ≥ c0
The extrapolation is to a counterfactual

Cunningham Causal Inference


Extrapolation

Estimation methods attempt to approximate the limiting parameter


using units left and right of the cutoff

Figure: Dashed lines are extrapolations


Key identifying assumption

Smoothness (or continuity) of conditional expectation functions


(Hahn, Todd and Van der Klaauw 2001)
E [Yi0 |X = c0 ] and E [Yi1 |X = c0 ] are continuous (smooth) in X at
c0 .

Potential outcomes not actual outcomes


If population average potential outcomes, Y 1 and Y 0 , are
smooth functions of X through the cutoff, c0 , then potential
average outcomes won’t jump at c0 .
Implies the cutoff is exogenous – i.e., nothing else changes
related to potential outcomes at c0
Unobservables are evolving smoothly, too, through the cutoff
Smoothness is the identifying assumption and untestable

The smoothness assumption allows us to use average outcome


of units right below the cutoff as a valid counterfactual for
units right above the cutoff.
In other words, extrapolation is allowed if smoothness is
credible, and extrapolation is nonsensical if smoothing isn’t
credible
The causal effect of the treatment will be based on
extrapolation from the trend, E [Yi0 |X < c0 ], to those values
of X > c0 for the E [Yi0 |X > c0 ].
Means you have to think long and hard about smoothness and
what violations mean in your context
Why then is it not directly testable? Because potential
outcomes are counterfactual
Graphical example of the smoothness assumption

Note these are potential not actual outcomes


Graphical example of the treatment effect, not the
smoothness assumption

800
600

Vertical distance is the treatment effect


Outcome (Y)
400
200
0

0 50 100 150 200 250


Test Score (X)

Note that these are actual, not potential outcomes


Re-centering the data

It is common for authors to transform X by “centering” at c0 :

Yi = α + β(Xi − c0 ) + δDi + εi

This doesn’t change the interpretation of the treatment effect


– only the interpretation of the intercept.
Re-centering the data

Example: Medicare and age 65. Center the running variable


(age) by subtracting 65:

Y = β0 + β1 (Age − 65) + β2 Edu


= β0 + β1 Age − β1 65 + β2 Edu
= α + β1 Age + β2 Edu

where α = β0 − β1 65.
All other coefficients, notice, have the same interpretation,
except for the intercept.
Regression without re-centering

reg y D x
Regression with centering

gen x_c = x 140

r e g y D x_c
Nonlinearity bias

Smoothness and linearity are different things.


What if the trend relation E [Yi0 |Xi ] does not jump at c0 but
rather is simply nonlinear?
Then your linear model will identify a treatment effect when
there isn’t because the functional form had poor predictive
properties beyond the cutoff
Let’s look at a simulation
gen x2 = x ∗ x

gen x3 = x ∗ x ∗ x

gen y = 10000 + 0∗D 100∗ x +x2 + r n o r m a l ( 0 , 1 0 0 0 )

s c a t t e r y x i f D==0, m s i z e ( v s m a l l ) | | s c a t t e r y x //
i f D==1, m s i z e ( v s m a l l ) l e g e n d ( o f f ) x l i n e ( 1 4 0 , ///
l s t y l e ( f o r e g r o u n d ) ) y l a b e l ( none ) | | l f i t y x ///
i f D ==0, c o l o r ( r e d ) | | l f i t y x i f D ==1, ///
c o l o r ( r e d ) x t i t l e ( " T e s t S c o r e (X) " ) ///
y t i t l e ( " Outcome (Y) " )
See how the two lines don’t touch at c0 but empirically should?
That’s bc the linear fit is the wrong functional form – we know this
from the simulation that it’s the wrong functional form.
Sharp RDD: Nonlinear Case

Suppose the nonlinear relationship is E [Yi0 |Xi ] = f (Xi ) for


some reasonably smooth function f (Xi ) (drumroll – like a
cubic!)
In that case we’d fit the regression model:

Yi = f (Xi ) + δDi + ηi

Since f (Xi ) is counterfactual for values of Xi > c0 , how will


we model the nonlinearity?
There are 2 common ways of approximating f (Xi )
Nonlinearities

People until Gelman and Imbens 2018 favored“higher order


polynomials” but this is problematic due to overfitting. Gelman and
Imbens 2018 recommend at best a quadratic
1 Use global and local regressions with f (Xi ) equalling a p th
order polynomial

Yi = α + δDi + β1 xi + β2 xi2 + · · · + βp xip + ηi

2 Or use some nonparametric kernel method which I’ll cover later


Different polynomials on the 2 sides of the discontinuity

We can generalize the function, f (xi ), by allowing it to differ


on both sides of the cutoff by including them both individually
and interacting them with Di .
In that case we have:

E [Yi0 |Xi ] = α + β01 X̃i + β02 X̃i2 + · · · + β0p X̃ip


E [Yi1 |Xi ] = α + δ + β11 X̃i + β12 X̃i2 + · · · + β1p X̃ip

where X̃i is the centered running variable (i.e., Xi − c0 ).


Lines to the left, lines to the right of the cutoff

Re-centering at c0 ensures that the treatment effect at


Xi = c0 is the coefficient on Di in a regression model with
interaction terms
As Lee and Lemieux (2010) note, allowing different functions
on both sides of the discontinuity should be the main results in
an RDD paper
Different polynomials on the 2 sides of the discontinuity

To derive a regression model, first note that the observed


values must be used in place of the potential outcomes:

E [Y |X ] = E [Y 0 |X ] + E [Y 1 |X ] − E [Y 0 |X ] D


which is the switching equation from earlier expressed in terms


of conditional expectation functions
Regression model you estimate is:

Yi = α + β01 x̃i + β02 x̃i2 + · · · + β0p x̃ip


+δDi + β1∗ Di x̃i + β2∗ Di x̃i2 + · · · + βp∗ Di x̃ip + εi

where β1∗ = β11 − β01 , β2∗ = β21 − β21 and βp∗ = β1p − β0p
The treatment effect at c0 is δ
Polynomial simulation example

c a p t u r e d r o p y x2 x3

gen x2 = x ∗ x
gen x3 = x ∗ x ∗ x
gen y = 10000 + 0∗D 100∗ x +x2 + r n o r m a l ( 0 , 1 0 0 0 )

r e g y D x x2 x3
p r e d i c t yhat

s c a t t e r y x i f D==0, m s i z e ( v s m a l l ) | | s c a t t e r y x
i f D==1, m s i z e ( v s m a l l ) l e g e n d ( o f f ) x l i n e ( 1 4 0 ,
l s t y l e ( f o r e g r o u n d ) ) y l a b e l ( none ) | | l i n e y h a t x
i f D ==0, c o l o r ( r e d ) s o r t | | l i n e y h a t x i f D==1,
s o r t c o l o r ( r e d ) x t i t l e ( " T e s t S c o r e (X) " )
y t i t l e ( " Outcome (Y) " )
Polynomial simulation example

Outcome (Y)

0 50 100 150 200 250


Test Score (X)

Figure: Third degree polynomial. Actual model second degree polynomial.

Notice: no more gap at c0 once we model the function f (x)


Stata simulation

gen x2_c = x2 _ 140

gen x3_c = x3 _ 140

r e g y D x x2

r e g y D x_c x2_c
Polynomial simulation example

Notice: no more gap at c0 once we model the function f (x) (e.g.,


D is insignificant once we include polynomials)
Polynomial simulation example

And centering did nothing to the interpretation of the main results


(D), only to the intercept.
Robustness against what?

Are you done now that you have your main results? No
You main results are only causal insofar as smoothness is a
credible belief, and since smoothness isn’t guaranteed by “the
science” like an RCT, you have to build your case
You must now scrutinize alternative hypotheses that are
consistent with your main results through sensitivity checks,
placebos and alternative approaches

Cunningham Causal Inference


Main Challenges

Classify your concern regarding smoothness violations into two


categories:
Manipulation on the running variable
Endogeneity of the cutoff
Most robustness is aimed at building credibility around these,
Manipulation of your running variable score

Treatment is not as good as randomly assigned around the


cutoff, c0 , when agents are able to manipulate their running
variable scores. This happens when:
1 the assignment rule is known in advance
2 agents are interested in adjusting
3 agents have time to adjust
4 administrative quirks like nonrandom heaping along the
running variable
Examples include re-taking an exam, self-reported income,
certain types of non-random rounding.
Since necessarily treatment assignment is no longer
independent of potential outcomes, it’s likely this implies
smoothness has been violated
Test 1: Manipulation of the running variable

Manipulation of the running variable


Assume a desirable treatment, D, and an assignment rule X ≥ c0 .
If individuals sort into D by choosing X such that X ≥ c0 , then we
say individuals are manipulating the running variable.

Also can be called “sorting on the running variable” – same thing


A badly designed RCT

Suppose a doctor randomly assigns heart patients to statin


and placebo to study the effect of the statin on heart attacks
within 10 years
Patients are placed in two different waiting rooms, A and B,
and plans to give those in A the statin and those in B the
placebo.
The doors are unlocked and movement between the two can
happen
Versions of this happened with HIV RCTs in the 1980s
ironically in which medication from treatment group was given
to the control group, but I’m talking about something a little
different
McCrary Density Test

We would expect waiting room A to become crowded. In the RDD


context, sorting on the running variable implies heaping on the
“good side” of c0
McCrary (2008) suggests a formal test: under the null the
density should be continuous at the cutoff point.
Under the alternative hypothesis, the density should increase
at the kink (where D is viewed as good)
1 Partition the assignment variable into bins and calculate
frequencies (i.e., number of observations) in each bin
2 Treat those frequency counts as dependent variable in a local
linear regression
This is oftentimes visualized with confidence intervals
illustrating the effect of the discontinuity on density - you need
no jump to pass this test
McCrary density test

The McCrary Density Test has become mandatory for every


analysis using RDD.
If you can estimate the conditional expectations, you evidently
have data on the running variable. So in principle you can
always do a density test
You can download the (no longer supported) Stata ado
package, DCdensity, to implement McCrary’s density test
(http://eml.berkeley.edu/~jmccrary/DCdensity/)
You can install rdrobust for Stata and R too, and it will
implement the test
Caveats about McCrary Density Test

For RDD to be useful, you already need to know something about the
mechanism generating the assignment variable and how susceptible it could be
to manipulation. Note the rationality of economic
Fig. 1. The agent’s problem. actors that this test is built
on.
A discontinuity
0.50 in the density is “suspicious” 0.50 – it suggests manipulation of X
Conditional Expectation

Conditional Expectation
around the0.30
cutoff is probably going on. 0.30 In principle one doesn’t need continuity.
Estimate

Estimate
0.10 0.10
This is a-0.10high-powered test. You need a-0.10lot of observations ` at c0 to distinguish
a discontinuity
-0.30
in the density from noise. -0.30

-0.50 -0.50
5 10 15 20 5 10 15 20
Income Income

0.16 0.16
0.14 0.14
Density Estimate

Density Estimate
0.12 0.12
0.10 0.10
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0.00 0.00
5 10 15 20 5 10 15 20
Income Income

Fig. 2. Hypothetical example: gaming the system with an income-tested job training program: (A) conditional expectation of returns to
Figure:
treatmentPanel
with noCpre-announcement
is density of income when there
and no manipulation; is no pre-announcement
(B) conditional expectation of returns and no manipulation.
to treatment Panel D is
with pre-announcement
theand
density of income
manipulation; when
(C) density therewith
of income is no
pre-announcement
pre-announcement and and manipulation.
no manipulation; From
(D) density McCrary
of income (2008).
with pre-announcement
and manipulation.

also necessary, and we may characterize those who reduce their labor supply as those with coai pc=f i and
bi 4ai ð1 " f i Þ=d.
Fig. 2 shows the implications of these behavioral effects using a simulated data set on 50,000 agents with
Visualizing manipulation

Figure 3: Running McCrary z-statistic


Figure 2: Distribution of marathon finishing times (n = 9, 378, 546)

150
100
McCrary z
50
0
-50
2:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 6:30 7:00
threshold

Non-categorical 10 minute threshold


30 minute threshold

NOTE: The McCrary test is run at each minute threshold from 2:40 to 7:00 to test whether there is a significant discontinuity
NOTE: The dark bars highlight the density in the minute bin just prior to each 30 minute threshold. in the density function at that threshold.

Figure: Figures 2 and 3 from Eric Allen, Patricia Dechow, Devin Pope and George Wu’s (2013)
“Reference-Dependent Preferences: Evidence from Marathon Runners”.
http://faculty.chicagobooth.edu/devin.pope/research/pdf/Website_Marathons.pdf

12 14
Newborn mortality and medical expenditure

Almond, et al. 2010 attempted to estimate the causal effect of


medical expenditures on health outcomes, which is ordinarily
rife with selection bias due to endogeneous physician behavior
(independence is violated)
In the US, newborns whose birthweight falls below 1500 grams
receive heightened medical attention bc 1500 is the “very low
birth weight” range and quite dangerous for infants
Used RDD with hospital administrative records and found
1-year infant mortality decreased by 1pp just below 1500
grams compared to just above – medical expenditures are
cost-effective
Heaping problem

Figure: Distribution of births by gram from Almond, et al. 2010


Heaping, Running and Jumping

This picture shows “heaping” which is excess mass at certain


points along the running variable
Unlikely births actually heap at certain intervals; more likely
someone is rounding
Some scales may be less sophisticated, some practices may be
more common in some types of hospitals than others, there
could outright manipulation
Failure to reject

Almond, et al. 2010 used the McCrary density test but found
no evidence of manipulation
Ironically, the McCrary density test may fail to reject in a
heaping scenario
In this scenario, the heaping is associated with high mortality
children who are outliers compared to newborns both to the
left and to the right
“This [heaping at 1500 grams] may be a signal that
poor-quality hospitals have relatively high propensities to
round birth weights but is also consistent with
manipulation of recorded birth weights by doctors, nurses,
or parents to obtain favorable treatment for their children.
Barreca, et al. 2011 show that this nonrandom heaping
leads one to conclude that it is “good” to be strictly less
than any 100-g cutoff between 1,000 and 3,000 grams.”
Donut holes

RDD compares means as we approach c0 from either direction


along X
Estimates should not logically be sensitive to the observations
at the cutoff – if it is, then smoothness may be violated
Through Monte Carlos, Barreca, et al. 2016 suggest an
alternative strategy – drop the units in the vicinity of 1500
grams, and re-estimate the model
They call this a “donut” RDD bc you drop the units at the
cutoff (the “donut hole”) and estimate your model on the units
in the neighborhood instead
Newborn mortality and medical expenditure

Dropping units (e.g., trimming) always changes the parameter


we’re estimating – it’s not the ATE, the ATT, not even the
LATE except under strong assumptions
In this case, dropping at the threshold reduced sample size by
2%
But the strength of this practice is that it allows for the
possibility that units at the heap differ markedly due to
selection bias than those in the surrounding area
Donut RDD analysis found effect sizes that were
approximately 50% smaller than Almond, et al 2010
Caution with heaping is a good attitude to have
Endogenous cutoffs

(A) Data generating graph (B) Limiting graph

X U X → c0 U

D Y D Y
Endogeneous cutoffs

RCT randomization breaks all ordinary backdoor paths


between D and Y because that’s how “the science” of
randomization works
RDD blocks the backdoor path from D ← X ←? → U → Y ;
it assumes away the backdoor path D ← U → Y
But if cutoffs are endogenous, then it is there, which means
absent the treatment, smoothness would’ve been violated
anyway
Smoothness isn’t guaranteed by an RDD unless D ← U → Y
isn’t present – which is why it is the critical identifying
assumption
Endogeneous cutoffs

Examples of endogenous cutoffs


Age thresholds used for policy (i.e., person turns 18, and faces
more severe penalties for crime) is correlated with other
variables that affect the outcome (i.e., graduation, voting
rights, etc.)
Age 65 is correlated with factors that directly affect healthcare
expenditure and mortality such as retirement
But some of these can be weakly defended with balance tests
(observables), or may be directly testable through placebos
assuming you have the data
Evaluating smoothness through balance

Balance tests and placebo tests are related but distinct


We can’t directly test smoothness bc we are missing
counterfactuals
Ask yourself: why should average values of exogenous
covariates jump if potential outcomes are smooth through the
cutoff?
If there are exogenous (non collider) covariates strongly
associated with potential outcomes but exogenous to them,
then they should be the same on either side of the cutoff if
smoothness holds
In this sense, balance tests are indirect searching for evidence
supporting smoothness
Balance implementation

Don’t make it hard – do what you did to Y , only to Z


Choose other noncolliders associated with potential outcomes,
Z
Create similar graphical plots as you did for Y
Could also conduct the parametric and nonparametric
estimation on Z
You do not want to see a jump around the cutoff, c0
Visualizing Balance

DO VOTERS AFFECT OR ELECT POLICIES? 835

Downloaded from http://qje.oxfordjournals.org/ at Baylor University on


FIGURE III
Similarity
Figure: Figure 3 from of Constituents’
Lee, Moretti Characteristics
and Butler (2004), “Do in Bare Democrat
Voters Affect and Republican
or Elect Policies?” Quarterly
Districts–Part 1
Journal of Economcis.PanelsPanels
refer refer to (top
to (from lefttoto
top left bottom
bottom right)
right) the following
the following district
district characteristics: real
character-
income, percentageistics:
with real income, percentage
high-school degree,with high-school
percentage degree,
black, percentageeligible
percentage black, percent-
to vote. Circles represent
age eligible to vote. Circles represent the average characteristic within intervals
the average characteristic within intervals of 0.01 in Democratic vote share. The continuous line represents
of 0.01 in Democrat vote share. The continuous line represents the predicted
the predicted values from
values a fourth-order
from a fourth-orderpolynomial
polynomial in in vote share fitted
vote share fittedseparately
separatelyforfor points above and
points
abovethreshold.
below the 50 percent and belowThe the dotted
50 percentlinethreshold.
representsThe thedotted line represents
95 percent confidence theinterval.
95
percent confidence interval.
Placebos at non-discontinuous points

Placebos in time are common with panels; placebo in running


variables are their equivalent in RDD
Imbens and Lemieux (2010) suggest we look at one side of the
discontinuity (e.g., X < c0 ), take the median value of the
running variable in that section, and pretend it was a
discontinuity, c00
Then test whether in reality there is a discontinuity at c00 . You
do not want to find anything.
Remember though: smoothness at placebo points is neither
necessary nor sufficient for smoothness in the potential
outcomes at the cutoff
So there are Type I and Type II risks of error with this
Pictures, pictures and more pictures

Synthetic control and RDD are visually intense


Eyeball tests are rampant (and deservedly) in RDD studies
Even if your main results are all parametric, you’ll still want to
present at least some nonparametric style pictures according to
Imbens and Lemieux (2010)
Let’s review some of the graphs you have to include

Cunningham Causal Inference


Outcomes

1 Outcome by running variable, (Xi ):


Construct bins and average the outcome within bins on both
sides of the cutoff
Look at different bin sizes when constructing these graphs
Plot the running variables, Xi , on the horizontal axis and the
average of Yi for each bin on the vertical axis
Consider plotting a relatively flexible regression line on top of
the bin means, but some readers prefer an eyeball test without
the regression line to avoid “priming”
Example: Outcomes by Forcing Variable
Example: Outcomes by Running Variables
From Lee and Lemieux (2010) based on Lee (2008)

Waldinger (Warwick) 26 / 48
Example: Outcomes by Forcing Variable - Smaller Bins
Example:
From Lee Outcomes
and Lemieux by Running
(2010) based Variables with smaller bins
on Lee (2008)

Waldinger (Warwick) 27 / 48
Probability of treatment

2 Probability of treatment by running variable if fuzzy


RDD
In a fuzzy RDD, you also want to see that the treatment
variable jumps at c0
This tells you whether you have a first stage (“bite”)
Let’s look at that again from earlier Hoekstra (2008) and
enrollment at the flagship
THE EFFECT OF ATTENDING THE FLAGSHIP STATE UNIVERSITY ON EARNINGS

FIGURE 1.—FRACTION ENROLLED AT THE FLAGSHIP STATE UNIVERSITY

.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0

-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350

Local Average
McCrary Density

3 Density of the running variable


One should plot the number of observations in each bin.
This plot allows to investigate whether there is a discontinuity
or heaping in the distribution of the running variable at the
threshold
Heaping or discontinuities in the density suggest that people
can manipulate their running variable score
This is an indirect test of the identifying assumption that each
individual has imprecise control over the assignment variable,
which may violate smoothness
Density
Density of the forcingofvariable
the running variable
From Lee & Lemieux (2010) based on Lee (2008)

Waldinger (Warwick) 31 / 48
Balance pictures

4 Covariates by a running variable


Construct a similar graph to the outcomes graph but use a
noncollider covariate as the “outcome”
Balance implies smoothness through the cutoff, c0 .
If noncollider covariates jump at the cutoff, one is probably
justified to reject that potential outcomes aren’t also probably
jumping there
Example:
Example Covariates
Covariates by ForcingbyVariable
Running Variable
From Lee and Lemieux (2010) based on Lee (2008)

Waldinger (Warwick) 29 / 48
Inference – honesty

Lee and Card (2008) and Lee and Lemieux (2010) recommend
clustering standard errors on the running variable
Kolesár and Rothe (2018) provide extensive theoretical and
simulation-based evidence that this is not good; you’d be
better off just with heteroskedastic robust
They propose two alternative confidence intervals that achieve
correct coverage in large samples – called “honest” (great
intro! Still studying this procedure)
Unavailable in Stata, but is available in R – RDHonest – at
https://github.com/kolesarm/RDHonest

Cunningham Causal Inference


Inference – randomization inference

Cattaeneo, et al. (2015) say to consider that the cutoff is a


randomized experiment
Use randomization inference which is a test of the null of no
individual unit level treatment effect at the cutoff
Parametric vs. nonparametric approaches

Least squares approaches, because it models the


counterfactual using functional forms, is parametric
As a result, it can have poor predictive properties on
counterfactuals above/below the cutoff
Another way of approximating f (Xi ) is to use a nonparametric
kernel which has its own problems; just not that one.
e nonparametric kernel method has its problems in this case
ause you are trying to estimate
Kernel regressions
regression at the cuto§ point.
s results in a "boundary problem".

ile the "true" e§ect is AB, with a certain bandwidth a rectangular


nel would While the “true”
estimate effect isasAB,
the e§ect with a certain bandwidth a
A’B’.
rectangular kernel would estimate the effect as A0 B 0
ere is therefore systematic bias with the kernel method if the f (X )
There is therefore systematic bias with the kernel method if
upwards orthedownwards sloping.
f (X ) is upwards or downwards sloping
nger (Warwick) 21 /
Kernel weighted local polynomial regression

The nonparametric one-sided kernel estimation problems are


called “boundary problems” at the cutoff (Hahn, Todd and Van
der Klaauw 2001)
Kernel estimation (such as lowess) may have poor properties
because the point of interest is at a boundary
They proposed to use “local linear nonparametric regressions”
instead
Local linear regression with weights

Local linear nonparametric regression substantially reduces the


bias
Think of it as a weighted regression restricted to a window –
kernel provides the weights to that regression.

n
xi − c0
X  
b ≡a,b 2
(b
a, b) (yi − a − b(xi − c0 )) K 1(xi > c0 )
h
i=1

where xi is the value of the running variable, c0 is the cutoff, K is a


kernel function and h > 0 is a suitable bandwidth
Animation of a local linear regression

https://twitter.com/page_eco/status/958687180104245248
Estimation

Stata’s poly estimates kernel-weighted local polynomial


regressions.
A rectangular kernel would give the same result as E [Y ] at a
given bin on X . The triangular kernel gives more importance
to observations close to the center.
This method will be sensitive to how large the bandwidth
(window) you choose
Optimal bandwidths

A rectangular kernel would give the same result as taking E [Y ]


at a given bin on X whereas the triangular kernel gives more
importance to the observations closer to the center.
While estimating this in a given window of width h around the
cutoff is straightforward, it’s more difficult to choose this
bandwidth (or window), and the method is sensitive to the
choice of bandwidth.
Bandwidths

Several methods for choosing the optimal bandwidth


(window), but it’s always a trade off between bias and variance
In practical applications, you want to check for balance around
that window
Standard error of the treatment effects can be bootstrapped
but there are also other alternatives
You could add other variables to nonparametric methods.
Bandwidths

Imbens and Kalyanaraman (2012), and more recently Calonico,


et al. (2017), have proposed methods for estimating “optimal”
bandwidths which may differ on either side of the cutoff.
Calonico, et al (2017) propose local-polynomial regression
discontinuity estimators with robust confidence intervals
Stata ado package and R package are both called rdrobust
Cunningham Causal Inference
Implementation

The following paper is a seminal paper in public choice both


scientifically and methodologically – the close election RDD
I call the close election RDD a type of sub-RDD in that it’s
widely used in political science and economics to the point
that it’s taken on a life of its own
Let’s take everything we’ve done and apply it by replicating
this paper using programs I’ve provided
Public choice

There are two fundamentally different views of the role of voters in


a representative democracy.
1 Convergence: Voters force candidates to become relatively
moderate depending on their size in the distribution (Downs
1957).
“Competition for votes can force even the most
partisan Republicans and Democrats to moderate
their policy choices. In the extreme case, competition
may be so strong that it leads to ‘full policy
convergence’: opposing parties are forced to adopt
identical policies” – Lee, Moretti, and Butler 2004.

2 Divergence: Voters pick the official and after taking office,


she pursues her most-preferred policy.
Falsification of either hypothesis had been hard

Very difficult to test either one of these since you don’t observe
the counterfactual votes of the loser for the same district/time
Winners in a district are selected based on their policy’s
conforming to unobserved voter preferences, too
Lee, Moretti and Butler (2004) develop the “close election
RDD” which has the aim of determining whether convergence,
while theoretically appealing, has any explanatory power in
Congress
The metaphor of the RCT is useful here: maybe close elections
are being determined by coin flips (e.g., a few votes here, a
few votes there)
Outcome is Congress person’s liberal voting score

Liberal voting score is a report card from the Americans for


Democratic Action (ADA) for the House election results
1946-1995
Authors use the ADA score for all US House Representatives
from 1946 to 1995 as their voting record index
For each Congress, ADA chooses about twenty high-profile
roll-call votes and creates an index varying 0 and 100 for each
Representative of the House measuring liberal voting record
Democratic “voteshare” is the running variable

Voteshare from the same races


The running variable is voteshare which is the share of all
votes that went to a Democrat.
They use a close Democratic victory to check whether
convergence or divergence is correct (what’s smoothness here?)
Discontinuity in the running variable occurs at
voteshare= 0.5. When voteshare> 0.5, the Democratic
candidate wins.
I’ll show lmb1.do to lmb10.do (and R) at times just so we
can all see the simple estimation methods ourselves.
Remember these results
832 QUARTERLY JOURNAL OF ECONOMICS

TABLE I
RESULTS BASED ON ADA SCORES—CLOSE ELECTIONS SAMPLE

Total effect Elect component Affect component


! #1 (PD R D R *D *R
t"1 $ Pt"1) #1[(Pt"1 $ Pt"1)] #0[P t "1 $ P t "1]

Variable ADA t"1 ADAt DEMt"1 (col. (2)*(col. (3)) (col. (1)) $ (col. (4))
(1) (2) (3) (4) (5)

Estimated gap 21.2 47.6 0.48


(1.9) (1.3) (0.02)
22.84 $1.64

Downloaded from http://qje.oxfordjournals.org/ at Ba


(2.2) (2.0)

Standard errors are in parentheses. The unit of observation is a district-congressional session. The
sample includes only observations where the Democrat vote share at time t is strictly between 48 percent and
52 percent. The estimated gap is the difference in the average of the relevant variable for observations for
which the Democrat vote share at time t is strictly between 50 percent and 52 percent and observations for
which the Democrat vote share at time t is strictly between 48 percent and 50 percent. Time t and t " 1 refer
to congressional sessions. ADA t is the adjusted ADA voting score. Higher ADA scores correspond to more
liberal roll-call voting records. Sample size is 915.

Figure: Lee, Moretti, and Butler 2004, Table 1.


primarily elect policies (full divergence) rather than affect poli-
cies (partial convergence).
Here we quantify our estimates more precisely. In the analy-
sis that follows, we restrict our attention to “close elections”—
where the Democrat vote share in time t is strictly between 48
and 52 percent. As Figures I and II show, the difference between
Nonparametric estimation

Hahn, Todd and Van der Klaauw (2001) emphasized using


local polynomial regressions
Estimate E [Y |X ] in such a way that doesn’t require
committing to a functional form
That model would be something general like

Y = f (X ) + ε
Nonparametric estimation (cont.)

We’ll do this estimation just rolling E [ADA] across the running


variable voteshare visually
Stata has an option to do this called cmogram and it has a lot
of useful options, though many people prefer to graph it
themselves bc it gives more flexibility.
We can recreate Figures I, IIA and IIB using it
Future liberal voting score
828 QUARTERLY JOURNAL OF ECONOMICS

Downloaded from http://qje.oxfordjournals.org/ at Baylor University on April 7, 2014


FIGURE I
Total Effect of Initial Win on Future ADA Scores: "
This figure plots ADA scores after the election at time t ! 1 against the
Democrat vote share, time t. Each circle is the average ADA score within 0.01
intervals of the Democrat vote share. Solid lines are fitted values from fourth-
order polynomial regressions on either side of the discontinuity. Dotted lines are
pointwise 95 percent confidence intervals. The discontinuity gap estimates
" ! # 0$P *t !1
D
" P *t !1
R
% # # 1$P *t !1
D
" P *t !1
R
%.
“Affect” “Elect”

Figure:
be a Lee, Moretti,
continuous andfunction
and smooth Butler 2004,
of vote sharesFigure I. γ ≈ 20
everywhere,
except at the threshold that determines party membership. There
is a large discontinuous jump in ADA scores at the 50 percent
Contemporaneous liberal voting score
830 QUARTERLY JOURNAL OF ECONOMICS

Downloaded from http://qje.oxfordjournals.org/ at Baylor U


FIGURE IIa
Effect of Party Affiliation: #1

Figure: Lee, Moretti, and Butler 2004, Figure IIa. π1 ≈ 45


p://qje.oxfordjournals.org/ at Baylor University on April 7, 2014
Incumbency advantage
FIGURE IIa
Effect of Party Affiliation: #1

FIGURE IIb
Effect of Initial Win on Winning Next Election: (P D R
t!1 " P t!1 )
Top panel plots ADA scores after the election at time t against the Democrat
vote share, time t. Bottom panel plots probability of Democrat victory at t ! 1
against Democrat vote share, time t. See caption of Figure III for more
D details.R
Figure: Lee, Moretti, and Butler 2004, Figure IIb. (Pt+1 − Pt+1 ) ≈ 0.50
Concluding remarks

Caughey and Sekhon (2011) questioned the finding (not the


design per se) saying that bare winners and bare losers in the
US House elections differed considerably on pretreatment
covariates (imbalance), which got worse in the closest elections
Eggers, et al. (2014) evaluated 40,000 close elections
including the House in other time periods, mayor races, and
other types of US races including nine other countries
They couldn’t find another instance where Caughey and
Sekhon’s critique applied
Assumptions behind close election design therefore probably
holds and is one of the best RD designs we have
Fuzzy RDD, IV and ITT

Fuzzy RDD is an IV estimator, and requires those assumptions


You may be more comfortable with presenting the
intent-to-treat (ITT) parameter which is just the reduced form
regression of Y on Z , therefore
Many papers will not present an IV-style parameter, but rather
a blizzard of ITT parameters, out of a “fear” that the exclusion
restrictions may not hold
But let’s review the IV approach anyway for completeness
(more IV to come!)

Cunningham Causal Inference


Probability of treatment jumps at discontinuity

Probabilistic treatment assignment (i.e. “fuzzy RDD”)


The probability of receiving treatment changes discontinuously at
the cutoff, c0 , but need not go from 0 to 1

limXi →c0 Pr (Di = 1|Xi = c0 ) 6= limc0 ←Xi Pr (Di = 1|Xi = c0 )

Examples: Incentives to participate in some program may change


discontinuously at the cutoff but are not powerful enough to move
everyone from non participation to participation.
Deterministic (sharp) vs. probabilistic (fuzzy)

In the sharp RDD, Di was determined by Xi ≥ c0


In the fuzzy RDD, the conditional probability of treatment
jumps at c0 .
The relationship between the conditional probability of
treatment and Xi can be written as:

P[Di = 1|Xi ] = g0 (Xi ) + [g1 (Xi ) − g0 (Xi )]Zi

where Zi = 1 if (Xi ≥ c0 ) and 0 otherwise.


Visualization of identification strategy (i.e. smoothness)

E [Y 0 |X ] and E [Y 1 |X ] for D = 0, 1 are the dashed/solid


continuous functions
E [Y |X ] is the solid which jumps at X = 6
5

0
0 1 2 3 4 5 6 7 8 9 10
Hoekstra flagship school
THE EFFECT OF ATTENDING THE FLAGSHIP STATE UNIVERSITY ON EARNINGS

FIGURE 1.—FRACTION ENROLLED AT THE FLAGSHIP STATE UNIVERSITY


.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0

-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350

Local Average
Instrumental variables

As said, fuzzy designs are numerically equivalent and


conceptually similar to IV
“Reduced form” Numerator: “jump” in the regression of the
outcome on the running variable, X .
“First stage” Denominator: “jump” in the regression of the
treatment indicator on the running variable X .
Same IV assumptions, caveats about compliers vs. defiers, and
statistical tests that we will discuss in next lecture with
instrumental variables apply here – e.g., check for weak
instruments using F test on instrument in first stage, etc.
Wald estimator

Wald estimator of treatment effect under Fuzzy RDD


Average causal effect of the treatment is the Wald IV parameter

limX →c0 E [Y |X = c0 ] − limc0 ←X E [Y |X = c0 ]


δFuzzy RDD =
limX →c0 E [D|X = c0 ] − limc0 ←X E [D|X = c0 ]
RDD’s Relationship to IV

Center X it’s equal to zero at c0 and define Z = 1(X ≥ 0)


The coefficient on Z in a regression like

. reg Y Z X X2 X3

is the reduced form discontinuity, and

. reg D Z X X2 X3

is the first stage discontinuity


Ratio of discontinuities is estimate of δFuzzy RDD
Simple way to implement is IV

. ivregress 2sls Y (D=Z) X X2 X3


First stage relationship between X and D

One can use both Zi as well as the interaction terms as


instruments for Di .
If one uses only Zi as IV, then the it is a “just identified”
model which usually has good finite sample properties.
In the just identified case, the first stage would be:

Di = γ0 + γ1 Xi + γ2 Xi2 + · · · + γp Xip + πZi + ε1i

where π is the causal effect of Z on the conditional probability


of treatment.
The fuzzy RD reduced form is:

Yi = µ + κ1 Xi + κ2 Xi2 + · · · + κp Xip + ρπZi + ε2i


Fuzzy RDD with varying Treatment Effects - Second Stage

As in the sharp RDD case one can allow the smooth function
to be different on both sides of the discontinuity.
The second stage model with interaction terms would be the
same as before:

Yi = α + β01 x̃i + β02 x̃i2 + · · · + β0p x̃ip


+ρDi + β1∗ Di x̃i + β2∗ Di x̃i2 + · · · + βp∗ Di x̃ip + ηi

Where x̃ are now not only normalized with respect to c0 but


are also fitted values obtained from the first stage regression.
Fuzzy RDD with Varying Treatment Effects - First Stages

Again one can use both Zi as well as the interaction terms as


instruments for Di
Only using Z the estimated first stages would be:

Di = γ00 + γ01 X̃i + γ02 X̃i2 + · · · + γ0p X̃ip


+πZi + γ1∗ X̃i Zi + γ2∗ X̃i2 Zi + · · · + γp∗ Zi + ε1i

We would also construct analogous first stages for X̃i Di ,


X̃i2 Di , . . . , X̃ip Di .
Limitations of the LATE

Fuzzy RDD has assumptions of all standard IV framework


(exclusion, independence, nonzero first stage, and
monotonicity)
As with other binary IVs, the fuzzy RDD is estimating LATE:
the local average treatment effect for the group of compliers
In RDD, the compliers are those whose treatment status
changed as we moved the value of xi from just to the left of c0
to just to the right of c0
Means we can use Medicare age cutoff to estimate the effect
of public insurance on mortality (LATE) and still not know the
effect of public insurance on mortality (ATE)
Instrumental variables

If treatment is tied to an unobservable, then conditioning


strategies, even RDD, are invalid
Instrumental variables offers some hope at recovering the
causal effect of D on Y
The best instruments come from deep knowledge of
institutional details (Angrist and Krueger 1991)
Certain types of natural experiments can be the source of such
opportunities and may be useful

Cunningham Causal Inference


When is IV used?

Instrumental variables methods are typically used to address the


following kinds of problems encountered in naive regressions
1 Omitted variable bias
2 Measurement error
3 Simultaneity bias
4 Reverse causality
5 Randomized control trials with noncompliance
Selection on unobservables

D Y

Then D is endogenous due to backdoor path D ← U → Y and


causal effect D → Y is not identified using the backdoor criterion.
Instruments

Z U

D Y

Notice how the path from Z → D ← U → Y is blocked by a


collider.
Phillip Wright

Philip Wright was a renaissance man - published in JASA,


QJE, AER, you name it, while on a very intense teaching load.
Also published poetry, and even personally published Carl
Sandburg’s first book of poetry!
Spent a long time at Tufts
He was very concerned about the negative effects of tariffs and
wrote a book about commodity markets
Elasticity of demand is unidentified

James Stock notes that his publications had a theme regarding


identification
He knew, for instance, that he couldn’t simple look at
correlations between price and quantity if he wanted the
elasticity of demand due to simultaneous shifts in supply and
demand
The pairs of quantity and price weren’t demand, or supply -
they were demand and supply equilibrium values and therefore
didn’t reflect the demand or the supply curve, both of which
are counterfactuals
Those points are nothing more than a bunch of numbers – no
more, no less – that have no practical use, scientific or
otherwise
Exhibit 1
The Graphical Demonstration of the Identification Problem in Appendix B (p. 296)

FicruRB 4. Price-output Data Fail to Revbal Either Supply


or Demand Curve.

without affecting cost conditions or which (B) affect cost conditions without
Figure: Wright’s graphical demonstration of the identification problem
affecting demand conditions.

B then two derivations of the instrumental variable estimator


Sewell Wright

Sewell was his son, who did not go into the family business
Rather, he decided to become a genius and invent genetics
Developed path diagrams (which Pearl revived 50 years later
for causal inference)
Father and son engage in letter correspondence as Philip tried
to solve the “identification problem”
Figure: Wright’s letter to Sewell, his son
Figure: Recognize these?
QJE Rejects

QJE misses a chance to make history and rejects his paper


proving an IV estimator
Sticks his proof in Appendix B of 1928 book,
The Tariff on Animal and Vegetable Oils
His work on IV is ignored, and is then rediscovered 15 years
later (e.g., Olav Reiersøl).
James Stock and others have helped correct the record
Sidebar: stylometric analysis

Long standing question was who wrote Appendix B? Answer


according to Stock and Trebbi (2003) using stylometric
methods is that Philip wrote it.
But who invented it? It was collaborative, but Sewell
acknowledged he didn’t know how to handle endogeneity and
simultaneity (that was Philip)
Constant treatment effects

Constant treatment effects (i.e., β is constant across all


individual units)
Constant treatment effects is the traditional econometric
pedagogy when first learning instrumental variables, and
doesn’t need the potential outcomes model or notation to get
the point across
Constant treatment effects is identical to assuming that
ATE=ATT=ATU because constant treatment effects assumes
βi = β−i = β for all units
Heterogenous treatment effects

Heterogeneous treatment effects (i.e., βi varies across


individual units)
Heterogeneous treatment effects means that the
ATE 6= ATT 6= ATU because βi differs across the population
This is equivalent to assuming the coefficient, βi , is a random
variable that varies across the population
Heterogenous treatment effects is based on work by Angrist,
Imbens and Rubin (1996) and Imbens and Angrist (1994)
which introduced the “local average treatment effect” (LATE)
concept
Data requirements

Your data isn’t going to come with a codebook saying


“instrumental variable”. So how do you find it?
Well, sometimes the researcher just knows.
That is, the researcher knows of a variable (Z ) that actually is
randomly assigned and that affects the endogenous variable
but not the outcome (except via the endogenous variable)
Such a variable is called an “instrument”.
Picking a good instrument

The best instruments you think of first, then you seek the data
second (but often students go in the reverse order which is
basically guaranteed to be a crappy instrument)
If you want to use IV, then ask:
What moves around the covariate of interest that
might be plausibly random?

Is there any element in the treatment that could be construed


as random?
If you were to find that random piece, then you have found an
instrument
Once you have identified such a variable, begin to think about
what data sets might have information on an outcome of
interest, the treatment, and the instrument you have put your
finger on.
Does family size reduce labor supply or is it selection?

Angrist and Evans (1998), “Children and their parents’ labor


supply” American Economic Review,
They want to know the effect of family size on labor supply,
but need exogenous changes in family size
So what if I told you if the first two children born were of the
same gender, then you’re less likely to work. What?!
Angrist and Evans cont.

Many parents have a preference for having at least one child of


each gender
Consider a couple whose first two kids were both boys; they
will often have a third, hoping to have a girl
Consider a couple whose first two kids were girls; they will
often have a third, hoping for a boy
Consider a couple with one boy and one girls; they will often
not have a third kid
The gender of your kids is arguably randomly assigned (maybe
not exactly, but close enough)
Good instruments must be a bit strange

On its face, it’s puzzling that the first two kids’ gender
predicts labor market participation
Instrumental variables strategies formalize strangeness of the
instrument, which is the inference drawn by an intelligent
layperson with no particular knowledge of the phenomena or
background in statistics.
You need more information, in other words, otherwise the
layperson can’t understand what same gender of your children
has to do with working
When a good IV strategy finally makes sense

But then the researchers point out that women whose first two
children are of the same gender are more likely to have
additional children than women whose first two children are of
different genders
The layperson then asks himself, “Hm. I wonder if the labor
market differences are due solely to the differences in the
number of kids the woman has...”
Sunday Candy is a good instrument

Let’s listen to a few lines from “Ultralight Beam” by Kanye


West. Chance the Rapper sings on it and says
“I made Sunday Candy, I’m never going to hell
I met Kanye West, I’m never going to fail.”
- Chance the Rapper

What does making a song have to do with hell? What does


meeting Kanye West have to do with success? Let’s consider
each in order
What are we missing?

“I made Sunday Candy,


I’m never going to hell”,

There must be more to this story, right?


So what if it’s something like this

“I made Sunday Candy


this pastor invited me to church on Sunday,
I’m never going to hell”
Sunday Candy DAG

Sunday Candy U

Church Hell
Kanye West is a bad instrument

Chance long idolized and was inspired by Kanye West – both


Chicago, both very creative hip hop artists
Kanye West is not a good instrument for Chance’s inspiration,
though, because Kanye West can singlehandedly make a
person’s career
Kanye is not strange enough
Kanye West DAG

Kanye West U

Inspiration Success
Foreshadowing the questions you need to be asking

1 Is our instrument highly correlated with the treatment? With


the outcome? Can you test that?
2 Are there random elements within the treatment? Why do you
think that?
3 Is the instrument exogenous? Why do you think that?
4 Could the instrument affect outcomes directly? Why do you
think that?
5 Could the instrument be associated with anything that causes
the outcome even if it doesn’t directly? Why do you think
that?
Our causal model: Returns to schooling again

Y = α + δS + γA + ν

where Y is log earnings, S is years of schooling, A is unobserved


ability, and ν is the error term
Suppose there exists a variable, Zi , that is correlated with Si .
We can estimate δ with this variable, Z :
How can IV be used to obtain consistent estimates?

Cov (Y , Z ) = Cov (α + δS + γA + ν, Z )
= E [(α + δS + γA + ν)Z ] − E [α + δS + γA + ν]E [Z ]
= {αE (Z ) − αE (Z )} + δ{E (SZ ) − E (S)E (Z )}
+γ{E (AZ ) − E (A)E (Z )} + E (νZ ) − E (ν)E (Z )
Cov (Y , Z ) = δCov (S, Z ) + γCov (A, Z ) + Cov (ν, Z )

Divide both sides by Cov (S, Z ) and the first term becomes δ, the
LHS becomes the ratio of the reduced form to the first stage, plus
two other scaled terms.
Consistency

What conditions must hold for a valid IV design?


Cov (S, Z ) 6= 0 – “first stage” exists. S and Z are correlated
Cov (A, Z ) = Cov (ν, Z ) = 0 – “exclusion restriction”. This
means Z that orthogonal to the factors in ν, such as
unobserved ability, A, as well as the structural disturbance
term, ν
Assuming the first stage exists and that the exclusion
restriction holds, then we can estimate δ with δIV :

Cov (Y , Z )
δIV =
Cov (S, Z )
= δ
IV is Consistent if IV Assumptions are Satisfied

The IV estimator is consistent if the IV assumptions are


satisfied. Substitute true model for Y :
Cov ([α + ρS + γA + ν], Z )
δIV =
Cov (S, Z )
Cov ([S], Z ) Cov ([A], Z ) Cov ([ν], Z )
= δ +γ +
Cov (S, Z ) Cov (S, Z ) Cov (S, Z )
Cov (η, Z )
= δ+γ
Cov (S, Z )
Identifying assumptions and consistency

Taking the probability limit which is an asymptotic operation


to show consistency:

Cov (η, Z )
plim δbIV = plim δ + γ
Cov (S, Z )
= δ

because Cov ([A], Z ) = 0 and Cov ([ν], Z ) = 0 due to the


exclusion restriction, and Cov (S, Z ) 6= 0 (due to the first
stage)
IV Assumptions

But, if Z is not independent of η (either correlated with A or


ν), and if the correlation between S and Z is “weak”, then the
second term blows up.
We will explore the problems created by weak instruments in
just a moment.
First, let’s look at a DAG summarizing all this information
One of these DAGs is not like the other

Z A

S Y
(a)

Z A

S Y
(b)

Notice - the top DAG, a, satisfies both exclusion and relevance


(i.e., non-zero first stage), but the bottom DAG, b, satisfies
relevance but not exclusion.
Two-stage least squares

The two-stage least squares estimator was developed by Theil


(1953) and Basman (1957) independently
Note, while IV is a research design, 2SLS is a specific
estimator.
Others include LIML, the Wald estimator, jacknive IV, two
sample IV, and more

Cunningham Causal Inference


Two Sample IV

In a pinch, you can even get by with two different data sets
1 Dataset 1 needs information on the outcome and the
instrument
2 Dataset 1 needs information on the treatment and the
instrument.
This is known as “Two sample IV” because there are two
samples involved, rather than the traditional one sample.
Once we define what IV is measuring carefully, you will see
why this works.
Two-stage least squares concepts

Causal model. Sometimes called the structural model:

Yi = α + δSi + ηi

First-stage regression. Gets the name because of two-stage


least squares:
Si = γ + ρZi + ζi
Second-stage regression. Notice the fitted values, S:
b

Yi = β + δ Sbi + νi
Reduced form

Some people like a simpler approach because they don’t want


to defend IV’s assumptions
Reduced form a regression of Y onto the instrument:

Yi = ψ + πZi + εi

This would be like regressing hell onto Sunday Candy, as


opposed to regressing hell onto church with Sunday Candy
instrumenting for church
Two-stage least squares

Suppose you have a sample of data on Y , X , and Z . For each


observation i we assume the data are generated according to

Yi = α + δSi + ηi
Si = γ + ρZi + ζi

where Cov (Z , ηi ) = 0 and ρ 6= 0.


Two-stage least squares

Plug in covariance and write out the following:

Cov (Z , Y )
δd
2sls =
Cov (Z , S)
1 Pn
n i=1 (Zi − Z )(Yi − Y )
= 1 Pn
n i=1 (Zi − Z )(Si − S)
1 Pn
n i=1 (Zi − Z )Yi
= 1 Pn
n i=1 (Zi − Z )Si
Two-stage least squares

Substitute the causal model definition of Y to get:


1 Pn
n i=1 (Zi − Z ){α + δSi + ηi }
δ2sls =
d
1 Pn
n i=1 (Zi − Z )Si
1
n (Zi − Z )ηi
= δ+ 1 P n
n i=1 (Zi − Z )Si
= δ + "small if n is large"

Where did the first term go? Why did the second term become δ?
Two-stage least squares

Calculate the ratio of “reduced form” (π) to “first stage”


coefficient (ρ):
Cov (Z ,Y )
Cov (Z , Y ) Var (Z ) π
δb2sls = = =
b
Cov (Z , S) Cov (Z ,S) ρb
Var (Z )

Rewrite ρb as

Cov (Z , S)
ρb =
Var (Z )
ρbVar (Z ) = Cov (Z , S)
Two-stage least squares

Then rewrite δb2sls

Cov (Z , Y ) ρbCov (Z , Y ) ρbCov (Z , Y )


δb2sls = = = 2
Cov (Z , S) ρbCov (Z , S) ρb Var (Z )
Cov (b
ρZ , Y )
=
Var (b
ρZ )
Two-stage least squares

Recall
Si = γ + ρZi + ζi
Then
Sb = γ
b + ρbZ
Then

Cov (b
ρZ , Y ) Cov (S,
b Y)
δb2sls = =
Var (b
ρZ ) Var (S)
b
Proof.
We will show that δCov
b (Y , Z ) = Cov (S,
b Y ). I will leave it to you
to show that Var (δZ ) = Var (S)
b b

Cov (S, b ] − E [S]E


b Y ) = E [SY b [Y ]
= E (Y [b b ]) − E (Y )E (b
ρ + δZ ρ + δZ
b )
b (YZ ) − ρbE (Y ) − δE
= ρbE (Y ) + δE b (Y )E (Z )
b (YZ ) − E (Y )E (Z )]
= δ[E
Cov (S,
b Y ) = δCov
b (Y , Z )
Intuition of 2SLS

Two stage least squares is nice because in addition to being an


estimator, there’s also great intuition contained in it which you
can use as a device for thinking about IV more generally.
The intuition is that 2SLS estimator replaces S with the fitted
values of S (i.e., S)
b from the first stage regression of S onto Z
and all other covariates.
By using the fitted values of the endogenous regressor from
the first stage regression, our regression now uses only the
exogenous variation in the regressor due to the instrumental
variable itself
Intuition of IV in 2SLS

. . . but think about it – that variation was there before, but


was just a subset of all the variation in the regressor
Go back to what we said in the beginning - we need the
endogenous variable to have pieces that are random, and IV
finds them.
Instrumental variables therefore reduces the variation in the
data, but that variation which is left is exogenous
“With a long enough [instrument], you can [estimate any
causal effect]” - Scott Cunningham paraphrasing Archimedes
Estimation with software

One manual way is just to estimate the reduced form and first
stage coefficients and take the ratio of the respective
coefficients on Z
But while it is always a good idea to run these two regressions,
don’t compute your IV estimate this way
Estimation with software

It is often the case that a pattern of missing data will differ


between Y and S
In such a case, the usual procedure of “casewise deletion” is to
keep the subsample with non-missing data on Y , S, and Z .
But the reduced form and first stage regressions would be
estimated off of different sub-samples if you used the two step
method before
The standard errors from the second stage regression are also
wrong
Estimation with software

Estimate this in Stata using -ivregress 2sls-.


Estimate this in R -ivreg()- which is in the AER package
Weak instruments

A weak instrument is one that is not strongly correlated with


the endogenous variable in the first stage
This can happen if the two variables are independent or the
sample is small
If you have a weak instrument, then the bias of 2SLS is
centered on the bias of OLS and the cure ends up being worse
than the disease
We knew this was a problem, but it was brought into sharp
focus with Angrist and Krueger (1991) and some papers that
followed

Cunningham Causal Inference


Angrist and Krueger (1991)

In practice, it is often difficult to find convincing instruments –


usually because potential instruments don’t satisfy the
exclusion restriction
But in an early paper in the causal inference movement,
Angrist and Krueger (1991) wrote a very interesting and
influential study instrumental variable
They were interested in schooling’s effect on earnings and
instrumented for it with which quarter of the year you were
born
Remember Chance quote - what the heck would birth quarter
have to do with earnings such that it was an excludable
instrument?
Compulsory schooling

In the US, you could drop out of school once you turned 16
“School districts typically require a student to have turned age
six by January 1 of the year in which he or she enters school”
(Angrist and Krueger 1991, p. 980)
Children have different ages when they start school, though,
and this creates different lengths of schooling at the time they
turn 16 (potential drop out age):
Born Turn Start
Dec 6 School S

Born Turn Start 16


Jan 6 School

If you’re born in the fourth quarter, you hit 16 with more schooling
than those born in the first quarter
Visuals

You need good data visualization for IV partly because of the


scrutiny around the design
The two pieces you should be ready to build pictures for are
the first stage and the reduced form
Angrist and Krueger (1991) provide simple, classic and
compelling pictures of both
First Stage
First Stages
Men born earlier in the year have lower schooling. This indicates
that there is a first stage. Notice all the 3s and 4s at the top. But
Men born earlier in the year have lower schooling. This indicates that
then notice how it attenuates over time . . .
there is a first stage.
Reduced Form Reduced Form

Do differences in schooling due to different quarter of birth


Do di§erences in schooling due to di§erent quarter of birth translate
translate into different earnings?
into di§erent earnings?
Two Stage Least Squares model

The causal model is


Yi = δSi + ε
The first stage regression is:

Si = X π10 + π11 Zi + η1i

The reduced form regression is:

Yi = X π20 + π21 Zi + η2i

The covariate adjusted IV estimator is the sample analog of


the ratio, ππ11
21
Two Stage Least Squares

Angrist and Krueger instrument for schooling using three


quarter of birth dummies: a dummies for 2nd, 3rd and 4th qob
Their estimated first-stage regression is:

Si = X π10 + Z1i π11 + Z2i π12 + Z3i π13 + η1

The second stage is the same as before, but the fitted values
are from the new first stage
First stage regression results
First Stageof Regressions
Quarter birth is a stronginpredictor
Angrist & Krueger
of total (1991)
years of education

Quarter of birth is a strong predictor of total years of education.


First stage regression results: Placebos
IV Results
IV Estimates
IV Estimates Birth Birth
Cohorts 20-29, 1980 Cohorts
Census 20-29, 1980 Census

Waldinger (Warwick) 17 / 45
Sidebar: Wald estimator

Recall that 2SLS uses the predicted values from a first stage
regression – but we showed that the 2SLS method was
(Y ,Z )
equivalent to Cov
Cov (X ,Z )
The Wald estimator simply calculates the return to education
as the ratio of the difference in earnings by quarter of birth to
the difference in years of education by quarter of birth – it’s a
version of the above
E (Y |Z =1)−E (Y |Z =0)
Formally, IVWald = E (D|Z =1)−E (D|Z =0)
Mechanism

In addition to log weekly wage, they examined the impact of


compulsory schooling on log annual salary and weeks worked
The main impact of compulsory schooling is on the log weekly
wage – not on weeks worked
More instruments
Problem enters with many quarter of birth interactions

They want to increase the precision of their 2SLS estimates, so


they load up their first stage with more instruments
Specifications with 30 (quarter of birth × year) dummy
variables and 150 (quarter of birth × state) instruments
What’s the intuition here? The effect of quarter of birth may
vary by birth year or by state
It reduced the standard errors, but that comes at a cost of
potentially having a weak instruments problem
More instruments
More instruments
Weak Instruments

For a long time, applied empiricists were not attentive to the


small sample bias of IV
But in the early 1990s, a number of papers highlighted that IV
can be severely biased – in particular, when instruments have
only a weak correlation with the endogenous variable of
interest and when many instruments are used to instrument for
one endogenous variable (i.e., there are many overidentifying
restrictions).
In the worst case, if the instruments are so weak that there is
no first stage, then the 2SLS sampling distribution is centered
on the probability limit of OLS
Causal model

Let’s consider a model with a single endogenous regressor and


a simple constant treatment effect (i.e., “just identified”)
The causal model of interest is:

Y = βX + ν
Matrices and instruments

We’ll sadly need some matrix notation, but I’ll try to make it
painless.
The matrix of instrumental variables is Z with the first stage
equation:
X = Z 0π + η
And let Pz be the project matrix producing residuals from
population regression of X on Z

Pz = Z (Z 0 Z )−1 Z 0
Weak instruments and bias towards OLS

If νi and ηi are correlated, estimating the first equation by


OLS would lead to biased results, wherein the OLS bias is:
Cov (ν, X )
E [βOLS − β] =
Var (X )
σνη
If νi and ηi are correlated the OLS bias is therefore: ση2
Deriving the bias of 2SLS

βb2sls = (X 0 Pz X )−1 X 0 PZ Y
= β + (X 0 Pz X )−1 X 0 PZz ν

substitution of Y = βX + ν
2SLS bias

βb2sls − β = (X 0 Pz X )−1 X 0 Pz ν
= aX 0 Pz ν
= a[π 0 Z 0 + η 0 ]Pz ν
= aπ 0 Z 0 ν + aη 0 PZ ν
= (X 0 PZ X )−1 π 0 Z 0 ν + (X 0 Pz X )−1 η 0 Pz ν

The bias of 2SLS comes from the non-zero expectation of terms on


the right-hand-side even though Z and ν are not correlated.
Taking expectations

Angrist and Pischke (ch. 4) note that taking expectations of


that prior expression is hard because the expectation operator
won’t pass through (X 0 Pz X )−1 .
However, the expectation of the ratios in the second term can
be closely approximated

βb2sls − β = (X 0 PZ X )−1 π 0 Z 0 ν + (X 0 Pz X )−1 η 0 Pz ν


 −1  −1
0 0 0 0
E [β2sls − β] ≈
b E [X PZ X ] E [π Z ν] + E [X Pz X ] E [η 0 Pz ν]
Approximate bias of 2SLS

We know E [π 0 Z 0 ν] = 0 and E [π 0 Z 0 η] = 0. So letting E [η 0 Pz ν] = b


bc this is hard for me otherwise

E [βb2sls − β] ≈ E [X 0 Pz X )−1 b
≈ E 9X 0 Z (Z 0 Z )−1 Z 0 X )−1 b
≈ E [(πZ + η)0 Pz (πZ + η)]−1 b
 
0 0 0 −1
≈ E (π Z Z π) + E (η Pz η) b
 
≈ E (π 0 Z 0 Z π) + E (η 0 Pz η)−1 E [η 0 Pz ν]

That last term is what creates the bias so long as η and ν are
correlated – which it’s because they are that you picked up 2SLS to
begin with
First stage F

With some algebra and manipulation, Angrist and Pische show that
the bias of 2SLS is equal to

−1
σνη E (π 0 Z 0 Z π)/Q

E [βb2sls − β] ≈ +1
ση2 ση2

where the interior term is the population F-statistic for the joint
significance of all regressions in the first stage
Weak instruments and bias towards OLS

Substituting F for that big term, we can derive the


approximate bias of 2SLS as:
σνη 1
E [βb2SLS − β] ≈ 2
ση F + 1

Consider the intuition all that work bought us now: if the first
stage is weak (i.e, F → 0), then the bias of 2SLS approaches
σνη
ση2
Weak instruments and bias towards OLS

This is the same as the OLS bias as for π = 0 in the second


equation on the earlier slide (i.e., there is no first stage
σ
relationship) σx2 = ση2 and therefore the OLS bias σνη 2 becomes
η
σνη
ση2
.
But if the first stage is very strong (F → ∞) then the 2SLS
bias is approaching 0.
Cool thing is – you can test this with an F test on the joint
significance of Z in the first stage
It’s absolutely critical therefore that you choose instruments
that are strongly correlated with the endogenous regressor,
otherwise the cure is worse than the disease
Weak Instruments - Adding More Instruments

Adding more weak instruments will increase the bias of 2SLS


By adding further instruments without predictive power, the
first stage F -statistic goes toward zero and the bias increases
We will see this more closely when we cover judge fixed effects
If the model is “just identified” – mean the same number of
instrumental variables as there are endogenous covariates –
weak instrument bias is less of a problem
Weak instrument problem

After Angrist and Krueger study, there were new papers


highlighting issues related to weak instruments and finite
sample bias
Key papers are Nelson and Startz (1990), Buse (1992), Bekker
(1994) and especially Bound, Jaeger and Baker (1995)
Bound, Jaeger and Baker (1995) highlighted this problem for
the Angrist and Krueger study.
Bound, Jaeger and Baker (1995)

Remember, AK present findings from expanding their instruments


to include many interactions
1 Quarter of birth dummies → 3 instruments
2 Quarter of birth dummies + (quarter of birth) × (year of birth)
+ (quarter of birth) × (state of birth) → 180 instruments
So if any of these are weak, then the approximate bias of 2SLS gets
worse
Adding Instruments in Angrist & Krueger
Adding
Table from Bound, Jaeger, instruments in -Angrist
and Baker (1995) 3 and 30 and
IVs Krueger

Adding
Addingmore
moreweak
weakinstruments
instrumentsreduced
reducedthethefirst
firststage -statistic and
stageFF-statistic
and increases
moves the bias of
the coe¢cient 2SLS. the
towards Notice
OLSitscoe¢cient.
also moved closer to OLS.
Adding Instruments in Angrist
Adding instruments & Krueger
in Angrist and Krueger
Table from Bound, Jaeger, and Baker (1995) - 180 IVs

Adding
More more weakincrease
instruments instruments reduced
precision, the first
but drive downstage F-statistic and
F , therefore
moves the coe¢cient towards the OLS
we know the problem has gotten worse coe¢cient.
Guidance on working around weak instruments

Use a just identified model with your strongest IV


Use a limited information maximum likelihood estimator
(LIML) as it is approximately median unbiased for over
identified constant effects models and provides the same
asymptotic distribution as 2SLS (under constant effects) with
a finite-sample bias reduction.
Find stronger instruments – easier said than done
Look at the reduced form

1 Look at the reduced form


The reduced form is estimated with OLS and is therefore
unbiased
If you can’t see the causal relationship of interest in the
reduced form, it is probably not there

Cunningham Causal Inference


Report the first stage

2 Report the first stage (preferably in the same table as your


main results)
Does it make sense?
Do the coefficients have the right magnitude and sign?
Please make beautiful IV tables – you’ll be celebrated across
the land if you do
Report F statistic and OLS

3 Report the F -statistic on the excluded instrument(s).


Stock, Wright and Yogo (2002) suggest that F -statistics > 10
indicate that you do not have a weak instrument problem –
this is not a proof, but more like a rule of thumb
If you have more than one endogenous regressor for which you
want to instrument, reporting the first stage F -statistic is not
enough (because 1 instrument could affect both endogenous
variables and the other could have no effect – the model would
be under identified). In that case, you want to report the
Cragg-Donald EV statistic.
4 Report OLS – you said it was biased, but we want to still see it
Table: OLS and 2SLS regressions of Log Earnings on Schooling

Dependent variable Log wage


OLS 2SLS
educ 0.071*** 0.124**
(0.003) (0.050)
exper 0.034*** 0.056***
(0.002) (0.020)
black -0.166*** -0.116**
(0.018) (0.051)
south -0.132*** -0.113***
(0.015) (0.023)
married -0.036*** -0.032***
(0.003) (0.005)
smsa 0.176*** 0.148***
(0.015) (0.031)

First Stage Instrument


College in the county 0.327***
Robust standard error 0.082
F statistic for IV in first stage 15.767
N 3,003 3,003
Mean Dependent Variable 6.262 6.262
Std. Dev. Dependent Variable 0.444 0.444
Standard errors in parenthesis. * p<0.10, ** p<0.05, *** p<0.01
Practical Tips for IV Papers

5 If you have many IVs, pick your best instrument and report the
just identified model (weak instrument problem is much less
problematic)
6 Check over identified 2SLS models with LIML
Make beautiful pictures of first stage and reduced form

7 This cannot be overstated: you must present your main results


in beautiful pictures
Show pictures of the first stage. Convince the reader
something is there. The eyeball is underrated
You can’t show a second stage with raw data, so instead show
pictures of the reduced form.
Visualizing the instrument: supply shocks on meth prices
Visualizing the first stage
Visualizing the reduced form
Heterogenous Treatment Effects

Up to this point, we only considered models where the causal


effect was the same for all individuals
Constant treatment effects where Yi1 − Yi0 = δ for all i units)
Let’s now try to understand what instrumental variables
estimation is measuring if treatment effects are heterogenous
Yi1 − Yi0 = δi which varies across the population

Cunningham Causal Inference


Why do we care about heterogeneity?

Heterogeneity, it turns out, makes life interesting and


challenging
There are two issues here:
1 We care about internal validity: Does the design successfully
uncover causal effects for the population that we are studying?
2 We care about external validity: Does the study’s results
inform us about different populations?
What parameter did we even estimate using IV when there
were heterogenous treatment effects?
Potential outcome notation

“Potential treatment status” (D j ) versus “observed” treatment


status (D)
Di1 = i’s treatment status when Zi = 1
Di0 = i’s treatment status when Zi = 0
We’ll represent outcomes as a function of both treatment status
and instrument status. In other words, Yi (Di = 0, Zi = 1) is
represented as Yi (0, 1)
Switching equation

Move from potential treatment status to observed treatment status

Di = Di0 + (Di1 − Di0 )Zi


= π0i + π1i Zi + ζi

π0i = E [Di0 ]
π1i = (Di1 − Di0 ) is the heterogenous causal effect of the IV
on Di .
E [π1i ] = The average causal effect of Zi on Di
Identifying assumptions under heterogenous treatment
effects

1 Stable Unit Treatment Value Assumption (SUTVA)


2 Random Assignment
3 Exclusion Restriction
4 Nonzero First Stage
5 Monotonicity
Stable Unit Treatment Value Assumption (SUTVA)

Stable Unit Treatment Value Assumption (SUTVA)


If Zi = Zi0 , then Di (Z) = Di (Z0 )
If Zi = Zi0 and Di = Di0 , then Yi (D,Z) = Yi (D’,Z’)

Potential outcomes for each person i are unrelated to the


treatment status of other individuals.
Example: Your instrument is the death of a CEO for hirings.
But if a CEO dies, then perhaps other companies lose a CEO
as they are hired in the vacant spots.
In which case, the instrument is related to treatment status of
other individuals.
Independence assumption

Independence assumption (e.g., “as good as random assignment”)


{Yi (Di1 , 1), Yi (Di0 , 0), Di1 , Di0 } ⊥
⊥ Zi

The IV is independent of the vector of potential outcomes and


potential treatment assignments (i.e. “as good as randomly
assigned”)
First two children of the same gender are assigned to families
randomly. That is, they are assigned to those with higher
likelihood of working just as often as it is assigned to those
less likely to work.
It’s all about the randomness of the instrument, in other
words, not the instrument’s effect.
Independence

Independence means that the first stage measures the causal effect
of Zi on Di :

E [Di |Zi = 1] − E [Di |Zi = 0] = E [Di1 |Zi = 1] − E [Di0 |Zi = 0]


= E [Di1 − Di0 ]
Independence

The independence assumption is sufficient for a causal


interpretation of the reduced form:

E [Yi |Zi = 1] − E [Yi |Zi = 0] = E [Yi (Di1 , 1)|Zi = 1]


−E [Yi (Di0 , 0)|Zi = 0]
= E [Yi (Di1 , 1)] − E [Yi (Di0 , 0)]
Exclusion Restriction

Exclusion Restriction
Y(D,Z) = Y(D,Z’) for all Z, Z’, and for all D

Any effect of Z on Y must be via the effect of Z on D. In


other words, Yi (Di , Zi ) is a function of D only. Or formally:

Yi (Di , 0) = Yi (Di , 1) for D = 0, 1

Sometimes called the “only through” assumption because


you’re assuming the effect of Z on Y is “only through” its
effect on D.
Recall the DAG and the missing arrows from Z to u and from
Z to Y .
Exclusion restriction

Use the exclusion restriction to define potential outcomes


indexed solely against treatment status:

Yi1 = Yi (1, 1) = Yi (1, 0)


Yi0 = Yi (0, 1) = Yi (0, 0)

Rewrite the switching equation:

Yi = Yi (0, Zi ) + [Yi (1, Zi ) − Yi (0, Zi )]Di


Yi = Yi0 + [Yi1 − Yi0 ]Di

Random coefficients notation for this is:

Yi = α0 + δi Di
with α0 = E [Yi0 ] and δi = Yi1 − Yi0
Spotting violations of exclusion is a sport

Watch the gears turn:


We are interested in causal effect of military service on
earnings, and so use draft number are instrument for military
service.
Draft number is generated by a random number generator.
Therefore independence is met as draft number is independent
of potential outcomes and potential treatment status.
But, people with higher draft numbers evade draft by investing
in schooling. Earnings change for reasons other than military
service. Exclusion is violated
In other words, random lottery numbers (independence) do not
imply that the exclusion restriction is satisfied
Strong first stage

Nonzero Average Causal Effect of Z on D


E [Di1 − Di0 ] 6= 0

D 1 means instrument is turned on, and D 0 means it is turned


off. We need treatment to change when instrument changes.
Z has to have some statistically significant effect on the
average probability of treatment
First two children of the same gender makes you more likely to
have a third.
Finally – a testable assumption. We have data on Z and D
Monotonicity

Monotonicity
Either π1i ≥ 0 for all i or π1i ≤ 0 for all i = 1, . . . , N

Recall that π1i is the reduced form causal effect of the


instrumental variable on an individual i’s treatment status.
Monotonicity requires that the instrumental variable (weakly)
operate in the same direction on all individual units.
In other words, while the instrument may have no effect on
some people, all those who are affected are affected in the
same direction (i.e., positively or negatively, but not both).
Monotonicity cont.

We instrument for schooling with birth quarter. Under


monotonicity scenarios 1-2:
1 they get more schooling or the same schooling if born in the
fourth quarter
2 they get less schooling or the same schooling if born in the
fourth quarter
Monotonicity says either of these can be true, but they cannot
both be true in your data – yet it’s not hard to imagine
violations where two people respond differently
Without monotonicity, IV estimators are not guaranteed to
estimate a weighted average of the underlying causal effects of
the affected group, Yi1 − Yi0 .
Force yourself to think of monotonicity violations

In the quarter of birth example for schooling, this assumption


may not be satisfied (see Barua and Lang 2009).
Being born in the 4th quarter (which typically increases
schooling) may have reduced schooling for some because their
school enrollment was held back by their parents
Local average treatment effect

If all 1-5 assumptions are satisfied, then IV estimates the local


average treatment effect (LATE) of D on Y :

Effect of Z on Y
δIV ,LATE =
Effect of Z on D
Estimand

Instrumental variables (IV) estimand:

E [Yi (Di1 , 1) − Yi (Di0 , 0)]


δIV ,LATE =
E [Di1 − Di0 ]
= E [(Yi1 − Yi0 )|Di1 − Di0 = 1]
Local Average Treatment Effect

The LATE parameters is the average causal effect of D on Y


for those whose treatment status was changed by the
instrument, Z
For example, IV estimates the average effect of military service
on earnings for the subpopulation who enrolled in military
service because of the draft but would not have served
otherwise.
LATE does not tell us what the causal effect of military service
was for patriots (volunteers) or those who were exempted from
military service for medical reasons
LATE cont.

We have reviewed the properties of IV with heterogenous


treatment effects using a very simple dummy endogenous
variable, dummy IV, and no additional controls example.
The intuition of LATE generalizes to most cases where we
have continuous endogenous variables and instruments, and
additional control variables.
LATE and subpopulations

The instrument partitions any population into 4 distinct groups:


1 Compliers: The subpopulation with Di1 = 1 and Di0 = 0. Their
treatment status is affected by the instrument in the “correct
direction”.
2 Always takers: The subpopulation with Di1 = Di0 = 1. They
always take the treatment independently of Z .
3 Never takers: The subpopulation with Di1 = Di0 = 0. They
never take the treatment independently of Z .
4 Defiers: The subpopulation with Di1 = 0 and Di0 = 1. Their
treatment status is affected by the instrument in the “wrong
direction”.
Subpopulations of soldieres

Examples of subpopulations:
1 Compliers: I only enrolled in the military because I was drafted
otherwise I wouldn’t have served
2 Always takers: My family have always served, so I serve
regardless of whether I am drafted
3 Never takers: I’m a contentious objector so under no
circumstances will I serve, even if drafted
4 Defiers: When I was drafted, I dodged. But had I not been
drafted, I would have served. I can’t make up my mind.
Never-Takers Complier
Di1 − Di0 = 0 Di1 − Di0 = 1
Yi (0, 1) − Yi (0, 0) = 0 Yi (1, 1) − Yi (0, 0) = Yi (1) − Yi (0)
By Exclusion Restriction, causal Average Treatment Effect among
effect of Z on Y is zero. Compliers

Defier Always-taker
Di1 − Di0 = −1 Di1 − Di0 = 0
Yi (0, 1) − Yi (1, 0) = Yi (0) − Yi (1) Yi (1, 1) − Yi (1, 0) = 0
By Monotonicity, no one in this By Exclusion Restriction, causal
group effect of Z on Y is zero.
Monotonicity Ensures that there are no defiers

Why is it important to not have defiers?


If there were defiers, effects on compliers could be (partly)
canceled out by opposite effects on defiers
One could then observe a reduced form which is close to zero
even though treatment effects are positive for everyone (but
the compliers are pushed in one direction by the instrument
and the defiers in the other direction)
Monotonicity assumes there are no defiers
What Does IV (Not) Estimate?

As said, with all 5 assumptions satisfied, IV estimates the


average treatment effect for compliers, or LATE
Without further assumptions (e.g., constant causal effects),
LATE is not informative about effects on never-takers or
always-takers because the instrument does not affect their
treatment status
So what? Well, it matters because in most applications, we
would be mostly interested in estimating the average
treatment effect on the whole population:

ATE = E [Yi1 − Yi0 ]

But that’s not possible usually with IV


Sensitivity to assumptions: exclusion restriction

Someone at risk of draft (low lottery number) changes


education plans to retain draft deferments and avoid
conscription.
Increased bias to IV estimand through two channels:
Average direct effect of Z on Y for compliers
Average direct effect of Z on Y for noncompliers multiplied by
odds of being a non-complier
Severity depends on:
Odds of noncompliance (smaller → less bias)
“Strength” of instrument (stronger → less bias)
Effect of the alternative channel on Y
Sensitivity to assumptions: Monotonicity violations

Someone who would have volunteered for Army when not at


risk of draft (high lottery number) chooses to avoid military
service when at risk of being drafted (low lottery number)
Bias to IV estimand (multiplication of 2 terms):
Proportion defiers relative to compliers
Difference in average causal effects of D on Y for compliers
and defiers
Severity depends on:
Proportion of defiers (small → less bias)
“Strength” of instrument (stronger → less bias)
Variation in effect of D on Y (less → less bias)
Summarizing

The potential outcomes framework gives a more subtle


interpretation of what IV is measuring
In the constant coefficients world, IV measures δ which is “the”
causal effect of Di on Yi , and assumed to be the same for all i
units
In the random coefficients world, IV measures instead an
average of heterogeneous causal effects across a particular
population – E [δi ] for some group of i units
IV, therefore, measures the local average treatment effect or
LATE parameter, which is the average of causal effects across
the subpopulation of compliers, or those units whose covariate
of interest, Di , is influenced by the instrument.
Summarizing

Under heterogenous treatment effects, Angrist and Evans


(1996) identify the causal effect of the gender composition of
the first two kids on labor supply
This is not the same thing as identifying the causal effect of
children on labor supply; the former is a LATE whereas the
latter might be better described as an ATE
Ex post this is probably obvious, but like many obvious things,
it wasn’t obvious until it was worked out. This was a real
breakthrough (see Angrist, Imbens and Rubin 1996; Imbens
and Angrist 1994)
IV in Randomized Trials

In many randomized trials, participation is nonetheless


voluntary among those randomly assigned to treatment
Consequently, noncompliance is not uncommon and without
correcting for it, creates selection biases
IV designs may even be helpful when evaluating a randomized
trial, even though treatment was randomly assigned
The solution is to instrument for treatment with whether you
“won the lottery” and estimate LATE

Cunningham Causal Inference


Lottery designs

The instrument is your randomized lottery


Examples might be randomized lottery for attending charter
schools to study effect of charter schools on educational
outcomes, or a randomized voucher to encourage the
collection of health information
Recall Thornton (2008) instrumented for getting HIV results
to estimate causal effect of learning one was HIV+ on condom
purchases
We’ll discuss two papers from 2012 and 2014 evaluating a
lottery-based expansion of Medicaid health insurance on
Oregon on numerous health and financial outcomes
Overarching question

What are the effects of expanding access to public health


insurance for low income adults?
Magnitudes, and even the signs, associated with that question
were uncertain
Limited existing evidence
Institute of Medicine review of evidence was suggestive, but a
lot of uncertainty
Observational studies are confounded by selection into health
insurance
Quasi-experimental work often focuses on elderly and children
Only one randomized experiment in a developed country: the
RAND health insurance experiment
1970s experiment on a general population
Randomized cost-sharing, not coverage itself
The Oregon Health Insurance Experiment

Setting: Oregon Health Plan Standard


Oregon’s Medicaid expansion program for poor adults
Eligibility
Poor (<100% federal poverty line) adults 19-64
Not eligible for other programs
Uninsured > 6 months
Legal residents
Comprehensive coverage (no dental or vision)
Minimum cost-sharing
Similar to other states in payments, management
Closed to new enrollment in 2004
The Oregon Medicaid Experiment

Oregon held a lottery


Waiver to operate lottery
5-week sign-up period, heavy advertising (January to February
2008)
Low barriers to sign up, no eligibility pre-screening
Limited information on list
Randomly drew 30,000 out of 85,000 on list (March-October
2008)
Those selected given chance to apply
Treatment at household level
Had to return application within 45 days
60% applied; 50% of those deemed eligible → 10,000 enrollees
Oregon Health Insurance Experiment

Evaluate effects of Medicaid using lottery as randomized


controlled trial (RCT)
Intent-to-treat: Reduced form comparison of outcomes
between treatment group (lottery selected individuals) and
controls (not selected)
LATE: IV using lottery as instrument for insurance coverage
First stage: about a 25 percentage point increase in insurance
coverage
Archived analysis plan
Massive data collect effort – primary and secondary
Similar to ACA expansion but limits to generalizability
Partial equilibrium vs. General equilibrium
Mandate and external validity
Oregon vs. other states
Short vs. Long-run
Examine Broad Range of Outcomes

Costs: Health care utilization


Insurance increases resources (income) and lowers price,
increasing utilization
But improved efficiency (and improved health), decreasing
utilization (“offset”)
Additional uncertainty when comparing Medicaid to no
insurance
Benefits I: Financial risk exposure
Insurance supposed to smooth consumption
But for very low income, is most care de jure or de facto free?
Benefits II: Health
Expected to improve (via increased quantity / quality of care)
But could discourage health investments (“ex ante moral
hazard”)
Data

Pre-randomization demographic information


From lottery sign-up
State administrative records on Medicaid enrollment
Primary measure of first stage (i.e., insurance coverage)
Outcomes
Administrative data (∼16 months post-notification): Hospital
discharge data, mortality, credit reports
Mail surveys (∼15 months): some questions ask 6-month
look-back; some ask current
In-person survey and measurements (∼25 months): Detailed
questionnaires, blood samples, blood pressure, body mass index
Study Population

10
Empirical Framework

They present reduced form estimates of the causal effect of


lottery selection

Yihj = β0 + β1 LOTTERYh + Xih β2 + Vih β3 + εihj

Validity of experimental design: randomization; balance on


treatment and control. This is what readers expect
Empirical framework

They also present IV results because they want to isolate the


causal effect of insurance coverage

INSURANCEihj = δ0 + δ1 LOTTERYih + Xih δ2 + Vih δ3 + µihj


yihj \
= π0 + π1 INSURANCE ih + Xih π2 + Vih π3 + vihj

Effect of lottery on coverage: about 25 percentage points


We have independence guaranteed; now we need exclusion: the
primary pathway of the lottery must be via being on Medicaid
Could affect participation in other programs, but actually small
“Warm glow” of winning – especially early
Analysis plan, multiple inference adjustment
Effect of lottery on coverage (first stage)
Effects of Lottery on Coverage (1st Stage)
Full sample Credit subsample Survey respondents
Control Estimated Control Estimated Control Estimated
mean FS mean FS mean FS
Ever on Medicaid 0.141 0.256 0.135 0.255 0.135 0.290
(0.004) (0.004) (0.007)
Ever on OHP Standard 0.027 0.264 0.028 0.264 0.026 0.302
(0.003) (0.004) (0.005)
# of Months on Medicaid 1.408 3.355 1.352 3.366 1.509 3.943
(0.045) (0.055) -0.09
On Medicaid, end of study period 0.106 0.148 0.101 0.151 0.105 0.189
(0.003) (0.004) (0.006)
Currently have any insurance (self report) 0.325 0.179
(0.008)
Currenty have private ins. (self report) 0.128 -0.008
(0.005)
Currently on Medicaid (self report) 0.117 0.197
(0.006)
Currently on Medicaid 0.093 0.177
(0.006)

12
Amy Finkelstein, et al. (2012). “The Oregon Health
Insurance Experiment: Evidence from the First Year”,
Quarterly Journal of Economics, vol. 127, issue 3, August.
Effects of Medicaid

Use primary and secondary data to gauge 1-year effects


Mail surveys: 70,000 surveys at baseline, 12 months
Administrative data
Medicaid enrollment records
Statewide Hospital discharge data, 2007-2010
Credit report data, 2007-2010
Mortality data, 2007-2010
Mail survey data

Fielding protocol
∼70,000 people, surveyed at baseline and 12 months later
Basic protocol: three-stage male survey protocol,
English/Spanish
Intensive protocol on a 30% subsample included additional
tracking, mailings, phone attempts (done to adjust for
non-response bias)
Response rate
Effective response rate = 50%
Non-response bias aways possible, but response rate and
pre-randomization measures in administrative data were
balanced between treatment and control
Administrative data

Medicaid records
Pre-randomization demographics from list
Enrollment records to assess “first stage” (how many of the
selected got insurance coverage)
Hospital discharge data
Probabilistically matched to list, de-identified at Oregon
Health Plan
Includes dates and source of admissions, diagnoses,
procedures, length of stay, hospital identifier
Includes years before and after randomization
Other data
Mortality data from Oregon death records
Credit report data, probabilistically matched, de-identified
Sample

89,824 unique individuals on the waiting list


Sample exclusions (based on pre-randomization data only)
Ineligible for OHP Standard (out of state address, age, etc.)
Individuals with institutional addresses on list
Final sample: 79,922 individuals out of 66,385 households
29,834 treated individuals (surveyed 29,589)
40,088 control individuals (surveyed 28,816)
Sample Characteristics
Sample characteristics

19
Outcomes

Access and use of care


Is access to care improved? Do the insured use more care? Is
there a shift in the types of care being used?
Mail surveys and hospital discharge data
Financial strain
How much does insurance protect against financial strain?
What are the out-of-pocket implications?
Mail surveys and credit reports
Health
What are the short-term impacts on self-reported physical and
mental health?
Mail surveys and vital statistics (mortality)
Results: Access & Use of Care
Effect of lottery on coverage

Gaining
Gaining insuranceresulted
insurance resulted in
in better
betteraccess
accesstoto
care andand
care higher
higher
satisfactionwith
satisfaction withcare
care (conditional
(conditional ononactually
actuallygetting care).
getting care)

CONTROL RF Model IV Model P-Value


(ITT) (LATE)
Have a usual place of care 49.9% +9.9% +33.9% .0001
Have a personal doctor 49.0% +8.1% +28.0% .0001
Got all needed health care 68.4% +6.9% +23.9% .0001
Got all needed prescriptions 76.5% +5.6% +19.5% .0001
Satisfied with quality of care 70.8% +4.3% +14.2% .001

SOURCE: Survey data

21
22
Results: Access & Use
Effect of lottery of Care
on coverage

Gaining insurance resulted in increased probability of hospital


Gaining insurance resulted in increased probability of hospital
admissions, primarily
admissions, driven
primarily drivenbybynon-emergency department
non-ED admissions.
admissions

CONTROL RF Model IV Model P-Value


(ITT) (LATE)
Any hospital admission 6.7% +.50% +2.1% .004
--Admits through ED 4.8% +.2% +.7% .265
--Admits NOT through 2.9% +.4% +1.6% .002
ED

SOURCE: Hospital Discharge Data


Overall, this represents a 30% higher probability of admission, although admissions
are still rare events.
Overall, this represents a 30% higher probability of admission,
although admissions are still rare events 23
Total Use By Condition

24
Summary: Access and use of care

Overall, utilization and costs went up relative to controls


30% increase in probability of an inpatient admission
35% increase in probability of an outpatient visit
15% increase in probability of taking prescription medications
Total $777 increase in average spending (a 25% increase)
With this increased spending, those who gained insurance were

35% more likely to get all needed care


25% more likely to get all needed medications
Far more likely to follow preventive care guidelines, such as
mammograms (60%) and PAP tests (45%)
Results: Financial Strain
Results: Financial Strain

Gaining insurance
Gaining insuranceresulted
resultedininaareduced probabilityofofhaving
reduced probability havingmedical
medical collections in credit reports, and in lower amounts
collections in credit reports, and in lower amounts owed. owed

CONTROL RF Model IV Model P-Value


(ITT) (LATE)
Had a bankruptcy 1.4% +0.2% +0.9% .358
Had a collection 50.0% -1.2% -4.8% .013
--Medical collections 28.1% -1.6% -6.4% .0001
--Non-medical collections 39.2% -0.5 -1.8% .455
$ owed medical collections $1,999 -$99 -$390 .025

SOURCE: Credit report data


Source: Credit report data

26
27
Summary: Financial Strain

Overall, reductions in collections on credit reports were evident

25% decrease in probability of a medical collection


Those with a collection owed significantly less
Household financial strain related to medical costs was
mitigated
Substantial reduction across all financial strain measures
Captures “informal channels” people use to make it work
Implications for both patients and providers
Only 2% of bills sent to collections are ever paid
Results: Self-Reported Health
Results: Self-reported health

Self-reported measures
Self-reported measuresshowed
showedsignificant
significant improvements oneyear
improvements one year
afterafter
randomization
randomization
CONTROL RF Model IV Model P-Value
(ITT) (LATE)
Health good, v good, excellent 54.8% +3.9% +13.3% .0001
Health stable or improving 71.4% +3.3% +11.3% .0001
Depression screen NEGATIVE 67.1% +2.3% +7.8% .003

CDC Healthy Days (physical) 21.86 +.381 +1.31 .018


CDC Healthy Days (mental) 18.73 +.603 +2.08 .003

SOURCE: Survey data


Source: Survey data

29
Summary: Self-reported health

Overall, big improvements in self-reported physical and mental


health
25% increase in probability of good, very good or excellent
health
10% decrease in probability of screening for depression
Physical health measures open to several interpretations
Improvements consistent with findings of increased utilization,
better access, and improved quality
BUT in their baseline surveys, results appeared shortly after
coverage (∼2/3rds magnitude of full result)
May suggest increase in perception of well-being rather than
physical health
Biomarker data can shed light on this issue
Discussion

At 1 year, found increases in utilization, reductions in financial


strain, and improvements in self-reported health
Medicaid expansion had benefits and costs – didn’t “pay for
itself”
Confirmed biases inherent in observational studies – would
have estimated bigger increases in use and smaller
improvements in outcomes
Policy-makers may have different views on value of different
aspects of improved well-being
“I have an incredible amount of fear because I don’t know if
the cancer has spread or not.”
“A lot of times I wanted to rob a bank so I could pay for the
medicine I was just so scared . . . People with cancer either
have a good chance or no chance. In my case it’s hard to
recover from lung cancer but it’s possible. Insurance took so
long to kick in that I didn’t think I would get it. Now there is
a big bright light shining on me.” (Anecdotes)
Important to have broad evidence on multifaceted effects of
Medicaid expansions
Baicker, Katherine, et al. (2014). “The Oregon Experiment
– Effects of Medicaid on Clinical Outcomes”, The New
England Journal of Medicine.
In-person data collection

Questionnaire and health examination including


Survey questions
Anthropometric and blood pressure measurement
Dried blood spot collection
Catalog of all medications
Fielded between September 2009 and December 2010
Average response ∼25 months after lottery began
Limited to Portland area: 20,745 person sample
12,229 interviews for effective response rate of 73%
Analytic approach

Intent to treat effect of lottery selection


Comparing all selected with all not selected
Random treatment assignment
No differential selection for outcome measurement
Local average treatment effect on Medicaid coverage
Using lottery selection as an instrument for coverage
∼24 percentage point increase in Medicaid enrollment
No change in private insurance (no crowd-out)
No effect of lottery except via Medicaid coverage
Statistical inference is the same for both
Results

1 Health care use


2 Financial strain
3 Clinical health outcomes
36
37
Health care use results

Increases in use in various settings


Increases in probability and number of outpatient visits
Increases in probability and number of prescription drugs
No discernible change in hospital or ED use (imprecise)
Increases in preventive care across range of services
Increases in perceived access and quality
Implied 35% increase in spending for insured
Results

1 Health care use


2 Financial strain
3 Clinical health outcomes
40
Financial Hardship Results

Reduction in strain, out-of-pocket (OOP), money owed


Substantial reduction across measures
Elimination of catastrophic OOP health spending
Implications for distribution of burden/benefits
Some borne by patients, some by providers
Non-financial burden of medical expenses and debt
Results

1 Health care use


2 Financial strain
3 Clinical health outcomes
Focusing on specific conditions

Measured:
Blood pressure
Cholesterol levels
Glycated hemoglobin
Depression
Reasons for selecting these:
Reasonably prevalent conditions
Clinically effective medications exist
Markers of longer term risk of cardiovascular disease
Can be measured by trained interviewers and lab tests
A limited window into health status
44
45
46
47
Results on specific conditions

Large reductions in depression


Increases in diagnoses and medication
In-person estimate of −9 percentage points in being depressed
Glycated hemoglobin
Increases in diagnosis and medication
No significant effect on HbA1c; wide confidence intervals
Blood pressure and cholesterol
No significant effects on diagnosis or medication
No significant effects on outcomes
Framingham risk score
No significant effect (in general or sub-poplulations)
49
Summary

One to two years after expanded access to Medicaid:


Increases in health care use and associated costs
Increases in compliance with recommended preventive care
Improvements in quality and access
Reductions in financial strain
Improvements in self-reported health
Improvements in depression
No significant change in specific physical measures
Sense of the relative magnitude of the effects
Use and access, financial benefits, general health, depression
Physical measures of specific chronic conditions
Extrapolation to Obamacare (ACA) Expansion

Context quite relevant for health care reform:


States can choose to cover a similar population in planned
2014 Medicaid expansions (up to 138% of federal poverty line)
But important caveats to bear in mind
Oregon and Portland vs. US generally
Voluntary enrollment vs. mandate
Partial vs. general equilibrium effects
Short-run (1-2 years) vs. medium or long run
We will revisit this again later in the difference-in-differences
section when discussing Miller, et al. (2019)
Updating Priors based on Study’s Findings

“Medicaid is worthless or worse than no insurance"’


Studies found increases in utilization and perceived access and
quality
Reductions in financial strain, improvement in self-reported
health
Improvement in depression
Can reject large declines in several physical measures
“Health insurance expansion saves money”
In short run, studies showed increases in utilization and cost
and no change in ED use
Increases in preventive care, improvements in self-reported
health, improvements in depression
Conclusion

Effects of expanding Medicaid likely to be manifold


Hard to establish with observational data and often misleading
Expanding Medicaid generates both costs and benefits
Increased spending
Measurably improves some aspects of health but not others
Important caveats about generalizability
Weighing them depends on policy priorities
Further research on alternative policies needed
Many steps in pathway between insurance and outcome
Role for innovation in insurance coverage
Complements to health care (e.g., social determinants)
Judge fixed effects designs

Imagine the following:


1 A person moves through a pipeline and hits a critical point
where treatment occurs as a result of some decision-maker
2 There are many different decision-makers and you’re assigned
randomly to one of them
3 Each decision-maker differs in terms of their leniency in
assigning the treatment
Very popular in criminal justice bc of how often judges are
randomly assigned to defendants (Kling 2006; Mueller-Smith
2015; Dobbie, et al. 2018) or even children to foster care case
workers (Doyle 2007; Doyle 2008)

Cunningham Causal Inference


Juvenile incarceration

Aizer and Doyle (2015) were interested in the causal effect of


juvenile imprisonment on future crime and human capital
accumulation
Extremely important policy question given the US has the
world’s highest incarceration rate and prison population of any
country in the world by a significant margin (500 prisoners per
100,000, over 2 million adults imprisoned, 4.8 million under
supervision)
High rates of incarceration extend to juveniles: in 2010, the
stock of juvenile detainees stood at 70,792, a rate of 2.3 per
1,000 aged 10-19.
Including supervision, US has a juvenile corrections rate 5x
higher than the next highest country, South Africa
Confounding

D Y

We are interested in the causal effect of juvenile incarceration


(D) on life outcomes, like adult crime and high school
completion
But youth choose to commit crimes, and that choice may be
due to unobserved criminogenic factors like poverty or
underlying criminal propensities which are themselves causing
those future outcomes
Leniency as an instrument

Z e

D Y

Aizer and Doyle (2015) propose an instrument - the propensity


to convict by the judge the youth is randomly assigned
If judge assignment is random, and the various assumptions
hold, then the IV strategy identifies the local average
treatment effect of juvenile incarceration on life outcomes
The Main Idea

“Plausibly exogenous” variation in juvenile detention stemming


from the random assignment of cases to judges who vary in
their sentencing
Consider two juveniles randomly assigned to two different
judges with different incarceration tendencies (Scott and Bob)
Random assignment ensures that differences in incarceration
between Scott and Bob are due to the judge, not themselves,
because remember, they’re identical
Data

35,000 juveniles administrative records over 10 years who came


before a juvenile court in Chicago (Juvenile Court of Cook
County Delinquency Database)
Data were linked to public school data for Chicago (Chicago
Public Schools) and adult incarceration data for Illinois (Illinois
Dept. of Corrections Adult Admissions and Exits)
They wanted to know the effect of juvenile incarceration on
high school completion (2nd data needed) and adult crime
(3rd data needed) using randomized judge assignment (1st
data needed)
They need personal identifying information in each data set to
make this link (i.e., name, DOB, address)
Preview of findings

Juvenile incarceration decreased high school graduation by 13


percentage points (vs. 39pp in OLS)
Increased adult incarceration by 23 percentage points (vs.
41pp in OLS)
Marginal cases are high risk of adult incarceration and low risk
of high school completion as a result of juvenile custody
Unlikely to ever return to school after incarcerated, but when
they do return, they are more likely to be classified as special
ed students, and more likely to be classified for special ed
services due to behavioral/emotional disorders (as opposed to
cognitive disability)
“Plausibly” exogenous

Very common in these studies for the assignment to some


decision-maker to be arbitrary but not clearly random (i.e., not
random no. generator)
In this case, juveniles charged with a crime are assigned to a
calendar corresponding to their neighborhood and calendars
have 1-2 judges who preside over them
1/5 of hearings are presided over by judges who cover the
calendar when the main judge can’t, known as swing judges
Judge assignment is a function of the sequence with which
cases happen to enter into the system and judge availability
that is set in advance
No scope for which judge you see first; conversations with
court administrators confirm its random
Structural equation

Yi = β0 + β1 JIi + β2 Xi + εi

where Xi is controls and εi is an error term. In this, juvenile


incarceration is likely correlated with the error term.

This is the “long” causal model. But note, from the prior DAG, we
cannot control for e because it is unobserved. But it is confounding
the estimation of juvenile incarceration’s effect on outcomes.
Incarceration Propensity as an Instrument

The instrument is based on the randomized judge equalling the


propensity to incarcerate from the randomly assigned judge
“Leave-one-out mean”
  nj(i) −1 
1 X
Zj(i) = JI
ek
nj(i) − 1
k6=i

The nj(i) terms is the total number of cases seen by judge k,


and JI
e k is equal to 1 if the juvenile was incarcerated during
their first case
Thus the instrument is the judge’s incarceration among first
cases based on all their other cases
It’s basically a judge fixed effect given the likelihood two
judges have precisely the same propensity is small
Information about the instrument

There are 62 judges in the data, and the average number of


initial cases per judge is 607
Substantial variation in the data - raw measure ranges from
4% to 21%
Residualized measure based on controls still has substantial
variation from 6% to 18%
Variation comes from two sources: variation among the regular
(nonswing) judges (80% of cases) and variation from the
swing judges (20% of cases)
Distribution of IV
Balance test
First stage
High school completion
Adult crime
Crime type
High school transfers
Developing emotional problems
Concluding remarks

Sad, but important, paper - the marginal kid shouldn’t have


been incarcerated
More generally, leniency designs are very powerful and very
common if you know how to look for them
Bottleneck, influential decision-makers, discretion - these are
the three elements of the design
Comments on judge fixed effects

Leave-one-out average propensity of the decision-maker, or


some residualized instrument, is very common
More often you’ll see jackknife IV (JIVE) which drops
observations while running regressions to improve finite sample
bias
The biggest threats aren’t exclusion probably (though
sometimes), but monotonicity
Might judges be harsh in some situations (violent crimes) but
lenient in others (female defendants, first time offenders)
Tests for violations

New paper by Frandsen, Lefgren and Leslie (2019) proposes a


test
They show that the identifying assumptions imply a conditional
expectation of the outcome of interest given the judge
assignment is a continuous function of the judge propensity
They propose a two-part test that generalizes the
Sargan-Hansen over identification test and assesses whether
treatment effects across judge propensities are possible
Software available on Emily Leslie’s website
Multi-dimensional instrument

Peter Hull in a cautionary note notes that while combining


judge fixed effects into a single propensity is numerically
equivalent, it’s still a series of dummies
Therefore it’s very important to keep in mind the lessons we
learned from weak instruments – the more weak instruments
you have when a parameter is overidentified, the larger the bias
It’s ongoing at the moment to think about ways to improve
instrument selection, but not settled
I encourage you to read Peter’s note on his website and begin
thinking about this yourself
Discussion questions

When working on a judge fixed effects project, write down an


IV DAG
Whereas monotonicity cannot be visualized to my knowledge
on a DAG, exclusion can – so what does an exclusion violation
mean in this context?
Use logic and conversations with those administering the
program to answer the following – what does monotonicity
mean in this context and how might it be violated?
Empirical exercise

Let’s estimate the effect of cash bail on defendant outcomes


using 2SLS and JIVE
Excellent paper by Megan Stevenson
-bail.do- and -bail.r- in dropbox and github
Twoway fixed effects

When working with panel data, the so-called “twoway fixed


effects” (TWFE) estimator is the workhorse estimator
It’s easy to run, a version of OLS, and many people are just
interested in mean effects anyway
It’s the most common model for estimating treatment effects
in a difference-in-differences, and so for all these reasons, we
need to spend some time understanding what it is

Cunningham Causal Inference


Panel Data

Panel data: we observe the same units (individuals, firms,


countries, schools, etc.) over several time periods
Often our outcome variable depends on unobserved factors
which are also correlated with our explanatory variable of
interest
If these omitted variables are constant over time, we can use
panel data estimators to consistently estimate the effect of our
explanatory variable
What I will cover

I will cover pooled OLS and twoway fixed effects


But I won’t be covering random effects, Arrelano and Bond
and any number of important panel estimators because the
purpose here is to present the modal regression model used in
difference-in-differences
Yi1 Yi2 Yi3

Di1 Di2 Di3

Xi

ci

Sorry - drawing the DAG for a simple panel model is somewhat


messy!
When to use this

Traditionally, this was used for estimating constant treatment


effects with unobserved time-invariant heterogeneity – recall
the ci was constant across all time periods
It’s a linear model, so you’ll be estimating conditional mean
treatment effects – if you want the median, you can’t use this
Once you enter into a world with dynamic treatment effects
and differential timing, this loses all value
Problems that fixed effects cannot solve

Reverse causality: Becker predicted police reduce crime, but


when you regress crime onto police, it’s usually positive
βbFE inconsistent unless strict exogeneity conditional on ci
holds
E [εit |xi1 , xi2 , . . . , xiT , ci ] = 0; t = 1, 2, . . . , T
implies εit uncorrelated with past, current and future
regressors
Time-varying unobserved heterogeneity
It’s the time-varying unobservables you have to worry about in
fixed effects
Can include time-varying controls, but as always, don’t
condition on a collider
Formal panel notation

Let y and x ≡ (x1 , x2 , . . . , xk ) be observable random variables


and c be an unobservable random variable
We are interested in the partial effects of variable xj in the
population regression function

E [y |x1 , x2 , . . . , xk , c]
Formal panel notation cont.

We observe a sample of i = 1, 2, . . . , N cross-sectional units


for t = 1, 2, . . . , T time periods (a balanced panel)
For each unit i, we denote the observable variables for all time
periods as {(yit , xit ) : t = 1, 2, . . . , T }
xit ≡ (xit1 , xit2 , . . . , xitk ) is a 1 × K vector
Typically assume that cross-sectional units are i.i.d. draws
from the population: {yi , xi , ci }N
i=1 ∼ i.i.d. (cross-sectional
independence)
yi ≡ (yi1 , yi2 , . . . , yiT )0 and xi ≡ (xi1 , xi2 , . . . , xiT )
Consider asymptotic properties with T fixed and N → ∞
Formal panel notation

Single unit:

   
yi1 Xi,1,1 Xi,1,2 Xi,1,j ... Xi,1,K
 ..   .. .. .. .. 
 .   . . . . 
   
yi =  yit 
  Xi =  Xi,t,1 Xi,t,2 Xi,t,j
 . . . Xi,t,K 
 ..  .. .. .. ..

 
 .   . . . . 
yiT T ×1 Xi,T ,1 Xi,T ,2 Xi,T ,j . . . Xi,T ,K T ×K

Panel with all units:


   
y1 X1
 ..   .. 
 .   . 
   
y =
 yi 
 X =
 Xi 

 ..   .. 
 .   . 
yN NT ×1 XN NT ×K
Unobserved heterogeneity

For a randomly drawn cross-sectional unit i, the model is given


by
yit = xit β + ci + εit , t = 1, 2, . . . , T

yit : log wages i in year t


xit : 1 × K vector of variable events for person i in year t, such
as education, marriage, etc. plus an intercept
β : K × 1 vector of marginal effects of events
ci : sum of all time-invariant inputs known to people i (but
unobserved for the researcher), e.g., ability, beauty, grit, etc.,
often called unobserved heterogeneity or fixed effect
εit : time-varying unobserved factors, such as a recession,
unknown to the farmer at the time the decision on the events
xit are made, sometimes called idiosyncratic error
Pooled OLS

When we ignore the panel structure and regress yit on xit we


get
yit = xit β + vit ; t = 1, 2, . . . , T
with composite error vit ≡ ci + εit
What happens when we regress yit on xit if x is correlated with
ci ?
Then x ends up correlated with v , the composite error term.
Somehow we need to eliminate this bias, but how?

Cunningham Causal Inference


Pooled OLS

Main assumption to obtain consistent estimates for β is:


E [vit |xi1 , xi2 , . . . , xiT ] = E [vit |xit ] = 0 for t = 1, 2, . . . , T
xit are strictly exogenous: the composite error vit in each time
period is uncorrelated with the past, current and future
regressors
But: education xit likely depends on grit and ability ci and so
we have omitted variable bias and βb is not consistent
No correlation between xit and vit implies no correlation
between unobserved effect ci and xit for all t
Violations are common: whenever we omit a time-constant
variable that is correlated with the regressors (heterogeneity
bias)
Additional problem: vit are serially correlated for same i since
ci is present in each t and thus pooled OLS standard errors are
invalid
Pooled OLS

Always ask: is there a time-constant unobserved variable (ci )


that is correlated with the regressors?
If yes, then pooled OLS is problematic
This is how we motivate a fixed effects model: because we
believe unobserved heterogeneity is the main driving force
making the treatment variable endogenous
Fixed effect regression

Our unobserved effects model is:

yit = xit β + ci + εit ; t = 1, 2, . . . , T

If we have data on multiple time periods, we can think of ci as


fixed effects to be estimated
OLS estimation with fixed effects yields
N X
X T
(β,
b cb1 , . . . , cbN ) = argmin (yit − xit b − mi )2
b,m1 ,...,mN i=1 t=1

this amounts to including N individual dummies in regression


of yit on xit
Derivation: fixed effects regression

N X
X T
(β,
b cb1 , . . . , cbN ) = argmin (yit − xit b − mi )2
b,m1 ,...,mN i=1 t=1

The first-order conditions (FOC) for this minimization problem are:


N X
X T
xit0 (yit − xit βb − cbi ) = 0
i=1 t=1

and
T
X
(yit − xit βb − cbi ) = 0
t=1

for i = 1, . . . , N.
Derivation: fixed effects regression

Therefore, for i = 1, . . . , N,
T
1 X
cbi = (yit − xit β)
b = ȳi − x̄i β,
b
T
t=1

where
T T
1 X 1 X
x̄i ≡ xit ; ȳi ≡ yit
T T
t=1 t=1

Plug this result into the first FOC to obtain:


N X
X T N X
−1  X T 
βb = (xit − x̄i )0 (xit − x̄i ) (xit − x̄i )0 (yit − ȳ )
i=1 t=1 i=1 t=1

N X
X T N X
−1  X T 
βb = ẍit0 ẍit ẍit0 ẍit
i=1 t=1 i=1 t=1

with time-demeaned variables ẍit ≡ xit − x̄, ÿit ≡ yit − ȳi


Fixed effects regression

Running a regression with the time-demeaned variables


ÿit ≡ yit − ȳi and ẍit ≡ xit − x̄ is numerically equivalent to a
regression of yit on xit and unit specific dummy variables.

Even better, the regression with the time demeaned variables is


consistent for β even when Cov [xit , ci ] 6= 0 because
time-demeaning eliminates the unobserved effects

yit = xit β + ci + εit


ȳi = x̄i β + ci + ε̄i

(yit − ȳi ) = (xit − x̄)β + (ci − ci ) + (εit − ε̄i )


ÿit = ẍit β + ε̈it
Fixed effects regression: main results

Identification assumptions:
1 E [εit |xi1 , x + i2, . . . , xiT , ci ] = 0; t = 1, 2, . . . , T
regressors are strictly exogenous conditional on the unobserved
effect
allows xit to be arbitrarily related to ci
 
PT 0
rank t=1 E [ẍit ẍit ] =K
2

regressors vary over time for at least some i and not collinear
Fixed effects estimator
1 Demean and regress ÿit on ẍit (need to correct degrees of
freedom)
2 Regress yit on xit and unit dummies (dummy variable
regression)
3 Regress yit on xit with canned fixed effects routine
Stata: xtreg y x, fe i(PanelID)
FE main results

Properties (under assumptions 1-2):


βbFE is consistent: plim βbFE ,N = β
N→∞
βbFE is unbiased conditional on X
Fixed effects regression: main issues

Inference:
Standard errors have to be “clustered” by panel unit (e.g.,
farm) to allow correlation in the εit ’s for the same i.
Yields valid inference as long as number of clusters is
reasonably large
Typically we care about β, but unit fixed effects ci could be of
interest
cbi from dummy variable regression is unbiased but not
consistent for ci (based on fixed T and N → ∞)
Application: SASP

From 2008-2009, I fielded a survey of Internet sex workers


(685 respondents, 5% response rate)
I asked two types of questions: static provider-specific
information (e.g., age, weight) and dynamic session
information over last 5 sessions
Let’s look at the panel aspect of this analysis together

Cunningham Causal Inference


Risk premium equation

Yis = βi Xi + δDis + γis Zis + ui + εis


Ÿis = γis Z̈is + η̈is

where Y is log price, D is unprotected sex with a client in a


session, X are client and session characteristics, Z is unobserved
heterogeneity, and ui is both unobserved and correlated with Zis .
Table: POLS, FE and Demeaned OLS Estimates of the Determinants of
Log Hourly Price for a Panel of Sex Workers
Depvar: POLS FE Demeaned OLS

Unprotected sex with client of any kind 0.013 0.051* 0.051*


(0.028) (0.028) (0.026)
Ln(Length) -0.308*** -0.435*** -0.435***
(0.028) (0.024) (0.019)
Client was a Regular -0.047* -0.037** -0.037**
(0.028) (0.019) (0.017)
Age of Client -0.001 0.002 0.002
(0.009) (0.007) (0.006)
Age of Client Squared 0.000 -0.000 -0.000
(0.000) (0.000) (0.000)
Client Attractiveness (Scale of 1 to 10) 0.020*** 0.006 0.006
(0.007) (0.006) (0.005)
Second Provider Involved 0.055 0.113* 0.113*
(0.067) (0.060) (0.048)
Asian Client -0.014 -0.010 -0.010
(0.049) (0.034) (0.030)
Black Client 0.092 0.027 0.027
(0.073) (0.042) (0.037)
Hispanic Client 0.052 -0.062 -0.062
(0.080) (0.052) (0.045)
Other Ethnicity Client 0.156** 0.142*** 0.142***
(0.068) (0.049) (0.045)
Met Client in Hotel 0.133*** 0.052* 0.052*
(0.029) (0.027) (0.024)
Gave Client a Massage -0.134*** -0.001 -0.001
(0.029) (0.028) (0.024)
Age of provider 0.003 0.000 0.000
(0.012) (.) (.)
Age of provider squared -0.000 0.000 0.000
(0.000) (.) (.)
Table: POLS, FE and Demeaned OLS Estimates of the Determinants of
Log Hourly Price for a Panel of Sex Workers
Depvar: POLS FE Demeaned OLS

Body Mass Index -0.022*** 0.000 0.000


(0.002) (.) (.)
Hispanic -0.226*** 0.000 0.000
(0.082) (.) (.)
Black 0.028 0.000 0.000
(0.064) (.) (.)
Other -0.112 0.000 0.000
(0.077) (.) (.)
Asian 0.086 0.000 0.000
(0.158) (.) (.)
Imputed Years of Schooling 0.020** 0.000 0.000
(0.010) (.) (.)
Cohabitating (living with a partner) but unmarried -0.054 0.000 0.000
(0.036) (.) (.)
Currently married and living with your spouse 0.005 0.000 0.000
(0.043) (.) (.)
Divorced and not remarried -0.021 0.000 0.000
(0.038) (.) (.)
Married but not currently living with your spouse -0.056 0.000 0.000
(0.059) (.) (.)

N 1,028 1,028 1,028


Mean of dependent variable 5.57 5.57 0.00
Heteroskedastic robust standard errors in parenthesis clustered at the provider level. * p<0.10,
** p<0.05, *** p<0.01
Unit specific time trends often eliminate “results”

Table: Demeaned OLS Estimates of the Determinants of Log Hourly


Price for a Panel of Sex Workers with provider specific trends
Depvar: FE w/provider trends

Unprotected sex with client of any kind 0.004


(0.046)
Ln(Length) -0.450***
(0.020)
Client was a Regular -0.071**
(0.023)
Age of Client 0.008
(0.005)
Age of Client Squared -0.000
(0.000)
Client Attractiveness (Scale of 1 to 10) 0.003
(0.003)
Second Provider Involved 0.126*
(0.055)
Asian Client -0.048***
(0.007)
Black Client 0.017
(0.043)
Hispanic Client -0.015
(0.022)
Other Ethnicity Client 0.135***
(0.031)
Met Client in Hotel 0.073***
(0.019)
Gave Client a Massage 0.022
(0.012)
Concluding remarks

This is not a review of panel econometrics; for that see


Wooldridge and other excellent options
We reviewed POLS and TWFE because they are commonly
used with individual level panel data and
difference-in-differences
Their main value is how they control for unobserved
heterogeneity through a simple demeaning
Now let’s discuss difference-in-differences which will at various
times use the TWFE model
John Snow

John Snow was a practicing anesthesiologist in the mid 19th


century London
He was then famous for inventing a machine that would
carefully deliver chloroform to patients in homogenous dosage
which reduced mortality from anasthesia
But he is now famous for providing convincing evidence that
cholera was a waterborne disease during the 1854 outbreak
Published two works on cholera – an essay in 1849, and a
book in 1855
Died of a stroke in 1858

Cunningham Causal Inference


Figure: Daily cholera deaths, London (Coleman 2019)
Cholera background

Cholera hits London three times in the early to mid 1800s


causing large waves of tens of thousands of deaths
Three London epidemics – 1831-1832, 1848-1849, 1853-1854
Cholera attacked victims suddenly, with a 50% survival rate,
and very painful symptoms included vomiting and acute
diarrhea
Miasmis

19th century London was a filthy place with waste collecting in


cesspools under houses or emptied into open ditches and
sewers
Majority opinion about disease was miasmis
Miasmis hypothesized that disease transmission was caused by
vapors and smells; unclear its relevance for person-to-person
Never before seen microorganism

Microscopes were around but had horrible resolution


Most human pathogens couldn’t be seen
Johnson (2007) reports Snow did track down a microscope but
could only see blurry things moving around
Isolating these microorganisms wouldn’t occur for half a
century
Snow’s hypothesis

Snow (as well as a few others like Rev. Henry Whitehead)


believe miasmis is not relevant for explaining cholera
Snow hypothesizes that the active agent was a living organism
that entered the body, got into the alimentary canal with food
or drink, multiplied in the body, and generated some poison
that caused the body to expel water
The organism passed out of the body with these evacuations,
entered the water supply and infected new victims
The process repeated itself, growing rapidly through the
common water supply, causing an epidemic
Thought Experiment

How will he convince anyone that cholera is waterborne and


not due to “bad air”?
Consider the ideal experiment: randomize households by coin
flip to receive water from runoff (control) vs. water without
runoff (treatment)
Unethical, impractical and unrealistic
Even if the randomized experiment is not possible, the thought
experiment suggests the observational equivalent
Multiple sources of evidence, not just one

Snow makes his argument with many pieces of evidence that when
taken together are very compelling that water, not air, is the cause
of the cholera epidemics. These can be categorized as:
1 Observation
2 Broad Street Pump
3 Grand Experiment
Observation

Observed progression of the disease for years


Tracked Patient Zero
Treatments didn’t work: Snow would cover with burlap sacks,
which did nothing
Strange irregular patterns – higher deaths in close proximity to
a public pump on Broad Street, fewer deaths at a pub

“cholera extended to nearly all the houses in which the


water was thus tainted, and to no others.” (Snow 1849)
Broad street outbreak

“The most terrible outbreak of cholera which ever


occurred in this kingdom, is probably that which took
place in Broad Street, Golden Square, and the adjoining
streets, a few weeks ago. Within two hundred and fifty
yards of the spot where Cambridge Street [now Lexington
St.] joins Broad Street [now Broadwick], there were
upwards of five hundred fatal attacks of cholera in ten
days.” (Snow 1855)
How he argues for the Broad street pump

Famous map showing unusual mass of cholera deaths near the


public Broad street pump
He was looking for the source, but he was not inductively
forming his theory with this map because he already knew the
mechanism
He was assembling evidence that would further refute the
explanations of those who advocated an alternative
explanation of the outbreak
Figure: Cholera deaths laid over a small area of London near Broad Street
Map was important but not enough on its own

“[Snow] could see at a glance that he’d be able to


demonstrate that the outbreak was clustered around the
pump, yet he knew from experience that that kind of
evidence, on its own, would not satisfy a miasmatist. The
cluster could just as easily reflect some pocket of poisoned
air that had settled over that part of Soho, something
emanating from the gulley holes or cesspools – or perhaps
even from the pump itself. Snow knew that the case
would be made in the exceptions from the norm. Pockets
of life where you could expect death, pockets of death
where you would expect life.” Johnson (2007) p. 140
Two companies fight for customers

Southwark and Vauxhall Waterworks Company and the


Lambeth Water Company competed over some of the regions
south of the Thames
In 16 sub-districts, with a population of 300,000, they
competed directly, even supplying customers side-by-side

“In many cases a single house has a supply different from


that on either side. Each company supplies both rich and
poor, both large houses and small; there is no difference
in the condition or occupation of the persons receiving
the water of the different companies.” Snow (1855) p 75
Lambeth moves its pipe

During the 1849 epidemic, both companies drew water from


Thames which was polluted with sewage and cholera
London passes legislation requiring utility companies to move
their pipes above the city
In 1852, the Lambeth Company, a water utility company,
changed supply from Hungerford Bridge
It moved its intake pipe upstream to cleaner water and in
response to legislation (SV delayed)
This created a natural experiment because Southwark and
Vauxhall left its intake pipe in place
Meticulous Data Collection

Two types of data: DD uses aggregate deaths bc of mixing of


customers whereas his Broad Street evidence focused on
individuals
Collected detailed information from households with cholera
deaths on utility subscription (Lambeth or SV)
Many residents didn’t know their water company – distant
landlords paid for it
He knew Lambeth water was four times saltier, so he’d take a
sample and test it using a saline test back at his office
Shoeleather and knowledge of institutional details

Careful balance checks – “the pipes of each Company go down


all the streets into nearly all the courts and alleys”
Concern for sample selection bias –“No fewer than 3000 people
of both sexes [of all types affected]”
Treatment assignment was arbitrary – “a few houses supplied
by one Company and a few by the other”
Table XII

Modified Table XII (Snow 1854)


Company name 1849 1854
Southwark and Vauxhall 135 147
Lambeth 85 19

Estimated ATT using DD is 78 fewer deaths per 10,000


Failure to convince

“In spite of what has since been recognized as a classic


exercise in data, analysis, and argument, Snow failed to
convince the medical profession, the policy-making
establishment, or the public.” (Coleman 2019)
Final victory

Another cholera outbreak in 1866, east of London, is when


Snow’s ideas were gradually and reluctantly accepted by public
officials and the scientific community
1866 outbreak was confined only to the east of London, which
was the last area not yet connected to the newly constructed
sewage system which discharged sewage below the Thames
The rest of London didn’t have an outbreak
This was the final piece of evidence that swayed skeptics and
led to a more reasoned assessment of Snow’s data and analysis
Merits of Snow’s work

Long commitment to the topic led him to reject unsound


hypotheses and form new ones based on observation and
experience (shoe leather)
Expert handling of data analysis, data visualization, and a
framing of evidence with a ladder of reasoning
Layered rhetoric of research

“The strength of his model derived from its ability to use


observed phenomena on one scale to make predictions
about behavior on other scales up and down the chain. ...
If cholera were waterborne then the patterns of infection
must correlate with the patterns of water distribution in
London’s neighborhoods. Snow?s theory was like a
ladder; each individual rung was impressive enough, but
the power of it lay in ascending from bottom to top, from
the membrane of the small intestine all the way up to the
city itself.” (Johnson, Ghost Map)
Simple cross-sectional design

Table: Lambeth and Southwark and Vauxhall, 1854

Company Cholera mortality


Lambeth Y =L+D
Southwark and Vauxhall Y = SV

Cunningham Causal Inference


Interrupted time series design

Table: Lambeth, 1849 and 1854

Company Time Cholera mortality


Lambeth 1854 Y =L
1849 Y = L + (T + D)
Difference-in-differences

Table: Lambeth and Southwark and Vauxhall, 1849 and 1854

Companies Time Outcome D1 D2


Lambeth Before Y =L
After Y =L+T +D T +D
D
Southwark and Vauxhall Before Y = SV
After Y = SV + T T
Sample averages

   
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
Population expectations

   
2x2
δbkU = E [Yk |Post] − E [Yk |Pre] − E [YU |Post] − E [YU |Pre]
Potential outcomes and the switching equation

   
2x2
δbkU = E [Yk1 |Post] − E [Yk0 |Pre] − E [YU0 |Post] − E [YU0 |Pre]
| {z }
Switching equation

+ E [Yk0 |Post] − 0
E [Yk |Post]
| {z }
Adding zero
Parallel trends bias

2x2
δbkU = E [Yk1 |Post] − E [Yk0 |Post]
| {z }
ATT
   
0 0 0 0
+ E [Yk |Post] − E [Yk |Pre] − E [YU |Post] − E [YU |Pre]
| {z }
Non-parallel trends bias in 2x2 case
Another famous DD study

Card and Krueger (1994) was a seminal study on the minimum


wage both for the result and for the design
Not the first time we saw DD in the modern period - there’s
Ashenfelter (1978) and Card (1991) - but got a lot of attention
Competitive vs noncompetitive markets

Suppose you are interested in the effect of minimum wages on


employment which is a classic and divisive question.
In a competitive input market, increases in the minimum wage
would move us up a downward sloping labor demand curve →
employment would fall
Monopsony (imperfect labor markets) suggest the opposite
effect whereby raising the minimum wage increases
employment
Monopsony’s minimum wage predictions
Card and Krueger (1994)

In February 1992, New Jersey increased the state minimum


wage from $4.25 to $5.05. Pennsylvania’s minimum wage
stayed at $4.25.
Locations of Restaurants (Card and Krueger 2000)

J. Hainmueller (MIT) 5 / 50

They surveyed about 400 fast food stores both in New Jersey
and Pennsylvania before and after the minimum wage increase
in New Jersey - shoeleather!
Parallel trends assumption

Key identifying assumption is the “parallel trends” assumption


0 0 0 0
[E [YNJ |Post] − E [YNJ |Pre]] − [E [YPA |Post] − E [YPA |Pre]]
| {z }
Non-parallel trends bias

Note the counterfactual - it is not testable no matter what


someone tells you, bc New Jersey’s post period potential
employment in a world with a lower minimum wage is
unobserved
Let’s look at this a couple of different ways, including a
graphic showing the binding minimum wage
Wages After Rise in Minimum Wage

J. Hainmueller (MIT) 7 / 50
TABLE3-AVERAGE EMPLOYMENT PER S
IN NEW JERSEYMI

Stores by state
Difference,
PA NJ NJ-PA
Variable (i) (ii) (iii)
1. FTE employment before,
all available observations
2. FTE employment after,
all available observations
3. Change in mean FTE
employment
4. Change in mean FTE
employment, balanced
sample of storesC
J. Hainmueller (MIT) 17
Surprisingly, employment rose in NJ relative to PA after the
5. Change in mean FTE
minimum wage changesetting
employment, - consistent with monopsony theory
Regression DD

Remember, I said there are good reasons to use TWFE


It’s easy to calculate the standard errors
We can control for other variables which may reduce the
residual variance (lead to smaller standard errors)
It’s easy to include multiple periods
We can study treatments with different treatment intensity.
(e.g., varying increases in the minimum wage for different
states)
But there are bad reasons, too, which I’ll discuss under
differential timing
Regression DD

The typical regression model we estimate is

Yit = β1 + β2 Treati + β3 Postt + β4 (Treat × Post)it + εit

where Treat is a dummy if the observation is in the treatment


group and Post is a post treatment dummy
Regression DD - Card and Krueger

In the Card and Krueger case, the equivalent regression would


be:
Yits = α + γNJs + λdt + δ(NJ × d)st + εits

NJ is a dummy equal to 1 if the observation is from NJ


d is a dummy equal to 1 if the observation is from November
(the post period)
This equation takes the following values
PA Pre: α
PA Post: α + λ
NJ Pre: α + γ
NJ Post: α + γ + λ + δ
DD estimate: (NJ Post - NJ Pre) - (PA Post - PA Pre) = δ
Graph - Observed Data

Waldinger (Warwick) 19 / 55
Graph - DD

YistY= =+
ist α γNJ
a+ s +
gNJ s +λd
ldt t+ d(NJs× d)
+δ(NJ dt )st++#ist
εist

Waldinger (Warwick) 20 / 55
Graph - DD

==
YistYist α+
a+γNJ
gNJs s++λd
ldtt + d(NJs ×
+ δ(NJ  dd) ++
t )st #istεist

Waldinger (Warwick) 21 / 55
Graph - DD

YYistist ==αa + gNJ +ld


γNJss + NJs ×
λdt t++d(δ(NJ dt d) + εist
) +st#ist

Waldinger (Warwick) 22 / 55
Key assumption of any DD strategy: Parallel trends

The key assumption for any DD strategy is that the outcome


in treatment and control group would follow the same time
trend in the absence of the treatment
This doesn’t mean that they have to have the same mean of
the outcome
But regardless of parallel trends, OLS always estimates the
vertical bar on next slide
Graph - DD

Yist==αa+
Yist gNJss + ld
+γNJ λdtt + (NJs ×dtd)
+dδ(NJ )+st #+
ist εist

Waldinger (Warwick) 23 / 55
Losing parallel trends

If parallel trends doesn’t hold, then ATT is not identified


But, regardless of whether ATT is identified, OLS always
estimates the same thing
That’s because OLS uses the slope of the control group to
estimate the DD parameter, which is only unbiased if that
slope is the correct counterfactual for the treatment group
Labor Supply
Treated

Observed PA

Observed NJ
δ OLS
δ ATT

Counterfactual NJ

Time
Feb Nov

Figure: DD regression diagram without parallel trends


Compositional differences violate parallel trends

One of the risks of a repeated cross-section is that the


composition of the sample may have changed between the pre
and post period
Hong (2011) uses repeated cross-sectional data from the
Consumer Expenditure Survey (CEX) containing music
expenditure and internet use for a random sample of
households
Study exploits the emergence of Napter (first file sharing
software widely used by Internet users) in June 1999 as a
natural experiment
Study compares internet users and internet non-users before
and after emergence of Napster
Compositional di↵erences?

Figure 1: Internet Di↵usion and Average Quarterly Music Expenditure in the CEX
40 100

Non-user Group Internet User Group Internet Diffusion 90


35

% of HHs w/Internet connection


80
Average Music Expenditure

30
70
(in 1998 dollars)

25
60

20 50

40
15

30
10
20

5
10

0 0
1996 1997 1998 1999 2000 2001
Year

J. Hainmueller (MIT) 41 / 50
Table 1: Descriptive Statistics for Internet User and Non-user Groupsa

Year 1997 1998 1999 2000


Internet User Non-user Internet User Non-user Internet User Non-user Internet User
Average Expenditure
Recorded Music $25.73 $10.90 $24.18 $9.97 $20.92 $9.37 $17.42
Entertainment $195.03 $96.71 $193.38 $84.92 $182.42 $80.19 $164.88
Zero Expenditure
Recorded Music .56 .79 .60 .80 .64 .81 .68
Entertainment .08 .32 .09 .35 .14 .39 .17
Demographics
Age 40.2 49.0 42.3 49.0 44.1 49.4 44.3
Income $52,887 $30,459 $51,995 $28,169 $49,970 $26,649 $47,510
High School Grad. .18 .31 .17 .32 .21 .32 .22
Some College .37 .28 .35 .27 .34 .27 .36
College Grad. .43 .21 .45 .21 .42 .20 .37
40

Manager .16 .08 .16 .08 .14 .08 .14


Professional .23 .11 .22 .10 .21 .10 .19
Living in a Dorm .12 0 .08 0 .05 0 .05
Di↵usion of the internet changes samples (e.g. younger music fans are
Urban .93 .87 .93 .86 .91 .87 .89
Diffusion
early of the Internet changes samples (e.g., younger music fans are early
adopters)
Inside a MSA .84 .78 .83 .78 .83 .78 .81
Pop. Size > 4 million .34 .26 .30 .26 .31 .25 .28
adopters)
Appliance Ownership
Computer .79 .27 .81 .28 .80 .28 .81
Sound System .81 .57 .79 .58 .78 .56 .76
J. Hainmueller (MIT) 42 / 50
VCR .83 .72 .86 .74 .86 .72 .85
Total Households
(in million) 15 91 22 86 28 80 34
Observations 3,163 19,052 5,624 21,550 8,191 22,810 9,606
a
All the statistics are weighted using the weights provided by the CEX. Years refer to the period from June of the year to May of the next y
households are computed by summing the CEX weights.
Parallel leads, not trends

The identifying assumption for all DD designs is some


representation of a counterfactual parallel trend
Parallel trends cannot be directly verified because technically
one of the parallel trends is an unobserved counterfactual
But one often will check using pre-treatment data to show
that the trends had been the same prior to treatment
But, even if pre-trends are the same one still has to worry
about other policies changing at the same time (omitted
variable bias)

Cunningham Causal Inference


g at the same time.
Plot the raw data when there’s only two groups
Differential timing makes pre-treatment undefined for
untreated groups

New Jersey treated in late 1992, New York in late 1993,


Pennsylvania never treated
Pre-treatment:
New Jersey: <1992
New York: <1993
Pennsylvania: undefined
So how do we check parallel leads?
Randomize treatment dates to control units

Figure: Anderson, et al. (2013) display of raw traffic fatality rates for
re-centered treatment states and control states with randomized
treatment dates
Randomized control counties to receive arbitrary dates as
treatment can be misleading

Average birth rates

5.8
Average birth rates per 1000
5.2 5.4
5 5.6

-50 0 50
treat_date

Craigslist counties Non-Craigslist counties

Figure: From one of my studies. Looks decent right?


Event study regression

Including leads into the DD model is an easy way to analyze


pre-treatment trends
Lags can be included to analyze whether the treatment effect
changes over time after assignment
The estimated regression would be:
−q
X m
X
Yits = γs + λt + γτ Dsτ + δτ Dsτ + xist + εist
τ =−1 τ =0

Treatment occurs in year 0


Includes q leads or anticipatory effects
Includes m leads or post treatment effects
.4
Birth Rates per 15-44yo per 1,000
-.2 0 .2

DD Coefficient = -0.18 (s.e. = 0.02)


-.4

-4 -3 -2 -1 0 1 2 3 4 5
10 Months relative to CL Entry

Same data as a couple slides ago, leads don’t look good


Medicaid and Affordable Care Act example

Miller, et al. (2019) examine a rollout of Medicaid under the


Affordable Care Act
They link large-scale survey data with administrative death
records
9.3 reduction in annual mortality caused by Medicaid expansion
Driven by a reduction in disease-related deaths which grows
over time
Figure: Miller, et al. (2019) estimates of Medicaid expansion’s effects on
on annual mortality
Standard errors in DD strategies

Many paper using DD strategies use data from many years –


not just 1 pre and 1 post period
The variables of interest in many of these setups only vary at a
group level (say a state level) and outcome variables are often
serially correlated
As Bertrand, Duflo and Mullainathan (2004) point out,
conventional standard errors often severely understate the
standard deviation of the estimators – standard errors are
biased downward (i.e., too small, over reject)
Standard errors in DD – practical solutions

Bertrand, Duflo and Mullainathan propose the following


solutions:
1 Block bootstrapping standard errors (if you analyze states the
block should be the states and you would sample whole states
with replacement for bootstrapping)
2 Clustering standard errors at the group level (in Stata one
would simply add , cluster(state) to the regression
equation if one analyzes state level variation)
3 Aggregating the data into one pre and one post period.
Literally works if there is only one treatment data. With
staggered treatment dates one should adopt the following
procedure:
Regress Yst onto state FE, year FE and relevant covariates
Obtain residuals from the treatment states only and divide
them into 2 groups: pre and post treatment
Then regress the two groups of residuals onto a post dummy
Note about groups

Correct treatment of standard errors sometimes makes the


number of groups very small: in the Card and Krueger study
the number of groups is only 2.
DD Robustness

Very common for readers and others to request a variety of


“robustness checks” from a DD design
Think of these as along the same lines as the leads and lags
we already discussed
Event study (already discussed)
Falsification test using data for alternative control group
Falsification test using alternative “placebo” outcome that
should not be affected by the treatment
Within group controls - triple diff

Table: Differences-in-differences-in-differences

States Group Period Outcomes D1 D2 D


After NJ + T + NJt + lt + D
Low wage employment T + NJt + lt + D
Before NJ
NJ D + lt − st
After NJ + T + NJt + st
High wage employment T + NJt + st
Before NJ

D
After PA + T + PAt + lt
Low wage employment T + PAt + lt
Before PA
PA lt − st
After PA + T + PAt + st
High wage employment T + PAt + st
Before PA
Di↵erence-in-Di↵erences: Threats to Validity
DDD Example by Gruber
Triple DDD: Mandated Maternity Benefits (Gruber, 1994
DDD in Regression

Yijt = α + β1 Xijt + β2 τt + β3 δj + β4 Di + β5 (δ × τ )jt


+ β6 (τ × D)ti + β7 (δ × D)ij + β8 (δ × τ × D)ijt + εijt

The DDD estimate is the difference between the DD of


interest and a placebo DD (which is supposed to be zero)
If the placebo DD is non-zero, it might be difficult to convince
the reviewer that the DDD removed all the bias
If the placebo DD is zero, then DD and DDD give the same
results but DD is preferable because standard errors are
smaller for DD than DDD
But now you have multiple parallel trends assumption - both
the control group trends are good counterfactuals, and
within-state placebo trends for within-state treatment unit
counterfactual trends
Implementing DDD

Have to get the structure of the data correct because now you
have (1) before and after, (2) treatment and control states,
and (3) within state placebo
I give an example in my Mixtape (p. 278) looking at abortion
legalization’s effect on longterm risky sexual behavior,
including do file
Let’s review first the paper, then work through the exercise
itself using data.
Figure: Longrun effects of abortion legalization on Risky Sex
Motivation

Legalization caused teen childbearing to fall by 12% (Levine


2004)
Gruber, et al. (1999) showed that the marginal child would
have been 60% more likely to live in a single-parent household,
50% more likely to live in poverty, and 45% more likely to be a
recipient of public services
Mechanism was believed to be non-random selection
associated with high risk conditions
Emerging influence

Donohue and Levitt (2001) linked abortion legalization to


declining crime in the 1990s, one of several reasons given for
his John Bates Clark award
Freakonomics popularizes the sensational theory
Other papers followed like Charles and Stephens (2006) who
find that children exposed in utero to legalization were less
likely to use illegal substances
Controversy

Triple diff by Joyce finds no evidence for it when using an


(arbitrary) cutoff of the median abortion rate within early
repeal treatment states
Foote and Goetz (2008) argue the abortion ratio was
constructed incorrectly, and report a coding error leaving out
state-year fixed effects; construction problem destroys results,
state-year fixed effects somewhat attenuates
Literature stops and theory is ignored
In defense of Steve Levitt

I want to remind people though: we only know about the


coding error bc Levitt posted his do files and gave them to
anyone who asked (very easy to “lose do files”)
Levitt had and has oodles of scientific integrity for his
willingness to cooperate; not always the case
“If abortion lowers homicide rates by 20 – 30%, then it is
likely to have affected an entire spectrum of outcomes
associated with well-being: infant health, child
development, schooling, earnings and marital status.
Similarly, the policy implications are broader than
abortion. Other interventions that affect fertility control
and that lead to fewer unwanted births – contraception or
sexual abstinence – have huge potential payoffs. In short,
a causal relationship between legalized abortion and crime
has such significant ramifications for social policy and at
the same time is so controversial, that further assessment
of the identifying assumptions and their robustness to
alternative strategies is warranted.” Ted Joyce in his triple
diff paper
Figure: Light bending around the sun, predicted by Einstein, and
confirmed in a natural experiment involving an eclipse. Artwork by Seth
Hahne .c
In defense of falsifiable predictions

Theories which make falsifiable predictions (comparative


statics) are more convincing of causal effects than simpler
reduced form studies
Great paper by Coleman on (2019) Snow’s rhetoric in his 1849
essay and his 1855 book on cholera – mounts different data to
make his argument, some of which is of this nature
Those predictions are threefold:
Where we should find effects
Where we should not find effects
The kind of effects we should find
If all three are met, an identified causal effect becomes
epistemologically more credible
Falsifiable predictions contained in a diff-in-diff

Figure: Group-time differential exposure predicts a temporary parabolic


ATT
Figure: Raw data for repeal and Roe states.
Estimating equation

Yst = β1 Repeals + β2 DTt + β3t Repeals × DTt + Xst ψ + αs DSs


+ γ1 t + γ2s × t + εst
Estimated effect of abortion legalization on gonorrhea
Black females 15-19 year-olds

1.00
Repeal x year estimated coefficient
-1.50 -1.00 -0.50 0.00 0.50

1995 1997 1999 2000


1998
1994 1996
1986
1987 1993
1992
1991
1988 1990
1989

1985 1990 1995 2000


Year
Whisker plots are estimated coefficients of DD estimates

Figure: Differences in black female gonorrhea incidence between repeal


and Roe cohorts.
Assuaging doubt

Maybe spurious - something happened in those years, but


what?
Crack epidemic maybe? But we control for the crack index by
Fryer, et al.
Maybe something else - let’s try a within-state control group
(the older cohort)
DDD Equation

Yast = β1 Repeals + β2 DTt + β3 DA + β4t Repeals · DTt +


+ β5 Repeals · DA + β6t DA · DTt + β7t Repeals · DA · DTt
+ Xst ξ + α1s DSs + α2s DSs · DA + γ1 t + γ2s DSs · t + γ3 DA · t
+ γ4s DSs · DA · t + ast

One will be dropped, but I want to focus your attention on the


number of interactions needed to identify DDD parameters
Stacking Structure
DDD Results

Estimated effect of abortion legalization on gonorrhea


Black females 15-19 year-olds vs Black females 25-29 year-olds
Repeal x 15-19yo x year estimated coefficient
0.50

1997

1994
1996
1995
1992 1998 1999
1991 2000
0.00

1990

1993

1989
1986
1987 1988
-1.00 -0.50

1985 1990 1995 2000


Year
Whisker plots are estimated coefficients of DDD coefficients
My original conclusions

Model made narrow predictions of a parabola within a given


window but only for the treatment cohort
Amazingly we actually found that very shape in the DD – did
we vindicate Gruber, et al. and Donohue and Levitt then?
Also used older group as within-state controls in a DDD, and
still found the parabola, though not as great a look as DD
which is a bit of a red flag
Paper also illustrates the usefulness of having a specific
theoretical prediction. Limits the number of competing
hypotheses (Popperian type of reasoning).
But was I done? Look back at the table
Going beyond Cornwell and Cunningham (2013)

Figure: Second theoretical prediction - this time for 20-24 year olds
Estimated effect of abortion legalization on gonorrhea
Black females 20-24 year-olds
1.00
Repeal x year estimated coefficient
0.00 0.50

1995 1999 2000


1986 1987 1996 1997 1998
1994
1993

1992
-0.50

1988
1991
1989 1990
-1.00

1985 1990 1995 2000


Year
Whisker plots are estimated coefficients of DD estimates
Second prediction fails second DD model

Ugh. lo tov (Hebrew to English: not good)


Well, maybe DDD will look better?
Estimated effect of abortion legalization on gonorrhea
Repeal x 20-24yo x year estimated coefficient Black females 20-24 year-olds vs Black females 25-29 year-olds
0.50

1997
1994 1996
0.00

1992 1998 1999 2000


1991 1995
1990
1986 1987 1989
1988 1993
-1.00 -0.50

1985 1990 1995 2000


Year
Whisker plots are estimated coefficients of DDD coefficients
Second predictions fails DDD too

Notice that when we exploited just one testable prediction, we


found evidence
But when we exploit all of the testable predictions, the results
fall apart, suggesting original DD was spurious
Imagine for a moment, though – what if we had seen the
group-time ATT moving with the cohort as they aged?
Other alternative is the repeal-Roe effects dissipate by early to
late 20s, but what does Ockham’s Razor say is the more
credible explanation?
Perhaps the Gruber, et al. (1999) and Donohue and Levitt
(2001) hypothesis was always spurious
Stata replication

Let’s replicate this using the abortion.do file. Pay close attention to
the stacking of the data by group-state, not just state, and the
exact way in which the interactions must therefore be constructed
Falsification test with alternative outcome

The within-group control group (DDD) is a form of placebo


analysis using the same outcome
But there are also placebos using a different outcome – but
you need a hypothesis of mechanisms to figure out what is in
fact a different outcome
Figure out what those are, and test them – finding no effect
raises the epistemological credibility of the first result,
interestingly
Cheng and Hoekstra (2013) examine the effect of castle
doctrine gun laws on non-gun related offenses like grand theft
auto and find no evidence of an effect
Rational addiction as a placebo critique

Sometimes, an empirical literature may be criticized using nothing


more than placebo analysis
“A majority of [our] respondents believe the literature is a
success story that demonstrates the power of economic
reasoning. At the same time, they also believe the
empirical evidence is weak, and they disagree both on the
type of evidence that would validate the theory and the
policy implications. Taken together, this points to an
interesting gap. On the one hand, most of the
respondents claim that the theory has valuable real world
implications. On the other hand, they do not believe the
theory has received empirical support.”
Placebo as critique of empirical rational addiction

Auld and Grootendorst (2004) estimated standard “rational


addiction” models (Becker and Murphy 1988) on data with
milk, eggs, oranges and apples.
They find these plausibly non-addictive goods are addictive,
which casts doubt on the empirical rational addiction models.
Placebo as critique of peer effects

Several studies found evidence for “peer effects” involving


inter-peer transmission of smoking, alcohol use and happiness
tendencies
Christakis and Fowler (2007) found significant network effects
on outcomes like obesity
Cohen-Cole and Fletcher (2008) use similar models and data
and find similar network “effects” for things that aren’t
contagious like acne, height and headaches
Ockham’s razor - given social interaction endogeneity (Manski
1993), homophily more likely explanation
Cunningham Causal Inference
State federalism and differential timing

We’ve been considering situations where treatment occurs in


one area for the most part
But the modal situation is when there is differential timing
This happens in America usually because each area (state,
municipality) will adopt a policy whenever they want to, which
creates tendencies for roll out to occur
Example might be the minimum wage though we will look at
others
Summary

Cheng and Hoekstra (2013) are interested in whether


expansions to “castle doctrine statutes” at the state level
increase or decrease gun violence.
Prior to these expansions, English common law principle
required “duty to retreat” before using lethal force against an
assailant except when the assailant is an intruder in the home
The home is one’s “castle” – hence, “castle doctrine”
When intruders threatened the victim in the home, the duty to
retreat was waived and lethal force in self-defense was allowed
Castle doctrine law explained

In 2005, Florida passed a law that expanded self-defense


protections beyond the house
2000 to 2010, 21 states explicitly put “castle doctrine” into
statute, and (more importantly) extended it to places outside
the home
In other words, 21 states removed the duty to retreat in
specified circumstances
Other changes:
Presumption of reasonable fear is added
Civil liability for those acting under the law is removed
Economic theory predicts more lethal homicides

Workers supply legal or illegal labor and are therefore


responsive to costs and benefits
Castle doctrine expansions lowered the (expected) cost of
killing someone in self-defense
If people are rational, then lowering the price of lethal
self-defense should increase lethal homicides
Economic theory also predicts less crime from deterrence

Although deterrence is a theoretical possibility, note that the


goal of the laws was to protect enhance victim rights, not
deter crime
Testable prediction with data and same design
Treatment passage

Summary:
21 states passed laws removing “duty to retreat” in places
outside the home
17 states removed “duty to retreat” in any place one had a
legal right to be
13 states include a presumption of reasonable fear
18 states remove civil liability when force was justified under
law
Cheng and Hoekstra’s identification strategy

Panel fixed effects estimation

Yit = β1 Di + β2 Tt + β3 (CDLit ) + α1 Xit + ci + ut + εit

CDL is a fraction between 0 and 1 depending on the percent


of the year the state has a castle doctrine law
Preferred specifications includes “region-by-year fixed effects”
Data

FBI Uniform Crime Reports Part 1 Offenses (2000-2010)


State-level crime rates, or “offenses per 100,000 population”
Falsification outcomes: motor vehicle theft and larceny
Dataset on justifiable homicides by private citizens
Outcomes (in order)

Deterrence and homicide outcomes:


1 Burglary: the unlawful entry of a structure to commit a felony
or a theft
2 Robbery: the talking or attempting to take anything of value
from the care, custody or control of a person or persons by
force or threat of force or violence and/or putting the victim in
fear
3 Aggravated assault: unlawful attack by one person upon
another for the purpose of inflicting severe or aggravated
bodily injury
Homicide categories
1 Total homicides – murder plus non-negligent manslaughter
(∼14,000 per year)
2 Justifiable homicides by private citizens (∼250/year)
Inference: Clustering

Statistical inference: cluster standard errors at the state level


Are disturbances random draws from individually identical
distribution?
It’s likely that within a state, unobserved determinants of
crime are serially correlated
They follow Bertand, Duflo and Mullainathan (2004) and
adjust for serial correlation in unobserved disturbances within
states at the level of the treatment
Inference: Fisher’s sharp null

How likely is it that we estimate effects of this magnitude


when using randomly chosen pre-treatment time periods and
randomly assigning placebo treatments?
Randomizes dates within-state for the pre-treatment period
(<2000)
Randomization inference and exact p-values
Region-by-year fixed effects

Absent passing castle doctrine laws, outcomes in these 21


states would have changed similar to other states in their same
region
Recall the “region-by-year fixed effects” in the X term
By including “region-by-year fixed effects”, they are arguing
that unobserved changes in crime are running “parallel” to the
treatment states within region over time
Need not hold across regions since the across region variation
is not being used in this analysis due to the saturation of the
model with “region-by-year fixed effects”
State specific time trends

Alabama, et al. dummy interacted with TREND which equals


1 in 2000, 2 in 2001, . . . , 11 in 2010
Forces the identification to come from variation in outcomes
around the state-specific linear trend
Outcomes must be large enough and different enough from a
state-specific linear trend otherwise it is collinear with the
state-trend
Same argument applies to any control though
Goodman-Bacon (2019) suggests group-trends are less taxing
and satisfying than unit-specific trends
Control variables

Controls (X matrix in earlier equation)


Full-time police employment per 100,000 state residents from
the LEKOA data (FBI data)
Persons incarcerated in state prison per 100,000 residents
Shares of white/black men in 15-24 and 25-44 age groups
State per capita spending on public assistance
State per capita spending on public welfare
Parallel Leads

Look at each set of treatment states against never-treated


figure by figure (rare)
Use a one-period lead in the regression model (not as common)
I’m going to look at event study coefficients (most common)
Step one: Falsification test

Policy-makers are not just randomly flipping coins when


passing laws, but presumably do so because of things they
observe on the ground
Address concerns up front this isn’t driven by spurious crime
results
Cheng and Hoekstra (2013) present falsification of larceny and
motor vehicle theft first, then results
Step one (cont.)

Results will be presented separately under six different


specifications
Each new specification adds more controls
Pop quiz: What should you expect to find on key variables of
interest when conducting a falsification and why?
Answer

No statistically significant association between the CDL


passage and the placebos; preferably precise zeroes
No association on the one-year lead either
Basically, you should not find effects where there are no
theoretical policy effects; gun laws shouldn’t affect non-violent
offenses
Step one (cont.)

How do you interpret coefficients?


His model is “log outcomes” regressed onto a dummy variable
(level), so these are semi-elasticities and approximate
percentage changes – but you should transform them by taking
the exponential of each coefficient and then differencing it
from one to find the actual percentage change
Ex: CDL = -0.0137 (column 12, Table 3, “Log (larceny rate)”
outcome.) Exp(-0.0137) = 0.986, and so 1-0.986 = 1.4. Thus,
CDL reduced larceny rates by 1.4 percent, which is not
statistically significant.
Results – Falsification Exercise

Table 3: Placebo Tests


OLS - Unweighted
7 8 9 10 11 12
Panel A: Larceny Log (Larceny Rate)
Castle Doctrine Law 0.00745 0.00145 -0.00188 -0.00445 -0.00361 -0.0137
(0.0227) (0.0205) (0.0210) (0.0226) (0.0201) (0.0228)

One Year Before Adoption of -0.0103


Castle Doctrine Law (0.0114)

Observation 550 550 550 550 550 550


Panel B: Motor Vehicle Theft Log (Motor Vehicle Theft Rate)

Castle Doctrine Law 0.0767* 0.0138 0.00814 0.00775 0.00977 -0.00373


(0.0413) (0.0444) (0.0407) (0.0462) (0.0391) (0.0361)

One Year Before Adoption of -0.00155


Castle Doctrine Law (0.0287)

Observation 550 550 550 550 550 550


State and Year Fixed Effects Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes
Controls for Larceny or Motor Theft Yes
State-Specific Linear Time Trends Yes
Notes: Each column in each panel represents a separate regression. The unit of observation is state-year. Robust standard errors are
clustered at the state level. Time-varying controls include policing and incarceration rates, welfare and public assistance spending, median
income, poverty rate, unemployment rate, and demographics.
Step two: testing the deterrence hypothesis

Having found no effect on their placebos, Cheng and Hoekstra


(2013) examine the effect of CDL on three deterrence
outcomes: burglary, robbery and aggravated assault
They will, again, have six specifications per outcome in the
“weighted” regression, and then another five for the
“unweighted” regression
Pop quiz: What does deterrence look like?
Answer

Negative signs on the CDL variable is consistent with


deterrence – these crimes were “deterred”, in other words
Based on early work by Becker (1968) and 1970s work by his
student Isaac Ehrlich; higher probabilities of getting hurt in
public may cause offenders to avoid violence in public
altogether
Bounds on the magnitudes from the standard errors are used
to provide some confidence about the estimates as well
Results – Deterrence

OLS - Weighted by State Population OLS - Unweighted


1 2 3 4 5 6 7 8 9 10 11 12
Panel A: Burglary Log (Burglary Rate) Log (Burglary Rate)
Castle Doctrine Law 0.0780*** 0.0290 0.0223 0.0164 0.0327* 0.0237 0.0572** 0.00961 0.00663 0.00277 0.00683 0.0207
(0.0255) (0.0236) (0.0223) (0.0247) (0.0165) (0.0207) (0.0272) (0.0291) (0.0268) (0.0304) (0.0222) (0.0259)

One Year Before Adoption of -0.0201 -0.0154


Castle Doctrine Law (0.0139) (0.0214)
Panel B: Robbery Log (Robbery Rate) Log (Robbery Rate)
Castle Doctrine Law 0.0408 0.0344 0.0262 0.0216 0.0376** 0.0515* 0.0448 0.0320 0.00839 0.00552 0.00874 0.0267
(0.0254) (0.0224) (0.0229) (0.0246) (0.0181) (0.0274) (0.0331) (0.0421) (0.0387) (0.0437) (0.0339) (0.0299)

One Year Before Adoption of -0.0156 -0.0115


Castle Doctrine Law (0.0167) (0.0283)
Panel C: Aggravated Assault Log (Aggravated Assault Rate) Log (Aggravated Assault Rate)
Castle Doctrine Law 0.0434 0.0397 0.0372 0.0362 0.0424 0.0414 0.0555 0.0698 0.0343 0.0305 0.0341 0.0317
(0.0387) (0.0407) (0.0319) (0.0349) (0.0291) (0.0285) (0.0604) (0.0630) (0.0433) (0.0478) (0.0405) (0.0380)

One Year Before Adoption of -0.00343 -0.0150


Castle Doctrine Law (0.0161) (0.0251)
Observations 550 550 550 550 550 550 550 550 550 550 550 550
State and Year Fixed Effects Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes Yes Yes Yes Yes
Contemporaneous Crime Rates Yes Yes
State-Specific Linear Time Trends Yes Yes
Conclusion

“In short, these estimates provide strong evidence against the


possibility that castle doctrine laws cause economically
meaningful deterrence effects” (p. 17)
Translation: They can’t find evidence of large deterrence
effects
“Thus, while castle doctrine law may well have benefits to
those legally justified in protecting themselves in self-defense,
there is no evidence that the law provides positive spillovers by
deterring crime more generally” (p. 17)
They note in footnote 24 that they cannot measure the
benefits to victims whose crimes were deterred, or the benefits
from lower legal costs; their focus is limited to whether it
deterred the crimes, not whether the net benefits from the
laws were positive
Obviously, if there is no deterrence, though, then the net
benefits are lower from CDL than they would be if they did
deter
Step 3: Homicides

The key finding in this study focuses on CDL and its effect on
homicides and non-negligent manslaughter
Pop quiz: what should the sign on CDL be here?
Answer

Effects should be positive


Cheng and Hoekstra want to show the raw data, but have
differential timing
Differential timing means you can’t show pre-treatment raw
data for the never-treated groups
So they show it one by one – which isn’t the most aesthetically
pleasing way to do it, but which has the benefit of being
transparent
Parallel pre-treatment trends

Keep your eyes on whether pre-treatment trends are parallel


for treatment and control groups
Remember, though – he needs parallel trends within-region –
these figures don’t show that
But starting with pictures and raw data has value
Log Homicide Rates – 2005 Adopter = Florida

2
1.9
1.8
Log Homicide Rate

1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

Treatment: Florida (law enacted in October 2005)


Control: States that did not enact a law 2000 - 2010
Log Homicide Rates – 2006 Adopter (13 states)

2
1.9
1.8
Log Homicide Rate

1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

Treatment: States that enacted the law in 2006


Control: States that did not enact a law 2000 - 2010
Log Homicide Rates – 2007 Adopter (4 states)

2
1.9
1.8
Log Homicide Rate

1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

Treatment: States that enacted the law in 2007


Control: States that did not enact a law 2000 - 2010
Log Homicide Rates – 2008 Adopter (2 states)

2
1.9
1.8
Log Homicide Rate

1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

Treatment: States that enacted the law in 2008


Control: States that did not enact a law 2000 - 2010
Log Homicide Rates – 2009 Adopter = Montana

1.5
1.4
1.3
Log Homicide Rate

1.2
1.1
1
.9
.8
.7
.6
.5
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

Treatment: State that enacted the law in 2009 (Montana)


Control: States that did not enact a law 2000 - 2010
Modeling

He uses a class of estimators more appropriate for “counts”


called “count models”, like the negative binomial estimated
with maximum likelihood
Results are robust to least squares and count models
Homicide – Negative Binomial; Murder – OLS

1 2 3 4 5 6
Panel C: Homicide (Negative Binomial - Unweighted)

Castle Doctrine Law 0.0565* 0.0734** 0.0879*** 0.0783** 0.0937*** 0.108***


(0.0331) (0.0305) (0.0313) (0.0355) (0.0302) (0.0346)

One Year Before Adoption of Castle Doctrine -0.0352


Law (0.0260)

Observations 550 550 550 550 550 550


Panel D: Log Murder Rate (OLS - Weighted)

Castle Doctrine Law 0.0906** 0.0955** 0.0916** 0.0884** 0.0981** 0.0813


(0.0424) (0.0389) (0.0382) (0.0404) (0.0391) (0.0520)
One Year Before Adoption of Castle Doctrine -0.0110
Law (0.0230)
Observations 550 550 550 550 550 550

State and Year Fixed Effects Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes
Contemporaneous Crime Rates Yes
State-Specific Linear Time Trends Yes
Fisher sharp null

Move the 11-year panel back one year at a time (covering


1960-2009) and estimate 40 placebo “effects” of passing CDL 1 to
40 years earlier

Method Average Estimates larger


estimate than actual estimate
Weighted OLS –0.003 0/40
Unweighted OLS 0.001 1/40
Negative binomial 0.001 0/40
My replication using event study plots

Log Murder Rate

0.400
0.200

0.172
0.105
0.078 0.082 0.079

0.009 0.005 -0.004 0.012 0.019


0.000

-0.026

-0.137
-0.200

-0.261
-0.304
-0.400

lead9

lead8

lead7

lead6

lead5

lead4

lead3

lead2

lead1

lag1

lag2

lag3

lag4

lag5
Figure: Homicide event study plots using coefplot
.4
.2
Log Murders
0

DD Coefficient = 0.08 (s.e. = 0.03)


-.2
-.4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Years before and after castle doctrine expansion

Figure: Homicide event study plots using twoway


.3
.2
Log Murders
.1
0
-.1

DD Coefficient = 0.08 (s.e. = 0.03)


-6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Years before and after castle doctrine expansion

Figure: Homicide event study plots using twoway and force early leads
into one coefficient
.3
.2
Log Murders
.1
0
-.1

DD Coefficient = 0.09 (s.e. = 0.03)


-6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Years before and after castle doctrine expansion

Figure: Homicide event study plots using twoway dropping imbalanced


states
Interpretation

No evidence that Castle Doctrine/Stand Your Ground Laws


deter violent crimes such as burglary, robbery and aggravated
assault
These laws do lead to an 8% net increase in homicide rates,
translating to around 600 additional homicides per year across
the 21 adopting states
Unlikely that all of the additional homicides were legally
justified
Incentives matter in some contexts (lethal force) but not
others (deterrence)
Where to from here?

Now that we’ve reviewed the twoway fixed effects with


treatment that differed across time, how does this more
general form of “differential timing” compare with the 2x2 DD
that we reviewed?
Complicated derivation, but simple interpretation - twoway
fixed effects with differential timing estimates a weighted
average of all 2x2
Andrew Goodman-Bacon (2018; 2019) and Callaway and
Sant’ann (2019)
I will be making the argument that under certain modal
situations, the twoway fixed effects model has major problems,
even fatal ones, due to biases even when parallel trends
plausibly holds
Reminder of 2x2 DD

To understand differential timing, we need to remind ourselves 2x2


form

   
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU

Post to pre difference for treatment group compared to the post to


pre difference for never treated
Different treatment dates by panel unit

yit = βDi + τ Postt + δ(Di × Postt ) + Xit + αi + αt + εit


| {z }
2x2 DD
yit = δDit + Xit + αi + αt + it
| {z }
Twoway FE

We know a lot about 2x2, but about the twoway fixed effects
estimator when it comes to DD designs
Decomposition Preview

Linear panel models estimate a treatment parameter that is a


weighted average over all 2x2 in your sample
The estimator is a weighted average of all potential δ 2×2 in
which treated units act as both controls and treatment
depending on the situation
Weights are function of sample sizes of each “group” and the
variance of the treatment dummies for the groups
Decomposition (cont.)

Under the assumptions of variance weighted common trends


(VWCT) and time invariant treatment effects, the estimator
called the variance weighted ATT is a weighted average of all
possible ATTs
Under more restrictive assumptions it perfectly matches the
ATT
Time varying treatment effects generate a bias that needs to
be accounted for
3 Group Example

Suppose two treatment groups (k,l) and one untreated group


(u)
k,l define the groups based on when they receive treatment
(differently in time) with k receiving it later than l
Denote D k as the share of time each group spends in
treatment status
2x2,j
Denote δbab as the canonical 2 × 2 DD estimator for groups a
and b where j is the treatment group
So what are the possible 2 × 2 combinations?
How many 2x2?

A lot!
When there’s three groups - a never treated (U), an early
treated (k) and a late treated (l), there are four 2x2s
But typically, we have more than 3 groups making the number
of potential 2x2 even larger
With K timing groups and one untreated group, there are K 2
distinct 2x2 DDs
K 2 distinct DDs

Assume 3 timing groups (a, b and c) and one untreated group (U).
Then there should be 9 2x2 DDs. Here they are:

a to b b to a c to a
a to c b to c c to b
a to U b to U c to U
Simple example with 3 groups

We’ll stick with two groups, k and l, who will get the treatment
at tk∗ and tl∗ , and the third group U will never get treated
The earlier period before anyone is treated is “pre”, the period
between k and l treatment is “mid”, and the period after l is
treated is “post”
Three important 2x2 DDs

  
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
   
2x2 mid(k,l) pre(k) mid(k,l) pre(k)
δbkl = yk − yk − yl − yl
   
2x2 post(l) mid(k,l) post(l) mid(k,l)
δblk = yl − yl − yk − yk

where the first 2x2 is any timing group compared to untreated, the
second is a group compared to yet-to-be-treated timing group, and
the last is the eventually-treated compared to the already-treated
controls.
   
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
   
2x2 post(l) pre(l) post(l) pre(l)
δblU = yl − yl − yU − yU
   
2x2,k MID(k,l) Pre(k,l) MID(k,l) PRE (k,l)
δkl = yk − yk − yl − yl
   
2x2,l POST (k,l) MID(k,l) POST (k,l) MID(k,l)
δlk = yl − yl − yk − yk
Second, what makes up the DD estimator?

The least squares estimate yields a weighted combination of each


groups’ respective 2x2 (of which there are 4 in this example)
X XX  2x2,k 2x2,l

DD 2x2
δ
b = skU δkU +
b skl µkl δkl
b + (1 − µkl )δlk
b
k6=U k6=U l>k

where that first 2x2 is the k compared to U and the l compared to


U (combined to make the equation shorter)
Third, the Weights

nk nu D k (1 − D k )
sku =
d (D̃it )
Var
nk nl (D k − D l )(1 − (D k − D l ))
skl =
d (D̃it )
Var
1 − Dk
µkl =
1 − (D k − D l )

where n refer to sample sizes, D k (1 − D k )


(D k − D l )(1 − (D k − D l )) expressions refer to variance of
treatment, and the final equation is the same for two timing groups.
Weights discussion

Two things pop out of these weights


“Group” variation matters more than unit-level variation. A
group is if two states got treated in 1995. They are the 1995
group. More units in a group, the bigger that 2x2 is practically
Within-group treatment variance matters a lot.
Think about what causes the treatment variance to be as big
as possible. Let’s think about the sku weights.
1 D = 0.1. Then 0.1 × 0.9 = 0.09
2 D = 0.4. Then 0.4 × 0.6 = 0.24
3 D = 0.5. Then 0.5 × 0.5 = 0.25
What’s this mean? The weight on treatment variance is
maximized for groups treated in middle of the panel
More weights discussion

But what about the “treated on treated” weights? What’s this


D k − D l business about?
Well, same principle as before - when the difference between
treatment variance is close to 0.5, those 2x2s are given the
greatest weight
For instance, say tk∗ = 0.15 and tl∗ = 0.67. Then
D k − D l = 0.52. And thus 0.52 × 0.48 = 0.2496.
TWFE and centralities

Groups in the middle of the panel weight up their respective


2x2s via the variance weighting
But when looking at treated to treated comparisons, when
differences in timing have a spacing of around 1/2, those also
weight up the respective 2s2s via variance weighting
But there’s no theoretical reason why should prefer this as it’s
just a weighting procedure being determined by how we drew
the panel
This is the first thing about TWFE that should give us pause,
as not all estimators do this
Potential outcomes

Previous just showed that DD was based on a weighted


“adding up” of particular 2x2s. That tells us what DD is
numerically. But that’s not the end
Because the decomposition theorem expresses the DD
coefficient in terms of sample averages, the movement to
potential outcomes is easy.
Now we express DD in terms of ATT which is essential for
understanding identification and bias
Average treatment effect on the treatment group (ATT)

Define the year-specific ATT as

ATTk (τ ) = E [Yit1 − Yit0 |k, t = τ ]

Now define it over a time window W (e.g., a post-treatment


window)

ATTk (τ ) = E [Yit1 − Yit0 |k, τ ∈ W ]

Define differences in average potential outcomes over time as:

∆Ykh (W1 , W0 ) = E [Yith |k, W1 ] − E [Yith |k, W0 ]

for h = 0 (i.e., Y 0 ) or h = 1 (i.e., Y 1 )


Changing potential outcomes

Figure: With trends, differences in mean potential outcomes is non-zero


From 2x2 to ATT

   
2x2
δbkU = E [Yj |Post] − E [Yj |Pre] − E [Yu |Post] − E [Yu |Pre]
   
1 0 0 0
= E [Yj |Post] − E [Yj |Pre] − E [Yu |Post] − E [Yu |Pre]
| {z }
Switching equation

+ E [Yj0 |Post] − 0
E [Yj |Post]
| {z }
Adding zero

= E [Yj |Post] − E [Yj0 |Post]


1
| {z }
ATT
   
0 0 0 0
+ E [Yj |Post] − E [Yj |Pre] − E [YU |Post] − E [YU |Pre]
| {z }
Non-parallel trends bias in 2x2 case
Potential outcomes

2x2 0 0
δbkU = ATTPost,j + ∆YPost,Pre,j − ∆YPost,Pre,U
| {z }
Selection bias!

Hah! It’s that another selection bias term, like when we


decomposed the simple difference in outcomes! But here we see it’s
basis - non-parallel trends in potential outcomes themselves. Notice
one of these is counterfactuals, but which one?
Two benign 2x2

2x2
δbkU = ATTk Post + ∆Yk0 (Post(k), Pre(k)) − ∆YU0 (Post(k), Pre)
2x2
δbkl = ATTk (MID) + ∆Yk0 (MID, Pre) − ∆Yl0 (MID, Pre)

These look the same because you’re always comparing the treated
unit with an untreated unit (though in the second case it’s just that
they haven’t been treated yet).
The dangerous 2x2

But what about the 2x2 that compared the late groups to the
already-treated earlier groups? With a lot of substitutions like we
did we get:

2x2
δblk = ATTl,Post(l) + ∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
| {z }
Parallel trends bias
− (ATTk (Post) − ATTk (Mid))
| {z }
Heterogeneity bias!
Heterogeneity bias?

That old decomposition of the simple difference in outcomes rears


its ugly head!

2x2
δbkl = ATTl,Post(l)
+∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
−(ATTk (Post) − ATTk (Mid))

The first part is the ATT we are looking for


The selection bias which only zeroes out if Y 0 for k and l has
the same parallel trends from mid to post period
The heterogeneity bias (3) occurs if the ATT for k differs over
time. If not, then it just zeroes out.
Substitute all this stuff into the decomposition formula

 
2x2,k 2x2,l
X XX
2x2
δbDD = skU δbkU + skl µkl δbkl + (1 − µkl )δbkl
k6=U k6=U l>k

where we will make these substitutions


2x2
δbkU = ATTk (Post) + ∆Yl0 (Post, Pre) − ∆YU0 (Post, Pre)
2x2,k
δbkl = ATTk (Mid) + ∆Yl0 (Mid, Pre) − ∆Yl0 (Mid, Pre)
δb2x2,l
lk = ATTl Post(l) + ∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
−(ATTk (Post) − ATTk (Mid))

Notice all those potential sources of biases!


Potential Outcome Notation

DD
p lim δbn→∞ = δ DD
= VWATT + VWCT − ∆ATT

Notice the number of assumptions needed even to estimate


this very strange weighted ATT (which is a function of how
you drew the panel in the first place).
With dynamics, it attenuates the estimate (bias) and can even
reverse sign depending on the magnitudes of what is otherwise
effects in the sign in a reinforcing direction!
Let’s look at each of these three parts more closely
Variance weighted ATT

X
VWATT = σkU ATTk (Post(k))
k6=U
XX  
+ σkl µkl ATTk (MID) + (1 − µkl )ATTl (POST (l))
k6=U l>k

where σ is like s only population terms not samples.


Weights sum to one.
Note, if all the ATT are identical, then the weighting is
irrelevant.
But otherwise, it’s basically weighting each of the individual
sets of ATT we have been discussing, where weights depend
on group size and variance
Variance weighted common trends

VWCT can be understood as a variance weighted common


trends component,
This is the collection of selection biases we previously wrote
out,
But notice – identification requires variance weighted common
trends to hold.
You get this with identical trends, but you don’t need identical
trends anymore as the weights can make it hold without.
Huge pain to write out, unfortunately.
Variance weighted common trends

X  
0 0
VWCT = σkU ∆Yk (Post(k), Pre) − ∆YU (Post(k), Pre)
k6=U
XX 
+ σkl µkl {∆Yk0 (Mid, Pre(k)) − ∆Yl0 (Mid, Pre(k))}
k6=U l>k

+ (1 − µkl ){∆Yl0 (Post(l), Mid) − ∆Yk0 (Post(l), Mid)}

This is new. But while this is a lot to be equalling zero, it’s


ironically a weaker identifying assumption than we thought bc you
don’t need identical common trends since the weights can
technically correct for unequal trends.
Heterogeneity bias

XX  
∆ATT = (1 − µkl ) ATTk (Post(l) − ATTk (Mid))
k6=U l>k

Now, if the ATT is constant over time, then this difference is zero,
but what if the ATT is not constant? Then TWFE is biased, and
depending on the dynamics and the VWATT, may even flip signs
Case 1: ATT varies across units but not time

DD
p lim δbn→∞ = VWATT + VWCT

because ∆ATT = 0 here. Assume VWCT=0. Then the VWATT


equals

X  k−1
X K
X 
VWATT = ATTk σkU + σjk (1 − µjk ) + σjk µjk
k6=U j−1 j=k+1
X
= ATTk wkT
k6=U

the VWATT weights together group-specific ATTs by a function of


sample shares and treatment variance.
Case 1 cont.

The processes that determine treatment timing are central to


the interpretation of VWATT.
Assume treatment rolls out first to units with the largest
ATTs.
Then regression DD underestimates the sample-weighted ATT
if t1∗ is early enough, or if there are a lot of post periods, so
that D 1 very small and D k ≈ 0.5
Regression DD overestimates if t1∗ is late enough (or if there
are a lot of pre periods) so that D 1 ≈ 0.5 and D k is small
Goodman-Bacon (2018) suggests scattering the weights
against each group’s sample share. They may be close if there
is little variation in treatment timing, if the untreated group is
very large, or if some timing groups are very large
Case 2: Constant ATT across units, but heterogenous over
time

Time varying treatment effects, even if they are identical


across units, generate cross-group heterogeneity because of the
differing post-treatment windows
Let’s consider a case where the counterfactual outcomes are
identical, but the treatment effect is a linear break in the
trend. For instance, Yit1 = Yit0 + θ(t − t1∗ + 1) similar to Meer
and West (2013)
Treatment effect is break in trend
Case 2 cont.

The first 2x2 uses the later group as its control in the middle
period. But in the late period, the later treated unit is using
the earlier treated as its control
But notice, this effect is biased because the control group is
experiencing a trend in outcomes (heterogeneous treatment
effects)
This bias feeds through to the later 2x2 according to the size
of the weight (1 − µkl )
Variance weighted common trends

If treatment effects are constant over time, then we only need


VWCT = 0 to identify VWATT. “Only”!
The assumption itself is not testable because common trends
is based on counterfactual Y 0 for the treatment groups in the
post-treatment period, and we only have pre-treatment data
But let’s assume differential counterfactual trends Yk0 are
linear throughout the panel. Then we can get a convenient
approximation to the VWCT on the next slide
Variance weighted common trends

X  k−1
X K
X 
VWCT = ∆Yk0 σkU + σjk (1 − 2µjk ) + σkj (2µkj − 1)
k6=U j=1 j=k+1
X
−∆YU0 σkU
k6=U

Obviously, for this bias to be inconsequential, we need the sum of


the two weighted counterfactual trends to be zero. You get this
with identical trends, but those are not necessary due to the
weights ability to shift non-identical trends so as to satisfy the zero
condition.
Variance weighted common trends

The weight on each group’s counterfactual trend equals the


difference between the total weight it gets when it acts as a
treatment group (wkT ) minus the total weight it gets when it acts
as a control (wkc ).
X
∆Yk0 [wkT − wkC ] = 0
k

where wkT is the sum of all weights where group k is the treatment
group
k−1
X K
X
T
wk = σkU + σjk (1 − µjk ) + σkj µkj
k=1 j=k+1

and wkc is the sum of al weights where group k is the control group
k−1
X K
X
wkc = σjk µjk + σjk (1 − µjk )
k=1 j=k+1
Variance weighted common trends

The bias induced by each group will depend on whether it is a


net treatment/control group
A positive pre-trend for group j will bias the results upwards if
j is a net treatment group (wjT > wjC ) or down if its a net
control group, and if they are equal, then the bias will be zero
regardless of group pre-trend
Units treated towards the ends of the panel get relatively more
weight when they act as controls.
Needless to say, the size of the bias from a given trend is larger
for groups with more weight
Variance weighted common trends

What this means is that while all units are acting as controls,
treatment timing causes some units to be controls more often
- hence why they become negative (e.g., wkT − wkC < 0 implies
wkC has become relatively large)
The earliest and/or latest units get more weight as controls
than treatments
Units treated in the middle of the panel have high treatment
variance as we’ve noted repeatedly, and so get more weight
when they act as the treatment group
Variance weighted common trend weights
Testing VWCT

The identifying assumption k ∆Yk0 [wkT − wkC ] = 0 shows us how


P
to exactly weight averages of xit and perform a single t-test that
directly captures the identifying assumption.
1 Generate a dummy for the effective treatment group

1[Bk ] = wkT − wkC > 0

2 Estimate
xk = βBk + εk
weighted by |wkT − wkC |
The coefficient βb equals covariate differences weighted by the
actual identifying variation and its t-statistic tests the null of
reweighted balance implied the VWCT equality
Software to check the 2x2s and weights

Austin Nichols and Thomas Goldring have made available a


package in Stata called ddtiming.ado
This will estimate each individual 2x2 and the weights
associated with a simple twoway fixed effects model
Let’s look it. First download Cheng and Hoekstra data from
earlier (castle-doctrine-2000-2010.dta)
Now install ddtiming.ado and use the do file that I’ve supplied
called hoekstra-cheng.do
Stata

. use castle-doctrine-2000-2010.dta, replace


. areg l_murder post i.year, a(sid) robust

Dep var Log homicide


Castle doctrine law 0.105
(0.032)
Recall the estimated ATT is 0.105

. ddtiming l_murder post, i(sid) t(year)

DD Comparison Weight Avg DD Est


Earlier T vs. Later C 0.060 -0.039
Later T vs. Earlier C 0.032 0.063
T vs. Never treated 0.908 0.116

. di (0.060*-0.039) + (0.032*0.063) + (0.908*0.116)


. 0.105

Most of the 0.105 is coming from comparing treatment units to


never treated units; the others cancel out
2x2s and their corresponding weights

.4 .2
2x2 DD Estimate
-.2 0
-.4
-.6

0.00 0.20 0.40 0.60


Weight

Earlier Group Treatment vs. Later Group Control


Later Group Treatment vs. Earlier Group Control
Treatment vs. Never Treated
Biased DD with OLS

Review baker.do
So we see – with differential timing, and heterogeneous
treatment effects over time, the TWFE bias can be gigantic
because:

plim = VWATT + VWCT − ∆ATTlk

New papers are coming out focused on the issues that we are
seeing with TWFE
Callaway and Sant’anna (2019) is one of these (currently R&R
at Journal of Econometrics)
Preliminary

Callaway and Sant’anna consider identification, estimation and


inference procedures for ATE in DD models with
1 multiple time periods
2 variation in treatment timing (i.e., differential timing)
3 parallel trends only holds after conditioning on observables
Group-time ATE

Key concept: the ATE for a specific group and time


Groups are basically cohorts of units treated at the same time
Their method will calculate an ATE per group/time which
yields many individual ATE estimates
Group-time ATE estimates are not determined by the
estimation method one adopts (first difference or FE)
Does not directly restrict heterogeneity with respect to
observed covariates, timing or the evolution of treatment
effects over time
Provides a way to aggregate over these to get a single ATE
Another contribution

Typical econometrics paper: they propose estimators and


provide asymptotically valid inference procedures for the causal
parameter of interest
Uses a particularly kind of bootstrapping that is
computationally convenient to obtain confidence intervals
This is an extension of an older Abadie (2006) paper on
semi-parametric DD with some subtle and substantive
differences
The estimator will look awfully similar to an inverse probability
weighting estimator down to the use of propensity scores
Parallel trends assumption

Parallel trends is never directly testable


If you assume though that it holds in the pre-treatment period
that therefore it holds in the counterfactual periods, then fine
(IMO, this begs the question [as in assumes the conclusion].
Obviously if treatment is endogenous then parallel trends
doens’t hold even if it did hold prior (see Kahn-Lang and Lang
2018))
Notation

T periods going from t = 1, . . . , T


Units are either treated (Dt = 1) or untreated (Dt = 0) but
once treated cannot revert to untreated state
Gg signifies a group and is binary. Equals one if individual
units are treated at time period t.
C is also binary and indicates a control group unit equalling
one if “never treated”
Recall the problem with OLS on using treatment units as
controls
Callaway and Sant’anna seem to know this and working to
specifically address it by essentially not using those units at all
as controls
Generalized propensity score:
ˆ ) = Pr (Gg = 1|X , Gc + C = 1)
p(X
Propensity scores

They’ll estimate a propensity score based on group covariates


using probit or logit (but not OLS)
That score will then be normalized (e.g., Hajek weight) which
improves finite sample bias
You may need to trim it on the [0.1,0.9] interval as is
commonly suggested in other applications
Essentially, units in control group will be weighted up if their
propensity scores are high, and weighted down if low, making
more apple-to-apples comparisons
Detour into IPW

Horvitz weights

1 PN Di − pb(Xi )
δbATT = i=1 Yi ·
NT 1 − pb(Xi )

Harjek weights
N N N N
Yi (1 − Di ) (1 − Di )
X  X  X  X 
Yi Di Di
δbATT = / − /
pb pb (1 − pb) (1 − pb)
i=1 i=1 i=1 i=1
Parameter of interest

ATT (g , t) = E [Yt1 − Yt0 |Gg = 1]


Potential uses of this estimator

1 Are treatment effects heterogenous by time of adoption?


2 Does treatment effect change over time?
3 Are shortrun effects more pronounced than longrun effects?
4 Do treatment effect dynamics differ if people are first treated
in a recession relative to expansion years?
Assumptions

Assumption 1: Sampling is iid (panel data)

Assumption 2: Conditional parallel trends

E [Yt0 − Yt−1
0
|X , Gg = 1] = [Yt0 − Yt−1
0
|X , C = 1]

Assumption 3: Irreversible treatment

Assumption 4: Common support (propensity score)


Estimator

Theorem 1
 p̂(X )C  
Gg 1−p̂(X )
ATT (g , t) = E −   (Yt − Yg −1 )
E [Gg ] p̂(X )C
E 1−p̂(X )
Which units will and will not be controls?

Callaway and Sant’anna are keeping us from calculating DD’s


using TWFE, which is problematic in part bc you’re implicitly
calculating 2x2s by comparing later treated units to early
treated units, which is a sin
But what if you never have a true control group, or “never
treated”?
Remarks about “staggered adoption” with universal coverage

Proof.
Remark 1: In some applications, eventually all units are treated,
implying that C is never equal to one. In such cases one can
consider the “not yet treated” (Dt = 0) as a control group instead
of the “never treated?” (C = 1).
Aggregated vs single year/group ATT

The method they propose is really just identifying very narrow


ATT per group time.
But we are often interested in more aggregate parameters, like
the ATT across all groups and all times
They present two alternative methods for building “interesting
parameters”

“We can aggregate the group-time treatment effects into


fewer interpretable causal effect parameters, which makes
interpretation easier, and also increases statistical power
and reduces estimation uncertainty.” - Andrew Baker
Interesting Parameter 1

T T
2 XX
1{g ≤ t}ATT (g , t)
T (T − 1)
g =2 t=2

where T is number of pre-treatment years (Assumption 2 regarding


conditional parallel trends). Let’s look at an example.
Aggregating the first way

ATT (1986, 1986) = 10


ATT (1986, 1987) = 15
ATT (1986, 1988) = 20

Let data run from 1983 - 1988. Thus T = 3. ATT simple average
is 15.
Interesting Parameter 2

T T
1 XX
1{g ≤ t}ATT (g , t)P(G = g )
k
g =2 t=2

This is a weighted average of each ATT (g , t) putting more weight


on ATT (g , t) with larger group sizes
Bootstrap inference

They propose a bootstrap procedure to conduct asymptotically


valid inference which can adjust for autocorrelation and clustering
Stata example

See baker.do
Concluding remarks on DD

Chances are you are going to write more papers using DD than
any other design
Goodman-Bacon (2018, 2019) is worth your time so that you
know what you are estimating
And Callaway and Sant’ann (2019) is an extremely useful
contribution to the DD toolbox for showing a way to estimate
the group-time ATT using any variety of approaches, including
regression
What is synthetic control

Synthetic control has been called the most important


innovation in causal inference of the last 15 years (Athey and
Imbens 2017)
It’s extremely useful for case studies, which is nice because
that’s often all we have
Continues to also be methodologically a frontier for applied
econometrics
Consider this talk a starting point for you

Cunningham Causal Inference


What is a comparative case study

Single treated unit – country, state, whatever


Social scientists tackle such situations in two ways:
qualitatively and quantitatively
In political science, probably others, you see as a stark dividing
line between camps
Not so much in economics
Qualitative comparative case studies

In qualitative comparative case studies, the goal is to reason


inductively the causal effects of events or characteristics of a
single unit on some outcome, oftentimes through logic and
historical analysis.
May not answer the causal questions at all because there is
rarely a counterfactual, or if so, it’s ad hoc.
Classic example of comparative case study approach is Alexis
de Toqueville’s Democracy in America (but he is regularly
comparing the US to France)
Traditional quantitative comparative case studies

Quantitative comparative case studies are often explicitly


causal designs.
Usually a natural experiment applied to a single aggregate unit
(e.g., city, school, firm, state, country)
Method compares the evolution of an aggregate outcome for
the unit affected by the intervention to the evolution of the
same ad hoc aggregate control group (Card 1990; Card and
Krueger 1994)
Pros and cons of traditional case study approaches

Pros:
Policy interventions often take place at an aggregate level
Aggregate/macro data are often available
Cons:
Selection of control group is ad hoc
Standard errors do not reflect uncertainty about the ability of
the control group to reproduce the counterfactual of interest
Description of the Mariel Boatlift

How do inflows of immigrants affect the wages and


employment of natives in local labor markets?
Card (1990) uses the Mariel Boatlift of 1980 as a natural
experiment to measure the effect of a sudden influx of
immigrants on unemployment among less-skilled natives
The Mariel Boatlift increased the Miami labor force by 7%
Individual-level data on unemployment from the Current
Population Survey (CPS) for Miami and four comparison cities
(Atlanta, Los Angeles, Houston, Tampa-St. Petersburg)
Why these four?
( ~\ i
-..~
0\
---
V Vii'}
--~
1J')NOO
~
V'
- ~ci- "'"'"
N"':N
"'"'"
- I / C'i~~
-0- C1C1o.
-N-
~
O\C;:) I I I I
,
vi
u
~
u
.r:
c
~
00
~ - r;::; ~ ~ ~
O\MO\ 000\00.
ft ddd "';:dN c
'-' '-' '-" ,-
O\MV \0\00 C
~
M~d o\NM ~
Motivating Example: The Mariel Boatlift

E -
Card’s main results

~ OO~ - 0
-00. --
O\N I I.r: ~
u
a ~
d
U, ~
C
L..
='
0
t:
C
0
',----M- --~
r-..000\"d
0u
.-0
c:
- 0 ... - - 0 ... - ~C
K ~ . -vr-..
-""""" MMO
\~ '/ .-00 ~ 0\ . V .. 0
1J') 00'" 0 N g
au t--
~c.s
- C/)
.-
e >c I """
\0
10..,
"d
~ ~
~uU M
rl
C c: :0
U
0 0 ~
.I::
...
10.., .~.~~
0.
~
0..
~
L.
0
0
a E 0\
--
'a
0 0 0\
~
~ U
I Y
~
';j
II)
u
'-.-
u 6
.~
U

(I) .
,-
~

u
~
.~
u'-

~
~
~
'-'
a
r \
'-'
~ 0.2-: c.2-:
OGJ
a
0
g °u
~
U '~ U
C
~
.~ U
C ~
.L..
~F;0..L.. U
I£: -
eU~o..~
L..
u,.. II) .- U ~ ,;... U
"0
~ R
=' ,~ ~ ,- a ~ u ;a a ~
0 ~ ~ .- 0 ~ 0..
t!: (,5 ~~UQ ~~UQ ~
<
';,
,.
: GJ
II)
II)
; U
C U
0
' U -0
Z
'I!~
'
,
.-
~
GJ
, ---N ---" ,
.
or
l~ Q ~_c ~~~
,
\
:)
Can this ever lead to subjective biases?

Card found that the Mariel boatlift reduced unemployment


compared to the four cities he chose
Is there anything principled we could do that doesn’t give the
researcher so much control over control group?
Enter synthetic control (Abadie and Gardeazabal 2003;
Abadie, Diamond and Hainmueller 2010)
Synthetic Control

First appears in Abadie and Gardeazabal (2003) in a study of a


terrorist attack in Spain (Basque) on GDP
Revisited again in a 2011 JASA with Diamond and
Hainmueller, two political scientists who were PhD students at
Harvard (more proofs and inference)
A combination of comparison units often does a better job
reproducing the characteristics of a treated unit than single
comparison unit alone
Researcher’s objectives

Our goal here is to reproduce the counterfactual of a treated


unit by finding the combination of untreated units that best
resembles the treated unit before the intervention in terms of
the values of k relevant covariates (predictors of the outcome
of interest)
Method selects weighted average of all potential comparison
units that best resembles the characteristics of the treated
unit(s) - called the “synthetic control”
Synthetic control method: advantages

Precludes extrapolation (unlike regression) because


counterfactual forms a convex hull
Does not require access to post-treatment outcomes in the
“design” phase of the study - no peeking
Makes explicit the contribution of each comparison unit to the
counterfactual
Formalizing the way comparison units are chosen has direct
implications for inference
Synthetic control method: disadvantages

1 Subjective researcher bias kicked down to the model selection


stage
2 Significant diversity at the moment as to how to principally
select models - from machine learning to modifications - as
well as estimation and software
Furman and Pinto (2018) recommend showing a few different
results in their “cherry picking” JPAM
Synthetic control method: estimation

Suppose that we observe J + 1 units in periods 1, 2, . . . , T


Unit “one” is exposed to the intervention of interest (that is,
“treated”) during periods T0 + 1, . . . , T
The remaining J are an untreated reservoir of potential
controls (a “donor pool”)
Potential outcomes notation

Let Yit0 be the outcome that would be observed for unit i at


time t in the absence of the intervention
Let Yit1 be the outcome that would be observed for unit i at
time t if unit i is exposed to the intervention in periods T0 + 1
to T .
Dynamic ATT

Treatment effect parameter is defined as dynamic ATT where

δ1t = Y1t1 − Y1t0


= Y1t − Y1t0

for each post-treatment period, t > T0 and Y1t is the outcome for
unit one at time t. We will estimate Y1t0 using the J units in the
donor pool
Estimating optimal weights

Let W = (w2 , . . . , wJ+1 )0 with wj ≥ 0 for j = 2, . . . , J + 1 and


w2 + · · · + wj+1 = 1. Each value of W represents a potential
synthetic control
Let X1 be a (k × 1) vector of pre-intervention characteristics
for the treated unit. Similarly, let X0 be a (k × J) matrix
which contains the same variables for the unaffected units.
The vector W ∗ = (w2∗ , . . . , wJ+1
∗ )0 is chosen to minimize

||X1 − X0 W ||, subject to our weight constraints


Optimal weights differ by another weighting matrix

Abadie, et al. consider


p
||X1 − X0 W ||= (X1 − X0 W )0 V (X1 − X0 W )

where Xjm is the value of the m-th covariates for unit j and V is
some (k × k) symmetric and positive semidefinite matrix
More on the V matrix

Typically, V is diagonal with main diagonal v1 , . . . , vk . Then, the


synthetic control weights w2∗ , . . . , wJ+1
∗ minimize:

k
X  J+1
X 2
vm X1m − wj Xjm
m=1 j=2

where vm is a weight that reflects the relative importance that we


assign to the m-th variable when we measure the discrepancy
between the treated unit and the synthetic controls
Choice of V is critical

The synthetic control W ∗ (V ∗ ) is meant to reproduce the


behavior of the outcome variable for the treated unit in the
absence of the treatment
Therefore, the V ∗ weights directly shape W ∗
Estimating the V matrix

Choice of v1 , . . . , vk can be based on


Assess the predictive power of the covariates using regression
Subjectively assess the predictive power of each of the
covariates, or calibration inspecting how different values for
v1 , . . . , vk affect the discrepancies between the treated unit
and the synthetic control
Minimize mean square prediction error (MSPE) for the
pre-treatment period (default):
T0 
X J
X 2
∗ ∗
Y1t − wj (V )Yjt
t=1 j=2
Cross validation

Divide the pre-treatment period into an initial training period


and a subsequent validation period
For any given V , calculate W ∗ (V ) in the training period.
Minimize the MSPE of W ∗ (V ) in the validation period
Suppose Y 0 is given by a factor model

What about unmeasured factors affecting the outcome variables as


well as heterogeneity in the effect of observed and unobserved
factors?

Yit0 = αt + θt Zi + λt ui + εit

where αt is an unknown common factor with constant factor


loadings across units, and λt is a vector of unobserved common
factors
With some manipulation

J+1
X J+1
X T0
X T0
X −1
Y1t0 − wj∗ Yjt = wj∗ λt λ0n λn λ0s (εjs − ε1s )
j=2 j=2 s=1 n=1
J+1
X
− wj∗ (εjt − ε1t )
j=2

If T
P 0 0
t=1 λt λt is nonsingular, then RHS will be close to zero if
number of preintervention periods is “large” relative to size of
transitory shocks
Only units that are alike in observables and unobservables
should produce similar trajectories of the outcome variable
over extended periods of time
Proof in Appendix B of ADH (2011)
Example: California’s Proposition 99

In 1988, California first passed comprehensive tobacco control


legislation:
increased cigarette tax by 25 cents/pack
earmarked tax revenues to health and anti-smoking budgets
funded anti-smoking media campaigns
spurred clean-air ordinances throughout the state
produced more than $100 million per year in anti-tobacco
projects
Other states that subsequently passed control programs are
excluded from donor pool of controls (AK, AZ, FL, HI, MA,
MD, MI, NJ, OR, WA, DC)
Cigarette Consumption: CA and the Rest of the US

Cigarette Consumption: CA and the Rest of the U.S.

140
California
rest of the U.S.

120
per−capita cigarette sales (in packs)

100
80
60
40

Passage of Proposition 99
20
0

1970 1975 1980 1985 1990 1995 2000

year
Cigarette Consumption: CA and synthetic CA
Cigarette Consumption: CA and synthetic CA

140
California
synthetic California

120
per−capita cigarette sales (in packs)

100
80
60
40

Passage of Proposition 99
20
0

1970 1975 1980 1985 1990 1995 2000

year
Predictor Means:
Predictor ActualActual
Means: vs. Synthetic CaliforniaCalifornia
vs. Synthetic

California Average of
Variables Real Synthetic 38 control states
Ln(GDP per capita) 10.08 9.86 9.86
Percent aged 15-24 17.40 17.40 17.29
Retail price 89.42 89.41 87.27
Beer consumption per capita 24.28 24.20 23.75
Cigarette sales per capita 1988 90.10 91.62 114.20
Cigarette sales per capita 1980 120.20 120.43 136.58
Cigarette sales per capita 1975 127.10 126.99 132.81
Note: All variables except lagged cigarette sales are averaged for the 1980-
1988 period (beer consumption is averaged 1984-1988).
Smoking Gap between CA and synthetic CA
Smoking Gap Between CA and synthetic CA

30
gap in per−capita cigarette sales (in packs)

20
10
0
−10
−20

Passage of Proposition 99
−30

1970 1975 1980 1985 1990 1995 2000

year
Inference

To assess significance, we calculate exact p-values under


Fisher’s sharp null using a test statistic equal to after to before
ratio of RMSPE
Exact p-value method
Iteratively apply the synthetic method to each country/state in
the donor pool and obtain a distribution of placebo effects
Compare the gap (RMSPE) for California to the distribution of
the placebo gaps. For example the post-Prop. 99 RMSPE is:
 T  J+1 2  21
1 X X

RMSPE = Y1t − wj Yjt
T − T0
t=T0 +1 j=2

and the exact p-value is the treatment unit rank divided by J


Smoking Gap for CA and 38 control states

Smoking Gap for CA and 38 control states


(All States in Donor Pool)
California

30
control states
gap in per−capita cigarette sales (in packs)

20
10
0
−10
−20

Passage of Proposition 99
−30

1970 1975 1980 1985 1990 1995 2000

year
Smoking Gap for CA and 34 control states

Smoking Gap for CA and 34 control states


(Pre-Prop. 99 MSPE  20 Times Pre-Prop. 99 MSPE for CA)
California

30
control states
gap in per−capita cigarette sales (in packs)

20
10
0
−10
−20

Passage of Proposition 99
−30

1970 1975 1980 1985 1990 1995 2000

year
Smoking Gap for CA and 29 control states

Smoking Gap for CA and 29 control states


(Pre-Prop. 99 MSPE  5 Times Pre-Prop. 99 MSPE for CA)
California

30
control states
gap in per−capita cigarette sales (in packs)

20
10
0
−10
−20

Passage of Proposition 99
−30

1970 1975 1980 1985 1990 1995 2000

year
Smoking Gap for CA and 19 control states

Smoking Gap for CA and 19 control states


(Pre-Prop. 99 MSPE  2 Times Pre-Prop. 99 MSPE for CA)
California

30
control states
gap in per−capita cigarette sales (in packs)

20
10
0
−10
−20

Passage of Proposition 99
−30

1970 1975 1980 1985 1990 1995 2000

year
Ratio Post-Prop. 99 RMSPE to Pre-Prop. 99 RMSPE

Ratio Post-Prop. 99 MSPE to Pre-Prop. 99 MSPE


(All 38 States in Donor Pool)

5
4
3
frequency

California
2
1
0

0 20 40 60 80 100 120

post/pre−Proposition 99 mean squared prediction error


Facts

The US has the highest prison population of any OECD


country in the world
2.3 million are currently incarcerated in US federal and state
prisons and county jails
Another 4.75 million are on parole
From the early 1970s to the present, incarceration and prison
admission rates quintupled in size

Cunningham Causal Inference


Prison constraints

Prisons are and have been at capacity for a long time.


Requires managing flows through
Prison construction
Overcrowding
Paroles
Texas prison boom

Ruiz v. Estelle 1980


Class action lawsuit against TX Dept of Corrections (Estelle,
warden).
TDC lost. Lengthy period of appeals and legal decrees.
Lengthy period of time relying on paroles to manage flows
Governor Ann Richards (D) 1991-1995
Operation prison capacity increased 30-35% in 1993, 1994 and
1995.
Prison capacity increased from 55,000 in 1992 to 130,000 in
1995.
Building of new prisons (private and public)
New prison construction

25
Number of new prison construction
5 10 15
0 20

1840 1860 1880 1900 1920 1940 1960 1980 2000 2020
Year
Red dashed line is 1993
Texas prison growth
Operational capacity

5 10 15 20 25 30 35
00

Percent change in capacity operational


00
16
Prison capacity operational

00
00
14
00
00
12
00
00
10
0
00
80
0
00

-5 0
60
0
00
40

1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

Operational capacity Percent change


Texas Prison Flows Measures per 100 000 Population

400
Coutns per 100 000 Population
100 200 0 300

1980 1985 1990 1995 2000 2005

Prison Admissions Prison Releases


Discretionary Paroles
Total incarceration per 100 000
Texas vs US

800
Total incarceration rates
200 400
0 600

1980 1985 1990 1995 2000 2005

TX USA (excluding TX)


1993 starts the prison expansion
Data

National Prisoner Statistics - prison measures, including race


and gender-specific incarceration
Current Population Survey - controls
SEER - population
Incarcerated persons per 100,000
1993 Treatment

.5
.3
Gap in prediction error
.1
-.1
-.3
-.5

1975 1980 1985 1990 1995 2000 2005


Texas rank: 2, p-value: 0.04
What if you can’t conduct randomized experiment?

Problems with the experimental design itself:


non-compliance by administrators
non-compliance by members of the treatment group
non-compliance by members of the control group
Experiments may be impractical due to:
Too expensive
Unethical
Not feasible for some other reason

Cunningham Causal Inference


GERRY B. HILL, WAYNE MILLAR and JAMES CONNELLY

Figure 1
Lung Cancer at Autopsy: Combined Results from 18 Studies

Per cent of autopsies


8
*

1860 1870 1880 1890 1900 l910 1920 1930 1940 1950
Year
Observed +fitted

Mortality Statistics
"The
TheGreat Debate"
Registrar General of England and Wales began publishing the num- 371
bers of deaths for specific cancer sites in 1911.W The death rates for can-
cer of the lung from 1911 to 1955 were
Figure 2(a)published by Percy Stocks.26The
rates increased exponentially
Mortality overof
from Cancer thetheperiod:
Lung in10% per year in males
Males
and 6% per year in females. Canadian rates for the period 1931-52 were
Rate per 100,000
published
120 by A. J. Phillips.27 The rates were consistently lower in Canada
than in- England and Wales, but also increased exponentially at 8% per
l00
year in males and 4% per year in females.
The
80 -British and Canadian rates are shown in Figure 2. The rates (a) for
males, and (b) for females have been age-standardized,28 and the trends
6 0 - to 1990, using data published by Richard Peto and colleagues, 29
extended
and40by- Statistics Canada.30In British males the rates reached a maxi-
I mum in the mid-1970's and then declined. In Canadian males the initial
rise 20
was- more prolonged, reaching a maximum in 1990.Among females
the age-standardized rates continue to climb in both countries, the rise
0
beingl910
steeper in Canada
1920 1930 1940than in Britain.
1960 1960 1970 l980 l990 2000
The fact that mortality was lower Yearat first in Canada than in Britain
may be explained by the difference in smoking in the two countries.
-England
Percy Stocks31 cited data on the+
& Wales annual
Canadaconsumption
+ United per adult of ciga-
Kingdom
rettes in various countries between 1939 and 1957. In 1939 the con-
increases with the amount smoked.

Figure 4
Smoking and Lung Cancer Case-control Studies
376 Odds Ratio* GERRY B. HILL, WAYNE MILLAR and JAMES CONNELLY
'0 1
Cohort Studies 60
Cohort studies, though less prone to bias, are much more difficult to
perform than case-control studies, since it is necessary to assemble many
thousands of individuals, determine their smoking status, and follow
them up for several years to determine how many develop lung cancer.
Four such studies were mounted in the 1950s. The subjects used were
British doctors,61United States veterans,62 Canadian veterans,63 and vol-
unteers assembled by the American Cancer Society.@All four used mor-
tality as"the end-point.
Lees than 20 20 or more All
Figure 5 shows the combined mortality ratios for cancer of the lung in
males by level of cigarette smoking. Two of the studies involved females,
deaths m
but the numbers of lung cancerMale8 wereFemales
too small to provide precise
estimates. Since all causes of death were recorded in the cohort studies it
was Weighted
possiblemean
to determine the relationship between smoking and dis-
'
eases other than lung cancer. Sigruficant associations were found in rela-
tion to several types of cancer (e.g. mouth, pharynx, larynx, esophagus,
bladder) and with chronic respiratory disease and cardiovascular disease.
Figure 5
Smoking and Lung cancer Cohort Studies in Males

Mortality Ratio*
25
-1

Less than 10 10 to 19 , 20 or more

welghied mean of 4 etudles


Does Smoking Cause Cancer?

Smoking, S, causes lung cancer, C (S → C ) versus spurious


correlation due to backdoor path:

S C
Nature of the criticism

Criticisms from Joseph Berkson, Jerzy Neyman and Ronald Fisher:


(Hill, Millar and Connelly 2003)
1 Correlation b/w smoking and lung cancer is spurious due to
biased selection of subjects (e.g., conditioning on collider
problem)
2 Functional form complaints about using “risk ratios” and “odds
ratios”
3 Confounder, Z , creates backdoor path between smoking and
cancer
4 Implausible magnitudes
5 No experimental evidence to incriminate smoking as a cause of
lung cancer
Fisher’s confounding theory

Fisher, equally famous as a geneticist, argued from logic,


statistics and genetic evidence for a hypothetical confounding
genome, Z , and therefore smokers and non-smokers were not
exchangeable (violation of independence assumption)
Other studies showed that cigarette smokers and non-smokers
were different on observables – more extraverted than
non-smokers and pipe smokers, differed in age, differed in
income, differed in education, etc.
Hindsight is 20/20

Fisher was a chain smoking pipe smoker, he died of cancer,


and he was a paid expert witness for the tobacco industry.
But cynicism aside, it is easy to criticize Fisher because we
look back with more information to when the smoking/lung
cancer link was not universally accepted, and evidence for the
causal link was shallow:
“the [the epidemiologists] turned out to be right, but
only because bad logic does not necessarily lead to
wrong conclusions.” Robert Hooke’s (1983)
Motivation: Smoking and Mortality

Table: Death rates per 1,000 person-years (Cochran 1968)

Smoking group Canada U.K. U.S.


Non-smokers 20.2 11.3 13.5
Cigarettes 20.5 14.1 13.5
Cigars/pipes 35.5 20.7 17.4

Are cigars dangerous?


Non-smokers and smokers differ in mortality and age

Table: Mean ages, years (Cochran 1968)

Smoking group Canada U.K. U.S.


Non-smokers 54.9 49.1 57.0
Cigarettes 50.5 49.8 53.2
Cigars/pipes 65.9 55.7 59.7

Older people die at a higher rate, and for reasons other than
just smoking cigars
Maybe cigar smokers higher observed death rates is because
they’re older on average
Subclassification

One way to think about the problem is that the covariates are
not balanced – their mean values differ for treatment and
control group. So let’s try to balance them.
Worth a pause - blocking on confounders vs controlling for
covariates. The latter reduces residual variance, but shouldn’t
affect the bias of the estimator. Ceteris paribus vs blocking
Subclassification (also called stratification): Compare mortality
rates across the different smoking groups within age groups so
as to neutralize covariate imbalances in the observed sample
Subclassification

Divide the smoking group samples into age groups


For each of the smoking group samples, calculate the mortality
rates for the age group
Construct probability weights for each age group as the
proportion of the sample with a given age
Compute the weighted averages of the age groups mortality
rates for each smoking group using the probability weights
Subclassification: example

Death rates Number of


Pipe-smokers Pipe-smokers Non-smokers
Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40

Question: What is the average death rate for pipe smokers?


Subclassification: example

Death rates Number of


Pipe-smokers Pipe-smokers Non-smokers
Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40

Question: What is the average death rate for pipe smokers?


     
11 13 16
15 · + 35 · + 50 · = 35.5
40 40 40
Subclassification: example

Death rates Number of


Pipe-smokers Pipe-smokers Non-smokers
Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40

Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?
Subclassification: example

Death rates Number of


Pipe-smokers Pipe-smokers Non-smokers
Age 20-50 15 11 29
Age 50-70 35 13 9
Age +70 50 16 2
Total 40 40

Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?
     
29 9 2
15 · + 35 · + 50 · = 21.2
40 40 40
Table: Adjusted death rates using 3 age groups (Cochran 1968)

Smoking group Canada U.K. U.S.


Non-smokers 20.2 11.3 13.5
Cigarettes 28.3 12.8 17.7
Cigars/pipes 21.2 12.0 14.2
Covariates

Definition: Predetermined Covariates


Variable X is predetermined with respect to the treatment D (also
called “pretreatment”) if for each individual i, Xi0 = Xi1 , i.e., the
value of Xi does not depend on the value of Di . Such
characteristics are called covariates.
Comment I: Does not imply X and D are independent
Comment II: Predetermined variables are often time invariant (e.g.,
sex, race), but time invariance is not a necessary condition
Comment III: Beware of colliders
Outcomes

Definition: Outcomes
Those variables, Y , that are (possibly) not predetermined are called
outcomes (for some individual i, Yi0 6= Yi1 )
Adjustment for Observables

Subclassification (Cochran 1968)


Nearest Neighbor matching (Abadie and Imbens 2006, 2008)
Propensity score (Rosenbaum and Rubin 1973)
Multivariate regression
Identification under independence

Recall that randomization implies

(Y 0 , Y 1 ) ⊥
⊥D

and therefore:

E [Y |D = 1] − E [Y |D = 0] = E [Y 1 |D = 1] − E [Y 0 |D = 0]
| {z }
by the switching equation
1 0
= E [Y ] − E [Y ]
| {z }
by independence
1 0
= E [Y − Y ]
| {z }
ATE

As well as that ATT = ATE :

E [Y 1 − Y 0 ] = E [Y 1 − Y 0 |D = 1]
Identification under conditional independence

Identification assumptions:
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Given assumption 1:

E [Y 1 − Y 0 |X ] = E [Y 1 − Y 0 |X , D = 1]
= E [Y |X , D = 1] − E [Y |X , D = 0]

Given assumption 2:

δATE = E [Y 1 − Y 0 ]
Z
= E [Y 1 − Y 0 |X , D = 1]dPr (X )
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X )
Identification under conditional independence

Identification assumptions:
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Similarly

δATT = E [Y 1 − Y 0 |D = 1]
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X |D = 1)

To identify δATT the conditional independence and common


support assumptions can be relaxed to:
1 Y0 ⊥⊥ D|X
2 Pr (D = 1|X ) < 1 (with Pr (D = 1) > 0)
Subclassification estimator

The identification result is:


Z
δATE = (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X )
Z
δATT = (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X |D = 1)

Assume X takes on K different cells {X 1 , . . . , X k , . . . , X K }.


Then the analogy principle suggests the following estimators:
K  k
X 1,k 0,k N
δATE =
b (Y − Y ) ·
N
k=1
K  k
X 1,k 0,k NT
δbATT = (Y − Y ) ·
NT
k=1

where Nk is the number of obs. and NTK is the number of


1,k
treatment observations in cell k; Y is the mean outcome for
0,k
the treated in cell k; Y is the mean outcome for the control
in cell k
Subclassification by Age (K = 2)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20

K
Nk
 
Question: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
Subclassification by Age (K = 2)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20

K
Nk
 
Question: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
   
13 17
4· +6· = 5.13
30 30
Subclassification by Age (K = 2)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20

K
NTk
 
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
Subclassification by Age (K = 2)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old 28 24 4 3 10
Young 22 16 6 7 10
Total 10 20

K
NTk
 
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
   
3 7
4· +6· = 5.4
10 10
Subclassification by Age and Gender (K = 4)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old Males 28 22 4 3 7
Old Females 24 3
Young Males 21 16 5 3 4
Young Females 23 17 6 4 6
Total 10 20

K
Nk
 
Problem: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
Subclassification by Age and Gender (K = 4)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old Males 28 22 4 3 7
Old Females 24 3
Young Males 21 16 5 3 4
Young Females 23 17 6 4 6
Total 10 20

K
Nk
 
Problem: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
Not identified!
Subclassification by Age and Gender (K = 4)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old Males 28 22 4 3 7
Old Females 24 3
Young Males 21 16 5 3 4
Young Females 23 17 6 4 6
Total 10 20

K
NTk
 
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
Subclassification by Age and Gender (K = 4)

Death Rate Number of


Xk Smokers Non-smokers Diff. Smokers Non-smokers
Old Males 28 22 4 3 7
Old Females 24 3
Young Males 21 16 5 3 4
Young Females 23 17 6 4 6
Total 10 20

K
NTk
 
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
     
3 3 4
4· +5· +6· = 5.1
10 10 10
Curse of Dimensionality

Subclassification may become less feasible in finite samples as


the number of covariates grows (e.g., K = 4 was too many for
this sample)
Assume we have k covariates and we divide each into 3 coarse
categories (e.g., age: young, middle age, old; income: low,
medium, high, etc.)
The number of sub classification cells (or “strata”) is 3k . For
k = 10, then it’s 310 = 59, 049
Curse of Dimensionality

If sparseness occurs, it means many cells may contain either


only treatment units or only control units but not both. If so,
we cannot use sub classification.
Subclassification is also a problem if the cells are “too coarse”.
We can always use “finer” classifications, but finer cells
worsens the dimensional problem, so we don’t gain much from
that. ex: using 10 variables and 5 categories for each, we get
510 = 9, 765, 625.
Nearest Neighbor Matching

See Abadie and Imbens (2006). “Large sample properties of


matching estimators for average treatment effects”.
Econometrica
We could also estimate δATT by imputing the missing potential
outcome of each treatment unit i using the observed outcome
from that outcome’s “nearest” neighbor j in the control set
1 X
δATT = (Yi − Yj(i) )
NT
Di =1

where Yj(i) is the observed outcome of a control unit such that


Xj(i) is the closest value to Xi among all of the control
observations (eg match on X )

Cunningham Causal Inference


Matching

We could also use the average observed outcome over M


closest matches:
M
" #!
1 X 1 X
δATT = Yi − Yjm (1)
NT M
Di =1 m=1

Works well when we can find good matches for each treatment
group unit, so M is usually defined to be small (i.e., M = 1 or
M = 2)
Matching

We can also use matching to estimate δATE . In that case, we


match in both directions:
1 If observation i is treated, we impute Yi0 using the control
matches, {Yj1 (i) , . . . , YjM (i) }
2 If observation i is control, we impute Yi1 using the treatment
matches, {Yj1 (i) , . . . , YjM (i) }
The estimator is:
N M
" !#
1 X 1 X
δbATE = (2Di − 1) Yi − Yjm (i)
N M
i=1 m=1
Matching example with single covariate

Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4

1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
Matching example with single covariate

Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4

1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
Match and plug in!
Matching example with single covariate

Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 9 1 3
2 1 0 1 1
3 0 9 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4

1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
1 1 1
δbATT = · (6 − 9) + · (1 − 0) + · (0 − 9) = −3.7
3 3 3
A Training Example
Trainees Non-Trainees
unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average: 28.5 13982
21 32 25900
Average: 33 20724
Age Distribution: Before Matching

A: Trainees
3
2
1
frequency
0

B: Non−Trainees
3
2
1
0

20 30 40 50 60
age
Graphs by group
Age Distribution: After Matching

A: Trainees
3
2
1
frequency
0

B: Non−Trainees
3
2
1
0

20 30 40 50 60
age
Graphs by group
Training E↵ect Estimates

Di↵erence in average earnings between trainees and non-trainees


Before matching

16426 20724 = 4298

After matching:

16426 13982 = 2444


Alternative distance metric: Euclidean distance
 
X1
X2 
When the vector of matching covariates, X =  .  has more
 
 .. 
Xk
than one dimension (k > 1) we will need a new definition of
distance to measure “closeness”.
Definition: Euclidean distance
q
||Xi − Xj || = (Xi − Xj )0 (Xi − Xj )
v
u k
uX
= t (Xni − Xnj )2
n=1

Comment: The Euclidean distance is not invariant to changes in


the scale of the X ’s. For this reason, alternative distance metrics
that are invariant to changes in scale are used
Normalized Euclidean distance

Definition: Normalized Euclidean distance


A commonly used distance is the normalized Euclidean distance:
q
||Xi − Xj ||= (Xi − Xj )0 Vb −1 (Xi − Xj )

where
 2 
σb1 0 . . . 0
0 σ b22 . . . 0 
Vb −1 = . .. . . .
 
 .. . . .. 
0 0 ... σ bk2

Notice that the normalized Euclidean distance is equal to:


v
u k
uX (Xni − Xnj )
||Xi − Xj ||= t
bn2
σ
n=1

bn2 , and
Thus, if there are changes in the scale of Xni , these changes also affect σ
the normalized Euclidean distance does not change
Mahalanobis distance

Definition: Mahalanobis distance


The Mahalanobis distance is the scale-invariant distance metric:
q
b −1 (Xi − Xj )
||Xi − Xj ||= (Xi − Xj )0 Σ X

where Σ
b X is the sample variance-covariance matrix of X .
Arbitrary weights

Or, you could just create your own arbitrary weights


v
u k
uX
||Xi − Xj ||= t ωn · (Xni − Xnj )2
n=1

(with all ωn ≥ 0) so that we assign large ωn ’s to those covariates


that we want to match particularly well.
Matching and the Curse of Dimensionality

Dimensionality creates headaches for us in matching.


Bad news: Matching discrepancies ||Xi − Xj(i) || tend to
increase with k, the dimension of X
Good news: Matching discrepancies converge to zero . . .
Bad news: . . . but they converge very slow if k is large
Good news: Mathematically, it can be shown that ||Xi − Xj(i) ||
converges to zero at the same rate as 11
Nk
Bad news: It’s hard to find good matches when X has a large
dimension: you need many observations if k is big.
Deriving the matching bias
Derive the matching bias by first writing out the sample ATT
estimate:
1 X
δbATT = (Yi − Yj(i) ),
NT
Di =1

where each i and j(i) units are matched, Xi ≈ Xj(i) and Dj(i) = 0.
Define potential outcomes and switching eq.
µ0 (x) = E [Y |X = x, D = 0] = E [Y 0 |X = x],
µ1 (x) = E [Y |X = x, D = 1] = E [Y 1 |X = x],
Yi = µDi (Xi ) + εi
Substitute and distribute terms
1 X 1
(µ (Xi ) + εi ) − (µ0 (Xj(i) ) + εj(i) )

δbATT =
NT
Di =1
1 X 1 1 X
= (µ (Xi ) − µ0 (Xj(i) )) + (εi − εj(i) )
NT NT
Di =1 Di =1
Deriving the matching bias

Difference between sample estimate and population parameter is:


1 X 1
µ (Xi ) − µ0 (Xj(i) ) − δATT

δbATT − δATT =
NT
Di =1
1 X
+ (εi − εj(i) )
NT
Di =1

Algebraic manipulation and simplification:


1 X 1
µ (Xi ) − µ0 (Xi ) − δATT

δbATT − δATT =
NT
Di =1
1 X
+ (εi − εj(i) )
NT
Di =1
1 X 0
µ (Xi ) − µ0 (Xj(i) ) .

+
NT
Di =1
Deriving the matching bias
Apply the Central Limit Theorem and the difference,
r
1 b
(δATT − δATT ),
N
converges to a Normal distribution with zero mean. But, however,
r r
1 b 1 0
E[ (δATT − δATT )] = E [ (µ (Xi ) − µ0 (Xj(i) ))|D = 1].
N N
Now consider the implications if k is large:
The difference between Xi and Xj(i) converges to zero very
slowly
The difference µ0 (Xi ) − µ0 (Xj(i) ) converges to zero very slowly
q
E [ N1 (µ0 (Xi ) − µ0 (Xj(i) ))|D = 1] may not converge to zero
andqan be very large!
E [ 1 (δbATT − δATT )] may not converge to zero because the
N
bias of the matching discrepancy is dominating the matching
estimator!
Bias is often an issue when we match in many dimensions
Solutions to matching bias problem

The bias of the matching estimator is caused by large matching


discrepancies ||Xi − Xj(i) ||. The curse of dimensionality virtually
guarantees this. However:
1 But the matching discrepancies are observed. We can always
check in the data how well we’re matching the covariates.
2 For δbATT we can always make the matching discrepancies
small by using a large reservoir of untreated units to select the
matches (that is, by making NC large).
3 If the matching discrepancies are large, so we are worried
about potential biases, we can apply bias correction techniques
4 Partial solution: propensity score methods (coming soon. . . )
Matching with bias correction

Each treated observation contributes

µ0 (Xi ) − µ0 (Xj(i) )

to the bias.
Bias-corrected (BC) matching:

BC 1 Xh 0 0
i
δATT =
b (Yi − Yj(i) ) − (µ (Xi ) − µ (Xj(i) ))
c c
NT
Di =1

c0 (X ) is an estimate of E [Y |X = x, D = 0]. For


where µ
example using OLS.
Under some conditions, the bias correction eliminates the bias
of the matching estimator without affecting the variance.
Bias adjustment in matched data

Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8

10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
Bias adjustment in matched data

Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8

10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
c0 (X ) = βb0 + βb1 X = 2 + X
For the bias correction, estimate µ
Bias adjustment in matched data

Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8

10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
For the bias correction, estimate µc0 (X ) = βb0 + βb1 X = 2 + X

(10 − 8) − (µc0 (3) − µ


c0 (4)) (4 − 1) − (µ c0 (1) − µ
c0 (0))
δbATT = +
3 3
0 0
(10 − 9) − (µ (10) − µ (8))
c c
+ = 1.33
3
Matching bias: Implications for practice

Bias arises because of the effect of large matching discrepancies on


µ0 (Xi ) − µ0 (Xj(i) ). To minimize matching discrepancies:
1 Use a small M (e.g., M = 1). Larger values of M produce
large matching discrepancies.
2 Use matching with replacement. Because matching with
replacement can use untreated units as a match more than
once, matching with replacement produces smaller matching
discrepancies than matching without replacement.
3 Try to match covariates with a large effect on µ0 (·)
particularly well.
Large sample distribution for matching estimators

Matching estimators have a Normal distribution in large


samples (provided the bias is small):
p d 2
NT (δbATT − δATT ) −
→ N(0, σATT )

For matching without replacement, the “usual” variance


estimator:
M
!2
2 1 X 1 X
bATT =
σ Yi − Yjm (i) − δbATT ,
NT M
Di =1 m=1

is valid.
Large sample distribution for matching estimators

For matching with replacement:


M
!2
2 1 X 1 X
σ
bATT = Yi − Yjm (i) − δbATT
NT M
Di =1 m=1
1 X Ki (Ki − 1)
 
+ c (ε|Xi , Di = 0)
var
NT M2
Di =0

where Ki is the number of times observation i is used as a


match.
c (Yi |Xi , Di = 0) can be estimated also by matching. For
var
example, take two observations with Di = Dj = 0 and
Xi ≈ Xj , then
(Yi − Yj )2
c (Yi |Xi , Di = 0) =
var
2
c (εi |Xi , Di = 0))
is an unbiased estimator of var
The bootstrap doesn’t work!
Avoiding dimensionality problems

Curse of dimensionality makes matching on K covariates


challenging
Rubin (1977) and Rosenbaum and Rubin (1983) develop a
method that can contain those K covariates used for adjusting
Insofar as treatment is random conditional on K covariates,
then one can use the propensity score to adjust for confounders

Cunningham Causal Inference


Least squares

OLS is best linear predictor and approximation to the


conditional expectation function
But if probability of treatment is nonlinear, this conditional
mean may be less informative
Propensity scores relax the linearity assumption and have other
advantages
The Idea behind propensity scores

Earlier we matched on X ’s to compare units “near” one


another based on some distance but matching discrepancies
and sparseness created problems
Propensity scores summarize covariate information about
treatment selection into a single number bounded between 0
and 1 (i.e., a probability)
Now we compare units with similar estimated probabilities of
treatment
And once we adjust using the propensity score, we no longer
need to adjust for X
Identifying assumptions

We need two assumptions for propensity scores to help us


identify causal effects
1 Conditional independence, or unconfoundedness
2 Common support or overlap
The first is based on state of the art and institutional details
sufficient to warrant such a judgment call, making propensity
scores arguably more, not less, advanced
The latter is testable
Identifying assumption I: Conditional independence
(Yi0 , Yi1 ) ⊥
⊥ D|Xi . There exists a set X of observable covariates
such that after controlling for these covariates, treatment
assignment is independent of potential outcomes.

Conditional on X , treatment assignment is ‘as good as


random’.
‘As good as random’ is English for “independent of potential
outcomes” potential outcomes jargon
Also sometimes called ‘ignorable treatment assignment’,
‘unconfoundedness”, ‘selection on observables’, ‘exogeneity’,
‘conditional zero mean’
CIA is assumed, not tested, bc potential outcomes are
missing. Consult your doctor
Identifying assumption II: Common support
For ranges of X , there is a positive probability of being both
treated and untreated

We’ll talk about the propensity score in just a second; for now
this assumption is only about X
Assumption requires that there are units in both treatment
and control for the range of propensity score
Recall, RDD did not have common support so relied on
extrapolation sensitive to functional form assumptions
Common support ensures we can find similar enough donors in
the control pool
Unlike CIA, common support is testable
Formal Definition

Definition of Propensity score


A propensity score is a number bounded between 0 and 1
measuring the probability of treatment assignment conditional on a
vector of confounding variables: p(X ) = Pr (D = 1|X )

Two Necessary Identification Assumptions:


1 (Y 0 , Y 1 ) ⊥
⊥ D|X (CIA)
2 0 < Pr (D = 1|X ) < 1 (common support)
Steps

1 Estimate the propensity score using logit/probit


2 Estimate a particular ATE incorporating the propensity score
using stratification, imputation, regression, or inverse
probability weighting
3 Estimate standard errors
Estimating the propensity score

Estimate the conditional probability of treatment using probit


or logit model
Pr (Di = 1|Xi ) = F (βXi )
Use the estimated coefficients to calculate the propensity score
for each unit i
ρbi = βX
b i

Propensity score is the predicted conditional probability of


treatment, or the fitted value for each unit – same thing
Identification

A group of unit’s average treatment effect may depend on


some characteristic, X

E [δi (Xi )] = E [Yi1 − Yi0 |Xi = x]


= E [Yi1 |Xi = x] − E [Yi0 |Xi = x]

CIA allow us to substitute

E [Yi |Di = 1, Xi = x] = E [Yi1 |Di = 1, Xi = x]

and similar for other term Y 0 using a switching equation


Common support allows us to estimate both terms
Visualizing the propensity score theorem

D Y

p(X )

It’s similar to the visualization of the RDD strategy from earlier


except that it achieves common support
Propensity score theorem
If (Y 1 , Y 0 ) ⊥
⊥ D|X (CIA), then (Y 1 , Y 0 ) ⊥
⊥ D|ρ(X ) where
ρ(X ) = Pr (D = 1|X ), the propensity score

Conditioning on the propensity score is enough to have


independence between D and (Y 1 , Y 0 ) (Rosenbaum and
Rubin 1983)
Valuable theorem because of dimension reduction and
convergence rate issues which can introduce biases
Big picture: You can toss X out if you have ρb because all
information from X have been absorbed into ρb
Proof

Before diving into the proof, first recognize that

Pr (D = 1|Y 0 , Y 1 ρ(X )) = E [D|Y 0 , Y 1 , ρ(X )]

because

E [D|Y 0 , Y 1 , ρ(X )] = 1 × Pr (D = 1|Y 0 , Y 1 , ρ(X ))


+0 × Pr (D = 0|Y 0 , Y 1 , ρ(X ))

and the second term cancels out.


Proof.
Assume (Y 1 , Y 0 ) ⊥
⊥ D|X (CIA). Then:

Pr (D = 1|Y 1 , Y 0 , ρ(X )) = E [D|Y 1 , Y 0 , ρ(X )]


| {z }
See previous slide

= E [E [D|Y , Y , ρ(X ), X ]|Y 1 , Y 0 , ρ(X )]


1 0
| {z }
by LIE

= E [E [D|Y , Y , X ]|Y , Y 0 , ρ(X )]


1 0 1
| {z }
Given X , we know p(X )

= E [E [D|X ]|Y 1 , Y 0 , ρ(X )]


| {z }
by CIA
1 0
= E [ρ(X )|Y , Y , ρ(X )]
| {z }
propensity score definition
= ρ(X )
Similar proof

We also can show that the probability of treatment conditional on


the propensity score is the propensity score using a similar
argument:

Pr (D = 1|ρ(X )) = E [D|ρ(X )]
| {z }
Previous slide
= E [E [D|X ]|ρ(X )]
| {z }
LIE
= E [p(X )|ρ(X )]
| {z }
definition
= ρ(X )

and Pr (D = 1|Y 1 , Y 0 , ρ(X )) = Pr (D = 1|ρ(X )) by CIA


Unbiased estimation of the ATE

Exact methods to do this to be discussed later, but until then, we


can say this:
Corollary: Estimating the ATE
If (Y 1 , Y 0 ) ⊥
⊥ D|X , we can estimate average treatment effects:

E [Y 1 − Y 0 |ρ(X )] = E [Y |D = 1, ρ(X )] − E [Y |D = 0, ρ(X )]


Balancing property

Because the propensity score is a function of X, we know:

Pr (D = 1|X , ρ(X )) = Pr (D = 1|X )


= ρ(X )

Conditional on ρ(X ), the probability that D = 1 does not


depend on X .
D and X are independent conditional on p(X ):

D⊥
⊥ X |ρ(X )
Balancing property

So we obtain the balancing property of the propensity score:

Pr (X |D = 1, p(X )) = Pr (X |D = 0, p(X ))

conditional on the property score, the distribution of the


covariates is the same for treatment and control group units
We can use this to check if our estimated propensity score
actually produces balance:

Pr (X |D = 1, pb(X )) = Pr (X |D = 0, pb(X ))
Propensity score theorem

This theorem tells us the only covariate we need to adjust for


is the conditional probability of treatment itself (i.e., the
propensity score)
It does not tell us which method we should use to do that
adjustment, though, which is an estimation question
There are options: inverse probability weighting, forms of
imputation, stratification, and sometimes even regressions will
incorporate the score as weights
Checking the common support assumption

We can summarize the propensity scores in the treatment and


control group and count how many units are off-support
Crump, et al. (2009) offer a rule of thumb: keep scores on
interval [0.1,0.9].
Tossing out observations beyond those min and max scores
A histogram of propensity scores by treatment and control
group also highlights the overlap problem; software also can
help such as teffects overlap
Inverse probability weighting

I really like the simple method of inverse probability weighting


aesthetically because there are no black boxes; it’s all
non-parametric averaging done through a particular kind of
weights based on the propensity score
IPW involves fewer implementation choices like number of
neighbors, common support, etc.
And because IPW is a smooth estimator, the bootstrap is valid
for inference unlike covariate nearest neighbor matching which
Abadie and Imbens (2008) show is not valid

Cunningham Causal Inference


Inverse probability weighting

IPW is basically a reweighting of the outcomes by the


propensity score developed in Robins and Rotnitzky (1995),
Imbens (2000), Hirano and Imbens (2001)
The weights can be expressed in two ways – without
normalization (Horvitz and Thompson 1952) or normalized
(Hajek1971) – the difference being how well either approach
can handle extreme values of the propensity score; the
differences come out of the survey sampling literature
Notation is far far scarier than in fact what we are doing, so I’ll
show you this in a Stata and R simulation to help pin down
the intuition a little better
We’ll start with the basic idea using the Horvitz and Thompson
(1952) expression of the weights as it’s not as messy.
Inverse Probability Weighting

Proposition
If Y 1 , Y 0 ⊥
⊥ D|X , then

δATE = E [Y 1 − Y 0 ]
D − ρ(X )
 
= E Y·
ρ(X ) · (1 − ρ(X ))
δATT = E [Y 1 − Y 0 |D = 1]
D − ρ(X )
 
1
= ·E Y ·
Pr (D = 1) 1 − ρ(X )
IPW Proof

Proof.

D − ρ(X )
   
Y
E Y· X = E X , D = 1 ρ(X )
ρ(X )(1 − ρ(X )) ρ(X )
−Y
 
+E X , D = 0 (1 − ρ(X ))
1 − ρ(X )
= E [Y |X , D = 1] − E [Y |X , D = 0]

and the results follow from integrating over P(X ) and


P(X |D = 1).
Weighting on the propensity score

Previous formulas used population concepts. Switching to samples,


we use a two-step estimator:
1 Estimate the propensity score: ρb(X )
2 Use estimated score to produce analog estimators. Let δbATE
and δbATT be an estimate of the ATE and ATT parameter:
N
1 X Di − ρb(Xi )
δbATE = Yi ·
N ρb(Xi ) · (1 − ρb(Xi ))
i=1
N
1 X Di − ρb(Xi )
δbATT = Yi ·
NT 1 − ρb(Xi )
i=1
Weighting on the propensity score

Standard errors can be constructed a few different ways:


We need to adjust the standard errors for first-step estimation
of ρ(X )
Parameteric first step: Newey and McFadden (1994)
Non-parametric first step: Newey (1994)
Or bootstrap the entire two-step procedure (Adudumilli 2018
and Bodory et al. 2020)
Implementation with software

I like estimating with IPW manually because I like being


reminded how simple a procedure it is
But Stata’s -teffects- and R’s -ipw- do it too, and -teffects-
uses the Hajek normalization weights which will produce
identical estimates to my program
My programs don’t do the inference, but I think that would be
fun and easy to do using the bootstrap
Let’s look at it real quickly now with an example from
LaLonde’s 1986 paper on the NSW job trainings program
(which I’ll discuss again soon)
Double robust estimators

Lots of papers: Robins and Rotnizky (1995) originally, Hirano


and Imbens (2001), etc.
Basic idea is you are going to control for covariates twice:
through regression and then through the propensity score
We say that estimators combining regression with IPW are
double robust so long as
The regression for the outcome is properly specified, or
The propensity score is properly specified
Hence the name “double robust”. We give ourselves two
chances to get it right (either/or not both/and)
Estimation of outcome model

Di 1 − Di
yi = α0 + Xi β + α˜1 Di + θ0 + θ1 + ε˜i
[
ρ(X i)
[
1 − ρ(X i)
Propensity score matching

Matching, or what I like to call “imputation”, is another way


that utilizes the ρb
They all use the same first stage, but differ on their second
and third stages
Part of the second stage may be imposing common support
through “trimming”, but for different reasons because now this
idea of distance is entering and maybe you think some units
are “too far away” to be relevant counterfactuals

Cunningham Causal Inference


Standard matching strategy

Pair each treatment unit i with one or more comparable


control group unit j, where comparability is in terms of
proximity to the estimated propensity score
Impute the unit’s missing counterfactual outcome Yi(j) based
on the unit or units chosen in the previous step
If more than one are “nearest neighbors”, then use the
neighbors’ weighted outcomes
X
Yi(j) = wij Yj
j∈C (i)

where C (i) is the set of neighbors with W = 0 of the


treatment
P unit iand wij is the weight of control group units j
with j∈C (i) wij = 1
Imputing the counterfactuals

A parameter of interest:

E [Yi1 |Di = 1] − E [Yi0 |Di = 1]

We estimate it as follows
1 X  
[
ATT = = Yi − Yi(j)
NT
i:Wi =1

where NT is the number of matched treatment units in the sample.


Note the difference between imputation and weighting
Matching methods

The probability of observing two units with exactly the same


propensity score is in principle zero because p(x) is continuous
Several matching methods have been proposed in the
literature, but the most widely used are:
Stratification matching
Nearest-neighbor matching (with or without caliper)
Radius matching
Kernel matching
Typically, one treatment unit i is matched to several control
units j, but sometimes one-to-one matching is used
Stratification

Stratification is used to force covariate balance by finding


strata where there is no difference in mean covariate values.
You then use those strata to calculate within differences in
means and sum over properly weighted strata. See Becker and
Ichino (2002)
Stratification is a brute force method for imposing balance by
grouping the data and testing for differences in covariate
means
It’s actually kind of similar to coarsened exact matching, only
using the propensity score for the “stratification” not the
covariates
Stratification: Achieving Balance

The algorithm is brute force covariate balancing


1 Sort the data by propensity score and divide into groups of
observations with similar propensity scores (e.g., percentiles)
2 Within each group, test (using a t-test) whether the means of
the covariates (X ) are equal between treatment and control
3 If so, then stop. If not, it means the covariates aren’t balanced
within that group. Divide the group in half and repeat
4 If a particular covariate is unbalanced for multiple groups,
modify the initial logit or probit equation by including higher
order terms and/or interactions with that covariate and repeat
Historically this could be done with -pscore2.ado- or manually
oneself if they felt so inclined, but it was dropped with -teffects-
Nearest Neighbor

Pretty similar to covariate matching. Formula is


 
NN 1 X X
ATT = Yi − wij Yj
NT
i:Wi =1 j∈C (i)M

NT is the number of Treatment units i


wij is equal to N1C if j is a control unit and zero otherwise; NC
is number of control units j
And unit j is chosen as a control for i if it’s propensity score is
nearest to that of i
NN Matching: Bias vs. Variance

But how far away on the propensity score will you use? Herein
lies the different types of matching proposed
Matching just one nearest neighbor minimizes bias at the cost
of larger variance
Matching using additional nearest neighbors increases the bias
but decreases the variance
Matching with or without replacement
with replacement keeps bias low at the cost of larger variance
without replacement keeps variance low but at the cost of
potential bias
Distance between treatment and control units

What was historically done was limiting “distance” through


various ad hoc choices
Imagine these choices as creating like a lasso (like the cowboy
rope)
Anything within the lasso could be used for the imputation;
anything outside the lasso could not
There were two common ways – caliper matching and radius
matching.
Caliper matching

Caliper matching is a variation on NN matching that tries to


build brakes into the algorithm as to avoid “bad neighbors”
It does this by imposing a tolerable maximum distance (e.g.,
0.2 units in the propensity score away from a treatment unit
i’s propensity score)
Note – this is a one-to-one imputation, and if there doesn’t
exist anybody in the control group unit j within that “caliper”,
then treatment unit i is discarded
Means we aren’t estimating the ATE anymore once we start
dropping units
It’s difficult to know what this caliper should be ex ante, hence
why I said it is somewhat ad hoc
Radius matching

Each treatment unit i is matched with the control group units


whose propensity score are in a predefined neighborhood of the
propensity score of the treatment unit.
All the control units with ρbj falling within a radius r from ρbi
are matched to the treatment unit i – this is what
distinguishes it from calipers, and makes it more similar to
covariate matching (Abadie and Imbens 2006, 2008)
The smaller the radius, the better the quality of the matches,
but the higher the possibility some treatment units are not
matched because the neighborhood does not contain control
group units j
Software

I think you can use -teffects, psmatch- to get at these two


nearest neighbor approaches by setting the number of matches
You can use -pscore2- for stratification, but I think the
standard errors are wrong, so you may need to just do it
manually using bootstrapping or variance approximation, and
that may be a pain to program up
Not sure of the R command, but I know it’s out there
King and Nielsen (2019)

There is a King and Nielsen (2019) critique of these methods


that is popularly known but not popularly studied
King and Nielsen (2019) is not a critique of the propensity
score, because it does not apply to stratification, regression
adjustment, or inverse probability weighting
It only applies to nearest neighbor and is related to forced
balance through trimming and a myriad of other common
choices made by the researcher
“ ‘[The] more balanced the data, or the more balance it
becomes by [trimming] some of the observations through
matching, the more likely propensity score matching will
degrade inferences.” – King and Nielsen (2019)
Examples of propensity score matching

Workhorse example of propensity score matching is the Job


Trainings Program (NSW)
First studied by LaLonde (1986) evaluating multiple
econometric models for program evaluation
All the standard estimators failed to estimate the known ATE
when replacing experimental controls with non-experimental
controls – even difference-in-differences
Dehejia and Wahba (1999; 2002) use LaLonde’s data with
propensity score matching and found better results
Critiques by Petra Todd, Jeff Smith and others followed which
I won’t review here for sake of time
Description of NSW Job Trainings Program

The National Supported Work Demonstration (NSW), operated by


Manpower Demonstration Research Corp in the mid-1970s:
was a temporary employment program designed to help
disadvantaged workers lacking basic job skills move into the
labor market by giving them work experience and counseling in
a sheltered environment
was also unique in that it randomly assigned qualified
applicants to training positions:
Treatment group: received all the benefits of NSW program
Control group: left to fend for themselves
admitted AFDC females, ex-drug addicts, ex-criminal
offenders, and high school dropouts of both sexes
NSW Program

Treatment group members were:


guaranteed a job for 9-18 months depending on the target
group and site
divided into crews of 3-5 participants who worked together and
met frequently with an NSW counselor to discuss grievances
and performance
paid for their work
Control group members were randomized so the same
Note: the randomization balanced observables and
unobservables across the two arms, thus enabling the
estimation of an ATE for the people who self-selected into the
program
NSW Program

Other details about the NSW program:


Wages: NSW offered the trainees lower wage rates than they
would’ve received on a regular job, but allowed their earnings
to increase for satisfactory performance and attendance
Post-treatment: after their term expired, they were forced to
find regular employment
Job types: varied within sites – gas station attendant, working
at a printer shop – and males and females were frequently
performing different kinds of work
NSW Data

NSW data collection:


MDRC collected earnings and demographic information from
both treatment and control at baseline and every 9 months
thereafter
Conducted up to 4 post-baseline interviews
Different sample sizes from study to study can be confusing,
but has simple explanations
NSW Data

Estimation:
NSW was a randomized job trainings program; therefore
estimating the average treatment effect is straightforward:
1 X 1 X
Yi − Yi ≈ E [Y 1 − Y 0 ]
Nt Nc
Di =1 Di =0

in large samples assuming treatment selection is independent


of potential outcomes (randomization) – i.e., (Y 0 , Y 1 ) ⊥
⊥ D.
NSW worked: Treatment group participants’ real earnings
post-treatment (1978) was positive and economically
meaningful – ≈ $900 (LaLonde 1986) to $1,800 (Dehejia and
Wahba 2002) depending on the sample used
LaLonde, Robert J. (1986). “Evaluating the Econometric
Evaluations of Training Programs with Experimental Data”.
American Economic Review.

LaLonde’s study was not an evaluation of the NSW program, as


that had been done, but rather an evaluation of econometric
models done by:
replacing the experimental NSW control group with
non-experimental control group drawn from two nationally
representative survey datasets: Current Population Survey
(CPS) and Panel Study of Income Dynamics (PSID)
estimating the average effect using non-experimental workers
as controls for the NSW trainees
comparing his non-experimental estimates to the experimental
estimates of $900
LaLonde (1986)

LaLonde’s conclusion: available econometric approaches were


biased and inconsistent
His estimates were way off and usually the wrong sign
Conclusion was influential in policy circles and led to greater
push for more experimental evaluations
Imbalanced covariates for experimental and non-experimental
samples

CPS NSW
All Controls Trainees
Nc = 15, 992 Nt = 297
covariate mean (s.d.) mean mean t-stat diff
Black 0.09 0.28 0.07 0.80 47.04 -0.73
Hispanic 0.07 0.26 0.07 0.94 1.47 -0.02
Age 33.07 11.04 33.2 24.63 13.37 8.6
Married 0.70 0.46 0.71 0.17 20.54 0.54
No degree 0.30 0.46 0.30 0.73 16.27 -0.43
Education 12.0 2.86 12.03 10.38 9.85 1.65
1975 Earnings 13.51 9.31 13.65 3.1 19.63 10.6
1975 Unemp 0.11 0.32 0.11 0.37 14.29 -0.26
Dehija and Wahba (1999)

Dehejia and Wahba (DW) update LaLonde’s original study


using propensity score matching
1 Dehejia, Rajeev H. and Sadek Wahba (1999). “Causal Effects
in Nonexperimental Studies: Reevaluating the Evaluation of
Training Programs”. Journal of the American Statistical
Association, vol. 94(448): 1053-1062 (pdf)
Can propensity score matching improve over the estimators
that LaLonde examined?
Proposition 2

X ⊥
⊥ D|p(X )

Conditional on the propensity score, the covariates are


independent of the treatment, suggesting that the distribution
of covariate values should be the same for both treatment and
control groups
This can be checked as we have data on all three once we’ve
estimated the propensity score
Trimming the data

Common terms are “trimming” or “pruning”


Drop units which do not overlap in terms of estimated
propensity score
Sometimes as a rule of thumb, just keep units on the
[0.1,0.9] interval
Common support
Overlap
Results
Covariate balance
Estimation in Stata

I have written up code that will implement IPW on the DW


data
It’s nonparametric, so it doesn’t use any packages
But you are welcome to try some packages, particularly the
-teffects- command
Kernel matching

Alternatively we can perform propensity score matching with a


kernel-based method.
Notice on the next slide that the estimate of the ATT switches
sign relative to that produced by the NN matching algorithm
Stata syntax

psmatch2 t r e a t e d , p s c o r e ( s c o r e ) outcome ( r e 7 8 )
k e r n e l k ( n o r m a l ) bw ( 0 . 0 1 )
p s t e s t 2 age b l a c k h i s p a n i c m a r r i e d educ n o d e g r e e
r e 7 8 , sum g r a p h
Panel B for outcomes. Notice the large differences in back- pretreatment covariates in Table 1, panel A, but do not include
ground characteristics between the program participants and theany higher order terms
Matchings vs. Propensity score or interactions, with only the control
PSID sample. This is what makes drawing causal inferences units that are used as a match [the units j such that W = 0 and j

Table 2. Experimental and nonexperimental estimates for the NSW data

M=1 M=4 M = 16 M = 64 M = 2490


Est. (SE) Est. (SE) Est. (SE) Est. (SE) Est. (SE)
Panel A:
Experimental estimates
Covariate matching 1.22 (0.84) 1.99 (0.74) 1.75 (0.74) 2.20 (0.70) 1.79 (0.67)
Bias-adjusted cov matching 1.16 (0.84) 1.84 (0.74) 1.54 (0.75) 1.74 (0.71) 1.72 (0.68)
Pscore matching 1.43 (0.81) 1.95 (0.69) 1.85 (0.69) 1.85 (0.68) 1.79 (0.67)
Bias-adjusted pscore matching 1.22 (0.81) 1.89 (0.71) 1.78 (0.70) 1.67 (0.69) 1.72 (0.68)
Regression estimates
Mean difference 1.79 (0.67)
Linear 1.72 (0.68)
Quadratic 2.27 (0.80)
Weighting on pscore 1.79 (0.67)
Weighting and linear regression 1.69 (0.66)
Panel B:
Nonexperimental estimates
Simple matching 2.07 (1.13) 1.62 (0.91) 0.47 (0.85) −0.11 (0.75) −15.20 (0.61)
Bias-adjusted matching 2.42 (1.13) 2.51 (0.90) 2.48 (0.83) 2.26 (0.71) 0.84 (0.63)
Pscore matching 2.32 (1.21) 2.06 (1.01) 0.79 (1.25) −0.18 (0.92) −1.55 (0.80)
Bias-adjusted pscore matching 3.10 (1.21) 2.61 (1.03) 2.37 (1.28) 2.32 (0.94) 2.00 (0.84)
Regression estimates
Mean difference −15.20 (0.66)
Linear 0.84 (0.88)
Quadratic 3.26 (1.04)
Weighting on pscore 1.77 (0.67)
Weighting and linear regression 1.65 (0.66)
NOTE: The outcome is earnings in 1978 in thousands of dollars.
Subsequent studies

Heckman et al. (1996, 1998) used experimental data from the


US National Job Training Partnership Act (JTPA)
They conclude that in order for matching estimators to have
low bias, it is important that the data include a rich set of
variables related to program participation and labor market
outcomes, that the nonexperimental comparison group be
drawn from the same local labor markets as the participants
and the dependent variable (typically earnings) be measured in
the same way for participants and nonparticipants
All three of these conditions fail to hold in DW (1999, 2002)
according to Smith and Todd (2005)
Smith and Todd

Difference-in-differences with propensity scores tended to work


well in Smith and Todd (2005)
But hard to make this a rule, because it’s hard to know ex ante
if we’ve specified the propensity score correctly (i.e., have CIA)
It is vital you know you’re data, if you’re going to use these
methods, which means understanding at a deep level the way
in which selection (i.e., treatment assignment) works in your
data
Beating a dead horse

The propensity score can make groups comparable but only on


the variables used to estimate the propensity score in the first
place. There is NO guarantee you are balancing on
unobserved covariates.
If you know that there are important unobservable variables,
you may need another tool.
Remember: randomization ensure that both observable and
unobservable variables are balanced
Coarsened exact matching

There are two kinds of matching as we’ve said


1 Exact matching matches a treated unit to all of the control
units with the same covariate value. Sometimes this is
impossible (e.g., continuous covariate).
2 Approximate matching specifies a metric to find control units
that are close to the treated unit. Requires a distance metric,
such as Euclidean, Mahalanobis, or the propensity score. All of
which can be implemented in Stata’s teffects.
Iacus, King and Porro (2011) propose another version of
matching they call coarsened exact matching (CEM). Some
big picture ideas

Cunningham Causal Inference


Checking imbalance

Iacus, King and Porro (2008) say that in practice approximate


matching requires setting the matching solution beforehand,
then checking for imbalance after.
Start over, repeat, until the user is exhausted by checking for
imbalance.
CEM Algorithm

1 Begin with covariates X . Make a copy called X ∗


2 Coarsen X ∗ according to user-defined cutpoints or CEM’s
automatic binning algorithm
Schooling → less than high school, high school, some college,
college, post college
3 Create one stratum per unique observation of X ∗ and place
each observation in a stratum
4 Assign these strata to the original data, X , and drop any
observation whose stratum doesn’t contain at least one treated
and control unit
You then add weights for stratum size and analyze without
matching.
Tradeoffs

Larger bins mean more coarsening. This results in fewer strata.


Fewer strata result in more diverse observations within the
same strata and thus higher imbalance
CEM prunes both treatment and control group units, which
changes the parameter of interest. Be transparent about this
as you’re not estimating the ATE or the ATT when you start
pruning
Benefits

The key benefit of CEM is that it is in a class of matching


methods called monotonic imbalance bounding
MIB methods bound the maximum imbalance in some feature
of the empirical distributions by an ex ante decision by the user
In CEM, this ex ante choice is the coarsening decision
By choosing the coarsening beforehand, users can control the
amount of imbalance in the matching solution
It’s also wicked fast.
Imbalance

There are several ways of measuring imbalance, but here we


focus on the L1 (f , g ) measure which is

1 X
L1 (f , g ) = |fl1 ...lk − gl1 ...lk |
2
l1 ...lk

where the f and g record the relative frequencies for the


treatment and control group units.
Perfect global imbalance is indicated by L1 = 0. Larger values
indicate larger imbalance between the groups, with a
maximum of L1 = 1.
Stata

Download cem from Stata: ssc install cem, replace


You will automatically compute the global imbalance measure,
as well as several unidimensional measures of imbalance, when
using cem
I got a L1 = 0.55. What does it mean?
By itself, it’s meaningless. It’s a reference point between
matching solutions.
Once we have a matching solution, we will compare its L1 to
0.55 and gauge the increase in balance due to the matching
solution from that difference.
Thus L1 works for imbalance as R 2 works for model fit: the
absolute values mean less than comparisons between matching
solutions.
More Stata

Because cem bounds the imbalance ex ante, the most


important information in the Stata output is the number of
observations matched.
You can also choose the coarsening as opposed to relying on
the algorithm’s automated binning.
Once you have estimated the strata, you regress the outcome
onto the treatment and then weight the regression by
cem_weights. For instance,

regress re78 treat [iweight=cem_weight]

For more on this, see Blackwell, et al. Stata journal article


from 2009.
The credibility revolution won

People like Heckman, Rubin, Ashenfelter, LaLonde, Angrist,


Krueger, Card, Imbens, Athey, Duflo, Abadie and many others
built on their backs a movement of sorts
The movement tried to shift economists and other social
scientists away from naive empirical methods that couldn’t
hope to estimate behavioral causal parameters towards things
that might
It’s in your best interest to study empirical methods, papers
that use them, how they communicate their findings and the
econometricians so that you can be ready when the
opportunity arises

Cunningham Causal Inference


Make the stone stoney again

A man walks up the mountain barefoot til he can’t feel his feet
again – Victor Schlosskey said art is there to make “the stone
feel like a stone again”. I want research to feel like research
again for you
Research is a quest for honest answers to good faith questions
that people care about
Most of all, research is truly fun for those who find such things
fun. It’s a form of self-expression and creativity for many of us
And it is fun to understand the answers you get and why those
answers are reliable which requires checklists, workflows,
clearly defined assumptions and proper tools for the job
It is not fun to get a bad answer to a poorly defined question
that you’re not confident about
A Priori Knowledge is Necessary for Identification

Think hard about these questions:


Can you write down a DAG or otherwise model the data
generating process?
What parameter do you think is interesting (e.g., ATT, LATE)
What are the assumptions needed for identifying that
parameter?
Pick estimators based on these questions, not the other way
around
Less so, pick data based on these questions, not the other way
around (usually)
What’s a good research design?

A good research design is one you are excited to tell people


about – that’s basically what characterizes all research designs,
whether propensity score matching or regression discontinuity
designs
Don’t get enamored by statistical modeling that obscures the
identification problem from plain sight.
Always understand what assumptions you must make, be clear
which parameters you are and are not identifying
Good research designs help you believe and not be afraid of
your answers
What’s the reason for your work?

Causal identification is a necessary but not a sufficient


condition for publishing well these days bc the credibility
revolution won
Must also be an “interesting” question - admittedly subjective
If it must be interesting, then the best thing you can do for
yourself is choose a topic that you care about
Publishing is simply too difficult to be working on something
you find trivial
Free disposal advice

My colleague said “a good study and a bad study take the


same amount of time” – don’t work on stuff just to work on it
Finding projects with upside, in a set of potential projects, is a
good idea
The sooner you can cut bait on a bad project and move on,
the better – beware the sunk cost fallacy
For my personality, questions are practically existential quests
for the meaning of life, but not everyone needs extreme
incentives
So know yourself, work to your strengths, figure out things
that downplay your weaknesses, believe in yourself, find your
sponsors and mentors, seek help

You might also like