Causal Inference and Research Design Scott Cunningham (Baylor)
Causal Inference and Research Design Scott Cunningham (Baylor)
Causal Inference and Research Design Scott Cunningham (Baylor)
Figure: xkcd
Where to find this material
Readings:
We will also discuss a number of papers in each lecture, each
of which you will need to learn inside and out.
Lecture slides and reading lists are available
Key literature is contained in the shared dropbox folder which
I’ll distribute beforehand
About me
Once upon a time there was a boy who wrote a job market
paper using the NLSY97.
This boy presented the findings a half dozen times, spoke to
the media a few times, got 17 interviews at the ASSA, 7
flyouts, and an offer from Baylor
He submitted the job market paper to the Journal of Human
Resources, a top field journal in labor, and received a “revise
and resubmit” request from the editor (woo hoo!)
The horror!
“Happy families are all alike; every unhappy family is unhappy in its
own way.” - Leo Tolstoy, Anna Karenina
“Good empirical work is all alike; every bad empirical work is bad in
its own way.” - Scott Cunningham, This slide
Wikipedia definition:
“A workflow consists of an orchestrated and repeatable
pattern of activity, enabled by the systematic organization
of resources into processes that transform materials,
provide services, or process information.”
Dictionary definition:
“the sequence of industrial, administrative, or other
processes through which a piece of work passes from
initiation to completion.”
Empirical workflow
Say you have 51 state units (50 states plus DC) and 10 years
51 × 10 = 510 observations
If you do not have 510 observations, then you have an
unbalanced panel; if you have 510 observations you have a
balanced panel
Check the patterns using xtdescribe and simple counting
tricks
Merge
“Be conservative in what you do; be liberal in what you accept from
others.” - Jon Postel
I’ve put several deck of slides and helpful articles for you in the
dropbox folder
Jesse Shapiro’s “How to Present an Applied Micro Paper”
Gentzkow and Shapiro’s coding practices manual
Rachael Meager on presenting as an academic
Ljubica “LJ” Ristovska’s language agnostic guide to
programming for economists
Grant McDermott on Version Control using Github
https://raw.githack.com/uo-ec607/lectures/master/
02-git/02-Git.html#1
Data Visualization
y = β0 + β1 x + u (1)
addresses each of them.
Simple linear regression model
E (u) = 0 (2)
where E (·) is the expected value operator.
The intercept
The presence of β0 in
y = β0 + β1 x + u (3)
E (y |x) = β0 + β1 x (7)
which shows the population regression function is a linear
function of x.
The straight line in the graph on the next page is what
Wooldridge calls the population regression function, and
what Angrist and Pischke call the conditional expectation
function
E (y |x) = β0 + β1 x
The conditional distribution of y at three different values of x
are superimposed. for a given value of x, we see a range of y
values: remember, y = β0 + β1 x + u, and u has a distribution
in the population.
Deriving the Ordinary Least Squares Estimates
yi = β0 + β1 xi + ui (8)
E (u) = 0
Cov (x, u) = 0
E (y − β0 − β1 x) = 0
E [x(y − β0 − β1 x)] = 0
n
X
−1
n (yi − β̂0 − β̂1 xi ) = 0
i=1
n
X
−1
n xi (yi − β̂0 − β̂1 xi ) = 0
i=1
where β̂0 and β̂1 are the estimates from the data.
These are two linear equations in the two unknowns β̂0 and β̂1 .
Pass the summation operator through the first equation:
n
X
−1
n (yi − β̂0 − β̂1 xi ) (9)
i=1
n
X n
X n
X
−1 −1 −1
=n yi − n β̂0 − n β̂1 xi (10)
i=1 i=1 i=1
n n
!
X X
= n−1 yi − β̂0 − β̂1 n−1 xi (11)
i=1 i=1
implies
Pn
(x − x)(yi − y ) Sample Covariance(xi , yi )
Pn i
β̂1 = i=1 2
= (20)
i=1 (xi − x) Sample Variance(xi )
OLS
For any candidates β̂0 and β̂1 , define a fitted value for each i
as
ûi = yi − ŷi
= yi − β̂0 − β̂1 xi
n
X n
X
ûi2 = (yi − β̂0 − β̂1 xi )2
i=1 i=1
Notice the logic here: this means the OLS residuals always add up
to zero, by construction,
n
X
ûi = 0 (23)
i=1
and so y = ŷ .
Second moment
Because the ŷi are linear functions of the xi , the fitted values and
residuals are uncorrelated, too:
n
X
n−1 ŷi ûi = 0 (27)
i=1
Averages
E (β̂) = β (29)
Don’t forget why we’re here
y = β0 + β1 x + u (30)
where β0 and β1 are the (unknown) population parameters.
We view x and u as outcomes of random variables; thus, y is
random.
Stating this assumption formally shows that our goal is to
estimate β0 and β1 .
Assumption SLR.2 (Random Sampling)
We have a random sample of size n, {(xi , yi ) : i = 1, ..., n},
following the population model.
We know how to use this data to estimate β0 and β1 by OLS.
Because each i is a draw from the population, we can write,
for each i,
yi = β0 + β1 xi + ui (31)
Notice that ui here is the unobserved error for observation i. It
is not the residual that we compute from the data!
Assumption SLR.3 (Sample Variation in the Explanatory
Variable)
The sample outcomes on xi are not all the same value.
This is the same as saying the sample variance of
{xi : i = 1, ..., n} is not zero.
In practice, this is no assumption at all. If the xi are all the
same value, we cannot learn how x affects y in the population.
Assumption SLR.4 (Zero Conditional Mean)
In the population, the error term has zero mean given any
value of the explanatory variable:
E (β̂1 ) = β1 (33)
where the expected value means averaging across random samples.
Step 1: Write down a formula for β̂1 . It is convenient to use
Pn
(xi − x)yi
β̂1 = Pi=1
n 2
(34)
i=1 (xi − x)
which is one of several equivalent forms.
It is convenient to define SSTx = ni=1 (xi − x)2 , to total variation
P
in the xi , and write
Pn
(xi − x)yi
β̂1 = i=1 (35)
SSTx
Remember, SSTx is just some positive number. The existence of
β̂1 is guaranteed by SLR.3.
Pn Pn
β1 SSTx + i=1 (xi − x)ui i=1 (xi− x)ui
β̂1 = = β1 + (40)
SSTx SSTx
Note how the last piece is the slope coefficient from the OLS
regression of ui on xi , i = 1, ..., n. We cannot do this regression
because the ui are not observed.
Now define
(xi − x)
wi = (41)
SSTx
so we have
n
X
β̂1 = β1 + w i ui (42)
i=1
= β1 (45)
E (u 2 |x) = σ 2 = E (u 2 ) (48)
Under the population Assumptions SLR.1 (y = β0 + β1 x + u),
SRL.4 (E (u|x) = 0) and SLR.5 (Var (u|x) = σ 2 ),
E (y |x) = β0 + β1 x
Var (y |x) = σ 2
σ2 σ2
Var (β̂1 |x) = Pn 2
=
i=1 (xi − x) SSTx
2 −1
Pn 2
σ n i=1 xi
Var (β̂0 |x) =
SSTx
(conditional on the outcomes {x1 , x2 , ..., xn }).
To show this, write, as before,
n
X
β̂1 = β1 + w i ui (49)
i=1
n
!
X
Var (β̂1 |x) = Var wi ui |x
i=1
n
X n
X
= Var (wi ui |x) = wi2 Var (ui |x)
i=1 i=1
n
X n
X
= wi2 σ 2 =σ 2
wi2
i=1 i=1
n n Pn 2
X X (xi − x)2 i=1 (xi − x)
wi2 = =
(SSTx )2 (SSTx )2
i=1 i=1
SSTx 1
= 2
=
(SSTx ) SSTx
We have shown
σ2
Var (β̂1 ) = (50)
SSTx
Usually we are interested in β1 . We can easily study the two factors
that affect its variance.
σ2
Var (β̂1 ) = (51)
SSTx
σ2
Var (β̂1 ) = (55)
SSTx
we can compute SSTx from {xi : i = 1, ..., n}. But we need to
estimate σ 2 .
Recall that
σ 2 = E (u 2 ). (56)
Therefore, if we could observe a sample on the errors,
{ui : i = 1, 2, ..., n}, an unbiased estimator of σ 2 would be the
sample average
n
X
n−1 ui2 (57)
i=1
ui = yi − β0 − β1 xi
ûi = yi − β̂0 − β̂1 xi
ûi can be computed from the data because it depends on the
estimators β̂0 and β̂1 . Except by fluke,
ûi 6= ui (58)
for any i.
It is a true estimator and easily computed from the data after OLS.
As it turns out, this estimator is slightly biased: its expected value
is a little less than σ 2 .
The estimator does not account for the two restrictions on the
residuals, used to obtain β̂0 and β̂1 :
n
X
ûi = 0
i=1
n
X
xi ûi = 0
i=1
SSR
σ̂ 2 = (60)
(n − 2)
THEOREM: Unbiased Estimator of σ 2
Under Assumptions SLR.1 to SLR.5,
E (σ̂ 2 ) = σ 2 (61)
In regression output, it is
s
√ SSR
σ̂ = σ̂ 2 = (62)
(n − 2)
that is usually reported. This is an estimator ofPsd(u), the standard
deviation of the population error. And SSR = ni=1 ub2 .
σ̂ is called the standard error of the regression, which
means it is an estimate of the standard deviation of the error
in the regression. Stata calls it the root mean squared error.
Given σ̂, we can now estimate sd(β̂1 ) and sd(β̂0 ). The
estimates of these are called the standard errors of the β̂j .
We just plug σ̂ in for σ:
σ̂
se(β̂1 ) = √ (63)
SSTx
When σi2 = σ 2 for all i, this formula reduces to the usual form,
σ2
SSTx2
A valid estimator of Var(βb1 ) for heteroskedasticity of any form
(including homoskedasticity) is
Pn
− x)2 ubi 2
i=1 (xi
Var (βb1 ) =
SSTx2
yg = xg β + ug
βb = [X 0 X ]−1 X 0 y
E (yi |xi = x)
Helpful result: Law of Iterated Expectations
E (Y ) = E (E [Y |X ])
.
We use LIE for a lot of stuff, and it’s actually quite intuitive. You
may even know it and not know you know it!
Simple example of LIE
E [IQ] = E (E [IQ|Sex])
X
= Pr (Sexi ) · E [IQ|Sexi ]
Sexi
= Pr (Male) · E [IQ|Male]
+Pr (Female) · E [IQ|Female]
E[IQ] = 120
E[IQ | Male] = 115; E[IQ | Female] = 125
LIE: E ( E [ IQ | Sex ] ) = (0.5)×115 + (0.5)×125 = 120
Proof.
For the continuous case:
Z
E [E (Y |X )] = E (Y |X = u)gx (u)du
Z Z
= tfy |x (t|X = u)dt gx (u)du
Z Z
= tfy |x (t|X = u)gx (u)dudt
Z Z
= t fy |x (t|X = u)gx (u)du dt
Z
= t [fx,y du] dt
Z
= tgy (t)dt
= E (y )
Proof.
For the discrete case,
X
E (E [Y |X ]) = E [Y |X = x]p(x)
x
!
X X
= yp(y |x) p(x)
x y
XX
= yp(x, y )
x y
X X
= y p(x, y )
y x
X
= yp(y )
y
= E (Y )
Property 1: CEF Decomposition Property
yi = E (yi |xi ) + ui
where
1 ui is mean independent of xi ; that is
E (ui |xi ) = 0
and x̃1i = x1i − xb1i being the residual from the auxiliary regression.
The parameter β1 can be rewritten as:
β1 E [fi x1i ] = · · · = βk−1 E [fi xk−1i ] = βk+1 E [fi xk+1i ] = · · · = βK E [fi xKI ] = 0
Regression Anatomy Proof (cont.)
3 Consider now the term E [ei fi ]. This can be written as:
E [ei fi ] = E [ei fi ]
= E [ei x̃ki ]
= E [ei (xki − xbki )]
= E [ei xki ] − E [ei x̃ki ]
Once again, since ei is uncorrelated with any independent variable, the expected
value of the terms is equal to zero. Then, it follows E [ei fi ] = 0.
Regression Anatomy Proof (cont.)
4 The only remaining term is E [βk xki fi ] which equals E [βk xki x̃ki ] since fi = x̃ki . The
term xki can be substituted using a rewriting of the auxiliary regression model, xki ,
such that
xki = E [xki |X−k ] + x̃ki
This gives
which follows directly from the orthogonoality between E [xki |X−k ] and x̃ki . From
previous derivations we finally get
Yi = β0 + β1 Si + β2 Ai + ui
Yi = log of earnings
Si = schooling measured in years
Ai = individual ability
When Cov (A, S) > 0 then ability and schooling are correlated.
When ability is unobserved, then not even multiple regression
will identify the causal effect of schooling on wages.
Here we see one of the main justifications for this workshop –
what will we do when the treatment variable is endogenous?
We will need an identification strategy to recover the causal
effect
Introduction to the Selection Problem
Aliens come and orbit earth, see sick people in hospitals and
conclude “these ‘hospitals’ are hurting people”
Motivated by anger and compassion, they kill the doctors to
save the patients
Sounds stupid, but earthlings do this too - all the time
Causal question:
Correlation question:
1 Cov (D, Y )
√ √
n VarD VarY
Every morning the rooster crows and then the sun rises
Did the rooster cause the sun to rise? Or did the sun cause
the rooster to crow?
Post hoc ergo propter hoc: “after this, therefore, because of
this”
#3: No correlation does not mean no causality!
Yi = Di Yi1 + (1 − Di )Yi0
(
Yi1 if Di = 1
Yi =
Yi0 if Di = 0
So what’s the problem?
E [δ|D = 1] = E [Y 1 − Y 0 |D = 1]
= E [Y 1 |D = 1] − E [Y 0 |D = 1]
E [δ|D = 0] = E [Y 1 − Y 0 |D = 0]
= E [Y 1 |D = 0] − E [Y 0 |D = 0]
Causality and comparisons
Comparisons are at the heart of the causal problem, but not all
comparisons are equal because of the selection problem
Does the hospital make me sick? Or am I sick, and that’s why
I went to the hospital?
Why can’t I just compare my health (Scott) with someone
who isn’t in the hospital (Nathan)? Aren’t we supposed to
have a “control group”?
What are we actually measuring if we compare average health
outcomes for the hospitalized with the non-hospitalized?
Definition 7: Simple difference in mean outcomes (SDO)
A simple difference in mean outcomes (SDO) is the difference
between the population average outcome for the treatment and
control groups, and can be approximated by the sample averages:
SDO = E [Y 1 |D = 1] − E [Y 0 |D = 0]
= EN [Y |D = 1] − EN [Y |D = 0]
in large samples.
SDO vs. ATE
Notice the subtle difference between the SDO and ATE notation:
E [Y |D = 1] − E [Y |D = 0] <
> E [Y 1 ] − E [Y 0 ]
E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+E [Y 0 |D = 1] − E [Y 0 |D = 0]
+(1 − π)(ATT − ATU)
ATE = E [Y 1 ] − E [Y 0 ]
= {πE [Y 1 |D = 1] + (1 − π)E [Y 1 |D = 0]}
−{πE [Y 0 |D = 1] + (1 − π)E [Y 0 |D = 0]}
E [Y 1 |D = 1] = a
E [Y 1 |D = 0] = b
E [Y 0 |D = 1] = c
E [Y 0 |D = 0] = d
ATE = e
Rewrite ATE
e = {πa + (1 − π)b}
−{πc + (1 − π)d}
Move SDO terms to LHS
E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)({E [Y 1 |D = 1] − E [Y 0 |D = 1]}
−(1 − π){E [Y 1 |D = 0] − E [Y 0 |D = 0]})
1 0
E [Y |D = 1] − E [Y |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)(ATT − ATU)
Decomposition of difference in means
+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias
where EN [Y |D = 1] → E [Y 1 |D = 1],
EN [Y |D = 0] → E [Y 0 |D = 0] and (1 − π) is the share of the
population in the control group.
Independence assumption
Independence assumption
Treatment is independent of potential outcomes
(Y 0 , Y 1 ) ⊥
⊥D
In words: Random assignment means that the treatment has been assigned to units
independent of their potential outcomes. Thus, mean potential outcomes for the
treatment group and control group are the same for a given state of the world
E [Y 0 |D = 1] = E [Y 0 |D = 0]
E [Y 1 |D = 1] = E [Y 1 |D = 0]
+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias
E [Y 0 |D = 1] − E [Y 0 |D = 0] = 0
Random Assignment Solves the Heterogenous Treatment Effects
How does randomization affect heterogeneity treatment effects bias from the
third line? Rewrite definitions for ATT and ATU:
ATT = E [Y 1 |D = 1] − E [Y 0 |D = 1]
ATU = E [Y 1 |D = 0] − E [Y 0 |D = 0]
E [Y 1 |D = 1] = E [Y 0 |D = 0]
SUTVA
Yi = α + δDi + γXi + ηi
1 2 3 4 5
Any incentive 0.431*** 0.309*** 0.219*** 0.220*** 0.219 ***
(0.023) (0.026) (0.029) (0.029) (0.029)
Amount of incentive 0.091*** 0.274*** 0.274*** 0.273***
(0.012) (0.036) (0.035) (0.036)
Amount of incentive2 −0.063*** −0.063*** −0.063***
(0.011) (0.011) (0.011)
HIV −0.055* −0.052 −0.05 −0.058* −0.055*
(0.031) (0.032) (0.032) (0.031) (0.031)
Distance (km) −0.076***
(0.027)
Distance2 0.010**
(0.005)
Controls Yes Yes Yes Yes Yes
Sample size 2,812 2,812 2,812 2,812 2,812
Average attendance 0.69 0.69 0.69 0.69 0.69
Figure: Visual representation of cash transfers on learning HIV test
results.
Results
For those who were HIV+ and got their test results, 42% more
likely to buy condoms (but shrinks and becomes insignificant
at conventional levels with IV).
Number of condoms bought – very small. HIV+ respondents
who learned their status bought 2 more condoms
Randomization inference and causal inference
Original claim: Given a cup of tea with milk, Bristol claims she
can discriminate the order in which the milk and tea were
added to the cup
Experiment: To test her claim, Fisher prepares 8 cups of tea –
4 milk then tea and 4 tea then milk – and presents each
cup to Bristol for a taste test
Question: How many cups must Bristol correctly identify to
convince us of her unusual ability to identify the order in which
the milk was poured?
Fisher’s sharp null: Assume she can’t discriminate. Then
what’s the likelihood that random chance was responsible for
her answers?
Choosing subsets
Name D Y Y0 Y1
Andy 1 10 . 10
Ben 1 5 . 5
Chad 1 16 . 16
Daniel 1 3 . 3
Edith 0 5 5 .
Frank 0 7 7 .
George 0 8 8 .
Hank 0 10 10 .
Name D Y Y0 Y1
Andy 1 10 10 10
Ben 1 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 0 7 7 7
George 0 8 8 8
Hank 0 10 10 10
Test Statistic
A test statistic T (D, Y ) is a scalar quantity calculated from the
treatment assignments D and the observed outcomes Y
Name D Y Y0 Y1 δi
Andy 1 10 10 10 0
Ben 1 5 5 5 0
Chad 1 16 16 16 0
Daniel 1 3 3 3 0
Edith 0 5 5 5 0
Frank 0 7 7 7 0
George 0 8 8 8 0
Hank 0 10 10 10 0
We’ll start with this simple the simple difference in means test
statistic, T (D, Y ): δSDO = 34/4 − 30/4 = 1
Steps 3-5: Null randomization distribution
Name D˜2 Y Y0 Y1
Andy 1 10 10 10
Ben 0 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 1 7 7 7
George 0 8 8 8
Hank 0 10 10 10
Name D˜3 Y Y0 Y1
Andy 1 10 10 10
Ben 0 5 5 5
Chad 1 16 16 16
Daniel 1 3 3 3
Edith 0 5 5 5
Frank 0 7 7 7
George 1 8 8 8
Hank 0 10 10 10
Assignment D1 D2 D3 D4 D5 D6 D7 D8 |Ti |
True D 1 1 1 1 0 0 0 0 1
D˜2 1 0 1 1 0 1 0 0 2
D˜3 1 0 1 1 0 0 1 0 2.25
...
Step 2: Other test statistics
Name D Y Y0 Y1 Rank Ri
Andy 1 10 10 10 6.5 2
Ben 1 5 5 5 2.5 -2
Chad 1 16 16 16 8 3.5
Daniel 1 3 3 3 1 -3.5
Edith 0 5 5 5 2.5 -1
Frank 0 7 7 7 4 -0.5
George 0 8 8 8 5 0.5
Hank 0 10 10 10 6.5 2
1 X
FbC (Y ) = 1(Yi ≤ Y )
NC
i:Di =0
1 X
FbT (Y ) = 1(Yi ≤ Y )
NT
i:Di =1
.4
.3
Kolmogorov-Smirnov test
kdensity y
.2.1
0
-5 0 5 10
x
Treatment Control
eCDFs by treatment status and test statistic
1
.8
eCDF
ECDF of y
.4 .2
0 .6
-5 0 5 10
y
Treatment Control
KS Test Statistic
A good test statistic is the one that best fits your data. Some test
statistics will have weird properties in the randomization as we’ll
see in synthetic control.
One-sided or two-sided?
H0 : δi = 0 ∀i
H1 : δi 6= 0 for some i
H0 : δi = 0 ∀i
H1 : δi > 0 for some i
Tdiff ∗ = Y T − Y C
Small vs. Modest Sample Sizes are non-trivial
Let’s do this now with Thornton’s data. You can replicate that
using thorton_ri.do or thornton_ri.R
Thornton’s experiment
Yi = α + δDi + βXi + ε
PE I
D Y
B
PE I
D Y
B
D Y
B
B is a parent of PE and D
PE and D are descendants of B
There is a direct (causal) path from D to Y
There is a mediated (causal) path from B to Y through D
There are four paths from PE to Y but none are direct, and
one is unlike the others
Colliders
PE I
D Y
B
D Y
X
Blocked backdoor paths
Examples:
1 Conditioning on a noncollider blocks a path:
X Z Y
2 Conditioning on a collider opens a path (i.e., creates spurious
correlations):
Z X Y
3 Not conditioning on a collider blocks a path:
Z X Y
Backdoor criterion
Backdoor criterion
Conditioning on X satisfies the backdoor criterion with respect to
(D, Y ) directed path if:
1 All backdoor paths are blocked by X
2 No element of X is a collider
In words: If X satisfies the backdoor criterion with respect to
(D, Y ), then controlling for or matching on X identifies the causal
effect of D on Y
What control strategy meets the backdoor criterion?
X1 D Y
X2
What are the necessary and sufficient set of controls which will
satisfy the backdoor criterion?
What if you have an unobservable?
U X2 D Y
What are the necessary and sufficient set of controls which will
satisfy the backdoor criterion?
What about the unobserved variable, U?
Multiple strategies
X1
X3 D Y
X2
X1
X3 D Y
X2
PE I
D Y
B
Collider bias
X1 D Y
X2
X1 D Y
X2
X D Y
U2
X D Y
U2
Living in reality - he doesn’t love you
Movie Star
Talent Beauty
Stata code
clear all
set seed 3444
4
2
0
-2
-4
Beauty
-4 -2 0 2 4
Total
4
2
0
-2
-4
-4 -2 0 2 4
Talent
Graphs by Movie star
Figure: Top left figure: Non-star sample scatter plot of beauty (vertical axis) and talent
(horizontal axis). Top right right figure: Star sample scatter plot of beauty and talent.
Bottom left figure: Entire (stars and non-stars combined) sample scatter plot of beauty and
talent.
Stata
F y
o A
F y
o A
Fryer finds that blacks and Hispanics were more than 50%
more likely to have an interaction with the policy in NYC Stop
and Frisk as well as Police-Public Contact survey
It survives extensive controls – magnitudes fall, but still very
large (21%)
Moves to admin data
Conditional on police interaction, no racial differences in
officer-related shootings
Fryer calls it one of the most surprising findings in his career
Lots of eyes on this study as a result of the counter intuitive
results; published in JPE
Knox, et al (202) claim his data is itself a collider. What?
Controls
X
U
Suspicion
.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
Local Average
V. Results
discontinuity in earnings at the admission cutoff. This
is shown for white men in figure 2, which shows a
A. Earnings Discontinuities at the Admission Cutoff
Tell me what you think is ofhappening
regression residual earnings on a cubic polynomial of
To the extent that there are economic returns to attend- adjusted SAT score. Table 1 shows the discontinuity
ing the flagship state university, one should observe a estimates that result from varying functional form
FIGURE 2.—NATURAL LOG OF ANNUAL EARNINGS FORWHITE MEN TEN TO FIFTEEN YEARS AFTER HIGH SCHOOL GRADUATION (FIT WITH A CUBIC
POLYNOMIAL OF ADJUSTED SAT SCORE)
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
SAT Points Above the Admission Cutoff
E [Y 0 |D = 1] 6= E [Y 0 |D = 0]]
X U X → c0 U
D Y D Y
Sharp RD Design
0
X
Running variable X
800
600
Yi = α + β(Xi − c0 ) + δDi + εi
where α = β0 − β1 65.
All other coefficients, notice, have the same interpretation,
except for the intercept.
Regression without re-centering
reg y D x
Regression with centering
r e g y D x_c
Nonlinearity bias
gen x3 = x ∗ x ∗ x
s c a t t e r y x i f D==0, m s i z e ( v s m a l l ) | | s c a t t e r y x //
i f D==1, m s i z e ( v s m a l l ) l e g e n d ( o f f ) x l i n e ( 1 4 0 , ///
l s t y l e ( f o r e g r o u n d ) ) y l a b e l ( none ) | | l f i t y x ///
i f D ==0, c o l o r ( r e d ) | | l f i t y x i f D ==1, ///
c o l o r ( r e d ) x t i t l e ( " T e s t S c o r e (X) " ) ///
y t i t l e ( " Outcome (Y) " )
See how the two lines don’t touch at c0 but empirically should?
That’s bc the linear fit is the wrong functional form – we know this
from the simulation that it’s the wrong functional form.
Sharp RDD: Nonlinear Case
Yi = f (Xi ) + δDi + ηi
E [Y |X ] = E [Y 0 |X ] + E [Y 1 |X ] − E [Y 0 |X ] D
where β1∗ = β11 − β01 , β2∗ = β21 − β21 and βp∗ = β1p − β0p
The treatment effect at c0 is δ
Polynomial simulation example
c a p t u r e d r o p y x2 x3
gen x2 = x ∗ x
gen x3 = x ∗ x ∗ x
gen y = 10000 + 0∗D 100∗ x +x2 + r n o r m a l ( 0 , 1 0 0 0 )
r e g y D x x2 x3
p r e d i c t yhat
s c a t t e r y x i f D==0, m s i z e ( v s m a l l ) | | s c a t t e r y x
i f D==1, m s i z e ( v s m a l l ) l e g e n d ( o f f ) x l i n e ( 1 4 0 ,
l s t y l e ( f o r e g r o u n d ) ) y l a b e l ( none ) | | l i n e y h a t x
i f D ==0, c o l o r ( r e d ) s o r t | | l i n e y h a t x i f D==1,
s o r t c o l o r ( r e d ) x t i t l e ( " T e s t S c o r e (X) " )
y t i t l e ( " Outcome (Y) " )
Polynomial simulation example
Outcome (Y)
r e g y D x x2
r e g y D x_c x2_c
Polynomial simulation example
Are you done now that you have your main results? No
You main results are only causal insofar as smoothness is a
credible belief, and since smoothness isn’t guaranteed by “the
science” like an RCT, you have to build your case
You must now scrutinize alternative hypotheses that are
consistent with your main results through sensitivity checks,
placebos and alternative approaches
For RDD to be useful, you already need to know something about the
mechanism generating the assignment variable and how susceptible it could be
to manipulation. Note the rationality of economic
Fig. 1. The agent’s problem. actors that this test is built
on.
A discontinuity
0.50 in the density is “suspicious” 0.50 – it suggests manipulation of X
Conditional Expectation
Conditional Expectation
around the0.30
cutoff is probably going on. 0.30 In principle one doesn’t need continuity.
Estimate
Estimate
0.10 0.10
This is a-0.10high-powered test. You need a-0.10lot of observations ` at c0 to distinguish
a discontinuity
-0.30
in the density from noise. -0.30
-0.50 -0.50
5 10 15 20 5 10 15 20
Income Income
0.16 0.16
0.14 0.14
Density Estimate
Density Estimate
0.12 0.12
0.10 0.10
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0.00 0.00
5 10 15 20 5 10 15 20
Income Income
Fig. 2. Hypothetical example: gaming the system with an income-tested job training program: (A) conditional expectation of returns to
Figure:
treatmentPanel
with noCpre-announcement
is density of income when there
and no manipulation; is no pre-announcement
(B) conditional expectation of returns and no manipulation.
to treatment Panel D is
with pre-announcement
theand
density of income
manipulation; when
(C) density therewith
of income is no
pre-announcement
pre-announcement and and manipulation.
no manipulation; From
(D) density McCrary
of income (2008).
with pre-announcement
and manipulation.
also necessary, and we may characterize those who reduce their labor supply as those with coai pc=f i and
bi 4ai ð1 " f i Þ=d.
Fig. 2 shows the implications of these behavioral effects using a simulated data set on 50,000 agents with
Visualizing manipulation
150
100
McCrary z
50
0
-50
2:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 6:30 7:00
threshold
NOTE: The McCrary test is run at each minute threshold from 2:40 to 7:00 to test whether there is a significant discontinuity
NOTE: The dark bars highlight the density in the minute bin just prior to each 30 minute threshold. in the density function at that threshold.
Figure: Figures 2 and 3 from Eric Allen, Patricia Dechow, Devin Pope and George Wu’s (2013)
“Reference-Dependent Preferences: Evidence from Marathon Runners”.
http://faculty.chicagobooth.edu/devin.pope/research/pdf/Website_Marathons.pdf
12 14
Newborn mortality and medical expenditure
Almond, et al. 2010 used the McCrary density test but found
no evidence of manipulation
Ironically, the McCrary density test may fail to reject in a
heaping scenario
In this scenario, the heaping is associated with high mortality
children who are outliers compared to newborns both to the
left and to the right
“This [heaping at 1500 grams] may be a signal that
poor-quality hospitals have relatively high propensities to
round birth weights but is also consistent with
manipulation of recorded birth weights by doctors, nurses,
or parents to obtain favorable treatment for their children.
Barreca, et al. 2011 show that this nonrandom heaping
leads one to conclude that it is “good” to be strictly less
than any 100-g cutoff between 1,000 and 3,000 grams.”
Donut holes
X U X → c0 U
D Y D Y
Endogeneous cutoffs
Waldinger (Warwick) 26 / 48
Example: Outcomes by Forcing Variable - Smaller Bins
Example:
From Lee Outcomes
and Lemieux by Running
(2010) based Variables with smaller bins
on Lee (2008)
Waldinger (Warwick) 27 / 48
Probability of treatment
.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
Local Average
McCrary Density
Waldinger (Warwick) 31 / 48
Balance pictures
Waldinger (Warwick) 29 / 48
Inference – honesty
Lee and Card (2008) and Lee and Lemieux (2010) recommend
clustering standard errors on the running variable
Kolesár and Rothe (2018) provide extensive theoretical and
simulation-based evidence that this is not good; you’d be
better off just with heteroskedastic robust
They propose two alternative confidence intervals that achieve
correct coverage in large samples – called “honest” (great
intro! Still studying this procedure)
Unavailable in Stata, but is available in R – RDHonest – at
https://github.com/kolesarm/RDHonest
n
xi − c0
X
b ≡a,b 2
(b
a, b) (yi − a − b(xi − c0 )) K 1(xi > c0 )
h
i=1
https://twitter.com/page_eco/status/958687180104245248
Estimation
Very difficult to test either one of these since you don’t observe
the counterfactual votes of the loser for the same district/time
Winners in a district are selected based on their policy’s
conforming to unobserved voter preferences, too
Lee, Moretti and Butler (2004) develop the “close election
RDD” which has the aim of determining whether convergence,
while theoretically appealing, has any explanatory power in
Congress
The metaphor of the RCT is useful here: maybe close elections
are being determined by coin flips (e.g., a few votes here, a
few votes there)
Outcome is Congress person’s liberal voting score
TABLE I
RESULTS BASED ON ADA SCORES—CLOSE ELECTIONS SAMPLE
Variable ADA t"1 ADAt DEMt"1 (col. (2)*(col. (3)) (col. (1)) $ (col. (4))
(1) (2) (3) (4) (5)
Standard errors are in parentheses. The unit of observation is a district-congressional session. The
sample includes only observations where the Democrat vote share at time t is strictly between 48 percent and
52 percent. The estimated gap is the difference in the average of the relevant variable for observations for
which the Democrat vote share at time t is strictly between 50 percent and 52 percent and observations for
which the Democrat vote share at time t is strictly between 48 percent and 50 percent. Time t and t " 1 refer
to congressional sessions. ADA t is the adjusted ADA voting score. Higher ADA scores correspond to more
liberal roll-call voting records. Sample size is 915.
Y = f (X ) + ε
Nonparametric estimation (cont.)
Figure:
be a Lee, Moretti,
continuous andfunction
and smooth Butler 2004,
of vote sharesFigure I. γ ≈ 20
everywhere,
except at the threshold that determines party membership. There
is a large discontinuous jump in ADA scores at the 50 percent
Contemporaneous liberal voting score
830 QUARTERLY JOURNAL OF ECONOMICS
FIGURE IIb
Effect of Initial Win on Winning Next Election: (P D R
t!1 " P t!1 )
Top panel plots ADA scores after the election at time t against the Democrat
vote share, time t. Bottom panel plots probability of Democrat victory at t ! 1
against Democrat vote share, time t. See caption of Figure III for more
D details.R
Figure: Lee, Moretti, and Butler 2004, Figure IIb. (Pt+1 − Pt+1 ) ≈ 0.50
Concluding remarks
0
0 1 2 3 4 5 6 7 8 9 10
Hoekstra flagship school
THE EFFECT OF ATTENDING THE FLAGSHIP STATE UNIVERSITY ON EARNINGS
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
Local Average
Instrumental variables
. reg Y Z X X2 X3
. reg D Z X X2 X3
As in the sharp RDD case one can allow the smooth function
to be different on both sides of the discontinuity.
The second stage model with interaction terms would be the
same as before:
D Y
Z U
D Y
without affecting cost conditions or which (B) affect cost conditions without
Figure: Wright’s graphical demonstration of the identification problem
affecting demand conditions.
Sewell was his son, who did not go into the family business
Rather, he decided to become a genius and invent genetics
Developed path diagrams (which Pearl revived 50 years later
for causal inference)
Father and son engage in letter correspondence as Philip tried
to solve the “identification problem”
Figure: Wright’s letter to Sewell, his son
Figure: Recognize these?
QJE Rejects
The best instruments you think of first, then you seek the data
second (but often students go in the reverse order which is
basically guaranteed to be a crappy instrument)
If you want to use IV, then ask:
What moves around the covariate of interest that
might be plausibly random?
On its face, it’s puzzling that the first two kids’ gender
predicts labor market participation
Instrumental variables strategies formalize strangeness of the
instrument, which is the inference drawn by an intelligent
layperson with no particular knowledge of the phenomena or
background in statistics.
You need more information, in other words, otherwise the
layperson can’t understand what same gender of your children
has to do with working
When a good IV strategy finally makes sense
But then the researchers point out that women whose first two
children are of the same gender are more likely to have
additional children than women whose first two children are of
different genders
The layperson then asks himself, “Hm. I wonder if the labor
market differences are due solely to the differences in the
number of kids the woman has...”
Sunday Candy is a good instrument
Sunday Candy U
Church Hell
Kanye West is a bad instrument
Kanye West U
Inspiration Success
Foreshadowing the questions you need to be asking
Y = α + δS + γA + ν
Cov (Y , Z ) = Cov (α + δS + γA + ν, Z )
= E [(α + δS + γA + ν)Z ] − E [α + δS + γA + ν]E [Z ]
= {αE (Z ) − αE (Z )} + δ{E (SZ ) − E (S)E (Z )}
+γ{E (AZ ) − E (A)E (Z )} + E (νZ ) − E (ν)E (Z )
Cov (Y , Z ) = δCov (S, Z ) + γCov (A, Z ) + Cov (ν, Z )
Divide both sides by Cov (S, Z ) and the first term becomes δ, the
LHS becomes the ratio of the reduced form to the first stage, plus
two other scaled terms.
Consistency
Cov (Y , Z )
δIV =
Cov (S, Z )
= δ
IV is Consistent if IV Assumptions are Satisfied
Cov (η, Z )
plim δbIV = plim δ + γ
Cov (S, Z )
= δ
Z A
S Y
(a)
Z A
S Y
(b)
In a pinch, you can even get by with two different data sets
1 Dataset 1 needs information on the outcome and the
instrument
2 Dataset 1 needs information on the treatment and the
instrument.
This is known as “Two sample IV” because there are two
samples involved, rather than the traditional one sample.
Once we define what IV is measuring carefully, you will see
why this works.
Two-stage least squares concepts
Yi = α + δSi + ηi
Yi = β + δ Sbi + νi
Reduced form
Yi = ψ + πZi + εi
Yi = α + δSi + ηi
Si = γ + ρZi + ζi
Cov (Z , Y )
δd
2sls =
Cov (Z , S)
1 Pn
n i=1 (Zi − Z )(Yi − Y )
= 1 Pn
n i=1 (Zi − Z )(Si − S)
1 Pn
n i=1 (Zi − Z )Yi
= 1 Pn
n i=1 (Zi − Z )Si
Two-stage least squares
Where did the first term go? Why did the second term become δ?
Two-stage least squares
Rewrite ρb as
Cov (Z , S)
ρb =
Var (Z )
ρbVar (Z ) = Cov (Z , S)
Two-stage least squares
Recall
Si = γ + ρZi + ζi
Then
Sb = γ
b + ρbZ
Then
Cov (b
ρZ , Y ) Cov (S,
b Y)
δb2sls = =
Var (b
ρZ ) Var (S)
b
Proof.
We will show that δCov
b (Y , Z ) = Cov (S,
b Y ). I will leave it to you
to show that Var (δZ ) = Var (S)
b b
One manual way is just to estimate the reduced form and first
stage coefficients and take the ratio of the respective
coefficients on Z
But while it is always a good idea to run these two regressions,
don’t compute your IV estimate this way
Estimation with software
In the US, you could drop out of school once you turned 16
“School districts typically require a student to have turned age
six by January 1 of the year in which he or she enters school”
(Angrist and Krueger 1991, p. 980)
Children have different ages when they start school, though,
and this creates different lengths of schooling at the time they
turn 16 (potential drop out age):
Born Turn Start
Dec 6 School S
If you’re born in the fourth quarter, you hit 16 with more schooling
than those born in the first quarter
Visuals
The second stage is the same as before, but the fitted values
are from the new first stage
First stage regression results
First Stageof Regressions
Quarter birth is a stronginpredictor
Angrist & Krueger
of total (1991)
years of education
Waldinger (Warwick) 17 / 45
Sidebar: Wald estimator
Recall that 2SLS uses the predicted values from a first stage
regression – but we showed that the 2SLS method was
(Y ,Z )
equivalent to Cov
Cov (X ,Z )
The Wald estimator simply calculates the return to education
as the ratio of the difference in earnings by quarter of birth to
the difference in years of education by quarter of birth – it’s a
version of the above
E (Y |Z =1)−E (Y |Z =0)
Formally, IVWald = E (D|Z =1)−E (D|Z =0)
Mechanism
Y = βX + ν
Matrices and instruments
We’ll sadly need some matrix notation, but I’ll try to make it
painless.
The matrix of instrumental variables is Z with the first stage
equation:
X = Z 0π + η
And let Pz be the project matrix producing residuals from
population regression of X on Z
Pz = Z (Z 0 Z )−1 Z 0
Weak instruments and bias towards OLS
βb2sls = (X 0 Pz X )−1 X 0 PZ Y
= β + (X 0 Pz X )−1 X 0 PZz ν
substitution of Y = βX + ν
2SLS bias
βb2sls − β = (X 0 Pz X )−1 X 0 Pz ν
= aX 0 Pz ν
= a[π 0 Z 0 + η 0 ]Pz ν
= aπ 0 Z 0 ν + aη 0 PZ ν
= (X 0 PZ X )−1 π 0 Z 0 ν + (X 0 Pz X )−1 η 0 Pz ν
E [βb2sls − β] ≈ E [X 0 Pz X )−1 b
≈ E 9X 0 Z (Z 0 Z )−1 Z 0 X )−1 b
≈ E [(πZ + η)0 Pz (πZ + η)]−1 b
0 0 0 −1
≈ E (π Z Z π) + E (η Pz η) b
≈ E (π 0 Z 0 Z π) + E (η 0 Pz η)−1 E [η 0 Pz ν]
That last term is what creates the bias so long as η and ν are
correlated – which it’s because they are that you picked up 2SLS to
begin with
First stage F
With some algebra and manipulation, Angrist and Pische show that
the bias of 2SLS is equal to
−1
σνη E (π 0 Z 0 Z π)/Q
E [βb2sls − β] ≈ +1
ση2 ση2
where the interior term is the population F-statistic for the joint
significance of all regressions in the first stage
Weak instruments and bias towards OLS
Consider the intuition all that work bought us now: if the first
stage is weak (i.e, F → 0), then the bias of 2SLS approaches
σνη
ση2
Weak instruments and bias towards OLS
Adding
Addingmore
moreweak
weakinstruments
instrumentsreduced
reducedthethefirst
firststage -statistic and
stageFF-statistic
and increases
moves the bias of
the coe¢cient 2SLS. the
towards Notice
OLSitscoe¢cient.
also moved closer to OLS.
Adding Instruments in Angrist
Adding instruments & Krueger
in Angrist and Krueger
Table from Bound, Jaeger, and Baker (1995) - 180 IVs
Adding
More more weakincrease
instruments instruments reduced
precision, the first
but drive downstage F-statistic and
F , therefore
moves the coe¢cient towards the OLS
we know the problem has gotten worse coe¢cient.
Guidance on working around weak instruments
5 If you have many IVs, pick your best instrument and report the
just identified model (weak instrument problem is much less
problematic)
6 Check over identified 2SLS models with LIML
Make beautiful pictures of first stage and reduced form
π0i = E [Di0 ]
π1i = (Di1 − Di0 ) is the heterogenous causal effect of the IV
on Di .
E [π1i ] = The average causal effect of Zi on Di
Identifying assumptions under heterogenous treatment
effects
Independence means that the first stage measures the causal effect
of Zi on Di :
Exclusion Restriction
Y(D,Z) = Y(D,Z’) for all Z, Z’, and for all D
Yi = α0 + δi Di
with α0 = E [Yi0 ] and δi = Yi1 − Yi0
Spotting violations of exclusion is a sport
Monotonicity
Either π1i ≥ 0 for all i or π1i ≤ 0 for all i = 1, . . . , N
Effect of Z on Y
δIV ,LATE =
Effect of Z on D
Estimand
Examples of subpopulations:
1 Compliers: I only enrolled in the military because I was drafted
otherwise I wouldn’t have served
2 Always takers: My family have always served, so I serve
regardless of whether I am drafted
3 Never takers: I’m a contentious objector so under no
circumstances will I serve, even if drafted
4 Defiers: When I was drafted, I dodged. But had I not been
drafted, I would have served. I can’t make up my mind.
Never-Takers Complier
Di1 − Di0 = 0 Di1 − Di0 = 1
Yi (0, 1) − Yi (0, 0) = 0 Yi (1, 1) − Yi (0, 0) = Yi (1) − Yi (0)
By Exclusion Restriction, causal Average Treatment Effect among
effect of Z on Y is zero. Compliers
Defier Always-taker
Di1 − Di0 = −1 Di1 − Di0 = 0
Yi (0, 1) − Yi (1, 0) = Yi (0) − Yi (1) Yi (1, 1) − Yi (1, 0) = 0
By Monotonicity, no one in this By Exclusion Restriction, causal
group effect of Z on Y is zero.
Monotonicity Ensures that there are no defiers
10
Empirical Framework
12
Amy Finkelstein, et al. (2012). “The Oregon Health
Insurance Experiment: Evidence from the First Year”,
Quarterly Journal of Economics, vol. 127, issue 3, August.
Effects of Medicaid
Fielding protocol
∼70,000 people, surveyed at baseline and 12 months later
Basic protocol: three-stage male survey protocol,
English/Spanish
Intensive protocol on a 30% subsample included additional
tracking, mailings, phone attempts (done to adjust for
non-response bias)
Response rate
Effective response rate = 50%
Non-response bias aways possible, but response rate and
pre-randomization measures in administrative data were
balanced between treatment and control
Administrative data
Medicaid records
Pre-randomization demographics from list
Enrollment records to assess “first stage” (how many of the
selected got insurance coverage)
Hospital discharge data
Probabilistically matched to list, de-identified at Oregon
Health Plan
Includes dates and source of admissions, diagnoses,
procedures, length of stay, hospital identifier
Includes years before and after randomization
Other data
Mortality data from Oregon death records
Credit report data, probabilistically matched, de-identified
Sample
19
Outcomes
Gaining
Gaining insuranceresulted
insurance resulted in
in better
betteraccess
accesstoto
care andand
care higher
higher
satisfactionwith
satisfaction withcare
care (conditional
(conditional ononactually
actuallygetting care).
getting care)
21
22
Results: Access & Use
Effect of lottery of Care
on coverage
24
Summary: Access and use of care
Gaining insurance
Gaining insuranceresulted
resultedininaareduced probabilityofofhaving
reduced probability havingmedical
medical collections in credit reports, and in lower amounts
collections in credit reports, and in lower amounts owed. owed
26
27
Summary: Financial Strain
Self-reported measures
Self-reported measuresshowed
showedsignificant
significant improvements oneyear
improvements one year
afterafter
randomization
randomization
CONTROL RF Model IV Model P-Value
(ITT) (LATE)
Health good, v good, excellent 54.8% +3.9% +13.3% .0001
Health stable or improving 71.4% +3.3% +11.3% .0001
Depression screen NEGATIVE 67.1% +2.3% +7.8% .003
29
Summary: Self-reported health
Measured:
Blood pressure
Cholesterol levels
Glycated hemoglobin
Depression
Reasons for selecting these:
Reasonably prevalent conditions
Clinically effective medications exist
Markers of longer term risk of cardiovascular disease
Can be measured by trained interviewers and lab tests
A limited window into health status
44
45
46
47
Results on specific conditions
D Y
Z e
D Y
Yi = β0 + β1 JIi + β2 Xi + εi
This is the “long” causal model. But note, from the prior DAG, we
cannot control for e because it is unobserved. But it is confounding
the estimation of juvenile incarceration’s effect on outcomes.
Incarceration Propensity as an Instrument
Xi
ci
E [y |x1 , x2 , . . . , xk , c]
Formal panel notation cont.
Single unit:
yi1 Xi,1,1 Xi,1,2 Xi,1,j ... Xi,1,K
.. .. .. .. ..
. . . . .
yi = yit
Xi = Xi,t,1 Xi,t,2 Xi,t,j
. . . Xi,t,K
.. .. .. .. ..
. . . . .
yiT T ×1 Xi,T ,1 Xi,T ,2 Xi,T ,j . . . Xi,T ,K T ×K
N X
X T
(β,
b cb1 , . . . , cbN ) = argmin (yit − xit b − mi )2
b,m1 ,...,mN i=1 t=1
and
T
X
(yit − xit βb − cbi ) = 0
t=1
for i = 1, . . . , N.
Derivation: fixed effects regression
Therefore, for i = 1, . . . , N,
T
1 X
cbi = (yit − xit β)
b = ȳi − x̄i β,
b
T
t=1
where
T T
1 X 1 X
x̄i ≡ xit ; ȳi ≡ yit
T T
t=1 t=1
N X
X T N X
−1 X T
βb = ẍit0 ẍit ẍit0 ẍit
i=1 t=1 i=1 t=1
Identification assumptions:
1 E [εit |xi1 , x + i2, . . . , xiT , ci ] = 0; t = 1, 2, . . . , T
regressors are strictly exogenous conditional on the unobserved
effect
allows xit to be arbitrarily related to ci
PT 0
rank t=1 E [ẍit ẍit ] =K
2
regressors vary over time for at least some i and not collinear
Fixed effects estimator
1 Demean and regress ÿit on ẍit (need to correct degrees of
freedom)
2 Regress yit on xit and unit dummies (dummy variable
regression)
3 Regress yit on xit with canned fixed effects routine
Stata: xtreg y x, fe i(PanelID)
FE main results
Inference:
Standard errors have to be “clustered” by panel unit (e.g.,
farm) to allow correlation in the εit ’s for the same i.
Yields valid inference as long as number of clusters is
reasonably large
Typically we care about β, but unit fixed effects ci could be of
interest
cbi from dummy variable regression is unbiased but not
consistent for ci (based on fixed T and N → ∞)
Application: SASP
Snow makes his argument with many pieces of evidence that when
taken together are very compelling that water, not air, is the cause
of the cholera epidemics. These can be categorized as:
1 Observation
2 Broad Street Pump
3 Grand Experiment
Observation
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
Population expectations
2x2
δbkU = E [Yk |Post] − E [Yk |Pre] − E [YU |Post] − E [YU |Pre]
Potential outcomes and the switching equation
2x2
δbkU = E [Yk1 |Post] − E [Yk0 |Pre] − E [YU0 |Post] − E [YU0 |Pre]
| {z }
Switching equation
+ E [Yk0 |Post] − 0
E [Yk |Post]
| {z }
Adding zero
Parallel trends bias
2x2
δbkU = E [Yk1 |Post] − E [Yk0 |Post]
| {z }
ATT
0 0 0 0
+ E [Yk |Post] − E [Yk |Pre] − E [YU |Post] − E [YU |Pre]
| {z }
Non-parallel trends bias in 2x2 case
Another famous DD study
J. Hainmueller (MIT) 5 / 50
They surveyed about 400 fast food stores both in New Jersey
and Pennsylvania before and after the minimum wage increase
in New Jersey - shoeleather!
Parallel trends assumption
J. Hainmueller (MIT) 7 / 50
TABLE3-AVERAGE EMPLOYMENT PER S
IN NEW JERSEYMI
Stores by state
Difference,
PA NJ NJ-PA
Variable (i) (ii) (iii)
1. FTE employment before,
all available observations
2. FTE employment after,
all available observations
3. Change in mean FTE
employment
4. Change in mean FTE
employment, balanced
sample of storesC
J. Hainmueller (MIT) 17
Surprisingly, employment rose in NJ relative to PA after the
5. Change in mean FTE
minimum wage changesetting
employment, - consistent with monopsony theory
Regression DD
Waldinger (Warwick) 19 / 55
Graph - DD
YistY= =+
ist α γNJ
a+ s +
gNJ s +λd
ldt t+ d(NJs× d)
+δ(NJ dt )st++#ist
εist
Waldinger (Warwick) 20 / 55
Graph - DD
==
YistYist α+
a+γNJ
gNJs s++λd
ldtt + d(NJs ×
+ δ(NJ dd) ++
t )st #istεist
Waldinger (Warwick) 21 / 55
Graph - DD
Waldinger (Warwick) 22 / 55
Key assumption of any DD strategy: Parallel trends
Yist==αa+
Yist gNJss + ld
+γNJ λdtt + (NJs ×dtd)
+dδ(NJ )+st #+
ist εist
Waldinger (Warwick) 23 / 55
Losing parallel trends
Observed PA
Observed NJ
δ OLS
δ ATT
Counterfactual NJ
Time
Feb Nov
Figure 1: Internet Di↵usion and Average Quarterly Music Expenditure in the CEX
40 100
30
70
(in 1998 dollars)
25
60
20 50
40
15
30
10
20
5
10
0 0
1996 1997 1998 1999 2000 2001
Year
J. Hainmueller (MIT) 41 / 50
Table 1: Descriptive Statistics for Internet User and Non-user Groupsa
Figure: Anderson, et al. (2013) display of raw traffic fatality rates for
re-centered treatment states and control states with randomized
treatment dates
Randomized control counties to receive arbitrary dates as
treatment can be misleading
5.8
Average birth rates per 1000
5.2 5.4
5 5.6
-50 0 50
treat_date
-4 -3 -2 -1 0 1 2 3 4 5
10 Months relative to CL Entry
Table: Differences-in-differences-in-differences
D
After PA + T + PAt + lt
Low wage employment T + PAt + lt
Before PA
PA lt − st
After PA + T + PAt + st
High wage employment T + PAt + st
Before PA
Di↵erence-in-Di↵erences: Threats to Validity
DDD Example by Gruber
Triple DDD: Mandated Maternity Benefits (Gruber, 1994
DDD in Regression
Have to get the structure of the data correct because now you
have (1) before and after, (2) treatment and control states,
and (3) within state placebo
I give an example in my Mixtape (p. 278) looking at abortion
legalization’s effect on longterm risky sexual behavior,
including do file
Let’s review first the paper, then work through the exercise
itself using data.
Figure: Longrun effects of abortion legalization on Risky Sex
Motivation
1.00
Repeal x year estimated coefficient
-1.50 -1.00 -0.50 0.00 0.50
1997
1994
1996
1995
1992 1998 1999
1991 2000
0.00
1990
1993
1989
1986
1987 1988
-1.00 -0.50
Figure: Second theoretical prediction - this time for 20-24 year olds
Estimated effect of abortion legalization on gonorrhea
Black females 20-24 year-olds
1.00
Repeal x year estimated coefficient
0.00 0.50
1992
-0.50
1988
1991
1989 1990
-1.00
1997
1994 1996
0.00
Let’s replicate this using the abortion.do file. Pay close attention to
the stacking of the data by group-state, not just state, and the
exact way in which the interactions must therefore be constructed
Falsification test with alternative outcome
Summary:
21 states passed laws removing “duty to retreat” in places
outside the home
17 states removed “duty to retreat” in any place one had a
legal right to be
13 states include a presumption of reasonable fear
18 states remove civil liability when force was justified under
law
Cheng and Hoekstra’s identification strategy
The key finding in this study focuses on CDL and its effect on
homicides and non-negligent manslaughter
Pop quiz: what should the sign on CDL be here?
Answer
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
1.5
1.4
1.3
Log Homicide Rate
1.2
1.1
1
.9
.8
.7
.6
.5
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
1 2 3 4 5 6
Panel C: Homicide (Negative Binomial - Unweighted)
State and Year Fixed Effects Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes
Contemporaneous Crime Rates Yes
State-Specific Linear Time Trends Yes
Fisher sharp null
0.400
0.200
0.172
0.105
0.078 0.082 0.079
-0.026
-0.137
-0.200
-0.261
-0.304
-0.400
lead9
lead8
lead7
lead6
lead5
lead4
lead3
lead2
lead1
lag1
lag2
lag3
lag4
lag5
Figure: Homicide event study plots using coefplot
.4
.2
Log Murders
0
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Years before and after castle doctrine expansion
Figure: Homicide event study plots using twoway and force early leads
into one coefficient
.3
.2
Log Murders
.1
0
-.1
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
We know a lot about 2x2, but about the twoway fixed effects
estimator when it comes to DD designs
Decomposition Preview
A lot!
When there’s three groups - a never treated (U), an early
treated (k) and a late treated (l), there are four 2x2s
But typically, we have more than 3 groups making the number
of potential 2x2 even larger
With K timing groups and one untreated group, there are K 2
distinct 2x2 DDs
K 2 distinct DDs
Assume 3 timing groups (a, b and c) and one untreated group (U).
Then there should be 9 2x2 DDs. Here they are:
a to b b to a c to a
a to c b to c c to b
a to U b to U c to U
Simple example with 3 groups
We’ll stick with two groups, k and l, who will get the treatment
at tk∗ and tl∗ , and the third group U will never get treated
The earlier period before anyone is treated is “pre”, the period
between k and l treatment is “mid”, and the period after l is
treated is “post”
Three important 2x2 DDs
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
2x2 mid(k,l) pre(k) mid(k,l) pre(k)
δbkl = yk − yk − yl − yl
2x2 post(l) mid(k,l) post(l) mid(k,l)
δblk = yl − yl − yk − yk
where the first 2x2 is any timing group compared to untreated, the
second is a group compared to yet-to-be-treated timing group, and
the last is the eventually-treated compared to the already-treated
controls.
2x2 post(k) pre(k) post(k) pre(k)
δbkU = yk − yk − yU − yU
2x2 post(l) pre(l) post(l) pre(l)
δblU = yl − yl − yU − yU
2x2,k MID(k,l) Pre(k,l) MID(k,l) PRE (k,l)
δkl = yk − yk − yl − yl
2x2,l POST (k,l) MID(k,l) POST (k,l) MID(k,l)
δlk = yl − yl − yk − yk
Second, what makes up the DD estimator?
nk nu D k (1 − D k )
sku =
d (D̃it )
Var
nk nl (D k − D l )(1 − (D k − D l ))
skl =
d (D̃it )
Var
1 − Dk
µkl =
1 − (D k − D l )
2x2
δbkU = E [Yj |Post] − E [Yj |Pre] − E [Yu |Post] − E [Yu |Pre]
1 0 0 0
= E [Yj |Post] − E [Yj |Pre] − E [Yu |Post] − E [Yu |Pre]
| {z }
Switching equation
+ E [Yj0 |Post] − 0
E [Yj |Post]
| {z }
Adding zero
2x2 0 0
δbkU = ATTPost,j + ∆YPost,Pre,j − ∆YPost,Pre,U
| {z }
Selection bias!
2x2
δbkU = ATTk Post + ∆Yk0 (Post(k), Pre(k)) − ∆YU0 (Post(k), Pre)
2x2
δbkl = ATTk (MID) + ∆Yk0 (MID, Pre) − ∆Yl0 (MID, Pre)
These look the same because you’re always comparing the treated
unit with an untreated unit (though in the second case it’s just that
they haven’t been treated yet).
The dangerous 2x2
But what about the 2x2 that compared the late groups to the
already-treated earlier groups? With a lot of substitutions like we
did we get:
2x2
δblk = ATTl,Post(l) + ∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
| {z }
Parallel trends bias
− (ATTk (Post) − ATTk (Mid))
| {z }
Heterogeneity bias!
Heterogeneity bias?
2x2
δbkl = ATTl,Post(l)
+∆Yl0 (Post(l), MID) − ∆Yk0 (Post(l), MID)
−(ATTk (Post) − ATTk (Mid))
2x2,k 2x2,l
X XX
2x2
δbDD = skU δbkU + skl µkl δbkl + (1 − µkl )δbkl
k6=U k6=U l>k
DD
p lim δbn→∞ = δ DD
= VWATT + VWCT − ∆ATT
X
VWATT = σkU ATTk (Post(k))
k6=U
XX
+ σkl µkl ATTk (MID) + (1 − µkl )ATTl (POST (l))
k6=U l>k
X
0 0
VWCT = σkU ∆Yk (Post(k), Pre) − ∆YU (Post(k), Pre)
k6=U
XX
+ σkl µkl {∆Yk0 (Mid, Pre(k)) − ∆Yl0 (Mid, Pre(k))}
k6=U l>k
+ (1 − µkl ){∆Yl0 (Post(l), Mid) − ∆Yk0 (Post(l), Mid)}
XX
∆ATT = (1 − µkl ) ATTk (Post(l) − ATTk (Mid))
k6=U l>k
Now, if the ATT is constant over time, then this difference is zero,
but what if the ATT is not constant? Then TWFE is biased, and
depending on the dynamics and the VWATT, may even flip signs
Case 1: ATT varies across units but not time
DD
p lim δbn→∞ = VWATT + VWCT
X k−1
X K
X
VWATT = ATTk σkU + σjk (1 − µjk ) + σjk µjk
k6=U j−1 j=k+1
X
= ATTk wkT
k6=U
The first 2x2 uses the later group as its control in the middle
period. But in the late period, the later treated unit is using
the earlier treated as its control
But notice, this effect is biased because the control group is
experiencing a trend in outcomes (heterogeneous treatment
effects)
This bias feeds through to the later 2x2 according to the size
of the weight (1 − µkl )
Variance weighted common trends
X k−1
X K
X
VWCT = ∆Yk0 σkU + σjk (1 − 2µjk ) + σkj (2µkj − 1)
k6=U j=1 j=k+1
X
−∆YU0 σkU
k6=U
where wkT is the sum of all weights where group k is the treatment
group
k−1
X K
X
T
wk = σkU + σjk (1 − µjk ) + σkj µkj
k=1 j=k+1
and wkc is the sum of al weights where group k is the control group
k−1
X K
X
wkc = σjk µjk + σjk (1 − µjk )
k=1 j=k+1
Variance weighted common trends
What this means is that while all units are acting as controls,
treatment timing causes some units to be controls more often
- hence why they become negative (e.g., wkT − wkC < 0 implies
wkC has become relatively large)
The earliest and/or latest units get more weight as controls
than treatments
Units treated in the middle of the panel have high treatment
variance as we’ve noted repeatedly, and so get more weight
when they act as the treatment group
Variance weighted common trend weights
Testing VWCT
2 Estimate
xk = βBk + εk
weighted by |wkT − wkC |
The coefficient βb equals covariate differences weighted by the
actual identifying variation and its t-statistic tests the null of
reweighted balance implied the VWCT equality
Software to check the 2x2s and weights
.4 .2
2x2 DD Estimate
-.2 0
-.4
-.6
Review baker.do
So we see – with differential timing, and heterogeneous
treatment effects over time, the TWFE bias can be gigantic
because:
New papers are coming out focused on the issues that we are
seeing with TWFE
Callaway and Sant’anna (2019) is one of these (currently R&R
at Journal of Econometrics)
Preliminary
Horvitz weights
1 PN Di − pb(Xi )
δbATT = i=1 Yi ·
NT 1 − pb(Xi )
Harjek weights
N N N N
Yi (1 − Di ) (1 − Di )
X X X X
Yi Di Di
δbATT = / − /
pb pb (1 − pb) (1 − pb)
i=1 i=1 i=1 i=1
Parameter of interest
E [Yt0 − Yt−1
0
|X , Gg = 1] = [Yt0 − Yt−1
0
|X , C = 1]
Theorem 1
p̂(X )C
Gg 1−p̂(X )
ATT (g , t) = E − (Yt − Yg −1 )
E [Gg ] p̂(X )C
E 1−p̂(X )
Which units will and will not be controls?
Proof.
Remark 1: In some applications, eventually all units are treated,
implying that C is never equal to one. In such cases one can
consider the “not yet treated” (Dt = 0) as a control group instead
of the “never treated?” (C = 1).
Aggregated vs single year/group ATT
T T
2 XX
1{g ≤ t}ATT (g , t)
T (T − 1)
g =2 t=2
Let data run from 1983 - 1988. Thus T = 3. ATT simple average
is 15.
Interesting Parameter 2
T T
1 XX
1{g ≤ t}ATT (g , t)P(G = g )
k
g =2 t=2
See baker.do
Concluding remarks on DD
Chances are you are going to write more papers using DD than
any other design
Goodman-Bacon (2018, 2019) is worth your time so that you
know what you are estimating
And Callaway and Sant’ann (2019) is an extremely useful
contribution to the DD toolbox for showing a way to estimate
the group-time ATT using any variety of approaches, including
regression
What is synthetic control
Pros:
Policy interventions often take place at an aggregate level
Aggregate/macro data are often available
Cons:
Selection of control group is ad hoc
Standard errors do not reflect uncertainty about the ability of
the control group to reproduce the counterfactual of interest
Description of the Mariel Boatlift
E -
Card’s main results
~ OO~ - 0
-00. --
O\N I I.r: ~
u
a ~
d
U, ~
C
L..
='
0
t:
C
0
',----M- --~
r-..000\"d
0u
.-0
c:
- 0 ... - - 0 ... - ~C
K ~ . -vr-..
-""""" MMO
\~ '/ .-00 ~ 0\ . V .. 0
1J') 00'" 0 N g
au t--
~c.s
- C/)
.-
e >c I """
\0
10..,
"d
~ ~
~uU M
rl
C c: :0
U
0 0 ~
.I::
...
10.., .~.~~
0.
~
0..
~
L.
0
0
a E 0\
--
'a
0 0 0\
~
~ U
I Y
~
';j
II)
u
'-.-
u 6
.~
U
(I) .
,-
~
u
~
.~
u'-
~
~
~
'-'
a
r \
'-'
~ 0.2-: c.2-:
OGJ
a
0
g °u
~
U '~ U
C
~
.~ U
C ~
.L..
~F;0..L.. U
I£: -
eU~o..~
L..
u,.. II) .- U ~ ,;... U
"0
~ R
=' ,~ ~ ,- a ~ u ;a a ~
0 ~ ~ .- 0 ~ 0..
t!: (,5 ~~UQ ~~UQ ~
<
';,
,.
: GJ
II)
II)
; U
C U
0
' U -0
Z
'I!~
'
,
.-
~
GJ
, ---N ---" ,
.
or
l~ Q ~_c ~~~
,
\
:)
Can this ever lead to subjective biases?
for each post-treatment period, t > T0 and Y1t is the outcome for
unit one at time t. We will estimate Y1t0 using the J units in the
donor pool
Estimating optimal weights
where Xjm is the value of the m-th covariates for unit j and V is
some (k × k) symmetric and positive semidefinite matrix
More on the V matrix
k
X J+1
X 2
vm X1m − wj Xjm
m=1 j=2
Yit0 = αt + θt Zi + λt ui + εit
J+1
X J+1
X T0
X T0
X −1
Y1t0 − wj∗ Yjt = wj∗ λt λ0n λn λ0s (εjs − ε1s )
j=2 j=2 s=1 n=1
J+1
X
− wj∗ (εjt − ε1t )
j=2
If T
P 0 0
t=1 λt λt is nonsingular, then RHS will be close to zero if
number of preintervention periods is “large” relative to size of
transitory shocks
Only units that are alike in observables and unobservables
should produce similar trajectories of the outcome variable
over extended periods of time
Proof in Appendix B of ADH (2011)
Example: California’s Proposition 99
140
California
rest of the U.S.
120
per−capita cigarette sales (in packs)
100
80
60
40
Passage of Proposition 99
20
0
year
Cigarette Consumption: CA and synthetic CA
Cigarette Consumption: CA and synthetic CA
140
California
synthetic California
120
per−capita cigarette sales (in packs)
100
80
60
40
Passage of Proposition 99
20
0
year
Predictor Means:
Predictor ActualActual
Means: vs. Synthetic CaliforniaCalifornia
vs. Synthetic
California Average of
Variables Real Synthetic 38 control states
Ln(GDP per capita) 10.08 9.86 9.86
Percent aged 15-24 17.40 17.40 17.29
Retail price 89.42 89.41 87.27
Beer consumption per capita 24.28 24.20 23.75
Cigarette sales per capita 1988 90.10 91.62 114.20
Cigarette sales per capita 1980 120.20 120.43 136.58
Cigarette sales per capita 1975 127.10 126.99 132.81
Note: All variables except lagged cigarette sales are averaged for the 1980-
1988 period (beer consumption is averaged 1984-1988).
Smoking Gap between CA and synthetic CA
Smoking Gap Between CA and synthetic CA
30
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Inference
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Smoking Gap for CA and 34 control states
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Smoking Gap for CA and 29 control states
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Smoking Gap for CA and 19 control states
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Ratio Post-Prop. 99 RMSPE to Pre-Prop. 99 RMSPE
5
4
3
frequency
California
2
1
0
0 20 40 60 80 100 120
25
Number of new prison construction
5 10 15
0 20
1840 1860 1880 1900 1920 1940 1960 1980 2000 2020
Year
Red dashed line is 1993
Texas prison growth
Operational capacity
5 10 15 20 25 30 35
00
00
00
14
00
00
12
00
00
10
0
00
80
0
00
-5 0
60
0
00
40
1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
400
Coutns per 100 000 Population
100 200 0 300
800
Total incarceration rates
200 400
0 600
.5
.3
Gap in prediction error
.1
-.1
-.3
-.5
Figure 1
Lung Cancer at Autopsy: Combined Results from 18 Studies
1860 1870 1880 1890 1900 l910 1920 1930 1940 1950
Year
Observed +fitted
Mortality Statistics
"The
TheGreat Debate"
Registrar General of England and Wales began publishing the num- 371
bers of deaths for specific cancer sites in 1911.W The death rates for can-
cer of the lung from 1911 to 1955 were
Figure 2(a)published by Percy Stocks.26The
rates increased exponentially
Mortality overof
from Cancer thetheperiod:
Lung in10% per year in males
Males
and 6% per year in females. Canadian rates for the period 1931-52 were
Rate per 100,000
published
120 by A. J. Phillips.27 The rates were consistently lower in Canada
than in- England and Wales, but also increased exponentially at 8% per
l00
year in males and 4% per year in females.
The
80 -British and Canadian rates are shown in Figure 2. The rates (a) for
males, and (b) for females have been age-standardized,28 and the trends
6 0 - to 1990, using data published by Richard Peto and colleagues, 29
extended
and40by- Statistics Canada.30In British males the rates reached a maxi-
I mum in the mid-1970's and then declined. In Canadian males the initial
rise 20
was- more prolonged, reaching a maximum in 1990.Among females
the age-standardized rates continue to climb in both countries, the rise
0
beingl910
steeper in Canada
1920 1930 1940than in Britain.
1960 1960 1970 l980 l990 2000
The fact that mortality was lower Yearat first in Canada than in Britain
may be explained by the difference in smoking in the two countries.
-England
Percy Stocks31 cited data on the+
& Wales annual
Canadaconsumption
+ United per adult of ciga-
Kingdom
rettes in various countries between 1939 and 1957. In 1939 the con-
increases with the amount smoked.
Figure 4
Smoking and Lung Cancer Case-control Studies
376 Odds Ratio* GERRY B. HILL, WAYNE MILLAR and JAMES CONNELLY
'0 1
Cohort Studies 60
Cohort studies, though less prone to bias, are much more difficult to
perform than case-control studies, since it is necessary to assemble many
thousands of individuals, determine their smoking status, and follow
them up for several years to determine how many develop lung cancer.
Four such studies were mounted in the 1950s. The subjects used were
British doctors,61United States veterans,62 Canadian veterans,63 and vol-
unteers assembled by the American Cancer Society.@All four used mor-
tality as"the end-point.
Lees than 20 20 or more All
Figure 5 shows the combined mortality ratios for cancer of the lung in
males by level of cigarette smoking. Two of the studies involved females,
deaths m
but the numbers of lung cancerMale8 wereFemales
too small to provide precise
estimates. Since all causes of death were recorded in the cohort studies it
was Weighted
possiblemean
to determine the relationship between smoking and dis-
'
eases other than lung cancer. Sigruficant associations were found in rela-
tion to several types of cancer (e.g. mouth, pharynx, larynx, esophagus,
bladder) and with chronic respiratory disease and cardiovascular disease.
Figure 5
Smoking and Lung cancer Cohort Studies in Males
Mortality Ratio*
25
-1
S C
Nature of the criticism
Older people die at a higher rate, and for reasons other than
just smoking cigars
Maybe cigar smokers higher observed death rates is because
they’re older on average
Subclassification
One way to think about the problem is that the covariates are
not balanced – their mean values differ for treatment and
control group. So let’s try to balance them.
Worth a pause - blocking on confounders vs controlling for
covariates. The latter reduces residual variance, but shouldn’t
affect the bias of the estimator. Ceteris paribus vs blocking
Subclassification (also called stratification): Compare mortality
rates across the different smoking groups within age groups so
as to neutralize covariate imbalances in the observed sample
Subclassification
Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?
Subclassification: example
Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?
29 9 2
15 · + 35 · + 50 · = 21.2
40 40 40
Table: Adjusted death rates using 3 age groups (Cochran 1968)
Definition: Outcomes
Those variables, Y , that are (possibly) not predetermined are called
outcomes (for some individual i, Yi0 6= Yi1 )
Adjustment for Observables
(Y 0 , Y 1 ) ⊥
⊥D
and therefore:
E [Y |D = 1] − E [Y |D = 0] = E [Y 1 |D = 1] − E [Y 0 |D = 0]
| {z }
by the switching equation
1 0
= E [Y ] − E [Y ]
| {z }
by independence
1 0
= E [Y − Y ]
| {z }
ATE
E [Y 1 − Y 0 ] = E [Y 1 − Y 0 |D = 1]
Identification under conditional independence
Identification assumptions:
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Given assumption 1:
E [Y 1 − Y 0 |X ] = E [Y 1 − Y 0 |X , D = 1]
= E [Y |X , D = 1] − E [Y |X , D = 0]
Given assumption 2:
δATE = E [Y 1 − Y 0 ]
Z
= E [Y 1 − Y 0 |X , D = 1]dPr (X )
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X )
Identification under conditional independence
Identification assumptions:
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Similarly
δATT = E [Y 1 − Y 0 |D = 1]
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X |D = 1)
K
Nk
Question: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
Subclassification by Age (K = 2)
K
Nk
Question: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
13 17
4· +6· = 5.13
30 30
Subclassification by Age (K = 2)
K
NTk
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
Subclassification by Age (K = 2)
K
NTk
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
3 7
4· +6· = 5.4
10 10
Subclassification by Age and Gender (K = 4)
K
Nk
Problem: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
Subclassification by Age and Gender (K = 4)
K
Nk
Problem: What is δ[
X 1,k 0,k
ATE = (Y −Y )· ?
N
k=1
Not identified!
Subclassification by Age and Gender (K = 4)
K
NTk
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
Subclassification by Age and Gender (K = 4)
K
NTk
Question: What is δ[
X 1,k 0,k
ATT = (Y −Y )· ?
NT
k=1
3 3 4
4· +5· +6· = 5.1
10 10 10
Curse of Dimensionality
Works well when we can find good matches for each treatment
group unit, so M is usually defined to be small (i.e., M = 1 or
M = 2)
Matching
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
Matching example with single covariate
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
Match and plug in!
Matching example with single covariate
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 9 1 3
2 1 0 1 1
3 0 9 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
1 1 1
δbATT = · (6 − 9) + · (1 − 0) + · (0 − 9) = −3.7
3 3 3
A Training Example
Trainees Non-Trainees
unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average: 28.5 13982
21 32 25900
Average: 33 20724
Age Distribution: Before Matching
A: Trainees
3
2
1
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Age Distribution: After Matching
A: Trainees
3
2
1
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Training E↵ect Estimates
After matching:
where
2
σb1 0 . . . 0
0 σ b22 . . . 0
Vb −1 = . .. . . .
.. . . ..
0 0 ... σ bk2
bn2 , and
Thus, if there are changes in the scale of Xni , these changes also affect σ
the normalized Euclidean distance does not change
Mahalanobis distance
where Σ
b X is the sample variance-covariance matrix of X .
Arbitrary weights
where each i and j(i) units are matched, Xi ≈ Xj(i) and Dj(i) = 0.
Define potential outcomes and switching eq.
µ0 (x) = E [Y |X = x, D = 0] = E [Y 0 |X = x],
µ1 (x) = E [Y |X = x, D = 1] = E [Y 1 |X = x],
Yi = µDi (Xi ) + εi
Substitute and distribute terms
1 X 1
(µ (Xi ) + εi ) − (µ0 (Xj(i) ) + εj(i) )
δbATT =
NT
Di =1
1 X 1 1 X
= (µ (Xi ) − µ0 (Xj(i) )) + (εi − εj(i) )
NT NT
Di =1 Di =1
Deriving the matching bias
µ0 (Xi ) − µ0 (Xj(i) )
to the bias.
Bias-corrected (BC) matching:
BC 1 Xh 0 0
i
δATT =
b (Yi − Yj(i) ) − (µ (Xi ) − µ (Xj(i) ))
c c
NT
Di =1
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
Bias adjustment in matched data
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
c0 (X ) = βb0 + βb1 X = 2 + X
For the bias correction, estimate µ
Bias adjustment in matched data
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
For the bias correction, estimate µc0 (X ) = βb0 + βb1 X = 2 + X
is valid.
Large sample distribution for matching estimators
We’ll talk about the propensity score in just a second; for now
this assumption is only about X
Assumption requires that there are units in both treatment
and control for the range of propensity score
Recall, RDD did not have common support so relied on
extrapolation sensitive to functional form assumptions
Common support ensures we can find similar enough donors in
the control pool
Unlike CIA, common support is testable
Formal Definition
D Y
p(X )
because
Pr (D = 1|ρ(X )) = E [D|ρ(X )]
| {z }
Previous slide
= E [E [D|X ]|ρ(X )]
| {z }
LIE
= E [p(X )|ρ(X )]
| {z }
definition
= ρ(X )
D⊥
⊥ X |ρ(X )
Balancing property
Pr (X |D = 1, p(X )) = Pr (X |D = 0, p(X ))
Pr (X |D = 1, pb(X )) = Pr (X |D = 0, pb(X ))
Propensity score theorem
Proposition
If Y 1 , Y 0 ⊥
⊥ D|X , then
δATE = E [Y 1 − Y 0 ]
D − ρ(X )
= E Y·
ρ(X ) · (1 − ρ(X ))
δATT = E [Y 1 − Y 0 |D = 1]
D − ρ(X )
1
= ·E Y ·
Pr (D = 1) 1 − ρ(X )
IPW Proof
Proof.
D − ρ(X )
Y
E Y· X = E X , D = 1 ρ(X )
ρ(X )(1 − ρ(X )) ρ(X )
−Y
+E X , D = 0 (1 − ρ(X ))
1 − ρ(X )
= E [Y |X , D = 1] − E [Y |X , D = 0]
Di 1 − Di
yi = α0 + Xi β + α˜1 Di + θ0 + θ1 + ε˜i
[
ρ(X i)
[
1 − ρ(X i)
Propensity score matching
A parameter of interest:
We estimate it as follows
1 X
[
ATT = = Yi − Yi(j)
NT
i:Wi =1
But how far away on the propensity score will you use? Herein
lies the different types of matching proposed
Matching just one nearest neighbor minimizes bias at the cost
of larger variance
Matching using additional nearest neighbors increases the bias
but decreases the variance
Matching with or without replacement
with replacement keeps bias low at the cost of larger variance
without replacement keeps variance low but at the cost of
potential bias
Distance between treatment and control units
Estimation:
NSW was a randomized job trainings program; therefore
estimating the average treatment effect is straightforward:
1 X 1 X
Yi − Yi ≈ E [Y 1 − Y 0 ]
Nt Nc
Di =1 Di =0
CPS NSW
All Controls Trainees
Nc = 15, 992 Nt = 297
covariate mean (s.d.) mean mean t-stat diff
Black 0.09 0.28 0.07 0.80 47.04 -0.73
Hispanic 0.07 0.26 0.07 0.94 1.47 -0.02
Age 33.07 11.04 33.2 24.63 13.37 8.6
Married 0.70 0.46 0.71 0.17 20.54 0.54
No degree 0.30 0.46 0.30 0.73 16.27 -0.43
Education 12.0 2.86 12.03 10.38 9.85 1.65
1975 Earnings 13.51 9.31 13.65 3.1 19.63 10.6
1975 Unemp 0.11 0.32 0.11 0.37 14.29 -0.26
Dehija and Wahba (1999)
X ⊥
⊥ D|p(X )
psmatch2 t r e a t e d , p s c o r e ( s c o r e ) outcome ( r e 7 8 )
k e r n e l k ( n o r m a l ) bw ( 0 . 0 1 )
p s t e s t 2 age b l a c k h i s p a n i c m a r r i e d educ n o d e g r e e
r e 7 8 , sum g r a p h
Panel B for outcomes. Notice the large differences in back- pretreatment covariates in Table 1, panel A, but do not include
ground characteristics between the program participants and theany higher order terms
Matchings vs. Propensity score or interactions, with only the control
PSID sample. This is what makes drawing causal inferences units that are used as a match [the units j such that W = 0 and j
1 X
L1 (f , g ) = |fl1 ...lk − gl1 ...lk |
2
l1 ...lk
A man walks up the mountain barefoot til he can’t feel his feet
again – Victor Schlosskey said art is there to make “the stone
feel like a stone again”. I want research to feel like research
again for you
Research is a quest for honest answers to good faith questions
that people care about
Most of all, research is truly fun for those who find such things
fun. It’s a form of self-expression and creativity for many of us
And it is fun to understand the answers you get and why those
answers are reliable which requires checklists, workflows,
clearly defined assumptions and proper tools for the job
It is not fun to get a bad answer to a poorly defined question
that you’re not confident about
A Priori Knowledge is Necessary for Identification