Nothing Special   »   [go: up one dir, main page]

Introduction To Multiple Regression

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 36

Chapter 6

Introduction to
Multiple Regression
Outline
1. Omitted variable bias
2. Causality and regression analysis
3. Multiple regression and OLS
4. Measures of fit
5. Sampling distribution of the OLS estimator

2
Omitted Variable Bias
(SW Section 6.1)

The error u arises because of factors that influence Y but are not
included in the regression function; so, there are always omitted
variables.

Sometimes, the omission of those variables can lead to bias in


the OLS estimator.

3
Omitted variable bias, ctd.
The bias in the OLS estimator that occurs as a result of an
omitted factor is called omitted variable bias. For omitted
variable bias to occur, the omitted factor “Z” must be:

1. A determinant of Y (i.e. Z is part of u); and

2. Correlated with the regressor X (i.e. corr(Z,X)  0)

Both conditions must hold for the omission of Z to result in


omitted variable bias.
4
Omitted variable bias, ctd.
In the test score example:
1. English language ability (whether the student has English as
a second language) plausibly affects standardized test
scores: Z is a determinant of Y.
2. Immigrant communities tend to be less affluent and thus
have smaller school budgets – and higher STR: Z is
correlated with X.

Accordingly, ˆ1 is biased. What is the direction of this bias?


 What does common sense suggest?
 If common sense fails you, there is a formula…
5
Omitted variable bias, ctd.
A formula for omitted variable bias: recall the equation,
n
1 n
 ( X i  X )u i 
n i 1
vi
ˆ
1 1–  = i 1
=
n
 n 1 2
 i 1
( Xi  X ) 2

 n 
 sX

where vi = (Xi – X )ui  (Xi – X)ui. Under Least Squares


Assumption 1,
E[(Xi – X)ui] = cov(Xi,ui) = 0.

But what if E[(Xi – X)ui] = cov(Xi,ui) = Xu  0?


6
Omitted variable bias, ctd.
In general (that is, even if Assumption #1 is not true),
1 n

n i 1
( X i  X )u i
ˆ
1 – 1 =
1 n
 i
n i 1
( X  X ) 2

p
 Xu
 2
X
  u    Xu    u 
=    =   Xu ,
  X    X u    X 
where Xu = corr(X,u). If assumption #1 is valid, then Xu = 0,
but if not we have….
7
The omitted variable bias formula:
ˆ
p  u 
1  1 +    Xu
X 
If an omitted factor Z is both:
(1) a determinant of Y (that is, it is contained in u); and
(2) correlated with X,
then Xu  0 and the OLS estimator ˆ is biased (and is not
1

consistent).
The math makes precise the idea that districts with few ESL
students (1) do better on standardized tests and (2) have
smaller classes (bigger budgets), so ignoring the ESL factor
results in overstating the class size effect.
Is this is actually going on in the CA data?
8
 Districts with fewer English Learners have higher test scores
 Districts with lower percent EL (PctEL) have smaller classes
 Among districts with comparable PctEL, the effect of class size is
small (recall overall “test score gap” = 7.4)
9
Digression on causality and
regression analysis
What do we want to estimate?

 What is, precisely, a causal effect?


 The common-sense definition of causality isn’t precise
enough for our purposes.
 In this course, we define a causal effect as the effect that is
measured in an ideal randomized controlled experiment.

10
Ideal Randomized Controlled
Experiment
 Ideal: subjects all follow the treatment protocol – perfect
compliance, no errors in reporting, etc.!
 Randomized: subjects from the population of interest are
randomly assigned to a treatment or control group (so
there are no confounding factors)
 Controlled: having a control group permits measuring the
differential effect of the treatment
 Experiment: the treatment is assigned as part of the
experiment: the subjects have no choice, so there is no
“reverse causality” in which subjects choose the treatment
they think will work best.

11
Back to class size:
 Conceive an ideal randomized controlled experiment for
measuring the effect on Test Score of reducing STR…
 How does our observational data differ from this ideal?
 The treatment is not randomly assigned
 Consider PctEL – percent English learners – in the district.
It plausibly satisfies the two criteria for omitted variable
bias: Z = PctEL is:
1. a determinant of Y; and
2. correlated with the regressor X.
 The “control” and “treatment” groups differ in a systematic
way – corr(STR,PctEL)  0
12
 Randomized controlled experiments:
 Randomization + control group means that any differences
between the treatment and control groups are random – not
systematically related to the treatment
 We can eliminate the difference in PctEL between the large
(control) and small (treatment) groups by examining the
effect of class size among districts with the same PctEL.
 If the only systematic difference between the large and
small class size groups is in PctEL, then we are back to the
randomized controlled experiment – within each PctEL
group.
 This is one way to “control” for the effect of PctEL when
estimating the effect of STR.
13
Return to omitted variable bias
Three ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which treatment
(STR) is randomly assigned: then PctEL is still a determinant
of TestScore, but PctEL is uncorrelated with STR. (But this is
unrealistic in practice.)
2. Adopt the “cross tabulation” approach, with finer gradations
of STR and PctEL – within each group, all classes have the
same PctEL, so we control for PctEL (But soon we will run
out of data, and what about other determinants like family
income and parental education?)
3. Use a regression in which the omitted variable (PctEL) is no
longer omitted: include PctEL as an additional regressor in a
multiple regression.
14
The Population Multiple Regression
Model (SW Section 6.2)
Consider the case of two regressors:
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n

 Y is the dependent variable


 X1, X2 are the two independent variables (regressors)
 (Yi, X1i, X2i) denote the ith observation on Y, X1, and X2.
 0 = unknown population intercept
 1 = effect on Y of a change in X1, holding X2 constant
 2 = effect on Y of a change in X2, holding X1 constant
 ui = the regression error (omitted factors)
15
Interpretation of coefficients in
multiple regression
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n

Consider changing X1 by X1 while holding X2 constant:


Population regression line before the change:

Y = 0 + 1X1 + 2X2

Population regression line, after the change:

Y + Y = 0 + 1(X1 + X1) + 2X2


16
Before: Y = 0 + 1(X1 + X1) + 2X2

After: Y + Y = 0 + 1(X1 + X1) + 2X2

Difference: Y = 1X1
So:
Y
1 = , holding X2 constant
X 1

Y
2 = , holding X1 constant
X 2

0 = predicted value of Y when X1 = X2 = 0.


17
The OLS Estimator in Multiple
Regression (SW Section 6.3)
With two regressors, the OLS estimator solves:

n
min b0 ,b1 ,b2  [Yi  (b0  b1 X 1i  b2 X 2i )]2
i 1

 The OLS estimator minimizes the average squared difference


between the actual values of Yi and the prediction (predicted
value) based on the estimated line.
 This minimization problem is solved using calculus
 This yields the OLS estimators of 0 and 1.
18
Example: the California test score
data
Regression of TestScore against STR:

·
TestScore = 698.9 – 2.28STR

Now include percent English Learners in the district (PctEL):

·
TestScore = 686.0 – 1.10STR – 0.65PctEL

 What happens to the coefficient on STR?


 Why? (Note: corr(STR, PctEL) = 0.19)
19
Multiple regression in STATA
reg testscr str pctel, robust;

Regression with robust standard errors Number of obs = 420


F( 2, 417) = 223.82
Prob > F = 0.0000
R-squared = 0.4264
Root MSE = 14.464

------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -1.101296 .4328472 -2.54 0.011 -1.95213 -.2504616
pctel | -.6497768 .0310318 -20.94 0.000 -.710775 -.5887786
_cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189
------------------------------------------------------------------------------

·
TestScore = 686.0 – 1.10STR – 0.65PctEL

More on this printout later…


20
Measures of Fit for Multiple
Regression (SW Section 6.4)
Actual = predicted + residual: Yi = Yˆi + uˆi

SER = std. deviation of uˆi (with d.f. correction)

RMSE = std. deviation of uˆi (without d.f. correction)

R2 = fraction of variance of Y explained by X

R 2 = “adjusted R2” = R2 with a degrees-of-freedom correction


that adjusts for estimation uncertainty; R 2 < R2
21
SER and RMSE
As in regression with a single regressor, the SER and the RMSE
are measures of the spread of the Y’s around the regression line:

n
1
SER = i
n  k  1 i 1
ˆ
u 2

1 n 2
RMSE = 
n i 1
uˆi

22
2
R2 and R
The R2 is the fraction of the variance explained – same definition
as in regression with a single regressor:

2 ESS SSR
R = = 1 ,
TSS TSS

n n n
where ESS =  (Yˆi  Yˆ ) , SSR =
i 1
2
 uˆ , TSS =
i 1
2
i  i
(Y
i 1
 Y ) 2
.

 The R2 always increases when you add another regressor


(why?) – a bit of a problem for a measure of “fit”

23
2
R2 and R , ctd.
The R 2 (the “adjusted R2”) corrects this problem by “penalizing”
you for including another regressor – the R 2 does not necessarily
increase when you add another regressor.

2  n  1  SSR
Adjusted R : R = 1  
2

 n  k  1  TSS

Note that R 2 < R2, however if n is large the two will be very
close.

24
Measures of fit, ctd.
Test score example:

(1) ·
TestScore = 698.9 – 2.28STR,
R2 = .05, SER = 18.6

(2) ·
TestScore = 686.0 – 1.10STR – 0.65PctEL,
R2 = .426, R 2 = .424, SER = 14.5

 What – precisely – does this tell you about the fit of regression
(2) compared with regression (1)?
 Why are the R2 and the R 2 so close in (2)?
25
The Least Squares Assumptions for
Multiple Regression (SW Section 6.5)
Yi = 0 + 1X1i + 2X2i + … + kXki + ui, i = 1,…,n

1. The conditional distribution of u given the X’s has mean


zero, that is, E(u|X1 = x1,…, Xk = xk) = 0.
2. (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
3. Large outliers are rare: X1,…, Xk, and Y have four moments:
E( X 1i4 ) <  ,…, E( X ki4 ) <  , E(Yi 4 ) <  .
4. There is no perfect multicollinearity.

26
Assumption #1: the conditional mean of
u given the included X’s is zero.
E(u|X1 = x1,…, Xk = xk) = 0

 This has the same interpretation as in regression with a


single regressor.
 If an omitted variable (1) belongs in the equation (so is in u)
and (2) is correlated with an included X, then this condition
fails
 Failure of this condition leads to omitted variable bias
 The solution – if possible – is to include the omitted
variable in the regression.
27
Assumption #2: (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
This is satisfied automatically if the data are collected by
simple random sampling.

Assumption #3: large outliers are rare (finite fourth


moments)
This is the same assumption as we had before for a single
regressor. As in the case of a single regressor, OLS can be
sensitive to large outliers, so you need to check your data
(scatterplots!) to make sure there are no crazy values (typos
or coding errors).

28
Assumption #4: There is no perfect multicollinearity
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.

Example: Suppose you accidentally include STR twice:


regress testscr str str, robust
Regression with robust standard errors Number of obs = 420
F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
str | (dropped)
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
29
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.
 In the previous regression, 1 is the effect on TestScore of a
unit change in STR, holding STR constant (???)
 We will return to perfect (and imperfect) multicollinearity
shortly, with more examples…

With these least squares assumptions in hand, we now can derive


the sampling dist’n of ˆ1 , ˆ2 ,…, ˆk .

30
The Sampling Distribution of the
OLS Estimator (SW Section 6.6)
Under the four Least Squares Assumptions,
 The exact (finite sample) distribution of ˆ1 has mean 1,
var( ˆ1 ) is inversely proportional to n; so too for ˆ2 .
 Other than its mean and variance, the exact (finite-n)
distribution of ˆ1 is very complicated; but for large n…
p
 ˆ1 is consistent: ˆ1  1 (law of large numbers)
ˆ1  E ( ˆ1 )
 is approximately distributed N(0,1) (CLT)
var( ˆ1 )
 So too for ˆ2 ,…, ˆk
Conceptually, there is nothing new here!
31
Multicollinearity, Perfect and
Imperfect (SW Section 6.7)
Some more examples of perfect multicollinearity
 The example from earlier: you include STR twice.
 Second example: regress TestScore on a constant, D, and B,
where: Di = 1 if STR ≤ 20, = 0 otherwise; Bi = 1 if STR >20,
= 0 otherwise, so Bi = 1 – Di and there is perfect
multicollinearity
 Would there be perfect multicollinearity if the intercept
(constant) were somehow dropped (that is, omitted or
suppressed) in this regression?
 This example is a special case of…
32
The dummy variable trap
Suppose you have a set of multiple binary (dummy)
variables, which are mutually exclusive and exhaustive – that is,
there are multiple categories and every observation falls in one
and only one category (Freshmen, Sophomores, Juniors, Seniors,
Other). If you include all these dummy variables and a constant,
you will have perfect multicollinearity – this is sometimes called
the dummy variable trap.
 Why is there perfect multicollinearity here?
 Solutions to the dummy variable trap:
1. Omit one of the groups (e.g. Senior), or
2. Omit the intercept
 What are the implications of (1) or (2) for the interpretation of
the coefficients?
33
Perfect multicollinearity, ctd.
 Perfect multicollinearity usually reflects a mistake in the
definitions of the regressors, or an oddity in the data
 If you have perfect multicollinearity, your statistical software
will let you know – either by crashing or giving an error
message or by “dropping” one of the variables arbitrarily
 The solution to perfect multicollinearity is to modify your list
of regressors so that you no longer have perfect
multicollinearity.

34
Imperfect multicollinearity
Imperfect and perfect multicollinearity are quite different despite
the similarity of the names.

Imperfect multicollinearity occurs when two or more regressors


are very highly correlated.
 Why this term? If two regressors are very highly
correlated, then their scatterplot will pretty much look like a
straight line – they are collinear – but unless the correlation
is exactly 1, that collinearity is imperfect.

35
Imperfect multicollinearity, ctd.
Imperfect multicollinearity implies that one or more of the
regression coefficients will be imprecisely estimated.
 Intuition: the coefficient on X1 is the effect of X1 holding X2
constant; but if X1 and X2 are highly correlated, there is very
little variation in X1 once X2 is held constant – so the data are
pretty much uninformative about what happens when X1
changes but X2 doesn’t, so the variance of the OLS estimator
of the coefficient on X1 will be large.
 Imperfect multicollinearity (correctly) results in large
standard errors for one or more of the OLS coefficients.
 The math? See SW, App. 6.2

Next topic: hypothesis tests and confidence intervals… 36

You might also like