Chapter 3 Econometrics

Chapter Three
Multiple Linear Regression Analysis
1
Multiple Regression
 A statistical model that utilizes two or more quantitative and
qualitative explanatory variables (x1,..., xp) to predict a
quantitative dependent variable Y.
Caution: have at least two or more quantitative
explanatory variables.
 Multiple regression simultaneously considers the influence of
multiple explanatory variables on a response variable Y:
2
Simple vs. Multiple
•  represents the unit change • i represents the unit change in
in Y per unit change in X . Y per unit change in Xi.
• Does not take into account
any other variable besides • Takes into account the effect of
single independent variable. other independent variables.
• R2: proportion of variation • R2: proportion of variation in Y

in Y predictable from X. predictable by set of X’s
3
Multiple Regression Models
Multiple
Regression
M odels
Non-
Linear
Linear
Dummy Inter-
Linear action
Variable
Poly- Square
Log Reciprocal Exponential
Nomial Root
4
The Multiple Linear Regression Model building
Idea: Examine the linear relationship between 1

dependent (Y) & 2 or more independent variables (X i)
Multiple Regression Model with k Independent Variables:
Y-intercept Population slopes Random Error
Yi  β0 β1X1i β2X2i βkXki  Ui
5
• The coefficients of the multiple regression model are
estimated using sample data with k independent
variables
Estimated Estimated
(or predicted) Estimated slope coefficients
value of Y intercept
Ŷi b0 b1 X1i b2 X2i bk Xki

• Interpretation of the Slopes:
– b1=The change in the mean of Y per unit change in X1,
taking into account the effect of the rest of Xj’s (or net of
Xj)
– b0 = Y intercept. It is the same as simple regression.
6
ASSUMPTIONS
• Linear regression model: The regression model is linear in the
parameters, though it may or may not be linear in variables.
• The X variable is independent of the error term. This means
that we require zero covariance between ui and each X variables.
cov.(ui , X1i) = cov(ui, X2i)=------- = cov(ui, Xki) = 0
• Zero mean value of the disturbance term ui. Given the value
of Xi, the mean, or the expected value of the random disturbance
term ui is zero.
E(ui)= 0 for each i
• Homoscedasticity or constant variance of ui . This implies that
the variance of the error term is the same, regardless of the value
of X.
var (ui) = σ2 7
• No auto-correlation between the disturbance terms.
cov ( ui, uj) = 0 i≠j
 This implies that the observations are sampled independently.

• The number of observations n must be greater than the number
of parameters to be estimated.
• There must be variation in the values of the X variables.
Technically, var(X) must be a positive number.
No strong/perfect multicollinearity: No exact linear relationship
exists between any of the explanatory variables.
8
Estimation of parameters and standard errors
9
10
11
The coefficient of determination and test of model adequacy
12
13
Test of the Significance of Individual Variables!
• Use t-tests of individual variable slopes
• If there is a linear relationship between the variable Xi and Y;
Hypotheses:
• H0: βi = 0 (no linear relationship)
• H1: βi ≠ 0 (linear relationship does exist between Xi and Y)
• Test Statistic: bi  0
t* 
S bi
• Confidence interval for the population slope βi

b i  t c S bi
14
• Then, as before if t*>t & If t*<t …… Accept null
Assumptions and Procedures to Conduct Multiple
Linear Regression
 When you choose to analyse your data using multiple
regression, make sure that the data you want to analyse can
actually be analysed using multiple regression.
 It is only appropriate to use multiple regression if your data
"passes" eight assumptions that are required for multiple
regression to give you a valid result.
 let's take a look at these eight assumptions:
Assumption #1:
 Your dependent variable should be measured on a
continuous scale .
15
Assumption #2:
 You should have two or more IdepVars, which can be either
continuous or categorical or dummy.
Assumption #3:
 You should have independence of residuals, which you can
easily check using the Durbin-Watson statistic.
Assumption #4:
 There needs to be a linear relationship between :
 the Depvar and each of your indepvars
Assumption #5:
 Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move
along the line.
16
Assumption #6:
 Your data must not show multicollinearity, which occurs when
you have two or more Idepvars that are highly correlated with
each other.
Assumption #7:
 There should be no significant outliers
 This can change the output that any Statistics produces and
reduce the predictive accuracy of your results as well as the
statistical significance.
Assumption #8:
 Finally, you need to check that the residuals (errors) are
normally distributed.
17
You can check assumptions #3, #4, #5, #6, #7 and #8 using
STATA/SPSS.
Assumptions #1 and #2 should be checked first, before
moving onto assumptions #3, #4, #5, #6, #7 and #8.
 Just remember that if you do not run the statistical tests on
these assumptions correctly, the results you get when
running multiple regression might not be valid.
18
Given the assumptions and data on Y and set of IVs (X1,..,
XK ) , the following are a suggested procedures/steps to
conduct multiple linear regression:
1. Select variables that you believe are linearly related to the
dependent variable.
2. Use a software to generate the coefficients and the
statistics used to assess the model.
3. Diagnose violations of required conditions/ assumptions.
 If there are problems, attempt to remedy them.
4. Assess the model’s fit.
5. Test & interpret the coefficients
6. We use the model to predict a value of the DV.
19
Regression Output Interpretation
Example
 In a study of consumer demand (Qd), multiple regression
analysis is done to examine the relationship between quantity
demanded and four potential predictors.
The four independent variables are: price, income, tax and Price
of related goods.
The output for this example is interpreted as follows:
The multiple correlation coefficient is 0.971.
 R is the correlation between the observed values of Y and the
values of Y predicted by the model.
20
Source | SS df MS Number of obs = 16
-------------+------------------------------ F(4,11) = 45.76
Model | 16478.6652 4 4119.66629 Prob > F = 0.0000
Residual | 990.272334 11 90.0247576 R-squared = 0.9433
-------------+------------------------------ Adj R-squared = 0.9227
Total | 17468.9375 15 1164.59583 Root MSE = 9.4881
------------------------------------------------------------------------------
Qd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p | -.1619978 .0258238 -6.27 0.000 -.2188356 -.1051599
I | .000022 .0008965 0.02 0.981 -.0019511 .001995
pr | -.4774324 .1851878 -2.58 0.026 -.8850281 -.0698367
tax | 6.270663 2.953327 2.12 0.057 -.229565 12.77089
_cons | 100.8259 14.82823 6.80 0.000 68.18922 133.4627

21
 Therefore, large values of R represent a large correlation between the
predicted and observed values of the outcome.
 R of 1 implies the model perfectly predicts the observed.
 0.971 implies nearly perfect prediction of the actual Y.
 The R2 is 0.943.
 This means that the IVs explain 98.2% of the variation in the DV.
The adjusted R-square, a measure of explanatory power, is 0.922.
 This statistic is not generally interpreted because it is neither a

percentage (like the R2), nor a test of significance (the F-statistic).
22
The p value for the F statistic is <.01.
 This means that at least one of the IVs is a significant predictor
of the dependent variable (quantity demanded).
 This indicates rejection of the null hypothesis.
Interpreting Parameter Values (Model Coefficients)
The results of the estimated regression line include the estimated

coefficients, the standard error of the coefficients, the calculated t-
statistic, the corresponding p-value, and the bounds of the 95%
confidence intervals. STATA Version 12
23
Finally, the above table will help us to determine whether quantity
demanded and explanatory variables are significantly related,
and the direction and strength of their relationship.
 The prediction equation is written as:
Qd= 100.82-0.162p+0.22*10-4 y-0.477pr+6.27tax

Results of the multiple linear regression model showed that out of the 4
explanatory variables that were entered to the model, 2 of them, namely price
of the product and price of related commodities were found to be statistically
significant @ 5% while tax is significant only @10%.
 Results of the statistically significant variables are discussed as follows:
 The Constant is the predicted value of quantity demanded when all of the
independent variables have a value of zero.
24
 The b coefficient associated with price (-0. 162) is negative,
indicating an inverse relationship in which higher price of the
product is associated with lower quantity demanded.
 For the independent variable price, the probability of the t
statistic (0.000) for the b coefficient is less than the level of
significance of 0.05.
 We reject the null hypothesis that the slope associated with
price is equal to zero and conclude that there is a statistically
significant relationship between price and quantity demanded.
 A unit increase/decrease in the price of the product leads to a
0.162 decrease/increase in quantity demanded, ceteris paribus.
25
 The income variable is found to be positively and
insignificantly ( even at 10% level of significance) related to
quantity demanded. There is no relation between income and
quantity demanded of this good.
 Tax coefficient is statistically significant (at 10% probability
level) and carries positive sign.
 The slope of tax is 6.27. This means that for every one unit
increase/decrease in tax on a commodity, quantity demanded
will increase/decrease by 6.27 units, ceteris paribus. Of
course, this is not a valid conclusion.
26
Dummy independent Variables
Describing Qualitative Information
• In regression analysis the dependent variable can be
inﬂuenced by variables that are essentially qualitative in
nature,
 such as sex, race, color, religion, nationality, geographical
region, political upheavals, and party affiliation.
• One way we could “quantify” such attributes is by
constructing artiﬁcial variables that take on values of 1 or 0,
 1 indicating the presence (or possession) of that attribute and 0
indicating the absence of that attribute.
• Variables that assume such 0 and 1 values are called dummy/
indicator/ binary/ categorical/ dichotomous variables.
27
Example 1 :
where Y=annual salary of a college professor
Di  1 if male college professor
= 0 otherwise (i.e., female professor)

 The Model may enable us to find out whether sex makes any
difference in a college professor’s salary, assuming, of course,
that all other variables such as age, degree attained, and years of
experience are held constant.
 Mean salary of female college professor:
 Mean salary of male college professor:
  tells by how much the mean salary of a male college professor
differs from the mean salary of his female counterpart.
 A test of the null hypothesis that there is no sex discrimination
( Ho:  = 0) can be easily made and finding out whether the
estimated  is statistically significant on the basis of the t test.
28
Example 2:
Where: Xi = years of teaching experience
Mean salary of female college professor: E (Yi / X i , Di  0)   1  X i
Mean salary of male college professor: E (Yi / X i , Di  1)  (   2 )  X i
 the male and female college professors’ salary functions in relation to

the years of teaching experience have the same slope () but
different intercepts.
 Male intercept = a1 +a2
 Female intercept = a1
 Difference = a2
Note: If a qualitative variable has ‘m’ categories, introduce only ‘m-1’
dummy variables.
 The group, category, or classification that is assigned the value of 0
is often referred to as the base, benchmark, control, comparison,
reference, or omitted category. 29
30
Example 3: qualitative variable with more than two classes
 regress the annual expenditure on health care by an
individual on the income and education of the individual.
Yi   1   2 D2i   3 D3i  X i  u i
Where Yi  annual expenditure on health care

X i  annual income
D2  1 if high school education

= 0 otherwise
D3  1 if college education
= 0 otherwise
 “less than high school education” category as the base
category.
 Therefore, the intercept  will reflect the intercept for this
category.
31
• the mean health care expenditure functions for the three
levels of education, namely, less than high school, high
school, and college:
E (Yi | D2  0, D3  0, X i )   1  X i
E (Yi | D2  1, D3  0, X i )  ( 1   2 )   X i
E (Yi | D2  0, D3  1, X i )  ( 1   3 )  X i
32
Log-Level
 wage increases by 8.3 percent for every additional year of

education.
33
Log-Log:
 The coefficient of log(sales) is the estimated elasticity of

salary with respect to sales.
• It implies that a 1 percent increase in firm sales increases
salary by about 0.257 percent—the usual interpretation of an
elasticity.
Level – Log:
 it arises less often in practice.
Y = 0 + 1 log(x) + u
34
Y –hat = 110 + 12 log(x), change in Y hat =? 0.12 units.

Chapter 3 Econometrics

Uploaded by

Copyright:

Available Formats

Chapter 3 Econometrics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 Econometrics

Uploaded by

Copyright:

Available Formats

Chapter Three

Multiple Linear Regression Analysis

• R2: proportion of variation • R2: proportion of variation in Y

Idea: Examine the linear relationship between 1

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Yi  β0 β1X1i β2X2i βkXki  Ui

Ŷi b0 b1 X1i b2 X2i bk Xki

cov ( ui, uj) = 0 i≠j

 This implies that the observations are sampled independently.

• H1: βi ≠ 0 (linear relationship does exist between Xi and Y)

• Confidence interval for the population slope βi

-------------+------------------------------ F(4,11) = 45.76

Model | 16478.6652 4 4119.66629 Prob > F = 0.0000

Residual | 990.272334 11 90.0247576 R-squared = 0.9433

-------------+------------------------------ Adj R-squared = 0.9227

Total | 17468.9375 15 1164.59583 Root MSE = 9.4881

Qd | Coef. Std. Err. t P>|t| [95% Conf. Interval]

p | -.1619978 .0258238 -6.27 0.000 -.2188356 -.1051599

I | .000022 .0008965 0.02 0.981 -.0019511 .001995

pr | -.4774324 .1851878 -2.58 0.026 -.8850281 -.0698367

tax | 6.270663 2.953327 2.12 0.057 -.229565 12.77089

_cons | 100.8259 14.82823 6.80 0.000 68.18922 133.4627

 R of 1 implies the model perfectly predicts the observed.

 0.971 implies nearly perfect prediction of the actual Y.

The adjusted R-square, a measure of explanatory power, is 0.922.

 This statistic is not generally interpreted because it is neither a

Interpreting Parameter Values (Model Coefficients)

The results of the estimated regression line include the estimated

Qd= 100.82-0.162p+0.22*10-4 y-0.477pr+6.27tax

 Results of the statistically significant variables are discussed as follows:

= 0 otherwise (i.e., female professor)

 the male and female college professors’ salary functions in relation to

Where Yi  annual expenditure on health care

D2  1 if high school education

 wage increases by 8.3 percent for every additional year of

 The coefficient of log(sales) is the estimated elasticity of

You might also like