Nothing Special   »   [go: up one dir, main page]

Chapter 3 MLR

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Chapter 3

Multiple Regression Model


Regression Model & Regression Equation
• Multiple regression analysis is the study of how a
dependent variable y is related to two or more
independent variables.
• In the general case, we will use p to denote the
number of independent variables.
• The concepts of a regression model and a regression
equation introduced in the preceding chapter are
applicable in the multiple regression case.
• The equation that describes how the dependent
variable y is related to the independent variables x1,
x2, . . . , xp and an error term is called the multiple
regression model.
• We begin with the assumption that the multiple
regression model takes the following form.
• MULTIPLE REGRESSION MODEL
y = β0 + β 1x1 + β2 x2 . . . p β xp
• ESTIMATED MULTIPLE REGRESSION EQUATION
yˆ= b0 + b1x1 + b2 x2 . . . bp xp
where
• b0, b1, b2, . . . , bp are the point estimates of 0, 1, 2, . . . , p
• yˆ es mated value of the dependent variable
Eg. If yˆ= 27 + 9x1 + 12X2
yˆ is predicted sales (‘000), X1 is capital investment in (‘000)
and X2 is market expenditure (‘000).
Thus, 9,000 is an estimate increase in sales (y)
corresponding to 1000 increase in capital investment when
marketing expenditure is held constant.
NB: in MR, each coefficient is interpreted as estimated
change in Y corresponding to a one unit change in a variable,
when all other variables are held constant.
Estimation Process
Multiple Regression Model
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp Sample Data:
x1 x 2 . . . x p y
+e . . . .
Multiple Regression Equation . . . .
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp
Unknown parameters are
0, 1, 2, . . . , p

Estimated Multiple
Regression Equation
b 0, b 1, b2, . . . , bp
provide estimates of yˆ  b0  b1 x1  b2 x2  ...  bp x p
0 , 1, 2, . . . , p Sample statistics are
b0, b1, b2, . . . , bp
Least Squares Method
• In Chapter 2, we used the least squares
method to develop the estimated regression
equation that best approximated the straight-
line relationship between the dependent and
independent variables.
• This same approach is used to develop the
estimated multiple regression equation. The
least squares criterion is restated as follows.
LEAST SQUARES CRITERIO
Min ∑(yi - yˆi)2
Assumptions About the Error Term e

The error e is a random variable with mean of zero.

The variance of e , denoted by  2, is the same for all


values of the independent variables.

The values of e are independent.

The error e is a normally distributed random variable


reflecting the deviation between the y value and the
expected value of y given by 0 + 1x1 + 2x2 + . . + pxp.
Testing for Significance

In simple linear regression, the F and t tests provide


the same conclusion.

In multiple regression, the F and t tests have different


purposes.
The F test is used to determine whether a significant
relationship exists between the dependent variable
and the set of all the independent variables.
The F test is referred to as the test for overall
significance.
F-statistics is test of the null hypothesis that all of the
regression coefficients are equal to zero.
Thus if p-value related to F-test is greater that α,
then you reject the Ho
Testing for Significance: t Test
If the F test shows an overall significance, the t test is
used to determine whether each of the individual
independent variables is significant.

A separate t test is conducted for each of the


independent variables in the model.

We refer to each of these t tests as a test for individual


significance.
Testing for Significance: F Test

Hypotheses H 0 : 1 = 2 = . . . = p = 0
H1: One or more of the parameters
is not equal to zero.
where;
Test Statistics F = MSR/MSE
MSR = SSR/df
MSE = SSE/n-df
Rejection Rule Reject H0 if p-value < a or if F > Fa
where Fa is based on an F distribution
with p d.f. in the numerator and
n - p - 1 d.f. in the denominator.
Testing for Significance: t Test
H0 : i  0
Hypotheses
H a : i  0

bb ii
Test Statistics tt 
ss bb ii

Rejection Rule Reject H0 if p-value < a or


if t < -ta   or t > ta where ta
is based on a t distribution
with n - p - 1 degrees of freedom.
Testing for Significance: Multicollinearity

The term multicollinearity refers to the correlation


among the independent variables.

When the independent variables are highly correlated


(say, |r | > 0.7), it is not possible to determine the
separate effect of any particular independent variable
on the dependent variable.

If the estimated regression equation is to be used only


for predictive purposes, multicollinearity is usually
not a serious problem.

Every attempt should be made to avoid including


independent variables that are highly correlated.
Signs of Multi-collinearity
 The high correlation between pairs of predictor
variables (X1, X2 & Xn).
 The ultimate sensitivity of magnitude or sign of
regression coefficients leads to the insertion or
deletion of a predictor variable.
NB: the idea is for all independent variables to be
correlated with the dependent variable but not with
each other.
Pre-works
• Because of multi-colliniarity problem & non-
significance of some regressors, one has to check if
these occur by any one of the following ways;
1. scatter plotting
2. Correlations
3. Simple regressions
Some independent variables (IV) or set of IV are better
to predict the dependent variable, some contributes
nothing,
We thus must identify which independent variables to
include and which to exclude in the model.
steps
1. Generate a list of potential variables (IV & DV)
2. Collect data on the variables
3. Check the relationships between each IV and the
DV using scatterplots and correlations
4. Check the relationships among the IVs using
scatter plot and correlations
5. Use the non-redundant simple linear regression
for each IV & DV pair
6. Use the best fitting model to make predictions
about the DV
Example
• Let’s assume that you are a small business owner
for regional delivery service who offers same-day
delivery for letters, packages and other small
cargo.
• You have grouped individual deliveries in to one
trip to reduce time and fuel costs. Therefore
some trips will have more than one delivery.
• As the owner, you would like to be able to
estimate how long a delivery will take based on
the three factors: the distance of a trip, number
of delivery made in a trip and the gas price.
Step 1 & 2
֎ list of potential variables
֎ Colleting data on each variables
Mile travel Number of Gas price Travel time (Y)
(x1) delivery (x2) (x3)
88 4 3.84 7
66 1 3.19 5.4
78 3 3.78 6.6
111 6 3.89 7.4
44 1 3.57 4.8
77 3 3.57 6.4
80 3 3.03 7
66 2 3.51 5.6
109 5 3.54 7.3
76 3 3.25 6.4
Step 3
Scatter plot of dependent &
explanatory variables
A. Travel time & gas price
Travel time (Y)
8

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
B. number of delivery & travel time
Travel time (Y)
8

0
0 1 2 3 4 5 6 7
C. mile travel & travel time
Travel time (Y)
8


7

0
0 20 40 60 80 100 120
Scatterplot Summary
• As you have observed from the scatter plot
between each IV and DV, the relationship
between gas price and the travel time (Y) is not
linear and no correlation exists.
• Gas price is then not used as an explanatory
variable in the regression model.
• How ever, the relationship between number of
delivery and travel time and mile travelled and
time of delivery are linearly related. These two
IV are highly correlated with DV.
Step4
Scatter plot between explanatory
variables (X1 & X2)
Number of delivery (x2)
7

Number of delivery (x2)


3

0
0 20 40 60 80 100 120
Between X1 & X3
4.5

4
3.5

3
2.5

2
1.5
Between X2 & X3
1

0.5

0
0 20 40 60 80 100 120
Summary

X1 X2 are highly related variables


X1 &X3 have no collinear relation
X2 & X3 have no collinear relation
• In this regard, X1 &X2 have multi-colliniarty,
thus one of them should be used in the
regression model to reduce redundancy.
• Now, let us include X1 &X2 in the model but
not X3 due to it is not correlated with the
dependent variable y
Step5
Regression of variables
• Let us perform simple regression for each
independent variable individually. We can do
this by using excel.
• Then
• Discuss the interpretations
Coefficients : values, t-statistics, p-value
ANOVA: F-value, p-value
R-squared, adjusted R-squared
DF
• Degrees of freedom (df) defines the number of
values in a dataset having the freedom to vary. It
helps estimate parameters in statistical analysis or
finds the missing or unknown value when making
the final calculation.
• Degrees of freedom (df) refers to the number of
independent values (variable) in a data sample used
to find the missing piece of information (fixed)
without violating any constraints imposed in a
dynamic system. These nominal values have the
freedom to vary, making it easier for users to find
the unknown or missing value in a dataset.
Travel time (y) and mile traveled (x1)
Regression
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.926577448
R Square 0.858545767
Adjusted R Square 0.840863988
Standard Error 0.345959481 table 3
Observations 10

ANOVA
Significance
df SS MS F F
Regression 1 5.811496299 5.811496 48.55539 0.000116
Residual 8 0.957503701 0.119688
Total 9 6.769

Coefficients Standard Error t Stat P-value Lower 95%


Intercept 3.187130289 0.472483188 6.745489 0.000146 2.097582
X Variable 1 0.040287669 0.005781671 6.96817 0.000116 0.026955
Result interpretations
Table 1: Regression statistics
 Multiple R (0.926577) is the same as the correlation between the two
variables (X1 & y).
 R2 represents the proportion (85.8%) of the variations of y accounted for
by independent variable X1.
 Adjusted R2 is just the R2 adjusted for the number of independent
variables in the model.
Adj. R Squared: Each time you add an independent variable to the model,
the R-squared increases, even if the independent variable is not significant.
It never decreases. Whereas Adjusted R-squared increases only when the
independent variable is significant and affects the dependent variable.
In our case, both are high enough to represent that the independent
variable sufficiently explain the variable y.
 Standard error of regression: 0.345959 is the average distance of the
data points from the regression line in dependent variable units. The
data points are on average 0.345hrs away from the regression line. It
gives us a measure how tightly around the regression line the data points
are. The narrower that is the more tightly the our data points around the
regression line & the wider it is the more scatter the data are.
Interpreting…
• The t value column displays the test statistics. Unless
otherwise specified, the test statistic used in linear
regression is the t-value from a two-sided t-test.
The larger the test statistic, the less likely it is that the
results occurred by chance.
• The P-value column shows the p-value. This shows
how likely the calculated t-value would have
occurred by chance if the null hypothesis of no effect
of the parameter were true.
• If these values are so low (p < 0.001 in both cases),
we can reject the null hypothesis and conclude that
both X1 & X2 likely influence rates of Y.
Interpreting the results
• The Estimate column is the estimated effect, also
called the regression coefficient or r2 value.
The estimates in the table tell us that for every
one percent increase in X1, there is an associated
percent decrease/increase in Y, and that for every
one percent increase in X2 there is an associated
percent increase/decrease in Y.
• The Std.error column displays the standard
error of the estimate. This number shows how
much variation there is around the estimates of
the regression coefficient.
Results….
Table2: ANOVA: is a technique of understanding the variance of variables.
It gives us the significance of overall model. F-statistic: 48.55 with P-value
of 0.000116. Since the p-value is less that α, it indicates that the model is
significant.
 F-Test tests the of overall significance in regression is a test of whether
or not your linear regression model provides a better fit to a dataset
than a model with no predictor variables
• If the p-value associated with the F-statistic is ≥ α: Then there is no
relationship between ANY of the independent variables and Y (no
significance of the model exists).
• If the p-value associated with the F-statistic < α : Then, AT LEAST 1
independent variable is related to Y.
• Additional part is Squared Sum (SS)
ANOVA
df SS
SSR
Regression 1 5.811496299 SSE
Residual 8 0.957503701
Total 9 6.769 SST
Results…
• Table3: Coefficient
 Intercept: 3.187 is the estimated value of y if the value of X (explanatory
variable) is zero.
 Slope: 0.040287 rounded to 0.0403, for every mile that is increased , the
time travel increases by 0.0403 hours
 P-value of the mile traveled coefficient is the same as the p-value of
ANOVA (0.000116) due to single variable only. It shows significance.
thus, Ŷ= 3.187 + 0.0403X1
for X (mile traveled) = 84,
Ŷ = 3.187 + (0.0403*84)
3.187+3.3851 = 6.5722hrs = 6:34*
= 6.5722± (2.31*0.3459)
Ŷ = 5.7732 to 7.3712hrs** 2.31 is obtained from t-table
with df. of 8 (n-2) & α 0.05
* POINT ESTIMATE
** INTERVAL ESTIMATE
= 5.46 to 7.22 hours travel for 84 miles travel with 95% prediction interval.
Regression: travel time (y) Vs number of Delivery
(X2)
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.913347
R Square 0.834202
Adjusted R Square 0.810517
Standard Error 0.387987
Observations 9

ANOVA
df SS MS F Significance F
5.30181
Regression 1 5.301818 8 35.22009 0.000579
0.15053
Residual 7 1.053737 4
Total 8 6.355556

Standard Upper Lower Upper


Coefficients Error t Stat P-value Lower 95% 95% 95.0% 95.0%
17.3297 5.5112
Intercept 4.849495 0.279836 9 5.24E-07 4.187789 01 4.187789 5.511201
5.93465 0.6865
4 0.490909 0.082719 2 0.000579 0.295309 09 0.295309 0.686509
Interpretation of result
R2 represents the proportion (83 %) of the
variations of y accounted for by independent
variable X2 and Adjusted R2 (0.81) are high
enough.
Coefficient
4.489 is intercept & 0.49 is 2 meaning that
Ŷ = 4. 849 + 0.491X2
for 4 number of delivery, the total estimated delivery time is
4.489 + (4*0.491)
= 6.813 = (6:49hrs)
Regression: travel time (y) Vs gas price(X3)
Regression Statistics
Multiple R 0.190069661
R Square 0.036126476
Adjusted R Square -0.101569742
Standard Error 0.935487018
Observations 9

ANOVA
df SS MS F Significance F
Regression 1 0.229603825 0.229604 0.262364 0.62426159
Residual 7 6.125951731 0.875136
Total 8 6.355555556

Coefficients Standard Error t Stat P-value Lower 95%


Intercept 4.200875175 4.153244323 1.011468 0.34547 -5.619987074
3.84 0.609387916 1.189712554 0.512214 0.624262 -2.20383524
Summary
F P-value S R2 R2 adj. X1 X2 X3

48.55 0.000116 0.345 0.8585 0.8408 X

35.22 0.0005 0.387 0.83 0.81 X

0.26234 0.6242 0.935 0.036 -0.1015 X

In the above summary and forgoing regressions, we understand


that X3 has no relation with the dependent variable and it is not
significantly describe the dependent variable. Thus in the
regression model X1 & X2 only are included.
If you are requested to suggest one best variable to use if one variable only is
to be used, definitely you will suggest X1, because it has highest F value,
lowest Standard error, and highest R and R2
• The data for two variables are given below:

Miles Number of Total travel


traveled delivery Hr (Y)
(x1) (x2)
89 4 7 • Here, travel hour is
66 1 5.4 dependent variable
78 3 6.6 (DV) on two factors;
111 6 7.4 miles traveled (IV1) and
44 1 4.8 number of delivery (IV2)
77 3 6.4 • The aim is to be able to
80 3 7 predict how long it
66 2 5.6 takes in hours to make
109 5 7.3 a delivery.
76 3 6.4
R2 misleading?
R2: r2 increases when you add variables in to
your model whether the variables are
significant or not.
Adjusted R2: is a modified version of r2 adjusted
for the number of predictors in the model. It
increases only when the added independent
variable is significant and affects the dependent
variable.
 Predicted R2: Indicates how well a regression
model predicts responses for new observations.
Model options summary
• Assume the following summary is obtained from
regression
F P-value S R2 Adj R2 X1 X2 X3 Selection
Pred
23.72 0.001 0.35264 83.47% 59.95 x x

22.36 0.001 0.35988 82.78% 68.11 x X NO

27.63 <0.001 0.32970 85.55% 71.76 x x NO

16.99 0.002 0.34469 84.2 57.19 x x X NO

• SELECTION of best fitting model;


 high F-value and p-value below α
 high adjusted R2 and lower standard error
 Low drop in predicted R2
Assumptions of MLR
Multiple linear regression makes all of the same assumptions
as SLRM
• Homogeneity of variance (homoscedasticity): the size of the
error in our prediction doesn’t change significantly across the
values of the independent variable.
• Independence of observations: the observations in the dataset
were collected using statistically valid methods, and there are no
hidden relationships among variables.
• In MLR, it is possible that some of the independent variables are
actually correlated with one another, so it is important to check
these before developing the regression model.
• If two independent variables are too highly correlated (r2 > ~0.6),
then only one of them should be used in the regression model.
• Normality: The data follows a NORMAL DISTRIBUTION.
• Linearity: the line of best fit through the data points is a straight
line, rather than a curve or some sort of grouping factor.
Report the results
Use the following format to report the results of a MLRM
• Multiple linear regression was used to test if significantly
predicted the fitted regression model was: [fitted
regression equation]
• The overall regression was statistically significant (R2 =
[R2 value], F(df regression, df residual) = [F-value], p = [p-
value]).
• It was found that [predictor variable 1] significantly
predicted [response variable] (β = [β-value], p = [p-
value]).
• It was found that [predictor variable 2] did not
significantly predict [response variable] (β = [β-value], p =
[p-value]).

You might also like