Applied Business Forecasting and Planning: Multiple Regression Analysis
Applied Business Forecasting and Planning: Multiple Regression Analysis
Applied Business Forecasting and Planning: Multiple Regression Analysis
and planning
Y 0 1 x1 2 x2 k xk
• 0, 1, , k are parameters
• X1, X2, …,Xk are known constants
• , the error terms are independent N(o, 2)
Estimating the parameters of the model
The values of the regression parameters i are not
known. We estimate them from data.
As in the simple linear regression case, we use the
least-squares method to fit a linear function
yˆ b0 b1 x1 b2 x2 bk xk
to the data.
The least-squares method chooses the b’s that
make the sum of squares of the residuals as small
as possible.
Estimating the parameters of the model
The least-squares estimates are the values that
minimize the quantity
n
i i
( y
i 1
ˆ
y ) 2
i
( y y ) 2
i
( ˆ
y y ) 2
i i
( y ˆ
y ) 2
product sales with the help of other variables that are likely to
have a bearing on sales.
Example
The growth of cable television has created vast new potential
Regression Statistics
Multiple R 0.884267744
R Square 0.781929444
Adjusted R Square 0.723777295
Standard Error 142.9354188
Observations 20
ANOVA
df SS MS F Significance F
Regression 4 1098857.84 274714.4601 13.44626923 7.52E-05
Residual 15 306458.0092 20430.53395
Total 19 1405315.85
Regression Statistics
Multiple R 0.882638739
R Square 0.779051144
Adjusted R Square 0.737623233
Standard Error 139.3069743
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 1094812.92 364937.64 18.80498277 1.69966E-05
Residual 16 310502.9296 19406.4331
Total 19 1405315.85
Regression Statistics
Multiple R 0.8802681
R Square 0.774871928
Adjusted R Square 0.748386273
Standard Error 136.4197776
Observations 20
ANOVA
df SS MS F Significance F
Regression 2 1088939.802 544469.901 29.2562866 3.13078E-06
Residual 17 316376.0474 18610.35573
Total 19 1405315.85
SUBSCRIBE 51.42 0.27 ADRATE - .02 SIGNAL 0.44 APIPOP 16.23 COMPETE
Regression Statistics
Multiple R 0.878054
R Square 0.770978
Adjusted R Square 0.728036
Standard Error 264.3027
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 3762601 1254200 17.9541 2.25472E-05
Residual 16 1117695 69855.92
Total 19 4880295
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -472.685 139.7492 -3.38238 0.003799 -768.9402258 -176.43
Compete 159.8413 28.29157 5.649786 3.62E-05 99.86587622 219.8168
ADRATE 0.048173 0.149395 0.322455 0.751283 -0.268529713 0.364876
Signal 0.037937 0.083011 0.457012 0.653806 -0.138038952 0.213913
Example:Sales Forecasting
Fit the model
Compete 0 1 ADRATE 2 APIPOP 3 SIGNAL
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.882936
R Square 0.779575
Adjusted R Square 0.738246
Standard Error 1.34954
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 103.0599 34.35329 18.86239 1.66815E-05
Residual 16 29.14013 1.821258
Total 19 132.2
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 3.10416 0.520589 5.96278 1.99E-05 2.000559786 4.20776
ADRATE 0.000491 0.000755 0.649331 0.525337 -0.001110874 0.002092
Signal 0.000334 0.000418 0.799258 0.435846 -0.000552489 0.001221
APIPOP 0.004167 0.000738 5.649786 3.62E-05 0.002603667 0.005731
Example:Sales Forecasting
Fit the model
Signal 0 1 ADRATE 2 APIPOP 3 COMPETE
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.512244
R Square 0.262394
Adjusted R Square 0.124092
Standard Error 790.8387
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 3559789 1186596 1.897261 0.170774675
Residual 16 10006813 625425.8
Total 19 13566602
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 5.171093 547.6089 0.009443 0.992582 -1155.707711 1166.05
APIPOP 0.339655 0.743207 0.457012 0.653806 -1.235874129 1.915184
Compete 114.8227 143.6617 0.799258 0.435846 -189.7263711 419.3718
ADRATE -0.38091 0.438238 -0.86919 0.397593 -1.309935875 0.548109
Example:Sales Forecasting
Fit the model
ADRATE 0 1 Signal 2 APIPOP 3 COMPETE
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.399084
R Square 0.159268
Adjusted R Square 0.001631
Standard Error 440.8588
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 589101.7 196367.2 1.010346 0.413876018
Residual 16 3109703 194356.5
Total 19 3698805
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 253.7304 298.6063 0.849716 0.408018 -379.2865355 886.7474
Signal -0.11837 0.136186 -0.86919 0.397593 -0.407073832 0.170329
APIPOP 0.134029 0.415653 0.322455 0.751283 -0.747116077 1.015175
Compete 52.3446 80.61309 0.649331 0.525337 -118.5474784 223.2367
Example:Sales Forecasting
VIF calculation Results:
Variable R- Squared VIF
ADRATE 0.159268 1.19
COMPETE 0.779575 4.54
SIGNAL 0.262394 1.36
APIPOP 0.770978 4.36
and 1.
For the insurance innovation example, where the
qualitative variable has two classes, we might
define the indicator variable x2 as follows:
1 if stock company
x2
0 otherwise
Indicator variables
A qualitative variable with c classes will be
represented by c-1 indicator variables.
A regression function with an indicator
variable with two levels (c = 2) will yield
two estimated lines.
Interpretation of Regression Coefficients
In our insurance innovation example, the
regression model is:
y 0 1 x1 2 x2
Where:
x1 size of firm
1 if stock company
x2
0 otherwise
Interpretation of Regression Coefficients
To understand the meaning of the
regression coefficients in this model,
consider first the case of mutual firm. For
such a firm, x2 = 0 and we have:
yˆi b0 b1 x1 b2 (0) b0 b1 x1 Mutual firms
For a stock firm x2 = 1 and the response
function is:
yˆi b0 b1 x1 b2 (1) (b0 b2 ) b1 x1 Stock firms
Interpretation of Regression Coefficients
The response function for the mutual firms is a
straight line, with y intercept 0 and slope 1.
For stock firms, this also is a straight line, with the
same slope 1 but with y intercept 0+2.
With reference to the insurance innovation
example, the mean time elapsed before the
innovation is adopted is linear function of size of
firm (x1), with the same slope 1for both types of
firms.
Interpretation of Regression Coefficients
2 indicates how much lower or higher the
response function for stock firm is than the one for
the mutual firm.
2 measures the differential effect of type of firms.
In general, 2 shows how much higher (lower) the
mean response line is for the class coded 1 than
the line for the class coded 0, for any level of x1.
Example: Insurance Innovation Adoption
Here is the data set for the insurance innovation example:
Months Elapsed Size type of firm Type
17 151 0 Mutual
26 92 0 Mutual
21 175 0 Mutual
30 31 0 Mutual
22 104 0 Mutual
0 277 0 Mutual
12 210 0 Mutual
19 120 0 Mutual
4 290 0 Mutual
16 238 1 Stock
28 164 1 Stock
15 272 1 Stock
11 295 1 Stock
38 68 1 Stock
31 85 1 Stock
21 224 1 Stock
20 166 1 Stock
13 305 1 Stock
30 124 1 Stock
14 246 1 Stock
Example: Insurance Innovation Adoption
Fitting the regression model
y 0 1 x1 2 x2
Where
x1 size of firm
1 if stock company
x2
0 otherwise
Regression Statistics
Multiple R 0.95993655
R Square 0.92147818
Adjusted R Square 0.91224031
Standard Error 2.78630562
Observations 20
ANOVA
df SS MS F Significance F
Regression 2 1548.820517 774.4103 99.75016 4.04966E-10
Residual 17 131.979483 7.763499
Total 19 1680.8
Jul-90
Nov-90
Mar-91
1
Jul-91
Nov-91
Mar-92
1
Jul-92
Nov-92
1
Mar-93
Jul-93
Nov-93
1
Mar-94
Jul-94
Nov-94
1
Mar-95
Jul-95
Nov-95
Private Housing Starts (PHS) in Thousands of Units
Mar-96
Jul-96
Nov-96
"1" marks the first quarter of each year.
1
Mar-97
Jul-97
Nov-97
Mar-98
1
Jul-98
Nov-98
Example: Private Housing Starts (PHS)
Example: Private Housing Starts (PHS)
To Account for and measure this seasonality in a
regression model, we will use three dummy
variables: Q2 for the second quarter, Q3 for the
third quarter, and Q4 for the fourth quarter. These
will be coded as follows:
Q2 = 1 for all second quarters and zero otherwise.
Q3 = 1 for all third quarters and zero otherwise
Q4 = 1 for all fourth quarters and zero otherwise.
Example: Private Housing Starts (PHS)
Data for private housing starts (PHS), the
mortgage rate (MR), and these seasonal indicator
variables are shown in the following slide.
Examine the data carefully to verify your
understanding of the coding for Q2, Q3, Q4.
Since we have assigned dummy variables for the
second, third, and fourth quarters, the first quarter
is the base quarter for our regression model.
Note that any quarter could be used as the base,
with indicator variables to adjust for differences in
other quarters.
Example: Private Housing Starts (PHS)
PERIOD PHS MR Q2 Q3 Q4
31-Mar-90 217 10.1202 0 0 0
30-Jun-90 271.3 10.3372 1 0 0
30-Sep-90 233 10.1033 0 1 0
31-Dec-90 173.6 9.9547 0 0 1
31-Mar-91 146.7 9.5008 0 0 0
30-Jun-91 254.1 9.5265 1 0 0
30-Sep-91 239.8 9.2755 0 1 0
31-Dec-91 199.8 8.6882 0 0 1
31-Mar-92 218.5 8.7098 0 0 0
30-Jun-92 296.4 8.6782 1 0 0
30-Sep-92 276.4 8.0085 0 1 0
31-Dec-92 238.8 8.2052 0 0 1
31-Mar-93 213.2 7.7332 0 0 0
30-Jun-93 323.7 7.4515 1 0 0
30-Sep-93 309.3 7.0778 0 1 0
31-Dec-93 279.4 7.0537 0 0 1
31-Mar-94 252.6 7.2958 0 0 0
30-Jun-94 354.2 8.4370 1 0 0
30-Sep-94 325.7 8.5882 0 1 0
31-Dec-94 265.9 9.0977 0 0 1
31-Mar-95 214.2 8.8123 0 0 0
30-Jun-95 296.7 7.9470 1 0 0
30-Sep-95 308.2 7.7012 0 1 0
31-Dec-95 257.2 7.3508 0 0 1
31-Mar-96 240 7.2430 0 0 0
30-Jun-96 344.5 8.1050 1 0 0
30-Sep-96 324 8.1590 0 1 0
31-Dec-96 252.4 7.7102 0 0 1
31-Mar-97 237.8 7.7905 0 0 0
30-Jun-97 324.5 7.9255 1 0 0
30-Sep-97 314.6 7.4692 0 1 0
31-Dec-97 256.8 7.1980 0 0 1
31-Mar-98 258.4 7.0547 0 0 0
30-Jun-98 360.4 7.0938 1 0 0
30-Sep-98 348 6.8657 0 1 0
31-Dec-98 304.6 6.7633 0 0 1
31-Mar-99 294.1 6.8805 0 0 0
30-Jun-99 377.1 7.2037 1 0 0
30-Sep-99 355.6 7.7990 0 1 0
31-Dec-99 308.1 7.8338 0 0 1
Example: Private Housing Starts (PHS)
The regression model for private housing
starts (PHS) is:
PHS 0 1 (MR) 2 (Q2) 3 (Q3) 4 (Q4)
In this model we expect b1 to have a
negative sign, and we would expect b2, b3,
b4 all to have positive signs. Why?
Regression results for this model are shown
in the next slide.
Example: Private Housing Starts (PHS)
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.885398221
R Square 0.78393001
Adjusted R Square 0.759236296
Standard Error 26.4498851
Observations 40
ANOVA
df SS MS F Significance F
Regression 4 88837.93624 22209.48406 31.74613731 3.33637E-11
Residual 35 24485.87476 699.5964217
Total 39 113323.811
Private Housing Starts (PHS) with a Simple Regression Forecast (PHSF1) and a Multiple Regression Forecast (PHSF2) in
Thousands of Units
350
300
250
200
150
100
50
0
Jul-90
Jul-91
Jul-92
Jul-93
Jul-94
Jul-95
Jul-96
Jul-97
Jul-98
Mar-90
Mar-91
Mar-92
Mar-93
Mar-94
Mar-95
Mar-96
Mar-97
Mar-98
Nov-90
Nov-91
Nov-92
Nov-93
Nov-94
Nov-95
Nov-96
Nov-97
Nov-98
PHS PHSF1 PHSF2
Regression Diagnostics and Residual
Analysis
It is important to check the adequacy of the model before it
becomes part of the decision making process.
Residual plots can be used to check the model
assumptions.
It is important to study outlying observations to decide
whether they should be retained or eliminated.
If retained, whether their influence should be reduced in
the fitting process or revise the regression function.
Time Series Data and the Problem of
Serial Correlation
In the regression models we assume that the
errors i are independent.
In business and economics, many regression
applications involve time series data.
For such data, the assumption of
uncorrelated or independent error terms is
often not appropriate.
Problems of Serial Correlation
If the error terms in the regression model are
autocorrelated, the use of ordinary least squares
procedures has a number of important
consequences
MSE underestimate the variance of the error terms
The confidence intervals and tests using the t and F
distribution are no longer strictly applicable.
The standard error of the regression coefficients
underestimate the variability of the estimated regression
coefficients. Spurious regression can result.
First order serial correlation
The error term in current period is directly related
to the error term in the previous time period.
Let the subscript t represent time, then the simple
linear regression model is:
yt 0 1 xt t
Where
t t 1 t
t = error at time t
= the parameter that measures correlation between
adjacent error terms
t normally distributed error terms with mean zero and
variance 2
Example
The effect of positive serial correlation in a
simple linear regression model.
Misleading forecasts of future y values.
Standard error of the estimate, S y.x will
underestimate the variability of the y’s about
the true regression line.
Strong autocorrelation can make two unrelated
variables appear to be related.
Durbin-Watson Test for Serial
Correlation
Recall the first-order serial correlation model
yt 0 1 xt t
t t 1 t
The hypothesis to be tested are:
H0 : 0
Ha : 0
The alternative hypothesis is > 0 since in
business and economic time series tend to show
positive correlation.
Durbin-Watson Test for Serial
Correlation
The Durbin-Watson statistic is defined as
n
(e t et 1 ) 2
DW t 2
n
e
t 1
2
t
Where
et yt yˆt the residual for time period t
et 1 yt 1 yˆt 1 the residual for time period t -1
Durbin-Watson Test for Serial
Correlation
The auto correlation coefficient can be
estimated by the lag 1 residual
autocorrelation r1(e)
n
e e t t 1
r1 (e) t 2
n
t
e 2
t 1
35
30
Company Sales ($
25
millions)
20
15
10
5
0
0 50 100 150 200
Industry sales($ millions)
Example
The scatter plot suggests that a linear regression
model is appropriate.
Least squares method was used to fit a regression
line to the data.
The residuals were plotted against the fitted
values.
The plot shows that the residuals are consistently
above or below the fitted value for extended
periods.
Example
Example
To confirm this graphic diagnosis we will use the
Durbin-Watson test for:
H0 : 0
Ha : 0
(e t et 1 ) 2
DW t 2
n
e
t 1
2
t
Example
Year Quarter t Company sales(y) Industry sales(x) et et -et-1 (et -et-1)^2 et ^2
1983 1 1 20.96 127.3 -0.02605 0.000679
2 2 21.4 130 -0.06202 -0.03596 0.001293 0.003846
3 3 21.96 132.7 0.022021 0.084036 0.007062 0.000485
4 4 21.52 129.4 0.163754 0.141733 0.020088 0.026815
1984 1 5 22.39 135 0.04657 -0.11718 0.013732 0.002169
2 6 22.76 137.1 0.046377 -0.00019 3.76E-08 0.002151
3 7 23.48 141.2 0.043617 -0.00276 7.61E-06 0.001902
4 8 23.66 142.8 -0.05844 -0.10205 0.010415 0.003415
1985 1 9 24.1 145.5 -0.0944 -0.03596 0.001293 0.008911
2 10 24.01 145.3 -0.14914 -0.05474 0.002997 0.022243
3 11 24.54 148.3 -0.14799 0.001152 1.33E-06 0.021901
4 12 24.3 146.4 -0.05305 0.094937 0.009013 0.002815
1986 1 13 25 150.2 -0.02293 0.030125 0.000908 0.000526
2 14 25.64 153.1 0.105852 0.12878 0.016584 0.011205
3 15 26.36 157.3 0.085464 -0.02039 0.000416 0.007304
4 16 26.98 160.7 0.106102 0.020638 0.000426 0.011258
1987 1 17 27.52 164.2 0.029112 -0.07699 0.005927 0.000848
2 18 27.78 165.6 0.042316 0.013204 0.000174 0.001791
3 19 28.24 168.7 -0.04416 -0.08648 0.007478 0.00195
4 20 28.78 171.7 -0.03301 0.011152 0.000124 0.00109
0.097941 0.133302
Blaisdell Company Example
35
30
Example
.09794
DW .735
.13330
Using Durbin Watson table of your text
book, for k = 1, and n=20, and using =
.01 we find U = 1.15, and L = .95
Since DW = .735 falls below L = .95 , we
reject the null hypothesis, namely, that the
error terms are positively autocorrelated.
Remedial Measures for Serial Correlation
Addition of one or more independent
variables to the regression model.
One major cause of autocorrelated error terms
is the omission from the model of one or more
key variables that have time-ordered effects on
the dependent variable.
Use transformed variables.
The regression model is specified in terms of
changes rather than levels.
Extensions of the Multiple Regression
Model
In some situations, nonlinear terms may be needed
as independent variables in a regression analysis.
Business or economic logic may suggest that non-
linearity is expected.
A graphic display of the data may be helpful in
determining whether non-linearity is present.
One common economic cause for non-linearity is
diminishing returns.
Fore example, the effect of advertising on sales may
diminish as increased advertising is used.
Extensions of the Multiple Regression
Model
Some common forms of nonlinear functions
are :
Y 0 1 ( X ) 2 ( X 2 )
Y 0 1 ( X ) 2 ( X 2 ) 3 ( X 3 )
Y 0 1 (1 X )
Y e 0 X 1
Extensions of the Multiple Regression
Model
To illustrate the use and interpretation of a
non-linear term, we return to the problem of
developing a forecasting model for private
housing starts (PHS).
So far we have looked at the following
model
PHS 0 1 (MR) 2 (Q2) 3 (Q3) 4 (Q4)
Where MR is the mortgage rate and Q2, Q3, and Q4 are
indicators variables for quarters 2, 3, and 4.
Example: Private Housing Start
First we add real disposable personal
income per capita (DPI) as an independent
variable. Our new model for this data set is:
PHS 0 1 (MR) 2 (Q2) 3 (Q3) 4 (Q4) 5 ( DPI )
Regression Statistics
Multiple R 0.943791346
R Square 0.890742104
Adjusted R Square 0.874187878
Standard Error 19.05542121
Observations 39
ANOVA
df SS MS F Significance F
Regression 5 97690.01942 19538 53.80753 6.51194E-15
Residual 33 11982.59955 363.1091
Total 38 109672.619
relation. 21500
21000
regression.
PHS
19500
19000
DPI
250 300 350 400
regression model.
Example: Private Housing Start
We also add the dependent variable, lagged
one quarter, as an independent variable in
order to help reduce serial correlation.
The third model that we fit to our data set
is:
PHS 0 1 ( MR ) 2 (Q 2) 3 (Q3) 4 (Q4) 5 ( DPI ) 6 ( DPI 2 ) 7 ( LPHS )
Regression Statistics
Multiple R 0.97778626
R Square 0.956065971
Adjusted R Square 0.946145384
Standard Error 12.46719572
Observations 39
ANOVA
df SS MS F Significance F
Regression 7 104854.2589 14979.17985 96.37191 3.07085E-19
Residual 31 4818.360042 155.4309691
Total 38 109672.619