STAB27
STAB27
STAB27
Contact information
1 2
First get data from disk; copy, paste into Minitab (say c1).
3 4
Confidence interval for mean
5 6
In Minitab, select Stat, Basic Statistics, 1-sample t, then select C1 The P-value 0.66 at the end is not small (eg. smaller then 0.05), so
again. This time, click on Test Mean, and put 7.0 into the box. I got there is no reason to reject the null hypothesis (note wording). The
this: population mean could be 7, even though the sample mean is 7.3,
T-Test of the Mean just by chance. The mean is not significantly different from 7.
7 8
Why do confidence intervals and tests? Summary
9 10
11 12
Some symbols
There could be chance involved, so let’s split up y into two parts:
We need some symbols to describe what we’re doing.
y = E(y) + ǫ.
Let y denote the variable we’re trying to predict (like SAT score).
Called dependent variable or response variable. E(y) is the population mean, while ǫ is a random (chance) error
that prevents y from being exactly E(y).
Let x denote the variable we’re trying to predict it from (eg.
high-school GPA). Called independent variable or predictor There can be more than one predictor variable: in that case, call
variable. them x1 , x2 , . . . , xp .
13 14
15 16
Straight lines, mathematically
17 18
Scatterplot
Nice way to see x, y data is to plot it. If you plot each x-value This shows sales (on the vertical scale) against amount of
against its corresponding y -value you get something like this: advertising (on the horizontal scale). Appears that more advertising
goes with more sales!
That is, which slope and intercept best describes this relationship?
19 20
Least squares
Can try again with a different line; get different sum of squared
We want the straight line to go “close” to points on scatterplot. errors. Idea: choose line that makes SSE smallest. This is principle
of least squares.
Pick a line, say y= 0.2 + 0.5x. In the data, when x = 3, y = 2;
on this line, when x = 3, ŷ = 0.2 + (0.5)(3) = 1.7. (Use ŷ to Formulas via calculus for slope and intercept of this best line (p. 96),
denote the point on the line. This is off by 2 − 1.7 = 0.3. but we’ll get answers from Minitab.
Can do this for all 5 points for this line. Get an “error” at each point – Select Stat, Regression, Regression again. Response is sales;
some + (above line), some - (below). To combine errors, square double click to select. Predictor is advert; again select. Click OK.
them first to make them all +, then add up to get sum of squared I got this:
errors.
21 22
Regression Analysis
Analysis of Variance
(Errors of data points from this line are 0.4, -0.3, 0, -0.7, 0.6; sum of
squares of these is 1.1.)
Source DF SS MS F P
Regression 1 4.9000 4.9000 13.36 0.035
Residual Error 3 1.1000 0.3667
Total 4 6.0000
23 24
Assumptions
We earlier wrote y = E(y) + ǫ, to allow for the points not being σ is a population parameter, so we don’t know it. However, we can
exactly on the line (the ǫ are “random errors”). estimate it from the data.
Make two assumptions about the random ǫ: The further the points are off the line, the larger σ should be. Now,
SSE measures how far the points are off the line, so can estimate
• ǫ have mean 0 and constant SD σ (same for all x).
σ 2 from SSE: precisely, SSE divided by n − 2, where n is the
• ǫ have a normal distribution. number of data points you have.
25 26
The estimated slope is 0.7. The figure in the “StDev” column next to
it expresses uncertainty about the slope.
27 28
For, say, a 95% interval, look in t table (p. 762). 95% means 5% “cut
Confidence interval for population slope off”, or 0.025 each end. Look in the 0.025 column and the row for
the error df, 3 here. This gives 3.182.
The reason we don’t know the “real” slope is that we don’t know
Then the 95% confidence interval for the slope is
about the population; we only have a sample and a sample slope
(from the regression line). If we knew the population slope exactly, 0.7 ± 3.182(0.1915) = (0.09, 1.31).
the “StDev” figure would be 0.
This doesn’t pin down the population slope very well, because we
We can make a confidence interval for the population slope. We use don’t have much data.
the sample slope, its StDev, and the t-distribution (because we don’t
For a different interval (say 90%), decide how much to cut off each
know σ ).
end – here half of 10%, or 0.05. Use the error df again.
29 30
Look along the advert line. The P-value for this test is 0.035,
which is small (smaller than 0.05).
31 32
Correlation and R-squared Correlations (Pearson)
33 34
Source DF SS MS F P
The correlation squared, called R-squared, has another nice Regression 1 4.9000 4.9000 13.36 0.035
interpretation in regression. Recall this (ADSALES data again): Residual Error 3 1.1000 0.3667
Total 4 6.0000
Regression Analysis
Correlation squared is 0.9042 = 0.817, same as R-sq above.
The regression equation is
sales = - 0.100 + 0.700 advert But also regression SS divided by total SS is 4.9/6 = 0.817. Thus
Predictor Coef StDev T P R-squared also says this here: “out of variation in y , 81.7% of it
Constant -0.1000 0.6351 -0.16 0.885 explained by fact that y depends on x”. Higher the better.
advert 0.7000 0.1915 3.66 0.035
In multiple regression (later), correlation not so useful, but
S = 0.6055 R-Sq = 81.7% R-Sq(adj) = 75.6%
R-squared still helpful.
35 36
In the population, though, don’t know “true” sales when advertising
is 4; only have sample estimates, with uncertainty. Do 2 kinds of
Confidence interval for mean y and
prediction:
prediction interval for new y at particular x
1. mean value of y for given x.
Having decided that the regression line is useful, next step is to use 2. Value of y for new observation with given x.
it for prediction. First says: “imagine all sales in population where advertising was 4,
To simply predict y at a particular x, just use the line. make CI for mean of those”.
For example (ADSALES), the line was y = −0.1 + 0.7x. To Second says: have one new observation with advertising 4, guess
predict sales when advertising is 4, put in x = 4 to get what sales is for that.
−0.1 + 0.7(4) = 2.7. Second CI has more uncertainty: even if line known very well, still
uncertainty about y .
37 38
Fit
2.700
StDev Fit
0.332 (
95.0% CI
1.645, 3.755) (
95.0% PI
0.503, 4.897) Multiple regression
Fit 2.7 is value from putting x = 4 into line. 95% CI says that mean
sales when advertising is 4 lies between 1.645 and 3.755 – not very
useful! But PI even less useful: sales for new month when
advertising 4 could be between 0.5 and 4.9 – not useful at all!
(Quite typical of PIs.)
39 40
Multiple regression model
Sales of a product might depend on advertising, but also on season, Kinds of multiple regression model
inventory, sales force, productivity.
In other words, our y -variable (sales) depends on not one but The x’s can represent several different things:
41 42
43 44
Regression Analysis
45 46
Interpreting intercept and slopes Confidence intervals and tests for slopes
The regression equation is
As in 1-variable regression. Bear in mind interpretation.
price = 1470 + 0.814 land + 0.820 improve + 13.5 area
Don’t know “real” slope for a variable because don’t know about
1470 is intercept: value of y (price) when other variables all 0 (land,
population; only have sample and sample slope (from the
improve, area). Doesn’t mean much here.
regression equation).
0.814 is slope for land value. Says that if land value increases by 1
Can make a confidence interval for population slope for a variable.
unit, with other variables not changing, price will increase by 0.814
As before, use the sample slope, its StDev, and t-distribution
units.
(because we don’t know σ ).
Same idea for slopes of improve and area.
Assumptions same as before:
Note that interpretation of slopes requires other variables in model
to be held fixed; gives idea of effect of that variable over and above
• ǫ have mean 0 and constant SD σ (same for all x).
others. • ǫ have a normal distribution.
47 48
Previous example: slope for land value 0.814, StDev 0.5122. More
of output:
95% interval for slope of land value. Use t table (p. 762). Look in the
S = 7919 R-Sq = 89.7% R-Sq(adj) = 87.8%
0.025 column and the row for the error df, 16 here. This gives 2.120
Analysis of Variance
Then the 95% confidence interval for the slope is
Source DF SS MS F P
Regression 3 8779676741 2926558914 46.66 0.000
0.814 ± 2.120(0.5122) = (−0.27, 1.90).
Residual Error 16 1003491259 62718204
Total 19 9783168000
49 50
51 52
The regression equation is
price = 98 + 0.960 improve + 16.4 area R-squared, correlation, and test for whole
Predictor Coef StDev T P model
Constant 98 5931 0.02 0.987
improve 0.9604 0.2004 4.79 0.000
area 16.373 6.617 2.47 0.024 Minitab quotes an r-squared for multiple regressions as well.
Interpret as before (regression containing all 3 x’s):
area now clearly has an effect on price.
S = 7919 R-Sq = 89.7% R-Sq(adj) = 87.8%
Explanation: land and area predict price in similar way, so that
having both is unnecessary, but having one is useful. This is high, 89.7%, so regression is doing a good job of predicting
selling price.
(In regression with land and improve only, land is nearly
significant.) Doesn’t mean that regression is best in any way, just good.
53 54
Correlation is only defined between two variables, not meaningful in Above table also contains P-value, 0 to 3 decimals. So some null
multiple regression. So R-squared here only defined as hypothesis is being rejected, but what?
55 56
First, do regression with all 3 x’s, note SS:
Comparing two models: the partial F -test
Analysis of Variance
So far, we know how to test for one x-variable (t-test) and how to Source DF SS MS F P
Regression 3 8779676741 2926558914 46.66 0.000
test for all of them (global F -test). How to test some of the x’s? Residual Error 16 1003491259 62718204
Total 19 9783168000
Answer: partial F -test.
Then do regression with land only:
Fit regression containing all the x’s under consideration; then
Analysis of Variance
remove those you want to test, see if fit is “significantly” worse.
Source DF SS MS F P
Example: in real estate data, see if regression with all 3 x’s better Regression 1 6102224089 6102224089 29.84 0.000
Residual Error 18 3680943911 204496884
than regression with only land area. Total 19 9783168000
57 58
Another way to do the same test is via only the regression with all
three variables, provided you include the ones to be tested last.
Then take difference in error SS divided by difference in error df, That is, do a regression with land, improve and area in that
and divide that by (smaller error SS divided by df): order. You get sequential SS:
Source DF Seq SS
(3680943911 − 1003491259)/2
F = = 21.35. land 1 6102224089
1003491259/16 improve 1 2412784751
area 1 264667901
P-value for this is very small (Minitab or tables).
Add these up to get the top of the test statistic:
So we conclude that the smaller regression fits worse: that is, we
should include both value of improvements and area of property in (2412784751 + 264667901)/2
F = = 21.35,
the regression, rather than taking both out. 1003491259/16
same as before. Conclusion is same: those two variables should be
in the regression, since the fit is much worse without them.
59 60
In this chapter, we will see how to use non-linear functions of x (like
x2 ), how to include categorical x’s, and what interactions are and
Variations on regression how to model them. We’ll also see how to fit some non-linear
models (that can be made to look like multiple regression).
61 62
Quadratic models
63 64
The regression equation is
igg = - 1464 + 88.3 oxygen - 0.536 oxygensq
To calculate x2 in Minitab: select Calc then Calculator. Type a name
Predictor Coef StDev T P
like oxygensq in the Store Result box. In the Expression box, Constant -1464.4 411.4 -3.56 0.001
select oxygen by double-clicking on it. Select * (for oxygen 88.31 16.47 5.36 0.000
oxygensq -0.5362 0.1582 -3.39 0.002
multiplication), then select oxygen again. This multiplies the
oxygen values by themselves, and saves the results in a new S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%
65 66
The significant t-test for the slope of oxygensq says that this of curvedness in a relationship. Here’s another kind:
adds something to the prediction over and above the other Consider antique selling; in particular, selling grandfather clocks at
variables. That is, the squared term is needed to capture the curve auction. You might expect that an older clock will sell for a higher
– the curve is real, not just chance. price, but also the price will be higher if the auction is more
The slope for oxygensq is −0.54. This is not far from zero, but competitive: that is, if more people bid. This suggests predicting
far enough to be significant. The negative sign means that the curve selling price from both age of clock and number of bidders. Using
“opens downward”, as you see from the scatter plot. the GFCLOCKS data set (p. 173):
67 68
The regression equation is
price = - 1339 + 12.7 age + 86.0 bidders
Predictor Coef StDev T P First in Minitab, select Manip then Code. We will code the number of
Constant -1339.0 173.8 -7.70 0.000
bidders into “high” and “low”, so select bidders into the top box.
age 12.7406 0.9047 14.08 0.000
bidders 85.953 8.729 9.85 0.000 Below that, enter a new column, like c4. Then define “low” (0–8) and
“high” (8–20). Column 4 will contain an “l” for each auction with a
S = 133.5 R-Sq = 89.2% R-Sq(adj) = 88.5%
low number of bidders, and “h” for “high”.
So far so good: R-squared is high, and both variables are strongly
Now to plot: select Graph and then Plot. Select price and age
significant.
into the y and x boxes, then click the arrow next to Annotation and
Let’s see how well this model predicts in a different way: first then Data Labels. Click next to “use labels from column” and enter
separate the auctions into high and low numbers of bidders, then c4 in the box. Click OK twice. Plot:
plot selling price against age showing whether bidders is high or
low.
69 70
y = 1 + 2x1 + 3x2 .
x1 = 0, x2 = 0 : y = 1 + 0 + 0 = 1;
x1 = 1, x2 = 0 : y = 1 + 2 + 0 = 3.
Notice how the h’s are at the top of the picture and the l’s at the
That is, y goes up by 2, the slope of x1 . This is true for any x2 (try
bottom. Also, the h’s seem to go up more quickly.
it!); it’s what the slope means.
That is, auction prices go up faster with age when there are many
bidders than when there are few.
71 72
Now suppose
y = 1 + 2x1 + 3x2 + 4x1 x2 . This is exactly what we want for the grandfather clocks: when the
number of bidders is high, the price should increase faster with age
Put x2 = 0; when x1 = 0, y = 1, and when x1 = 1, y = 3 as
then when #bidders is low.
before; y goes up by 2.
So let’s put an interaction term into our regression. First create a
Now put x2 = 1. When x1 = 0, y = 1 + 0 + 3 + 0 = 4; when
column containing the age values times the bidders values.
x2 = 1, y = 1 + 2 + 3 + 4 = 10. Now increasing x1 by 1
increases y by 6. Select Calc and Calculator. Name the new variable something like
agebid (type into the top box) and define it as
That is, in this model, the effect of x1 on y depends on the value of
age * bidders. (Similarity with defining x2 before.) Then do
x2 .
a regression predicting price from age, bidders and
The term 4x1 x2 is called an interaction term: it describes how x1 agebid:
and x2 combine to influence y .
73 74
S = 88.91 R-Sq = 95.4% R-Sq(adj) = 94.9% Interpret a higher order model by first testing the highest-order
terms. Don’t remove any lower-order terms containing the same
R-squared has gone up (was 89% before); more important, t-test for
variables, even if they’re not significant.
interaction term significant.
Thus for the grandfather clocks, the right model includes age and
That is, interaction helps to predict over and above age and
bidders as well as the interaction, even though age isn’t
bidders. The selling price depends on age differently for each
significant.
number of bidders.
75 76
Other higher-order models
Define all these variables and give names, then run regression
The amount charged for sending a package using a regional including them:
express delivery service depends on its weight and the distance The regression equation is
sent. But cost to the delivery company depends on other things too cost = 0.827 - 0.609 weight + 0.00402 distance + 0.0898 wtsq +0.000015 d
+ 0.00733 wtdist
(like size of package and how full delivery truck is).
Predictor Coef StDev T P
Supposing we only had weight of package and distance shipped. Constant 0.8270 0.7023 1.18 0.259
Could we predict cost of delivery (data set EXPRESS)? weight -0.6091 0.1799 -3.39 0.004
distance 0.004021 0.007998 0.50 0.623
First fit model predicting cost from weight and distance. Fit is good: wtsq 0.08975 0.02021 4.44 0.001
distsq 0.00001507 0.00002243 0.67 0.513
R-squared 91.5%, error SS 37.9 with 17 df.
wtdist 0.0073271 0.0006374 11.49 0.000
Relationship doesn’t have to be linear. Try second-order model
S = 0.4428 R-Sq = 99.4% R-Sq(adj) = 99.2%
including weight squared, distance squared, weight by distance
interaction.
77 78
Analysis of Variance
Source DF SS MS F P
Regression 5 449.341 89.868 458.39 0.000
Residual Error 14 2.745 0.196
Total 19 452.086
37.19 − 2.745/3
F = = 58.6;
2.745/14
P-value from F -dist with 3 and 14 df is very small; reject null hyp.
that second-order terms are useless – that is, keep them.
79 80
Example: a certain drug may increase anxiety level in patients; rate
Using categorical x-variables of increase suspected to be different for males and females. Data in
ANXIETY. Results:
Variables come in two kinds: numerical (counted or measured), and
The regression equation is
categorical (classified). Examples: surveys on people might score = 13.6 + 0.341 dose + 2.80 sex
81 82
Predictor Coef StDev T P Because interaction significant, should keep both dose and sex
Constant 15.3000 0.5983 25.57 0.000
in the model even though sex by itself not significant.
dose 0.19143 0.04523 4.23 0.000
sex -0.7000 0.8461 -0.83 0.412
If interaction had not been significant, would have had straight line
sexdose 0.30000 0.06396 4.69 0.000
for each sex with same slope, and first regression would have been
Interaction significant. What does this mean?
appropriate.
Means: way score depends on dose different for each sex. That is,
fitting straight line for each sex separately, and lines have different
slopes.
83 84
Categorical variables with more than 2
levels 4 levels, so defined 3 dummy variables (1 less). In words: 1st
dummy variable is 1 if it’s the 2nd level, 0 otherwise; 2nd is 1 if 3rd
With more then 2 levels, define a string of dummy variables like level and 0 otherwise, and so on. No need to define a dummy
this: variable specific to first level, because if not others, must be first.
Age group x1 x2 x3 A consulting firm sells a computerized system for monitoring road
18–25 0 0 0 construction bids. It wants to compare the mean annual
45–54 0 0 1
85 86
Note layout: all costs, regardless of state, in 1 column. Then x1 =1 Slope for a dummy variable compares its category (where it is 1)
if Kentucky (2nd), 0 otherwise; x2= 1 if Texas (3rd), 0 otherwise. with “baseline” category (Kansas).
Then run regression predicting cost from x1 and x2 :
How to tell if maintenance costs really do differ between states? Null
The regression equation is
cost = 280 + 80.3 kentucky + 198 texas hypothesis of no difference: all states have same mean, so dummy
variables for all states 0 (difference from Kansas 0). These are all
Predictor Coef StDev T P
Constant 279.60 53.43 5.23 0.000 the slopes in this regression, so global F-test tells story:
kentucky 80.30 75.56 1.06 0.297
Analysis of Variance
texas 198.20 75.56 2.62 0.014
Source DF SS MS F P
S = 168.9 R-Sq = 20.5% R-Sq(adj) = 14.6%
Regression 2 198772 99386 3.48 0.045
Residual Error 27 770671 28543
Intercept of 280 is mean cost for Kansas (omitted state). Mean for Total 29 969443
Kentucky is $80.30 larger than for Kansas, mean for Texas $198.20
Just significant at 0.05 level. Mean costs do differ among states.
bigger than for Kansas.
87 88
Predict burn rate from brake power and dummy variables:
In the SYNFUELS data set, diesel engines with different brake Predictor Coef StDev T P
Constant 13.320 6.931 1.92 0.084
power were run with three different fuels: a diesel called DF-2, a brake 4.3650 0.8057 5.42 0.000
synthetic blended fuel, and a blended fuel with advanced timing. df2 -22.600 5.464 -4.14 0.002
bln -7.360 5.464 -1.35 0.208
The mass burning rate was measured.
S = 8.057 R-Sq = 81.2% R-Sq(adj) = 75.6%
One numerical x, brake power, and one categorical, fuel type (3
levels). Define 3 − 1 = 2 dummy variables for the fuels. In data R-squared high, so model useful. Brake power definitely helps to
set: df2 1 for fuel DF-2, 0 otherwise, bln 1 for blended fuel, 0 predict over and above fuel type. For fixed brake power, Advanced
otherwise. Timing fuel has highest burn rate, followed by blended fuel, followed
by DF-2. (Dummy variable slopes negative).
89 90
To see whether fuel type helps to predict (over and above brake
power), need partial F-test to compare fit of regression containing
just brake power. Can use partial SS for this:
Analysis of Variance
Source DF SS MS F P
Regression 3 2807.90 935.97 14.42 0.001
Residual Error 10 649.08 64.91 Has 2 and 10 df. P-value 0.0053. The fuel type definitely affects the
Total 13 3456.99
burn rate.
Source DF Seq SS
brake 1 1603.93
df2 1 1086.22
bln 1 117.76
(1086.22 + 117.76)/2
F = = 9.27.
649.08/10
91 92
Introduction
We are often faced with data sets containing a y -variable and many
x-variables.
Problem: some of the x’s may have nothing to do with y ; including
93 94
95 96
To run all possible regressions in Minitab, select Regression, then 3 47.6 44.3 9.4 11.597 X X X
Best Subsets. Enter the y -variable in Reponse, then enter all the 4 51.8 47.7 7.1 11.235 X X X X
4 51.6 47.4 7.4 11.265 X X X X
x’s into Free Predictors. I got this output: 5 54.5 49.6 6.4 11.036 X X X X X
Response is y 5 54.3 49.4 6.5 11.055 X X X X X
6 56.1 50.2 6.8 10.961 X X X X X X
m g m 6 56.0 50.1 6.9 10.978 X X X X X X
a i a c i 7 56.8 50.0 8.0 10.990 X X X X X X X
i f c c h s
l t c h e c This gives the best regression with each possible number of
o c t a q m
p e p n u a b x-variables (up to 7 here). For instance, the first line says that the
Adj. e r a g e i u best regression with 1 x-variable has an R-squared of 34.5%, and it
Vars R-Sq R-Sq C-p s n t y e s l s
predicts total hours worked from the number of cheques cashed
1 34.5 33.2 18.8 12.701 X
(read cheques downwards on the right). Also shown is the
1 24.9 23.4 28.6 13.599 X
2 43.6 41.3 11.5 11.902 X X second-best 1-variable regression, which has an R-squared of
2 43.3 41.0 11.8 11.934 X X
24.9%.
3 48.1 44.8 8.9 11.543 X X X
97 98
Adjusted R-squared
Looking further down, the best 6-variable regression has an One way: adjust definition of R-squared so that it goes down when a
R-squared of 56.1%, and contains all the variables except the worthless x-variable is added.
number of bus tickets sold.
R-squared is regression SS / total SS, or 1 - error SS / total SS. The
But how do you compare regressions with different numbers of total SS is same for any regression with the same y , so really
x-variables? depends on error SS.
Looking at R-squared: every time you add a new x-variable, even if Idea: base adjusted R-sq on error MS:
it’s useless, R-squared will go up. Remembering Occam’s Razor, error MS
Ra2 = 1 − (n − 1) ,
we only want to add a new variable if it’s useful. So R-squared is no total SS
good. n being the number of observations.
Since error MS can go up or down depending on usefulness of x, so
can this adjusted R-squared.
99 100
Minitab puts “R-sq (adj)” on regression output and also has column Mallows’ Cp
“adj R-sq” on best subsets output. Look for highest. In example:
101 102
103 104
Cautions
• by doing many regressions and taking the best, we may
“capitalize on chance” (“if you look at enough things, you’re
We now know that the “best” regression has the largest adjusted
bound to find something good”)
R-squared or the smallest Cp .
• an automatic procedure is no substitute for subject-matter
These will sometimes disagree about the “best” regression (as in
knowledge: which variables should be important.
our example): usually Cp favours fewer x’s.
The best use of these methods is as a suggestion for future study.
Tempting to go with the “best” regression for further analysis, but:
Collect a new data set with the suggested variables, then analyze
• picking the “best” regression gives an optimistic R-squared (that that.
future studies won’t repeat)
Another technique, “stepwise regression”, should be ignored
• because of chance, “best” regression may not actually be the completely! (It has all these problems and more.)
best one
105 106
Introduction
Detecting and correcting Not everything will go smoothly in a regression analysis. We need to
see whether anything is not as it appears, decide how (if possible) to
107 108
Example: study where researchers gave IQ tests to 2-year-old
infants (score y ); also noted whether the mother admitted using
Observational data vs. designed
cocaine during preganancy: x = 1, 0.
experiments This is observational study because value of x not controlled
(impossible!).
Recall distinction between these two:
Mothers not randomly assigned to cocaine use/not, so groups could
observational data occur when the data are just observed: no
differ on other variables not recorded. Eg. IQ might differ by
effort is made to control anything.
socioeconomic status of mothers and this might be related to
designed experiments occur when the x-variables are controlled cocaine use.
and can be changed by the experimenter.
General principle: be cautious about drawing conclusions from
observational study.
109 110
In a regression, want y to be correlated with (at least some) of the Predictor Coef StDev T P
x’s. Constant 3.202 3.462 0.93 0.365
tar 0.9626 0.2422 3.97 0.001
But what if the x’s are correlated among themselves? nicotine -2.632 3.901 -0.67 0.507
weight -0.130 3.885 -0.03 0.974
Example: the American Federal Trade Commission measures
S = 1.446 R-Sq = 91.9% R-Sq(adj) = 90.7%
cigarette brands according to tar, nicotine and carbon monoxide.
Can we predict carbon monoxide from other variables plus weight? R-squared is high (good). CO depends on amount of tar in positive
111 112
Reason: tar and nicotine highly correlated with each other: To get VIFs in Minitab: select Stat, Regression, Regression. Click
Correlation of tar and nicotine = 0.977 Options, select Variance Inflation Factors. Click OK twice. Get this:
Predictor Coef StDev T P VIF
so that two variables predict CO in the same way: once you have Constant 3.202 3.462 0.93 0.365
tar 0.9626 0.2422 3.97 0.001 21.6
one, you don’t need the other.
nicotine -2.632 3.901 -0.67 0.507 21.9
weight -0.130 3.885 -0.03 0.974 1.3
How do you tell this has happened? Usual clue: an expected
significant variable is non-significant. VIF greater 10 “large”. Here, as expected, VIFs for tar and
Remedy: calculate variance inflation factors. These based on nicotine high: variables correlated with each other.
correlation from fictitious regression predicting each x-variable from Suppose now one x-variable correlated with sum of two others.
other x’s. Here, expect VIFs for tar and nicotine to be high because Then no high correlations between x’s, but VIF for that one variable
can predict one from other. high. Thus VIFs better than correlations in general.
113 114
Finally, to illustrate that we could use either tar or nicotine in Let’s find confidence and prediction intervals for a cigarette brand
example, do regression without tar: with 12.2 mg of tar, 0.88 mg of nicotine, and weighs 0.97 g. First,
The regression equation is use regression with all x’s. In Regression, click Options, enter 13
co = 1.61 + 12.4 nicotine + 0.06 weight 1.1 1.05 on the “new observations” line, then click OK.
Predicted Values
Predictor Coef StDev T P
Fit StDev Fit 95.0% CI 95.0% PI
Constant 1.614 4.447 0.36 0.720
12.503 0.290 ( 11.901, 13.106) ( 9.437, 15.569)
nicotine 12.388 1.245 9.95 0.000
weight 0.059 5.024 0.01 0.991
Now compare with regression omitting nicotine (so no
S = 1.870 R-Sq = 85.7% R-Sq(adj) = 84.4% correlation problems). Take out 1.1 from the “new observations” line:
Predicted Values
R-squared hasn’t changed much, but now slope for nicotine
positive and significant. Fit StDev Fit 95.0% CI 95.0% PI
12.515 0.286 ( 11.923, 13.107) ( 9.496, 15.535)
117 118
What happened? Prices in data set only go up to $3.70, so we are Fit StDev Fit 95.0% CI 95.0% PI
-141.8 74.9 ( -325.1, 41.5) ( -338.5, 54.9) XX
extrapolating. X denotes a row with X values away from the center
XX denotes a row with very extreme X values
Here, got nonsense, but often get apparently reasonable prediction.
For extrapolation to work, the linear relationship would have to A price of $5 is a very extreme x-value.
continue beyond the data, and we have no way of knowing that this
will happen.
119 120
Note also the size of the intervals, compared with those for price
$3.30 (an average price in the data):
Predicted Values
Transformations of y and x
Fit StDev Fit 95.0% CI 95.0% PI
891.8 10.5 ( 866.0, 917.6) ( 815.9, 967.6)
For coffee data, predicted demand for a price of $5 was nonsense.
A plot shows why:
The intervals are narrower – we have more “nearby” data to work
with.
121 122
Only know how to fit straight lines, not curves. But to make things
more linear, can use functions of x and y instead of x and y
themselves.
Only endmost points are above the line: relationship curved not
linear.
123 124
Two steps:
• create new column(s) of transformed data Now predict demand from 1/price, including new column as
x-variable in regression:
• run analysis on transformed data.
The regression equation is
demand = - 1180 + 6808 1/price
For coffee data, create new column of 1/price. To do this:
Predictor Coef StDev T P
• give new column a name (type it into the header row) Constant -1180.5 107.7 -10.96 0.000
1/price 6808.1 358.4 19.00 0.000
• Select Calc, Calculator. Double-click name of new column, then
go to Expression box. There type or select 1/, then S = 20.90 R-Sq = 98.4% R-Sq(adj) = 98.1%
double-click price.
Plot shows that relationship is (a little) more linear now.
• Click OK; new column appears in worksheet.
125 126
Other transformations
Predict demand for price $3.30, first by hand:
Can transform y instead of/as well as x. Consider these data:
Row x y
demand = −1180.5 + 6808.1(1/3.3) = 882.4.
1 1 1.0
2 2 1.5
Or can do by Minitab. Now predicting for 1/price, so in Regression, 3 3 2.0
4 4 2.75
Options under “prediction intervals for new observations”, enter
5 5 3.5
0.303, which is 1/3.3. Click OK: 6 6 4.5
Predicted Values
y seems to increase faster as x increases, so linear relationship no
Fit StDev Fit 95.0% CI 95.0% PI √
882.37 7.47 ( 864.08, 900.66) ( 828.04, 936.71) good. Predict instead y from x. (Plot indicates this relationship
straight.)
The fit is better, so the intervals are shorter.
Define new column sqrty containing square root of y (in
Calculator, select “square root” or type sqrt(’y’)).
127 128
How to predict y for x = 3?
Predict square root of y from x:
First, use the line, put in x = 3, get
The regression equation is
sqrty = 0.769 + 0.223 x
0.769 + 0.223(3) = 1.438.
Predictor Coef StDev T P
Constant 0.76934 0.01542 49.90 0.000 But this is predicted value of square root of y , so have to undo
x 0.222541 0.003959 56.21 0.000
square root: square this to get predicted value of y :
S = 0.01656 R-Sq = 99.9% R-Sq(adj) = 99.8%
(1.438)2 = 2.07
Fit is very good.
which fits well with the data.
129 130
131 132
Power relationships
If y is a power of x, can fit this using regression. If y = kxc , then The regression equation is
c c logy = - 2.20 + 2.95 logx
ln y = ln(kx ) = ln k + ln(x ) = ln k + c ln x.
Predictor Coef StDev T P
Since k is a constant, ln y is linear function of ln x. Constant -2.20406 0.09557 -23.06 0.000
logx 2.94686 0.07631 38.62 0.000
Consider following data, where y is approx 0.1x3 :
S = 0.1131 R-Sq = 99.7% R-Sq(adj) = 99.7%
x 1 2 3 4 5 6
Fit again very good. But since ln(kxc )
= ln k + c ln x, note that
y 0.1 1 3 6 12 22
−2.20 ≃ ln k so k ≃ 0.11 and c = 2.95 ≃ 3; relationship
Relationship definitely curved. Calculate logs of x and y . I used approx y = 0.1x3 .
133 134
Regression should summarize all relationship between y and x’s, as Predictor as usual, then click Storage. Select Residuals and Fits.
so should be no pattern in plots of residuals vs. anything. Two kinds of useful plots:
Any pattern indicates problem; kind of pattern indicates kind of • residuals vs. fitted values: Any patterns: problem with y -variable
problem.
• residuals vs. x-variables: patterns: problems with that
x-variable
135 136
Residuals vs. fitted values: select Graph, then Plot. Notice residuals
Interpret: curved relation needed, or one obs. unusual.
(RESI1) and fitted values (FITS1) available for plot. Select them with
residuals as y . Get this: Plot residuals against 1 x-variable fat:
137 138
139 140
Example: social worker data
The regression equation is
salary = 20242 + 522 experience + 53.0 expˆ2
The SOCWORK data set contains years of experience and salary
data for 50 social workers. How does salary depend on experience? Predictor Coef StDev T P
Constant 20242 4423 4.58 0.000
Plotting salary against years of experience suggests a curved experien 522.3 616.7 0.85 0.401
expˆ2 53.01 19.57 2.71 0.009
relationship, so calculate experinence-squared and add that as well.
S = 8123 R-Sq = 81.6% R-Sq(adj) = 80.8%
(Or: do straight-line regression, look at residuals, note that
relationship not straight line, add experience-squared.) Definitely need expˆ2 term.
Use Calc-Calculator to create column of experience-squared values. But store residuals, fitted values; plot:
Add to regression. I get this:
141 142
143 144
Stored residuals and fits again (in RESI2 and FITS2). Has plot
Also, look at regression: experience-squared term no longer
improved?
signficant. That is, by using log of salary, get simpler model without
expˆ2. Redo regression:
The regression equation is
logsal = 9.84 + 0.0500 experience
145 146
147 148
Introduction
If we can select x’s, have a statistical experiment. This gives us a • Objects on which y measured (people, animals, plants,
chance of saying “x’s cause y ” (rather than “x’s and y happen to samples) called experimental units.
vary together”). Thus can hope to prove cause and effect.
• x-variables that can be controlled called factors.
• Chosen value of factor called level.
Terminology • Combination of factor levels called a treatment.
149 150
151 152
Randomization
Basic idea: randomly allocate experimental units to treatment
Experimental units are not all the same. Eg. some weeks are better
combinations. Result: no factor has advantage, because of
for selling coffee than others, and this has nothing to do with shelf or
experimental units, over any other. Thus any difference must be
location. (We therefore don’t care about it). Likewise, people or
because of factor.
animals are different physically.
Various approaches to randomization. Simplest is completely
If all the best units go with a particular treatment combination, that
randomized design.
treatment combination will look best even when it is not. Want to
“share out” experimental units.
153 154
155 156
ANOVA: analysis of completely randomized Example: GPA and socioeconomic class
design
Seven students were randomly selected from each socioeconomic
class (lower, middle, upper), their GPAs taken from university files at
In regression, issue was not only whether y appeared to depend on
end of academic year.
x’s, but whether it did so more than chance.
Does GPA depend on socioeconomic class (for all students at that
Likewise, here want to know whether, if you looked at whole
university)?
population of experimental units, whether factors would affect
response. That is, is effect observed in data stronger tham chance? Data set GPA3. Classes numbered 1–3.
Example: effect of displays on beverage sales; is one display so Know how to do this with regression: define dummy variables
much more effective than the others in data (sample) that we would picking out students in classes 2 and 3. That is, dummy1 1 for
be confident in claiming this display best for all supermarkets in all students from class 2, 0 otherwise; dummy2 for students from
weeks (population), not just ones in sample? class 3, 0 otherwise.
157 158
Then do regression predicting GPA from these 2 dummy variables: Slope for dummy1 0.73, so students from class 2 have higher GPA
The regression equation is on average than class 1 (by 0.73). Likewise, students from class 3
gpa = 2.52 + 0.727 dummy1 + 0.021 dummy2
have slightly higher average GPA than class 1 (by 0.02).
Predictor Coef StDev T P Issue whether these differences just chance (would be different with
Constant 2.5214 0.1934 13.04 0.000
dummy1 0.7271 0.2735 2.66 0.016 different data) or real (different data would show same pattern).
dummy2 0.0214 0.2735 0.08 0.938
Recall what analysis of variance table says: is there any effect of
S = 0.5116 R-Sq = 33.7% R-Sq(adj) = 26.4% any x’s, that is is there any real difference in GPA between
socioeconomic classes?
Analysis of Variance
Here, P-value is 0.025, smaller than 0.05, so are justified in saying
Source DF SS MS F P
Regression 2 2.3969 1.1984 4.58 0.025 there is a real difference in GPA between socioeconomic classes.
Residual Error 18 4.7111 0.2617
Total 20 7.1080 Null hypothesis: all the groups have the same mean; alternative is
that the null is not true. ie. that one or more groups has a different
Intercept 2.52, mean of class 1 (not interesting: comparing classes).
mean from the others.
159 160
Analysis of Variance for gpa
Analysis of variance in a completely Source DF SS MS F P
class 2 2.397 1.198 4.58 0.025
161 162
Recall regression approach for GPA data set: • what if we want to compare class 2 and 3?
• more important, one of these t tests only works if we decided
before collecting the data that we wanted to compare these and
only these 2 groups.
163 164
If we compare all possible pairs of groups (1–2, 1–3, 2–3), are doing
several tests at once. Suppose all groups have same mean. By doing more than one test,
increase chance of declaring some pair of groups to be different
Why does this matter?
when actually not. (Have 3 chances to make mistake not just 1.)
Think about how tests work: by rejecting when P-value less than
Better idea (Tukey): if all groups have same mean, figure out how
0.05, we have 5% chance of incorrectly rejecting null when actually
big difference between largest and smallest sample mean could be.
true.
Any sample means further apart than this significantly different.
Here, null hypothesis for each test is that groups being compared Doesn’t matter how many groups; idea still works.
have same mean.
165 166
We’ll assume that all the groups are the same size. First, note that overall F is significant, so there are some
Analysis of Variance for gpa Track down “Pooled StDev”, here 0.5116. Call this s. In the DF
Source DF SS MS F P
class 2 2.397 1.198 4.58 0.025
column, look for the “error” row. Result here is 18. Call this v . Note
Error 18 4.711 0.262 we have 3 groups to compare. Call this p. All groups have n =7
Total 20 7.108
observations.
167 168
Turn to Table 11 in Appendix C of the text. Find the row for v and
the column for p. Here v = 18, p = 3. Number in table is 3.61.
Call this q . A nice way to illustrate this:
√ √
Calculate w = qs/ n. Here w = (3.61)(0.5116)/ 7 = 0.698. list means in order, match up with groups, then put line on top of
means that are not significantly different.
Finally, any groups whose means differ by more than w are
significantly different (at the 5% level). Here the differences are: -------------
Mean 2.5214 2.5429 3.2486
2 vs 3 3.2486 − 2.5429 = 0.7057 Significant
Class 1 3 2
2 vs 1 3.2486 − 2.5214 = 0.7272 Significant
This shows that group 2’s mean is significantly bigger than the
3 vs 1 2.5429 − 2.5214 = 0.0215 Not significant
others.
so class 2’s mean GPA is significantly higher than that of the other
classes, but the class 1–3 difference is probably just chance.
169 170
If groups not all same size, smaller groups have to be further apart Study made of chemical properties of 3 types of hazardous
to be significantly different (more room for chance when groups solvents. Measured “sorption rate” for samples of each solvent type:
smaller).
aromatics, chloroalkanes and esters. Data in SORPRATE and on p.
Calculate a w for each pair of groups. For comparing groups i and 564 of text.
j with ni and nj observations, ANOVA itself has no difficulties:
s
s 1 1 Analysis of Variance for sorprate
wij = q √ + . Source DF SS MS F P
2 ni nj solvent 2 3.3054 1.6527 24.51 0.000
Error 29 1.9553 0.0674
Everything else as before. Sometimes known as “Tukey-Kramer”.
Total 31 5.2607
If can choose size of each group, best to have groups of equal size:
Pooled StDev = 0.2597
gives best chance to detect any significant differences.
171 172
s
1 1
w12 = (3.49)(0.2597/2) + = 0.22.
Are significant differences among the solvents. In data, esters seem 9 8
s
to be lower than other two. 1 1
w13 = (3.49)(0.2597/2) + = 0.20.
Tukey: need mean sorption rates for each solvent: 0.94, 1.01 and
9 12
s
0.33 (order as above). 1 1
w23 = (3.49)(0.2597/2) + = 0.21.
8 12
From table C-11 (text): p = 3 groups, v = 29 error df, q = 3.49
w ’s similar because group sizes similar (smallest groups have
(used 30 df).
largest w ).
From ANOVA, pooled SD is s = 0.2597. 9 aromatics, 8
Summary:
chloroalkanes, 12 esters.
Group Esters Aromatics Chloroalkanes
Mean 0.33 0.94 1.01
------------------
173 174
175 176
Example: Prompting in a walking program
177 178
call.
179 180
Tukey for randomized blocks
Having decided that call strategy does make a difference, now want
The ANOVA table has two tests now: one for any difference in
to decide which call strategies are better.
number of walkers due to call strategy, one for any difference due to
weeks. First get pooled SD as square root of error MS:
√
s= 7.43 = 2.7258.
Difference due to weeks no surprise: expected this (reason for
including weeks as blocks in the first place). Then same calculation as for completely randomized ANOVA.
Comparing p = 5 call strategies, v = 20 error df. In table C-11:
Difference due to call strategy. Strictly, “difference due to call
q = 4.23.
strategy allowing for differences due to weeks” (like regression). √ √
Calculate w = qs/ n = (4.23)(2.7258)/ 6 = 4.71. Each call
This is what we want.
strategy mean is based on the 6 weeks’ worth of data, so divide by
√
6. Any means differing by more than 4.71 are significantly
different.
181 182
183 184
Decides to run experiment with ratios 0.5, 1 and 2, supply 15, 18, 21 Results:
tons (typical values). (Ratio 0.5 means “make half as much of Two-way Analysis of Variance
185 186
187 188
Story so far: look first at interaction test. If significant, test means for
all combinations (by Tukey).
189 190
Interaction not significant. So now look at size and diet (main General procedure for 2-way ANOVA
effects).
In general, when interaction not significant, test each of the main
Diet not significant, but rat size is. That is, kind of diet has no effect
effects (from the ANOVA). For any significant ones, do Tukey as
on kidney weight, but size before experiment does.
necessary.
Usually here do Tukey on any significant factors (ratsize only).
When interaction significant, that is finding. Do Tukey on means for
But in this experiment, no point: only two different sizes of rats, so
all combinations of groups.
sizes sig. different in kidney weight from each other. Looking at
data, obese rats have larger kidney weights than lean rats. Diagram on p. 588 of text:
191 192
Checking ANOVA assumptions
193 194
Testing variances
Select Stat, ANOVA, homogeneity of variance. Fill in response and
Usually, normal distribution assumption not crucial. Are tests for factors (kidneywt is response, others factors). Output includes a
normality, but tend to be too sensitive: reject normality when data graph as well as this text:
are “normal enough”. Levene’s Test (any continuous distribution)
So concentrate on testing spread within groups. Tests based on Test Statistic: 1.036
variance (SD squared). Best test is Levene’s test. Null hypothesis: P-Value : 0.394
195 196
Second example: Vanadium important trace element found in living Get means and SDs for groups. Can be done by running ANOVA
organisms. Experiment done to compare V concentrations in and ignoring half the results!
different materials (oyster tissue, citrus leaves, bovine liver, human Individual 95% CIs For Mean
Based on Pooled StDev
serum). Data p. 640 and in VANADIUM.
Level N Mean StDev ----+---------+---------+---------+--
1 3 1.3300 1.0053 (-----*------)
Ultimately: test whether mean concentration same in materials. But
2 3 3.1600 0.8884 (-----*------
first test assumption of equal spread with Levene’s test: 3 3 0.4100 0.1212 (-----*------)
4 5 0.1460 0.0279 (----*----)
Levene’s Test (any continuous distribution)
----+---------+---------+---------+--
Pooled StDev = 0.6027 0.0 1.2 2.4 3.6
Test Statistic: 3.214
P-Value : 0.070
Groups with larger mean also have larger SD.
P-value not smaller than 0.05, but is small, so have doubts about
When this happens (quite common), transformation like logarithm or
equal-spread assumption.
square root often helps. May be theory to guide choice, eg. if
Need transformation of response variable, as in regression. percent change in response meaningful, log better.
197 198
This is a lot better. Running ANOVA shows that we have Now can do F -test: there is a significant difference in mean
(somewhat) evened out the SDs: (log-)concentrations among the 4 materials, since P-value is 0 to
accuracy shown. Confident that assumptions of ANOVA are OK.
199 200
Follow up here with Tukey-Kramer (groups of different sizes).
All group means further apart than these, so all materials have
significantly different means.
201