CVEN2002 Week11
CVEN2002 Week11
CVEN2002 Week11
CVEN2002/2702
Week 11
This lecture
11. Regression Analysis
11.1 Introduction
11.2 Simple Linear Regression
11.3 Least Squares Estimators
11.4 Inferences in simple linear regression
11.5 Prediction of new observations
11.6 Adequacy of the regression model
11.7 Correlation
CVEN2002/2702 (Statistics)
Dr Justin Wishart
2 / 49
11.1 Introduction
Introduction
The main objective of many statistical investigations is to make
predictions, preferably on the basis of mathematical equations
For instance, an engineer may wish to predict the amount of oxide that
will form on the surface of a metal baked in an oven for one hour at
200 C, or the amount of deformation of a ring subjected to a certain
compressive force, or the number of miles to wear out a tire as a
function of tread thickness and composition
Usually, such predictions require that a formula be found which relates
the dependent variable whose value we want to predict (usually it is
called the response) to one or more other variables, usually called
predictors (or regressors)
The collection of statistical tools that are used to model and explore
relationships between variables that are related is called regression
analysis, and is one of the most widely used statistical techniques
CVEN2002/2702 (Statistics)
Dr Justin Wishart
3 / 49
11.1 Introduction
Introduction
As an illustration, consider the following data, where yi s are the
observed purity of oxygen produced in a chemical distillation process,
and xi s are the observed corresponding percentage of hydrocarbons
that are present in the main condenser of the distillation unit
Scatterplot
CVEN2002/2702 (Statistics)
98
96
94
92
yi (%)
90.01
89.05
91.43
93.74
96.73
94.45
87.59
91.77
99.42
93.65
93.54
92.52
90.56
89.54
89.54
90.39
93.25
93.41
94.98
87.33
90
xi (%)
0.99
1.02
1.15
1.29
1.46
1.36
0.87
1.23
1.55
1.40
1.19
1.15
0.98
1.01
1.11
1.20
1.26
1.32
1.43
0.95
88
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.9
1.0
1.1
1.2
1.3
1.4
1.5
Dr Justin Wishart
4 / 49
11.1 Introduction
Dr Justin Wishart
5 / 49
Dr Justin Wishart
6 / 49
98
88
90
92
94
96
1.0
1.1
1.2
X
CVEN2002/2702 (Statistics)
1.3
1.4
1.5
Dr Justin Wishart
7 / 49
Y |X =x = 0 + 1 x
The question is how to determine the estimators 0 and 1 (and then
an estimator for )
CVEN2002/2702 (Statistics)
Dr Justin Wishart
8 / 49
98
residuals
96
94
92
90
88
0.9
1.0
1.1
1.2
1.3
1.4
1.5
CVEN2002/2702 (Statistics)
Dr Justin Wishart
9 / 49
X
R
(a, b) = 2
(Yi (a + bXi ))
a
i=1
R
(a, b) = 2
b
n
X
i=1
i (Yi (0 + 1 Xi ))Xi = 0
which are
1 =
=
where X
1
n
Pn
i=1
P
P
( i Xi )( i Yi )
P n
( i Xi )2
n
i Xi Yi
P 2
i Xi
=
Xi and Y
CVEN2002/2702 (Statistics)
1
n
Pn
i=1
and
1 X
0 = Y
Yi
Dr Justin Wishart
10 / 49
i=1
we have:
and
SXY X
0 = Y
SXX
= 0 + 1 X
, the estimated straight line will always go
Note: as Y
Dr Justin Wishart
11 / 49
n
X
=
(xi x )2
and sxy
n
X
=
(xi x )(yi y )
i=1
i=1
1 and b
0 of 1 and 0 :
and thus the estimates b
0 = y sxy x
1 = sxy
and
b
b
sxx
sxx
0 + b
1 x, which is
The estimated or fitted regression line is therefore b
an estimate of Y |X =x
Now, we know that estimates of means are also typically used for
0 + b
1 x is also
prediction of future observation (Slide 21 Week 8) ; b
used for predicting the future observation of Y when X is set to x, and
is often denoted y (x):
0 + b
1 x
y (x) = b
CVEN2002/2702 (Statistics)
Dr Justin Wishart
12 / 49
sxy
y = 92.1605
X
xi yi = 2, 214.6566
xi2 = 29.2892,
P
X
( xi )2
23.922
2
sxx =
xi
= 29.2892
= 0.68088
n
20
P
P
X
( xi )( yi )
23.92 1, 843.21
=
xi yi
= 2, 214.6566
= 10.17744
n
20
CVEN2002/2702 (Statistics)
Dr Justin Wishart
13 / 49
Dr Justin Wishart
14 / 49
Estimating 2
The variance 2 of the error term = Y (0 + 1 X ) is another
unknown parameter
; the residuals of the fitted model, i.e.
0 + b
1 xi ) = yi y (xi ),
i = yi (b
e
i = 1, 2, . . . , n
=
e
1X
0 + b
1 x ) 0
i = y (b
e
n
i=1
0 and b
1 )
(by definition of the estimated coefficient b
CVEN2002/2702 (Statistics)
Dr Justin Wishart
15 / 49
Estimating 2
Also, recall (Slide 43 Week 10) that the number of degrees of freedom
for the usual sample variance is n 1 because we have to estimate
one parameter (x estimates the true ) (bias!)
Here we have to first estimate two parameters (0 and 1 )
; the number of degrees of freedom must now be n 2
n
; an unbiased estimate of
is
1 X 2
i
s =
e
n2
2
i=1
S2 =
1 X
(Yi (0 + 1 xi ))2
n2
i=1
Dr Justin Wishart
16 / 49
Estimating 2 : example
In the previous example, we fitted y (x) = 74.283 + 14.947x, so that we get a
series of fitted values y (xi ) = 74.283 + 14.947xi , for i = 1, . . . , 20, from which
i = yi y (xi ), for i = 1, . . . , 20
the residuals can be computed: e
We find: s2 =
1
18
P20
i=1
CVEN2002/2702 (Statistics)
i
e
0.959
-0.448
-0.005
0.219
0.676
-0.114
0.327
-0.857
2.025
-1.510
1.509
1.085
1.658
0.191
-0.989
-1.790
0.176
-0.558
-0.627
-1.125
98
yi
y^(x )
96
94
92
y (xi )
89.051
89.498
91.435
93.521
96.054
94.564
87.263
92.627
97.395
95.160
92.031
91.435
88.902
89.349
90.839
92.180
93.074
93.968
95.607
88.455
yi
90.01
89.05
91.43
93.74
96.73
94.45
87.59
91.77
99.42
93.65
93.54
92.52
90.56
89.54
89.85
90.39
93.25
93.41
94.98
87.33
1.3
1.4
90
xi
0.99
1.02
1.15
1.29
1.46
1.36
0.87
1.23
1.55
1.40
1.19
1.15
0.98
1.01
1.11
1.20
1.26
1.32
1.43
0.95
88
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.9
1.0
1.1
1.2
1.5
i2 = 1.1824 (%2 )
e
Dr Justin Wishart
;s=
17 / 49
Fixed design
From now on we will assume that the value of the xi s have been
chosen before the experiment is performed, and are therefore fixed
; this is known as a fixed design
So, only the Yi s are random, and that substantially simplifies the
coming developments, in particular the derivation of the sampling
properties of the estimators 0 and 1
CVEN2002/2702 (Statistics)
Dr Justin Wishart
18 / 49
i (xi
x )E(Yi )
=
sxx
i (xi
x )(0 + 1 xi )
1
=
sxx
xi (xi x )
= 1
sxx
; unbiased estimator of 1
Similarly, its variance is Var(1 ) =
2
i (xi x )
2
sxx
Dr Justin Wishart
Var(Yi )
i (xi x )
2
sxx
2
sxx
1 ,
s
Session 2, 2012 - Week 11 xx 19 / 49
1 N
n
X
Yi
i=1
1 x ,
n
X
E(Yi )
i=1
E(1 )x =
n
X
0 + 1 xi
i=1
1 x = 0
; unbiased estimator of 0
Similarly, we find Var(0 ) = 2
1
n
x 2
sxx
0 N 0 ,
CVEN2002/2702 (Statistics)
Dr Justin Wishart
x 2
+
n sxx
20 / 49
Inferences concerning 1
An important hypothesis to consider regarding the simple linear
regression model Y = 0 + 1 X + is the hypothesis that 1 = 0
; 1 = 0 is equivalent to stating that the response does not depend
on the predictor X
(as we would have Y 0 + )
We can set up a formal hypothesis test. The appropriate hypotheses
are:
H0 : 1 = 0
against
Ha : 1 6= 0
1 is too different to 0
; we reject H0 when the estimate b
1
From the sampling distribution of 1 , we get sxx 1
N (0, 1)
sxx
1 1
tn2
S
Dr Justin Wishart
21 / 49
Inferences concerning 1
From this result, all the inferential procedures that we introduced
previously can be readily adapted ; compare Slides 32-33 Week 9
At significance level 100 %, the rejection criterion for H0 : 1 = 0
against Ha : 1 6= 0 is
s
s
,
reject H0 if b1
/ tn2,1/2 , tn2,1/2
sxx
sxx
q
1 Pn 2
with the estimated standard deviation s = n2
i=1 ei (Slide 16)
and from the observed value of the test statistic under H0 (i.e. with
1 = 0)
1
b
t0 = sxx
s
we can compute the p-value
p = 1 P(T [|t0 |, |t0 |]) = 2 P(T > |t0 |)
where T is a r. v. with distribution tn2 (; find p in the table)
CVEN2002/2702 (Statistics)
Dr Justin Wishart
22 / 49
Inferences concerning 1
1
As sxx 1
tn2 , we can directly write
S
!
1 1
P tn2;1/2 sxx
tn2;1/2 = 1
S
or equivalently
S
S
P 1 tn2;1/2
1 1 + tn2;1/2
=1
sxx
sxx
1 , a two-sided
From an observed sample for which we find s and b
100 (1 )% confidence interval for the parameter 1 is
s
s
b1 tn2;1/2 , b1 + tn2;1/2
sxx
sxx
CVEN2002/2702 (Statistics)
Dr Justin Wishart
23 / 49
Inferences concerning 0
Although of less practical interest, inferences concerning the
parameter 0 can be made in exactly the same way from the sampling
distribution of 0
We find a two-sided 100 (1 )% confidence interval for 0
s
s
2
2
0 tn2;1/2 s 1 + x , b
0 + tn2;1/2 s 1 + x
b
n sxx
n sxx
as well as a rejection criterion for a hypothesis H0 : 0 = 0 (no
intercept in the model) tested against Ha : 0 6= 0: at level 100 %,
0
reject H0 if b
/ tn2,1/2 s
x 2
1
+
, tn2,1/2 s
n sxx
x 2
1
+
n sxx
with a p-value calculated from the observed value of the test statistic
t0 =
q b0
2
1
s n + sx
T tn2
xx
CVEN2002/2702 (Statistics)
Dr Justin Wishart
24 / 49
against
Ha : 1 6= 0
reject H0 if b1
/ 2.878
, 2.878
= [3.793, 3.793]
0.68088
0.68088
1 = 14.947, we clearly reject H0
Here, with b
; the true slope 1 between oxygen purity and hydrocarbon level is certainly
different from 0 ; hydrocarbon level does influence oxygen purity
CVEN2002/2702 (Statistics)
Dr Justin Wishart
25 / 49
sxx
b
14.947
= 0.68088
= 11.35
s
1.0874
(M ATLAB says: p = 10
(with T t18 )
Dr Justin Wishart
26 / 49
Y versus X
Coef
74.283
14.947
SE Coef
1.593
1.317
T
46.62
11.35
P
0.000
0.000
(s n1 + sxxx and ssxx ), the column T is for the observed values t0 of the test
statistics (when testing H0 : 0 = 0 and H0 : 1 = 0), and the column P gives
the associated p-values. Finally, S is the estimate s of
CVEN2002/2702 (Statistics)
Dr Justin Wishart
27 / 49
Y |X =x = 0 + 1 x
Note that, as a linear combination of normal random variables, the
estimator
Y |X =x is also normally distributed. Its expectation is:
E(
Y |X =x ) = E(0 + 1 x) = E(0 ) + E(1 )x = 0 + 1 x = Y |X =x
; unbiased estimator for Y |X =x
CVEN2002/2702 (Statistics)
Dr Justin Wishart
28 / 49
1 (x x )2
+
n
sxx
x)
and vice-versa ; term (x
sxx
2
!
At x = x , Var(
Y |X =x ) = n , which is just the variance of Y
s
)2
1
(x
Y |X =x N Y |X =x ,
+
n
sxx
CVEN2002/2702 (Statistics)
Dr Justin Wishart
29 / 49
Y |X =x Y |X =x
q
tn2
x )2
S n1 + (x
sxx
which directly leads to the following confidence interval for Y |X =x :
From an observed sample for which we find s and y (x) from the fitted
0 + b
1 x, a two-sided 100 (1 )% confidence interval
model y (x) = b
for the parameter Y |X =x , that is the mean response Y when X = x, is
y (x) tn2;1/2 s
CVEN2002/2702 (Statistics)
1 (x x )2
+
, y (x) + tn2;1/2 s
n
sxx
Dr Justin Wishart
1 (x x )2
+
n
sxx
30 / 49
Dr Justin Wishart
31 / 49
88
90
92
94
96
98
0.9
CVEN2002/2702 (Statistics)
1.0
1.1
1.2
Dr Justin Wishart
1.3
1.4
1.5
32 / 49
Dr Justin Wishart
33 / 49
x)
2 1+
=
= 2 + 2 n1 + (x
sxx
1
n
(xx )2
sxx
and we find
Y |(X = x) Y (x) N 0,
CVEN2002/2702 (Statistics)
1 (x x )2
1+ +
n
sxx
Dr Justin Wishart
34 / 49
y (x) tn2;1/2 s
CVEN2002/2702 (Statistics)
1 (x x )2
1+ +
, y (x) + tn2;1/2 s
n
sxx
Dr Justin Wishart
1 (x x )2
1+ +
n
sxx
35 / 49
CVEN2002/2702 (Statistics)
Dr Justin Wishart
36 / 49
Dr Justin Wishart
37 / 49
98
96
94
92
90
88
0.9
CVEN2002/2702 (Statistics)
1.0
1.1
1.2
Dr Justin Wishart
x
1.3
1.4
1.5
38 / 49
Yi |(Xi = xi ) N (0 + 1 xi , )
for i = 1, 2, . . . , n,
Dr Justin Wishart
39 / 49
Residual analysis
Unfortunately, we do not have access to the values ei s (as we do not
know 0 and 1 )
However, the observed residuals of the fitted model
0 + b
1 xi )
i = yi y (xi ) = yi (b
e
are probably good estimates of those ei s ; residual analysis
It is frequently helpful to plot the residuals (1) in time sequence (if
known), (2) against the fitted values y (xi ), and (3) against the predictor
values xi
Typically, these graphs will look like one of the four general patterns
shown on the next slide
As suggested by their name, the residuals are everything the model
will not consider ; no information should be observed in the residuals,
they should look like noise
CVEN2002/2702 (Statistics)
Dr Justin Wishart
40 / 49
Residual analysis
(a)
(b)
^
e
i
^
e
i
(c)
(d)
CVEN2002/2702 (Statistics)
Dr Justin Wishart
^
e
i
^
e
i
41 / 49
Residual analysis
Pattern (a) represents thus the ideal situation (nothing to report)
In (b), the variance of the error terms ei (and thus that of the responses
Yi ) seems to be increasing with time or with magnitude of Yi or Xi
Plot (c) indicates some sort of dependence in the error terms
In (d), we get clear indication of model inadequacy: the residuals are
systematically positive for extreme values and negative for medium
values
; the model is not complete, there is still much information in the
residuals: higher-order terms (like X 2 ) or other predictors should be
considered in the model
Finally, a normal probability plot (or a histogram) of residuals is
constructed so as to check the normality assumption
CVEN2002/2702 (Statistics)
Dr Justin Wishart
42 / 49
Normal QQ Plot
^
e
i
^
e
i
Theoretical Quantiles
Sample Quantiles
0.9
1.0
1.1
1.2
1.3
1.4
1.5
xi
88
90
92
94
96
y^(xi)
; nothing to report
; the assumptions we made look totally valid
CVEN2002/2702 (Statistics)
Dr Justin Wishart
43 / 49
Variability decomposition
Similarly to the notations on Slide 11, we can define
n
X
syy =
(yi y )2
i=1
because the xi values are different, all Yi have different means. This
variability is quantified by the regression sum of squares:
n
X
ssr =
(y (xi ) y )2
i=1
i=1
Dr Justin Wishart
44 / 49
Coefficient of determination
Suppose sst ' ssr and sse ' 0: the variability in the responses due to
the effect of the predictor is almost the total variability in the responses
; all the dots are very close to the straight line, the predictions are
very accurate: the linear regression model fits the data very well
Now suppose sst ' sse and ssr ' 0: almost the whole variation in the
responses is due to the error terms
; the dots are very far away from the fitted straight line, the
predictions are very imprecise: the regression model is useless
; comparing ssr to sst allows us to judge the model adequacy
The quantity r 2 , called the coefficient of determination, defined as
r2 =
ssr
,
sst
Dr Justin Wishart
45 / 49
Coefficient of determination
Clearly, the coefficient of variation will have a value between 0 and 1:
a value of r 2 near 1 indicates a good fit to the data
a value of r 2 near 0 indicates a poor fit to the data
Fact
If the regression model is able to explain most of the variation in the
response data, then it is considered to fit the data well, and is regarded
as a good model
In our running example, we find in the regression output on Slide 27 a
value of r 2 (R-Sq) is equal to 87.74%
; almost 88% of the variation of the oxygen purity is explained by the
level of hydrocarbons that was used. The remaining 12% of the
variation is due to the natural variability in the oxygen purity even
when the hydrocarbon level is fixed to a given level
Here r 2 is quite close to 1, which makes our model a good one
CVEN2002/2702 (Statistics)
Dr Justin Wishart
46 / 49
11.7 Correlation
Correlation
On Slide 47, Week 4, we introduced the correlation coefficient between
two random variables X and Y :
Cov(X , Y )
E((X E(X ))(Y E(Y ))
= p
=p
Var(X ) Var(Y )
E((X E(X ))2 )E((Y E(Y ))2 )
This coefficient quantifies the strength of the linear relationship
between X and Y
; if is close to 1 or 1, there is a strong linear relationship between
X and Y
; observations in a random sample {(xi , yi ), i = 1, . . . , n} drawn from
the joint distribution of (X , Y ) should fall close to a straight line
; a linear regression model linking Y to X , based on that sample,
should be a good model, with a value of r 2 close to 1
; true
CVEN2002/2702 (Statistics)
Dr Justin Wishart
47 / 49
11.7 Correlation
Correlation
We can write:
2
sxy
sst sse
sxx (sst sse )
ssr
=
=
=
sst
syy
sxx syy
sxx syy
P
2
( (xi x )(yi y ))
P
=P i
2 i (yi y )2
i (xi x )
P
(xi x )(yi y )
; we observe that
r = pP i
P
2 i (yi y )2
i (xi x )
r2 =
Dr Justin Wishart
48 / 49
Objectives
Objectives
Now you should be able to:
Use simple linear regression for building models for engineering
and scientific data
Understand how the method of least squares is used to estimate
the regression parameters
Analyse residuals to determine if the regression model is an
adequate fit to the data and to see if any underlying assumptions
is violated
Test statistical hypotheses and construct confidence intervals on
regression parameters
Use the regression model to make a prediction of a future
observation and construct an appropriate prediction interval
Understand how the linear regression model and the correlation
coefficient are related
Recommended exercises: ; Q7 p.104, Q13, Q15 p.114, Q21 p.126,
Q1 p.499, Q5, Q8 p.500, Q13 p.507, Q17 p.508, Q19 (a-c) p.515
CVEN2002/2702 (Statistics)
Dr Justin Wishart
49 / 49