Nothing Special   »   [go: up one dir, main page]

Lecture - Correlation and Regression GEG 222

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 67

ENGINEERING

STATISTICS
Dr K. O. Orolu
Covariance and
Correlation
Variance
Variance is a measure of the dispersion of a univariate
.distribution
Additional statistics are required to describe the joint
.distribution of two or more variables

The covariance provides a natural measure of the


association between two variables, and it appears in the
analysis of many Engineering problems
Correlation

Finding the relationship between two quantitative variables


without being able to infer causal relationships

Correlation is a statistical technique used to determine the degree


to which two variables are related
Scatter diagram
 Rectangular coordinate
 Two quantitative variables
 One variable is called independent (X) and
the second is called dependent (Y)
 Points are not joined
 No frequency table
Example
Scatter diagram of weight and systolic blood pressure
Scatter diagram of weight and systolic blood pressure
Scatter plots

The pattern of data is indicative of the type of relationship


between your two variables:
 positive relationship

 negative relationship

 no relationship
Positive relationship
Negative relationship

Reliability

Age of Car
No relation
Variance vs Covariance
 Do two variables change together?
Variance:
• Gives information on variability of a single
variable.

Covariance:
• Gives information on the degree to which
two variables vary together.
• Note how similar the covariance is to
variance: the equation simply multiplies x’s
error scores by y’s error scores as opposed to
squaring x’s error scores.
Covariance

 When X and Y : cov (x,y) = pos.


 When X and Y : cov (x,y) = neg.
 When no constant relationship: cov (x,y) = 0
Example Covariance

x y xi - x yi - y ( xi - x )( yi - y )
0 3 -3 0 0
2 2 -1 -1 1
3 4 0 1 0
4 0 1 -3 -3
6 6 3 3 9
x=3 y=3 å= 7

What does this


number tell us?
Problem with Covariance:
 The value obtained by covariance is dependent on
the size of the data’s standard deviations: if large,
the value will be greater than if small… even if the
relationship between x and y is exactly the same in
the large versus small standard deviation datasets.
Example of how covariance value
relies on variance
  High variance data    Low variance data
   
   
 
Subject x y x error * y   x y X error * y
error   error
 
1 101 100 2500   54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50

Sum of x error * y error : 7000 Sum of x error * y error : 28

Covariance: 1166.67 Covariance: 4.67


Correlation Coefficient

Statistic showing the degree of relation between two


variables
Simple Correlation coefficient (r)

 It is also called Pearson's correlation


or product moment correlation
coefficient.
 It measures the nature and strength
between two variables of
the quantitative type.
The sign of r denotes the nature of association

while the value of r denotes the strength of


association.
 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the
other variable and a decrease in one
variable is associated with a
decrease in the other variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).
 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated
by the following diagram.

strong intermediate weak weak intermediate strong

1- -0.75 -0.25 0 0.25 0.75 1


indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)
degree to which X and Y vary together
r = degree to which X and Y vary separately

____covariability of X and Y___ =


variability of X and Y separately
How to compute the simple correlation
coefficient (r)
Simpler calculation formula…
Numerator of
covariance

Numerators of
variance
How to compute the simple correlation
coefficient (r)
:Example
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

Weight Age serial


(Kg) (years) No
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variables to find the relation between age and
weight compute the simple correlation coefficient
using the following formula:
Weight Age
Serial
Y2 X2 xy (Kg) (years)
.n
(y) (x)
144 49 84 12 7 1

64 36 48 8 6 2

144 64 96 12 8 3

100 25 50 10 5 4

121 36 66 11 6 5

169 81 117 13 9 6

=y2∑ =x2∑ xy=∑ =y∑ =x∑ Total


742 291 461 66 41
r = 0.759
strong direct correlation
EXAMPLE: Relationship between Anxiety and Test Scores

Anxiety Test X2 Y2 XY
)X( score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
Calculating Correlation Coefficient

r = - 0.94

Indirect strong correlation


Regression Analyses
Regression: technique concerned with predicting some variables
by knowing others

The process of predicting variable Y using variable X


Regression
 Uses a variable (x) to predict some outcome variable (y)
 Tells you how values in y change as a function of changes in
values of x
Correlation and Regression

 Correlation describes the strength of a linear relationship


between two variables
 Linear means “straight line”

 Regression tells us how to draw the straight line described by the


correlation
Regression
 Calculates the “best-fit” line for a certain set of data
The regression line makes the sum of the squares of
the residuals smaller than for any other line
Regression minimizes residuals
By using the least squares method (a procedure
that minimizes the vertical deviations of plotted
points surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:

 x y
 xy  n
bb1 
( x) 2
 x 2

n
Regression Equation

 Regression equation
describes the
regression line
mathematically
 Intercept

 Slope
Linear Equations
Hours studying and grades
Regressing grades on hours


Linear Regression


90.00 Final grade in course = 59.95 + 3.17 * study
R-Square = 0.88


80.00  

70.00  

2.00 4.00 6.00 8.00 10.00

Number of hours spent studying

Predicted final grade in class =


59.95 + 3.17*(number of hours you study per week)
Predicted final grade in class = 59.95 + 3.17*(hours of study)

…Predict the final grade of

 Someone who studies for 12 hours


 Final grade = 59.95 + (3.17*12)
 Final grade = 97.99

 Someone who studies for 1 hour:


 Final grade = 59.95 + (3.17*1)
 Final grade = 63.12
Exercise
A sample of 6 persons was selected the
value of their age ( x variable) and their
weight is demonstrated in the following
table. Find the regression equation and
what is the predicted weight when age is
8.5 years.
Weight (y) Age (x) .Serial no
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
Answer

Y2 X2 xy Weight (y) Age (x) .Serial no


144 49 84 12 7 1
64 36 48 8 6 2
144 64 96 12 8 3
100 25 50 10 5 4
121 36 66 11 6 5
169 81 117 13 9 6

742 291 461 66 41 Total


Regression equation
12.6
12.4

Weight (in Kg)


12.2
12
11.8
11.6
11.4
7 7.5 8 8.5 9
Age (in years)

we create a regression line by plotting two


estimated values for y against their X component,
then extending the line right and left.
Exercise 2
B.P Age B.P Age
(y) (x) (y) (x)
128 46 120 20
The following are the 136 53 128 43
age (in years) and
systolic blood 146 60 141 63
pressure of 20 124 20 126 26
apparently healthy 143 63 134 53
adults. 130 43 128 31
124 26 136 58
121 19 132 46
126 31 140 58
123 23 144 70
Find the correlation between age
and blood pressure using simple
and Spearman's correlation
coefficients, and comment.
Find the regression equation?
What is the predicted blood
pressure for a man aging 25 years?
x2 xy y x Serial
400 2400 120 20 1
1849 5504 128 43 2
3969 8883 141 63 3
676 3276 126 26 4
2809 7102 134 53 5
961 3968 128 31 6
3364 7888 136 58 7
2116 6072 132 46 8
3364 8120 140 58 9
4900 10080 144 70 10
x2 xy y x Serial
2116 5888 128 46 11
2809 7208 136 53 12
3600 8760 146 60 13
400 2480 124 20 14
3969 9009 143 63 15
1849 5590 130 43 16
676 3224 124 26 17
361 2299 121 19 18
961 3906 126 31 19
529 2829 123 23 20
41678 114486 2630 852 Total
 x y
 xy 
n
b1  =
(  x) 2

x  n
2

=112.13 + 0.4547 x

for age 25
B.P = 112.13 + 0.4547 * 25=123.49 = 123.5 mm hg
Example
The strength of paper used in the manufacture of cardboard boxes (y)
is related to the percentage of hardwood concentration in the original
pulp (x). Under controlled conditions, a pilot plant manufactures 16
samples, each from a different batch of pulp, and measures the tensile
strength as shown in the Table

y 101.4 117.4 117.1 106.2 131.9 146.9 146.8 133.9 111.0 123.0 125.1 145.2 134.3 144.5 143.7 146.9

x 1 1.5 1.5 1.5 2 2 2.2 2.4 2.5 2.5 2.8 2.8 3.0 3.0 3.2 3.3
Example
• a. Describe the correlation between the tensile strength of
paper and the hard wood concentration.
• b. Derive a simple linear regression equation to predict
tensile strength from percentage of hardwood concentration
in the pulp.
• c. Predict tensile strength when concentration = 1.7.
• d. Obtain the fitted value of y when x = 2.2 and calculate the
corresponding residual.
Solution
a. Describe the correlation between the tensile strength of paper and
the hard wood concentration.

• Let x = Percentage of hardwood concentration in the original pulp


• y = Tensile strength of paper used in the manufacture of
cardboard boxes

• The correlation (r) between the tensile strength of paper and the
hardwood concentration in the original pulp can be expressed as:
Solution
Hence the correlation is calculated as follows
Concentration Strength
N xy x2 y2
(x) (y)
1 1 101.4 101.4 1 10281.96
2 1.5 117.4 176.1 2.25 13782.76
3 1.5 117.1 175.65 2.25 13712.41
4 1.5 106.2 159.3 2.25 11278.44
5 2 131.9 263.8 4 17397.61
6 2 146.9 293.8 4 21579.61
7 2.2 146.8 322.96 4.84 21550.24
8 2.4 133.9 321.36 5.76 17929.21
9 2.5 111 277.5 6.25 12321
10 2.5 123 307.5 6.25 15129
11 2.8 125.1 350.28 7.84 15650.01
12 2.8 145.2 406.56 7.84 21083.04
13 3 134.3 402.9 9 18036.49
14 3 144.5 433.5 9 20880.25
15 3.2 143.7 459.84 10.24 20649.69
16 3.3 146.9 484.77 10.89 21579.61
TOTAL 37.2 2075.3 4937.22 93.66 272841
Mean 2.325 129.70625      
r = 0.685
From the result, there is a direct/positive intermediate correlation
between the tensile strength of paper and the hardwood
concentration in the original pulp
Regression
• Derive a simple linear regression equation to predict tensile strength
from percentage of hardwood concentration in the pulp.
• To derive a linear regression

b = 15.641
The required simple linear regression equation to predict tensile strength
from percentage of hardwood concentration in the pulp is:
c. Predict tensile strength when concentration = 1.7

From the regression equation,

At x=1.7, y = 119.925
Multiple Regression

Multiple regression analysis is a straightforward extension


of simple regression analysis which allows more than one
independent variable.
Multiple regression

Multiple regression is used to determine the effect of a number of


independent variables, x1, x2, x3 etc, on a single dependent
variable, y

The different x variables are combined in a linear way and each


:has its own regression coefficient

y = a1x1+ a2x2 +…..+ anxn + b + ε

The a parameters reflect the independent contribution of each


.independent variable, x, to the value of the dependent variable, y
i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
Further Reading
For more on Linear and Multiple Regression, see Chapter 11 and
:12 of this book

Applied Statistics and Probability for Engineers (Third


Edition) by Douglas C. Montgomery, and George C.
Runger

You might also like