Nothing Special   »   [go: up one dir, main page]

Chapter 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023

Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Chapter IV

Correlation and Regression

4.1 Correlation

Correlation is a single number describe the degree of relationship between


two variables.

Correlation is any statistical relationship between two random variables,


regardless whether the relationship is causal (one variable causes the other)
or not. Although correlation technically refers to any statistical association,
it typically is used to describe how linearly related two variables are.

Even though correlation cannot be used to prove a causal relationship


between two variables, it can be used to make predictions. For example,
given two variables that are highly correlated, we can relatively accurately
predict the value of one given the other.

- Scatter plots
Scatter plots are graphs that depict clusters of dots that represent all of the
pairs of data in an experiment. For example, a plot of weight vs. height will
show a positive correlation: as height increases, weight also increases.

Scatter plots are constructed by plotting two variables along the horizontal
(x) and vertical (y) axes. Below are examples of scatter plots showing a
positive correlation, negative correlation, and no or little correlation. Note
that the more closely the cluster of dots represents a straight line, the
stronger the correlation.

1
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

 Positive correlation - The two random variables increase together.


There is a positive correlation between height and weight: weight
increases as height increases.

 Negative correlation - One of the random variables increases as the


other decreases. There is a negative correlation between speed and
the amount of time it takes to get somewhere: as speed increases, it
takes a shorter amount of time to get to a destination.

 No correlation - There is no linear relationship between the two


random variables. There is no correlation between being able to
write in cursive and the number of fish in the ocean.

4.1.1 Correlation coefficient


A correlation coefficient is a numerical representation of the relationship
between a pair of random variables. There are several different correlation
coefficients, the most commonly used of which is the Pearson correlation
coefficient.
- The Pearson correlation coefficient:
The Pearson correlation coefficient (r), also referred to as Pearson's r, is a
value between -1 and +1 that describes the linear relationship between two
random variables.
The closer to -1 or +1, the more linear the relationship between the
variables. An r of 0 would mean that there is no linear correlation between
the variables at all:
 r = 1: perfect positive correlation
 r = -1: perfect negative correlation
 r = 0: no correlation

The Pearson correlation coefficient is calculated using the following


equation:
𝑛 ∑𝑛𝑖=1 𝑥𝑦 − (∑𝑛𝑖=1 𝑥 )(∑𝑛𝑖=1 𝑦)
𝑟=
√𝑛 ∑𝑛𝑖=1 𝑥 2 − (∑𝑛𝑖=1 𝑥 )2 √𝑛 ∑𝑛𝑖=1 𝑦 2 − (∑𝑛𝑖=1 𝑦)2

2
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Positive Correlations

• A positive correlation is resulted if an increase in one variable results in


an increase in the other.
• The association is measured with a correlation coefficient (r) that ranges
from 0 and 1 for positive correlation.
• An r value of 0 suggest that there is no linear association present (or no
correlation).
• An r value of 1 suggest that there is a perfect linear association present
(or perfect positive correlation).

Negative Correlation
• A negative correlation is resulted if an increase in one variable results in
a decrease in the other.
• An r value of -1 suggest that there is a perfect linear association present
(or perfect negative correlation).

3
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Examples of Correlation in Real Life:

1) Time Spent Running vs. Body Fat Negative Correlation


The more time an individual spends running, the lower their body fat tends
to be.

2) Time Spent Watching TV vs. Exam Scores Negative Correlation


The more time a student spends watching TV, the lower their exam scores
tend to be.

4
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

3) Height vs. Weight Positive Correlation


The correlation between the height of an individual and their weight tends
to be positive.

4) Temperature vs. Ice Cream Sales Positive Correlation

When it’s hotter outside the total ice cream sales of companies tends to be
higher since more people buy ice cream when it’s hot out.

5
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

5) Coffee Consumption vs. Intelligence No Correlation

The amount of coffee that individuals consume and their IQ level has a
correlation of zero.

5) Shoe Size vs. Movies Watched No Correlation

The shoe size of individuals and the number of movies they watch per year
has a correlation of zero.

6
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Example1:

From this table, find correlation coefficient:

X 22 35 16 40 10 25 20 45 20 12
Y 5 4 7 3 8 5 6 2 4 7
And what the type of correlation?

Solution

X Y XY X2 Y2
22 5 110 484 25
35 4 140 1225 16
16 7 112 256 49
40 3 120 1600 9
10 8 80 100 64
25 5 125 625 25
20 6 120 400 36
45 2 90 2025 4
20 4 80 400 16
12 7 84 144 49
245 51 1061 7259 293

𝑛 ∑𝑛𝑖=1 𝑥𝑦 − (∑𝑛𝑖=1 𝑥 )(∑𝑛𝑖=1 𝑦)


𝑟=
√𝑛 ∑𝑛𝑖=1 𝑥 2 − (∑𝑛𝑖=1 𝑥 )2 √𝑛 ∑𝑛𝑖=1 𝑦 2 − (∑𝑛𝑖=1 𝑦)2

(10)(1061) − (245)(51)
𝑟= = −0.93
√(10)(7259) − (245)2 √(10)(293) − (51)2

We can see that this correlation is negative and strong.

Example2:

From this table, find correlation coefficient:

X 1 3 4 6 8 9 11 14
Y 1 2 4 4 5 7 8 9
And what the type of correlation?

7
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Solution

X Y XY X2 Y2
1 1 1 1 1
3 2 6 9 4
4 4 16 16 16
6 4 24 36 16
8 5 40 64 25
9 7 63 81 49
11 8 88 121 64
14 9 126 196 81
56 40 364 524 256

𝑛 ∑𝑛𝑖=1 𝑥𝑦 − (∑𝑛𝑖=1 𝑥 )(∑𝑛𝑖=1 𝑦)


𝑟=
√𝑛 ∑𝑛𝑖=1 𝑥 2 − (∑𝑛𝑖=1 𝑥 )2 √𝑛 ∑𝑛𝑖=1 𝑦 2 − (∑𝑛𝑖=1 𝑦)2

(8)(364) − (56)(40)
𝑟= = 0.975
√(8)(524) − (56)2 √(8)(256) − (40)2

We can see that this correlation is positive and strong.

- The Sperman Rank correlation coefficient:


In case we have descriptive data (ordinal data), we use Sperman Rank
correlation.

Solution steps:

1) Sorting data for both variables X and Y.


2) If there are frequency data, we take average.
3) Calculate differences between of X and Y ranks which is

D = Rx - Ry

Find Sperman Sperman Rank correlation coefficient by:

6 ∑𝑛𝑖=1 𝐷2
𝑟 =1−
𝑛(𝑛2 − 1)

8
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Example1:

From this table, find correlation coefficient of Sperman Rank:

X 5 7 9 10 8 6
Y 1 3 2 6 5 4
Solution
X: 10 9 8 7 6 5
Rx: 1 2 3 4 5 6
Y: 6 5 4 3 2 1
Ry: 1 2 3 4 5 6
X Y Rx Ry D D2
5 1 6 6 6–6=0 0
7 3 4 4 4–4=0 0
9 2 2 5 2 – 5 = -3 9
10 6 1 1 1–1=0 0
8 5 3 2 3 – 2 =1 1
6 4 5 3 5–3=2 4
Total 14
6 ∑𝑛𝑖=1 𝐷2 6(14)
𝑟 =1− = 1 − = 1 − 0.40 = 0.60
𝑛(𝑛2 − 1) 6(62 − 1)
This relationship is positive and average.
Example2:
This is the report about the state of 10 patients given by Doctors:
Doctor1 (X) Good Excellent Bad Bad Good V.Good Excellent V.Bad Bad Good
Doctor2 (Y) Good V.Good Bad V.Bad Excellent Excellent Excellent Bad Bad V.Good
What the relationship between two reports?

Solution
10 9 8 7 6 5 4 3 2 1
X Excellent Excellent V.Good Good Good Good Bad Bad Bad V.Bad
9+10 9+10 7+6+5 7+6+5 7+6+5 4+3+2 4+3+2 4+3+2
= = = = = = = =
Rx 2 2 8 3 3 3 3 3 3 1
9.5 9.5 6 6 6 3 3 3

10 9 8 7 6 5 4 3 2 1
Excellen
Y Excellent Excellent V.Good V.Good Good Bad Bad Bad V.Bad
t
10+9+8 10+9+8 10+9+8 7+6 7+6 4+3+2 4+3+2 4+3+2
= = = = = = = =
Ry 3 3 3 2 2 5 3 3 3 1
9 9 9 6.5 6.5 3 3 3

9
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

X Y Rx Ry D D2
Good Good 6 5 1 1
Excellent V.Good 9.5 6.5 3 9
Bad Bad 3 3 0 0
Bad V.Bad 3 1 2 4
Good Excellent 6 9 -3 9
V.Good Excellent 8 9 -1 1
Excellent Excellent 9.5 9 0.5 0.25
V.Bad Bad 1 3 -2 4
Bad Bad 3 3 0 0
Good V.Good 6 6.5 -0.5 0.25
∑ 𝐷2 =
28.5

6 ∑𝑛𝑖=1 𝐷2 6(28.5)
𝑟 =1− = 1 − = 0.83
𝑛(𝑛2 − 1) 10(102 − 1)
This relationship is positive and strong.

Example3:

The ranks of two variables X,Y are given as :

RX 2 4.5 2 6 4.5 8 2 8 8
RY 8.5 4.5 6 4.5 8.5 2.5 7 2.5 1
Find the relationship between X,Y?

Solution

Rx Ry D D2
2 8.5 -6.5 42.25
4.5 4.5 0 0
2 6 -4 16
6 4.5 1.5 2.25
4.5 8.5 -4.0 16
8 2.5 5.5 30.25
2 7 -5 25
8 2.5 5.5 30.25
8 1 7 49

10
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

∑ 𝐷2 = 211
6 ∑𝑛𝑖=1 𝐷2 6(211)
𝑟 =1− = 1 − = −0.76
𝑛(𝑛2 − 1) 9(92 − 1)
It is strong negative relationship.

4-2 Simple Linear Regression:

Regression describe the relationship between two or more variables.


The usual purpose of regression analysis is to explain and predict the change in
the magnitude of an outcome variable in terms of change in magnitude of an
explanatory variable in case of a simple regression model. The variable to be
explained is called the dependent variable, outcome variable, or response
variable, usually denoted by Y, and the remaining variable for which change in
outcome variable is studied is known as the independent, predictor, or
explanatory variable, usually denoted by X. For instance, Y may denote the
weight of an individual and X may represent associated height or daily calorie
intake. Generally, one does not predict the exact value of the occurrence. We are
usually satisfied if the predictions are, on the average, reasonably close. The
statistician usually wants to find the equation of the curve of best fit to express
the relationship of the variables.
The regression analysis deals with studying the form of the relationship between
the variables in order to find a mathematical equation relating the outcome
variables with explanatory variables. Once the mathematical form of the
relationship is determined, we can utilize it to predict the value of the response
(dependent) variable (Y) by knowing the value of the predictor (independent)
variable (X). In other words, the objective of regression analysis is to estimate
the mean of response variable (Y) by using the value of the predictor variable
(X). For instance, we might be interested in predicting the blood pressure level
of a person by using his weight, or predicting the height of an adult male by using
the height of his father.

11
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

- Linear Regression Model:

The equation that explain the relationship between X and Y called


Regression model.

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒𝑖

This model is called a simple regression model because it takes into


account a single independent variable X to explain the dependent variable,
Y. There are two regression parameters in this model which are unknown
constants, 𝛽0 is an intercept and 𝛽1 is a slope known as the regression
coefficient of the independent variable, X.

Where

𝛽0 : an intercept (Regression fixed) ‫ثابت االنحدار‬

𝛽1 : slope known as the regression coefficient ‫معامل االنحدار‬

𝑒𝑖 : are the experimental errors (residual) )‫األخطاء التجريبية (البواقي‬

- Estimation of 𝛽0 and 𝛽1

The parameters of the regression model, 𝛽0 and 𝛽1 , are unknown


population values. We have to estimate these parameters using the sample
data. There are more than one method of estimating parameters. For
estimating the parameters of a regression model, we generally use a method
known as the method of least squares. The method of least squares is
preferred for estimating parameters of a linear regression model because of
its good properties for the linear regression models, some of which will be
highlighted in this chapter.

1- The regression equation of Y on X ( Y/X):

𝑌̂ = 𝛽̂0 + 𝛽̂1 𝑋

𝑛 ∑𝑛𝑖=1 𝑋𝑌 − (∑𝑛𝑖=1 𝑋)( ∑𝑛𝑖=1 𝑌)


𝛽̂1 =
𝑛 ∑𝑛𝑖=1 𝑋 2 − (∑𝑛𝑖=1 𝑋)2

𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅

Where

12
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

𝑛 𝑛
∑ 𝑋 ∑ 𝑌
𝑋̅ = 𝑖=1 𝑖 , 𝑌̅ = 𝑖=1 𝑖
𝑛 𝑛

2- The regression equation of X on Y ( X/Y):

𝑋̂ = 𝛼̂0 + 𝛼̂1 𝑌
𝑛 ∑𝑛𝑖=1 𝑋𝑌 − (∑𝑛𝑖=1 𝑋)( ∑𝑛𝑖=1 𝑌)
𝛼̂1 =
𝑛 ∑𝑛𝑖=1 𝑌 2 − (∑𝑛𝑖=1 𝑌)2

𝛼̂0 = 𝑋̅ − 𝛼̂1 𝑌̅

Where
𝑛 𝑛
∑ 𝑋 ∑ 𝑌
𝑋̅ = 𝑖=1 𝑖 , 𝑌̅ = 𝑖=1 𝑖
𝑛 𝑛

 Important Note:

By two equations (Y/X) and (X/Y) we can find Correlation Coefficient as

𝜌 = √𝛽̂1 × 𝛼̂1

- Interpretations of the Parameters of the Regression Line

The parameters of the population regression model, 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒𝑖

, are 𝛽0 and 𝛽1 . The first parameter, 𝛽0 , is the intercept of the line; it is the
value of Y when X = 0. The second parameter, 𝛽1 , is the slope of the
regression line; it is the amount of change in the value of Y, on average,
when the value of X is increased by one unit.

In particular, the value of 𝛽1 determines the nature of the linear relationship


between X and Y.

- If there is no linear relationship between X and Y, then 𝛽1 = 0.


- If there is a positive linear relationship, then 𝛽1 > 0.
- If there is a negative linear relationship, then 𝛽1 < 0.

13
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Example1:

Find the Regression Equation (Y/X) using the following data

Y 19 20 21 19 23 18 22
X 8 7 8 7 9 6 8
Then find the value of Y when X = 10 ?

Solution

X Y XY X2
8 19 152 64
7 20 140 49
8 21 168 64
7 19 133 49
9 23 207 81
6 18 108 36
8 22 176 64
53 142 1084 407

𝑌̂ = 𝛽̂0 + 𝛽̂1 𝑋

𝑛 ∑𝑛𝑖=1 𝑋𝑌 − (∑𝑛𝑖=1 𝑋)( ∑𝑛𝑖=1 𝑌) 7(1084) − (53)(142)


𝛽̂1 = = = 1.55
𝑛 ∑𝑛𝑖=1 𝑋 2 − (∑𝑛𝑖=1 𝑋)2 7(407) − (53)2
𝑛 𝑛
∑ 𝑋 53 ∑ 𝑌 142
𝑋̅ = 𝑖=1 𝑖 = = 5.571 , 𝑌̅ = 𝑖=1 𝑖 = = 20.285
𝑛 7 𝑛 7

𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅ = 20.285 − (1.55)(5.571) = 11.649

Then

𝑌̂ = 𝛽̂0 + 𝛽̂1 𝑋

𝑌̂ = 11.649 + 1.55𝑋

If X = 10 Then :

𝑌̂ = 11.649 + 1.55(10) = 27.149

14
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

We can note that the relationship between X and Y is positive


because 𝛽1 > 0

Example2:

If the Regression Equation (Y/X) given as:

𝑌̂ = 12 − 1.5𝑋

and the Regression Equation (X/Y) given as:

𝑋̂ = 7 − 0.42𝑌

Find the correlation coefficient (𝜌) and describe the relationship between
X and Y?

Solution

Since 𝛽̂1 = −1.5 and 𝛼̂1 = −0.42

Therefore

𝜌 = √𝛽̂1 × 𝛼̂1 = √(−1.5) × (−0.42) = −0.79

Note:
The signal of correlation coefficient take the same as signal of regression
coefficient.

The relationship between X and Y is negative and strong.

Example3:

1- Find the Regression Equation (Y/X) using the following data

X 1 2 3 4 5
Y 2 3 5 7 8
𝑛 ̂𝑖 ) = 0
2- Prove that the ∑𝑖=1(𝑌𝑖 − 𝑌

15
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

Solution

1- The Regression Equation (Y/X)

X Y XY X2
1 2 2 1
2 3 6 4
3 5 15 9
4 7 28 16
5 8 40 25
15 25 91 55

𝑌̂ = 𝛽̂0 + 𝛽̂1 𝑋
𝑛 ∑𝑛𝑖=1 𝑋𝑌 − (∑𝑛𝑖=1 𝑋)( ∑𝑛𝑖=1 𝑌) 5(91) − (15)(25)
𝛽̂1 = = = 1.6
𝑛 ∑𝑛𝑖=1 𝑋 2 − (∑𝑛𝑖=1 𝑋)2 5(55) − (15)2
𝑛 𝑛
∑ 𝑋 15 ∑ 𝑌 25
𝑋̅ = 𝑖=1 𝑖 = = 3 , 𝑌̅ = 𝑖=1 𝑖 = = 5
𝑛 5 𝑛 5

𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅ = 5 − (1.6)(3) = 0.2

Then

𝑌̂ = 𝛽̂0 + 𝛽̂1 𝑋

𝑌̂ = 0.2 + 1.6𝑋…………………………………… (1)


̂𝑖 ) = 0 we have to find all of
2- In order to prove that the ∑𝑛𝑖=1(𝑌𝑖 − 𝑌
differences between the real observations of variable Y and the estimating
observations.

From Regression Equation (Y/X) (1), and the estimating observations


of Y are

When

X = 1: 𝑌̂ = 0.2 + 1.6(1) = 1.8


X=2: 𝑌̂ = 0.2 + 1.6(2) = 3.4
X=3: 𝑌̂ = 0.2 + 1.6(3) = 5
X=4: 𝑌̂ = 0.2 + 1.6(4) = 6.6
X=5: 𝑌̂ = 0.2 + 1.6(5) = 8.2

16
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

X Y 𝑌̂ 𝑌 − 𝑌̂
1 2 1.8 0.2
2 3 3.4 -0.4
3 5 5 0
4 7 6.6 0.4
5 8 8.2 -0.2
Total 0
As we see
𝑛
̂𝑖 ) = 0
∑(𝑌𝑖 − 𝑌
𝑖=1

Exercises
1) Consider the following data on weight of women in kg (Y) and height
in cm (X). The sample size is 20.
- Find the correlation between X and Y and interpret your result.
X Y
148.1 46.4
158.1 53.2
158.1 52.8
151.4 42.0
152.9 50.8
159.1 43.0
151.0 51.9
158.2 59.2
148.2 55.1
147.3 38.9
145.6 49.7
155.1 49.9
155.2 43.1
149.7 42.2
147.0 52.7
152.2 49.8
149.1 50.7
145.2 44.8
145.9 49.2
149.7 47.7

17
Faculty of medicine and Pharmacy / Al-mergib University - Academic year 2022/2023
Biostatistics for Premedical students - Dr. A. A. Aziz - Dr. M.O. Alshrani

2) Following data shows the ranks given by two judges in an essay


competition for students in a college participated by 10 students:

Student Judge 1 Judge 2


1 3 2
2 1 1
3 2 3
4 5 5
5 4 4
6 6 6
7 9 7
8 10 9
9 7 8
10 8 10
Find the Spearman’s rank correlation and interpret your result.

3) Using the following data, estimate the parameter of a regression model


(Y/X) and (X/Y) and interpret your estimates.

X 2 1 3 2 5
Y 4 3 5 3 6

References:
1) HRS (Health And Retirement Study) (2014). Public use dataset. Produced and distributed by
the University of Michigan with funding from the National Institute on Aging (grant number
NIAU01AG09740). Ann Arbor, MI.
2) https://www.statology.org/correlation-examples-in-real-life/

18

You might also like