Nothing Special   »   [go: up one dir, main page]

Correlation Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Correlation

So far we have confined to the distributions involving

only one variable known as univariate distributions.

Normally we collect the data on many variables. Whenever

we conduct any experiment we gather information on more

related variables. A questionnaire will have the details about

name, age, education, gender, years of experience, income

etc.

Correlation is the study of relationship between two or

more variables.

When there are two related variables their joint

distribution is known as bivariate normal distribution and if

there are more than two variables their joint distribution is

known as multivariate normal distribution.

In case of bi-variate or multivariate normal distribution,

we are interested in discovering and measuring the

magnitude and direction of relationship between 2 or more

variables. For this we use the tool known as correlation.

Suppose we have two continuous variables X and Y and

if the change in X affects Y, the variables are said to be

correlated. In other words, the systematic relationship


between the variables is termed as correlation. When only 2

variables are involved the correlation is known as simple

correlation and when more than 2 variables are involved the

correlation is known as multiple correlation. For example if

there are two variables X and Y the relation is called as

simple correlation and if there are three variables say Y, X1

and X2 then the relationship is known multiple correlation.

Types of correlation

An increase in the value of X results in an increase in

the value of Y, then they move in the same direction. A

decrease in the value of X results in a decrease in the value

of Y, then also we say they move in the same direction.

Similarly an increase in X if it results in a decrease in Y they

move in the opposite direction and a decrease in X if it

results in an increase in Y then they move in the opposite

direction.

When the variables move in the same direction, these

variables are said to be positively correlated and if they

move in the opposite direction they are said to be negatively

correlated. The positive correlation is also known as direct

correlation. The negative correlation is also termed as

inverse correlation. When the two variables are not at all


related they are said to be independent. An eample for each

of the above are

rainfall and yield of a crop - positive correlation

the speed of a vehicle and the time taken to reach a

destination - negative correlation

amount of rainfall in Mumbai and yield of a crop in

Chennai - independent

In correlation we need not know which is the cause

variable and which is the effect variable.

Other examples of positive correlation

Height and weight. Taller people tend to be heavier.


Day temperature and sale of ice cream

An example of negative correlation would be


height above sea level and temperature. As you climb the
mountain (increase in height) it gets colder (decrease in
temperature).

A zero correlation (independent) exists when there is no


relationship between two variables. For example there is no
relationship between the amount of tea drunk and level of
intelligence.

Some more examples are

• age and salary


• experience and salary
• quantity of fertiliser applied and yield of a crop
• demand and supply
• the cost (in rupees) of a call to its duration (length)
• The more time you spend running on a treadmill, the more
calories you will burn.
• Taller people have larger shoe sizes and shorter people have
smaller shoe sizes.
• The number of days of absences in a course and the final
exam grade

Scatter Diagram

In correlation studies first we have to investigate

whether there is a relation between the variables X and Y.

For this a correlation can be expressed visually. This is done

by drawing a scatter diagram (also known as a scatterplot,

scatter graph, scatter chart).

A scatter diagram is a graphical display that shows the

relationships between two numerical variables, which are

represented as points (or dots) for each pair of score.

A scatter diagram indicates the strength and direction of the

correlation between the variables.

To investigate whether there is any relation between

the variables X and Y we use scatter diagram. Let (x1,y1),


(x2,y2)….(xn,yn) be n pairs of observations. If the variables X

and Y are plotted along the X-axis and Y-axis respectively in

the x-y plane of a graph sheet the resultant diagram of dots

is known as scatter diagram. From the scatter diagram we

can say whether there is any correlation between x and y

and whether it is positive or negative or the correlation is

linear or curvilinear. It may take any one of the following

forms.

When you draw a scatter diagram it doesn't matter which

variable goes on the x-axis and which goes on the y-axis.


Remember, in correlations we are always dealing with

paired scores, so the values of the 2 variables taken

together will be used to make the diagram.

Decide which variable goes on each axis and then simply

put a dot or a cross at the point where the 2 values coincide.

There is no rule for determining what size of correlation is

considered strong, moderate or weak. The interpretation of

the coefficient depends on the topic of study.

If the correlation coefficient is

> 0.4 we say the relationship is moderate and > 0.75

relatively strong.

Correlation can have a value:

• 1 is a perfect positive correlation

• 0 is no correlation (the values don't seem linked at all)

• -1 is a perfect negative correlation


Pearson's product moment correlation coefficient
The most common measure of correlation is Pearson’s

product-moment correlation, which is commonly referred to


as the correlation coefficient. The measures of the degree of

relationship between two continuous variables is called

correlation coefficient. It is denoted by r (in case of sample)

and  ( rho in case of population). The correlation coefficient

r is known as Pearson’s correlation coefficient as it was


discovered by Karl Pearson. It is also called as product

moment correlation.

The correlation coefficient r is given as the ratio of

covariance of the variables X and Y to the product of the

standard deviation of X and Y.

Symbolically,
1
( (x − x )( y − y ))
r = n − 1
1
 ( x − x )2 1  ( y − y )2
n −1 n −1

as n-1 is common in all it can be removed and written as

Where:

• rxy – the correlation coefficient of the linear relationship between

the variables x and y

• xi – the values of the x-variable in a sample

• x̅ – the mean of the values of the x-variable

• yi – the values of the y-variable in a sample

• ȳ – the mean of the values of the y-variable


In order to calculate the correlation coefficient using the formula

above, you must undertake the following steps:

1. Obtain a data sample with the values of x-variable and y-variable.

2. Calculate the means (averages) x̅ for the x-variable and ȳ for the

y-variable.

3. For the x-variable, subtract the mean from each value of the x-

variable (let’s call this new variable “a”). Do the same for the y-

variable (let’s call this variable “b”).

4. Multiply each a-value by the corresponding b-value and find the

sum of these multiplications (the final value is the numerator in

the formula).

5. Square each a-value and calculate the sum of the result

6. Find the square root of the value obtained in the previous step

(this is the denominator in the formula).

7. Divide the value obtained in step 4 by the value obtained in step

𝑐𝑜𝑣(𝑥,𝑦)
r=
√𝑣𝑎𝑟(𝑥 ) 𝑋 𝑣𝑎𝑟 (𝑦)

Another formula that can be used is

 x y
 xy −
r = n
( x )2
( y )2

x 2

n
y 2

n
𝑆𝑃(𝑥,𝑦)
r=
√𝑆𝑆(𝑥) 𝑋 𝑆𝑆(𝑦)
This correlation coefficient r is known as Pearson’s product

moment correlation coefficient. The numerator is termed as

sum of product of X and Y and abbreviated as SP(XY). In the

denominator the first term is called sum of squares of X (i.e)

SS(X) and second term is called sum of squares of Y (i.e)

SS(Y)
SP( XY )
r =
SS ( X ) SS (Y )

The denominator in the above formula is always positive.

The numerator may be positive or negative making r to be

either positive or negative.

Assumptions in correlation analysis:

Correlation coefficient r is used under certain assumptions

and they are

1. The variables under study are continuous random

variables and they are normally distributed

2. The relationship between the variables is linear

3. Each pair of observations is unconnected with other

pair (independent)

Problem:
Compute Pearsons coefficient of correlation between plant height (cm)

X and yield (Kgs) Y as per the data given below:

X 39 65 62 90 82 75 25 98 36 78

Y 47 53 58 86 62 68 60 91 51 84

X Y (x-
65)(y-
(x-65) (y-66) (x-65)2 (y-66)2 66)
39 47 -26 -19 676 361 494
65 53 0 -13 0 169 0
62 58 -3 -8 9 64 24
90 86 25 20 625 400 500
82 62 17 -4 289 16 -68
75 68 10 2 100 4 20
25 60 -40 -6 1600 36 240
98 91 33 25 1089 625 825
36 51 -29 -15 841 225 435
78 84 13 18 169 324 234
650 660 5398 2224 2704

rxv = 2704 / √(5398 X 2224) = 0.7804

n = 10

 x = 650  y = 660  xy = 45604  x 2


= 47648 y 2
= 45784
 x y
 xy −
r = n
( x ) 2
( y ) 2

x 2

n
y 2

n

(650)(660)
45604 −
= 10
(650) 2 (660) 2
47648 − 45784 −
10 10
45604 − 42900
= = 0.7804
(73.47)( 47.1)
Correlation coefficient is positively correlated.
Both the methods give the same result.

u=y- v=x-
Y X 5.13 79.41 v sqr v sqr uXv
5.22 94.2 0.09 14.79 0.0081 218.74 1.3311
8.13 69.3 3 -10.11 9 102.21 -30.33
6.52 114.3 1.39 34.89 1.9321 1217.3 48.497
4.16 83.3 -0.97 3.89 0.9409 15.132 -3.773
8.98 85.4 3.85 5.99 14.823 35.88 23.062
3.05 68.1 -2.08 -11.31 4.3264 127.92 23.525
3.49 50.7 -1.64 -28.71 2.6896 824.26 47.084
5.4 96.2 0.27 16.79 0.0729 281.9 4.5333
2.39 76.1 -2.74 -3.31 7.5076 10.956 9.0694
2.71 52 -2.42 -27.41 5.8564 751.31 66.332
3.97 82.1 -1.16 2.69 1.3456 7.2361 -3.12
7.56 81.3 2.43 1.89 5.9049 3.5721 4.5927
61.58 953 0.02 0.08 54.407 3596.4 190.8
0.4313

Y X y-3 x-20
5.22 94.2 2.22 74.2
8.13 69.3 5.13 49.3
6.52 114.3 3.52 94.3
4.16 83.3 1.16 63.3
8.98 85.4 5.98 65.4
3.05 68.1 0.05 48.1
3.49 50.7 0.49 30.7
5.4 96.2 2.4 76.2
2.39 76.1 -0.61 56.1
2.71 52 -0.29 32
3.97 82.1 0.97 62.1
7.56 81.3 4.56 61.3

corr 0.4313

Properties

1. It is a unit free measure.

2.The correlation coefficient value ranges between –1

and +1. If we get a value of r beyond these limits, it is

an indication of wrong computation.

3. The correlation coefficient is not affected by change

of origin or scale or both. When a constant is added or

subtracted from the original values of a variable we say

that the origin is changed. When the original values of a

variable is multiplied or divided by a constant we say

that the scale is changed.

4. It is symmetric. i.e. rxy = ryx.

Consider
X 2 4 6 8 10
Y 5 8 11 14 17
For an increase in X value of 2 there is an increase of 3 units
in Y. The rate of change is constant and we say they are
linearly related. The correlation calculated for such data is
called simple linear correlation.

Rank correlation

One of the assumption under correlation analysis is that the


2 variables are normally distributed. When both the variables
are not normal the linear correlation procedure is not
applicable. In such case we use rank correlation. There are
two methods available to calculate rank correlation. One is
proposed by Spearman and the other by Kendall. Both can
be applied for the same data. But Spearman rank correlation
is more popular than the other.

Spearman's rank correlation


This is indicated by rs. This procedure starts with ranking of
the measurements of the variable X and Y separately. The
ranks are assigned with the highest value getting the 1st
rank. The differences between the ranks of each of n pairs
are then found out. They are denoted by d. The Spearman's
rank correlation is then calculated by using the formula
6 𝑑2
rs =1 -
𝑛(𝑛2−1)
when there are no ties.
Calculate Spearman's rank correlation for the following data.
X Y d d2
5 7 -2 4
6 6 0 0
4 2 2 4
8 4 4 16
1 3 -2 4
3 1 2 4
2 5 -3 9
7 10 -3 9
9 8 1 1
10 9 1 1
∑ d = 52
2

rs = 1 - (6 ∑d2 / (n(n2-1)))
rs = 1- (6 X 52 / (10(102-1)))
rs = 1- (312/990) = 0.68

The following are the marks assigned by two judges in an interview.


Calculate Spearman's rank correlation.

Judge A 40, 48, 25, 32, 50, 41, 15, 18, 27, 36, 43, 49

Judge B 35, 40, 22, 25, 47, 38, 17, 19, 26, 33, 41, 39

You might also like