Nothing Special   »   [go: up one dir, main page]

Correlation Analysis-Students NotesMAR 2023

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

CORRELATION ANALYSIS

Introduction

Statistical methods of measures of central tendency, dispersion, skewness and kurtosis are
helpful for the purpose of comparison and analysis of distributions involving only one
variable i.e. univariate distributions. However, describing the relationship between two or
more variables, is another important part of statistics.

In many business research situations, the key to decision making lies in understanding the
relationships between two or more variables.

The statistical methods of Correlation is helpful in knowing the relationship between two or
more variables which may be related in same way.

In all these cases involving two or more variables, we may be interested in seeing:

1) if there is any association between the variables;


2) if there is an association, is it strong enough to be useful;
3) if so, what form the relationship between the two variables takes;
4) how we can make use of that relationship for predictive purposes, that is,
forecasting; and
5) how good such predictions will be.

What is Correlation?

Correlation is a measure of association between two or more variables. When two or more
variables vary in sympathy so that movement in one tends to be accompanied by
corresponding movements in the other variable(s), they are said to be correlated.

Types of Correlation

Correlation can be classified in several ways. The important ways of classifying correlation
are:

1) Positive and negative;


2) Linear and non-linear (curvilinear); and
3) Simple, partial and multiple.

Positive and Negative Correlation

If both the variables move in the same direction, we say that there is a positive correlation.

If the variables are varying in opposite direction, we say that it is a case of negative
correlation; e.g., movements of demand and supply.

1
Linear and Non-linear (Curvilinear) Correlation

If the change in one variable is accompanied by change in another variable in a constant ratio,
it is a case of linear correlation. Observe the following data:

X : 10 20 30 40 50

Y: 25 50 75 100 125

The ratio of change in the above example is the same. It is, thus, a case of linear correlation.

If the amount of change in one variable does not follow a constant ratio with the change in
another variable, it is a case of non-linear or curvilinear correlation.

Simple, Partial and Multiple Correlation

The distinction amongst these three types of correlation depends upon the number of
variables involved in a study.

If only two variables are involved in a study, then the correlation is said to be simple
correlation.

When three or more variables are involved in a study, then it is a problem of either partial
or multiple correlation.

In multiple correlation, three or more variables are studied simultaneously.

But in partial correlation we consider only two variables influencing each other while the
effect of other variable(s) is held constant.

Suppose we have a problem comprising three variables X, Y and Z.

1) X is the number of hours studied;


2) Y is I.Q.; and
3) Z is the number of marks obtained in the examination.

In a multiple correlation, we will study the relationship between the marks obtained (Z) and
the two variables, number of hours studied (X) and I.Q. (Y).

In contrast, when we study the relationship between X and Z, keeping an average I.Q. (Y) as
constant, it is said to be a study involving partial correlation.

Correlation does not necessarily mean Causation

The correlation analysis, in discovering the nature and degree of relationship between
variables, does not necessarily imply any cause and effect relationship between the
variables.

1. The correlation may be due to chance particularly when the data pertain to a small
sample.

2
2. It is possible that both the variables are influenced by one or more other variables.
3. There may be another situation where both the variables may be influencing each
other so that we cannot say which is the cause and which is the effect. Correlation
Analysis

It is also used along with regression analysis to measure how well the regression line
explains the variations of the dependent variable with the independent variable.

Some of the widely used methods include:

1. Scatter Diagram;
2. Correlation Graph;
3. Pearson’s Coefficient of Correlation
4. Spearman’s Rank Correlation
5. Concurrent Deviation Method

Scatter Diagram

This method is also known as Dotogram or Dot diagram. Scatter diagram is one of the
simplest methods of diagrammatic representation of a bivariate distribution. Under this
method, both the variables are plotted on the graph paper by putting dots. The diagram so
obtained is called "Scatter Diagram".

We should keep the following points in mind while interpreting correlation: -

1) If the plotted points are very close to each other, it indicates high degree of
correlation. If the plotted points are away from each other, it indicates low degree of
correlation.

3
Figure: Scatter Diagrams

2) If the points on the diagram reveal any trend (either upward or downward), the
variables are said to be correlated and if no trend is revealed, the variables are
uncorrelated.
3) If there is an upward trend rising from lower left-hand corner and going upward to the
upper right-hand corner, the correlation is positive. If the points depict a downward
trend from the upper left hand corner to the lower right hand corner, the correlation is
negative.
4) If all the points lie on a straight line starting from the left bottom and going up
towards the right top, the correlation is perfect and positive, and if all the points like
on a straight line starting from left top and coming down to right bottom, the
correlation is perfect and negative.

4
Example 1

Given the following data on sales (in thousand units) and expenses (in thousand KSHs.) of a
firm for 10 month:

Month: J F M A M J J A S O

Sales: 50 50 55 60 62 65 68 60 60 50

Expenses: 11 13 14 16 16 15 15 14 13 13

a) Make a Scatter Diagram

b) Do you think that there is a correlation between sales and expenses of the firm? Is
it positive or negative? Is it high or low?

Solution:(a) The Scatter Diagram of the given data is shown in Figure below

Figure Scatter Diagram

(b) The figure shows that the plotted points are close to each other and reveal an upward
trend. So there is a high degree of positive correlation between sales and expenses of the firm.

Correlation Graph

This method, also known as Correlogram is very simple. The data pertaining to two series
are plotted on a graph sheet. We can find out the correlation by examining the direction and
closeness of two curves.

5
Example 2

Find out graphically, if there is any correlation between price yield per plot (qtls); denoted by
Y and quantity of fertilizer used (kg); denote by X.

PlotNo.: 1 2 3 4 5 6 7 8 9 10

Y: 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3

X: 6 8 9 12 10 15 17 20 18 24

Solution: The Correlogram of the given data is shown in Figure below

30 25 20 15 10

50

Plot Number

1 2 3 4 5 6 7 8 9 10

Figure: Correlation Graph

The figure shows that the two curves move in the same direction and, moreover, they are very
close to each other, suggesting a close relationship between price yield per plot (qtls) and
quantity of fertilizer used (kg)

Remark: Both the Graphic methods - scatter diagram and correlation graph provide a ‘feel
for’ of the data – by providing visual representation of the association between the variables.

To quantify the extent of correlation, we make use of algebraic methods which calculate
correlation coefficient.

6
Pearson’s Coefficient of Correlation

A mathematical method for measuring the intensity or the magnitude of linear relationship
between two variables was suggested by Karl Pearson (1867-1936).

Karl Pearson’s measure, known as Pearsonian correlation coefficient between two


variables X and Y, usually denoted by r(X,Y) or rxy or simply r is a numerical measure of
linear relationship between them and is defined as:

The ratio of the covariance between X and Y, to the product of the standard deviations of X
and Y.

Symbolically

Cov(X,Y)
rxy = ............(4.1)

Sx.Sy

when, ( X 1 , Y1 ); ( X 2 , Y2 );..................( X n , Yn ) are N pairs of observations of the variables X


and Y in a bivariate distribution,

Cov(X,Y) =
∑(X − X̄ ) (Y −Ȳ ) ............(4.2a)

Sx = √ ∑(X − X̄) 2
............(4.2b)
N

and

Sy = √ ∑(Y − Ȳ ) 2
............(4.2c)

Thus, by substituting Eqs. (4.2) in Eq. (4.1), we can write the Pearsonian correlation
coefficient as

7
…………(4.3)

If we denote, dx =X− X̄ and dy =Y− Ȳ

………….(4.3a)

We can further simply the calculations of Eqs. (4.2)

We have

8
NOTE: Cov (X,Y) is ………………………………………………..(4.4)

And Sx2 is ……………………………………………..….(4.5a)

Similarly, we have

………………..(4.5b)

So Pearsonian correlation coefficient may be found as

…………………….(4.6)

9
Remark:

a) Eq. (4.3) or Eq. (4.3a) is quite convenient to apply if the means X̄ and Ȳ come out to
be integers;
b) If X̄ or/and Ȳ is (are) fractional then the Eq. (4.3) or Eq. (4.3a) is quite cumbersome to
apply, since the computations of ∑(X − X̄)2 , ∑(Y − Ȳ)2 and ∑(X − X̄)(Y − Ȳ) are
quite time consuming and tedious;
c) In such a case Eq. (4.6) may be used provided the values of X or/ and Y are small.
d) But if X and Y assume large values, the calculation of ∑ X 2 , ∑Y 2 and ∑ XY is
again quite time consuming.

Thus, if: -

(i) X̄ and Ȳ are fractional; and


(ii) X and Y assume large values, the Eq. (4.3) and Eq. (4.6) are not generally used for
numerical problems.

In such cases, the step deviation method where we take the deviations of the variables X and
Y from any arbitrary points is used.

Properties of Pearsonian Correlation Coefficient

The following are important properties of Pearsonian correlation coefficient:

1. Pearsonian correlation coefficient cannot exceed 1 numerically.

Symbolically,

-1 ≤ r ≤1

Remarks:

(i) If in any problem, the obtained value of r lies outside the limits + 1, this implies that there
is some mistake in our calculations.

(ii) The sign of r indicate the nature of the correlation. Positive value of r indicates positive
correlation, whereas negative value indicates negative correlation. r = 0 indicate absence
of correlation.

2. Pearsonian Correlation coefficient is independent of the change of origin and scale.

If given variables X and Y are transformed to new variables U and V by change of origin and
scale, i. e.

X−A Y−B
U= and V=

h k

10
Where A, B, h and k are constants and h > 0, k > 0;

Then the correlation coefficient between X and Y is same as the correlation coefficient
between U and V i.e.,

r(X,Y) = r(U, V) = rxy = ruv

Remark:

This is one of the very important properties of the correlation coefficient and is extremely
helpful in numerical computation of r.

We had already stated that Eq. (4.3) and Eq.(4.6) become quite tedious to use in numerical
problems if X and/or Y are in fractions or if X and Y are large.

In such cases we can conveniently change the origin and scale (if possible) in X or/and Y to
get new variables U and V and compute the correlation between U and V by the Eq. (4.7)

3. Two independent variables are uncorrelated but the converse is not true

If X and Y are independent variables then

rxy = 0

As an illustration consider the following bivariate distribution.

X : 1 2 3 -3 -2 -1

Y: 1 4 9 9 4 1

For this distribution, value of r will be 0.

Hence in the above example the variable X and Y are uncorrelated.

But if we examine the data carefully we find that X and Y are not independent but are
connected by the relation Y = X2.

11
Remarks:

One should not be confused with the words uncorrelation and independence.

rxy = 0 i.e., uncorrelation between the variables X and Y simply implies the absence of any
linear (straight line) relationship between them.

4. Pearsonian coefficient of correlation is the geometric mean of the two regression


coefficients, i.e.

rxy = ± bxy .byx

The signs of both the regression coefficients are the same, and so the value of r will also have
the same sign.

5. The square of Pearsonian correlation coefficient is known as the coefficient of


determination.

Coefficient of determination, which measures the percentage variation in the dependent


variable that is accounted for by the independent variable, is a much better and useful
measure for interpreting the value of r.

Probable Error of Correlation Coefficient

The correlation coefficient establishes the relationship of the two variables.

Probable error of the correlation coefficient is such a measure of testing the reliability of the
observed value of the correlation coefficient, when we consider it as satisfying the conditions
of the random sampling.

If r is the observed value of the correlation coefficient in a sample of N pairs of


observations for the two variables under consideration, then the Probable Error, denoted by
PE (r) is expressed as

12
There are two main functions of probable error:

1. Determination of limits: The limits of population correlation coefficient are r ± PE(r),


implying that if we take another random sample of the size N from the same population, then
the observed value of the correlation coefficient in the second sample can be expected to lie
within the limits given above, with 0.5 probability.

When sample size N is small, the concept or value of PE may lead to wrong conclusions.

Hence to use the concept of PE effectively, sample size N it should be fairly large.

2. Interpretation of 'r': The interpretation of 'r' based on PE is as under:

• If r < PE(r), there is no evidence of correlation, i.e. a case of insignificant

correlation.

• If r > 6 PE(r), correlation is significant.


• If r < 6 PE(r), it is insignificant.
• If the probable error is small, correlation exist where r > 0.5

Example 3

Find the Pearsonian correlation coefficient between sales (in thousand units) and expenses (in
thousand KSHs.) of the following 10 firms:

Firm: 1 2 3 4 5 6 7 8 9 10

Sales: 50 50 55 60 65 65 65 60 60 50

Expenses: 11 13 14 16 16 15 15 14 13 13

13
Solution: Let sales of a firm be denoted by X and expenses be denoted by Y

Calculations for Coefficient of Correlation =X− X̄ and dy =Y− Ȳ

{Using Eq. (4.3) or (4.3a)}

Firm X Y Dx = X- X̄ Dy = Y - Ȳ d2x d2y dx dy


1 50 11 -8 -3 64 9 24
2 50 13 -8 -1 64 1 8
3 55 14 -3 0 9 0 0
4 60 16 2 2 4 4 4
5 65 16 7 2 49 4 14
6 65 15 7 1 49 1 7
7 65 15 7 1 49 1 7
8 60 14 2 0 4 0 0
9 60 13 2 -1 4 1 -2
10 50 13 -8 -1 64 1 8
Σ X=580 Σ Y=140 Σ d2x=360 Σ d2y=22 Σ dx dy=70

X̄ = Σ X / N = 580/10 = 58 and Ȳ = Σ Y/N = 140/10 = 14

Applying the Eq. (4.3a), we have, Pearsonian coefficient of correlation

rxy = 70 / √(360 x 22) = 70 / √7920 = 0.78

The value of rxy = 0.78 , indicate a high degree of positive correlation between sales and
expenses.

14
Example 4

The data on price and quantity purchased relating to a commodity for 5 months is given
below:

Month January February March April May


Price KSHs. 10 10 11 12 12
Quantity (Kg) 5 6 4 3 3

Find the Pearsonian correlation coefficient between prices and quantity and comment on its
sign and magnitude.

Solution: Let price of the commodity be denoted by X and quantity be denoted by Y

Calculations for Coefficient of Correlation

{Using Eq. (4.6)}

Firm X Y X2 Y2 XY
1 10 5 100 25 64
2 10 6 100 36 64
3 11 4 121 16 9
4 12 3 144 9 4
5 12 3 144 9 49
Σ X=55 Σ Y=21 Σ X2=609 Σ Y2=95 Σ XY=226

Applying the Eq. (4.6), we have, Pearsonian coefficient of correlation

rxy = (5 x 226) – (55 x 21) = 1130 – 1155 = -25/√ 6

√ { (5 x 609) – (55 x 55) }{(5 x 95) – (21 x 21)} √ (20 x 34)

rxy = -0.98

The negative sign of r indicate negative correlation and its large magnitude indicate a very
high degree of correlation.

15
Spearman’s Rank Correlation

Sometimes we come across statistical series in which the variables under consideration are
not capable of quantitative measurement but can be arranged in serial order.

In such situations Karl Pearson’s coefficient of correlation cannot be used as such.

Let the random variables X and Y denote the ranks of the individuals in the characteristics A
and B respectively. If we assume that there is no tie, i.e., if no two individuals get the same
rank in a characteristic then, obviously, X and Y assume numerical values ranging from 1 to
N.

The Pearsonian correlation coefficient between the ranks X and Y is called the rank
correlation coefficient between the characteristics A and B for the group of individuals.

Spearman’s rank correlation coefficient, usually denoted by ρ(Rho) is given by the equation

Ρ = 1 – 6 Σ d2 ……………………………..(4.8)

N (N2 – 1)

Where d is the difference between the pair of ranks of the same individual in the two
characteristics and N is the number of pairs.

Example 6

Ten entries are submitted for a competition. Three judges study each entry and list the ten in
rank order. Their rankings are as follows:

Entry: A B C D E F G H I J

JudgeJ1: 9 3 7 5 1 6 2 4 10 8

JudgeJ2: 9 1 10 4 3 8 5 2 7 6

JudgeJ3: 6 3 8 7 2 4 1 5 9 10

Calculate the appropriate rank correlation to help you answer the following questions:

(i) Which pair of judges agrees the most?

(ii) Which pair of judges disagrees the most?

16
Solution:

Calculations for Coefficient of Rank Correlation

{Using Eq.(4.8)}

Rank by Judges
Entry Difference in Ranks
J1 J2 J3 d(J1&J2) 2 d(J1&J3) 2 d(J2&J3) 2
d d d
A 9 9 6 0 0 +3 9 +3 9

B 3 1 3 +2 4 0 0 -2 4

4
C 7 10 8 -3 9 -1 1 +2

D 5 4 7 +1 1 -2 4 -3 9

E 1 3 2 -2 4 -1 1 +1 1

F 6 8 4 -2 4 +2 4 +4 16

G 2 5 1 -3 9 +1 1 +4 16

H 4 2 5 +2 4 -1 1 -3
9

I 10 7 9 +3 9 +1 1 -2 4

J 8 6 10 +2 4 -2 4 -4 16
2 2 2
∑d =48 ∑d =26 ∑d =88

6∑d2
ρ(J1&J2)=1−

N(N2 −1)

6x48 288
=1− =1− =1 – 0.29 = +0.71

10(102 −1) 990

6∑d 2 156
ρ (J1 & J3) =1− 6x26 =1− =1 – 0.1575 = +0.8425
= 1−

N(N2 −1) 10(102 −1) 990

17
6∑d 2 528
ρ (J2 & J3) =1− 6x88 = 1− =1 – 0.53 = +0.47
= 1−

N(N2 −1) 10(102 −1) 990

So (i) Judges J1 and J3 agree the most

(ii) Judges J2 and J3 disagree the most

Spearman’s rank correlation Eq.(4.8) can also be used even if we are dealing with variables,
which are measured quantitatively. we shall have to convert the data into ranks.

The highest (or the smallest) observation is given the rank 1. The next highest (or the next
lowest) observation is given rank 2 and so on.

Example 7

Calculate the rank coefficient of correlation from the following data:

X: 75 88 95 70 60 80 81 50

Y: 120 134 150 115 110 140 142 100

Solution:

Calculations for Coefficient of Rank Correlation

{Using Eq.(4.8)}

X Ranks Rx Y Ranks Ry d = Rx - Ry d2
75 5 120 5 0 0
88 2 134 4 -2 4
95 1 150 1 0 0
70 6 115 6 0 0
60 7 110 7 0 0
88 4 140 3 +1 1
81 3 142 2 +1 1
50 8 100 8 0 0
2
∑d =6

6∑d2 6x6 36
ρ =1− =1− =1− = 1 – 0.07 = + 0.93

N(N2 −1) 8(82 −1) 504

Hence, there is a high degree of positive correlation between X and Y

Repeated Ranks

18
In case of attributes if there is a tie i.e., if any two or more individuals are placed together in
any classification w.r.t. an attribute or if in case of variable data there is more than one item
with the same value in either or both the series then Spearman’s Eq.(4.8) for calculating the
rank correlation coefficient breaks down, since in this case the variables X [the ranks of
individuals in characteristic A (1st series)] and Y [the ranks of individuals in characteristic B
(2nd series)] do not take the values from 1 to N.

In this case common ranks are assigned to the repeated items. These common ranks are the
arithmetic mean of the ranks, which these items would have got if they were different from
each other and the next item will get the rank next to the rank used in computing the common
rank.

If only a small proportion of the ranks are tied, this technique may be applied together with
Eq.(4.8). If a large proportion of ranks are tied, it is advisable to apply an adjustment or a
correction factor to Eq.(4.8)as explained below:

“In the Eq.(4.8) add the factor

m(m2 −1) ............(4.8a)

12

to ∑ d ; where m is the number of times an item is repeated. This correction factor is to be


2

added for each repeated value in both the series”.

Example 8

For a certain joint stock company, the prices of preference shares (X) and debentures (Y) are
given below:

X: 73.2 85.8 78.9 75.8 77.2 81.2 83.8

Y: 97.8 99.2 98.8 98.3 98.3 96.7 97.1

Use the method of rank correlation to determine the relationship between preference prices
and debentures prices.

Solution:

19
Calculations for Coefficient of Rank Correlation

{Using Eq. (4.8) and (4.8a)}


2
X Y Rank of X (XR) Rank of Y (YR) d=XR –YR d
73.2 97.8 7 5 2 4

85.8 99.2 1 1 0 0

78.9 98.8 4 2 2 4

75.8 98.3 6 3.5 2.5 1 6.25

77.2 98.3 5 3.5 .5 2.25

81.2 96.7 3 7 -4 16

83.8 97.1 2 6 -4 16
2
∑d =0 ∑d =48.50

Now we also have to apply correction factor

m(m2 −1)
to ∑d , where m is the number of times the value is repeated, here m = 2.
2

12

ρ= 6⎢∑d2 + m(m2 −1)⎥ = 6 ( 48.5 + 2 (4 – 1)/12) = 1 – (6 x 49)/ (7 x 49) =0.125

12 7 (72 – 1)

N(N2 −1)

Hence, there is a very low degree of positive correlation, probably no correlation,


between preference share prices and debenture prices.

Remarks on Spearman’s Rank Correlation Coefficient

1. We always have ∑ d = 0 , which provides a check for numerical calculations.


2. Since Spearman’s rank correlation coefficient, ρ, is nothing but Karl Pearson’s
correlation coefficient, r, between the ranks, it can be interpreted in the same way as
the Karl Pearson’s correlation coefficient.
3. Karl Pearson’s correlation coefficient assumes that the parent population from which
sample observations are drawn is normal. If this assumption is violated then we need
a measure, which is distribution free (or non-parametric). Spearman’s ρ is such a
distribution free measure, since no strict assumption are made about the population
from which sample observations are drawn.

20
4. Spearman’s formula is easy to understand and apply as compared to Karl Pearson’s
formula. The values obtained by the two formulae, viz Pearsonian r and Spearman’s ρ
are generally different.
5. Spearman’s formula is the only formula to be used for finding correlation coefficient
if we are dealing with qualitative characteristics, which cannot be measured
quantitatively but can be arranged serially. It can also be used where actual data are
given.
6. Spearman’s formula has its limitations also. It is not practicable in the case of
bivariate frequency distribution. For N >30, this formula should not be used unless the
ranks are given.

Concurrent Deviation Method

This is a casual method of determining the correlation between two series when we are not
very serious about its precision.

Thus we put a plus (+) sign, minus (-) sign or equality (=) sign for the deviation if the value
of the variable is greater than, less than or equal to the preceding value respectively.

The deviations in the values of two variables are said to be concurrent if they have the same
sign (either both deviations are positive or both are negative or both are equal).

The formula used for computing correlation coefficient rc by this method is given by

Where: -

c is the number of pairs of concurrent deviations; and

N is the number of pairs of deviations.

If (2c-N) is positive, we take positive sign in and outside the square root in Eq. (4.9); and

if (2c-N) is negative, we take negative sign in and outside the square root in Eq. (4.9).

Example 9

21
Calculate coefficient of correlation by the concurrent deviation method

Supply: 112 125 126 118 118 121 125 125 131 135

Price: 106 102 102 104 98 96 97 97 95 90

Solution:

Calculations for Coefficient of Concurrent Deviations

{Using Eq. (4.9)}

Supply Sign of deviation from preceding Price Sign of deviation preceding Concurrent
(X) value (X) (Y) value (Y) deviations

112 106

125 + 102 -
126 102
+ =
118 104
- +
118 98
= -
121 96
+ -
125 97 +(c)
+ +
125 97 = (c)

131 = 95 =

135 + 90 -

+ -

We have

Number of pairs of deviations, N =10 – 1 = 9

c = Number of concurrent deviations

= Number of deviations having like signs

=2

Coefficient of correlation by the method of concurrent deviations is given by:

22
rc = ± √ ± {(2 x2) - 9} / 9

rc = ± √ ± {- 0.05556}

Since 2c – N = -5 (negative), we take negative sign inside and outside the square root

rc = - √ - {- 0.05556}

rc = - √0.05556 = -0.7

Hence there is a fairly good degree of negative correlation between supply and price.

Limitations of Correlation Analysis

1. Correlation analysis cannot determine cause-and-effect relationship.


2. Another mistake that occurs frequently is on account of misinterpretation of the
coefficient of correlation. Suppose in one case r = 0.7, it will be wrong to interpret that
correlation explains 70 percent of the total variation in Y. The error can be seen easily when
we calculate the coefficient of determination. Here, the coefficient of determination r2 will be
0.49. This means that only 49 percent of the total variation in Y is explained.
3. Another mistake in the interpretation of the coefficient of correlation occurs when one
concludes a positive or negative relationship even though the two variables are actually
unrelated.

23
24

You might also like