Nothing Special   »   [go: up one dir, main page]

The Significance of Correlation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

The Significance of Correlation

Introduction
Two [bivariate] or more variables [multivariate] can be studied simultaneously and make an attempt
to find the relationship among the variables in quantitative/qualitative form. In reality, we have many
such related variables such as crop per acre and fertilizer, height and weight, birth and death rate, blood
pressure readings based on two different methods, age of elephants and annual maintenance cost,
quantum of pesticides applied and intensity of food poisoning, dietary component and plasma lipid
level, size of crops and percentage of worms, age and blood pressure, and antibiotics and bacteria. This
methodology of studying the strength of relationship among the variables is given by Sir Francis
Galton and Karl Pearson.
Correlation
It is a statistical measure used to evaluate the strength and degree of relationship among the two or
more variables under study. Here the term ‘relationship’ is used to measure the tendency of the
variables to move together. The movement of the variables may be in the same or opposite direction.
The correlation is said to be positive if the variables are moving in the same direction, and negative if
they are moving in the opposite direction. If there is no change in direction, it implies that the variables
are not related.
It is classified into 1. simple correlation, 2. rank correlation and 3. group correlation.
Simple Correlation/Correlation
This measure can be evaluated for a discrete series of quantitative in nature. It is denoted by the
notation r. The value of r lies in the closed interval [-1 ≤ r ≤ 1]. If the value of r is towards 1, then
variables are said to be positively correlated or directly related [if X increases, Y also increases and if
X decreases, Y also decreases]. If it is towards -1, then it is said to be negatively correlated or inversely
related [if X increases, Y will decrease and if X decreases, Y increases] and if it is 0, then the variables
are said to be uncorrelated [the change in X does not affect the variable Y and vice-versa].
Rank Correlation
This measure can be evaluated for a discrete series of qualitative in nature. It is denoted by R. The
value of R lies in the closed interval [-1 ≤ R ≤ 1].
Group Correlation
This measure can be evaluated for a continuous series of grouped data. It is denoted by r. The values
of r lie in the closed interval [-1 ≤ r ≤ 1].
Note: The larger the value of r, the stronger the linear relationship between Y and X. If r = -1 or r =
+1, the regression line will include all data points and the line will be a perfect fit.
Assumptions for Karl Pearson’s Coefficient of Correlation
1. The relationship between the two series [X and Y] is linear [the amount of variation in X bears a
constant ratio to the corresponding amount of variation in Y].

2. Either one of the series is dependent on the other or both are dependent on the third series.
3. Correlation analysis is applied to most scientific data where inferences are to be made. In agriculture,
amount of fertilizers and crops’ yields are correlated. In economics, prices and demand or money and
prices. In medicine, use of cigarettes and incidence of lung cancer or use of new drug and the
percentage of cases cured. In sociology, unemployment and crime or welfare expenditure and labour
efficiency. In demography, wealth and fertility and so on.
4. The correlation coefficient r, like other statistics of the sample, is tested to see how for the sample
results may be generalized for the parent population.

Limitations of Correlation
1. It is independent of any change of origin of reference and the units of measurement.
2. Its value lies in the interval [-1, 1].
3. It is a constant value, which helps to measure the relationship between two variables.

Scatter Diagram
The scatter diagram is a very valuable graphic device to show the existence of correlation between the
two variables. Represent the variable X on the x-axis and Y on the y-axis. Mark the coordinate points
[x, y]; then the existence of correlation can be studied based on the structure of the clustering of the
coordinate points. The direction of scatter reveals the refuse and strength of the scatter correlation
between the variables.
The scatter diagrams for r and 0 < r < 1 refers that the path is linear and the variables are moving in
the same direction. This indicates the correlation is positive [the relationship between the variables is
direct].
The scatter diagrams for r = -1 and -1 < r < 0 indicates that the variables are moving in opposite
direction and the path is linear.
The scatter diagram for r = 0 indicates that the variables are not having any relation and the path is a
curve.

Karl Pearson’s Coefficient of Correlation


Consider the pairs of values [X1 , Y1 ], [X2 , Y2 ], … , [Xn , Yn ] of the variables X and Y. Then, the
covariance of these two variables X and Y can be defined as

The standard deviations of X and Y can be given by


The correlation coefficient r can be defined as,

Equivalent alternate formulae for r


Value of r using assumed mean
To derive the result, we make use of the concept that the correlation coefficient is independent of
choice of origin. Take Xi = [X - a] and Yi = [Y - b]. Where a is any one value of X and b is any one
value of Y. Then

Coefficient of Correlation for A Grouped Data


In a grouped data, the information is given in a correlation table. In each compartment of the table, the
deviations from the average of x and the average of y with respect to the corresponding compartment
are multiplied and written within brackets. This outcome further multiplied with the frequency of that
cell. Adding all such values lead to
Probable Error of the Coefficient of Correlation
Normally, we use sample data to evaluate correlation coefficient. So, whenever the result is interpreted,
it is necessary to check the reliability of the evaluated sample correlation with the population’s
coefficient. This is determined by probable error. It is evaluated using the result.
Probable error = 0.6745 * [standard error of r]
Where standard error of r =

Where r is the correlation coefficient and n is the number of pairs of items. The interpretation is that if
P.E. of r = +/-a, where ‘a’ is a constant, then the range of the correlation of the population can be
evaluated approximately as [r - a, r + a]. This probable error calculation can be used only when the
whole data are normal or near to normal. The selection of sample should be unbiased. In related to the
probable error, the significance of the coefficient of correlation may be judged as follows: The
coefficient of correlation is significant, if it is more than six times the probable error or where the
probable error is not much and r exceeds 0.5. It is not significant at all, if it is less than the probable
error.

Rank Correlation
Pearson’s correlation coefficient ‘Υ’ gives a numerical measure of degree of relationship exists among
the two variables X and Y. However, it requires the joint distribution of X and Y must be normal.
These two things can be over cited by rank correlation coefficient based on the ranking of the variates.
This was introduced by Charles Edward Spearman in 1904. It helps on dealing with qualitative
characteristics such as beauty and intelligence. It is more suitable, if the variables can be arranged in
order of merit. This is denoted by R.
Note for repeated ranks
The above-given formula holds good, if the ranks are not repeated. For repeated ranks, say if a rank is
repeated form number of times, then the value [[m[m - 1]2 ]/12] should be added along with [Σdi2 ].
This must be carried over for each repeated ranks.
Merits of rank correlation coefficient
1. It is simple to understand and easy to evaluate.
2. It is very much useful for qualitative type of data.
3. It can be evaluated also for a quantitative type of data.

REFERENCE:
Biostatistics: An Introduction by Dr P. Mariappan.

You might also like