Nothing Special   »   [go: up one dir, main page]

Pearson's Correlation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Pearson’s Correlation

Google Health searches for various common medical topics from 2004 to
2017.

By
F005 Mithil Dedia
F006 Ishant Deopura
F007 Parth Desai
Problem Statement
The Pearson correlation coefficient is a statistic which measures
the degree of linear correlation between two variables.
In this project, we'll use the Pearson correlation coefficient to
gauge the degree of linearity in Google health search volume.
Does the correlation between searches go up or down with time?
Karl Pearson’s Correlation Coefficient

A coefficient of correlation is generally applied in statistics to calculate a


relationship between two variables. The correlation shows a specific value of the
degree of a linear relationship between two linearly related variables.

Karl Pearson’s coefficient of correlation or PCC is an extensively used


mathematical method in which the numerical representation is applied to measure
the level of relation between linearly related variables. The coefficient of
correlation is expressed by “r”.
To calculate Pearson's correlation coefficient, you need to have two sets of data, X and Y, that are normally
distributed. The formula for Pearson's correlation coefficient is:

Alternative Formula(Covariance formula)

Where:
Σ means sum of
X and Y are the variables being evaluated
X̅ and Ȳ are the mean values of variables X and Y, respectively
N is the sample size
and are the standard deviations of X and Y, respectively.
Karl Pearson’s correlation coefficient has
a value that falls in the range of -1 to +1.
The following chart demonstrates the The more tightly linear two variables X
correlation coefficients of various and Y are, the closer PCC will be to either -
distributions: 1, if the relationship is negative, or +1, if
the relationship is positive. Perfectly
linearly uncorrelated numbers will receive
a PCC of 0.
Complete Dataset

https://www.kaggle.com/code/residentmario/pearson-s-r-with-health-searches
Cleaned Dataset
Application

We're now going to look at an application of the Pearson correlation coefficient to real data. We'll use the Google

health search dataset, helpfully published on Kaggle by their News Labs, to do so.

The health search dataset includes an index of volumes of searches for various common medical topics throughout

an assortment of areas in the United States. The data covers the period 2004 through 2017, with a different index

value for every place and every year. This dataset is interesting because we naturally expect there to be a relationship

between the search volume for a specific term in 2004 and 2017, but we expect this relationship to be much stronger

the closer we get to 2017, maxing out in 2016.

We can verify that this is indeed the case by plotting the joint distributions of these variables using R program.
Analysis

Prior, a claim had been made that the relation between searches in 2016 and 2017
would be much stronger than that of 2004 and 2017.
From running the program in R we got the Pearson correlation coefficients of both
scenarios.
For the Cancer searches of 2004 and 2017, the Karl Pearson’s correlation
coefficient, r was found to be 0.47. There is a moderately positive correlation
between searches of 2004 and 2017.
For the Cancer searches of 2016 and 2017, the Karl Pearson’s correlation
coefficient, r was found to be 0.89. There is a strongly positive correlation
between searches of 2016 and 2017.
This means that the searches of 2004 were least similar to that of 2017. Whereas,
the searches of 2016 were most similar to that of 2017.

The second distribution, i.e., 2016 and 2017 is significantly close to being linear or
having perfect correlation than the first distribution, i.e., 2004 and 2017.
Conclusion

In this project we introduced ourselves to Karl Pearson’s Correlation Coefficient.


We made a claim that similarities between google health searches are higher
when the years are closer apart.

As found from our calculations, this claim is correct: the Pearson correlation
coefficient (within a certain amount of randomness) does go up over time!

We have successfully used R programming to solve and analyze the question,


“Does the correlation between searches go up or down with time?”

You might also like