Pearson's Correlation
Pearson's Correlation
Pearson's Correlation
Google Health searches for various common medical topics from 2004 to
2017.
By
F005 Mithil Dedia
F006 Ishant Deopura
F007 Parth Desai
Problem Statement
The Pearson correlation coefficient is a statistic which measures
the degree of linear correlation between two variables.
In this project, we'll use the Pearson correlation coefficient to
gauge the degree of linearity in Google health search volume.
Does the correlation between searches go up or down with time?
Karl Pearson’s Correlation Coefficient
Where:
Σ means sum of
X and Y are the variables being evaluated
X̅ and Ȳ are the mean values of variables X and Y, respectively
N is the sample size
and are the standard deviations of X and Y, respectively.
Karl Pearson’s correlation coefficient has
a value that falls in the range of -1 to +1.
The following chart demonstrates the The more tightly linear two variables X
correlation coefficients of various and Y are, the closer PCC will be to either -
distributions: 1, if the relationship is negative, or +1, if
the relationship is positive. Perfectly
linearly uncorrelated numbers will receive
a PCC of 0.
Complete Dataset
https://www.kaggle.com/code/residentmario/pearson-s-r-with-health-searches
Cleaned Dataset
Application
We're now going to look at an application of the Pearson correlation coefficient to real data. We'll use the Google
health search dataset, helpfully published on Kaggle by their News Labs, to do so.
The health search dataset includes an index of volumes of searches for various common medical topics throughout
an assortment of areas in the United States. The data covers the period 2004 through 2017, with a different index
value for every place and every year. This dataset is interesting because we naturally expect there to be a relationship
between the search volume for a specific term in 2004 and 2017, but we expect this relationship to be much stronger
We can verify that this is indeed the case by plotting the joint distributions of these variables using R program.
Analysis
Prior, a claim had been made that the relation between searches in 2016 and 2017
would be much stronger than that of 2004 and 2017.
From running the program in R we got the Pearson correlation coefficients of both
scenarios.
For the Cancer searches of 2004 and 2017, the Karl Pearson’s correlation
coefficient, r was found to be 0.47. There is a moderately positive correlation
between searches of 2004 and 2017.
For the Cancer searches of 2016 and 2017, the Karl Pearson’s correlation
coefficient, r was found to be 0.89. There is a strongly positive correlation
between searches of 2016 and 2017.
This means that the searches of 2004 were least similar to that of 2017. Whereas,
the searches of 2016 were most similar to that of 2017.
The second distribution, i.e., 2016 and 2017 is significantly close to being linear or
having perfect correlation than the first distribution, i.e., 2004 and 2017.
Conclusion
As found from our calculations, this claim is correct: the Pearson correlation
coefficient (within a certain amount of randomness) does go up over time!