Nothing Special   »   [go: up one dir, main page]

Hypothesis Testing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Hypothesis Testing

Leveraging customer information is of paramount importance for most businesses. Imagine


that you are an insurance analyst (actuarian) who needs to assess the insurability or the risks
of his/her customers. Part of your job is to look at customer attributes such as age, sex, BMI,
smokers/non-smokers, location, etc., and to use them in your decision-making process on
whether to churn or approve their claims.

This lab will guide you through the series of steps in hypothesis testing to help you decide,
using the statistical evidence, a certain effect of an attribute or a combination of attributes
on the insurance claims. This lab will only introduce you to the initial steps in the decision
making, before building the prediction models and classifiers.

Objectives
After completing this lab you will be able to:

Understand the elements of hypothesis testing


Choose a sample statistic
Define hypothesis
Set the decision criteria
Evaluate and interpret the results

Setup
For this lab, we will be using the following libraries:

pandas for managing the data.


numpy for mathematical operations.
seaborn for visualizing the data.
matplotlib for visualizing the data.
scipy.stats for statistical analysis.
statsmodels for statistical analysis.

1. T-Test:

When to use: Use a t-test when you want to compare the means of two groups.
Example 1: We here use a t-test to compare the mean between the BMI of male
and BMI of female.
Example 2: We here use a t-test to compare the average charges between smokers
and non-smokers.
2. ANOVA

When to use: Use ANOVA when you have three or more independent groups and
want to determine if there are significant differences in the means of these groups.
Example 3: We here use ANOVA to compare the The mean BMI of women with no
children , one child, and two children, ANOVA can help determine if there is a
statistically significant difference in BMI across these groups.
3. Chi-Squared Test:

When to use: Use a chi-squared test when you have categorical variables and want
to test if there is a significant association between them.
Example 4: We here use a chi-squared test to examine the association between sex
and smoking status.
Exercise 5: We here use a chi-squared test to determine if the proportion of
smokers is significantly different across the different regions

Installing Required Libraries


The following required modules are pre-installed in the Skills Network Labs environment.
However if you run this notebook commands in a different Jupyter environment (e.g. Watson
Studio or Ananconda) you will need to install these libraries by removing the # sign before
!mamba in the code cell below.

In [1]: # All Libraries required for this lab are listed below. The libraries pre-installed
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 s
# Note: If your environment doesn't support "!mamba install", use "!pip install"

In [2]: import pandas as pd


import numpy as np

import matplotlib.pyplot as plt


%matplotlib inline
import seaborn as sns

import scipy.stats as stats


from scipy.stats import chi2_contingency

from statsmodels.formula.api import ols


from statsmodels.stats.anova import anova_lm

Reading and understanding our data


For this lab, we will be using the insurance.csv file, hosted on IBM Cloud object.

This dataset contains information about age, sex, BMI, the number of children, whether the
client is smoker or non-smoker, region where the client lives, and the charges to their
insurance company.

Let's read the data into pandas data frame and look at the first 5 rows using the head()
method.

In [3]: data = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cl


data.head()

Out[3]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

By using info function, we will take a look at our types of data.

In [4]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

The describe() function provides the statistical information about the numeric variables.

In [5]: data.describe().T
Out[5]: count mean std min 25% 50% 75%

age 1338.0 39.207025 14.049960 18.0000 27.00000 39.000 51.000000

bmi 1338.0 30.663397 6.098187 15.9600 26.29625 30.400 34.693750

children 1338.0 1.094918 1.205493 0.0000 0.00000 1.000 2.000000

charges 1338.0 13270.422265 12110.011237 1121.8739 4740.28715 9382.033 16639.912515

According to the brief preview of our data, we will define the 'charges' to be our response
variable, and 'age', 'sex', 'bmi', 'children', 'smoker', and 'region' to be our predictor variables.
In this lab, we will test how our predictor variables influence the insurance 'charges'.

Steps in Hypothesis Testing

Example 1
In this first example, we will show how to prove (or disprove), with statistical evidence, that
the BMI of females is different from that of males.

1. Choose a sample statistic


The first step in hypothesis testing is to choose a sample test statistic. Hypothesis testing
allows us to check the sample statistic against a statistic of another sample or population.
Let 𝜇1 be the population mean for BMI of males and 𝜇2 be the the population mean for BMI
of females. We will compare these mean values, :$\mu\_{1}$ and $\mu\_{2}$, statistically.

2. Define hypothesis (Null and Alternative)


The next step is to define the hypothesis to be tested. Hypothesis is defined in two ways -
null hypothesis and alternative hypothesis. Null hypothesis is a statistical hypothesis which
assumes that the difference in observations is due to a random factor. It is denoted by Ho.
Alternative hypothesis is the opposite of null hypothesis. It assumes that the difference in
observations is the result of a real effect. The alternate hypothesis is denoted by H1.

$ 𝐻_{0}:\mu_{1}-\mu_{2} = 0 $ There is no difference between the BMI of male and BMI of


female.
$ 𝐻_{A}:\mu_{1}-\mu_{2} != 0 $ There is difference between the BMI of male and BMI of
female.

The equal sign in the null hypothesis indicates that it is a 2-tailed test.
3. Set the decision criteria

To set the criteria for a decision, we state the level of significance for a test. It could be 5%,
1% or 0.5%. Based on the level of significance, we can make a decision whether to accept the
null hypothesis and reject the alternate, and vise versa.

The diagram above describes the principles of hypothesis testing. We will choose 5%
significance level. Therefore, our $ \alpha=0.05 $. Since we have a 2-tailed test, we have to
divide alpha by 2, which gives us 0.025. So, if the calculated p-value is less than alpha, we will
reject the null hypothesis. The significance level is based on the business requirements. If
you would like to learn more about the statistical significance, please visit this wikipedia link.

In this lab, we will use one of the t-test, z-score, f-score or chi-squared statistics to evaluate
our results.

A t-test is used for testing the mean of one population against a standard or comparing the
means of two populations if you do not know standard deviation of the the population and
when you have a limited sample (n < 30). If you know the standard deviation of the
populations , you may use a z-test.

A z-test is used for testing the mean of a population versus a standard, or comparing the
means of two populations, with large (n ≥ 30) samples, whether you know the population
standard deviation or not. It is also used for testing the proportion of some characteristic
versus a standard proportion, or comparing the proportions of two populations.

An f-test is used to compare variances between 2 populations. The samples can be any size.
It is the basis of ANOVA.

chi-squared test is used to determine whether there is a statistically significant difference


between the expected and the observed frequencies in one or more categories of a
contingency table. A contingency table is a tabular representation of categorical data. It
shows the frequency distribution of the variables.

To learn more about t-test, z-score, f-score or chi-squared statistics and contingency tables,
please visit their corresponding wikipedia links.

4. Evaluate and interpret the result


First, let's get all observations for females and males by using the loc() function.

In [6]: female=data.loc[data.sex=="female"]
male=data.loc[data.sex=="male"]

Now, let's select the bmi values for females and males.

In [7]: f_bmi = female.bmi


m_bmi = male.bmi

Now, we will plot the distribution of 'bmi' values for females and males using seaborn's
distplot() function.

In [8]: sns.distplot(f_bmi,color='green',hist=False)
sns.distplot(m_bmi,color='red',hist=False)

Out[8]: <AxesSubplot:xlabel='bmi'>
From the graph, we already see that the two distributions are very similar.

Now, let's calculate the mean values for females and males bmi.

In [9]: female.bmi.mean()

Out[9]: 30.377749244713023

In [10]: male.bmi.mean()

Out[10]: 30.943128698224832

Next, we will obtain our statistics, t-value and p-value. We will use scipy.stats library and
ttest_ind() function to calculate these parameters.

In [11]: alpha=0.05
t_value1, p_value1 = stats.ttest_ind(m_bmi, f_bmi)
print("t_value1 = ",t_value1, ", p_value1 = ", p_value1)

t_value1 = 1.696752635752224 , p_value1 = 0.08997637178984932

Next, although optional, it is useful to print 'if/else' statements to make our conclusions
about the the hypothesis.

In [12]: if p_value1 <alpha:


print("Conclusion: since p_value {} is less than alpha {} ". format (p_value1,a
print("Reject the null hypothesis that there is no difference between bmi of fe
print("Verdict: There is a difference b/w female and male BMI")

else:
print("Conclusion: since p_value {} is greater than alpha {} ". format (p_value
print("Fail to reject the null hypothesis that there is a difference between bm
print("Verdict: There is no difference b/w female and male BMI")

Conclusion: since p_value 0.08997637178984932 is greater than alpha 0.05


Fail to reject the null hypothesis that there is a difference between bmi of females
and bmi of males.
Verdict: There is no difference b/w female and male BMI

Conclusion: We fail to reject the null hypothesis and can conclude that there is no difference
between the female and male bmi.

Example 2
In this example, we would like to prove (or disprove) that the medical claims made by the
people who smoke are greater than those who don't.

We will compare the mean values (𝜇) of population of people who smoke and those who do
not smoke. First, we need to calculate the mean values of smoking and non smoking
populations.

In [13]: smoker = data.loc[data.smoker=="yes"]


smoker_char = smoker.charges
sch_mean = smoker_char.mean()
sch_mean

Out[13]: 32050.23183153285

In [14]: nonsmoker = data.loc[data.smoker=="no"]


nonsmoker_char = nonsmoker.charges

Exercise 1
Calculate population mean of the nonsmokers.

In [29]: # Enter your code below and run the cell


nsch_mean = nonsmoker_char.mean()
nsch_mean

Out[29]: 8434.268297856199

Solution (Click Here)

Now, let's define our null and alternative hypothesis.


$ 𝐻_{0}:\mu_{1}<=\mu_{2} $ The average charges of smokers are less than or equal to
nonsmokers.
$ 𝐻_{A}:\mu_{1}>=\mu_{2} $ The average charges of smokers are greater than or equal to
nonsmokers.

The '>' sign in the alternate hypothesis indicates the test is right tailed. To compare the
mean values of smoking and nonsmoking populations, we will use a t-test. If z-values
(calculated from a t-test) fall into the area on the right side of a distribution curve, this would
cause us to reject the null hypothesis.

Now, let's plot our smoking versus nonsmoking populations by using seaborn boxplot()
function. It is always useful to have a visual representation of the data that we are working
with.

In [16]: sns.boxplot(x=data.charges,y=data.smoker,data=data).set(title="Fig:1 Smoker vs Char

Out[16]: [Text(0.5, 1.0, 'Fig:1 Smoker vs Charges')]

Now, we will calculate t-value and p-value of charges for smoking and nonsmoking
populations.

In [17]: alpha=0.05
t_val2, p_value2 = stats.ttest_ind(smoker_char, nonsmoker_char)
p_value_onetail=p_value2/2
print("t_value = {} , p_value ={} , p_value_onetail = {}".format(t_val2, p_value2,

t_value = 46.66492117272371 , p_value =8.271435842179102e-283 , p_value_onetail = 4.


135717921089551e-283

Exercise 2
Use print() function to state your conclusions based on the calculated statistics. What are
the conclusions?

In [37]: # Enter your code and run the cell


if p_value_onetail< alpha:
print("Conclusion: The p_value {} is less than alpha value {}".format(p_value_
print("We Successfully reject the null hypothesis that average charges of smoke
print("Verdict: We consider alternate hypothesis and conclude that \nThe averag
else:
print("Conclusion:Since p value {} is greater than alpha {} ".format (p_value_o
print("Failed to reject null hypothesis that average charges for smokers are le
print("Verdict: We consider null hypothesis and conclude that \n The average ch

Conclusion: The p_value 4.135717921089551e-283 is less than alpha value 0.05


We Successfully reject the null hypothesis that average charges of smokers are less
than or equal to nonsmokers.
Verdict: We consider alternate hypothesis and conclude that
The average charges of smokers are greater than or equal to nonsmokers.

Solution (Click Here)

Example 3
In this example, using the statistical evidence, we will compare the BMI of women with no
children, one child, and two children.

Now, let's define our null and alternative hypothesis.

$ 𝐻_{0}:\mu_{1}=\mu_{2}=\mu_{3} $ The mean BMI of women with no children , one child,


and two children are the same.
$ 𝐻_{A}: $ At least one of the means for women's BMI is not the same.

For this example, we will use a one-way ANOVA and f-score statistic to evaluate the
variances of these three different populations. We will set alpha value to be 0.05.

First, we need to filter data for women with 0, 1 and 2 children.

In [40]: female_children = female.loc[female['children']<=2]


We will use groupby() function to group the information by the number of children and
bmi values.

In [20]: female_children.groupby([female_children.children]).mean().bmi

Out[20]: children
0 30.361522
1 30.052658
2 30.649790
Name: bmi, dtype: float64

Exercise 3
In this exercise, draw the boxplots to visualize the difference in bmi values between these 3
groups.

In [46]: # Enter your code and run the cell


sns.boxplot(x="children" , y="bmi", data=female_children).set(title="BMI of Women w

Out[46]: [Text(0.5, 1.0, 'BMI of Women with 0,1 and 2 children')]

In [47]: # Plot can also be diplayed by the code below


# sns.boxplot(x=female_children.children, y=female_children.bmi)
# plt.grid()
# plt.show()
Solution (Click Here)

Now, we will construct the ANOVA table and check for each groups count (0,1,2 children)
against the bmi values. We will use the ols (ordinary least squares) model for estimation of
the unknown parameters.

OLS, or Ordinary Least Squares, is a method in statistics used to find the best-fitting line in
linear regression. It minimizes the sum of squared differences between observed and
predicted values.OLS is used to estimate relationships between variables and make
predictions in fields like economics, finance, and social sciences.

1. formula = 'bmi ~ C(children)' : This line defines the formula for the regression
model. It states that the variable bmi is the dependent variable (~) and is linearly
related to the categorical variable children . The C() function is used to treat
children as a categorical variable, indicating that it has distinct categories rather than
being a continuous variable.

2. model = ols(formula, female_children).fit() : Here, the ols function from


the statsmodels library is used to specify and fit the linear regression model. The
formula is passed as an argument along with the dataset female_children . The
fit() method then fits the model to the data.

In [22]: formula = 'bmi ~ C(children)'


model = ols(formula, female_children).fit()
aov_table = anova_lm(model)
aov_table

Out[22]: df sum_sq mean_sq F PR(>F)

C(children) 2.0 24.590123 12.295062 0.334472 0.715858

Residual 563.0 20695.661583 36.759612 NaN NaN

Conclusion. p-value is 0.715858 and it is greater than the alpha (0.05), therefore we fail to
reject the null hypothesis and conclude that the mean bmi of women with no children,
one child, and two children are the same.

Example 4
In this example, we will determine if the proportion of smokers is significantly different
across the different regions.

First, let's define our null and alternative hypothesis.


$ 𝐻_{0} $: Smokers proportions are not significantly different across the different regions.
$ 𝐻_{A} $: Smokers proportions are different across the different regions.

Here, we are comparing two different categorical variables, smokers/nonsmokers and


different regions. For this type of analysis, we will perform a chi-square test.

First, we will calculate a contingency table between the proportions of smokers in different
regions. For this, we will use pandas crosstab() function.

In [23]: contingency= pd.crosstab(data.region, data.smoker)


contingency

Out[23]: smoker no yes

region

northeast 257 67

northwest 267 58

southeast 273 91

southwest 267 58

Next, let's plot the distribution of nonsmokers/smokers across 4 different regions using the
plot() function.

In [52]: contingency.plot(kind='bar')

Out[52]: <AxesSubplot:xlabel='region'>
Now, using chi2_contingency() method, from the scipy.stats chi2_contingency , we
will calculate chi-squared, p-value, degrees of freedom, and expected frequencies for our
data.

In [25]: chi2, p_val, dof, exp_freq = chi2_contingency(contingency, correction = False)


print('chi-square statistic: {} , p_value: {} , degree of freedom: {} ,expected fre

chi-square statistic: 7.343477761407071 , p_value: 0.06171954839170541 , degree of f


reedom: 3 ,expected frequencies: [[257.65022422 66.34977578]
[258.44544096 66.55455904]
[289.45889387 74.54110613]
[258.44544096 66.55455904]]

Exercise 4
Based on the above results, print your conclusion statements whether to reject or accept the
null hypothesis. What are your conclusions about the hypothesis?

In [54]: # Enter your code and run the cell


if (p_val < 0.05):
print('Reject the null hypothesis, that the smokers proportions are not signifi
else:
print('Accept the null hypothesis, that the smokers proportions are not signifi
Accept the null hypothesis, that the smokers proportions are not significantly diffe
rent across the different regions

Conclusion: We failed to reject that the proportions of smokers are not significantly different
across different regions. Therefore, the proportions of smokers are different across different
regions.

Solution (Click Here)

Answer (Click Here)

Exercise 5
In this final exercise, we will determine If there is significant difference in the distribution of
smokers between males and females.

First, let's define our null and alternative hypothesis.

$ 𝐻_{0}:\mu_{1}<=\mu_{2} $ There is no significant difference in the distribution of smokers


between males and females.
$ 𝐻_{A}:\mu_{1}>=\mu_{2} $ There is a significant difference in the distribution of smokers
between males and females.

Here, we are comparing two different categorical variables, smokers/nonsmokers and


male/female. For this type of analysis, we will perform a chi-square test.

Explanation:

1. Import necessary libraries, including pandas for data manipulation and


chi2_contingency from scipy.stats for the chi-squared test.
2. Specify the null ($ 𝐻_{0}$) and alternative ($ 𝐻_{A}$) hypotheses.
3. Create a contingency table using pd.crosstab to summarize the counts of smokers
and non-smokers for each gender and print it
4. Perform the chi-squared test using chi2_contingency .
5. Print the chi-squared value and p-value.
6. Compare the p-value to the significance level (commonly 0.05) and make a decision
about whether to reject the null hypothesis.
7. Create a Bar chart using .plot method on contingency

In [78]: #BEGIN SOLUTION

import pandas as pd
from scipy.stats import chi2_contingency

# Hypothesis:
# H0: There is no significant difference in the distribution of smokers between mal
# H1: There is a significant difference in the distribution of smokers between male
# Create a contingency table
contingency_table = pd.crosstab(data['sex'], data['smoker'])
print(contingency_table)

# Perform the chi-squared test


chi2, p_val, dof, exp_freq = chi2_contingency(contingency_table)

# Set the significance level (alpha)


alpha = 0.05

# Print the results


print('\nchi-square statistic: {} , p_value: {} , degree of freedom: {} ,expected f

# Compare p-value to the significance level


if p_val < alpha:
print("Conclusion:- Alpha value {} is more than p value {} ".format(alpha,p_val
print("Reject the null hypothesis (There is no significant difference in the di
print("Verdict:- There is a significant difference in the distribution of smok
else:
print("Fail to reject the null hypothesis. There is no significant difference i

# Create a Bar Chart


contingency.plot(kind='bar')
#END SOLLUTION

smoker no yes
sex
female 547 115
male 517 159

chi-square statistic: 7.39291081459996 , p_value: 0.006548143503580696 , degree of f


reedom: 1 ,expected frequencies: [[526.43348281 135.56651719]
[537.56651719 138.43348281]]

Conclusion:- Alpha value 0.05 is more than p value 0.006548143503580696


Reject the null hypothesis (There is no significant difference in the distribution o
f smokers between males and females
Verdict:- There is a significant difference in the distribution of smokers between
males and females.
Out[78]: <AxesSubplot:xlabel='sex'>

You might also like