Hypothesis Testing

Hypothesis Testing
Leveraging customer information is of paramount importance for most businesses. Imagine

that you are an insurance analyst (actuarian) who needs to assess the insurability or the risks
of his/her customers. Part of your job is to look at customer attributes such as age, sex, BMI,
smokers/non-smokers, location, etc., and to use them in your decision-making process on
whether to churn or approve their claims.
This lab will guide you through the series of steps in hypothesis testing to help you decide,
using the statistical evidence, a certain effect of an attribute or a combination of attributes
on the insurance claims. This lab will only introduce you to the initial steps in the decision
making, before building the prediction models and classifiers.
Objectives
After completing this lab you will be able to:
Understand the elements of hypothesis testing

Choose a sample statistic
Define hypothesis
Set the decision criteria
Evaluate and interpret the results
Setup
For this lab, we will be using the following libraries:
pandas for managing the data.

numpy for mathematical operations.
seaborn for visualizing the data.
matplotlib for visualizing the data.
scipy.stats for statistical analysis.
statsmodels for statistical analysis.
1. T-Test:
When to use: Use a t-test when you want to compare the means of two groups.
Example 1: We here use a t-test to compare the mean between the BMI of male
and BMI of female.
Example 2: We here use a t-test to compare the average charges between smokers
and non-smokers.
2. ANOVA
When to use: Use ANOVA when you have three or more independent groups and
want to determine if there are significant differences in the means of these groups.
Example 3: We here use ANOVA to compare the The mean BMI of women with no
children , one child, and two children, ANOVA can help determine if there is a
statistically significant difference in BMI across these groups.
3. Chi-Squared Test:
When to use: Use a chi-squared test when you have categorical variables and want
to test if there is a significant association between them.
Example 4: We here use a chi-squared test to examine the association between sex
and smoking status.
Exercise 5: We here use a chi-squared test to determine if the proportion of
smokers is significantly different across the different regions
Installing Required Libraries

The following required modules are pre-installed in the Skills Network Labs environment.
However if you run this notebook commands in a different Jupyter environment (e.g. Watson
Studio or Ananconda) you will need to install these libraries by removing the # sign before
!mamba in the code cell below.
In [1]: # All Libraries required for this lab are listed below. The libraries pre-installed
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 s
# Note: If your environment doesn't support "!mamba install", use "!pip install"
In [2]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns
import scipy.stats as stats

from scipy.stats import chi2_contingency
from statsmodels.formula.api import ols

from statsmodels.stats.anova import anova_lm
Reading and understanding our data

For this lab, we will be using the insurance.csv file, hosted on IBM Cloud object.
This dataset contains information about age, sex, BMI, the number of children, whether the
client is smoker or non-smoker, region where the client lives, and the charges to their
insurance company.
Let's read the data into pandas data frame and look at the first 5 rows using the head()
method.
In [3]: data = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cl

data.head()
Out[3]: age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
By using info function, we will take a look at our types of data.
In [4]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
The describe() function provides the statistical information about the numeric variables.
In [5]: data.describe().T
Out[5]: count mean std min 25% 50% 75%
age 1338.0 39.207025 14.049960 18.0000 27.00000 39.000 51.000000
bmi 1338.0 30.663397 6.098187 15.9600 26.29625 30.400 34.693750
children 1338.0 1.094918 1.205493 0.0000 0.00000 1.000 2.000000
charges 1338.0 13270.422265 12110.011237 1121.8739 4740.28715 9382.033 16639.912515
According to the brief preview of our data, we will define the 'charges' to be our response
variable, and 'age', 'sex', 'bmi', 'children', 'smoker', and 'region' to be our predictor variables.
In this lab, we will test how our predictor variables influence the insurance 'charges'.
Steps in Hypothesis Testing
Example 1
In this first example, we will show how to prove (or disprove), with statistical evidence, that
the BMI of females is different from that of males.
1. Choose a sample statistic

The first step in hypothesis testing is to choose a sample test statistic. Hypothesis testing
allows us to check the sample statistic against a statistic of another sample or population.
Let 𝜇1 be the population mean for BMI of males and 𝜇2 be the the population mean for BMI
of females. We will compare these mean values, :$\mu\_{1}$ and $\mu\_{2}$, statistically.
2. Define hypothesis (Null and Alternative)

The next step is to define the hypothesis to be tested. Hypothesis is defined in two ways -
null hypothesis and alternative hypothesis. Null hypothesis is a statistical hypothesis which
assumes that the difference in observations is due to a random factor. It is denoted by Ho.
Alternative hypothesis is the opposite of null hypothesis. It assumes that the difference in
observations is the result of a real effect. The alternate hypothesis is denoted by H1.
$ 𝐻_{0}:\mu_{1}-\mu_{2} = 0 $ There is no difference between the BMI of male and BMI of

female.
$ 𝐻_{A}:\mu_{1}-\mu_{2} != 0 $ There is difference between the BMI of male and BMI of
female.
The equal sign in the null hypothesis indicates that it is a 2-tailed test.
3. Set the decision criteria
To set the criteria for a decision, we state the level of significance for a test. It could be 5%,
1% or 0.5%. Based on the level of significance, we can make a decision whether to accept the
null hypothesis and reject the alternate, and vise versa.
The diagram above describes the principles of hypothesis testing. We will choose 5%
significance level. Therefore, our $ \alpha=0.05 $. Since we have a 2-tailed test, we have to
divide alpha by 2, which gives us 0.025. So, if the calculated p-value is less than alpha, we will
reject the null hypothesis. The significance level is based on the business requirements. If
you would like to learn more about the statistical significance, please visit this wikipedia link.
In this lab, we will use one of the t-test, z-score, f-score or chi-squared statistics to evaluate
our results.
A t-test is used for testing the mean of one population against a standard or comparing the
means of two populations if you do not know standard deviation of the the population and
when you have a limited sample (n < 30). If you know the standard deviation of the
populations , you may use a z-test.
A z-test is used for testing the mean of a population versus a standard, or comparing the
means of two populations, with large (n ≥ 30) samples, whether you know the population
standard deviation or not. It is also used for testing the proportion of some characteristic
versus a standard proportion, or comparing the proportions of two populations.
An f-test is used to compare variances between 2 populations. The samples can be any size.
It is the basis of ANOVA.
chi-squared test is used to determine whether there is a statistically significant difference

between the expected and the observed frequencies in one or more categories of a
contingency table. A contingency table is a tabular representation of categorical data. It
shows the frequency distribution of the variables.
To learn more about t-test, z-score, f-score or chi-squared statistics and contingency tables,
please visit their corresponding wikipedia links.
4. Evaluate and interpret the result

First, let's get all observations for females and males by using the loc() function.
In [6]: female=data.loc[data.sex=="female"]
male=data.loc[data.sex=="male"]
Now, let's select the bmi values for females and males.
In [7]: f_bmi = female.bmi

m_bmi = male.bmi
Now, we will plot the distribution of 'bmi' values for females and males using seaborn's
distplot() function.
In [8]: sns.distplot(f_bmi,color='green',hist=False)
sns.distplot(m_bmi,color='red',hist=False)
Out[8]: <AxesSubplot:xlabel='bmi'>
From the graph, we already see that the two distributions are very similar.
Now, let's calculate the mean values for females and males bmi.
In [9]: female.bmi.mean()
Out[9]: 30.377749244713023
In [10]: male.bmi.mean()
Out[10]: 30.943128698224832
Next, we will obtain our statistics, t-value and p-value. We will use scipy.stats library and
ttest_ind() function to calculate these parameters.
In [11]: alpha=0.05
t_value1, p_value1 = stats.ttest_ind(m_bmi, f_bmi)
print("t_value1 = ",t_value1, ", p_value1 = ", p_value1)
t_value1 = 1.696752635752224 , p_value1 = 0.08997637178984932
Next, although optional, it is useful to print 'if/else' statements to make our conclusions
about the the hypothesis.
In [12]: if p_value1 <alpha:

print("Conclusion: since p_value {} is less than alpha {} ". format (p_value1,a
print("Reject the null hypothesis that there is no difference between bmi of fe
print("Verdict: There is a difference b/w female and male BMI")
else:
print("Conclusion: since p_value {} is greater than alpha {} ". format (p_value
print("Fail to reject the null hypothesis that there is a difference between bm
print("Verdict: There is no difference b/w female and male BMI")
Conclusion: since p_value 0.08997637178984932 is greater than alpha 0.05

Fail to reject the null hypothesis that there is a difference between bmi of females
and bmi of males.
Verdict: There is no difference b/w female and male BMI
Conclusion: We fail to reject the null hypothesis and can conclude that there is no difference
between the female and male bmi.
Example 2
In this example, we would like to prove (or disprove) that the medical claims made by the
people who smoke are greater than those who don't.
We will compare the mean values (𝜇) of population of people who smoke and those who do
not smoke. First, we need to calculate the mean values of smoking and non smoking
populations.
In [13]: smoker = data.loc[data.smoker=="yes"]

smoker_char = smoker.charges
sch_mean = smoker_char.mean()
sch_mean
Out[13]: 32050.23183153285
In [14]: nonsmoker = data.loc[data.smoker=="no"]

nonsmoker_char = nonsmoker.charges
Exercise 1
Calculate population mean of the nonsmokers.
In [29]: # Enter your code below and run the cell

nsch_mean = nonsmoker_char.mean()
nsch_mean
Out[29]: 8434.268297856199
Solution (Click Here)
Now, let's define our null and alternative hypothesis.

$ 𝐻_{0}:\mu_{1}<=\mu_{2} $ The average charges of smokers are less than or equal to
nonsmokers.
$ 𝐻_{A}:\mu_{1}>=\mu_{2} $ The average charges of smokers are greater than or equal to
nonsmokers.
The '>' sign in the alternate hypothesis indicates the test is right tailed. To compare the
mean values of smoking and nonsmoking populations, we will use a t-test. If z-values
(calculated from a t-test) fall into the area on the right side of a distribution curve, this would
cause us to reject the null hypothesis.
Now, let's plot our smoking versus nonsmoking populations by using seaborn boxplot()
function. It is always useful to have a visual representation of the data that we are working
with.
In [16]: sns.boxplot(x=data.charges,y=data.smoker,data=data).set(title="Fig:1 Smoker vs Char
Out[16]: [Text(0.5, 1.0, 'Fig:1 Smoker vs Charges')]
Now, we will calculate t-value and p-value of charges for smoking and nonsmoking
populations.
In [17]: alpha=0.05
t_val2, p_value2 = stats.ttest_ind(smoker_char, nonsmoker_char)
p_value_onetail=p_value2/2
print("t_value = {} , p_value ={} , p_value_onetail = {}".format(t_val2, p_value2,
t_value = 46.66492117272371 , p_value =8.271435842179102e-283 , p_value_onetail = 4.

135717921089551e-283
Exercise 2
Use print() function to state your conclusions based on the calculated statistics. What are
the conclusions?
In [37]: # Enter your code and run the cell

if p_value_onetail< alpha:
print("Conclusion: The p_value {} is less than alpha value {}".format(p_value_
print("We Successfully reject the null hypothesis that average charges of smoke
print("Verdict: We consider alternate hypothesis and conclude that \nThe averag
else:
print("Conclusion:Since p value {} is greater than alpha {} ".format (p_value_o
print("Failed to reject null hypothesis that average charges for smokers are le
print("Verdict: We consider null hypothesis and conclude that \n The average ch
Conclusion: The p_value 4.135717921089551e-283 is less than alpha value 0.05

We Successfully reject the null hypothesis that average charges of smokers are less
than or equal to nonsmokers.
Verdict: We consider alternate hypothesis and conclude that
The average charges of smokers are greater than or equal to nonsmokers.
Example 3
In this example, using the statistical evidence, we will compare the BMI of women with no
children, one child, and two children.
Now, let's define our null and alternative hypothesis.
$ 𝐻_{0}:\mu_{1}=\mu_{2}=\mu_{3} $ The mean BMI of women with no children , one child,

and two children are the same.
$ 𝐻_{A}: $ At least one of the means for women's BMI is not the same.
For this example, we will use a one-way ANOVA and f-score statistic to evaluate the
variances of these three different populations. We will set alpha value to be 0.05.
First, we need to filter data for women with 0, 1 and 2 children.
In [40]: female_children = female.loc[female['children']<=2]

We will use groupby() function to group the information by the number of children and
bmi values.
In [20]: female_children.groupby([female_children.children]).mean().bmi
Out[20]: children
0 30.361522
1 30.052658
2 30.649790
Name: bmi, dtype: float64
Exercise 3
In this exercise, draw the boxplots to visualize the difference in bmi values between these 3
groups.

sns.boxplot(x="children" , y="bmi", data=female_children).set(title="BMI of Women w
Out[46]: [Text(0.5, 1.0, 'BMI of Women with 0,1 and 2 children')]
In [47]: # Plot can also be diplayed by the code below

# sns.boxplot(x=female_children.children, y=female_children.bmi)
# plt.grid()
# plt.show()
Now, we will construct the ANOVA table and check for each groups count (0,1,2 children)
against the bmi values. We will use the ols (ordinary least squares) model for estimation of
the unknown parameters.
OLS, or Ordinary Least Squares, is a method in statistics used to find the best-fitting line in
linear regression. It minimizes the sum of squared differences between observed and
predicted values.OLS is used to estimate relationships between variables and make
predictions in fields like economics, finance, and social sciences.
1. formula = 'bmi ~ C(children)' : This line defines the formula for the regression
model. It states that the variable bmi is the dependent variable (~) and is linearly
related to the categorical variable children . The C() function is used to treat
children as a categorical variable, indicating that it has distinct categories rather than
being a continuous variable.
2. model = ols(formula, female_children).fit() : Here, the ols function from

the statsmodels library is used to specify and fit the linear regression model. The
formula is passed as an argument along with the dataset female_children . The
fit() method then fits the model to the data.
In [22]: formula = 'bmi ~ C(children)'

model = ols(formula, female_children).fit()
aov_table = anova_lm(model)
aov_table
Out[22]: df sum_sq mean_sq F PR(>F)
C(children) 2.0 24.590123 12.295062 0.334472 0.715858
Residual 563.0 20695.661583 36.759612 NaN NaN
Conclusion. p-value is 0.715858 and it is greater than the alpha (0.05), therefore we fail to
reject the null hypothesis and conclude that the mean bmi of women with no children,
one child, and two children are the same.
Example 4
In this example, we will determine if the proportion of smokers is significantly different
across the different regions.
First, let's define our null and alternative hypothesis.

$ 𝐻_{0} $: Smokers proportions are not significantly different across the different regions.
$ 𝐻_{A} $: Smokers proportions are different across the different regions.
Here, we are comparing two different categorical variables, smokers/nonsmokers and

different regions. For this type of analysis, we will perform a chi-square test.
First, we will calculate a contingency table between the proportions of smokers in different
regions. For this, we will use pandas crosstab() function.
In [23]: contingency= pd.crosstab(data.region, data.smoker)

contingency
Out[23]: smoker no yes
region
northeast 257 67
northwest 267 58
southeast 273 91
southwest 267 58
Next, let's plot the distribution of nonsmokers/smokers across 4 different regions using the
plot() function.
In [52]: contingency.plot(kind='bar')
Out[52]: <AxesSubplot:xlabel='region'>
Now, using chi2_contingency() method, from the scipy.stats chi2_contingency , we
will calculate chi-squared, p-value, degrees of freedom, and expected frequencies for our
data.
In [25]: chi2, p_val, dof, exp_freq = chi2_contingency(contingency, correction = False)

print('chi-square statistic: {} , p_value: {} , degree of freedom: {} ,expected fre
chi-square statistic: 7.343477761407071 , p_value: 0.06171954839170541 , degree of f

reedom: 3 ,expected frequencies: [[257.65022422 66.34977578]
[258.44544096 66.55455904]
[289.45889387 74.54110613]
[258.44544096 66.55455904]]
Exercise 4
Based on the above results, print your conclusion statements whether to reject or accept the
null hypothesis. What are your conclusions about the hypothesis?

if (p_val < 0.05):
print('Reject the null hypothesis, that the smokers proportions are not signifi
else:
print('Accept the null hypothesis, that the smokers proportions are not signifi
Accept the null hypothesis, that the smokers proportions are not significantly diffe
rent across the different regions
Conclusion: We failed to reject that the proportions of smokers are not significantly different
across different regions. Therefore, the proportions of smokers are different across different
regions.
Answer (Click Here)
Exercise 5
In this final exercise, we will determine If there is significant difference in the distribution of
smokers between males and females.
First, let's define our null and alternative hypothesis.
$ 𝐻_{0}:\mu_{1}<=\mu_{2} $ There is no significant difference in the distribution of smokers

between males and females.
$ 𝐻_{A}:\mu_{1}>=\mu_{2} $ There is a significant difference in the distribution of smokers
between males and females.
Here, we are comparing two different categorical variables, smokers/nonsmokers and

male/female. For this type of analysis, we will perform a chi-square test.
Explanation:
1. Import necessary libraries, including pandas for data manipulation and

chi2_contingency from scipy.stats for the chi-squared test.
2. Specify the null ($ 𝐻_{0}$) and alternative ($ 𝐻_{A}$) hypotheses.
3. Create a contingency table using pd.crosstab to summarize the counts of smokers
and non-smokers for each gender and print it
4. Perform the chi-squared test using chi2_contingency .
5. Print the chi-squared value and p-value.
6. Compare the p-value to the significance level (commonly 0.05) and make a decision
about whether to reject the null hypothesis.
7. Create a Bar chart using .plot method on contingency
In [78]: #BEGIN SOLUTION
import pandas as pd
from scipy.stats import chi2_contingency
# Hypothesis:
# H0: There is no significant difference in the distribution of smokers between mal
# H1: There is a significant difference in the distribution of smokers between male
# Create a contingency table
contingency_table = pd.crosstab(data['sex'], data['smoker'])
print(contingency_table)
# Perform the chi-squared test

chi2, p_val, dof, exp_freq = chi2_contingency(contingency_table)
# Set the significance level (alpha)

alpha = 0.05
# Print the results

print('\nchi-square statistic: {} , p_value: {} , degree of freedom: {} ,expected f
# Compare p-value to the significance level

if p_val < alpha:
print("Conclusion:- Alpha value {} is more than p value {} ".format(alpha,p_val
print("Reject the null hypothesis (There is no significant difference in the di
print("Verdict:- There is a significant difference in the distribution of smok
else:
print("Fail to reject the null hypothesis. There is no significant difference i
# Create a Bar Chart

contingency.plot(kind='bar')
#END SOLLUTION
smoker no yes
sex
female 547 115
male 517 159
chi-square statistic: 7.39291081459996 , p_value: 0.006548143503580696 , degree of f

reedom: 1 ,expected frequencies: [[526.43348281 135.56651719]
[537.56651719 138.43348281]]
Conclusion:- Alpha value 0.05 is more than p value 0.006548143503580696

Reject the null hypothesis (There is no significant difference in the distribution o
f smokers between males and females
Verdict:- There is a significant difference in the distribution of smokers between
males and females.
Out[78]: <AxesSubplot:xlabel='sex'>

Hypothesis Testing

Uploaded by

Copyright:

Available Formats

Hypothesis Testing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hypothesis Testing

Uploaded by

Copyright:

Available Formats

Hypothesis Testing

Leveraging customer information is of paramount importance for most businesses. Imagine

Understand the elements of hypothesis testing

pandas for managing the data.

Installing Required Libraries

In [2]: import pandas as pd

import matplotlib.pyplot as plt

import scipy.stats as stats

from statsmodels.formula.api import ols

Reading and understanding our data

In [3]: data = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cl

Out[3]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

By using info function, we will take a look at our types of data.

age 1338.0 39.207025 14.049960 18.0000 27.00000 39.000 51.000000

bmi 1338.0 30.663397 6.098187 15.9600 26.29625 30.400 34.693750

children 1338.0 1.094918 1.205493 0.0000 0.00000 1.000 2.000000

charges 1338.0 13270.422265 12110.011237 1121.8739 4740.28715 9382.033 16639.912515

Steps in Hypothesis Testing

1. Choose a sample statistic

2. Define hypothesis (Null and Alternative)

$ 𝐻_{0}:\mu_{1}-\mu_{2} = 0 $ There is no difference between the BMI of male and BMI of

chi-squared test is used to determine whether there is a statistically significant difference

4. Evaluate and interpret the result

In [7]: f_bmi = female.bmi

t_value1 = 1.696752635752224 , p_value1 = 0.08997637178984932

In [12]: if p_value1 <alpha:

Conclusion: since p_value 0.08997637178984932 is greater than alpha 0.05

In [13]: smoker = data.loc[data.smoker=="yes"]

In [14]: nonsmoker = data.loc[data.smoker=="no"]

In [29]: # Enter your code below and run the cell

Solution (Click Here)

Now, let's define our null and alternative hypothesis.

In [16]: sns.boxplot(x=data.charges,y=data.smoker,data=data).set(title="Fig:1 Smoker vs Char

Out[16]: [Text(0.5, 1.0, 'Fig:1 Smoker vs Charges')]

t_value = 46.66492117272371 , p_value =8.271435842179102e-283 , p_value_onetail = 4.

In [37]: # Enter your code and run the cell

Conclusion: The p_value 4.135717921089551e-283 is less than alpha value 0.05

Solution (Click Here)

Now, let's define our null and alternative hypothesis.

$ 𝐻_{0}:\mu_{1}=\mu_{2}=\mu_{3} $ The mean BMI of women with no children , one child,

First, we need to filter data for women with 0, 1 and 2 children.

In [40]: female_children = female.loc[female['children']<=2]

In [46]: # Enter your code and run the cell

Out[46]: [Text(0.5, 1.0, 'BMI of Women with 0,1 and 2 children')]

In [47]: # Plot can also be diplayed by the code below

2. model = ols(formula, female_children).fit() : Here, the ols function from

In [22]: formula = 'bmi ~ C(children)'

Out[22]: df sum_sq mean_sq F PR(>F)

C(children) 2.0 24.590123 12.295062 0.334472 0.715858

Residual 563.0 20695.661583 36.759612 NaN NaN

First, let's define our null and alternative hypothesis.

Here, we are comparing two different categorical variables, smokers/nonsmokers and

In [23]: contingency= pd.crosstab(data.region, data.smoker)