Hypothesis Testing
Hypothesis Testing
Hypothesis Testing
This lab will guide you through the series of steps in hypothesis testing to help you decide,
using the statistical evidence, a certain effect of an attribute or a combination of attributes
on the insurance claims. This lab will only introduce you to the initial steps in the decision
making, before building the prediction models and classifiers.
Objectives
After completing this lab you will be able to:
Setup
For this lab, we will be using the following libraries:
1. T-Test:
When to use: Use a t-test when you want to compare the means of two groups.
Example 1: We here use a t-test to compare the mean between the BMI of male
and BMI of female.
Example 2: We here use a t-test to compare the average charges between smokers
and non-smokers.
2. ANOVA
When to use: Use ANOVA when you have three or more independent groups and
want to determine if there are significant differences in the means of these groups.
Example 3: We here use ANOVA to compare the The mean BMI of women with no
children , one child, and two children, ANOVA can help determine if there is a
statistically significant difference in BMI across these groups.
3. Chi-Squared Test:
When to use: Use a chi-squared test when you have categorical variables and want
to test if there is a significant association between them.
Example 4: We here use a chi-squared test to examine the association between sex
and smoking status.
Exercise 5: We here use a chi-squared test to determine if the proportion of
smokers is significantly different across the different regions
In [1]: # All Libraries required for this lab are listed below. The libraries pre-installed
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 s
# Note: If your environment doesn't support "!mamba install", use "!pip install"
This dataset contains information about age, sex, BMI, the number of children, whether the
client is smoker or non-smoker, region where the client lives, and the charges to their
insurance company.
Let's read the data into pandas data frame and look at the first 5 rows using the head()
method.
In [4]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
The describe() function provides the statistical information about the numeric variables.
In [5]: data.describe().T
Out[5]: count mean std min 25% 50% 75%
According to the brief preview of our data, we will define the 'charges' to be our response
variable, and 'age', 'sex', 'bmi', 'children', 'smoker', and 'region' to be our predictor variables.
In this lab, we will test how our predictor variables influence the insurance 'charges'.
Example 1
In this first example, we will show how to prove (or disprove), with statistical evidence, that
the BMI of females is different from that of males.
The equal sign in the null hypothesis indicates that it is a 2-tailed test.
3. Set the decision criteria
To set the criteria for a decision, we state the level of significance for a test. It could be 5%,
1% or 0.5%. Based on the level of significance, we can make a decision whether to accept the
null hypothesis and reject the alternate, and vise versa.
The diagram above describes the principles of hypothesis testing. We will choose 5%
significance level. Therefore, our $ \alpha=0.05 $. Since we have a 2-tailed test, we have to
divide alpha by 2, which gives us 0.025. So, if the calculated p-value is less than alpha, we will
reject the null hypothesis. The significance level is based on the business requirements. If
you would like to learn more about the statistical significance, please visit this wikipedia link.
In this lab, we will use one of the t-test, z-score, f-score or chi-squared statistics to evaluate
our results.
A t-test is used for testing the mean of one population against a standard or comparing the
means of two populations if you do not know standard deviation of the the population and
when you have a limited sample (n < 30). If you know the standard deviation of the
populations , you may use a z-test.
A z-test is used for testing the mean of a population versus a standard, or comparing the
means of two populations, with large (n ≥ 30) samples, whether you know the population
standard deviation or not. It is also used for testing the proportion of some characteristic
versus a standard proportion, or comparing the proportions of two populations.
An f-test is used to compare variances between 2 populations. The samples can be any size.
It is the basis of ANOVA.
To learn more about t-test, z-score, f-score or chi-squared statistics and contingency tables,
please visit their corresponding wikipedia links.
In [6]: female=data.loc[data.sex=="female"]
male=data.loc[data.sex=="male"]
Now, let's select the bmi values for females and males.
Now, we will plot the distribution of 'bmi' values for females and males using seaborn's
distplot() function.
In [8]: sns.distplot(f_bmi,color='green',hist=False)
sns.distplot(m_bmi,color='red',hist=False)
Out[8]: <AxesSubplot:xlabel='bmi'>
From the graph, we already see that the two distributions are very similar.
Now, let's calculate the mean values for females and males bmi.
In [9]: female.bmi.mean()
Out[9]: 30.377749244713023
In [10]: male.bmi.mean()
Out[10]: 30.943128698224832
Next, we will obtain our statistics, t-value and p-value. We will use scipy.stats library and
ttest_ind() function to calculate these parameters.
In [11]: alpha=0.05
t_value1, p_value1 = stats.ttest_ind(m_bmi, f_bmi)
print("t_value1 = ",t_value1, ", p_value1 = ", p_value1)
Next, although optional, it is useful to print 'if/else' statements to make our conclusions
about the the hypothesis.
else:
print("Conclusion: since p_value {} is greater than alpha {} ". format (p_value
print("Fail to reject the null hypothesis that there is a difference between bm
print("Verdict: There is no difference b/w female and male BMI")
Conclusion: We fail to reject the null hypothesis and can conclude that there is no difference
between the female and male bmi.
Example 2
In this example, we would like to prove (or disprove) that the medical claims made by the
people who smoke are greater than those who don't.
We will compare the mean values (𝜇) of population of people who smoke and those who do
not smoke. First, we need to calculate the mean values of smoking and non smoking
populations.
Out[13]: 32050.23183153285
Exercise 1
Calculate population mean of the nonsmokers.
Out[29]: 8434.268297856199
The '>' sign in the alternate hypothesis indicates the test is right tailed. To compare the
mean values of smoking and nonsmoking populations, we will use a t-test. If z-values
(calculated from a t-test) fall into the area on the right side of a distribution curve, this would
cause us to reject the null hypothesis.
Now, let's plot our smoking versus nonsmoking populations by using seaborn boxplot()
function. It is always useful to have a visual representation of the data that we are working
with.
Now, we will calculate t-value and p-value of charges for smoking and nonsmoking
populations.
In [17]: alpha=0.05
t_val2, p_value2 = stats.ttest_ind(smoker_char, nonsmoker_char)
p_value_onetail=p_value2/2
print("t_value = {} , p_value ={} , p_value_onetail = {}".format(t_val2, p_value2,
Exercise 2
Use print() function to state your conclusions based on the calculated statistics. What are
the conclusions?
Example 3
In this example, using the statistical evidence, we will compare the BMI of women with no
children, one child, and two children.
For this example, we will use a one-way ANOVA and f-score statistic to evaluate the
variances of these three different populations. We will set alpha value to be 0.05.
In [20]: female_children.groupby([female_children.children]).mean().bmi
Out[20]: children
0 30.361522
1 30.052658
2 30.649790
Name: bmi, dtype: float64
Exercise 3
In this exercise, draw the boxplots to visualize the difference in bmi values between these 3
groups.
Now, we will construct the ANOVA table and check for each groups count (0,1,2 children)
against the bmi values. We will use the ols (ordinary least squares) model for estimation of
the unknown parameters.
OLS, or Ordinary Least Squares, is a method in statistics used to find the best-fitting line in
linear regression. It minimizes the sum of squared differences between observed and
predicted values.OLS is used to estimate relationships between variables and make
predictions in fields like economics, finance, and social sciences.
1. formula = 'bmi ~ C(children)' : This line defines the formula for the regression
model. It states that the variable bmi is the dependent variable (~) and is linearly
related to the categorical variable children . The C() function is used to treat
children as a categorical variable, indicating that it has distinct categories rather than
being a continuous variable.
Conclusion. p-value is 0.715858 and it is greater than the alpha (0.05), therefore we fail to
reject the null hypothesis and conclude that the mean bmi of women with no children,
one child, and two children are the same.
Example 4
In this example, we will determine if the proportion of smokers is significantly different
across the different regions.
First, we will calculate a contingency table between the proportions of smokers in different
regions. For this, we will use pandas crosstab() function.
region
northeast 257 67
northwest 267 58
southeast 273 91
southwest 267 58
Next, let's plot the distribution of nonsmokers/smokers across 4 different regions using the
plot() function.
In [52]: contingency.plot(kind='bar')
Out[52]: <AxesSubplot:xlabel='region'>
Now, using chi2_contingency() method, from the scipy.stats chi2_contingency , we
will calculate chi-squared, p-value, degrees of freedom, and expected frequencies for our
data.
Exercise 4
Based on the above results, print your conclusion statements whether to reject or accept the
null hypothesis. What are your conclusions about the hypothesis?
Conclusion: We failed to reject that the proportions of smokers are not significantly different
across different regions. Therefore, the proportions of smokers are different across different
regions.
Exercise 5
In this final exercise, we will determine If there is significant difference in the distribution of
smokers between males and females.
Explanation:
import pandas as pd
from scipy.stats import chi2_contingency
# Hypothesis:
# H0: There is no significant difference in the distribution of smokers between mal
# H1: There is a significant difference in the distribution of smokers between male
# Create a contingency table
contingency_table = pd.crosstab(data['sex'], data['smoker'])
print(contingency_table)
smoker no yes
sex
female 547 115
male 517 159