Pearson Chi Square Test
Pearson Chi Square Test
Pearson Chi Square Test
Test of Goodness-of-Fit
Hypothesis tests can also assist in assessing the quality of a model. In particular, the chi-squared goodness-of-fit
test checks whether a proposed distribution agrees with observed data.
Start with n independent observations that must be classified as one of r mutually exclusive categories. Define ni
ni
as the number of observations classified as Category i , where i = 1 , 2 , ..., r . Hence, n is the proportion of
observations in Category i .
Now consider a model that describes the distribution among the categories. If the model is properly specified, then
ni
pi , the probability an observation belongs to Category i , should be similar to n for all i . As such, the
hypotheses can be written as
ni
• H0 : pi = for all i = 1 , 2 , ..., r
n
ni
• H1 : At least one pi ≠ for i = 1 , 2 , ..., r
n
In other words, failing to reject H0 suggests that the model fits the data adequately, whereas rejecting H0
suggests that the model fits the data poorly.
Without discussing the proof, this is a right-tailed test with a test statistic calculated as
r
(ni − npi )2
∑
i=1
npi
which comes from a χ2 sampling distribution with r − 1 degrees of freedom. Therefore, reject H0 when
r
(ni − npi )2
∑ ≥ χ21−α, r−1
i=1
npi
As a reminder,
• r is the number of unique categories,
• ni is the number of Category i observations,
• n is the total number of observations,
• pi is the model's probability of a Category i observation,
• α is the significance level, and
• χ2p, ν is the 100p th percentile of a χ2 random variable with ν degrees of freedom.
EXAMPLE 4.2.5
1 17
2 18
3 24
4 29
5 33
6 29
Let χ2p, ν be the 100p th percentile of a chi-squared random variable with ν degrees of freedom. The following
table lists values of χ2p, ν for specific combinations of p and ν :
Test whether the die is fair using the chi-squared goodness-of-fit test.
SOLUTION
There are six categories, one for each die roll outcome. Therefore, r =6.
1
In addition, a fair die implies that each die roll outcome is equally likely, meaning pi = 6
for all i.
With 150 observations,
i npi
1
1 150 ⋅ 6
= 25
1
2 150 ⋅ 6
= 25
⋮ ⋮
1
6 150 ⋅ 6
= 25
r
(ni − npi )2 (17 − 25)2 (18 − 25)2 (29 − 25)2
∑ = + +…+
i=1
npi 25 25 25
= 8.4
In conclusion, we fail to reject H0 at the 6% significance level, suggesting that the assumption of a fair die
seems reasonable for this data of 150 rolls.
Test of Independence
A contingency table records the frequency of observations described by two categorical variables. It is used to
examine the presence of dependence between the two variables. This is achieved using the same procedure as
the goodness-of-fit test. The hypotheses are
One variable has r number of categories, while the other variable has s . Each of the n observations belongs to
one of the r -by-s combinations. Let
• nij be the number of observations in Category i for the first variable and Category j for the second variable,
• ni⋅ be the subtotal number of observations in Category i for the first variable, across all categories of the
second variable, and
• n⋅j be the subtotal number of observations in Category j for the second variable, across all categories of the
first variable,
Second Variable
Total
Cat 1 Cat 2 ⋯ Cat s
which comes from a χ2 sampling distribution with (r − 1)(s − 1) degrees of freedom. Therefore, reject H0
when
EXAMPLE 4.2.6
Cars 40 60 100
Type
Motorcycles 10 40 50
Total 50 100 150
Let χ2p, ν be the 100p th percentile of a chi-squared random variable with ν degrees of freedom. The following
table lists values of χ2p, ν for specific combinations of p and ν :
SOLUTION
Note that r = 2 and s = 2 , since each variable (type and year) has two categories.
In conclusion, reject H0 at the 2.5% significance level, but not at the 1% level, suggesting strong evidence
that vehicle type and year are dependent.
Discussions
Ask a question
SUMMARY:
MESSAGE: