Nothing Special   »   [go: up one dir, main page]

DS-2, Week 3 - Lectures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

MDS 202: Data Science II with R

Lectures 7-9: What are Distributions and Hypothesis Testing in Statistics?

Dr. Shatrughan Singh∗

Week 3 (20-24 February) 2023

1 LEARNING OBJECTIVES
1.1 What is expected from you to learn from this lecture?
• Learn about different distributions in data science.
• What is hypothesis testing?
• How to calculate statistical tests in R?

2 Probability Distributions
The probability of an event explains how likely that “event will occur”. For example, while tossing a
coin, observing Head or Tail is having equal probability of 50% for a fair coin. Hence, the applications of
probability begin with the numbers 𝑝0 , 𝑝1 , 𝑝2 , …, etc. gives the probability of each possible outcome (in case
of coin toss, only two possible outcomes are there).
Random Variables are those variables whose values depend on some random phenomena/event. Two types
of random variables are there -— Discrete (Finite or countable numbers) and Continuous (Infinite
number of values).
Few of the probability distributions are shown below. However, only normal distribution is discussed in
detail.

2.1 Binomial Distribution


Binomial distribution is the discrete probability distribution of the “number of successes” in a sequence
of N independent experiments, with the boolean-valued outcome for each trial is either 0' or1’ (success or
failure, heads or tails). The probability of success is given by ‘p’ and the probability of failure is given by
‘(1-p)’. If we take an example of a fair coin toss then, a fair coin has a probability of either heads or tails -–
1/2 or 0.5 (50%) each.
EX: Tossing a fair coin 10 times would probably form a Binomial Distribution.
The formula for the binomial distribution is shown below:
𝑁! 𝑥
𝑃 (𝑥) = 𝑥!(𝑁−𝑋)! 𝜋 (1 − 𝜋)𝑁−𝑥
where, 𝑃 (𝑥) is the probability of 𝑥 successes out of 𝑁 trials, 𝑁 is the number of trials, and 𝜋 is the probability
of success on a given trial.
∗ Amity University Rajasthan (Jaipur), ssingh9@jpr.amity.edu

Dr. S. Singh MDS 202: Lec —> 07-09 1


Figure 1: An example of Binomial Distributions with changes in sizes and probabilities.

The mean of a binomial distribution with parameters 𝑁 (the number of trials) and 𝜋 (the probability of
success on each trial) is:
𝜇 = 𝑁𝜋
where 𝜇 is the mean of the binomial distribution. While the standard deviation of a binomial distribution
is:
𝜎 = √𝑁 𝜋(1 − 𝜋)

2.2 Poisson Distribution


The Poisson distribution is a discrete probability distribution of the “number of events” occurring in a
specified time period, given the average number of times the event occurs over that time period. In other
words, the poisson distribution can be used to calculate the probabilities of various numbers of successes
based on the “mean number of successes”. Keep in mind that the various events must be independent in
order to apply the poisson distribution.
EX: The average number of calls to a police station on a weekday is 10. What is the probability that on a
given weekday there would be 15 calls? This would probably form a Poisson Distribution.
Poisson distribution function is very popular for its usage in modeling countable events occurring within a
given time interval. Let’s assume a random variable X follows a poisson distribution, then the probability
of observing x events over a time period can be expressed by the following probability function:
𝑒−𝜇 𝜇𝑥
𝑃 (𝑋 = 𝑥) = 𝑥!

For the Example above,


𝑒−10 1015
𝑃 (𝑋 = 15) = 15! = 0.035
where, 𝑥 is the number of successes in question, 𝜇 is the average number of successes, and 𝑒 is Euler’s
Number (= base of natural logarithms = 2.7183).
The mean of the poisson distribution is 𝜇. The variance is also equal to 𝜇.

Dr. S. Singh MDS 202: Lec —> 07-09 2


Figure 2: An example of Poisson Distributions with changes in average number of times an event occurs.

NOTE: The assumption of “Independent and Identically distributed” for the Poisson Distribution may not
always be true. For example, the failure of one bank may be linked to the failure of other banks as well at
a given time period.

2.3 Exponential Distribution


Exponential distribution is a continuous probability distribution of the time between events in a poisson
point process. It describes the arrival time of a randomly recurring independent event sequence.
If 𝜆 is the mean waiting time for the next event recurrence, its probability density function is:
⎧ 𝜆1 𝑒−𝑥/𝜆 𝑥≥0
{
𝑓(𝑥) = ⎨
{0 𝑥<0

The mean of the exponential distribution is 𝜇 = 1/𝜆. The variance is also equal to 𝜎2 = 1/𝜆2 .
EX: Suppose the average checkout time of a bank cashier is five minutes. Find the probability of a customer
checkout being completed by the cashier in less than three minutes. This would probably form an Exponential
Distribution.
Answer: The checkout processing rate is equals to one over the average checkout completion time. Hence,
the processing rate is 1/5 checkouts per minute. Using R, we can apply the function ‘pexp’ of the exponential
distribution with “rate = 1/5”.
pexp(3, rate=1/5)
[1] 0.4511884
The probability of completing a checkout in less than three minutes by the cashier would be roughly 45%.

2.4 Normal Distribution


The Normal distribution is a continuous probability distribution for a real-valued random variable. It
is also known as the Gaussian Distribution after the mathematician Karl Friedrich Gauss. It is the

Dr. S. Singh MDS 202: Lec —> 07-09 3


Figure 3: An example of Exponential Distributions with changes in lambda values.

famous bell-shaped curve and is one of the most popular distribution functions that are commonly used
in natural and social sciences for modeling purposes.The normal distribution is defined by the following
probability density function, where 𝜇 is the population mean and 𝜎2 is the variance. The symbol exp is the
base of the natural logarithm.

1 −(𝑥−𝜇)2
𝑓(𝑥) = √ 𝑒𝑥𝑝 2𝜎2 , − ∞ < 𝑥 < +∞
𝜎 2𝜋

where the parameter (𝜇) is the “mean” of the distribution also referred to as the location parameter,
parameter (𝜎) is the “standard deviation” of the distribution also referred to as the scale parameter. The
number 𝜋 (pi) is a mathematical constant approximately equal to 3.14.
The normal distribution has a central place in statistical analysis. It can be produced by taking random
samples from any distribution and “creating a distribution from the average of these samples”. This is the
basis for the Central Limit Theorem.
If a random variable ‘X’ follows the normal distribution, then it can be written as:

𝑋 ∼ 𝑁 (𝜇, 𝜎2 )

In particular, the normal distribution with 𝜇 = 0 and 𝜎 = 1 is called the Standard normal distribution,
and is denoted as N(0,1). It is also a symmetric distribution around mean.
Important characteristics of a normal distribution are listed as:
• Normal distributions are symmetric around their mean.
• The mean, median, and mode of a normal distribution are all equal.
• The area under the normal curve is equal to 1.0 (or unity).
• Normal distributions are denser in the center and less dense in the tails.
• Normal distributions are defined by two parameters, the mean (𝜇) and the standard deviation (𝜎).
• 68% of the area of a normal distribution is within one standard deviation of the mean.
• Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.

Dr. S. Singh MDS 202: Lec —> 07-09 4


Figure 4: Normal and Standard Normal Distributions with changes in mean and standard deviation values.

2.5 CENTRAL LIMIT THEOREM


2.5.1 Sampling Theory and Central Limit Theorem
The sample theory is the study of relationships existing between a population and samples drawn from
population. If we consider all the possible samples of size 𝑛 that can be drawn from the population. For
each sample, we can compute statistic like mean or a standard deviation, etc. that will vary from sample to
sample. This way we obtain a distribution called as the sampling distribution of a statistic. If the statisic
is sample mean, then the distribution is called the sampling distribution of mean.
Similarly, we can have sampling distributions of standard deviation, variance, medians, proportion etc.
The Central Limit Theorem states that the sampling distribution of the mean of any independent, random
variable will be normal or near normal, regardless of underlying distribution. If the sample size is large
enough, we get a nice bell shaped curve.
EX: A fair dice can be modelled with a discrete random variable with outcome 1 through 6, each with the
equal probability of 1/6.
Then, the expected value is
1+2+3+4+5+6
= 3.5
6

Suppose, we throw the dice 10,000 times and plot the frequency of each outcome. Here’s is what we could
see.
# Plot Settings
par(mar = c(4,4,0.5,0.5) + 0.1, mgp = c(3,1,0), cex = 1, las = 1)

DiceOutcome <- sample(1:6, 10000, replace = TRUE)


hist(DiceOutcome, col ="light blue", main = "", xlab = "Possible Dice Outcomes")
abline(v = 3.5, col = "red", lty = 1, lwd = 2)

Now, we will take samples of let’s say, size = 10, from the above 10,000 observation of outcome of dice roll,
calculate the mean and then plot the mean of sample. Continuing this for k = 10,000 times, we will get,
# Plot Settings
par(mar = c(4,4,0.5,0.5) + 0.1, mgp = c(3,1,0), cex = 1, las = 1)

x10 <- c()


k = 10000

Dr. S. Singh MDS 202: Lec —> 07-09 5


1500
Frequency

1000

500

1 2 3 4 5 6

Possible Dice Outcomes

Figure 5: Histogram of Dice Outcomes

Dr. S. Singh MDS 202: Lec —> 07-09 6


for (i in 1:k) {
x10[i] = mean(sample(1:6, 10, replace = TRUE))}
hist(x10, col ="pink", main = "Sample size = 10", xlab ="Dice Roll Outcomes")
abline(v = 3.5, col = "blue", lty = 1, lwd = 2)
abline(v = mean(x10), col = "black", lty = 2, lwd = 1.5)

Sample size = 10
1400

1200

1000
Frequency

800

600

400

200

2 3 4 5

Dice Roll Outcomes

Figure 6: Histogram of Dice Outcomes with Sammple Size = 10

Sample Size
As we know, if the sample size increases, a bell-shaped curve will be formed. As the 𝑛 apporaches ∞, we get
a normal distribution. So, let’s do this by increasing the sample size to 50, 100, and 1000 in the above
example.
# Plot Settings
par(mar = c(4,4,0.75,0.5) + 0.1, mgp = c(3,1,0), cex = 1, las = 1)
layout(matrix(1:3, nrow=1))

x50 <- c()


x100 <- c()
x1000 <- c()
k = 10000
for (i in 1:k){
x50[i] = mean(sample(1:6, 50, replace = TRUE))
x100[i] = mean(sample(1:6, 100, replace = TRUE))
x1000[i] = mean(sample(1:6, 1000, replace = TRUE))

Dr. S. Singh MDS 202: Lec —> 07-09 7


}
#par(mfrow=c(1,3))
hist(x50, col ="lightgreen", main = "n = 50", xlab ="Dice Rolls")
abline(v = mean(x50), col = "blue", lty = 3, lwd = 2.5)

hist(x100, col ="lightblue", main = "n = 100", xlab = "Dice Rolls")


abline(v = mean(x100), col = "red", lty = 2, lwd = 2)

hist(x1000, col ="coral", main = "n = 1000", xlab = "Dice Rolls")


abline(v = mean(x1000), col = "black", lty = 1, lwd = 1.75)

n = 50 n = 100 n = 1000
1500

1500
2000

1500 1000
1000
Frequency

Frequency

Frequency
1000

500
500

500

0 0 0

3.0 3.5 4.0 4.5 3.0 3.5 4.0 3.3 3.4 3.5 3.6 3.7

Dice Rolls Dice Rolls Dice Rolls

Figure 7: Histogram of Dice Outcomes with Sample Sizes = 50, 100, and 1000

3 Hypothesis Testing
While considering the hypothesis testing, variety of statistical tests are used to check whether the “null
hypothesis”, 𝐻0 , is rejected or not rejected, rather fail to reject. In principle, these statistical tests
assumes that a null hypothesis shows no relationship or no difference between groups (indicated by difference
in means equal to zero). A good “null hypothesis” is always the one that can be falsified in favor of the
“alternate hypothesis”, 𝐻𝑎 .
Once the null'' and thealternative’ ’ hypotheses are stated and the test assumptions are defined, the next
step is to determine which statistical test is appropriate and to calculate the test statistic. Whether or not
to reject or fail to reject the “null’ ’ can be determined by comparing the test statistic with the critical
value. This comparison shows whether or not the observed test statistic is more extreme than the defined

Dr. S. Singh MDS 202: Lec —> 07-09 8


Figure 8: Hypothesis testing is done for Null Hypothesis, mostly to reject the null.

critical value and it can have two possible results:


• The test statistic is > than the critical value → the null hypothesis can be rejected.
• The test statistic is < the critical value → we fail to reject the null hypothesis.
The critical value is based on a pre-specified significance level ‶ 𝛼″ (usually chosen to be equal to 5%) and
the type of probability distribution the test statistic follows. The critical value divides the area under
this probability distribution curve into the rejection region(s) and non-rejection regions. There are
numerous statistical tests used to test various hypotheses. Some examples of these statistical tests are
Student’s t-test, F-test, Chi-squared test, etc. are discussed below.

3.1 Type I and Type II Errors


While performing Statistical Hypothesis Testing we must consider two conceptual types of errors: Type I
error and Type II error. The Type I error occurs when the null is wrongly rejected whereas the Type
II error occurs when the null hypothesis is wrongly fail to rejected. A Confusion Matrix can help clearly
visualize the severity of these two types of errors shown below.

Figure 9: Types of Errors showing false positve and false negative as well as true positive and true negative.

Dr. S. Singh MDS 202: Lec —> 07-09 9


4 Common Statistical Tests
4.1 T-Test
• It is a parametric test of hypothesis testing based on Student’s T--distribution.
• It is essentially, testing the significance of the difference of the mean values when the sample size is
small (i.e, less than 30) and when the population standard deviation is not available.
• Assumptions of this test:
– Population distribution is normal.
– Samples are random and independent.
– The sample size is small.
– Population standard deviation is not known.
• Mann-Whitney ‘U’ test is a non-parametric counterpart of the T-test.

4.1.1 One-Sample T-Test


To compare a sample mean with that of the population mean.
𝑥̄ − 𝜇
𝑡= √
𝑠/ 𝑛
where, 𝑥̄ is the sample mean, 𝑠 is the sample standard deviation, 𝑛 is the sample size, and 𝜇 is the population
mean.

4.1.2 Two-Sample T-Test


To compare the means of two different samples. For the two-sample t-test, we need two variables. One
variable defines the two groups (or categories). The second variable is the measurement of interest.
Example: Body Weights of Men versus Women in a population.
While conducting a two-sample t-test when variances of two independent groups are assumed EQUAL,
we must keep followings in mind.
• Data values must be independent. Measurements for one observation do not affect measurements for
any other observation.
• Data in each group must be obtained via a random sample from the population.
• Data in each group are normally distributed.
• Data values are continuous.
𝑥1̄ − 𝑥2̄
𝑡=
𝑠𝑝 √1/𝑛1 + 1/𝑛2
where, 𝑥1̄ and 𝑥2̄ are the sample means of two groups, 𝑠𝑝 is the ‘pooled’ sample standard deviation, 𝑛1 and
𝑛2 are the sample sizes of two groups respectively.
The ‘pooled’ variance is calculated as:
(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠2𝑝 =
𝑛1 + 𝑛2 − 2
If 𝑛1 and 𝑛2 are EQAUL, then pooled variance will be,

𝑠21 + 𝑠22
𝑠2𝑝 =
2
When the variances of two independent groups are NOT EQUAL, we cannot use the pooled estimate
of standard deviation. Instead, we take the standard error for each group separately. The t-statistic will be:

𝑥1̄ − 𝑥2̄
𝑡=
√𝑠1 /𝑛1 + 𝑠22 /𝑛2
2

Dr. S. Singh MDS 202: Lec —> 07-09 10


CONCLUSIONS
• If the value of the test statistic (𝑡𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 ) is greater (>) than the table t-value (𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 ), then Reject
the null hypothesis.

• If the value of the test statistic (𝑡𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 ) is less (<) than the table t-value (𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 ), then Fail to reject
(or DO NOT REJECT) the null hypothesis.

End of the Lecture !!

Dr. S. Singh MDS 202: Lec —> 07-09 11

You might also like