DS-2, Week 3 - Lectures
DS-2, Week 3 - Lectures
DS-2, Week 3 - Lectures
1 LEARNING OBJECTIVES
1.1 What is expected from you to learn from this lecture?
• Learn about different distributions in data science.
• What is hypothesis testing?
• How to calculate statistical tests in R?
2 Probability Distributions
The probability of an event explains how likely that “event will occur”. For example, while tossing a
coin, observing Head or Tail is having equal probability of 50% for a fair coin. Hence, the applications of
probability begin with the numbers 𝑝0 , 𝑝1 , 𝑝2 , …, etc. gives the probability of each possible outcome (in case
of coin toss, only two possible outcomes are there).
Random Variables are those variables whose values depend on some random phenomena/event. Two types
of random variables are there -— Discrete (Finite or countable numbers) and Continuous (Infinite
number of values).
Few of the probability distributions are shown below. However, only normal distribution is discussed in
detail.
The mean of a binomial distribution with parameters 𝑁 (the number of trials) and 𝜋 (the probability of
success on each trial) is:
𝜇 = 𝑁𝜋
where 𝜇 is the mean of the binomial distribution. While the standard deviation of a binomial distribution
is:
𝜎 = √𝑁 𝜋(1 − 𝜋)
NOTE: The assumption of “Independent and Identically distributed” for the Poisson Distribution may not
always be true. For example, the failure of one bank may be linked to the failure of other banks as well at
a given time period.
famous bell-shaped curve and is one of the most popular distribution functions that are commonly used
in natural and social sciences for modeling purposes.The normal distribution is defined by the following
probability density function, where 𝜇 is the population mean and 𝜎2 is the variance. The symbol exp is the
base of the natural logarithm.
1 −(𝑥−𝜇)2
𝑓(𝑥) = √ 𝑒𝑥𝑝 2𝜎2 , − ∞ < 𝑥 < +∞
𝜎 2𝜋
where the parameter (𝜇) is the “mean” of the distribution also referred to as the location parameter,
parameter (𝜎) is the “standard deviation” of the distribution also referred to as the scale parameter. The
number 𝜋 (pi) is a mathematical constant approximately equal to 3.14.
The normal distribution has a central place in statistical analysis. It can be produced by taking random
samples from any distribution and “creating a distribution from the average of these samples”. This is the
basis for the Central Limit Theorem.
If a random variable ‘X’ follows the normal distribution, then it can be written as:
𝑋 ∼ 𝑁 (𝜇, 𝜎2 )
In particular, the normal distribution with 𝜇 = 0 and 𝜎 = 1 is called the Standard normal distribution,
and is denoted as N(0,1). It is also a symmetric distribution around mean.
Important characteristics of a normal distribution are listed as:
• Normal distributions are symmetric around their mean.
• The mean, median, and mode of a normal distribution are all equal.
• The area under the normal curve is equal to 1.0 (or unity).
• Normal distributions are denser in the center and less dense in the tails.
• Normal distributions are defined by two parameters, the mean (𝜇) and the standard deviation (𝜎).
• 68% of the area of a normal distribution is within one standard deviation of the mean.
• Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.
Suppose, we throw the dice 10,000 times and plot the frequency of each outcome. Here’s is what we could
see.
# Plot Settings
par(mar = c(4,4,0.5,0.5) + 0.1, mgp = c(3,1,0), cex = 1, las = 1)
Now, we will take samples of let’s say, size = 10, from the above 10,000 observation of outcome of dice roll,
calculate the mean and then plot the mean of sample. Continuing this for k = 10,000 times, we will get,
# Plot Settings
par(mar = c(4,4,0.5,0.5) + 0.1, mgp = c(3,1,0), cex = 1, las = 1)
1000
500
1 2 3 4 5 6
Sample size = 10
1400
1200
1000
Frequency
800
600
400
200
2 3 4 5
Sample Size
As we know, if the sample size increases, a bell-shaped curve will be formed. As the 𝑛 apporaches ∞, we get
a normal distribution. So, let’s do this by increasing the sample size to 50, 100, and 1000 in the above
example.
# Plot Settings
par(mar = c(4,4,0.75,0.5) + 0.1, mgp = c(3,1,0), cex = 1, las = 1)
layout(matrix(1:3, nrow=1))
n = 50 n = 100 n = 1000
1500
1500
2000
1500 1000
1000
Frequency
Frequency
Frequency
1000
500
500
500
0 0 0
3.0 3.5 4.0 4.5 3.0 3.5 4.0 3.3 3.4 3.5 3.6 3.7
Figure 7: Histogram of Dice Outcomes with Sample Sizes = 50, 100, and 1000
3 Hypothesis Testing
While considering the hypothesis testing, variety of statistical tests are used to check whether the “null
hypothesis”, 𝐻0 , is rejected or not rejected, rather fail to reject. In principle, these statistical tests
assumes that a null hypothesis shows no relationship or no difference between groups (indicated by difference
in means equal to zero). A good “null hypothesis” is always the one that can be falsified in favor of the
“alternate hypothesis”, 𝐻𝑎 .
Once the null'' and thealternative’ ’ hypotheses are stated and the test assumptions are defined, the next
step is to determine which statistical test is appropriate and to calculate the test statistic. Whether or not
to reject or fail to reject the “null’ ’ can be determined by comparing the test statistic with the critical
value. This comparison shows whether or not the observed test statistic is more extreme than the defined
Figure 9: Types of Errors showing false positve and false negative as well as true positive and true negative.
𝑠21 + 𝑠22
𝑠2𝑝 =
2
When the variances of two independent groups are NOT EQUAL, we cannot use the pooled estimate
of standard deviation. Instead, we take the standard error for each group separately. The t-statistic will be:
𝑥1̄ − 𝑥2̄
𝑡=
√𝑠1 /𝑛1 + 𝑠22 /𝑛2
2
• If the value of the test statistic (𝑡𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 ) is less (<) than the table t-value (𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 ), then Fail to reject
(or DO NOT REJECT) the null hypothesis.