Lecture 03 Probability and Statistics Review Part2

CS 434
Data Analytics
Lecture 4
Probability & Statistics Review
Part 2
Dr. Firas Jabloun
1
Sampling Methods
2
Introduction
A sample is a subset of people or items from a larger

population that you collect and analyse to make
inferences.
Statistical inference is the process of deducing

properties of an underlying distribution by analysis of
data.
The population is assumed to be larger than the

observed data set; in other words, the observed data is
assumed to be sampled from a larger population.
Population, Sampling, Sample
❖The purpose of statistical inference is to obtain information about a population from

information contained in a sample
▪ A population is the set of all the elements of interest.

▪ A parameter is a numerical characteristic of a population
▪ The sampling method defines how we will select the individuals in the population
to form the sample.
▪ A sample is a subset of the population.
▪ The sample results provide only estimates of the value of the population
characteristics
Population and Sample Parameters
• Sampling is important because most of the time it is difficult and expensive to observe all
the elements of a population.
• If the samples are selected with an adequate criterion, it is possible to measure with
precision the conclusions or inferences about that population.
Population N = set of statistical units ❖ Sample Parameters:

(individuals, households, businesses…) ▪ n = Sample size
❖ Population parameters : ▪ Sample Mean
• N = Population size • 𝑋=
σ 𝑋𝑖
𝑛
• Population Mean:
▪ Sample Variance
σ 𝑋𝑖
o µ= • s2 = σ (𝑋𝑖 −𝑋)2/(n-1)
𝑁
• Population variance
• σ2= σ (𝑋𝑖 −µ)2/N
Sampling methods
• There are 2 types of sampling methods:
Non-probabilistic sampling or Empirical methods Probabilistic sampling or Random methods:
The sample selection process doesn’t require a Each individual in the population has a certain
random algorithm. known probability to belong in a sample
Non Probabilistic Sampling
Sampling
Probabilistic Sampling
Probabilistic Sampling Methods
with replacement
1. Simple random
sampling
without replacement
2. Systematic Sampling
Proportionate
3. Stratified Sampling
Probabilistic Sampling
Disproportionate
4. Cluster Sampling
One stage
5. Multistage Sampling
Two stage
Systematic Sampling
❖The process of obtaining the systematic sample is much like an arithmetic

progression:
1. Starting number:
The researcher selects an integer that must be less than the total number of
individuals in the population. This integer will correspond to the first subject.
2. Interval:
The researcher picks another integer which will serve as the constant
difference between any two consecutive numbers in the progression. The
integer is typically selected so that the researcher obtains the correct sample
size
Stratified Sampling
❖Stratified Random Sampling also called proportional or quota random

sampling, involves dividing your population into homogeneous subgroups and
then taking a simple random sample in each subgroup. In more formal terms:
❖Objective: Divide the population into non-overlapping groups (i.e., strata) N1,
N2, N3, ... Ni, such that N1 + N2 + N3 + ... + Ni = N.
▪ Then do a simple random sample of f = n/N in each strata.
▪ f is referred to as sampling fraction
Benefits of Stratified Sampling
❖There are several major reasons why you might prefer stratified sampling over simple
random sampling.
1. It assures that you will be able to represent not only the overall population, but also key
subgroups of the population, especially small minority groups.
2. It will generally have more statistical precision than simple random sampling. This will
only be true if the strata or groups are homogeneous. If they are, we expect that the
variability within-groups is lower than the variability for the population as a whole.
Stratified sampling capitalizes on that fact.
Measure of Location &
Variability
11
Measures of Location and Variability
Location Variability
Mean Range
Median Interquartile Range
Mode Variance
Percentiles Standard Deviation
Quartiles Coefficient of Variation
12
Variability Measures
The Range
• Largest Value – Smallest Value
The interquartile range
• The difference between the third quartile and the first quartile.
• It is the range for the middle 50% of the data.
• It overcomes the sensitivity to extreme data values.
Coefficient of variation
• Indicates how large the standard deviation is in relation to the mean.

𝜎
• If dataset is a population 100
𝜇
𝑠
• If dataset is a sample ҧ 100
𝑥
13
The Empirical Rule
1 SD
2 SD
3 SD
Sampling Distribution &
Estimation
15
Sampling Distribution Definition
• A sampling distribution is a distribution of all of the possible values of a sample statistic for a
given size sample selected from a population.
• For example, suppose we sample 50 students from university regarding their mean GPA.
• If we obtained many different samples of 50, we will compute a different mean for each
sample.
• We are interested in the distribution of all potential mean GPA we might calculate for any
given sample of 50 students
Sampling Distribution: Example
❖Assume there is a population …
➢ Population size N=4
➢ Random variable, X is age of individuals
➢ Values of X: 18, 20, 22, 24 (years)
Summary Measures for the Population Distribution:
μ=
 X
 i
i
N (X − μ)2
18 + 20 + 22 + 24 σ= = 2.236
= = 21 N
4
Sampling Distribution
❖ Now consider all possible samples of size n=2 (with replacement)

❖ Nn=42
16 Sample
1st 2nd Observation Means
Obs
18 20 22 24
1st 2nd Observation
18 18,18 18,20 18,22 18,24 Obs 18 20 22 24
20 20,18 20,20 20,22 20,24 18 18 19 20 21
22 22,18 22,20 22,22 22,24 20 19 20 21 22
24 24,18 24,20 24,22 24,24 22 20 21 22 23
16 possible samples 24 21 22 23 24
(sampling with
replacement)
18
❖Sampling Distribution of All Sample Means
16 Sample Means Sample Means Distribution

P(X)
1st 2nd Observation
Obs 18 20 22 24
.3 P(21)=4/16
18 18 19 20 21
20 19 20 21 22 .1
22 20 21 22 23 0 18 19 20 21 22 23 24 X
24 21 22 23 24
No uniform distribution
❖Summary Measures of this Sampling Distribution:
μX =
 X
i 18 + 19 + 19 +  + 24
= = 21
N 16
σX =
 i X
( X − μ ) 2
(18 - 21) 2 + (19 - 21) 2 +  + (24 - 21) 2

= = 1.58
16
Comparing the Population Distribution to the Sample Means
Distribution
Population Sample Means Distribution

N=4 n=2
μ = 21 σ = 2.236 μ X = 21 σ X = 1.58
P(X) P(X)
.3 .3
.2
.1 .1
0 18 20 22 24 X 0 18 19 20 21 22 23 24 X
A B C D
Sample Mean Sampling Distribution:
Standard Error of the Mean
• Different samples of the same size from the same population will yield different
sample means
• A measure of the variability in the mean from sample to sample is given by the
Standard Error of the Mean:
• (This assumes that sampling is with replacement or sampling is without
replacement from an infinite population)
• Note that the standard error of the mean decreases as the sample size increases
σ
σX =
n
If the Population is Normal
❖If a population is normal with mean μ and standard deviation σ, the sampling
distribution of X is also normally distributed with
μX = μ
σ
σX =
n
Sampling Distribution Properties
❖Normal population distribution:
μ x
Normal sampling distribution μX = μ
μx x
Sampling Distribution Properties
❖As n increases 𝜎𝑥ҧ decreases
Larger
sample size
Smaller
sample size
μ x
If the Population is not Normal
❖ Population distribution :
μ x
❖ Sampling Distribution (becomes normal as n increases):
Larger
sample
Smaller size
sample
size
μ x
Central limit theorem
As the sample size gets large enough…
the sampling distribution becomes almost normal
regardless the shape of the population
n↑
x
Central limit theorem
If the population follows a normal probability distribution, then for any sample size
the sampling distribution of the sample mean will also be normal.
If the population distribution is symmetrical (but not normal), the normal shape of
the distribution of the sample mean emerge with samples as small as 10.
If a distribution that is skewed or has thick tails, it may require samples of 30 or

more to observe the normality feature.
The mean of the sampling distribution equal to μ and the variance equal to σ2/n.
Estimation
Distinctions Between Parameters and Statistics
Parameters Statistics
Source Population Sample
Notation Greek (e.g., μ) Roman (e.g., xbar)
Vary No Yes
Calculated No Yes
Estimation
• The objective of estimation is to determine the approximate value of a

population parameter on the basis of a sample statistic.
• E.g., the sample mean ( ) is employed to estimate the population mean ( ).
• There are two types of inference:

▪ Estimation
▪ hypothesis testing
Estimating
❖ Estimator
➢ A statistic that is used to estimate a population parameter.

➢ For example, the sample mean 𝑋ഥ is an estimator of m, the mean of the
population.
❖ Estimate
➢ The estimate is a particular value that that estimator takes

Estimation
• The objective of estimation is to determine the approximate

value of a population parameter on the basis of a sample
statistic.
• There are two types of estimators:
• Point Estimator
• Interval Estimator
Point Estimator Vs. Interval Estimator
• A point estimator draws inferences about a population by

estimating the value of an unknown parameter using a single
value or point.
• An interval estimator draws inferences about a population by

estimating the value of an unknown parameter using an
interval.
Lower Upper
bound bound
→ That is we say (with some ___% certainty) that the population

parameter of interest is between some lower and upper bounds.
Example
• For example, suppose we want to estimate the mean summer income

of a class of business students. For n=25 students:
• is calculated to be 400 $/week.
interval estimate
Point Estimate
Interval estimate
• An alternative statement is:

The mean income is between 380 and 420 $/week.
Qualities of a good estimator
• A good estimator is one which is close to the true value of the parameter
• A good estimator must possess the following characteristics:

1. Unbiasedness
2. Consistency
3. Efficiency and
4. Sufficiency
Qualities of a good estimator:
• An unbiased estimator of a population parameter is an estimator whose expected

value is equal to that parameter.
• 𝑬 𝜽෡ = 𝜽
• An unbiased estimator is said to be consistent if the difference between the estimator
and the parameter grows smaller as the sample size grows larger.
• If there are two unbiased estimators of a parameter, the one whose variance is
smaller is said to be relatively efficient.
• an estimator is sufficient if no other statistic that can be calculated from the
same sample provides any additional information as to the value of the parameter.
38
Confidence Interval Estimator for 𝜇: The population mean
𝜎
ത
❖Since 𝑋~𝑁 𝜇,
𝑛
ത
𝑋−𝜇
▪→ 𝑃 −𝑍𝛼Τ2 < < 𝑍𝛼Τ2 = 1 − 𝛼
𝜎Τ 𝑛
𝛼
and 𝑃 𝑍 > 𝑍𝛼Τ2 =
𝛼
Where: 𝑃 𝑍 < −𝑍𝛼Τ2 =
2 2
𝜎 𝜎
▪→ 𝑃 𝑋ത −𝑍𝛼Τ2 < 𝜇 < 𝑋ത + 𝑍𝛼Τ2 =1−𝛼 −𝑍𝛼Τ2 +𝑍𝛼Τ2
𝑛 𝑛
▪The probability ‘1– ‘ is called the confidence level.

Confidence Interval Estimator for 𝜇: The population mean
❖The probability ‘1– ‘ is called the confidence level.
Usually represented with a upper confidence limit

“plus/minus” ( ± ) sign (UCL)
lower confidence limit

(LCL)
Graphically
❖…the actual location of the population mean 𝜇:
…may be here… …or here… …or possibly even here…

Interval Width…
A wide interval provides little information.
• For example, suppose we estimate with 95% confidence that an accountant’s average
starting salary is between $15,000 and $100,000.
• Contrast this with: a 95% confidence interval estimate of starting salaries between $42,000
and $45,000.
• The second estimate is much narrower, providing accounting students more precise
information about starting salaries.
10.42
Interval Width…
• Increasing the sample size decreases the width of

the confidence interval while the confidence
level can remain unchanged.
• Note: this also increases the cost of obtaining
additional data
• Larger values of 𝜎 produce wider

confidence intervals
• A larger confidence level produces a wider

confidence interval.
Values to know
Example
• A computer company samples demand during lead time over 25 time periods:
235 374 309 499 253

421 361 514 462 369
394 439 348 344 330
261 374 302 466 535
386 316 296 332 334
• Its is known that the standard deviation of demand over lead time is 75 computers.
• We want to estimate the mean demand over lead time with 95% confidence in order
to set inventory levels…
Example
❖ “We want to estimate the mean demand over lead time with 95% confidence in
order to set inventory levels…”
❖Thus, the parameter to be estimated is the population mean 𝜇:
❖And so our confidence interval estimator will be:

𝜎
Lower Bound: 𝑥ഥ − 𝑧𝛼/2 𝑛
𝜎
Upper Bound: 𝑥ഥ + 𝑧𝛼/2
𝑛
Solution
Confidence level= 1 − 𝛼 = 95%

→ 𝛼 = 5%
𝑥ഥ 370.16 Calculated from the data
𝑧𝛼/2 1.96 𝑧𝛼/2 =𝑧0.025 = 1.96 (𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝑇𝑎𝑏𝑙𝑒)
𝜎 75
Given
n 25
• The lower and upper confidence limits are
75
• Lower Bound: 370.16 − 1.96 =340.76
25
75
• Upper Bound:370.16 + 1.96 =399.56
25
Hypothesis Testing
48
What is Hypothesis
Is also called significance testing
A Hypothesis is the statement or an assumption

about relationships between variables.
Tests a claim about a parameter using

evidence (data in a sample).
Criteria for Hypothesis Construction
It should be empirically testable, whether it is right or wrong.
It should be specific and precise.
The statements in the hypothesis should not be contradictory.
It should specify variables between which the relationship is to

be established.
It should describe one issue only.
Null and Alternative Hypotheses
Null Hypothesis (H0)

Alternative Hypothesis (Ha or H1)
Each of the following statements is an example of a null hypothesis and

alternative hypothesis.
Formulate a Decision Rule to Accept Null Hypothesis
❖Accept H0 if the test statistic value falls within the area of acceptance.
Reject otherwise.
Errors in Hypothesis Testing
Area of
acceptance
Case 1: Illustrative Example: “Body Weight”
• The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body
weight of 170 pounds. Standard deviation σ was 40 pounds.
• We want to test whether the mean body weight in the population nowadays
(i.e. in 2016) differs.
• We take a sample of 64 individuals and compute the mean body weight as 173.
What is your conclusion regarding the test.
“Body Weight” Hypothesis Testing
• Null hypothesis H0: μ = 170 (“no difference”)
• The alternative hypothesis can be either

Ha: μ > 170 (one-sided test) or
Ha: μ ≠ 170 (two-sided test)
Case 1: Test Statistic
This is an example of a one-sample test of a mean when σ is known.

Use this statistic to test the problem:
Sampling distribution of x
under H0: µ0 = 170 for n = 64
x ~ N (170,5) 
Case 1
=40/sqrt(64)
n
x − m0
z stat =
SE x
where m 0  population mean assuming H 0 is true

and SE x =
n
Case 1: z statistic Method
❖For the illustrative example, μ0 = 170  40

❖We know σ = 40 SE x = = =5
❖Take a sample of n = 64. Therefore n 64
❖The sample mean is 173, then:

x − m 0 173 − 170
zstat = = = 0.60
SE x 5
❖If we consider a 95% confidence interval, 0.6<1.96, so we accept H0

Case1: The Confidence Interval Method
❖Construct a confidence interval around the sample mean 173
❖Lower bound: 173-1.96*5=163.2

❖Upper bound: 173+1.96*5=182.8
❖The value 170 (i.e. H0) is a possible value of µ (as inside the confidence
interval) and therefore we accept H0
57
Case 2: Illustrative Example: “Body Weight”
• The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body
weight of 170 pounds. Standard deviation σ was 40 pounds.
• We want to test whether the mean body weight in the population now differs.
• We take another sample of 64 individuals and compute the mean body weight
as 185. What is your conclusion regarding the test.
Case2: z statistic Method
We found a sample mean of 185, then:
x − m 0 185 − 170
zstat = = = 3.00
SE x 5
If we consider a 95% confidence interval, 3 >1.96, so we reject H0

Case 2: The Confidence Interval Method
• Construct a confidence interval around the sample mean 173
Lower bound: 185 -1.96*5=175.2

Upper bound: 185+1.96*5=194.8
• The value 170 (i.e. H0) is not a possible value of µ at 95% confidence level (as not inside the
CI) and we therefore reject H0
60
p-value approach to testing
• We assume a certain hypothesis, the null Hypothesis H0, in contrast

to another hypothesis Ha
• P-value (sometimes referred to as the observed significance level) –
is another way to reach statistical conclusion in hypothesis testing
• p-value: Probability of obtaining a test statistic equal to or more

extreme than the observed sample value given H0 is true
Hypothesis Testing Example
❖Given the Null Hypothesis H0 : mean = 0,
❖If H0 is true, how likely is it to get a Z of 2.75, or something
farther from the mean (0), in either direction?
• P-value=P(Z≥2.75)+P(Z≤2.75)=2*0.003=0.006
❖The p-value is also called the observed level of significance
▪H0 can be rejected if the p-value is less than 𝛼 (with 𝛼 being your level of
significance)
▪No preset value of 𝛼 is given in the p-value method
▪p value defines the smallest value for which the null hypothesis can be rejected
• e.g. , p-value = 0.038 is the smallest value of 𝛼 for which H0 can be rejected.
• H0 is rejected with 𝛼 = 0.05 but not 0.01.
o 0.05 > 0.038 → Reject H0
o 0.01 < 0.038 → Accept H0
▪ Compare the p-value with 𝛼 (the level of significance)

▪ If p-value < 𝛼 → reject H0
▪ If p-value ≥ 𝛼 → do not reject H0
❖Remember: If the p-value is low then H0 must go.
❖10% is usually the largest value of alpha used and most researchers use 5%.
What is a T-distribution?
• A t-distribution is like a Z distribution, except has slightly fatter tails to

reflect the uncertainty added by estimating .
• The bigger the sample size (i.e., the bigger the sample size used to
estimate ), then the closer t becomes to Z.
• If n>100, t approaches Z.
T-distribution with only 1 degree of freedom.
(df=n*-1)
n=2
*n=number of observations
T-distribution with 4 degrees of freedom.
(df=n*-1)
n= 5
(df=n*-1)
n=10
(df=n*-1)
n=30
(df=n*-1)
Looks a lot like Z!!
n=100
Student’s t Distribution
Note: t Z as n increases
Standard
Normal
(t with df = )
t (df = 13)
t-distributions are bell-shaped
and symmetric, but have
‘fatter’ tails than the normal t (df = 5)
0 t
Let: n = 3
Student’s t Table df = n - 1 = 2
Upper Tail Area (p)
P=probability in the column

df .25 .10 .05
1 1.000 3.078 6.314
2 0.817 1.886 2.920

3 0.765 1.638 2.353 p= 0.05
Df in the rows
The body of the table
0 t
contains t values, not 2.920
probabilities P(x>2.920)=0.05
t distribution values
With comparison to the Z value
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____
.80 1.372 1.325 1.310 1.28

.90 1.812 1.725 1.697 1.64
.95 2.228 2.086 2.042 1.96
.99 3.169 2.845 2.750 2.58
Note: t Z as n increases
The T probability density function
What does t look like mathematically? (You may at least recognize some
resemblance to the normal distribution function…)
The t-distribution looks like a

mess! Don’t want to integrate!
Luckily, there are SPSS, SAS etc…
MUST SPECIFY DEGREES OF
Where: FREEDOM!
v is the degrees of freedom
(gamma) is the Gamma function
is the constant Pi (3.14...)

Lecture 03 Probability and Statistics Review Part2

Uploaded by

Copyright:

Available Formats

Lecture 03 Probability and Statistics Review Part2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 03 Probability and Statistics Review Part2

Uploaded by

Copyright:

Available Formats

CS 434

Dr. Firas Jabloun

A sample is a subset of people or items from a larger

Statistical inference is the process of deducing

The population is assumed to be larger than the

❖The purpose of statistical inference is to obtain information about a population from

▪ A population is the set of all the elements of interest.

Population N = set of statistical units ❖ Sample Parameters:

Non Probabilistic Sampling

❖The process of obtaining the systematic sample is much like an arithmetic

❖Stratified Random Sampling also called proportional or quota random

Median Interquartile Range

Percentiles Standard Deviation

Quartiles Coefficient of Variation

• Largest Value – Smallest Value

The interquartile range

• Indicates how large the standard deviation is in relation to the mean.

❖ Now consider all possible samples of size n=2 (with replacement)

16 Sample Means Sample Means Distribution

(18 - 21) 2 + (19 - 21) 2 +  + (24 - 21) 2

Population Sample Means Distribution

❖Normal population distribution:

If a distribution that is skewed or has thick tails, it may require samples of 30 or

Source Population Sample

Notation Greek (e.g., μ) Roman (e.g., xbar)

• The objective of estimation is to determine the approximate value of a

• E.g., the sample mean ( ) is employed to estimate the population mean ( ).

• There are two types of inference:

➢ A statistic that is used to estimate a population parameter.

➢ The estimate is a particular value that that estimator takes

• The objective of estimation is to determine the approximate

• There are two types of estimators:

• A point estimator draws inferences about a population by

• An interval estimator draws inferences about a population by

→ That is we say (with some ___% certainty) that the population

• For example, suppose we want to estimate the mean summer income

• is calculated to be 400 $/week.

• An alternative statement is:

• A good estimator must possess the following characteristics:

• An unbiased estimator of a population parameter is an estimator whose expected

▪The probability ‘1– ‘ is called the confidence level.

Usually represented with a upper confidence limit

lower confidence limit

❖…the actual location of the population mean 𝜇:

…may be here… …or here… …or possibly even here…

• Increasing the sample size decreases the width of

• Larger values of 𝜎 produce wider

• A larger confidence level produces a wider

235 374 309 499 253

❖Thus, the parameter to be estimated is the population mean 𝜇:

❖And so our confidence interval estimator will be:

Confidence level= 1 − 𝛼 = 95%

𝑧𝛼/2 1.96 𝑧𝛼/2 =𝑧0.025 = 1.96 (𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝑇𝑎𝑏𝑙𝑒)

Is also called significance testing

A Hypothesis is the statement or an assumption

Tests a claim about a parameter using

It should be empirically testable, whether it is right or wrong.

It should be specific and precise.