Nothing Special   »   [go: up one dir, main page]

Lecture 03 Probability and Statistics Review Part2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 74

CS 434

Data Analytics
Lecture 4
Probability & Statistics Review
Part 2

Dr. Firas Jabloun

1
Sampling Methods

2
Introduction

A sample is a subset of people or items from a larger


population that you collect and analyse to make
inferences.

Statistical inference is the process of deducing


properties of an underlying distribution by analysis of
data.

The population is assumed to be larger than the


observed data set; in other words, the observed data is
assumed to be sampled from a larger population.
Population, Sampling, Sample

❖The purpose of statistical inference is to obtain information about a population from


information contained in a sample

▪ A population is the set of all the elements of interest.


▪ A parameter is a numerical characteristic of a population
▪ The sampling method defines how we will select the individuals in the population
to form the sample.
▪ A sample is a subset of the population.
▪ The sample results provide only estimates of the value of the population
characteristics
Population and Sample Parameters
• Sampling is important because most of the time it is difficult and expensive to observe all
the elements of a population.
• If the samples are selected with an adequate criterion, it is possible to measure with
precision the conclusions or inferences about that population.

Population N = set of statistical units ❖ Sample Parameters:


(individuals, households, businesses…) ▪ n = Sample size
❖ Population parameters : ▪ Sample Mean
• N = Population size • 𝑋=
σ 𝑋𝑖
𝑛
• Population Mean:
▪ Sample Variance
σ 𝑋𝑖
o µ= • s2 = σ (𝑋𝑖 −𝑋)2/(n-1)
𝑁
• Population variance
• σ2= σ (𝑋𝑖 −µ)2/N
Sampling methods
• There are 2 types of sampling methods:
Non-probabilistic sampling or Empirical methods Probabilistic sampling or Random methods:
The sample selection process doesn’t require a Each individual in the population has a certain
random algorithm. known probability to belong in a sample

Non Probabilistic Sampling

Sampling

Probabilistic Sampling
Probabilistic Sampling Methods
with replacement
1. Simple random
sampling
without replacement

2. Systematic Sampling

Proportionate

3. Stratified Sampling
Probabilistic Sampling
Disproportionate

4. Cluster Sampling
One stage

5. Multistage Sampling
Two stage
Systematic Sampling

❖The process of obtaining the systematic sample is much like an arithmetic


progression:
1. Starting number:
The researcher selects an integer that must be less than the total number of
individuals in the population. This integer will correspond to the first subject.
2. Interval:
The researcher picks another integer which will serve as the constant
difference between any two consecutive numbers in the progression. The
integer is typically selected so that the researcher obtains the correct sample
size
Stratified Sampling

❖Stratified Random Sampling also called proportional or quota random


sampling, involves dividing your population into homogeneous subgroups and
then taking a simple random sample in each subgroup. In more formal terms:
❖Objective: Divide the population into non-overlapping groups (i.e., strata) N1,
N2, N3, ... Ni, such that N1 + N2 + N3 + ... + Ni = N.
▪ Then do a simple random sample of f = n/N in each strata.
▪ f is referred to as sampling fraction
Benefits of Stratified Sampling
❖There are several major reasons why you might prefer stratified sampling over simple
random sampling.
1. It assures that you will be able to represent not only the overall population, but also key
subgroups of the population, especially small minority groups.
2. It will generally have more statistical precision than simple random sampling. This will
only be true if the strata or groups are homogeneous. If they are, we expect that the
variability within-groups is lower than the variability for the population as a whole.
Stratified sampling capitalizes on that fact.
Measure of Location &
Variability

11
Measures of Location and Variability

Location Variability
Mean Range

Median Interquartile Range

Mode Variance

Percentiles Standard Deviation

Quartiles Coefficient of Variation

12
Variability Measures

The Range

• Largest Value – Smallest Value

The interquartile range

• The difference between the third quartile and the first quartile.
• It is the range for the middle 50% of the data.
• It overcomes the sensitivity to extreme data values.

Coefficient of variation

• Indicates how large the standard deviation is in relation to the mean.


𝜎
• If dataset is a population 100
𝜇
𝑠
• If dataset is a sample ҧ 100
𝑥

13
The Empirical Rule

1 SD
2 SD
3 SD
Sampling Distribution &
Estimation

15
Sampling Distribution Definition
• A sampling distribution is a distribution of all of the possible values of a sample statistic for a
given size sample selected from a population.

• For example, suppose we sample 50 students from university regarding their mean GPA.
• If we obtained many different samples of 50, we will compute a different mean for each
sample.
• We are interested in the distribution of all potential mean GPA we might calculate for any
given sample of 50 students
Sampling Distribution: Example
❖Assume there is a population …
➢ Population size N=4
➢ Random variable, X is age of individuals
➢ Values of X: 18, 20, 22, 24 (years)
Summary Measures for the Population Distribution:

μ=
 X
 i
i
N (X − μ)2

18 + 20 + 22 + 24 σ= = 2.236
= = 21 N
4
Sampling Distribution

❖ Now consider all possible samples of size n=2 (with replacement)


❖ Nn=42
16 Sample
1st 2nd Observation Means
Obs
18 20 22 24
1st 2nd Observation
18 18,18 18,20 18,22 18,24 Obs 18 20 22 24
20 20,18 20,20 20,22 20,24 18 18 19 20 21
22 22,18 22,20 22,22 22,24 20 19 20 21 22
24 24,18 24,20 24,22 24,24 22 20 21 22 23
16 possible samples 24 21 22 23 24
(sampling with
replacement)

18
Sampling Distribution
❖Sampling Distribution of All Sample Means

16 Sample Means Sample Means Distribution


P(X)
1st 2nd Observation
Obs 18 20 22 24
.3 P(21)=4/16
18 18 19 20 21
20 19 20 21 22 .1
22 20 21 22 23 0 18 19 20 21 22 23 24 X
24 21 22 23 24
No uniform distribution
Sampling Distribution
❖Summary Measures of this Sampling Distribution:

μX =
 X
i 18 + 19 + 19 +  + 24
= = 21
N 16

σX =
 i X
( X − μ ) 2

(18 - 21) 2 + (19 - 21) 2 +  + (24 - 21) 2


= = 1.58
16
Comparing the Population Distribution to the Sample Means
Distribution

Population Sample Means Distribution


N=4 n=2
μ = 21 σ = 2.236 μ X = 21 σ X = 1.58
P(X) P(X)
.3 .3
.2
.1 .1
0 18 20 22 24 X 0 18 19 20 21 22 23 24 X
A B C D
Sample Mean Sampling Distribution:
Standard Error of the Mean
• Different samples of the same size from the same population will yield different
sample means
• A measure of the variability in the mean from sample to sample is given by the
Standard Error of the Mean:
• (This assumes that sampling is with replacement or sampling is without
replacement from an infinite population)
• Note that the standard error of the mean decreases as the sample size increases

σ
σX =
n
Sample Mean Sampling Distribution:
If the Population is Normal
❖If a population is normal with mean μ and standard deviation σ, the sampling
distribution of X is also normally distributed with

μX = μ
σ
σX =
n
Sampling Distribution Properties

❖Normal population distribution:

μ x
Normal sampling distribution μX = μ

μx x
Sampling Distribution Properties
❖As n increases 𝜎𝑥ҧ decreases

Larger
sample size

Smaller
sample size

μ x
Sample Mean Sampling Distribution:
If the Population is not Normal

❖ Population distribution :

μ x
❖ Sampling Distribution (becomes normal as n increases):

Larger
sample
Smaller size
sample
size
μ x
Central limit theorem
As the sample size gets large enough…
the sampling distribution becomes almost normal
regardless the shape of the population

n↑

x
Central limit theorem

If the population follows a normal probability distribution, then for any sample size
the sampling distribution of the sample mean will also be normal.

If the population distribution is symmetrical (but not normal), the normal shape of
the distribution of the sample mean emerge with samples as small as 10.

If a distribution that is skewed or has thick tails, it may require samples of 30 or


more to observe the normality feature.

The mean of the sampling distribution equal to μ and the variance equal to σ2/n.
Estimation
Distinctions Between Parameters and Statistics

Parameters Statistics

Source Population Sample

Notation Greek (e.g., μ) Roman (e.g., xbar)

Vary No Yes

Calculated No Yes
Estimation

• The objective of estimation is to determine the approximate value of a


population parameter on the basis of a sample statistic.

• E.g., the sample mean ( ) is employed to estimate the population mean ( ).

• There are two types of inference:


▪ Estimation
▪ hypothesis testing
Estimating
❖ Estimator

➢ A statistic that is used to estimate a population parameter.


➢ For example, the sample mean 𝑋ഥ is an estimator of m, the mean of the
population.

❖ Estimate

➢ The estimate is a particular value that that estimator takes


Estimation

• The objective of estimation is to determine the approximate


value of a population parameter on the basis of a sample
statistic.

• There are two types of estimators:

• Point Estimator

• Interval Estimator
Point Estimator Vs. Interval Estimator

• A point estimator draws inferences about a population by


estimating the value of an unknown parameter using a single
value or point.

• An interval estimator draws inferences about a population by


estimating the value of an unknown parameter using an
interval.
Lower Upper
bound bound

→ That is we say (with some ___% certainty) that the population


parameter of interest is between some lower and upper bounds.
Example

• For example, suppose we want to estimate the mean summer income


of a class of business students. For n=25 students:

• is calculated to be 400 $/week.

interval estimate
Point Estimate
Interval estimate

• An alternative statement is:


The mean income is between 380 and 420 $/week.
Qualities of a good estimator

• A good estimator is one which is close to the true value of the parameter

• A good estimator must possess the following characteristics:


1. Unbiasedness
2. Consistency
3. Efficiency and
4. Sufficiency
Qualities of a good estimator:

• An unbiased estimator of a population parameter is an estimator whose expected


value is equal to that parameter.
• 𝑬 𝜽෡ = 𝜽
• An unbiased estimator is said to be consistent if the difference between the estimator
and the parameter grows smaller as the sample size grows larger.

• If there are two unbiased estimators of a parameter, the one whose variance is
smaller is said to be relatively efficient.
• an estimator is sufficient if no other statistic that can be calculated from the
same sample provides any additional information as to the value of the parameter.

38
Confidence Interval Estimator for 𝜇: The population mean
𝜎

❖Since 𝑋~𝑁 𝜇,
𝑛


𝑋−𝜇
▪→ 𝑃 −𝑍𝛼Τ2 < < 𝑍𝛼Τ2 = 1 − 𝛼
𝜎Τ 𝑛
𝛼
and 𝑃 𝑍 > 𝑍𝛼Τ2 =
𝛼
Where: 𝑃 𝑍 < −𝑍𝛼Τ2 =
2 2

𝜎 𝜎
▪→ 𝑃 𝑋ത −𝑍𝛼Τ2 < 𝜇 < 𝑋ത + 𝑍𝛼Τ2 =1−𝛼 −𝑍𝛼Τ2 +𝑍𝛼Τ2
𝑛 𝑛

▪The probability ‘1– ‘ is called the confidence level.


Confidence Interval Estimator for 𝜇: The population mean
❖The probability ‘1– ‘ is called the confidence level.

Usually represented with a upper confidence limit


“plus/minus” ( ± ) sign (UCL)

lower confidence limit


(LCL)
Graphically

❖…the actual location of the population mean 𝜇:

…may be here… …or here… …or possibly even here…


Interval Width…
A wide interval provides little information.
• For example, suppose we estimate with 95% confidence that an accountant’s average
starting salary is between $15,000 and $100,000.

• Contrast this with: a 95% confidence interval estimate of starting salaries between $42,000
and $45,000.

• The second estimate is much narrower, providing accounting students more precise
information about starting salaries.

10.42
Interval Width…

• Increasing the sample size decreases the width of


the confidence interval while the confidence
level can remain unchanged.
• Note: this also increases the cost of obtaining
additional data

• Larger values of 𝜎 produce wider


confidence intervals

• A larger confidence level produces a wider


confidence interval.
Values to know
Example

• A computer company samples demand during lead time over 25 time periods:

235 374 309 499 253


421 361 514 462 369
394 439 348 344 330
261 374 302 466 535
386 316 296 332 334

• Its is known that the standard deviation of demand over lead time is 75 computers.

• We want to estimate the mean demand over lead time with 95% confidence in order
to set inventory levels…
Example

❖ “We want to estimate the mean demand over lead time with 95% confidence in
order to set inventory levels…”

❖Thus, the parameter to be estimated is the population mean 𝜇:

❖And so our confidence interval estimator will be:


𝜎
Lower Bound: 𝑥ഥ − 𝑧𝛼/2 𝑛
𝜎
Upper Bound: 𝑥ഥ + 𝑧𝛼/2
𝑛
Solution

Confidence level= 1 − 𝛼 = 95%


→ 𝛼 = 5%
𝑥ഥ 370.16 Calculated from the data

𝑧𝛼/2 1.96 𝑧𝛼/2 =𝑧0.025 = 1.96 (𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝑇𝑎𝑏𝑙𝑒)

𝜎 75
Given
n 25
• The lower and upper confidence limits are
75
• Lower Bound: 370.16 − 1.96 =340.76
25
75
• Upper Bound:370.16 + 1.96 =399.56
25
Hypothesis Testing

48
What is Hypothesis

Is also called significance testing

A Hypothesis is the statement or an assumption


about relationships between variables.

Tests a claim about a parameter using


evidence (data in a sample).
Criteria for Hypothesis Construction

It should be empirically testable, whether it is right or wrong.

It should be specific and precise.

The statements in the hypothesis should not be contradictory.

It should specify variables between which the relationship is to


be established.
It should describe one issue only.
Null and Alternative Hypotheses

Null Hypothesis (H0)


Alternative Hypothesis (Ha or H1)

Each of the following statements is an example of a null hypothesis and


alternative hypothesis.
Formulate a Decision Rule to Accept Null Hypothesis

❖Accept H0 if the test statistic value falls within the area of acceptance.

Reject otherwise.
Errors in Hypothesis Testing

Area of
acceptance
Case 1: Illustrative Example: “Body Weight”
• The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body
weight of 170 pounds. Standard deviation σ was 40 pounds.

• We want to test whether the mean body weight in the population nowadays
(i.e. in 2016) differs.

• We take a sample of 64 individuals and compute the mean body weight as 173.
What is your conclusion regarding the test.
“Body Weight” Hypothesis Testing

• Null hypothesis H0: μ = 170 (“no difference”)

• The alternative hypothesis can be either


Ha: μ > 170 (one-sided test) or
Ha: μ ≠ 170 (two-sided test)
Case 1: Test Statistic

This is an example of a one-sample test of a mean when σ is known.


Use this statistic to test the problem:
Sampling distribution of x
under H0: µ0 = 170 for n = 64
x ~ N (170,5) 
Case 1
=40/sqrt(64)
n
x − m0
z stat =
SE x
where m 0  population mean assuming H 0 is true

and SE x =
n
Case 1: z statistic Method

❖For the illustrative example, μ0 = 170  40


❖We know σ = 40 SE x = = =5
❖Take a sample of n = 64. Therefore n 64

❖The sample mean is 173, then:


x − m 0 173 − 170
zstat = = = 0.60
SE x 5

❖If we consider a 95% confidence interval, 0.6<1.96, so we accept H0


Case1: The Confidence Interval Method

❖Construct a confidence interval around the sample mean 173

❖Lower bound: 173-1.96*5=163.2


❖Upper bound: 173+1.96*5=182.8

❖The value 170 (i.e. H0) is a possible value of µ (as inside the confidence
interval) and therefore we accept H0
57
Case 2: Illustrative Example: “Body Weight”

• The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body
weight of 170 pounds. Standard deviation σ was 40 pounds.

• We want to test whether the mean body weight in the population now differs.

• We take another sample of 64 individuals and compute the mean body weight
as 185. What is your conclusion regarding the test.
Case2: z statistic Method

We found a sample mean of 185, then:

x − m 0 185 − 170
zstat = = = 3.00
SE x 5

If we consider a 95% confidence interval, 3 >1.96, so we reject H0


Case 2: The Confidence Interval Method
• Construct a confidence interval around the sample mean 173

Lower bound: 185 -1.96*5=175.2


Upper bound: 185+1.96*5=194.8

• The value 170 (i.e. H0) is not a possible value of µ at 95% confidence level (as not inside the
CI) and we therefore reject H0

60
p-value approach to testing

• We assume a certain hypothesis, the null Hypothesis H0, in contrast


to another hypothesis Ha
• P-value (sometimes referred to as the observed significance level) –
is another way to reach statistical conclusion in hypothesis testing

• p-value: Probability of obtaining a test statistic equal to or more


extreme than the observed sample value given H0 is true
Hypothesis Testing Example
❖Given the Null Hypothesis H0 : mean = 0,
❖If H0 is true, how likely is it to get a Z of 2.75, or something
farther from the mean (0), in either direction?

• P-value=P(Z≥2.75)+P(Z≤2.75)=2*0.003=0.006
p-value approach to testing
❖The p-value is also called the observed level of significance
▪H0 can be rejected if the p-value is less than 𝛼 (with 𝛼 being your level of
significance)
▪No preset value of 𝛼 is given in the p-value method
▪p value defines the smallest value for which the null hypothesis can be rejected
• e.g. , p-value = 0.038 is the smallest value of 𝛼 for which H0 can be rejected.
• H0 is rejected with 𝛼 = 0.05 but not 0.01.
o 0.05 > 0.038 → Reject H0
o 0.01 < 0.038 → Accept H0
p-value approach to testing

▪ Compare the p-value with 𝛼 (the level of significance)


▪ If p-value < 𝛼 → reject H0
▪ If p-value ≥ 𝛼 → do not reject H0
❖Remember: If the p-value is low then H0 must go.
❖10% is usually the largest value of alpha used and most researchers use 5%.
What is a T-distribution?

• A t-distribution is like a Z distribution, except has slightly fatter tails to


reflect the uncertainty added by estimating .
• The bigger the sample size (i.e., the bigger the sample size used to
estimate ), then the closer t becomes to Z.
• If n>100, t approaches Z.
T-distribution with only 1 degree of freedom.
(df=n*-1)

n=2

*n=number of observations
T-distribution with 4 degrees of freedom.
(df=n*-1)

n= 5

*n=number of observations
T-distribution with 9 degrees of freedom.
(df=n*-1)

n=10

*n=number of observations
T-distribution with 29 degrees of freedom.
(df=n*-1)

n=30

*n=number of observations
T-distribution with 99 degrees of freedom.
(df=n*-1)
Looks a lot like Z!!

n=100

*n=number of observations
Student’s t Distribution

Note: t Z as n increases
Standard
Normal
(t with df = )

t (df = 13)
t-distributions are bell-shaped
and symmetric, but have
‘fatter’ tails than the normal t (df = 5)

0 t
Let: n = 3
Student’s t Table df = n - 1 = 2

Upper Tail Area (p)

P=probability in the column


df .25 .10 .05

1 1.000 3.078 6.314

2 0.817 1.886 2.920


3 0.765 1.638 2.353 p= 0.05

Df in the rows
The body of the table
0 t
contains t values, not 2.920
probabilities P(x>2.920)=0.05
t distribution values

With comparison to the Z value

Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____

.80 1.372 1.325 1.310 1.28


.90 1.812 1.725 1.697 1.64
.95 2.228 2.086 2.042 1.96
.99 3.169 2.845 2.750 2.58

Note: t Z as n increases
The T probability density function
What does t look like mathematically? (You may at least recognize some
resemblance to the normal distribution function…)

The t-distribution looks like a


mess! Don’t want to integrate!
Luckily, there are SPSS, SAS etc…
MUST SPECIFY DEGREES OF
Where: FREEDOM!
v is the degrees of freedom
(gamma) is the Gamma function
is the constant Pi (3.14...)

You might also like