Nothing Special   »   [go: up one dir, main page]

Analysis of Statistcal Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Analysis of Statistical Data

Central Tendency (Center) and


Dispersion (Variability)
 Central tendency: measures of the degree to
which scores are clustered around the mean of a
distribution

 Dispersion: measures the fluctuations (variability)


around the characteristics of central tendency
Measures of Center
• A measure along the horizontal axis of
the data distribution that locates the
center of the distribution.
Arithmetic Mean or Average
• The mean of a set of measurements is
the sum of the measurements divided
by the total number of measurements.
n

∑x i
x= i =1
n

where n = number of measurements


∑ xi =sum of all the measurements
Example
•The set: 2, 9, 1, 5, 6

∑ xi 2 + 9 + 11 + 5 + 6 33
x= = = = 6.6
n 5 5

If we were able to enumerate the whole


population, the population mean would be
called µ (the Greek letter “mu”).
Example:
 Resistance of 5 coils:
3.35, 3.37, 3.28, 3.34, 3.30 ohm.
 The average:

∑x3.35 + 3.37 + 3.28 + 3.34 + 3.30


i
=x =
i =1
= 3.33
n 5
Weighted Mean
 The Weighted mean of the positive real numbers
x1,x2, ..., xn with their weight w1,w2, ..., wn is defined
to be
n


i =1
wi xi
x= n

∑w
i =1
i
Geometric Mean
 Geometric mean is defined as the positive root of the
product of observations. Symbolically,

GM = ( x1 x2 x3  xn ) 1/ n

 It is also often used for a set of numbers whose values are


are exponential in nature, such as data on the growth of the
human population or interest rates of a financial
investment.

 Find geometric mean of rate of growth: 34, 27, 45, 55, 22, 34
Harmonic Mean
 The harmonic mean is the number of variables divided
by the sum of the reciprocals of the variables.
n
HM = n
1

i =1 xi

 Useful for ratios such as speed (=distance/time) etc.

 Exercise: Find the the harmonic mean of 1, 2, and 4


Median
• The median of a set of measurements
is the middle measurement when the
measurements are ranked from
smallest to largest.
• The position of the median is
0.5(n + 1)

once the measurements have been


ordered.
Example
 The set : 2, 4, 9, 8, 6, 5, 3 n=7
 Sort : 2, 3, 4, 5, 6, 8, 9
 Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement

• The set: 2, 4, 9, 8, 6, 5 n=6


• Sort: 2, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th
measurements
Mode
• The mode is the measurement which occurs
most frequently.
• The set: 2, 4, 9, 8, 8, 5, 3
• The mode is 8, which occurs twice
• The set: 2, 2, 9, 8, 8, 5, 3
• There are two modes—8 and 2 (bimodal)
• The set: 2, 4, 9, 8, 5, 3
• There is no mode (each value is unique).
Example
The number of quarts of milk purchased by 25
households:
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3
3 3 3 4 4 4 5
 Mean?
∑ xi 55
x= = = 2.2 10/25

n 25 8/25

Relative frequency
 Median? 6/25

m=2 4/25

 Mode? (Highest peak) 2/25

mode = 2
0
0 1 2 3 4 5
Quarts
Extreme Values
 The mean is more easily affected by extremely
large or small values than the median.

•The median is often used as a measure of


center when the distribution is skewed.
Extreme Values

Symmetric: Mean = Median

Skewed right: Mean > Median

Skewed left: Mean < Median


Measures of Variability
• A measure along the horizontal axis of the data distribution
that describes the spread of the distribution from the center.

 Range
Difference between maximum and minimum values
 Interquartile Range

Difference between third and first quartile (Q3 - Q1)


 Variance

Average*of the squared deviations from the mean


 Standard Deviation

Square root of the variance


Variability

Variability

No Variability
The Range
• The range, R, of a set of n measurements is the
difference between the largest and smallest
measurements.
• Example: A botanist records the number of
petals on 5 flowers:
5, 12, 6, 8, 14
• The range is R = 14 – 5 = 9.
Quartiles

Q1 Q2 Q3

25% 25% 25% 25%


Percentile
50th Percentile ≡ Median (Q2)
25th Percentile ≡ Lower Quartile (Q )
1
75th Percentile ≡ Upper Quartile (Q )
3

Interquartile Range:
IQR=Q3 – Q1
• The position of p-th percentile is 0.p(n + 1)

• The position of Q1 is 0.25(n + 1)

•The position of Q3 is 0.75(n + 1)

once the measurements have been ordered.


If the positions are not integers, find the
quartiles by interpolation.
Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95

Position of Q1 = 0.25(18 + 1) = 4.75


Position of Q3 = 0.75(18 + 1) = 14.25

Q1is 3/4 of the way between the 4th and 5th ordered
measurements, or Q1 = 65 + 0.75(65 - 65) = 65.
Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95

Position of Q1 = 0.25(18 + 1) = 4.75


Position of Q3 = 0.75(18 + 1) = 14.25

Q3 is 1/4 of the way between the 14th and 15th


ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
and
IQR = Q3 – Q1 = 74.25 - 65 = 9.25
90-th percentile P90
 The position of 90-th percentile is
0.9(18 + 1)=17.1

The prices ($) of 18 brands of walking shoes:


40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95

P90 = 90 + .10 (95-90) = 90.5


The Variance
• The variance is measure of variability that uses
all the measurements. It measures the average
deviation of the measurements about their
mean.
• Flower petals: 5, 12, 6, 8, 14

45
x= =9
5
4 6 8 10 12 14
The Variance
• The variance of a population of N measurements
is the average of the squared deviations of the
measurements about their mean µ.
∑ ( x − µ ) 2
σ2 = i
N

• The variance of a sample of n measurements is the sum


of the squared deviations of the measurements about their
mean, divided by (n – 1).

∑ ( x − x ) 2
s2 = i
n −1
The Standard Deviation
• In calculating the variance, we squared all of
the deviations, and in doing so changed the
scale of the measurements.
• To return this measure of variability to the
original units of measure, we calculate the
standard deviation, the positive square root of
the variance.

Population standard deviation : σ = σ 2


Sample standard deviation : s = s 2
Two Ways to Calculate the Sample Variance
Use the Definition Formula:
xi xi − x ( xi − x ) 2 ∑ ( x − x ) 2
s =
2 i
5 -4 16 n −1
12 3 9
60
6 -3 9 = = 15
8 -1 1 4
14 5 25
s = s = 15 = 3.87
2

Sum 45 0 60
Two Ways to Calculate the Sample Variance

Use the calculation formula:


xi xi2 (∑ xi )
2
∑ xi −
2

5 25 s2 = n
12 144 n −1
6 36 2
45
8 64 465 −
= 5 = 15
14 196 4
Sum 45 465
s = s 2 = 15 = 3.87
Example- ungrouped data
 Sample: Moisture content (%) of kraft paper are:
6.7, 6.0, 6.4, 6.4, 5.9, and 5.8.

(231.26) − (37.2) 2 6
s= = 0.35
(6 − 1)
 Sample standard deviation, s = 0.35
Using Measures of Center and Spread:
The Empirical Rule
Given a distribution of measurements
that is approximately mound-shaped:
The interval µ ± σ contains approximately 68% of the
measurements.
The interval µ ± 2σ contains approximately 95% of
the measurements.
The interval µ ± 3σ contains approximately 99.7% of
the measurements.
The Empirical Rule: An Example
Measures of Relative Standing
• Where does one particular measurement stand in
relation to the other measurements in the data
set?
• How many standard deviations away from the
mean does the measurement lie? This is measured
by the z-score.

Suppose s = 2. s
x−x 4
z - score= s s
s
x =5 x=9
x = 9 lies z =2 std dev from the mean.
z-Scores
• z-scores between –2 and 2 are not unusual. z-scores
should not be more than 3 in absolute value. z-scores
larger than 3 in absolute value would indicate a
possible outlier.

Outlier Not unusual Outlier


z

-3 -2 -1 0 1 2 3
Somewhat unusual
Example of z-Scores
X z-Score X z-Score
10 -1.28244 10 -0.29204
15 0.625954 500 3.473714
10 -1.28244 10 -0.29204
16 1.007634 16 -0.24593
11 -0.90076 11 -0.28435
17 1.389313 17 -0.23824
14 0.244275 14 -0.2613
13 -0.1374 13 -0.26898
10 -1.28244 10 -0.29204
16 1.007634 16 -0.24593
11 -0.90076 11 -0.28435
17 1.389313 17 -0.23824
14 0.244275 14 -0.2613
13 -0.1374 13 -0.26898
Coefficient of Variation(CV)
 When comparing between data sets with different units
or widely different means, one should use the
coefficient of variation for comparison instead of the
standard deviation.
 The Coefficient of Variation can be written as

s
CV =
x
 We express CV as a percentage by multiplying 100
Skewness
 Skewness measures the degree of asymmetry exhibited
by the data

 The data can exhibits +ve skewness or –ve skewness

 If the mean of the data is greater than its median, the


data is positively skewed; and if the mean of the data is
less than its median, the data is negatively skewed
n

∑ (x − x) 3

 Mathematically,
i
skewness = i =1
ns 3
42
Skewness

Mean Mode Mean Mean


Mode
Median
Median Mode Median

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
Kurtosis
 Kurtosis measure the peaking of the data relative to the
normal distribution

 Data with high degree of peakeness is said to be


leptokurtic and have the kaurtosis value more than 3

 Flat data has the kurtosis value of less than 3, and it is


called platykurtic n

∑ i
( x − x ) 4

 Mathematically, kurtosis = i =1
ns 4
44
Kurtosis
 Peakedness of a distribution
 Leptokurtic: high and thin
 Mesokurtic: normal in shape
 Platykurtic: flat and spread out

Leptokurtic

Mesokurtic
Platykurtic
Skewness and Kurtosis

46

You might also like