Nothing Special   »   [go: up one dir, main page]

Lesson Note 10

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

STA101: Lesson Note 10

Exploratory Data Analysis (EDA)

 Get to know your data!

 distributions (symmetric, normal, skewed)

 data quality problems

 outliers

 correlations and inter-relationships

 subsets of interest

 Goal: get a general sense of the data

 means, medians, quantiles, histograms, boxplots

 You should always look at every variable - you will learn something!

 Think interactive and visual

 Humans are the best pattern recognizers

 You can use more than 2 dimensions!

 x,y,z, space, color, time….

 Especially useful in early stages of data mining

 detect outliers (e.g. assess data quality)

 test assumptions (e.g. symmetrical distributions or skewed?)

 identify useful raw data & transforms (e.g. log(x))

 Bottom line: it is always well worth looking at your data!

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


Some visualizations with pie chart & histogram.

Death rate, crude (per 1,000 people) in Bangladesh was reported at 5.529 in 2018,
(Source: World Bank on August of 2020). It shows a positive skewed distribution.

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


The total population in Bangladesh was estimated at 165.2 million people in 2019, according to
the latest census figures and projections from Trading Economics. It shows a negative skewed
distribution.

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


Skewness:

Formulas:

3(𝑚𝑒𝑎𝑛 −𝑚𝑒𝑑𝑖𝑎𝑛) 𝑚𝑒𝑎𝑛 −𝑚𝑜𝑑𝑒


1. 𝑃𝑒𝑎𝑟𝑠𝑜𝑛′ 𝑠 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

(𝑄3 −𝑄2 )−(𝑄2 −𝑄1 )


2. 𝐵𝑜𝑤𝑙𝑒𝑦 ′ 𝑠 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝑄3 −𝑄1

Example:

For a distribution we have-

mean= 30.892, median= 30.58, SD= 2.219, Q1= 29.50, Q3= 32.1

Is the distribution being positively skewed? How? What is the value of coefficient of skewness?

For a distribution we have-


Mean= 30.892, median= 30.58, SD= 2.219, Q1= 29.50, Q3= 32.1
Is the distribution is positively skewed? How? What is the value of coefficient of skewness?

3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑃𝑒𝑎𝑟𝑠𝑜𝑛′ 𝑠 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
3(30.892 − 30.58)
= = 0.42
2.219

(𝑄3 − 𝑄2 ) − (𝑄2 − 𝑄1 )
𝐵𝑜𝑤𝑙𝑒𝑦 ′ 𝑠 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑄3 − 𝑄1
(32.1 − 30.58) − (30.58 − 29.50)
= = 0.17
32.1 − 29.50

Yes, the distribution is positively skewed. Because the coefficient of skewness is greater than 0.
The value of skewness is 0.42.

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


The Ordered Array:

 A sequence of data in rank order:

 Shows range (min to max)

 Provides some signals about variability within the range

 May help identify outliers (unusual observations)

 If the data set is large, the ordered array is less useful

 Data in raw form (as collected):

24, 26, 24, 21, 27, 27, 30, 41, 32, 38

 Data in ordered array from smallest to largest:


21, 24, 24, 26, 27, 27, 30, 32, 38, 41

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


The Stem and Leaf Display:

Stem and leaf plot is a graphical technique of representing quantitative data that can be used to
examine the shape of a frequency distribution. Here “stem” represents the tens (leading digits)
and the “leaf” represent the units (trailing digits). Compared to other techniques it is an easy
and quick way of displaying data.

Tukey (1977) first proposed the technique. It allows us to use the information contained in a
frequency distribution to show

 The range of score

 Concentration of scores

 The shape of the distribution

 Presence of any specific values or scores not represented in the entire data set

 Whether there are any stray or extreme values in the distribution.

 A simple way to see distribution details in a data set

Method: Separate the sorted data series


into leading digits (the stem) and
the trailing digits (the leaves)

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


Example 1:

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


Example 2:

The following data represented the marks obtained by 20 students in a statistics test.

84 17 78 45 47 53 76 54 75 22

66 65 55 54 51 33 39 19 54 72

Use the stem leaf plot to display the data.

Solution: Here the highest score is 84 and lowest score is 17. The stem and leaf diagram is
given below-

Solution:

Here the highest score is 84 and lowest score is 17. The stem and leaf diagram is given below-

Unorganized Organized

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


Example 3:

The following data represented the marks obtained by 20 students out of 10 in a quiz.

8.4 1.7 7.8 4.5 4.7 5.3 7.6 5.4 7.5 2.2

6.6 6.5 5.5 5.4 5.1 3.3 3.9 1.9 5.4 7.2

Use the stem leaf plot to display the data.

Solution: Here the highest score is 8.4 and lowest score is 1.7. The stem and leaf diagram is
given below-

Solution:

Here the highest score is 84 and lowest score is 17. The stem and leaf diagram is given below-

Unorganized Organized

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


Example 4:

Exercises for practice:

1. The following data represent the amount of insurance (in units of thousand taka) purchased
by 30 people from an insurance company in a given week:

31 44 51 35 76 84 110 50 56 61

40 48 61 85 90 92 40 65 120 125

100 105 115 70 77 120 75 80 92 115

Construct a stem and leaf plot to display the data.

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022


2. Let us consider the following data:

17.0 17.7 15.9 15.2 16.2 17.1 15.7 17.3 13.5 16.3

14.6 15.8 15.3 16.4 13.7 16.2 16.4 16.1 17.0 15.9

Construct a stem and leaf plot to display the data.

 Statistical Technique in Business & economics- Douglas A. Lind, William G. Marchal, and
Samuel A. Wathen. 17th edition. Mc Graw Hill Education.

Page: 101-102 Q: 7-10

STA101 (Introduction to Statistics) _Lesson Note 10_Summer 2022

You might also like