STATISTICS Grade 12
STATISTICS Grade 12
STATISTICS Grade 12
Revision
Terminology
Measures of central tendency:
Provide information on the data values at the centre of the data set.
To find the median: we first sort the data in ascending or descending order and then
pick out the value in the middle of the sorted list. If the middle is in between two values,
the median is the average of those two values.
MEASURES OF DISPERSION:
Tell us how spread out a data set is. If a measure of dispersion is small, the data are
clustered in a small region.
If a measure of dispersion is large, the data are spread out over a large region.
THE RANGE is the difference between the maximum and minimum values in the data
set.
The inter-quartile range is the difference between the first and third quartiles of the
data set. The quartiles are computed in a similar way to the median.
The median is halfway into the ordered data set and is sometimes also called the
second quartile.
THE FIRST QUARTILE is one quarter of the way into the ordered data set, whereas the
THIRD QUARTILE is three quarters of the way into the ordered data set.
1
If you begin numbering your ordered data set with the number 1, the formulae for the
location of each quartile are as follows:
Location of Q1= 14 (𝑛 − 1) + 1
Location of Q2= 12 (𝑛 − 1) + 1
Location of Q3= 34 (𝑛 − 1) + 1
The variance of the data is the average squared distance between the mean and each
data value.
∑𝒏 (𝒙𝒊 − 𝒙
̅ )𝟐
𝛔 = √ 𝒊=𝟏
𝒏
More precisely, the five number summary is written in the following order:
minimum;
first quartile;
median;
third quartile;
maximum.
2
THE FIVE NUMBER SUMMARY is often presented visually using a box and whisker
diagram, illustrated below.
3
WORKED EXAMPLE :
FIVE NUMBER SUMMARY
Draw a box and whisker diagram for the following data set:
1,25 ; 1,5 ; 2,5 ; 2,5 ; 3,1 ; 3,2 ; 4,1 ; 4,25 ; 4,75 ; 4,8 ; 4,95 ; 5,1
STEP 1:
Since the data set is already ordered, we can read off the minimum as the first value
(1,25) and the maximum as the last value ( 5,1).
STEP 2:
There are 12 values in the data set. We can use the figure below or the formulae to
determine where the quartiles are located.
Using the figure above we can see that the median is between the sixth and seventh
values. We can confirm this using the formula:
We now have the five number summary as ( 1,25; 2,5; 3,65; 4,775). The box and
whisker diagram representing the five number summary is given below.
5
WORKED EXAMPLE 2: VARIANCE AND STANDARD DEVIATION
You flip a coin 100100 times and it lands on heads 4444 times. You then use the same
coin and do another 100100 flips. This time in lands on heads 4949 times. You repeat
this experiment a total of 1010 times and get the following results for the number of
heads.
{44;49;52;62;53;48;54;49;46;51}
For the data set above:
Step 1.
∑𝑛𝑖=1 𝑥𝑖
𝑥̅ =
𝑛
∑𝒏 ̅) 𝟐
𝒊=𝟏(𝒙𝒊 −𝒙
The formula for the variance is 𝛔𝟐 = 𝒏
We first subtract the mean from each data point and then square the result.
44 49 52 62 53 48 54 49 46 51
-6.8 -1.8 1.2 11.2 2.2 -2.8 3.2 -1.8 -4.8 0.2
46.24 3.24 1.44 125.44 4.84 7.84 10.24 3.24 23.04 0.04 225.6
The variance is the sum of the last row in this table divided by 10, so 𝛔𝟐 = 𝟐𝟐. 𝟓𝟔
6
CALCULATE THE VARIANCE USING A CALCULATOR
7
8
Symmetric and skewed data
Last year you learnt about three shapes of data distribution: symmetric, left skewed and
right skewed.
A symmetric distribution is one where the left and right hand sides of the distribution
are roughly equally balanced around the mean. The histogram below shows a typical
symmetric distribution.
For symmetric distributions, the mean is approximately equal to the median and the left
and right tails are equally balanced, meaning that they have about the same length.
If large numbers of data are collected from a population, the graph will often have a bell
shape.
If the data was, say, examination results, a few learners usually get very high marks, a
few very low marks and most get a mark in the middle range. This is a common type of
symmetric data known as a normal distribution.
9
68% of the sample lies within one standard deviation of the mean, 95% within
two standard deviations and 99% within three standard deviations of the mean.
What happens if the test was very easy or very difficult? Then the distribution may not
be symmetrical. If extremely high or extremely low scores are added to a distribution,
then the mean and median tend to shift towards these scores and the curve becomes
skewed.
If the test was very difficult, the mean and median scores are shifted to the left. In this
case, we say the distribution is positively skewed, or skewed right.
2. the tail of the distribution is longer on the right hand side than on the left hand
side; and
3. the median is closer to the first quartile than to the third quartile.
If the test was very easy, then many learners would get high scores, and the mean and
median of the distribution would be shifted to the right. We say the distribution
is negatively skewed, or skewed left.
2. the tail of the distribution is longer on the left hand side than on the right hand
side; and
10
3. the median is closer to the third quartile than to the first quartile.
11
WORKED EXAMPLE 3: SKEWED AND SYMMETRIC DATA
Three Matric classes wrote a Mathematics test. The test is out of 4040 marks and each
class has 2121 learners. The results of the test are shown in the table below:
12
1. For each class, determine the five number summary and draw a box and whisker
diagram on the same set of axes using an appropriate scale.
2. Determine the mean and standard deviation for each class.
3. Comparing the mean and median values for each class, comment on the
distribution of the test marks for each class.
1.
First, we order the data from smallest to largest. This has already been done for us.
Then, we divide our data into quartiles:
Since there are 21 values in the data set, the median lies on the eleventh mark, making
it equal to 16 for Gr. 12A, 32 for Gr. 12B and 21 for Gr. 12C.
The first quartile lies between the fifth and sixth values, making it equal to 12 for Gr.
12A, 16 for Gr. 12B and 14 for Gr. 12C.
The third quartile lies between the 16th and 17th values, making it equal to 36 for Gr.
30+32
12A and Gr. 12B, and 2 = 31for Gr. 12C.
Therefore, we are able to formulate the following five number summaries and
subsequent box and whisker plots:
Gr. 12A =[4;12;16;36;40]
Gr. 12B =[4;16;32;36;40]
Gr. 12C =[4;14;21;31;40]
13
DETERMINE THE MEAN AND STANDARD DEVIATION FOR EACH CLASS.
2. Gr. 12A:
∑𝑛
𝑖=1 𝑥𝑖 ∑𝒏 ̅ )𝟐
𝒊=𝟏(𝒙𝒊 −𝒙
𝑥̅ = ; 𝛔=√
𝑛 𝒏
496
mean (𝑥̅ )= = 23.6
21
∑𝑛
𝐼=1(𝑥𝑖 −𝑥̅ )
2
standard deviation (σ)= √ = ±12.70
𝑛
Gr. 12B:
556
mean (𝑥̅ )= = 26.5
21
∑𝑛
𝐼=1(𝑥𝑖 −𝑥̅ )
2
standard deviation (σ)= √ = ±10.65
𝑛
Gr. 12C:
453
mean (𝑥̅ )= = 21.6
21
∑𝑛
𝐼=1(𝑥𝑖 −𝑥̅ )
2
standard deviation (σ)= √ = ±10.54
𝑛
3.
If the mean is greater than the median, the data is typically positively skewed and if
the mean is less than the median, the data is typically negatively skewed.
14
Gr. 12A:
mean−median=23,6−16=7,6
The marks for 12A are therefore positively skewed, meaning that there were many
low marks in the class with the high marks being more spread out.
Gr. 12B:
mean−median=26,5−32=−5,5
The marks for 12B are therefore negatively skewed, meaning that there were many
high marks in the class with the low marks being more spread out.
15
EXERCISE 9.1
State whether each of the following data sets are symmetric, skewed right or skewed left.
a. A data set with this distribution:
skewed right
b. A data set with this box and whisker plot:
symmetric
c. A data set with this histogram:
skewed left
A data set with this frequency polygon:
skewed right
16
A data set with this distribution:
skewed left
first
quartile 66.5
third
quartile 130.5
median 99.5
17
Note that we get contradicting indications from the different ways of determining whether the
data is skewed right or left.
The mean is slightly greater than the median. This would indicate that the data set is
skewed right.
The median is slightly closer to the third quartile than the first quartile. This would
indicate that the data set is skewed left.
Since these differences are so small and since they contradict each other, we conclude that the
data set is symmetric.
Example 2
a. 40 ; 45 ; 12 ; 6 ; 9 ; 16 ; 11; 7 ; 35 ; 7; 31 ; 33
3
6
7
7
9
11
12
18
16
31
35
40
45
222 18.5
STEP 1:
DETERMINE THE MINIMUM AND MAXIMUM
Since the data set is already ordered, we can read off the minimum as the first
Note: value (1,25) and the maximum as the last value (5,1).
Minimum 3
Maximum 45
STEP 2:
DETERMINE THE
QUARTILES
3 6 7 7 9 11 12 16 31 35 40 45 12
1 2 3 4 5 6 7 8 9 10 11 12
Using the figure above we can see that the median is between the sixth and seventh
values. We can confirm this using the formula:
Location Q2 6.5
Therfore , the value of median is :
11.5
The first quartile lies between the third and fourth values. We can
confirm this using the formula:
Location Q1 3.75
Therefore, the value of the first quartile is
Q1 7
The third quartile lies between the ninth and tenth values. We can
confirm this using the formula:
Location Q3 9.25
Therefore, the value of the third quartile is
Q3 33
19
STEP 3.
DRAW THE BOX AND WHISKER DIAGRAM
We now have the five number summary as ( 3; 7; 11.5; 33; 45)
The box and whisker diagram representing the five number
summary is given below.
20
21
22