Nothing Special   »   [go: up one dir, main page]

احصاء ابراهيم - 221221 - 085821

Download as pdf or txt
Download as pdf or txt
You are on page 1of 110

Stat 329: Principles of Statistics and Probability

Dr. Ibrahim Almanjahie

King Khalid University

Department of Mathematics

imalmanjahi@kku.edu.sa

‫ ﻫـ‬1442

Reference: Allan G. Bluman. Elementary Statistics: A Step by Step Approach.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 0
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Chapter 1: Introduction to Statistics

You may be familiar with probability and statistics through radio,


television, newspapers, and magazines. For example, you may have
read statements like the following found in newspapers.
- In Saudi Arabia, 95% of high school graduate students go to
universities.
- The average salary of employees in Saudi Arabia is 6000 SAR.
- The probability of getting infected by COVID-19 is 90%.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 1
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Definition of statistics
Statistics is the science of conducting studies to collect, organize,
summarize, analyze, and draw conclusions from data.

Branches of Statistics
A- Descriptive Statistics consists of the collection, organization, sum-
marization, and presentation of data.
Methods of Descriptive Statistics
✦ Frequency distributions (Frequency Tables), graphs,..
✦ Measures of central tendency (averages), measures of dispersion,...
B- Inferential Statistics consists of generalizing from samples to popu-
lations, performing estimations and hypothesis tests, determining rela-
tionships among variables, and making predictions. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 2
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Population and Sample


Population:
is the collection of all individuals or items that are being studied.

Types of Population:

✦ Limited: Consists of a limited number of individuals, such as the num-


ber of students of the 329-Stat count in this class.

✦ Unlimited consists of an infinite number of individuals that can be dis-


tinguished from each other, such as the number of fish in the sea.

Sample:
It is the part of the population from which information is collected.

Taking a sample from the population saves time and effort, such as examin-
ing a sample of eggs or the lifetime of the electricity bulbs produced from a
factory. .
.
.
.
.
.
. .
. .
. .
. .
.
. . . .
. . .
. .
. .
. .
. . .
.
.
.
.
.
.
.
.
.
3
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Data
is a set of observations taken during a specific study and maybe nu-
merical (quantitative) data such as lengths and weights of a group of
students, or non-numerical (Qualitative) data such as skin color, gen-
der, etc.

Parameter and statistic


Parameter: A numerical value summarizing all the data of an entire
population. For example, the average monthly income of families in
Saudi Arabia.

Statistic: A numerical value summarizing the sample data. Statistic


can be used to make inference about unknown parameters. For exam-
ple, the average monthly income of a sample of 100 families in Saudi
. . . . . . . . . . . . . . . . . . . .
Arabia. . . . . . . . . . . . . . . . . . . . . 4
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Variables:
Characteristics that varies from one person or thing to another.

Types:
✦ Qualitative variables are the variables that yield non-numerical
data. For example, gender (male or female), hair colour, eye colour,....
✦ Quantitative variables are the variables that yield numerical data.
For example, weight, height, measurement of the IQ, ...
Sources of data collection: Two sources

First: Historical, which is taken from archived records such as birth


and death records, statistics of the United Nations, and others.
Second: The survey, which is data collected from members of the
whole society, or part of it, by direct contact (personal interview) or
. . . . . . . . . . . . . . . . . . . .
indirect, such as regular mail, e-mail, and telephone.
. . . . . . . . . . . . . . . . . . . . 5
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Data Collection and Sampling Techniques

Two methods for data collection:


A-The comprehensive survey: data are collected from all elements of
the statistical population. The results of this method are characterized
by high accuracy, clarity, detail, and reliability.
B- Samplings: We mention below different sampling methods:
1- Simple random sample: Here, every member of the statistical pop-
ulation has the same opportunity to choose. To select a random sample
of, say, 15 subjects out of 85 subjects, it is necessary to number each
subject from 01 to 85. Then, generate random numbers with a com-
puter or calculator, select the 15 subjects based on the resulted gener-
ated numbers. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 6
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2- Stratified Sampling: We use it when the statistical population is di-


vided into homogeneous groups. In case of non-homogeneous groups,
the number of the sample population is calculated as follows:

Total number of strata


Number of strata sample = × Number of sample to be selected
Total number of population

Example (1-1):
If we want to choose a sample consisting of (30) students from the
College of Science students in the first stage. The number of admitted
students in the Department of Life Sciences is 130, in the Department
of Chemistry is 110, in the Department of Mathematics is 50 and in
the Department of Physics is 100. How many students do we choose
from each Department? . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sol:
Total number of students = 130 + 110 + 50 + 100 = 390
From Department of Life Sciences = ( 130
390 ) × 30 = 10
From Department of Chemistry = ( 110
390 ) × 30 = 8
50
From Department of Mathematics = ( 390 ) × 30 = 4
From Department of Physics = ( 100
390 ) × 30 = 8

3- Cluster Sampling: Here, the population is divided into groups and these
groups are divided into subgroups, and so on so that the smallest subgroup is
called a cluster. Then, we choose from each cluster a simple random sample
to get a cluster sample.
Example (1-2):
To study the opportunities for appointing King Khalid University students
after graduation. How do we determine the best sample?

Sol: Use the cluster sample because we have college students, department
. . . . . . . . . . . . . . . . . . . .
students. . . . . . . . . . . . . . . . . . . . . 8
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

4- Systematic Sampling:
Researchers obtain systematic samples by numbering each subject of
the population and then selecting every k th subject.
Example (1-3):
Suppose there were 2000 subjects in the population and a sample of
50 subjects were needed.

Sol: Since 2000/50 = 40, then k = 40, and every 40th subject would
be selected; however, the first subject (numbered between 1 and 40)
would be selected at random. Suppose subject 12 were the first subject
selected; then the sample would consist of the subjects whose numbers
were 12, 52, 92, etc.
Note: See page 727 for other types of sampling techniques.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Chapter 2
Frequency Distributions and Graphs

Content:
✦ Organizing Data
✦ Histograms, Frequency Polygons, and Ogives
✦ Other Types of Graphs

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 10
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

1. Organizing Data

✦ When data are collected in original form, they are called raw data.
✦ After collecting raw data, we organize it to make it easier for us
to deal with it and study it, and it is organized with a table called
frequency distribution.
Frequency distribution is the organizing of raw data in table form, using
classes and frequencies; the frequency of a class is denoted by f .

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 11
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
To construct a frequency distribution for this data, we will follow the
following steps:
1. Find the lowest (L) and the highest (H) values in the row data.
2. Calculate the Range (R) where
R = highest value − lowest value = H − L.
3. Decide on the number (n) of classes (or intervals) desired; use 5 to 15
classes.
4. Find the width (W) of the class using W = R
n. (Always round up.)
5. Find the lower limit (LL) and upper limit (UL) of the first class by:
LL = L, U L = LL + W − 1
6. For the second class limits, we use
LL = upper limit of first class + 1, U L = LL + W − 1
and use the same method for other classes. Then, calculate the frequency
for each class. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 12
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (2-1):
Data below shows the marks for 50 students in mathematics:
27 36 72 47 48 29 18 57 33 61 44 10 76 15 67 52 35
43 71 73 56 32 81 64 85 55 19 69 50 46 68 25 36 43
54 52 27 44 98 64 61 42 36 29 42 51 38 90 67 63.
Summarize the data in a frequency distribution table.

Sol:
1. Note that L = 10 and H = 98. Then R = 98 − 10 = 88
2. In this example, we will choose n = 9. Then

R 88
W = = = 9.77 ≈ 10
n 9

3. Find the lower limit (LL) and upper limit (UL) of the first class
by: LL = 10, U L = 10 + 10 − 1 = 19 .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
13
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
For the second class limits, we use
LL = 19 + 1 = 20, U L = 20 + 10 − 1 = 29 and so on for oth-
ers. Then, find the frequency for each class.
4. The frequency distribution is finally constructed as following.
Class Limits Frequency fi
10 − 19 4
20 − 29 5
30 − 39 7
40 − 49 9
50 − 59 8
60 − 69 9
70 − 79 4
80 − 89 2
90 − 99 2
Sum 50

Note that the class limit “10-19” is read as “from 10 to 19”.


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Class boundaries: The class boundaries are found by subtracting 0.5


the lower class limit and adding 0.5 to the upper class limit.

For example: Lower limit − 0.5 = 10 − 0.5 = 9.5 = lower boundary


Upper limit + 0.5 = 19 + 0.5 = 19.5 = upper boundary

Class midpoint: The class midpoint xi is obtained by adding the


lower and upper limits and dividing by 2:

lower boundary + upper boundary


xi =
2

For example: x1 = midpoint(9.5 − 19.5) = 9.5+19.5


2 = 14.5
Relative frequency (RF): RF = Frequency of the class
Sum of all frequencies
4
For example: RF (10 − 19) = 50 = 0.08
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Percentage frequency (PF): P F = RF × 100
For example: P F (10 − 19) = 0.80 × 100 = 80%
The frequency distribution table becomes:
Class limits Class boundary Class midpoints Frequency fi Relative Percentage
limits xi Frequency Frequency (%)
10 − 19 9.5 − 19.5 14.5 4 4/50 = 0.08 8
20 − 29 19.5 − 29.5 24.5 5 0.10 10
30 − 39 29.5 − 39.5 34.5 7 0.14 14
40 − 49 39.5 − 49.5 44.5 9 0.18 18
50 − 59 49.5 − 59.5 54.5 8 0.16 16
60 − 69 59.5 − 69.5 64.5 9 0.18 18
70 − 79 69.5 − 79.5 74.5 4 0.08 8
80 − 89 79.5 − 89.5 84.5 2 0.04 4
90 − 99 89.5 − 99.5 94.5 2 0.04 4
Sum 50 1 100

Remark:
to facilitate the construction of statistical tables derived from the frequency
distribution table, as well as the various statistical calculations that we will
be exposed to by explaining later, the boundaries of the classes in the
. . . . . . . . . . . . . . . . . . . .
frequency tables must be real. . . . . . . . . . . . . . . . . . . . . 16
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Cumulative Frequency (or “Less than” Cumulative Frequency): is a


distribution that shows the number of data values less than or equal to
a specific value (usually an upper boundary). The values are found by
adding the frequencies of the classes less than or equal to the upper
class boundary of a specific class.
Classes Cumulative Frequency
Less than 9.5 0
Less than 19.5 4
Less than 29.5 9
Less than 39.5 16
Less than 49.5 25
Less than 59.5 33
Less than 69.5 42
Less than 79.5 46
Less than 89.5 48
Less than 99.5 50
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Bivariate Frequency Tables
This type of table is used when the statistical data to be summarized is more
than one variable; Such as studying the phenomenon of height and weight of
students.
Bivariate Table is a grid of squares or a matrix in the form of horizontal and
vertical columns and writes the common occurrence of the two phenomena
within the squares, and at the end of each row and column write the sum of
the duplicates.
Example (2-2):
The following data represent grades of 20 students in the subjects of
chemistry and mathematics. Organize this data in a bivariate frequency
table.
Chemistry C C D E A A B C C B
Mathematics C B B E A C C B C C
Chemistry B C C A D A C B B A
Mathematics B A B B E . A.
. . . A
. . . . . C
. . . . . A. . C. . .
. . . . . . . . . . . . . . . . . . . . 18
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol.
Note that this bivariate data is a qualitative data of 20 Grades for chem-
istry and mathematics. We create a bivariate frequency table for this
data as follows:

Chimst
A B C D E Sum
Math
A 2 1 2 0 0 5
B 1 1 3 1 0 6
C 2 3 2 0 0 7
D 0 0 0 0 0 0
E 0 0 0 1 1 2
Sum 5 5 7 2 1 20
Note the ease of creating a bivariate frequency table in Example (2-2).
. . . . . . . . . . . . . . . . . . . .

What about quantitative data? . . . . . . . . . . . . . . . . . . . . 19


Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (2-3):
The following table represents the marks of 30 students in both the statistics
and mathematics subjects. Summarize this data in a bivariate frequency
distribution table.
Math Stat Math Stat Math Stat Math Stat Math Stat
71 76 90 57 50 53 75 80 55 50
93 93 75 73 65 72 68 71 72 70
67 64 92 90 86 85 65 62 80 81
96 94 72 74 52 56 82 83 60 61
72 77 92 91 81 86 60 63 85 82
77 78 70 75 57 60 81 84 75 79

Sol:
Note that these bivariate data are quantitative (numerical) data for the marks of
30 students in the subjects of statistics and mathematics. As the marks range
from 50 to 100, therefore the appropriate width for the class limits for both
statistics and mathematics in this example is 10. We construct the bivariate
. . . . . . . . . . . . . . . . . . . .
frequency table for this data as follows: . . . . . . . . . . . . . . . . . . . . 20
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Stat
50−59 60−69 70−79 80−89 90−99 Sum
Math
50−59 3 1 0 0 0 4
60−69 0 4 2 0 0 6
70−79 0 0 8 1 0 9
80−89 0 0 0 6 0 6
90−99 1 0 0 0 4 5
Sum 4 5 10 7 4 30

Note that the class boundary limits can be determined by subtracting


0.5 from the lower class limit and adding 0.5 to the upper class limit.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
2. Graphic display

A graphical display of data is one method that can be used to describe


data, in terms of the shape of the distribution and the extent to which the
data is centralized. In many applied aspects, the graphical presentation
is easier and faster to describe the phenomenon under study

Frequency distributions can be displayed graphically using:


✦ Histogram
✦ Polygon
✦ Frequency graph
✦ Cumulative Frequency graph, or ogive .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
22
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

1. Histogram

The histogram is a graph that displays the data by using contiguous


vertical bars (unless the frequency of a class is 0) of various heights to
represent the frequencies of the classes. To draw the histogram, follow
the steps below:
Step 1: Draw and label the x and y axes. The x axis is always the
horizontal axis, and the y axis is always the vertical axis.
Step 2: Represent the frequency on the y axis and the class bound-
aries on the x axis.
Step 3: Using the frequencies as the heights, draw vertical bars
for each class.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
The histogram for Example (2-1) is

Frequency (Number of students) 8


7
6
5
4
3
2
1

9.5 19.5 29.5 39.5 49.5 59.5 69.5 79.5 89.5 99.5

Marks (Boundary class Limits)

As the histogram shows, the classes with the greatest number of data
values 9 are 39.5-49.5 and 59.5–69.5, followed by 8 for 49.5–59.5. The
graph also has two peaks. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 24
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2. Frequency Polygon
The frequency polygon is a graph that displays the data by using lines
that connect points plotted for the frequencies at the midpoints of the
classes (at x-axis). The frequencies are represented by the heights of
the points (at y-axis).

The frequency polygon for Example (2-1) is plotted using the following
steps:
Step 1: Find the midpoints of each class as in Example (2-1).
Then, label the x axis with the midpoint of each class, and the y
axis with the frequencies, i.e.
x 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5
y 4 5 7 9 8 9 4 2 2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Step 2: Using the midpoints for the x values and the frequencies
as the y values, plot the points.
Step 3: Connect adjacent points with line segments.
9
Frequency (Number of students)

8
7
6
5
4
3
2
1

14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5

Marks (Midpoints)

Remark: To close the frequency polygon draw a line back to the x


axis at the beginning and end of the graph, at the same distance that
the previous and next midpoints would be located. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
26
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

3. Frequency Curve
By following the same previous steps in drawing the polygon, the fre-
quency curve can be drawn, but the broken lines are smoothed into a
curve so that it passes by the most number of points. For example (2-1),
the frequency curve can be drawn as:
9
Frequency (Number of students)

8
7
6
5
4
3
2
1

14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5

Marks (Midpoints)

Remark: The relative and percent frequency curves can be drawn in the same
. . . . . . . . . . . . . . . . . . . .

way. . . . . . . . . . . . . . . . . . . . . 27
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
4. Cumulative Frequency graph, or ogive
The ogive is a graph that represents the cumulative frequencies for the
classes in a frequency distribution. Steps for plotting ogive are:
Step 1: Find the cumulative frequency for each class.
Step 2: Draw the x and y axes. Label the x axis with the class
boundaries. Use an appropriate scale for the y axis to represent
the cumulative frequencies.
Step 3: Plot the cumulative frequency at each upper class bound-
ary. Upper boundaries are used since the cumulative frequencies
represent the number of data values accumulated up to the upper
boundary of each class.
Step 4: Starting with the first class boundary and connect adjacent
points with line segments. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
28
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

For example (2-1), the ogive graph can be drawn as:


50
Frequency (Number of students)

40

30

20

10

9.5 19.5 29.5 39.5 49.5 59.5 69.5 79.5 89.5 99.5

Marks (Class)

Remark: Cumulative frequency graphs are used to visually represent how


many values are below a certain upper class boundary. For example, to find
out how many students are less than 49.5, locate 49.5 on the x axis, draw a
vertical line up until it intersects the graph, and then draw a horizontal line at
that point to the y axis. The y axis value is 25.. .
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
29
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Some shapes for frequency curves

✦ Symmetric distribution Curves


A symmetric distribution is a type of distribution where the left
side of the distribution mirrors the right side.

✦ Asymmetric distribution Curves


A distribution is asymmetric if it is not symmetric with zero
skewness. An asymmetric distribution is either left-skewed or
right-skewed.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 30
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Other Types of Graphs

We introduce other types of graphs that have the most important meth-
ods used to illustrate the relationship between variables. These are:

✦ Line Graph
✦ Bar Graph
✦ Pie Chart
✦ Stem-and-Leaf diagrams

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 31
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Line Graph (or a time series graph)

A line graph is a type of chart used to represent data that occur over
a specific period of time. The horizontal axis represents time (years,
months, or days) and the vertical axis represents the values of data.

Example (2-4):
The following table contains information collected about the speed of
a particle at certain time periods:

Time (s) 0 1 2 3 4 5 6
Speed (m\s) 0 3 7 12 20 30 45

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 32
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Representing the previous table graphically is a great way to display


exact values.
45

30
Speed (m\s)

20

12

3
0
1 2 3 4 5 6

Time(s)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 33
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (2-5):
The following table contains information gathered on the number of
secondary schools for boys and girls in the Kingdom of Saudi Arabia
from 1395 to 1401 H. Construct a compound line graphs for the data.

Year 1395 1396 1397 1398 1399 1400


Boy Schools 177 209 273 322 343 375
Girl Schools 35 48 58 85 113 138

The data, in the above table, can be plotted graphically using a com-
pound of line graphs. A different colour or pattern should be given for
each graph. In this example, the x-axis represents “Year” and y-axis
represents “Number of schools”.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 34
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Boy Schools
350
Girl Schools
300
Number of Schools

250

200

150

100

1395 1396 1397 1398 1399 1400

Years

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Bar Graphs
A bar graph represents the data by using vertical or horizontal bars
whose heights or lengths represent the frequencies of the data. When
the data are qualitative or categorical, bar graphs can be used to
represent the data. A bar graph can be drawn using either horizontal
or vertical bars.
Bar graphs differ from histograms for three main reasons:
• The columns (bars) are positioned over a label that represents a
categorical variable.
• The columns do not have a class width.
• There is a gab between columns.
There are three types of bar graphs:
• Simple bar chart • Grouped bar charts . .
• Stacked bar charts
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

• Simple bar chart


is used to represent data involving only one variable classified on a
spatial, quantitative or temporal basis.
The simple bar graph for Example (2-5) can be used to represent the
total number of schools (Boy schools + Girl schools) for each year as

500
450
400
Number of Schools

350 # of Schools
300
250
200
150
100
50
0
1395 1396 1397 1398 1399 1400

Years
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 37
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
• Grouped bar chart
is used to represent and compare different categories of two or more
groups.
The grouped bar graph is drawn using the following steps:

• Draw two adjacent bars or more representing the values of the


categorical variables under study so that the length of each bar is
proportional to the number it represents.
• Distinguish between bars by shading or different colors, and make
this clear on the drawing by adding a legend for these bars.
• The necessity of taking into consideration that the bases of bars
are equal and the distances among them are equal.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 38
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

For Example (2-5) It is possible to compare the development of the


number of boy schools and the number of girl schools in each year
using the grouped bar graph as follows:

Boy Schools
350
Girl Schools
300
Number of Schools

250

200

150

100

50

1395 1396 1397 1398 1399 1400

Years . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
• Stacked bar chart

is used to break down and compare parts of a whole. Each bar in the
chart represents a whole, and segments in the bar represent different
parts or categories of that whole.
For Example (2-5) It is possible to compare the evolution of the number
of boy schools and the number of girl schools in each year by using
stacked bar graph as follows:

500
450
Number of Schools

400
350
300
Boy Schools
250
200
Girl Schools
150
100
50
0
1395 1396 1397 1398 1399 1400

Years . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 40
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Pie Chart

A pie graph is a circle that is divided into sections or wedges according to the
percentage of frequencies in each category of the distribution.
To construct a pie graph for the data, follow these steps:

Step 1: Since there are 360◦ in a circle, the frequency for each class
must be converted into a proportional part of the circle. This conversion
f
is done by using the formula Degrees = × 360 where f frequency
n
for each class and n sum of the frequencies.

Step 2: Each frequency must also be converted to a percentage. The


f
conversion is done by using the formula % = × 100
n
Step 3: Using a protractor and a compass, draw the graph using the
appropriate degree measures found in step 1, and label each section with
the name and percentages. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (2-6):
The following table represents the area (in million) of the 6 continents
Continents Area in Square Km2
Africa 30.3
Asia 47.4
Europe 4.9
North America 24.3
Australia (plus Oceania) 8.5
South America 17.9

Sol: Using the above steps, we construct the following table:

Continents Area in Square Km2 Central angle Percentage %


Africa 30.3 30.3 × 360 = 81.83 ≈ 82 30.3 × 100 = 22.73 ≈ 23
133.3 133.3
Asia 47.4 128.01 ≈ 128 36
Europe 4.9 13.23 ≈ 13 4
North America 24.3 65.63 ≈ 66 18
Australia (plus Oceania) 8.5 22.96 ≈ 23 6
South America 17.9 48.34 ≈ 48 13
Sum 133.3 360 100

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 42
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Using the data in the previous table, the pie chart can be drawn as
follows:

Africa
23%

Asia
36% South America
13%

Australia
6%
Europe North America

4% 18%

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 43
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sometimes we have percentages for readings, given that the angles
are calculated as follows:

Angle = % × 3.6

Exercise (2-1):
The following percentages indicate the source of energy used
worldwide. Construct the pie graph

Energy Percentages
Petroleum 39.8
Coal 23.2
Dry natural gas 22.4
Hydroelectric 7.0
Nuclear 6.4
Other (wind, solar, etc.) 1.2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 44
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Stem-and-Leaf diagrams
A stem and leaf plot is a data plot that uses part of the data value as the
stem and part of the data value as the leaf to form groups or classes. It
was presented by the statistician John Tony for the first time in 1960.
Its advantages are:
• It helps to gain a broad idea of the data in terms of the extent of
data and how it is centered.
• Clarifies any gaps in the data given and reveals the extreme values
in the data.
Stem-and-Leaf has two parts:
1- Leave is the first value to the right of the number.
2- Stem is the rest of the number.
For example, the number 35, the leave is 5, the stem is 3. For the
number 137, the leave is 7, and the stem is13. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
45
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
To construct the stem-and-Leaf diagram, follow these steps:
Step 1: Arrange the data in order.
Step 2: Separate the data according to the first digit.
Step 3: A display can be made by using the leading digit as the
stem and the trailing digit as the leaf.

Example (2-7):
At an outpatient testing center, the number of cardiograms performed
each day for 20 days is shown. Construct a stem and leaf plot for the
data.

25 31 20 32 13 14 43 02 57 23
36 32 33 32 44 32 52 44 51 45

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 46
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol:

Step 1: 02, 13, 14, 20, 23, 25, 31, 32, 32, 32, 32, 33, 36, 43, 44,
44, 45, 51, 52, 57
Step 2: 02 13, 14 20, 23, 25 31, 32, 32, 32, 32,
33, 36 43, 44, 44, 45 51, 52, 57
Step 3:
0 2
1 3 4
2 0 3 5
3 1 2 2 2 2 3 6
4 3 4 4 5
5 1 2 7
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 47
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
From the stem and leaf plot we see that the distribution peaks in the
center and that there are no gaps in the data. For 7 of the 20 days, the
number of patients receiving cardiograms was between 31 and 36. The
plot also shows that the testing center treated from a minimum of 2
patients to a maximum of 57 patients in any one day.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 48
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Chapter 3
A. Measures of Central Tendency

Content:
✦ Mean
✦ Weighted Mean
✦ Median
✦ Mode
✦ Geometric Mean
✦ Harmonic Mean . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 49
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
!
Summation

Let x1 , x2 , . . . , xn be the data observations. The sum of these values


are given by
!n
x1 + x2 + . . . + xn = i=1 xi ,
!n
where i=1 xi denotes the sum of all values of x.
Let y1 , y2 , . . . , yn denote other data observations, and assume that
c ∈ R. Then the following properties are valid.
!n !n !n
(i) i=1 (xi + yi ) = i=1 xi + i=1 yi ,
!n !n
(ii) i=1 cxi = c i=1 xi ,
!n
(iii) i=1 c = nc.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 50
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

1. Mean (Arithmetic Average)

The Mean is one of the most important and best measures of central
tendency and one of the most common and used in statistical analysis
due to its good statistical properties and characteristics.
Definition
The mean is the sum of the values, divided by the total number of
values. The symbol x̄ represents the sample mean and µ represents
the population mean.
To find the mean of the data, we must differentiate between two cases:

• Mean for raw data.


• Mean for grouped data. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
51
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
First: Mean for raw data

Assume that the number of data (sample size) is n and that the sample
observations are x1 , x2 , ..., xn . Then, the mean (arithmetic average)
is calculated by

!n
x1 + x2 + . . . + xn i=1 xi
x̄ = =
n n

For population, the mean for the population data x1 , x2 , ..., xN is

!N
x1 + x2 + . . . + xN i=1 xi
µ= =
N .
.
.
.
.
.
. .
N
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
52
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example(3-1)
The data represent the number of days off per year for a sample of
individuals selected from nine different countries:
20, 26, 40, 36, 23, 42, 35, 24, 30. Find the mean.

Sol:
!9
i=1 xi x1 + x2 + . . . + x9
x̄ = =
n 9
20 + 26 + 40 + 36 + 23 + 42 + 35 + 24 + 30
=
9
276
= = 30.7 (days)
9

Hence, the mean of the number of days off is 30.7 days.


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 53
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Second: Mean for grouped Data

Note that in the case of grouped data summarized in a frequency distribution


table:

• The original data is unknown, but the number of data in each boundary
class limit (class frequency) is known.

• To compute the mean, the midpoint is used as the mean value of all raw
data in each class.

To find the mean construct first the following table:


Boundary class limits Midpoints xi Frequency fi xi .fi
Class 1 x1 f1 x1 .f1
Class 2 x2 f2 x2 .f2
: : : :
Class k xk fk xk .fk
! !
‫ﺍﻟﻤﺠﻤﻭﻉ‬ n =. . . f. i. . . . . . .x. .f
i . . i. . . . . .
54
. . . . . . . . . . . . . . . . . . . .

Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Then the mean is computed by


!k !k
x f i=1 xi fi x1 f1 + x2 f2 + . . . + xk fk
x̄ = i=1 = ! =
i i
n k
i=1 f
f1 + f2 + · · · + fk

Example (3-2)
Find the mean of the daily wage for a number of workers (in Riyals)
in a factory for a sample of 50 people whose wages are summarized in
the following table.
Boundary class limits Frequency fi
20 − 29 9
30 − 39 12
40 − 49 15
50 − 59 8
60 − 69 4
70 − 79 2
!
Sum . . .=
n . i. =
. . . . f . . 50
. . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 55
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sol:
First, we construct the following table:
Boundary class limits Midpoints xi Frequency fi xi .fi
20 − 29 24.5 9 220.5
30 − 39 34.5 12 414.0
40 − 49 44.5 15 667.5
50 − 59 54.5 8 436.0
60 − 69 64.5 4 258.0
70 − 79 74.5 2 149.0
! !
Sum n = fi = 50 xi fi = 2145

!k
x f 2145
x̄ = i=1
! i i = = 42.90 SAR
fi 50

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 56
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Some properties of the mean

1- The sum of deviations of the items from their arithmetic mean is


always zero. i.e.

n
"
(xi − x̄) = (x1 − x̄) + (x2 − x̄) + . . . + (xn − x̄) = 0
i=1

2- The mean is easily subjected to algebraic operations as follows:


Mean Sample data
x̄ x1 , x2 , . . . , xn
(x ± b) = x ± b x1 ± b, x2 ± b, . . . , xn ± b
ax = ax̄ ax1 , ax2 , . . . , axn
(ax ± b) = ax̄ ± b ax1 ± b, ax2 ± b, . . . , axn ± b
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 57
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example(3-3)
If the mean for the sample observation x1 , x2 , . . . , xn is 20, find the
x1 −11 x2 −11 xn −11
mean for 3 , 3 ,..., 3 .

Sol:
Note that x̄ = 20. Then, using the property of mean, the solution is

x̄ − 11 20 − 11 9
= = =3
3 3 3

3- Assume that we have two samples of data where n1 and x̄1 are
the sample size and the mean for the first sample data
respectively, and n2 and x̄2 are the sample size and the mean for
the second sample data respectively. Then, the mean of
combining these two samples can be calculated by the following
formula: n1 x̄1 + n2 x¯2
x= . . . . . . . . . . . . . . . . . . . .

n1 + n2 . . . . . . . . . . . . . . . . . . . . 58
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

4- If we have the sample x1 , x2 , . . . , xn , then


n
" n
"
2
(xi − c) > (xi − x̄)2 , c ∈ R, x̄ ! c
i=1 i=1

Merits of the mean:

• The mean is found by using all the values of the data and is easily sub-
jected to algebraic operations.
• The mean for the data set is unique and not necessarily one of the data
values.
Demerits of the mean:

• The mean is affected by outliers.


• The mean cannot be used for qualitative data.
• The mean cannot be computed for the data in a frequency distribution
that has an open-ended class; the midpoint. will
. . . be
. . unknown.
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 59
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
2. Weighted Mean

Sometimes, you must find the mean of a data set in which not all values
are equally represented. The type of mean that considers in this case is
called the weighted mean.
Definition
Assume we have the sample x1 , x2 , . . . , xn with the corresponding
weights w1 , w2 , . . . , wn . Then, the weighted mean is computed by
!
w1 x1 + w2 x2 + . . . + wn xn n
wi xi
x̄w = = !i=1
w1 + w2 + . . . + wn n
i=1 wi

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 60
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example(3-4):
Find the weighted mean of a student’s marks, given that weight is the
number of hours, for the courses listed below.
Course Number of hours (wi ) Marks (xi )
Math 3 60
Physics 4 75
Biology 4 82
Chemistry 4 70

Sol:
!n
w1 x1 + w2 x2 + . . . + wn xn wi xi
x̄w = = !i=1
w1 + w2 + . . . + wn n
i=1 wi
(3 × 60) + (4 × 75) + (4 × 82) + (4 × 70) 1088
= = = 72.53 Mark
3+4+4+4 15
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 61
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Remarks:
1- The mean is a weighted mean if class frequencies equal to
weights. i.e.

w1 = f1 , w2 = f2 , . . . , wk = fk

2- The mean is a weighted mean when

w1 = w2 = . . . = wn = 1

3- The general GPA and the semester GPA for a student at King
Khalid University are two weighted means for points, given that
the hours are the weights.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 62
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

3. Median

The median is the halfway point in a data set. Before you can find
this point, the data must be arranged in order. When the data set is
ordered, the median is the middle, i.e. 50% of data is equal to or less
than the median and 50% of data is equal to or more than the median.
The median either will be a specific value in the data set or will fall
between two values.
definition
The median is the midpoint of the data array. The symbol for the
median is MD.
. . . . . . . . . . . . . . . . . . . .
63
definition
. . . . . . . . . . . . . . . . . . . .

Dr. Ibrahim Almanjahie Principles of Statistics and Probability


To find the mean of the data, we must differentiate between two cases:
• Median for raw data.
• Median for grouped data.
First: Median for raw data:

To find the median for a raw data, follow the following steps:
Step 1:Arrange the data in order.
Step 2: Select the middle value.
Step 3: If the sample size n is odd, then the median will be the
actual value in the middle. If the sample size n is even, then the
median will fall between two given values.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 64
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (3-5):
The following two sample data sets are the heights (in cm) of the
science college students:
Sample 1: 130, 145, 138, 142, 160, 158, 148
Sample 2: 135, 130, 145, 138, 142, 160, 158, 148

Sol:
Sample 1: 130, 138, 142, 145 , 148, 158, 160
Since n = 7, then MD = 145 cm.
Sample 2: 130, 135, 138, 142, 145 , 148, 158, 160
142+145
Since n = 8, then MD = 2 = 143.5 cm.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 65
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Second: Median for grouped data

There are two methods to calculate the median for the summarized data
in a frequency table:
A- Computational method: We follow these steps:
Step 1: Construct the cumulative frequency table.
n
Step 2: Determine 2, one- half of the total number of samples.
Then, locate the median class on the cumulative frequency by us-
ing n2 .
Step 3: Determine the following:
A: The lower limit of the median class.
L: The class width of the median class.
F1 : The previous frequency of the median class (before n2 ).
F2 : The next frequency of the median class (After n2 ).
.
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
66
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Step 4: Compute the median by


# n $
2 − F1
MD = A + ×L
F2 − F1

Example(3-6):
Use Example (3-2) to find the median value, computationally, of
worker daily wages (in Riyals).

Sol:
Note that
n 50
= = 25
2 2
Now, we construct the cumulative frequency table for Example (3-2)
as follow. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 67
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Daily wages Cumulative frequency
Less than 19.5 0
Less than 29.5 9
Less than A = 39.5 21 = F1 ↙25
Less than 49.5 36 = F2
Less than 59.5 44
Less than 69.5 48
Less than 79.5 50

A = 39.5, L = 49.5 − 39.5 = 10, F1 =


% 21, F2 = 36.
&
# n $ 50
− F1 − 21
MD = A + 2 × L = 39.5 + 2 × 10
F2 − F1 36 − 21
# $
4
= 39.5 + × 10 = 42.17 SAR
15
This result means that at least 50% of the workers have a daily wage
less than or equal to 42.17 SAR. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
68
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

B- Graphical method: We follow these steps:


Step 1: Construct the cumulative frequency graph or ogive.
Step 2: Determine n2 , and locate it on the y-axis of the ogive plot.
Step 3: Draw a horizontal line (and parallel to x-axis) from the
n
location of 2 until intersecting with the ogive plot on a point.
From the intersected point, draw a vertical line down to the x-axis.
The intersected point with the x-axis represents the approximation
value for the median.
Example(3-7):
Use Example (3-2) to find the median value, graphically, of worker
daily wages (in Riyals).
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 69
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sol:
Note the cumulative frequency table is already constructed in
Example(3-6). Therefore, the ogive is plotted as following:
50
Frequency (Number of workers)

40

30
25
20

10
MD≈42
19.5 29.5 39.5 49.5 59.5 69.5 79.5

Daily wage

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 70
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Merits of the median:

• The median for the data set is unique.


• It can be determined graphically.
• It is used for an open-ended distribution.
• It is affected less than the mean by extremely high or extremely
low values.

Demerits of the median:

• It does not consider all data values because it is a positional aver-


age.
• It is not applicable for further algebraic calculation.
• Unsuitable for fraction and percentage.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 71
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
4. Mode

Definition
The value that occurs most often in a data set is called the mode. The
symbol for mode is MOD.

Types of data based on the definition of Mode:


• Unimodal: A data set that has only one value that occurs with the great-
est frequency.
• Bimodal: A data set that has two values that occur with the same greatest
frequency.
• Multimodal: A data set that has more than two values that occur with
the same greatest frequency. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
72
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

To find the mode of the data, we must differentiate between two cases:
• Mode for raw data.
• Mode for grouped data.
In the following, we discuss the above two cases in details.
First: Mode for raw data

For the raw data, we look for the value that occurs most often in the
data set. This value is the mode.
Example (3-8):
Find the mode for the following data sets:
Data set 1: 65, 55, 31, 65, 48, 65
Data set 2: 11, 23, 14, 17, 25
Data set 3: 33, 35, 33, 24, 28, 31, 29, 24 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 73
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sol:

Data set Mode Type of data set


(1) 65 Unimodal
(2) Nothing No mode
(3) 24, 33 Bimodal

Second: Mode for grouped data

There are two methods to calculate the median for the summarized data
in a frequency table:
A- Computational method: We follow these steps:
• Step 1: From the frequency table, identify the class that has high-
est frequency. Denote the frequency of the modal class by F .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 74
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

• Step 2: Based on the highest frequency of the modal class,


determine the previous frequency F1 and the next frequency F2 .
• Step 3: Identify the lower boundary limit of the modal class and
denote it by A. Also, calculate L which is is the class mode
width.
• Step 4: Compute the mode by the following formula:
# $
F − F1
M OD = A + ×L
2F − F2 − F1

Example(3-9):
Use Example (3-2) to find the mode value, computationally, of
worker daily wages (in Riyals).
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 75
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sol:

Class limit Boundary class limit Frequency fi


20 − 29 19.5 − 29.5 9
30 − 39 29.5 − 39.5 12
40 − 49 39.5 − 49.5 15
50 − 59 49.5 − 59.5 8
60 − 69 59.5 − 69.5 4
70 − 79 69.5 − 79.5 2

Note that Highest frequency = 15, i.e. F = 15. Then, F1 = 12 and


F2 = 8. The class mode is (39.5, 49.5), so A = 39.5 and L = 10
# $ # $
F − F1 15 − 12
M OD =A + × L = 39.5 + × 10
2F − F2 − F1 (2 × 15) − 8 − 12
. . . . . . . . . . . . . . . . . . . .

= 42.5 SAR . . . . . . . . . . . . . . . . . . . . 76
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

B- Graphical method:
To find the mode graphically, use the following steps:
• Step 1: Plot the histogram of the given grouped data.
• Step 2: Identify the modal class and the bar representing it.
• Step 3: Draw lines from the top corners of the modal class bar to
the near corners of the neighboring bars.
• Step 4: Draw a perpendicular line from the intersection of the two
lines until it touches the horizontal axis. Then, read the mode from
the horizontal axis (x-axis).

Example(3-10):
Use Example (3-2) to find the mode value, graphically, of worker
daily wages (in Riyals).
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 77
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sol:

16

14
Frequency (Number of workers)

12

10

19.5 29.5 39.5 49.5 59.5 69.5 79.5

Daily wage

From the green circle on the x-axis of the above figure, the mode is
approximately equal to 42 SAR. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 78
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Merits of the mode:


• It can be obtained graphically.
• It can be used for qualitative data.
• Mode can be determined in distributions with open classes.
• It is not affected by extreme values. It can be obtained even if the
extreme values are not known.
Demerits of the mode:
• Mode is not based upon all the observation.
• It is not always possible to find an unique mode. Sometimes, the
data does not have a mode, and, in some cases, some
distributions may have two or more modes (Bimodal or
Multimodal).
• It is not capable of further algebraic treatment. It is impossible to
find the combined mode of some series as is in case of mean. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
79
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Relationship between Mean, Median and Mode

• Case 1: If the curve is completely homogeneous, then the three


averages coincide, i.e. (Mode = Median = Mean ), see Fig-
ure (a).
• Case 2: If the curve is asymmetric (flat) and skews right, then
Mean > Median > Mode, Figure (b).
• Case 3: If the curve is asymmetric (flattened) and skews to the
left, then Mode > Median > Mean , Figure (c).
• Case 4: The curve of the distribution in which the skewness is
moderate (i.e. simple skew), then

Mode − Mean
= Median − Mean
3 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 80
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

(a) Symmetric curve (b) Curve skewed to the right.

(c) Curve skewed to the left.


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 81
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
5. Geometric Mean

Definition
The geometric mean (GM) is defined as the nth root of the product of
n values.

To find the geometric mean of the data, we must differentiate between


two cases:
• Geometric Mean for raw data.
• Geometric Mean for grouped data.
We will start with the first case.
First: Geometric mean for raw data.

Geometric mean (G.M ) for the sample x1 , x2 , . . . , xn is


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 82
Dr. Ibrahim Almanjahie Principles of Statistics and Probability


G.M = n
x1 x2 . . . xn

The above formula is not easy to compute when the sample size is large
enough. To overcome this problem, we take the logarithm of both sides,
i.e.,
√ 1
log G.M = log(x1 x2 . . . xn ) = log(x1 x2 . . . xn ) n
n

% n &
1 1 "
= log(x1 x2 . . . xn ) = log xi
n n i=1

Hence,
% n &
1 "
G.M = 10M , where M= log xi
n i=1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 83
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (3-10):
Find the GM for 8, 4, 2.
' √
3
Sol: G.M = 3
(2)(4)(8) = 64 = 4
or
1
log G.M = (log 2+log 4+log 8) = 0.60206 ⇒ G.M = 100.60206 = 4.
3

Remark:
The previous example can also be solved using the natural logarithm
as follows:
ln G.M = 13 (ln 2 + ln 4 + ln 8) = 1.386 ⇒ G.M = exp(1.386) = 4.

The geometric mean is useful in finding the average of percentages,


ratios, indexes, or growth rates. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
84
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (3-10):
If a person receives a 20% raise after 1 year of service and a 10%
raise after the second year of service. Find the average percentage
raise per year.

Sol: Note that the average percentage raise per year is not 15 but
14.89%, as shown,
(
G.M = 2
(1.20)(1.10) = 1.1489

Second: Geometric mean for grouped data

To find geometric mean for grouped data, follow these two steps:
Step 1: Find the midpoint for boundary classes: x1 , x2 , . . . , xk .
Step 2: Compute geometric mean for grouped data by the
following formula: .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
85
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
(
G.M = xf11 xf22 . . . xfkk ,
n
(1)
!k
where f1 , f2 , . . . , fk are the class frequencies and n = i=1 fi
When the sample size is large enough, we may prefer to work with
logarithm. So, taking logarithm of equation (1), we get
#( $ ) *1
log G.M = log xf11 xf22 . . . xfkk = log xf11 xf22 . . . xfkk
n n

% &
1 ) * 1 "k
= log x1 x2 . . . xk =
f1 f2 fk
fi log xi
n n i=1
% &
1"
k
∴ GM = 10M , where M= fi log xi
n i=1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 86
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Merits of geometric mean:

• It is based on all the observations of the series, and gives more


weights to the small values and less weights to the large values.
• It is suitable for measuring the relative changes.
• It is used in averaging the ratios, percentages and in determining
the rate gradual increase and decrease.
• It is capable of further algebraic treatment.
Demerits of geometric mean:
• It is difficult to calculate as it involves finding out of the root of
the products of certain values either directly, or through
logarithmic operations.
• It cannot be calculated, if the number of negative values is odd.
• It cannot be calculated, if any value of a series is zero.
.
.
.
.
.
.
. .
. . . .
. .
. . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
87
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Harmonic Mean

Definition
The harmonic mean (HM) is defined as the number of values divided
by the sum of the reciprocals of each value.

To find the harmonic mean of the data, we must differentiate between


two cases:
• Harmonic Mean for raw data.
• Harmonic Mean for grouped data.
We will start with the first case. . . . . . . . . . . . . . . . . . . . .
88
. . . . . . . . . . . . . . . . . . . .

Dr. Ibrahim Almanjahie Principles of Statistics and Probability

First: Harmonic mean for raw data.

The harmonic mean (H.M ) for the sample x1 , x2 , . . . , xn is


1 1 1
1 + + ... +
= x1 x2 xn
H.M n
1 1 1 1
= ( + + ... + )
n x1 x2 xn
1" n
1
=
n i=1 xi
n
∴ H.M = !n 1
i=1 xi

Remark:
The harmonic mean is useful for finding the average speed. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 89
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example(3-11):
Find the harmonic mean for 8, 4, 2.

Sol:
Note that n = 3. Then
3
1 1" 1 1 1 1 1 1 7 7
= = ( + + )= × =
H.M 3 i=1 xi 3 2 4 8 3 8 24

Hence,
24
H.M = = 3.43
7

Example(3-12):
Suppose a person drove 100 miles at 40 miles per hour and returned
driving 50 miles per hour. Find the average miles per hour. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 90
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol: Note that The average miles per hour is not 45 miles per hour,
which is found by adding 40 and 50 and dividing by 2. The average is
found as shown below
2
HM = 1 1 = 44.44 miles.
40 + 50

Second: Harmonic mean for grouped data

The harmonic mean for grouped data is given by

1
f1
+ f2
+ ... + fk
1 f1 f2 fk 1" k
fi
= = ( + + ... + ) =
x1 x2 xk
H.M n n x1 x2 xk n i=1 xi
n
∴ H.M = !k fi
i=1 xi

where xi , i = 1, 2, . . . , k are the midpoint of the boundary classes and


fi , i = 1, 2, . . . , k are the class frequencies. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
91
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example(3-12):
Use Example (3-2) to find the geometric mean and the harmonic
mean value of worker daily wages (in Riyals).

Sol:
To simplify the solution, we construct the following table.

Class limits Midpoints xi Frequency fi fi log xi fi


xi
20 − 29 24.5 9 12.50 0.367
30 − 39 34.5 12 18.45 0.348
40 − 49 44.5 15 24.73 0.337
50 − 59 54.5 8 13.89 0.147
60 − 69 64.5 4 7.24 0.062
70 − 79 74.5 2 3.74 0.027
! !
fi = 50 . . . f. i. log . i. =
. . .x . . 80.55
. . . . . 1.288
. . .
. . . . . . . . . . . . . . . . . . . . 92
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

The geometric mean is computed as following:


% &
1 "k
1
log G.M = fi log xi = (80.55) = 1.611
n i=1 50

∴ G.M = 101.611 = 40.83 SAR

The harmonic mean is computed as following:

% k &
1 1 " fi 1 50
= = × 1.288 =⇒H.M = = 38.82 SAR
H.M n i=1
xi 50 1.288

Note that, from the previous results, we see the following result:

HM ≤ GM ≤ x̄
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 93
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
B. Measures of Position
Quartiles, Deciles and Percentiles

In addition to measures of central tendency, there are measures of


position or location. These measures include:
✦ Quartiles
✦ Deciles
✦ Percentiles
They are used to locate the relative position of a data value in the data
set. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 94
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

First: For raw data

Remember that the median is the midpoint of the arranged data array.
50% 50%
Smallest |...................................|...................................| Largest
MD
Quartiles
divide the distribution into four groups, separated by Q1 , Q2 , Q3 .

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 95
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
The above division points are:
Q1 = 1st quartile = the value with 25% of the data below it, and its rank = n
4
2n
Q2 = 2nd quartile = the value with 50% of the data below it, and its rank = 4
3n
Q3 = 3th quartile = the value with 75% of the data below it, and its rank = 4

Finding Data Values Corresponding to Q1 , Q2 , and Q3 :


Step 1: Arrange the data in order from lowest to highest.
Step 2: Find the median of the data values. This is the value for
Q2 .
Step 3: Find the median of the data values that fall below Q2 .
This is the value for Q1 .
Step 4: Find the median of the data values that fall above Q2 .
This is the value for Q3 .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 96
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Deciles
divide the distribution into 10 groups, as shown below.

The above division points are:


D1 = 1st decile = the value with 10% of the data below it, and its rank = n
10
2n
D2 = 2nd decile = the value with 20% of the data below it, and its rank = 10

In the same way, we find the rest of the deciles until


9n
D9 = 9th decile = the value with 90% of the data below it, and its rank = 10

Finding Data Values Corresponding to D1 , D2 , until D9 :


Arrange the data in order from lowest to highest. Then, find the cor-
responding percentile. Hence, follow the steps for finding data values
corresponding to percentiles; this will be explain in percentile section.
.
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
97
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Percentiles
Percentiles divide the data set into 100 equal groups.

The above division points are:


P1 = 1st percentile = the value with 1% of the data below it, and its rank = n
100
2n
P2 = 2nd percentile = the value with 2% of the data below it, and its rank = 100

In the same way, we find the rest of the percentiles until


99n
P99 = 99th percentile = the value with 99% of the data below it, and its rank = 100

Finding Data Values Corresponding to P1 , P2 , ..., P99 :


Step 1: Arrange the data in order from lowest to highest.
Step 2: Substitute into the formula R(Pk ) = kn
100 , where k =
percentile and n = total number of values. .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
98
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Step 3: If R(Pk ) is an integer, use the value halfway (the aver-


age) between the R(Pk ) and (R(Pk ) + 1) values when counting
up from the lowest value. If R(Pk ) is not an integer, round up to
the next integer number. Starting at the lowest value, count over
to the number that corresponds to the rounded-up value.

Percentile Formula: In case you want to find P


The percentile corresponding to a given value X is computed by
using the following formula:

number of values belowX + 0.5


Percentile = × 100%
total number of values
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 99
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
From the above we infer the following:

1. M ed = Q2 = D5 = P50

2. Q1 = P25 , Q3 = P75

3. D1 = P10 , D2 = P20 , D3 = P30 , . . . , D9 = P90

kn
4. Quartile number k is denoted by Qk and its rank is 4 .

kn
5. Decile number k is denoted by Dk and its rank is 10 .

kn
6. Percentile number k is denoted by Pk and its rank is 100 .

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 100
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (3-13):
A teacher gives a 20-point test to 10 students. The scores are:
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
1. Find the 3th quartile.
2. Find the 25th percentile.
3. Find the 5th decile.
4. Find the percentile rank of a score of 12.

Sol:
Arrange the data in order from lowest to highest:

2, 3, 5, 6, 8, 10, 12, 15, 18, 20


8+10
1. Note that MD= Q2 = 2 = 9. Then, Q3 is the median of the data
values that fall above Q2 . In this case, Q3 = 15. .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
101
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
25×10
2. Note that R(P25 ) = 100 = 2.5. The number 2.5 is not an in-
teger, then we round up. In this case, R(P25 ) = 3. Start at the
lowest value and count over to the third value. Then P25 = 5.
3. Note that D5 = P50 = M D = Q2 = 9.
4. Note that

number of values below12 + 0.5


Percentile = × 100%
total number of values

6 + 0.5
= × 100% = 65th percentile.
10

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 102
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Second: Mean for grouped Data

A- Computational method: the steps are similar to those for calculation


the median for grouped data:
Step 1: Construct the cumulative frequency table.
Step 2: Determine the rank R(:), which is the rank of the required
measure.
Step 3: Determine the following:
A: The lower limit of the quartile, decile or percentile class.
L: The class width
F1 : The previous frequency of the quartile, decile or percentile
class (before R(:)).
F2 : The next frequency of the quartile, decile or percentile class
(After R(:)). .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
103
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Then, compute the required measure (quartile, decile or
percentile) by
# $
R(:) − F1
measure(Q, D or P ) = A + ×L
F2 − F1

Example(3-14):
Use Example (3-2) data that represents the worker daily wages (in
Riyals) to find:
(i) D2
(ii) P99

Sol:

1. Construct the cumulative frequency table. This step is already


done in Example(3-6). . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 104
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2. Find the ranks for D2 and P99 :

2n 2 × 50 99n 99 × 50
R(D2 ) = = = 10, R(P99 ) = = = 49.5
10 10 100 100

3 Use the table to locate the positions of R(D2 ) and R(P99 ) and
then compute D2 and P99 .
Daily wages Cumulative frequency
Less than 19.5 0
Less than 29.5 9 = F1
Less than 39.5 21 = F2
Less than 49.5 36
Less than 59.5 44
Less than 69.5 48 = F1
Less than 79.5 50 = F2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 105
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
(i) A = 29.5, L = 39.5 − 29.5 = 10, F1 = 9, F2 = 21
% & # $
2n
10 − F1 10 − 9
D2 = A + × L = 29.5 + × 10 = 30.33
F2 − F1 21 − 9

(ii) A = 69.5, L = 79.5 − 69.5 = 10, F1 = 48, F2 = 50


% & # $
99n
100− F1 49.5 − 48
P99 = A + × L = 69.5 + × 10 = 77
F2 − F1 50 − 48

A- Graphical method: We follow the same steps for determining the


median graphically.
Example(3-15):
Use Example (3-2) data that represents the worker daily wages (in
Riyals) to find graphically:
(i) D2 (ii) P99 .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
106
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol:
Note the cumulative frequency table is already constructed in
Example(3-6). Therefore, the ogive is plotted as following:
50
Frequency (Number of workers)

40

30
25
20

10
D2 ≈ 30 P99 ≈ 77
19.5 29.5 39.5 49.5 59.5 69.5 79.5

Daily wage

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 107
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Remark
The unit of central tendency measures as well as quartiles, deciles,
and percentiles are the same as the original unit of data. If the data
unit is in Riyal, then the unit of all measures of central tendency as
well as quartiles, deciles, and percentages is in Riyal.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 108
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 108
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Chapter 4
Measures of Dispersion

Content:
✦ Range and Semi-interquartile range
✦ Mean deviation
✦ Standard deviation
✦ Coefficient of variation
✦ Chebychev’s Inequality
✦ Z-score . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 109
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

1. Range and Semi-interquartile Range

Range
First: Range for raw data

We have mentioned in Chapter 2 how we calculate the range for raw


data. Recall that the symbol R is used for the range. Then

R = Highest value − Lowest value

The range is useful for showing the spread within a dataset and for
comparing the spread between similar datasets. .
.
.
.
.
. . .
. .
. .
. .
. . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
110
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Second: Range for grouped data

There two methods for calculation the range in the case of grouped data:

Method 1: R = Midpoint of highest class − Midpoint of lowest class


Method 2: R = Upper boundary of highest class − Lower boundary of lowest class

Example (4-1):
Find the range for the 30 students marks in the statistics course shown in the
following table:
Classes 24-28 29-33 34-38 39-43 44-48 49-53
Frequency 3 4 7 6 8 2

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 111
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol:

Method 1: R = Midpoint of highest class − Midpoint of lowest class


53.5 + 48.5 28.5 + 23.5
R= − = 51 − 26 = 25
2 2
Method 2: R = Upper boundary of highest class − Lower boundary of lowest class
R = 53.5 − 23.5 = 30

We notice that the two methods give two different answers for the
range. The first method is preferable in calculating the range.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 112
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Merits of Range:

• It is simple to understand and easy to calculate.


• It is less time consuming.
• Commonly used, especially in daily temperatures, in production,
and in the stock exchange (stock prices traded per day).

Demerits of Range:

• It is not based on all the values of the data.


• It is very much affected by the extreme values.
• It is not capable of further algebraic treatment.
• Range cannot be computed in case of open-end classes.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 113
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Semi-interquartile range
We already knew that the range is affected by extreme values and, be-
cause of this, it is not a reliable measure in describing the dispersal of
the nature of the data. Therefore, there is a need to find another mea-
sure that is not affected by extreme values; the extreme values from the
top and the bottom. This measure is the semi-interquartile range.

First: Semi-interquartile range for raw data

To find the semi-interquartile range for a raw data, follow these steps:
Step 1: Arrange the data in order.
Step 2: Calculate first and third quartiles; Q1 and Q3 .
Step 3: Compute the semi-interquartile range by Q =
. . . . . . . . . . . .
Q3 −Q1
. . . .2 . . . .
. . . . . . . . . . . . . . . . . . . . 114
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (4-2):
Find the semi-interquartile range for the following two samples:
Sample 1: 22, 24, 36, 21, 25, 30, 20, 28
Sample 2: 21, 20, 25, 17, 19, 15, 22, 18, 23, 24

Sol:
Sample 1: order the data first:

20, 21, 22, 24, |25, 28, 30, 36

21+22 43 28+30 58
Then, Q1 = 2 = 2 = 21.5 and Q3 = 2 = 2 = 29. Hence,

Q3 − Q1 29 − 21.5 7.5
Q= = = = 3.75
2 2 2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 115
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sample 2: order the data first:

15, 17, 18, 19, 20, |21, 22, 23, 24, 25

Then, Q1 = 18 and Q3 = 23. Hence,

Q3 − Q1 23 − 18 5
Q= = = = 2.5
2 2 2

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 116
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Second: Semi-interquartile range for grouped data

We have already explained how to calculate the quartiles for a grouped


data in Chapter 3. You must re-review that section. Having that you
know Q1 and Q3 , the calculation of semi-interquartile range for a grouped
data is straightforward.

Exercise (4-1):
Use Example (4-1) to find the semi-interquartile range.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 117
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2. Mean deviation

The sum of deviations of the items from their arithmetic mean is always
zero. To avoid this property, we need to study the mean deviation.
Mean deviation
is the mean of the absolute deviations of a set of data about the data’s
mean. The symbol for mean deviation is MAD.

First: Mean deviation for raw data

For a sample size n, the mean deviation is defined by

1" n
MAD = |xi − x̄|
n i=1 .
.
.
.
.
.
. .
. .
. .
. . . . .
. . . .
.
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
118
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (4-2):
Find the mean deviation for 14, 16, 10, 8, 2.

Sol:
We should compute the mean first:
!5
i=1 xi 2 + 8 + 10 + 16 + 14 50
x̄ = = = = 10.
5 5 5

Then, the MAD is

1" n
MAD = |xi − x̄|
n i=1
1
= [|2 − 10| + |8 − 10| + |10 − 10| + |16 − 10| + |14 − 10|]
5
1 20
= [8 + 2 + 0 + 6 + 4] = = 4.
5 5 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 119
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Second: Mean deviation for grouped data

The mean deviation for a grouped data is calculated by

1" k
MAD = fi |xi − x̄|
n i=1

where fi is the frequency of the class i and xi is the midpoint of class


i.
Example (4-3):
Use Example (4-1) to find the mean deviation.

Sol:
We construct the following table.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 120
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Classes Midpoints xi Frequency fi fi xi xi − x̄ fi |xi − x̄|
24 − 28 26 3 78 −13 39
29 − 33 31 4 124 −8 32
34 − 38 36 7 252 −3 21
39 − 43 41 6 246 2 12
44 − 48 46 8 368 7 56
49 − 53 51 2 102 12 24
!
30 1170 184
!k
i=1 fi xi 1170
x̄ = = = 39.
n 30
!k
i=1 fi |xi − x̄| 184
M AD = = = 6.13
n 30

Remark: The mean deviation can be calculated using the median or


. . . . . . . . . . . . . . . . . . . .

any other mean. . . . . . . . . . . . . . . . . . . . . 121


Dr. Ibrahim Almanjahie Principles of Statistics and Probability

3. Variance and Standard deviation

First: Variance and Standard deviation for raw data

Let x1 , x2 , . . . , xN represent the observation values of a population


with mean µ. The squared deviations of these values from their mean
is (x1 − µ)2 , (x2 − µ)2 , . . . , (xN − µ)2 .
Variance
measures how far a data set is spread out. It is defined as the average
of the squared differences from the mean. The symbol for the
population variance is σ 2 . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 122
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
The formula for the population variance is

(x1 − µ)2 + (x2 − µ)2 + . . . + (xN − µ)2 1 "


N
2
σ = = (xi − µ)2
N N i=1

Standard Deviation
is the square root of the variance. The symbol for the population
standard deviation is σ.

The formula for the population standard deviation is


+
,
√ ,1 "
N
σ = σ2 = - (xi − µ)2
N i=1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 123
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Standard deviation is considered to be the best measure of dispersion


and is thereore, the most widely used measure of dispersion. Although
it is difficult to calculate it in the case of the large size of the statistical
population, the computer has facilitated this difficulty.
For a sample with size n and mean x̄, the variance, denoted by S 2 , is
computed by
2 1 " n
S = (xi − x̄)2 . (2)
n − 1 i=1

Note that, in the case of sampling, instead of dividing by n, we find


the variance of the sample by dividing by n − 1, giving a slightly larger
value and an unbiased estimate of the population variance.
The standard deviation of a sample (denoted by S) is
+
,
1 "
n
,
S= - (xi − x̄)2 .
n − 1 i=1 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 124
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Remark: If the sample size is large (greater than 30), the σ and S are
approximately equivalent.
Example (4-4):
Find the variance and standard deviation for the data:
12, 15, 11, 17, 18, 20, 19.
!
Sol: Note that x̄ = xi /n = 112/7 = 16. Then
xi (xi − x̄) (xi − x̄)2
12 12 − 16 = −4 16
15 15 − 16 = −1 1
11 11 − 16 = −5 25
17 17 − 16 = 1 1
18 18 − 16 = 2 4
20 20 − 16 = 4 16
19 19 − 16 = 3 9
! !
xi = 112 (xi − x̄)2 = 72
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 125
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

1 !n
The variance is: S 2 = n−1 i=1 (xi − x̄)2 = 72/6 = 12

The standard deiavtian is: S = 12 = 3.46

The formula in (2) can be simplified to


." ! /
1
2 2 ( x)2
S = x − (3)
n−1 n

Proof:

1 " 1 " 2
S2 = (x − x̄)2 = (x − 2xx̄ + x̄2 )
n−1 n−1
1 ) " " *
2 2
= x − 2x̄ x + nx̄
n−1
0 #! $ " # ! $2 1
1 " x x
= x2 − 2 x+n
n−1 n n
." ! 2/
1 ( x)
= x2 −
n−1
. . . . . . . . . . . . . . . . . . . .
n . . . . . . . . . . . . . . . . . . . . 126
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (4-5):
Use Example (4-4) and the formula in (3) to find the variance and
standard deviation.

Sol:
xi x2i
12 144
15 225
11 121
17 289
18 324
20 400
19 361
! ! 2
xi = 112 xi = 1864
. ! 2/ 2 3
1 ! ( 1 (112)2
2
S = n−1 2 = 72
x)
x − n = 7−1 1864 − 7 6 = 12
√ √
S= S2 = 12 = 3.46
. . . . . . . . . . . . . . . . . . . .

(what do you notice?)


. . . . . . . . . . . . . . . . . . . . 127
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Remark:
A variance is a positive number equal to zero when all observations
are equal and its value increases whenever the observation variations
increase.
Example (4-6):
The following samples represent the scores of 3 students in the first
monthly test of Statistics. Find the mean and variance. What do you
see?

Sol:
Sample 1: 10, 10, 10; x̄ = 10, S2 = 0
Sample 2: 8, 10, 12; x̄ = 10, S2 = 4
Sample 3: 4, 10, 16; x̄ = 10, S 2 = 36
Note that although the mean for all samples is equal, the sample varia-
. . . . . . . . . . . . . . . . . . . .

tions are different. . . . . . . . . . . . . . . . . . . . . 128


Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Properties of Variance and Standard deviation
Property 1: Variance (and standard deviation) is not affected by addi-
tion and subtraction operations.
Proof
Suppose we have the sample x1 , x2 , . . . , xn and c ∈ R is a constant.
When adding or subtracting the number c from each sample value we
obtain d1 = x1 ± c, d2 = x2 ± c, . . . , dn = xn ± c. This means that
1 !
d = x ± c. Now, substitute x = d ± c into S 2 = n−1 (x − x̄)2 to get

2 1 "2 ¯
32 1 " ¯2
S = (d ± c) − (d ± c) = (d − d)
n−1. /
n−1
1 " 1 " 2
) *
= d2 − d (4)
n−1 n
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 129
Dr. Ibrahim Almanjahie Principles of Statistics and Probability "

Property 2: The value of variance (and standard deviation) is affected


when all data values are multiplied by a constant number.
Proof: Suppose we have the sample x1 , x2 , . . . , xn and c !∈ R is a
constant. When multiply each sample value by the number c we obtain
d1 = cx1 , d2 = cx2 , . . . , dn = cxn . This means that d = cx. Now,
1 !
substitute x = d
c into S 2 = n−1 (x − x̄)2 to get

% &2
2 1 " d d¯ 1 1 " ¯2
Sx = − = (d − d)
n−1 c c c n−1
2

From the above we infer the following:

1 2 1
Sx2 = S → Sx = Sd
c2 d |c|
2
∴ Snew = c2 Sx2 → Snew = |c| Sx
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 130
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
"
Property 3: The sum of squared deviations of the values from their
mean x̄ is smaller than the sum of the squared deviations of the values
from another assumed mean a where a ! x.
Proof:

" "
(x − a)2 = (x + x̄ − x̄ − a)2
"
= [(x − x̄) + (x̄ − a)]2
" "
= (x − x̄)2 − 2(x̄ − a) (x − x̄) + n(x̄ − a)2
"
= (x − x̄)2 + n(x̄ − a)2
"
> (x − x̄)2

! !
This implies that (x − x̄)2 < (x − a)2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 131
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Property 4: The variance for combining two samples with sizes n1 and
n2 , variances S12 and S22 respectively, and x̄1 = x̄2 is

(n1 − 1)S12 + (n2 − 1)S22


2
S =
n1 + n2 − 1
Proof:
Suppose the two samples are x1 , x2 , . . . , xn1 ; y1 , y2 , . . . , yn2

1 " 1 "
S12 = (xi − x̄)2 , S22 = (yi − x̄)2
n1 − 1 n2 − 1
" "
S12 (n1 − 1) = (xi − x̄)2 , S22 (n2 − 1) = (yi − x̄)2
" "
S12 (n1 − 1) + S22 (n2 − 1) = (xi − x̄) + 2
(yi − x̄)2 ,
+n2
n1"
= (xi − x̄)2 ; xi = yi , i > n1
i=1
S12 (n1 − 1) + S22 (n2 2
− 1) = S (n1 + n2 − 1), . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 132
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
This implies that

2S12 (n1 − 1) + S22 (n2 − 1)


S =
(n1 + n2 − 1)

"
Example (4-6):
Find the variance and standard deviation after merging the following
two samples:
Sample 1: n1 = 5, x̄1 = 4, S12 = 3.5
Sample 2: n2 = 6, x̄2 = 4, S22 = 3

Sol:

3.5(5 − 1) + 3(6 − 1) 3.5 × 4 + 3 × 5


S2 = = = 2.9
(5 + 6 − 1) 10
√ √
S = S 2 = 2.9 = 1.7
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 133
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Property 5: The standard deviation of a set of data is greater than the


mean deviation of it (why?).
Second: Variance and Standard deviation for grouped data

If we have a sample data with size n and these data are summarised in
a frequency distribution table where:
• The number of classes is k.
• Midpoints are: x1 , x2 , . . . , xk .
• Class frequencies are: f1 , f2 , . . . , fk .
Then, the variance is computed by using one of the following formulas.

!k !
2 i=1 fi (xi − x̄)2 fi xi
S = , where x̄ =
n−1 n
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 134
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
% ! &
1 " ( fi xi )2
S2 = fi xi2 −
n−1 n
% ! &
1 " ( f d )2
S2 = fi di2 − where d = cx
i i
,
n−1 n

The standard deviation is obtained by taking the square root for the
variance.
Example (4-7):
Use Example (4-1) to find the variance and standard deviation.

Sol:
Construct the following table:
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 135
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Class limits Midpoints xi Frequency fi fi xi (xi − x̄)2 fi (xi − x̄)2


24 − 28 26 3 78 169 507
29 − 33 31 4 124 64 256
34 − 38 36 7 252 9 63
39 − 43 41 6 246 4 24
44 − 48 46 8 368 49 392
49 − 53 51 2 102 144 288
!
30 1170 1530

Using the first formula of variance, we get


!6
fi xi 1170
x̄ = i=1 = = 39,
n 30
6
2 1 " 1
S = fi (xi − x̄)2 = (1530) = 52.76,
n − 1 i=1 30 − 1
√ √
S = S 2 = 52.76 = 7.26.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 136
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
4. Coefficient of Variation

Whenever two samples have the same units of measure, the variance
and standard deviation for each can be compared directly. But what if
the units of the two samples are different.

How to compare two variables with different units‫؟‬


Sol:
A statistic that allows you to compare standard deviations when the
units are different is called the coefficient of variation.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 137
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Definition
The coefficient of variation is a relative measure, denoted by CV, and
defined as the standard deviation divided by the mean.

For samples, For populations,


s σ
CV = CV =
x̄ µ

The CV can be also computed by using the first and second quartiles
as
Q3 − Q1
C.V =
Q3 + Q1
Sample with the largest coefficient of variation has a greater relative
dispersion, i.e. it is less homogeneous and vice versa.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 138
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Uses of the coefficient of Variation:
• To assess the precision of a technique.
• Used as a measure of variability when the standard deviation is
proportional to the mean.
• To compare the variability of measurements made in different units.
• Data values must be positive.
• The arithmetic mean must be greater than zero.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 139
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (4-8):
The mean of the number of sales of cars over a 3-month period is 87,
and the standard deviation is 5. The mean of the commissions is
$5225, and the standard deviation is $773. Compare the variations of
the two.

Sol:
The coefficients of variation are

sx 5
CV = = = 0.057 sales
x̄ 87
sy 773
CV = = = 0.148 commissions
ȳ 5225

Since the coefficient of variation is larger for commissions, the com-


missions are more variable (or dispersal) than the sales.
.
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
140
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (4-9):
The table below includes two data sets from a study that was applied
to a group of people to measure weight (in kilograms) and height (in
centimeters). Which set of data is more variable?
Person 1 2 3 4 5 6 7 8
Height 170 155 161 160 152 164 156 149
Weight 66 90 64 68 73 65 58 57

Sol:
Find first the mean and standard deviation for each variable. The mean
and standard deviation for the heights, say X, are
! 4
x 1267 1 "
x̄ = = = 158.38, Sx = (x − x̄)2 = 6.78
n 8 n−1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 141
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

For the weights, say Y :


! 4
y 541 1 "
ȳ = = = 67.63, Sy = (y − ȳ)2 = 10.41
n 8 n−1

Now, the coefficients of variation are

Sx 6.78
C.Vx = = = 0.043
x̄ 158.38
Sy 10.41
C.Vy = = = 0.154
ȳ 67.63

Note that C.Vy > C.Vx . Since the coefficient of variation is larger for
weights, the weights are more variable (or dispersal) than the heights.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 142
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
5. Chebyshev’s Theorem

Chebyshev’s theorem, developed by the Russian mathematician Chebyshev


(1821–1894), specifies the proportions of the spread in terms of the stan-
dard deviation. When the mean and standard deviation of a sample’s data
are known, the Chebyshev theory is used to find the minimum percentage of
data values falls in an interval or to determine the interval in which at least a
certain percentage falls within it.

Theorem
5 1
6
At least 1 − k2 of data lie with ±k standard deviation, S, from the mean,
x̄, i.e. (x̄ − kS, x̄ + kS), regardless of the shape of the distribution.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 143
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (4-10):
It was found that the mean amount of vitamin C in a particular type of
fruit was 0.24 mg with a standard deviation of 0.004 mg. What is the
minimum percentage of fruits that contain this amount of vitamins
and fall in (0.232, 0.248) mg?

Sol:
Note that x̄ = 0.24 and S = 0.004. Then

(x̄ − kS, x̄ + kS) = (0.232, 0.248)⇒ x̄ + kS = 0.248


⇔ 0.24 + k(0.004) = 0.248 ⇒ k = 2
# $ # $
1 1 3
∴ 1− 2 = 1− 2 = = 0.75
k 2 4
Hence, the minimum percentage of fruits that contain the amount of vitamin
C and fall in (0.232, 0.248) is 75%.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 144
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Revisit the measures of position: Standard Scores

Definition
A z score or standard score for a value is obtained by subtracting the
mean from the value and dividing the result by the standard deviation.
The symbol for a standard score is z. Mathematically,

For samples, For populations,


x1 , x2 , . . . , xn x1 , x2 , . . . , xN
xi − x̄ xi − µ
zi = zi =
S σ
The z score represents the number of standard deviations that a data
value falls above or below the mean. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 145
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (4-11):
A student scored 83, 85, 80 in statistics, mathematics, physics test
respectively that had means 73, 71, 68 and standard deviations 5, 8,
11 respectively. Compare his relative positions on these three tests.

Sol:
For let z1 , z2 , z3 denote the student scores in statistics, mathematics and
physics respectively. Then
x1 − x̄1 83 − 73 10
z1 = = = = 2,
S1 5 5
x2 − x̄2 85 − 71 14
z2 = = = = 1.75,
S2 8 8
x3 − x̄3 80 − 68 12
z3 = = = = 1.09.
S3 11 11

Since the z score for statistics is the largest, his relative position in the Stat
class is higher than his relative position in the. other
. . .
. . classes.
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 146
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Chapter 5
Correlation and Simple Linear Regression

Content:
✦ Introduction
✦ Scatter plot
✦ Pearson’s coefficient of linear correlation
✦ Spearman’s rank correlation coefficient
✦ Coefficient of Association, Coefficient of Contingency and Kendall
Rank Coefficient
✦ Simple linear regression .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
147
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

1. Introduction

In previous chapters, our focus was on studying one variable of a phe-


nomenon; Such as weights for a group of students or wages for a group
of workers.

In this chapter, we will study:


• the relationship between two variables, ex: relationship between
the income and spending.
• measureing the degree of strength of the correlation between the
variables.
• expressed the relationship by a mathematical equation.
.
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
148
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Suppose that we want to study the relationship between the weight (X)
and the height (Y ) for a sample size of n, where

(X, Y ) = (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

These data can be graphically represented (or drawn) on two perpen-


dicular axes:
• x-axis: represent the weight (X).
• y-axis: represent the height (Y ).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 149
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2. Scatter diagrams

When representing the data (X, Y ) at the coordinating plane, we get


what is called scatter diagrams. These figures take different behaviours,
depending on the nature of the relationship between the variables (X, Y ).

(d) No relationship (e) There is a linear relationship, when X


. . . . . . . . . . . . . . . . . . . .
increases Y increases.
. . . . . . . . . . . . . . . . . . . . 150
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
(f) There is an inverse linear relationship, when X (g) There is a non-linear
increases Y decreases. relationship.

From the scatter plots, it is possible to notice linear relationship be-


tween the two variables (X, Y ).

Question: How can the linear correlation strength between the two vari-
ables (X, Y ) be measured?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 151
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Pearson’s coefficient of linear correlation

Pearson’s coefficient of linear correlation


is a measure of the strength of a linear association between two vari-
ables (X, Y ), and is denoted by r. It is used for quantitative data from
which the samples are obtained are normally distributed.
First: Pearson’s coefficient of linear correlation for raw data

Suppose that X and Y are taken from two populations with size N ,
where
X = x1 , x2 , . . . , xN
Y = y1 , y 2 , . . . , y N. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 152
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
The Pearson’s coefficient of linear correlation r between X and Y is
given by:
!N
1 i=1 (xi − µx )(yi − µy )
r= ,
N σx σy

where:
N : The population size.
µx : means for the variable X.
µy : means for the variable Y .
σx : standard deviation for the variable X.
σy : standard deviation for the variable Y .
For a sample with size n, The Pearson’s coefficient of linear correlation
r is given by:
!n
1 i=1 (xi − x̄)(yi − ȳ)
r=
n−1
. . . . . . . . . . . . . . . . . . . .
S x Sy. . . . . . . . . . . . . . . . . . . . 153
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Properties of Pearson’s coefficient of linear correlation

• −1 ≤ r ≤ 1
• r = 1 indicates perfect positive correlation between X and Y , while
r = −1 indicates perfect negative correlation.

• r = 0 indicates no correlation between X and Y .


• Positive values denote positive linear correlation, and negative values
denote negative linear correlation.

• The closer the value is to +1 or −1, the stronger the linear correlation.

Based on |r|, Evans suggested (1996) the following table as a guideline for
describing the strength of correlation:

|r| values 0.00 − 0.19 0.20 − 0.39 0.40 − 0.59 0.60 − 0.79 0.80 − 1.00
Types of correlation very weak weak moderate strong very strong
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 154
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Remark 1: Note that:

• The correlation coefficient does not relate to the gradient beyond sharing
its +ve or −ve sign!

• The correlation coefficient is a measure of linear relationship between


variables. If the correlation coefficient is zero, it signifies that there is
no linear relationship between the variables. However, this is only for
a linear relationship. It is possible that the variables have a strong non-
linear relationship. For example, in Figure (d), the scatter plot implies
no (linear) correlation however there is a perfect quadratic relationship.

Remark 2: From the formula of Pearson correlation, the correlation coeffi-


cient r is a relative scale that depends on measuring the amount of deviation
of the two variables X and Y from their arithmetic mean. Therefore, the cor-
relation coefficient value does not change with the four arithmetic operations
of addition, subtraction, division, and multiplication
. . . . in
. . a
. .constant
. . . . . . .number.
. . . . .
. . . . . . . . . . . . . . . . . . . . 155
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Remark 3:
Calculating the correlation coefficient r requires calculating Sx , Sy ,
x and y. These calculations make determining r not simple, so the r
formula can be simplified as follows:
! ! !
n xi yi − xi yi
r= ( ! 2 ! ! !
(n xi − ( xi )2 )(n yi2 − ( yi )2 )
where:
n number of pairs (xi , yi ).
!
xi yi sum of X times Y .
!
xi sum of X.
!
yi sum of Y .
! 2
xi Sum of the squares of the variable X.
! 2
yi Sum of the squares of the variable Y .
. .
. . .
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
156
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (5-1):
Use the Pearson method to find the correlation coefficient between the
expense (X) and the daily spending (Y ) in riyals for seven students
shown their data in the following table:
Expensing (X) 18 20 12 13 14 15 16
Spending (Y ) 14 18 11 12 14 12 14

Sol:
Use the following steps to simplify the solution:
• Subtract a constant number, say a = 11, from X and Y values.
(This is an optional step.)
• Construct the following table.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 157
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2 2
xi yi x′i = xi − 11 yi′ = yi − 11 x′i yi′ x′i yi′
18 14 7 3 21 49 9
20 18 9 7 63 81 49
12 11 1 0 0 1 0
13 12 2 1 2 4 1
14 14 3 3 9 9 9
15 12 4 1 4 16 1
16 14 5 3 15 25 9
!
31 18 114 185 78

! ! !
x′i yi′ − x′i yi′
n
r= ( ! ! ! !
(n x′i 2 − ( x′i )2 )(n yi′ 2 − ( yi′ )2 )
7 × 114 − 31 × 18 240
= ' = = 0.88
(7 × 185 − (31)2 )(7 × 78 − (18)2 ) 272.3
That means, there is a very strong positive correlation between the student’s
. . . . . . . . . . . . . . . . . . . .
expense and daily spending. . . . . . . . . . . . . . . . . . . . . 158
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Second: Pearson’s coefficient of linear correlation for grouped data

In the case of grouped data,


!the Pearson correlation
! !coefficient is computed by
n fxi yi xi yi − fxi xi fyi yi
r= ' ! ! ! !
(n fxi x2i − ( fxi xi )2 )(n fyi yi2 − ( fyi yi )2 )
where xi and yi are the midpoints and fxi yi is the joint frequency of xi and yi .
Example (5-2):
Find the correlation coefficient r for the marks of 30 students in the subjects
of statistics and mathematics, shown in the following table:
Stat
50 − 59 60 − 69 70 − 79 80 − 89 90 − 99 Sum
Math
50 − 59 3 1 0 0 0 4
60 − 69 0 4 2 0 0 6
70 − 79 0 0 8 1 0 9
80 − 89 0 0 0 6 0 6
90 − 99 1 0 0 0 4 5
Sum 4 5 10 .
7 . . .
4
. . . . . . .
30
. . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 159
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol:
Use the following steps to simplify the solution:
• Find the midpoints for all classes. Let X represent the mathe-
matics marks and Y the statistics marks. Note that midpoints are
equals for both subjects. Hence,
xi = yi =: 54.5, 64.5, 74.5, 84.5, 94.5 i = 1, 2, . . . , 5
• The class width is L = 10. Therefore, subtract a constant number,
say a = 74.5, from X and Y midpoints, and then divide by L. For
xi − 74.5 yi − 74.5
X, ui = and for Y vi = .
10 10
• Construct the following table.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 160
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
The numbers in red in the small squares are calculated by taking the
product of ui , vi and fui vi . For example, the number 12 is obtained by
−2 × −2 × 3 = 12. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 161
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Now, the Pearson correlation coefficient is obtained by


! ! !
n fui vi ui vi − fui ui fvi vi
r= ( ! ! ! !
(n fui ui2 − ( fui ui )2 )(n fvi vi2 − ( fvi vi )2 )

(30 × 36) − (2 × 2)
= '
[(30 × 48) − 22 ][(30 × 44) − 22 ]

1076 1076
= √ = = 0.78
1436 × 1316 1374.69

From this result, it becomes clear to us that there is a strong positive


correlation between the marks of statistics and mathematics.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 162
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
4. Spearman’s rank correlation coefficient

The Spearman rank correlation coefficient , denoted by rs , is a measure


of the relationship between two variables when data in the form of rank
orders are available. It is a nonparametric method and can be used for
both quantitative and ordinal qualitative data. This method is based on
data that can be ranked. For examples:
• Students’ grades in a subject, where it can be arranged according
to the high or low grade.
• Students’ marks in a subject, where it can be arranged according
to the largest or smallest mark. .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
163
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Remark: Rank refers to finding the order of readings for the two vari-
ables (X, Y ) with each reading remaining in its position; ranking can
be descending in order or ascending in order.
Example (5-3):
Find the ranks of X whose its values are given below:

X 14 10 12 8 3 5 6

Sol: Order the data set:


3, 5, 6, 8, 10, 12, 14

Assign ranks 1, 2, 3, 4, 5, 6, 7 to the above values. The solution is

X 14 10 12 8 3 5 6
Rank X 7 5 6 4
. . .
1
. . . .
2 . . . .
3 . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 164
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (5-4):
Find the ranks of the following grades:
B, C, B, F, D, D, A

Sol:
Rank the grades by arranging them from the lowest grade to the highest
grade:
F, D, D, C, B, B, A
Assign ranks 1, 2, 3, 4, 5, 6, 7 to the above grades. Note that there are
two “D” with different ranks. In this case, we assign to each one the
2+3
average of ranks; that is = 2.5, and do the same for the grades
2
“B”. The final solution is
Grades X B C B F D D A
Ranks X 5.5 4 5.5 1
. . .
2.5
. . . . . .
2.5
. . .
7
. . . . . . . .
. . . . . . . . . . . . . . . . . . . . 165
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Steps for calculating the Spearman rank correlation coefficient


1- Arrange X and Y in ascending order.
2- Assign ranks 1, 2, 3, ..., n to the value of each variable.
3- If repeated data exist, we calculate the rank as the average rank of
the repeated data rank values.
4- Compute rs by
!
6 d2i
rs = 1 − , di = Rank xi − Rank yi
n(n2 − 1)

Remark:
All of Pearson’s correlation coefficient properties apply to Spearman’s
rank correlation coefficient. The value of rs in the case of quantitative
data is much closer to r, but the rs is distinguished by ease and accuracy,
especially when the value pairs are less than n = 30.
.
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
166
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (5-5):
The following data represents the grades of eight students in
chemistry and physics.
Chemistry A B D F C D F B
Physics A C F D C D F B
Compute the correlation coefficient for students’ grades in chemistry
and physics?

Sol: Construct the following table.


Chemistry xi Physics yi Rank xi Rank yi di = Rank xi − Rank yi d2i
A A 8 8 0 0
B C 6.5 5.5 1 1
D F 3.5 1.5 2 4
F D 1.5 3.5 −2 4
C C 5 5.5 −0.5 0.25
D D 3.5 3.5 0 0
F F 1.5 1.5 0 0
B B 6.5 7 −0.5 0.25
! 2
. . . . . . . . d. i .= .9.5
. . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 167
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Note that n = 8. Then,


!
6 8i=1 di2
rs = 1 −
n(n2 − 1)
6 × 9.5 57
=1− = 1 − = 0.89
8(82 − 1) 504

This means that there is a very strong positive correlation between the
two subjects.
Example (5-6):
Compute the Spearman rank correlation coefficient between X and Y .

X 0 1 2 3 4 5 2
Y −1 2 2 8 4 14 5

Sol: Construct a table like the previous example. .


.
.
.
.
.
.
. . . .
. . .
. .
. . . .
. .
.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
168
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
xi yi Rank xi Rank yi di = Rank xi − Rank yi d2
i
0 −1 1 1 0 0
1 2 2 2.5 −0.5 0.25
2 2 3.5 2.5 1 1
3 8 5 6 −1 1
4 4 6 4 2 4
5 14 7 7 0 0
2 5 3.5 5 −1.5 2.25
!
d2
i = 8.5

Note that n = 7. Then,


!
6 7i=1 d2i
rs = 1 −
n(n2 − 1)
6 × 8.5 51
=1− = 1 − = 0.85
7(72 − 1) 336

This means that there is a very strong positive correlation between the
. . . . . . . . . . . . . . . . . . . .
variables X and Y . . . . . . . . . . . . . . . . . . . . . 169
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

5. Coefficient of Association, Coefficient of Contingency and


Kendall Rank Coefficient

Sometimes we have qualitative data with distinctive characteristics but


they cannot be arranged; For example:
- Education (educated, uneducated)
- Marital Status (Single, Married, Widowed, Divorced).
- Hair color (white, black, blond, ... etc.).
In all previous cases, the correlation between these characteristics can-
not be measured using the Pearson or the Spearman method. Hence,
we need to find other methods that can measure the correlation between
. . . . . . . . . . . . . . . . . . . .
these characteristics. . . . . . . . . . . . . . . . . . . . . 170
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Yule’s Q Coefficient of Association
is a measure of association between two binary variables. In other
words, we will have 2 × 2 table with two binary variables where each
variable has only two possible values. For a 2 × 2 table for binary
variables X and Y with frequencies or proportions,

X
x1 x2
Y
y1 a b
y2 c d

the coefficient of association, denoted by rQ is given by


ad − bc
rQ =
ad + bc

where a is the joint frequency of x1 and y1 , and so on.


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 171
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Remark: The absolute value of rQ is between 0 and 1, that is 0 ≤


|rQ | ≤ 1. Note that, the sign of rQ has no meaning, as it differs accord-
ing to the order of frequencies in the table; and therefore, its absolute
value is taken to indicate the existence of the relationship or not.
Example (5-7):
Calculate the correlation between smoking and cancer for the data of
100 persons shown below:

Smoking
smoker Non-smoker
Cancer
Infected 55 10
Not infected 5 30
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 172
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Sol:

ad − bc (55 × 30) − (10 × 5) 1650 − 50 1600


rQ = = = = = 0.94
ad + bc (55 × 30) + (10 × 5) 1650 + 50 1700

This result indicates a very strong correlation between smoking and


cancer.

Coefficient of Contingency:
Suggested by Cramer in 1946. It is used to measure the correlation
strength between two qualitative variables, and each variable is divided
into more than two types (that is, the table contains more 4 cells).
The coefficient of contingency is computed by the following steps:
1- Suppose we have (X) with r categories, and (Y ) with s cate-
gories. The following table shows the contingency between the
two variables:
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 173
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Y
y1 y2 ... ys Sum
X
x1 f11 f12 ... f1s f1.
x2 f21 f22 ... f2s f2.
.. .. .. .. ..
. . . ... . .
xr fr1 fr2 ... frs fr.
Sum f.1 f.2 ... f.s f..
2- Calculate B by
2
f11 2
f12 frs2
B= + + ··· +
f.1 × f1. f.2 × f1. f.s × fr.

3- Compute the coefficient of contingency, denoted by rc , by:


4
B−1
rc =
B .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
174
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (5-8):
A researcher wanted to measure the contingency of compatibility be-
tween the eyes of fathers and their children. Therefore, he collected
data from a group of fathers and their children, so the results were as
follows:

Children's eyes
Black Green Brown
Father's eyes
Black 2 4 4
Green 3 1 6
Brown 5 2 3

Is there a correlation between the fathers’ eyes and their children?

Sol:
. . . . . . . . . . . . . . . . . . . .
We add the rows and columns as follows: . . . . . . . . . . . . . . . . . . . . 175
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Children's eyes
Black Green Brown Sum
Father's eyes
Black 2 4 4 10
Green 3 1 6 10
Brown 5 2 3 10
Sum 10 7 13 30

Now, calculate B:

22 32 52 42 12
B= + + + +
10 × 10 10 × 10 10 × 10 7 × 10 7 × 10
22 42 62 32
+ + + +
7 × 10 13 × 10 13 × 10 13 × 10
= 1.15
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 176
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Then,
4 7 7
B−1 1.15 − 1 0.15
rc = = = = 0.36
B 1.15 1.15

From the above result, it is clear that the coefficient of contingency is


rc = 0.36 which shows a weak association between the eyes of fathers
and their children.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 177
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

The Kendall Rank Coefficient:


Suggested by Kendall in 1938. It is used to measure the correlation
between two variables X and Y , depending on their ranks. The Kendall
coefficient is denoted by the symbol τ , and read (tau). It is calculated
using the following relationship:
4Q
τ =1−
n(n − 1)

where n is the number of pairs (xi , yi ). We find Q using the following


steps:
1- Arrange X ranks in natural order and put the corresponding Y
ranks underneath.
2- Starting from the smallest X rank, calculate the number of ranks
to the right of each of Y and smaller than it. Adding the results
gives the value of Q. .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
178
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (5-9):
Two referees ranked 10 artistic paintings according to the preference
of each painting, as shown in the following table:
Artistic C E F G H J I B D A
Referee 1 2 1 3 4 5 6 8 7 9 10
Referee 2 1 2 4 3 6 5 7 10 9 8
Calculate the Kendall correlation coefficient between the ranks of the
two referees?

Sol:
1- Arrange referee 1, say X, ranks in natural order and put the cor-
responding referee 2, say Y , ranks underneath.
2- Starting from the smallest X rank, calculate the number of ranks
to the right of each of Y and smaller than it. Adding the results
gives the value of Q.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 179
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Artistic E C F G H J B I D A
Referee 1 1 2 3 4 5 6 7 8 9 10
Referee 2 2 1 4 3 6 5 10 7 9 8
# of small ranks to the right 1 0 1 0 1 0 3 0 1 0

Then, Q = 1 + 0 + 1 + 0 + 1 + 0 + 3 + 0 + 1 + 0 = 7, and the Kendall


coefficient is

4Q 4×7
τ =1− =1− = 0.69
n(n − 1) 10 ∗ (10 − 1)

From the value τ = 0.69 , it is clear that there is a strong correlation


between the ranks of the referees.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 180
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
5. Simple linear regression

The objective of studying regression is to predict the value of a variable,


say Y , knowing the value of the other variable, say X. In this case, the
variable Y is called dependent variable and the variable X is called
independent variable. The simple linear regression of Y on X is

y = β0 + β1 x
where each x and y represent the values of the random variables X
and Y respectively; β1 represents the regressor factor of Y on X. This
factor is also known as the slope of the linear regression, and β0 is the
cut section of the y-axis; called the intercept. .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
181
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

In the linear regression, if the value of x is given, the value of y can be esti-
mated. In this case, we use ŷ to distinguish it from the real value of y. The
linear regression can be rewritten as

ŷ = βˆ0 + βˆ1 x + ϵ
The above regression is simple because there is one independent variable that
is used to predict the dependent variable.

Remarks:
• ŷ is the predicted value of the dependent variable (y) for any given value
of the independent variable (x).
• β0 is the intercept, the predicted value of y when the x = 0.
• β1 is the regression coefficient (slope) – how much we expect y to change
as x increases.
• ϵ is the error of the estimate, or how much variation there is in our esti-
mate of the regression coefficient. . . . . . . . . . . . . . . . . . . . .
182
. . . . . . . . . . . . . . . . . . . .

Dr. Ibrahim Almanjahie Principles of Statistics and Probability


• If βˆ1 > 0, there is a positive relationship between Y and X.
• If βˆ1 = 0, there is no relationship between Y and X, i.e. ŷ = βˆ0 .
• If βˆ1 < 0, there is a negative relationship between Y and X.
• If βˆ0 = 0, the regression line passes the origin point.
• If we know the angle of the regression line θ with the positive
direction of x-axis, then βˆ1 = tan θ.
The following figure shows the line of best fit using linear regression.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 183
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Note that:
the linear regression of X on Y is:

x̂ = β0′ + β1′ y + ϵ

To find the linear regression equation for y = β0 + β1 x, we estimate


β0 and β1 using the least square method, where
! ! !
xn y − xi yi
βˆ1 =
i i
! 2 !
n xi − ( xi )2
! !
yi xi
βˆ0 = − βˆ1
n n

For the regression x = β0′ + β1′ y , we have


! ! ! ! !
xi yi − xi yi
βˆ1′ = βˆ0′ − βˆ1′
n xi yi
! 2 ! 2 , =
n yi − ( yi ) . . . . . . .
n . . . .
n . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 184
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (5-10):
In a car trading company, the following table represents the age of the
car in years, say X, and the selling price in thousands, say Y .

X 3 2 1 1 5 6 1 4
Y 31 44 60 70 18 17 71 29
1- Plot the data and comment on its behaviour.
2- Find the linear regression equation of Y on X.
3- Find the linear regression equation of X on Y .
4- How much the selling price for a 2.5 year-old car?
5- Plot the data with the regression line.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 185
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol:

1- The scatter plot of the data is displayed below.


70
60
50
Selling Price

40
30
20

1 2 3 4 5 6

The Car Age

We see from the above scatter plot that there is a negative correlation
between the car age and the selling price. .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
186
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
To find the simple linear regression, we first construct the following
table:
xi yi xi yi x2i yi2
3 31 93 9 961
2 44 88 4 1936
1 60 60 1 3600
1 70 70 1 4900
5 18 90 25 324
6 17 102 36 289
1 71 71 1 5041
4 29 116 16 841
! !
xi = yi = ! ! !
xi yi = 690 x2i = 93 yi2 = 17892
23 340

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 187
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2- To find the linear regression of Y on X, we estimate first the value


of β0 and β1 as follows.
! ! !
n xi yi − xi yi
βˆ1 = ! 2 !
n xi − ( xi ) 2

(8 × 690) − (23 × 340) −2300


= = = −10.698
8 × 93 − (23)2 215
! !
y xi
βˆ0 = − βˆ1
i
n n
340 23
= − βˆ1
8 8
= 42.5 − (−10.698) × (2.88) = 73.256

Hence, the linear regression equation of Y on X is

ŷ = −10.698x + 73.256 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 188
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
3- To find the linear regression of X on Y , we estimate first the value
of β0′ and β1′ as follows.
! ! !

βˆ1′ =
n xi y i xi yi
! 2 ! 2
n y i − ( yi )
(8 × 690) − (23 × 340) −2300
= = = −0.08353
8 × 17892 − (340)2 27536
! !
βˆ0 = − βˆ1
′ xi ′ yi
n n
23 340
= − βˆ1′
8 8
= 2.875 − (−0.08353) × (42.5) = 6.425

Hence, the linear regression equation of X on Y is

x̂ = −0.08353y + 6.425
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 189
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

4- To find the selling price of a 2.5 year-old car, we use the first re-
gression as:

ŷ = −10.698 × 2.5 + 73.256 = 46.511 (Thousand Riyals)

5- The following figure shows the line of best fit using linear regres-
sion of Y on X.
70
60
Selling Price

50
40
30
20

1 2 3 4 5 6

The Car Age


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 190
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Chapter 6
Principles of Probability

Content:
✦ Introduction
✦ Random Experiment
✦ Sample Space
✦ Events
✦ Probability
✦ Axioms of Probability .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
191
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Introduction

The probabilities are used in our daily life, for example:


- The probability of rain tomorrow.
- The probability that a certain team will win over another.
- The probability of a war in a country.
From these examples, it is clear that the probability is used to express an event
in itself that is not certain to happen, and the degree of confidence in judging
the occurrence of these events varies from person to person. Therefore, we
need numerical measures instead of expressions from which the degree of
confidence is understood. The science that studies these standards and their
relationship to each other is called probability. To understand probability, we
need to know some definitions and axioms of it.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 192
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Random Experiment

Random Experiment
is an experiment whose outcomes cannot be predicted with certainty. How-
ever, in most cases the collection of every possible outcome of a random ex-
periment can be listed.

Examples for random experiments:


• Tossing of a coin: The outcomes of a trial can be either head H or tail
T showing up.
• Checking a lamp: The outcomes can be either non-defective or defec-
tive.
• Tossing a die: The outcomes of a trial can be either 1, 2, 3, 4, 5 or 6
showing up. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 193
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sample Space

Sample space
A sample space of a random experiment is the collection of all possible
outcomes. The symbol Ω is used to denote the sample space, and the
number of outcomes is denoted by n(Ω)

Examples:
• In the experiment of tossing of a coin: Ω = {H, T }.
• In the experiment of checking a lamp: Ω = {0, 1}, where 0 is for
non-defective and 1 is for defective.
• In the experiment of tossing a die: Ω = {1, 2, 3, 4, 5, 6}.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 194
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (6-1):
Find the sample space for tossing a coin three times.

Sol:
Use the tree diagram to find the solution.

From the tree diagram, we have


Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 195
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (6-2):
Find the sample space for tossing two dice in the same times. What is the
number of all possible outcomes.

Sol:
We can use the Cartesian Product to find the solution as

Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}

Or the solution can be determined from the following diagram.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 196
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
From the diagram, we have
8 9
Ω = (1, 1), (1, 2), (1, 3), . . . , (6, 3), (6, 4), (6, 5), (6, 6)

The number of outcomes for Ω is n(Ω) = 36.


Types of sample space:
There are three types of sample space:
• Finite: for example: Toss a coin two times. Ω = {HH, HHT, T H, T T }.
• Countable infinite: for example: Toss a coin until a head appears,
Ω = {1, 2, 3, . . . , ∞}. Here ∞ refers to the case when a head
never appears.
• Uncountable: for example: Daily temperature in some cities in
KSA, Ω = {x : Real number x}
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 197
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Events

Event
An event A is a subset of the sample space Ω.

• A is an event iff A ⊆ Ω.
• The event A occurs if it belongs to the sample space outcomes.
• The number of outcomes for the event A is denoted by n(A).
• Impossible event is the empty set φ where φ ⊆ Ω.
• Sure event is the Ω where Ω ⊆ Ω.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 198
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (6-3):
In the experiment of tossing a fair coin two times, find all possible
outcomes for the following events and the numbers of elements:

A = { Getting a head in the first toss}


B = {Getting a tail in the first toss}
C = {Getting at least one head }

Sol:

Ω = {HH, HT, T H, T T )}; n(Ω) = 4


A = {HH, HT }; n(A) = 2
B = {T H, T T }; n(B) = 2
C = {HH, HT, T H}; n(C) = 3
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 199
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (6-4):
In the experiment of tossing a die two times, find all possible outcomes for
the following events and the numbers of elements. Denote the first and
second toss outcomes by x and y respectively.

A = {(x, y) : x + y < 4}
B = {(x, y) : x = y}
C = {(x, y) : x = 5}
D = {(x, y) : x + y = 1}

Sol: We already found the sample space in Example (6-2). Hence,

A = {(x, y) : x + y < 4} = {(1, 1), (1, 2), (2, 1)}; n(A) = 3


B = {(x, y) : x = y} = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)}; n(B) = 6
C = {(x, y) : x = 5} = {(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)}; n(C) = 6
D = {(x, y) : x + y = 1} = {} = φ; n(D) .=. 0. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 200
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Algebraic operations of events
Note that the sample space is a set and the event is a subset of it. There-
fore, we need to study the algebraic operations of events before the
probability.
Union: A ∪ B
The event that occurs, when either A or B or both occur. Mathemati-
cally,
A ∪ B = {x ∈ Ω : x ∈ A or x ∈ B}


A B
A∩B

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 201
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Intersection: A ∩ B
The event that occurs, when both A and B occur simultaneously.
Mathematically,

A ∩ B = {x ∈ Ω : x ∈ A and x ∈ B}


A B

A∩B

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 202
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Complement: Ac
The event that occurs, when A does not occur. Mathematically,

Ac = Ā = {x ∈ Ω : x " A}


A

Ac

The number of elements for the Ac is n(Ac ) = n(Ω) − n(A).


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 203
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Difference between two Events: A − B


The event that occurs, when A occurs, but B not. Mathematically,

A − B = {x ∈ Ω : x ∈ A and x " B}


A B

From the above, we infer the following:

(Ac )c = A, φc = Ω, Ω c = φ, A∩A=A
A ∪ A = A, A ∩ Ω = A, A ∪ Ω = Ω, A⊆B ⇒A∩B =A
A ∪ φ = A, Ac = Ω − A, A ∩ φ = φ, A⊆B ⇒A∪B =B
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 204
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Corollary:

A − B = A ∩ Bc

Corollary: De Morgan’s laws

(A ∪ B)c = Ac ∩ B c
(A ∩ B)c = Ac ∪ B c

Disjoint (Mutually Exclusive) Events


Two events that do not occur at the same time, i.e. A ∩ B = φ.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 205
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Equally likely Outcomes


Equally likely outcomes are events that have the same chance of occur-
ring.
Examples:
• Tossing a coin: the chance of getting (H) is equal to the chance
of getting (T ).
• Tossing a fair die: the chance of getting 1, 2, 3, 4, 5, or 6 are equal.
Favourable Events
The events, which ensure the required happening, are called Favourable
Events. For example, in throwing a die, to have the even numbers, 2,4
and 6, are favourable cases.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 206
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Exhaustive Events
The events A1 , A2 , . . . An of a random experiment are said to be exhaustive
events if at least one of them necessarily occurs, i.e.

∪ni=1 Ai = A1 ∪ A2 ∪ . . . ∪ An = Ω
Example (6-5):
In the experiment of tossing a die, what do you conclude for the following events?

1- A = {2, 4, 6} and B = {1, 3, 5}.


2- A1 = {1, 2, 3}, A2 = {2, 3, 4} and A3 = {3, 4, 5, 6}.

Sol:

1- Note that A ∪ B = {1, 2, 3, 4, 5, 6} = Ω and A ∩ B = φ. Hence, A and


B are exhaustive events and also mutually events.
2- Note that A1 ∪ A2 ∪ A3 = {1, 2, 3, 4, 5, 6} = Ω and Ai ∩Aj ! φ, where
i, j = 1, 2, 3. Hence, A1 , A2 and A3 are exhaustive events but not mu-
tually events. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 207
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Exercise (6-6):
Use Example (6-1) to find the following:
1- Find all possible outcomes and the number of elements for :

A = { Getting a head in the first toss}


B = {Getting at least one head}
C = {Getting a tail in the first toss and a head in the third toss }

2- Find the number of elements for the following events:

A ∩ B, A ∪ C, Ac ∪ B c , (A ∩ B)c , A ∩ B c

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 208
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Probability

Classical Definition of Probability:


For a random experiment with equally likely outcomes and limited
sample space, say n(Ω), the probability of the event A is given by
n(A)
P (A) =
n(Ω)
where n(A) is the number of elements of the event A.
Example (6-7):
In the experiment of tossing a coin two times, find the probability of
getting tail twice.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 209
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Sol: The sample space for this experiment is

Ω = {HH, HT, T H, T T }, n(Ω) = 4

Let A represents the event for getting tail twice

A = {T T }, n(A) = 1
1
Hence, the probability of getting tail twice is P (A) = n(A)
n(Ω) = 4
Remark:
The classical definition of probability is conceptually simple for many
situations. However, it is limited, since many situations do not have
finitely many equally likely outcomes. For examples, tossing a weighted
die is an example where we have finitely many outcomes, but they are
not equally likely. Studying people’s incomes over time would be a sit-
uation where we need to consider infinitely many possible outcomes.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 210
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Relative Frequency Probability:
If an experiment is repeated an extremely large number of times and
a particular outcome occurs If an experiment is repeated an extremely
large number of times, say n times, under the same conditions and the
number of event A to occur is r(A), the probability P (A) is defined by

r(A)
P (A) = lim
n→∞ n

Remark:
The relative frequency probability covers more cases than classical.
However, repeating the identical experiment an infinite number of times
is physically impossible.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 211
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Axioms of Probability (proposed by Kolmogorov):


Let Ω be a sample space and P : Ω → [0, 1] be a real-valued function
defined on the class of events of Ω (or subsets of Ω) by P (A) = n(A)
n(Ω) .
A function P : Ω → [0, 1] is called a probability function and P (A) is
called the probability of the event A ∈ Ω when the following hold:
• P (A) ≥ 0
• P (Ω) = 1
• If A1 , A2 , A3 , . . . and Ai ∩ Aj = φ for i ! j, then

P (A1 ∪ A2 ∪ A3 ∪ . . .)= P (A1 ) + P (A2 ) + P (A3 ) + . . .


⇐⇒
%∞ & ∞
: "
P Ai = P (Ai )
i=1 i=1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 212
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Some results of probability axioms
1- If φ is the empty event, then P (φ) = 0.
2- If A1 , A2 , . . . , An and Ai ∩ Aj = φ for i ! j, then
% n
& n
: "
P Ai = P (Ai )
i=1 i=1

3- If A is any event in Ω, then P (Ac ) = 1 − P (A).


4- If A and B are any events in Ω, then P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
5- A and B are any events in Ω, then

P (A ∩ B c ) = P (A) − P (A ∩ B) ⇐⇒ P (A) = P (A ∩ B) + P (A ∩ B c )

6- If A and B are two events and A ⊆ B, then P (A) ≤ P (B). . . . . . . . . . . . . . . . . . . . .


. . . . . . . . . . . . . . . . . . . . 213
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

Example (6-8):
Tickets numbered 1 to 20 are mixed up and then a ticket is drawn at
random. What is the probability that the ticket drawn has a number
which is a multiple of 3 or 5?

Sol:
Here, Ω = {1, 2, 3, 4, . . . , 19, 20}.
Let E1 = event of getting a multiple of 3 = {3, 6, 9, 12, 18}.
Let E2 = event of getting a multiple of 5 = {5, 10, 15, 20}. Hence,

5 4 9
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ) = + −0= .
20 20 20

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 214
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
Example (6-9):
A breakdown of the sources of energy used in the United States is
shown below.
Oil Nuclear Gas Hydropower Coal others
39% 8% 24% 3% 23% 3%
Choose one energy source at random. Find the probability that it is
1- Not oil.
2- Gas or oil.
3- Not nuclear and not hydropower.

Sol:
Let O= event of choosing Oil, N= Event of choosing Nuclear, G=Event
of choosing Gas, H= Event of choosing Hydropower, C= Event of
choosing Coal and T= Event of choosing Others. Then
1- P (Oc ) = 1 − P (O) = 1 − 0.39 = 0.61 .
.
.
.
.
.
.
. . . .
. . .
. .
. .
. .
. . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
215
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

2-

P (G or O) = P (G ∪ O)= P (G) + P (O) − P (G ∩ O)


= 0.24 + 0.39 − 0 = 0.63

3-

P (N c ∩ H c )= P ((N ∪ H)c )
= 1 − P (N ∪ H)
= 1 − (P (N ) + P (H))
= 1 − (0.08 + 0.03) = 0.89

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 216
Dr. Ibrahim Almanjahie Principles of Statistics and Probability
‫ﺘﻡ ﺒﺤﻤﺩﺍﷲ‬

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 217
Dr. Ibrahim Almanjahie Principles of Statistics and Probability

You might also like