Chapter 3. Systematization and Processing of Statistical Data
Chapter 3. Systematization and Processing of Statistical Data
Chapter 3. Systematization and Processing of Statistical Data
Marcu, 2019-2020
1. Definitions
Statistical processing of data is the stage in which the individual data, obtained by
statistical observation for each unit of the population, are transformed in indicators
characterizing the population as a whole.
Statistical data encoding involves assigning numerical codes for the variables expressed
in words in order to allow easier processing by a computer.
Statistical classification is a process of systematisation of mass data in classes (groups),
based on common attributes, to be as homogenous.
The Frequency Distribution1 is made up of two series of parallel rows of data of which
one is represented by grouping intervals and the other by frequencies of a certain value in the
range.
2. Level of measurement2
Data characterizing a statistical unit may be measured and expressed by one of the
following levels of measurement:
- Nominal;
- Ordinal;
- Interval;
- Ratio.
Nominal-Level Data is used for qualitative variables expressed in word. The level of
measurement not allows algebraic calculations; the observations can only be classified and
counted. There is no particular order to the labels. Ex. occupation, area of specialization of
individuals, goods for consumption, etc.
Ordinal-Level Data is used when studied variable values cannot be measured, but can be
ordered ascending or descending. Ex. opinion about the price of a product can be: „very
expensive”, „expensive”, „acceptable”, „cheap”, „very cheap”. It allows ordering units by
rank. Although the algebraic calculation of variable values is not possible, we can set a medium
score.
Interval-Level Data is used for variables that are assigned numerical values according to
the distance (range) of values. This level includes all the characteristics of the ordinal level but,
in addition, the difference between values is a constant size. Ex. temperature, time, distance
between two towns. Thus, the temperatures or the moments can be easily ranked, but we can
also determine the difference between temperatures/moments.
Ratio-Level Data is used for numerical variables. It enables all numerical operations. The
ratio between two numbers is meaningful: wages, units of production, changes in stock prices
etc. The 0 point is also meaningful: zero LEI means you have no money, for example.
1
In Romanian: “seria de distributie de frecevente” or “seria de distributie”.
2
In Romanian: “scale de masurare”.
1
Ass. Prof. Ph. D. L. Marcu, 2019-2020
The grouping of data is the process of sharing the statistical population units in
homogenous classes, depending on the variation of one or more variables, also named as
„grouping variables”.
Grouping variable = variable taken in account when statistical units of the population
are divided in homogenous classes.
Homogenous class = group of units within the variable variation is minimal.
It is used when data moves somewhat uniform, with a continuously variation between the
minimum and maximum values of the series. Characteristic for this class is that all intervals are
equal.
The steps to do for this type of class are the following:
a. The ordering of the series of individual data;
b. Determination of the absolute maximum amplitude of the variation: A = x max -
xmin
c. Determination of the number of classes and the size of the interval3:
Case 1. When the number of classes is known (r) we determine the size of the interval
(h) using the formula: h = A/r
Case 2 When the number of classes is unknown, the size of interval is determined
with Sturgers’ formula: h = xmax – xmin / (1+3,322 lgN), where N = total number of
units in the population.
Then we calculate the number of classes: r -= A/h
d. We realize the Frequency Distribution:
3
Class interval or width.
2
Ass. Prof. Ph. D. L. Marcu, 2019-2020
– The lower limit of the first interval is equal with xmin or a convenient
lower value;
– The upper limit of the first interval is obtained by adding the interval
size to the lower limit of the interval;
– The upper limit of the first interval became the lower limit of the
next interval after which the process is repeated.
e. Count the number of items in each class. The number of observations in each
class is called the class frequency.
Example:
For 20 employees we know the following data on length of service:
3, 5, 7, 29, 11, 13, 16, 20, 17, 18, 19, 19, 3, 21, 23, 25, 2, 25, 29, 16.
Please group the employees on five equal intervals by length of service.
Solution:
a. The ordering of the series of individual data:
2, 3, 3, 5, 7, 11, 13, 16, 16, 17, 18, 19, 19, 20, 21, 23, 25, 25, 29, 29
b. A = xmax – xmin = 30 – 2 = 28 years
c. R = 5, h = A/r = 28/5 = 5,6 years ≈ 6 years
d. The frequency distribution is:
Classes Frequency
0-6 4
6-12 2
12-18 5
18-24 7
24-30 2
Total 20
We choose 0 for the lower limit, instead of 2 (the smallest length of service), as it was considered
that it does not affect the class. In addition, it is possible to be engaged (in the future) a person
with a length of service lower than 2 years, in which case this person will be assigned to a class
already formed (0-6 years).
3
Ass. Prof. Ph. D. L. Marcu, 2019-2020
Types of tables:
- Descriptive tables : used for observation, recording;
- Tables for processing data : used for algorithm and calculation of indicators;
- Simple tables: used for simple groups of data;
- Tables of groups: used for presentation of simple classes of data and also for total values
by classes and frequencies;
- Contingency tables (crosstab or cross tabulation): used for presenting data grouped by
two variables.
- Association tables (or associative tables): reflects the link between two alternative
variables.
4
Ass. Prof. Ph. D. L. Marcu, 2019-2020
Statistical graph is a way of presentation data using conventional images in order to allow
identification of essential issues of the studied phenomenon.
The graph describes a phenomenon in a simplified manner using figures and sizes. It
facilitates the formation of a visual image regarding:
- the phenomenon trend,
- the interdependence of variables,
- structures and their changes in time and space.
Graph helps to choose the methods of statistical calculation and also to approximate the
statistical sizes. Graphs can accompany the tables. The graph is used if the user does not intend
to make own calculations. The graph neglects the details, have only suggestive data, reports
briefly on trends and interdependence of phenomena. Reflecting reality is correct if the principle
of proportionality is accomplished (the correct choice of scale and type of chart).
Graphic elements are:
- the title of the chart: suggest data nature, time and space in which data are applied;
- the axes of the graph: usually rectangular axes;
- the scale of representation: ensure the proportionality of the representations, indicates the
equivalent of a unit and helps the gradation of axes (scale may be linear or logarithmic);
- the network of the chart: parallel lines to rectangular axes or concentric circles networks;
- Chart legend: explains the conventional symbols and colors used;
- Data source: mentioned below the chart.
Types of graphs:
a) Column Chart: on horizontal axis (Ox) are placed so many columns how many indicators
we have. On vertical axis (height, Oy) it is the size indicator.
b) Bar Chart: replace the vertical columns of the column chart by horizontal bars. This
graph is used for example in demographic pyramid representation.
c) Geometric Chart (square, rectangle, and parallelogram): for the representation of volume
indicators or structures.
d) Charts for structure presentation: (d1) circle divided into sectors (pie chart); (d2) column
(bar): total area is equal to 100% and the rectangle is divided into as many parts as many
structural elements have the phenomenon.
e) Chart for representation of Frequency Distribution: (e1) histogram: on Ox it indicate the
classes and on Oy the frequencies (e2) frequency polygon = it is obtained if the middle
top points of the columns of the histogram are joined. On Ox we represent the centres of
intervals and on Oy the frequency of each interval.
5
Ass. Prof. Ph. D. L. Marcu, 2019-2020
f) Other type of charts: time chart4 (used to represent the time series5), polar diagrams,
cartograms, natural or symbolic figures.
Exemple of graphs:
Pie Chart
Bar Chart used for values comparison Bar Chart used for structure presentation
4
In Romanian: “cronograma”.
5
In Romanian: “serii cronologice”.
6
In Romanian: “Pictograma”.
6
Ass. Prof. Ph. D. L. Marcu, 2019-2020
Relative statistical measurements are statistical indicators resulted from a ratio between
two absolute, average or relative values. Relative measurement can be determined as the ratio of
two statistical indicators and they suggest the proportions between the indicators compared.
Depending on how they are determined and the meaning, there are five categories: structure;
coordination; dynamic; plan; intensity.
Any relative size, with the exception of relative intensity measurement, can be expressed:
- as a coefficient (the result of the ratio);
- as a percentage (the result of the ratio x 100);
- as a per-mil8 (the result of the ratio x 1000).
xi
For the attributive variable: xi* = ꞏ 100 where xi* = weight of xi in total;
∑ xi
For recurrent frequency10:
fi
fi * = ꞏ 100 where: fi* = weight of class “i” frequency in total population;
∑ fxi
For Frequency Distribution based on grouping by classes:
xi f
yi* = ꞏ 100 where: yi* = weight of xi class having fi frequency in total population.
i
∑ xi f i
The sum of the relative structural measurement is 100%. Graphical representation of this
relative measurement is mainly by pie chart.
7
In Romanian: “Marimi relative”.
8
Parts per thousand (‰).
9
In Romanian: “marimile relative de structura”.
10
In Romanian: “frecventa de repetitie”.
11
In Romanian: “marimile relative de coordonare”.
7
Ass. Prof. Ph. D. L. Marcu, 2019-2020
xA xC
KA/C = ꞏ100 and KC/A = ꞏ100
xC xA
Relative plan measurement expresses the extent to which targets have been met by a
company. Example:
xreal
MRP = ꞏ100, where xreal = level achieved of the variable; xpl = level planed of the variable.
x pl
- Coverage contracts:
x contracted
MRAC = ꞏ100
x planned
Relative dynamic measurement allows comparing the level reached by a variable in two
periods of time (0 – reference period, t – current period). They are used to describe the time
evolution of a phenomenon (time series).
xt
i= ꞏ100
x0
12
In Romanian: “marimile relative ale planului”.
13
In Romanian: “marimile relative de dinamica”.
14
In Romanian: “marimile relative de intensitate”.
8
Ass. Prof. Ph. D. L. Marcu, 2019-2020
9
Ass. Prof. Ph. D. L. Marcu, 2019-2020
2. What are the most representative ways to present the structure of a population?
10
Ass. Prof. Ph. D. L. Marcu, 2019-2020
11
Ass. Prof. Ph. D. L. Marcu, 2019-2020
It requires to determine all relative measurements possible and to represent graphically the
structure of production.
12