Chapter Two: Data Classification, Collection, Tabulation, and Presentation
Chapter Two: Data Classification, Collection, Tabulation, and Presentation
Chapter Two: Data Classification, Collection, Tabulation, and Presentation
Classification of data is the process of arranging data in groups/classes on the basis of certain
properties. The classification of statistical data serves the following purposes:
- It condenses the raw data into a form suitable for statistical analysis.
- It removes complexities and highlights the features of the data.
- It facilitates comparisons and in drawing inferences from the data. For example, if
university students in a particular course are divided according to sex, their results can be
compared.
- It provides information about the mutual relationships among elements of a data set. For
example, based on literacy and criminal tendency of a group of peoples, it can be
established whether literacy has any impact or not on criminal tendency.
- It helps in statistical analysis by separating elements of the data set into homogeneous.
Requisites of Ideal Classification
The classification of data is decided after taking into consideration the nature, scope, and purpose
of the investigation. However, an ideal classification should have following characteristics:
It should be unambiguous: it is necessary that the various classes should be so defined
that there is no room for confusion. There must be only one class for each element of
the data set. For example, if the population of the country is divided into two classes,
say literates and illiterates, then an exhaustive definition of the terms used would be
essential.
Classes should be exhaustive and mutually exclusive: Each element of the data set
must belong to a class. For this, an extra class can be created with the title ‘others’ so
as to accommodate all the remaining elements of the data set. Each class should be
mutually exclusive so that each element must belong to only one class. For example,
classification of students according to the age: below 25 years and more than 20 years,
is not correct because students of age 20 to 25 may belong to both the classes.
By: Asimamaw B. (MSc.)
It should be stable: The classification of a data set into various classes must be done
in such a manner that if each time an investigation is conducted, it remains unchanged
and hence the results of one investigation may be compared with that of another. For
example, classification of the country’s population by a census survey based on
occupation suffers from this defect because various occupations are defined in different
ways in successive censuses and, as such, these figures are not strictly comparable.
It should be flexible: A classification should be flexible so that suitable adjustments
can be made in new situations and circumstances. However, flexibility does not mean
instability. The data should be divided into few major classes which must be further
subdivided. Ordinarily there would not be many changes in the major classes. Only
small sub-classes may need a change and the classification can thus retain the merit of
stability and yet have flexibility.
- The term stability does not mean rigidity of classes. The term is used in a relative
sense. One-time classification can not remain stable forever. With change in time,
some classes become obsolete and have to be dropped and fresh classes have to be
added. The classification may be called ideal if it can adjust itself to these changes
and yet retain its stability.
Basis of Classification
Generally, data are classified on the basis of the following four bases:
Geographical Classification: In geographical classification, data are classified on the basis
of geographical or locational differences such as—cities, districts, or villages between
various elements of the data set.
Chronological Classification: When data are classified on the basis of time, the
classification is known as chronological classification. Such classifications are also called
time series because data are usually listed in chronological order starting with the earliest
period.
Qualitative Classification: In qualitative classification, data are classified on the basis of
descriptive characteristics or on the basis of attributes like sex, literacy, region, caste, or
education, which cannot be quantified. This is done in two ways: (i) Simple classification:
In this type of classification, each class is subdivided into two sub-classes and only one
attribute is studied such as: male and female; blind and not blind, educated and uneducated,
and so on. (ii) Manifold classification: In this type of classification, a class is subdivided
into more than two sub-classes which may be sub-divided further. An example of this form
of classification is shown in the box:
By: Asimamaw B. (MSc.)
Quantitative Classification: In this classification, data are classified on the basis of some
characteristics which can be measured such as height, weight, income, expenditure,
production, or sales.
- Quantitative variables can be divided into the following two types. The term
variable refers to any quantity or attribute whose value varies from one
investigation to another.
Continuous variable is the one that can take any value within the range of
numbers. Thus the height or weight of individuals can be of any value within
the limits. In such a case data are obtained by measurement,
Discrete (also called discontinuous) variable is the one whose values change
by steps or jumps and can not assume a fractional value. The number of
children in a family, number of workers (or employees), number of students
in a class, are few examples of a discrete variable. In such a case data are
obtained by counting.
Table 1: Examples of continuous and discrete variables in a data set
0 10 100 to 110 10
1 30 110 to 120 20
2 60 120 to 130 25
3 90 130 to 140 35
4 110 140 to 150 50
5 20
320 140
event in which he is interested. Sometimes mechanical devices are also used to record the
desired data.
Interview method: the interview method of collecting data involves presentation of oral-
verbal stimuli and reply in terms of oral-verbal responses. This method can be used through
personal interviews and, if possible, through telephone interviews.
Questionnaire method: one of the most conventional methods of data collection,
particularly in wider areas having big inquiries, is the questionnaire method of primary data
collection.
- In this method, a questionnaire is prepared be fitting to the objective of the study
and sent generally by post to the respondents with a request to answer the
questionnaires.
- A questionnaire consists of a number of questions printed or typed in a definite
order on a form or set of forms.
- The questionnaire is mailed to respondents who are expected to read and
understand the questions and write down the reply in the space meant for the
purpose in the questionnaire itself.
- The respondents have to answer the questions on their own.
Schedule method: is the tool or instrument used to collect data from the respondents
while interview is conducted.
- Schedule contains questions, statements (on which opinions are elicited) and blank
spaces/tables for filling up the respondents.
- Schedule is the name usually applied to a set of questions which are asked and filled
in by an interviewer in a face to face situation with another person.
- This method of data collection is very much like the collection of data through
questionnaire, with little difference which lies in the fact that schedules (proforma
containing a set of questions) are being filled in by the enumerators who are
specially appointed for the purpose.
- These enumerators along with schedules, go to respondents, put to them the
questions from the proforma in the order the questions are listed and record the
replies in the space meant for the same in the proforma.
Other methods such as:
- Warranty cards: Warranty cards are usually postal sized cards which are used by
dealers of consumer durables to collect information regarding their products.
- Distributor or store audits: Distributor or store audits are performed by
distributors as well as manufactures through their salesmen at regular intervals.
-
Pantry audits: Pantry audit technique is used to estimate consumption of the basket
of goods at the consumer level. In this type of audit, the investigator collects an
inventory of types, quantities, and prices of commodities consumed.
Methods of collection of Secondary data
- Secondary data may either be published data or unpublished data.
By: Asimamaw B. (MSc.)
Usually published data are available in: (a) various publications of the central, state are
local governments; (b) various publications of foreign governments or of international
bodies and their subsidiary organizations; (c) technical and trade journals; (d) books,
magazines and newspapers; (e) reports and publications of various associations connected
with business and industry, banks, stock exchanges, etc.; (f) reports prepared by research
scholars, universities, economists, etc. in different fields; and (g) public records and
statistics, historical documents, and other sources of published information.
The sources of unpublished data are many; they may be found in diaries, letters,
unpublished biographies and autobiographies and also may be available with scholars and
research workers, trade associations, labor bureaus and other public/ private individuals
and organizations.
And here’s the frequency distribution. You can see that for each range of scores, there are
associated frequency counts.
Table 3: frequency distribution of 50 scores on a test of statistics for Business
By: Asimamaw B. (MSc.)
- Class interval is a range of numbers, and the first step in the creation of a frequency
distribution is to define how large each interval will be.
Simply put, there are no hard-and-fast rules about creating class intervals on the
way to creating a frequency distribution. Here are six general rules:
Decide on the number of class intervals.
o The following two rules are often used to decide approximate number of classes in a frequency
distribution:
I. If k represents the number of classes and N the total number of
observations, then the value of k will be the smallest exponent of the
number 2, so that 2𝑘 ≥ N.
o Example: we have N = 30 observations. If we apply this rule, then we shall have 23 = 8 (< 30);
24 =16 (< 30); 25 =32 (> 30). Thus we may choose k = 5 as the number of classes.
II. According to Sturge’s rule, the number of classes can be determined by
the formula.
k =1 + 3.222 loge N
Where k is the number of classes and loge N is the logarithm of the total
number of observations.
Applying this rule, we get
k =1 + 3.222 log 30
=1 + 3.222 (1.4771) = 5.759 ≅ 5
84 || 2
85 || 2
86 — 0
87 | 1
88 |||| 4
89 ||| 3
90 || 2
91 || 2
92 || 2
93 ||||| 6
94 |||| 5
95 | 1
30
By: Asimamaw B. (MSc.)
Prefactory or head note: If need be, a prefactory note is given just below the title
for its further description in a prominent type. It is usually enclosed in brackets and
is about the unit of measurement.
Foot notes: Anything written below the table is called a footnote. It is written to
further clarify either the title captions or stubs.
o Example: the educational difference among household who participated in off-farm activities
and who did not participated were presented in the following table.
- All these advantages necessitate a clear understanding of the various forms of graphic
representation of a frequency distribution.
o Example: Trend of inflation in Ethiopia shown in the following graph.
INF
When we say that ‘one picture is worth a thousand words’, it neither proves (nor disproves)
a particular fact, nor is it suitable for further analysis of data.
However, if diagrams are properly drawn, they highlight the different characteristics of
data.
The following general guidelines are taken into consideration while preparing diagrams:
- Title: Each diagram should have a suitable title. It may be given either at the top of
the diagram or below it.
- Size: The size and portion of each component of a diagram should be such that all
the relevant characteristics of the data are properly displayed and can be easily
understood.
- Proportion of length and breadth: An appropriate proportion between the length
and breadth of the diagram should be maintained.
- Proper scale: There are again no fixed rules for selection of scale. The diagram
should neither be too small nor too large. The scale for the diagram should be
decided after taking into consideration the magnitude of data and the size of the
paper on which it is to be drawn. The scale showing the values as far as possible,
should be in even numbers or in multiples of 5, 10, 20, and so on. The scale should
specify the size of the unit and the nature of data it represents, for example,
‘millions of tonnes’, in Rs thousand, and the like. The scale adopted should be
indicated on both vertical and horizontal axes if different scales are used. Otherwise
can be indicated at some suitable place on the graph paper.
- Footnotes and source note: To clarify or elucidate any points which need further
explanation but cannot be shown in the graph, footnotes are given at the bottom of
the diagrams.
- Index: A brief index explaining the different types of lines, shades, designs, or
colours used in the construction of the diagram should be given to understand its
contents.
- Simplicity: Diagrams should be prepared in such a way that they can be understood
easily. To keep it simple, too much information should not be loaded in a single
diagram as it may create confusion.
variant types.
A. A frequency line for discrete as well as for continuous distributions can be represented
graphically by drawing ordinates equal to the frequency on a convenient scale at different
values of the variable, X. For the example of yield, we shall have different yield classes on
the horizontal X-axis and frequencies on the vertical Y-axis as shown in Fig. 2.
B. Bar diagram: Instead of drawing a line joining the class frequencies, one represents the
frequencies in the form of bars. In bar diagrams, equal bases on a horizontal (or vertical) line
are selected, and rectangles are constructed with length proportional to the given frequencies
on a suitably chosen scale. The bars should be drawn at equal distances from one another (Fig.
3).
By: Asimamaw B. (MSc.)
C. Histogram: Histogram is almost similar to that of a bar diagram for discrete data; the only
thing is that the reflection of nonexistence of any gap between two consecutive classes is also
reflected by leaving no gap between two consecutive bars. Continuous grouped data are
usually represented graphically by a histogram. The rectangles are drawn with bases
corresponding to the true class intervals and with heights proportional to the frequencies. With
all the class intervals equal, the areas of a rectangle also represent the corresponding
frequencies. If the class intervals are not all equal, then the heights are to be suitably adjusted
to make the area proportional to the frequencies (Fig. 4).
E. Pie Chart: The basic idea behind the formation of a pie diagram is to take the whole
frequencies in 100% and present it in a circle with 360 angle at the center. In the frequency
distribution table, ordinary frequency or relative frequency can effectively be used in the form
of a pie diagram. Thus, for example, the yield data following a pie chart is prepared with class
frequencies (Fig. 6).
F. Cumulative Frequency Curve (Ogive): Partitioning the whole data set can very well be made
with the help of a cumulative frequency graph, also known as OGIVE.
G. Pictorial Diagram: To make the information lively and easy to understand by any user,
sometimes information is presented in pictorial forms. Instead of a bar diagram or line
diagram or pie chart, one can use pictures in the diagrams.
H. Maps: Statistical maps are generally used to represent the distribution of particular parameters
like a forest area in a country, paddy-producing zone, and different mines located at different
places in a country, rainfall pattern, population density, etc.
By: Asimamaw B. (MSc.)
References
Allan G. Bluman (2012). Elementary Statistics: Step by Step Approach. Eighth Edition.
McGraw-Hill.
David R. Anderson; Dennis J. Sweeney; Thomas A. Williams; Jim Freeman; Eddie Shoesmith
(2014). Statistics for Business and Economics. Third edition. Andrew Ashwin.
J. K. Sharma (2007). Business Statistics. Second Edition. Pearson Education.
Leonard J. Kazmier (2004). Schaum’s Outline of Theory and Problems of Business Statistics.
Fourth Edition. McGraw-Hill.
Mark L. Berenson, David M. Levine, Timothy C. Krehbiel (2011). Basic Business Statistics:
Concepts and Applications. Twelfth Edition. Pearson Education
Neil J. Salkind (2016). Statistics for People Who (Think They) Hate Statistics. Sage
Publications, Inc.