Unit-Ii: Data Analysis: Editing, Coding, Transformation of Data
Unit-Ii: Data Analysis: Editing, Coding, Transformation of Data
Unit-Ii: Data Analysis: Editing, Coding, Transformation of Data
UNIT-II
*Data Analysis: Editing, Coding, Transformation of Data
Once the collection of data is over, the next step is to organize data so that
meaningful conclusions may be drawn. The information content of the observations has
to be reduced to a relatively few concepts and aggregates. The data collected from the
field has to be processed as laid down in the research plan. This is possible only through
systematic processing of data. Data processing involves editing, coding, classification
and tabulation of the data collected so that they are amenable to analysis. This is an
intermediary stage between the collection of data and their analysis and interpretation.
In case a researcher is confronted with a very large volume of data then it is
imperative to use 'computer processing'. For this purpose necessary statistical packages
such as SPSS etc. may be used. Computer technology can prove to be a boon because a
huge volume of complex data can be processed speedily with greater accuracy.
Editing of Data
1. Editing is the first stage in data processing.
2. Editing is the process of examining the data collected through various methods to
detect errors and omissions and correct them for further analysis.
3. While editing, care has to be taken to see that the data are as accurate and
complete as possible
4. The editor should have a copy of the instructions given to the interviewers.
5. The editor should not destroy or erase the original entry. Original entry should be
crossed out in such a manner that they are still legible.
6. Before tabulation of data it may be good to prepare an operation manual to
decide the process for identifying inconsistencies and errors and also the methods
to edit and correct them
Coding of Data
1. Coding refers to the process by which data are categorized into groups and
numerals or other symbols or both are assigned to each item depending on the
class it falls in.
2. Hence, coding involves: (i) deciding the categories to be used, and (ii) assigning
individual codes to them.
3. In general, coding reduces the huge amount of information collected into a form
that is amenable to analysis.
4. A coding manual is to be prepared with the details of variable names, codes and
instructions. Normally, the coding manual should be prepared before collection of
data,
5. For open-ended and partially coded questions, these two categories are to be
taken care of after the data collection.
Classification of Data
1. Classification of data, which in simple terms is the process of dividing data into
different groups or classes according to their similarities and dissimilarities.
2. The groups should be homogeneous within and heterogeneous between
themselves.
3. Classification condenses huge amount of data and helps in understanding the
important underlying features.
4. It enables us to make comparison, draw inferences, locate facts and also helps in
bringing out relationships, so as to draw meaningful conclusions.
5. In fact classification of data provides a basis for tabulation and analysis of data.
6. Data may be classified according to one or more external characteristics or one or
more internal characteristics or both.
Tabulation of Data
1. Arranging the data in an orderly manner in rows and columns is called tabulation
of data
2. Quite frequently, data presented in tabular form is much easier to read and
understand than the data presented in the text.
3. In classification, the data is divided on the basis of similarity and resemblance,
whereas tabulation is the process of recording the classified facts in rows and
columns. Therefore, after classifying the data into various classes, they should be
shown in the tabular form.
4. Tables may be classified, depending upon the use and objectives of the data to be
presented, into simple tables and complex tables.
Diagrammatic Presentation of Data
1. Diagrammatic presentation is one of the techniques of visual presentation of
statistical data.
2. It is a fact that diagrams do not add new meaning to the statistical facts but they
reveal the facts of the data more quickly and clearly.
3. One must have note that the diagrams must be geometrically accurate.
Therefore, they should be drawn on the graphic axis i.e., X axis (horizontal line)
and Y axis (vertical line).
4. Every diagram must have a concise and self explanatory title, which may be
written at the top or bottom of the diagram.
5. In order to draw the readers' attention, diagrams must be attractive and well
proportioned.
6. Different colours or shades should be used to exhibit various components of
diagrams and also an index must be provided for identification.
7. Bar diagram, multiple bar diagram, pie chart.
Graphical Presentation of Data
1. The graphic presentation of data leaves an impact on the mind of readers, as a
result of which it is easier to draw trends from the statistical data.
2. The shape of a graph offers easy and appropriate answers to several questions,
such as:
3. The direction of curves on the graph makes it very easy to draw comparisons.
4. The presentation of time series data on a graph makes it possible to interpolate
or extrapolate the values, which helps in forecasting.
5. The graph of frequency distribution helps us to determine the values of Mode,
Median, Quartiles, percentiles, etc.
6. The shape of the graph helps in demonstrating the degree of inequality and
direction of correlation
7. For all such advantages it is necessary for a researcher to have an understanding
of different types of graphic presentation of data.
Some Problem in Processing the Data
(1) The problem concerning Dont know (or DK) responses: While processing the data,
the researcher often comes across some responses that are difficult to handle. One
category of such responses may be Dont Know Response or simply DK response. When
the DK response group is small, it is of little significance. But when it is relatively big, it
becomes a matter of major concern. How DK responses are to be dealt with by
researchers?
The best way is to design better type of questions.
Setting of Hypothesis
What is a Hypothesis?
A hypothesis tentative explanation for an observation, phenomenon, or
scientific problem that can be tested by further investigation. A hypothesis states
what we are looking for and it is a proposition which can be put to a test to
determine its validity.
evidence) has a less than 0.05 probability of occurring if Ho is true. In other words, the 5
per cent level of significance means that researcher is willing to take as much as a 5 per
cent risk of rejecting the null hypothesis when it (Ho) happens to be true.
Decision rule or test of hypothesis:- Given a hypothesis Ho and an alternative
hypothesis Ha, we make a rule which is known as decision rule according to which we
accept H0 (i.e., reject Ha) or reject H0 (i.e., accept Ha).
a statistic is known as its standard error (S.E) and is considered the key to sampling
theory.
1. The standard error helps in testing whether the difference between observed and
expected frequencies could arise due to chance. The criterion usually adopted is
that if a difference is less than 3 times the S.E.
2. The standard error gives an idea about the reliability and precision of a sample.
The smaller the S.E., the greater the uniformity of sampling distribution and
hence, greater is the reliability of sample.
3. The standard error enables us to specify the limits within which the parameters of
the population are expected to lie with a specified degree of confidence. Such an
interval is usually known as confidence interval.
The chi-square test is one of the simplest and most widely used non-parametric
test in statistical work. The chi-square test was first used by Karl Pearson. Pearson's chisquared test is used to assess two types of comparison: tests of goodness of fit and tests
of independence.
A test of goodness of fit establishes whether or not an observed frequency
distribution differs from a theoretical distribution.
A test of independence assesses whether paired observations on two variables,
expressed in a contingency table, are independent of each other (e.g. polling responses
from people of different nationalities to see if one's nationality affects the response).
The procedure of the test includes following steps:
1. Calculate the chi-squared test statistic; which resembles a normalized sum of
squared deviations between observed and theoretical frequencies (see below).
2. Determine the degrees of freedom; of that statistic, which is essentially the
number of frequencies reduced by the number of parameters of the fitted
distribution.
3. Compare to the critical value of no significance from the distribution; which in
many cases gives a good approximation of the distribution of . A test that does
not rely on this approximation is Fisher's exact test; it is substantially more
accurate in obtaining a significance level, especially with few observations.
Degree of Freedom: by DOF we mean that number of classes to which the value can be
assigned arbitrarily or at will without violating the restrictions or limitation placed.