Unit-Ii: Data Analysis: Editing, Coding, Transformation of Data

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES
MBA I YEAR I SEMESTER
UNIT-II
*Data Analysis: Editing, Coding, Transformation of Data
Once the collection of data is over, the next step is to organize data so that
meaningful conclusions may be drawn. The information content of the observations has
to be reduced to a relatively few concepts and aggregates. The data collected from the
field has to be processed as laid down in the research plan. This is possible only through
systematic processing of data. Data processing involves editing, coding, classification
and tabulation of the data collected so that they are amenable to analysis. This is an
intermediary stage between the collection of data and their analysis and interpretation.
In case a researcher is confronted with a very large volume of data then it is
imperative to use 'computer processing'. For this purpose necessary statistical packages
such as SPSS etc. may be used. Computer technology can prove to be a boon because a
huge volume of complex data can be processed speedily with greater accuracy.
Editing of Data
1. Editing is the first stage in data processing.
2. Editing is the process of examining the data collected through various methods to
detect errors and omissions and correct them for further analysis.
3. While editing, care has to be taken to see that the data are as accurate and
complete as possible
4. The editor should have a copy of the instructions given to the interviewers.
5. The editor should not destroy or erase the original entry. Original entry should be
crossed out in such a manner that they are still legible.
6. Before tabulation of data it may be good to prepare an operation manual to
decide the process for identifying inconsistencies and errors and also the methods
to edit and correct them
Coding of Data
1. Coding refers to the process by which data are categorized into groups and
numerals or other symbols or both are assigned to each item depending on the
class it falls in.
2. Hence, coding involves: (i) deciding the categories to be used, and (ii) assigning
individual codes to them.
3. In general, coding reduces the huge amount of information collected into a form
that is amenable to analysis.
4. A coding manual is to be prepared with the details of variable names, codes and
instructions. Normally, the coding manual should be prepared before collection of
data,
5. For open-ended and partially coded questions, these two categories are to be
taken care of after the data collection.
Classification of Data
1. Classification of data, which in simple terms is the process of dividing data into
different groups or classes according to their similarities and dissimilarities.
2. The groups should be homogeneous within and heterogeneous between
themselves.
3. Classification condenses huge amount of data and helps in understanding the
important underlying features.
4. It enables us to make comparison, draw inferences, locate facts and also helps in
bringing out relationships, so as to draw meaningful conclusions.
5. In fact classification of data provides a basis for tabulation and analysis of data.
6. Data may be classified according to one or more external characteristics or one or
more internal characteristics or both.
Classification According to External Characteristics (Geographical,

Chronological )
Classification According to Internal Characteristics (Attributes, Variable,
Frequency distribution)
Tabulation of Data
1. Arranging the data in an orderly manner in rows and columns is called tabulation
of data
2. Quite frequently, data presented in tabular form is much easier to read and
understand than the data presented in the text.
3. In classification, the data is divided on the basis of similarity and resemblance,
whereas tabulation is the process of recording the classified facts in rows and
columns. Therefore, after classifying the data into various classes, they should be
shown in the tabular form.
4. Tables may be classified, depending upon the use and objectives of the data to be
presented, into simple tables and complex tables.
Diagrammatic Presentation of Data
1. Diagrammatic presentation is one of the techniques of visual presentation of
statistical data.
2. It is a fact that diagrams do not add new meaning to the statistical facts but they
reveal the facts of the data more quickly and clearly.
3. One must have note that the diagrams must be geometrically accurate.
Therefore, they should be drawn on the graphic axis i.e., X axis (horizontal line)
and Y axis (vertical line).
4. Every diagram must have a concise and self explanatory title, which may be
written at the top or bottom of the diagram.
5. In order to draw the readers' attention, diagrams must be attractive and well
proportioned.
6. Different colours or shades should be used to exhibit various components of
diagrams and also an index must be provided for identification.
7. Bar diagram, multiple bar diagram, pie chart.
Graphical Presentation of Data
1. The graphic presentation of data leaves an impact on the mind of readers, as a
result of which it is easier to draw trends from the statistical data.
2. The shape of a graph offers easy and appropriate answers to several questions,
such as:
3. The direction of curves on the graph makes it very easy to draw comparisons.
4. The presentation of time series data on a graph makes it possible to interpolate
or extrapolate the values, which helps in forecasting.
5. The graph of frequency distribution helps us to determine the values of Mode,
Median, Quartiles, percentiles, etc.
6. The shape of the graph helps in demonstrating the degree of inequality and
direction of correlation
7. For all such advantages it is necessary for a researcher to have an understanding
of different types of graphic presentation of data.
Some Problem in Processing the Data
(1) The problem concerning Dont know (or DK) responses: While processing the data,
the researcher often comes across some responses that are difficult to handle. One
category of such responses may be Dont Know Response or simply DK response. When
the DK response group is small, it is of little significance. But when it is relatively big, it
becomes a matter of major concern. How DK responses are to be dealt with by
researchers?
The best way is to design better type of questions.
Good rapport of interviewers with respondents will result in minimising DK

responses.
The other way is to keep DK responses as a separate category in tabulation
where we can consider it as a separate reply category if DK responses happen to
be legitimate
(2) Use of percentages: Percentages are often used in data presentation for they simplify
numbers, reducing all of them to a 0 to 100 range. Through the use of percentages, the
data are reduced in the standard form with base equal to 100 which fact facilitates
relative comparisons.
Basic Data Analysis

What is Analysis?
Analysis we mean the computation of certain indices or measures along with
searching for patterns of relationship that exist among the data groups.
Analysis, particularly in case of survey or experimental data, involves estimating
the values of unknown parameters of the population and testing of hypotheses for
drawing inferences
Analysis may be categorised as descriptive analysis and inferential analysis
Descriptive Analysis
Descriptive analysis is largely the study of distributions of one variable. this sort
of analysis may be in respect of one variable (described as unidimensional analysis), or
in respect of two variables (described as bivariate analysis) or in respect of more than
two variables (described as multivariate analysis)
Inferential Analysis
Inferential analysis is concerned with the various tests of significance for testing
hypotheses in order to determine with what validity data can be said to indicate some
conclusion or conclusions. It is also concerned with the estimation of population values.
It is mainly on the basis of inferential analysis that the task of interpretation (i.e., the
task of drawing inferences and conclusions) is performed. 3.5
Cluster Analysis:- Cluster analysis or clustering is the task of grouping a set of objects
in such a way that objects in the same group (called cluster) are more similar (in some
sense or another) to each other than to those in other groups (clusters). It is a main
task of exploratory data mining, and a common technique for statistical data analysis.
Cluster analysis consists of methods of classifying variables into clusters. Technically, a
cluster consists of variables that correlate highly with one another and have
comparatively low correlations with variables in other clusters. The basic objective of
cluster analysis is to determine How many mutually and exhaustive groups or clusters,
based on the similarities of profiles among entities, really exist in the population and
then to state the composition of such groups.
Factor Analysis:- Factor analysis is a statistical method used to describe variability
among observed, correlated variables in terms of a potentially lower number of
unobserved variables called factors. In other words, it is possible, for example, that
variations in three or four observed variables mainly reflect the variations in fewer
unobserved variables. Factor analysis searches for such joint variations in response to
unobserved latent variables. There are several methods of factor analysis, but they do
not necessarily give same results. As such factor analysis is not a single unique method
but a set of techniques.
Setting of Hypothesis
What is a Hypothesis?
A hypothesis tentative explanation for an observation, phenomenon, or
scientific problem that can be tested by further investigation. A hypothesis states
what we are looking for and it is a proposition which can be put to a test to
determine its validity.
What are the characteristics of hypothesis?

1. Hypothesis should be clear and precise.
2. Hypothesis should be capable of being tested.
3. Hypothesis should state relationship between variables.
4. Hypothesis should be limited in scope and must be specific.
5. Hypothesis should be stated as far as possible in most simple terms so
that the same is easily understandable by all concerned.
6. Hypothesis should be consistent with most known facts.
7. Hypothesis should be amenable to testing within a reasonable time.
8. Hypothesis must explain the facts that gave rise to the need for
explanation.
Following aspects should be kept in mind when formulating a hypothesis
1. Hypotheses can only be formulated after the researcher has gained
enough knowledge regarding the nature, extent and intensity of the
problem.
2. Hypotheses should figure throughout the research process in order to give
structure to the research.
3. Hypotheses are tentative statements/solutions or explanations of the
formulated problem. Care should be taken not to over-simplify and
generalize the formulation of hypotheses.
4. The research problem does not have to consist of one hypothesis only.
The type of problem area investigated, the extent which encircles the
research field are the determining factors on how many hypotheses will be
included in the research proposal.
Steps in Formulating a Hypothesis
1. Decide what you want to explain: choose a dependent variable
2. Choose independent variables that also show variation
3. Think of multiple causes of the dependent variable
4. Consider alternative measures of both the dependent and independent
variables
* Hypothesis: Null hypothesis and Alternative Hypothesis

Alternative hypothesis is usually the one which one wishes to prove and the
null hypothesis is the one which one wishes to disprove. Thus, a null hypothesis
represents the hypothesis we are trying to reject, and alternative hypothesis represents
all other possibilities. The hypothesis that we support (your prediction) is alternative
hypothesis, and we call the hypothesis that describes the remaining possible outcomes
the null hypothesis. Sometimes we use a notation like HA or H1 to represent the
alternative hypothesis or your prediction, and Ho or Ho to represent the null case.
If your prediction specifies a direction, and the null therefore is the no difference
prediction and the prediction of the opposite direction, we call this a one-tailed
hypothesis. When your prediction does not specify a direction, we say you have a twotailed hypothesis.
The Level of significance:- This is a very important concept in the context of
hypothesis testing. It is always some percentage (usually 5%) which should be chosen
with great care, thought and reason. In case we take the significance level at 5 per cent,
then this implies that H0 will be rejected when the sampling result (i.e., observed
evidence) has a less than 0.05 probability of occurring if Ho is true. In other words, the 5
per cent level of significance means that researcher is willing to take as much as a 5 per
cent risk of rejecting the null hypothesis when it (Ho) happens to be true.
Decision rule or test of hypothesis:- Given a hypothesis Ho and an alternative
hypothesis Ha, we make a rule which is known as decision rule according to which we
accept H0 (i.e., reject Ha) or reject H0 (i.e., accept Ha).
* Type I and Type II Errors

In the context of testing of hypotheses, there are basically two types of errors we can
make. We may reject Ho when Ho is true and we may accept Ho when in fact Ho is not
true. The former is known as Type I error and the
latter as Type II error.
Type I error means rejection of hypothesis which
should have been accepted and Type II error means
accepting the hypothesis which should have been
rejected. Type I error is denoted by (alpha)
known as error, also called the level of
significance of test; and Type II error is denoted by
(beta) known as error.
The probability of Type I error is usually determined in advance and is understood as the
level of significance of testing the hypothesis. If type I error is fixed at 5 per cent, it
means that there are about 5 chances in 100 that we will reject Ho when Ho is true. We
can control Type I error just by fixing it at a lower level. For instance, if we fix it at 1 per
cent, we will say that the maximum probability of committing Type I error would only be
0.01.
* Sampling Distribution and Standard Errors

Sampling Distribution
Sampling distribution or finite-sample distribution is the probability distribution of
a given statistic based on a random sample. Sampling distributions are important in
statistics because they provide a major simplification on the route to statistical inference.
The sampling distribution depends on the underlying distribution of the population, the
statistic being considered, the sampling procedure employed and the sample size used.
There is often considerable interest in whether the sampling distribution can be
approximated by an asymptotic distribution, which corresponds to the limiting case as n
.
We can have sampling distribution of mean, or the sampling distribution of
standard deviation or the sampling distribution of any other statistical measure. It may
be noted that each item in a sampling distribution is a particular statistic of a sample.
The sampling distribution tends quite closer to the normal distribution if the number of
samples is large. The significance of sampling distribution follows from the fact that the
mean of a sampling distribution is the same as the mean of the universe. Thus, the
mean of the sampling distribution can be taken as the mean of the universe
Some important sampling distributions, which are commonly used, are: (1)
sampling distribution of mean; (2) sampling distribution of proportion; (3) students t
distribution; (4) F distribution; and (5) Chi-square distribution.
Standard Errors
The standard deviation of the sampling distribution of the statistic is referred to
as the standard error of that quantity. For the case where the statistic is the sample
mean, and samples are uncorrelated. The standard deviation of sampling distribution of
a statistic is known as its standard error (S.E) and is considered the key to sampling
theory.
1. The standard error helps in testing whether the difference between observed and
expected frequencies could arise due to chance. The criterion usually adopted is
that if a difference is less than 3 times the S.E.
2. The standard error gives an idea about the reliability and precision of a sample.
The smaller the S.E., the greater the uniformity of sampling distribution and
hence, greater is the reliability of sample.
3. The standard error enables us to specify the limits within which the parameters of
the population are expected to lie with a specified degree of confidence. Such an
interval is usually known as confidence interval.
* Large Sample Test: Z Test

Z-test is any statistical test for which the distribution of the test statistic under
the null hypothesis can be approximated by a normal distribution. Because of the central
limit theorem, many test statistics are approximately normally distributed for large
samples. For each significance level, the Z-test has a single critical value (for example,
1.96 for 5% two tailed) which makes it more convenient than the Student's t-test which
has separate critical values for each sample size.
* Test of Significance: Small Sample Tests:

t-Test
Student's t-distribution (or simply the t-distribution) is a family of continuous
probability distributions that arises when estimating the mean of a normally distributed
population in situations where the sample size is small and population standard deviation
is unknown.
It plays a role in a number of widely used statistical analyses, including the
Student's t-test for assessing the statistical significance of the difference between two
sample means, the construction of confidence intervals for the difference between two
population means, and in linear regression analysis. The Student's t-distribution also
arises in the Bayesian analysis of data from a normal family.
The t-distribution is symmetric and bell-shaped, like the normal distribution, but
has heavier tails, meaning that it is more prone to producing values that fall far from its
mean.
The overall shape of the probability density function of the t-distribution
resembles the bell shape of a normally distributed variable with mean 0 and variance 1,
except that it is a bit lower and wider. As the number of degrees of freedom grows, the
t-distribution approaches the normal distribution with mean 0 and variance 1.
* Chi Square Tests: Goodness of Fit and Test of Association
The chi-square test is one of the simplest and most widely used non-parametric
test in statistical work. The chi-square test was first used by Karl Pearson. Pearson's chisquared test is used to assess two types of comparison: tests of goodness of fit and tests
of independence.
A test of goodness of fit establishes whether or not an observed frequency
distribution differs from a theoretical distribution.
A test of independence assesses whether paired observations on two variables,
expressed in a contingency table, are independent of each other (e.g. polling responses
from people of different nationalities to see if one's nationality affects the response).
The procedure of the test includes following steps:
1. Calculate the chi-squared test statistic; which resembles a normalized sum of
squared deviations between observed and theoretical frequencies (see below).
2. Determine the degrees of freedom; of that statistic, which is essentially the
number of frequencies reduced by the number of parameters of the fitted
distribution.
3. Compare to the critical value of no significance from the distribution; which in
many cases gives a good approximation of the distribution of . A test that does
not rely on this approximation is Fisher's exact test; it is substantially more
accurate in obtaining a significance level, especially with few observations.
Degree of Freedom: by DOF we mean that number of classes to which the value can be
assigned arbitrarily or at will without violating the restrictions or limitation placed.
* Analysis of Variance: Oneway and Twoway Classifications. (ANOVA)

In statistics, analysis of variance (ANOVA) is a collection of statistical models, and
their associated procedures, in which the observed variance in a particular variable is
partitioned into components attributable to different sources of variation. In its simplest
form, ANOVA provides a statistical test of whether or not the means of several groups
are all equal, and therefore generalizes t-test to more than two groups. Doing multiple
two-sample t-tests would result in an increased chance of committing a type I error. For
this reason, ANOVAs are useful in comparing two, three, or more means. ANOVA is a
particular form of statistical hypothesis testing heavily used in the analysis of
experimental data. A statistical hypothesis test is a method of making decisions using
data. A test result (calculated from the null hypothesis and the sample) is called
statistically significant if it is deemed unlikely to have occurred by chance, assuming the
truth of the null hypothesis. A statistically significant result (when a probability (p-value)
is less than a threshold (significance level)) justifies the rejection of the null hypothesis.

Unit-Ii: Data Analysis: Editing, Coding, Transformation of Data

Uploaded by

Copyright:

Available Formats

Unit-Ii: Data Analysis: Editing, Coding, Transformation of Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-Ii: Data Analysis: Editing, Coding, Transformation of Data

Uploaded by

Copyright:

Available Formats

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

Classification According to External Characteristics (Geographical,

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

Good rapport of interviewers with respondents will result in minimising DK

Basic Data Analysis

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

What are the characteristics of hypothesis?

* Hypothesis: Null hypothesis and Alternative Hypothesis

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

* Type I and Type II Errors

* Sampling Distribution and Standard Errors

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

* Large Sample Test: Z Test

* Test of Significance: Small Sample Tests:

* Chi Square Tests: Goodness of Fit and Test of Association

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

* Analysis of Variance: Oneway and Twoway Classifications. (ANOVA)

KHAMMAM INSTITUTE OF TECHNOLOGY AND SCIENCES

MBA I YEAR I SEMESTER

You might also like