Nothing Special   »   [go: up one dir, main page]

The Art of Data Analysis: January 2015

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/283269432

The art of Data Analysis

Article · January 2015

CITATIONS READS
2 7,363

1 author:

Muhammad Ibrahim
Govt. M A O College
54 PUBLICATIONS   222 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

CA Breast View project

Voice disorder View project

All content following this page was uploaded by Muhammad Ibrahim on 27 October 2015.

The user has requested enhancement of the downloaded file.


Reviewed Article

The art of Data Analysis

Muhammad Ibrahim1

Abstract
1. Muhammad Ibrahim,
After collection of reliable data next step is to analyze the data, here is briefed review Department of Statistics, Govt. MAO
of data analysis procedure College, Lahore
Contact No.0300-4668681
Email: Ibrahim_ap98@yahoo.com

Corresponding Author:
Muhammad Ibrahim
Department of Statistics, Govt. MAO
College, Lahore
Contact No. +92-300-4668681
Email: ibrahim.ap12@gmail.com

Citation this article

Ibrahim, M.(2015). The Art of data


analysis. Journal of Allied Health
Sciences Pakistan, 1(1), 98-104.

98 Journal of Allied Health Sciences Pakistan


The art of Data Analysis 99

Introduction qualitative classification second type of


classification is qualitative classification, which is
done on the basis of variable like height, weight.
Third type of classification is Geographical
After collecting the accurate and classification such as village, ward, city, Urban,
reliable data successfully by using the appropriate Rural. Fourth type of classification is called
method from the source, the next step is how to chronological classification which is done on the
extract the pertinent and useful information buried basis of time as weekly, monthly, yearly.
After doing the classification, the frequency of
in the data for further manipulation and
closes may be converted into proportion or
interpretation. The process of performing certain percentage.
calculations and evaluation in order to extract
relevant information from data is called data Percentage (%)= Part/Total * 100
analysis. The data analysis may take several steps
Proportion (p) = Part/Total
to reach certain conclusions. Simple data can be
organized very easily, while the complex data b) Tabulation or frequency distribution
requires proper processing. The word “processing”
The technique of presenting the quantitative data
means the recasting and dealing with data making like height, weight, BP, temperature and other
ready for analysis. biological characteristics which measured on
physical scale, into rows and columns, is called
The word analysis refers to a closely- tabulation.
related operation those are performing with the The tabulation is also known as frequency
purpose of summarizing the collected data and distribution of variable. The main objective of
organizing in such a manner yielding answer to the tabulation is to condense the data and to make
questions. In simple words, it means, studying the comparison easy. In the tabular form of data, the
data to determine inherent facts. required interpretation is easily accessible. The
Selltize, Johoda say, the term analysis data looks very clear. Preparation of table is an art
refers to such process facilitating the data for and requires an expert handling of data. The
operations, designed to draw conclusions for preparation of table depends upon size and nature
further manipulations. Analysis of data involves of data.
organizing the data in a proper way. The problem
of data analysis varies from study to study. GRAPHICAL REPRESETATION OF DATA
Sevral steps are involved in “integrated
operations” that are called analysis of data. Those
steps are:- The data can be displayed with the half of graph
and diagram instead of classification of tabulation.
i. Classification & tabulation There are many reasons for drawing the graph.
ii. Graphical representation The most appealing reasons is that one simple
iii. Measure of location graph says more than twenty pages of prose. The
iv. Measure of Variability graphs present the summery of data. It is usually
v. Measure of relationship suggested that the graphical representation should
vi. Estimating the unknown be looked at before preceding the formal statistical
vii. Testing of hypothesis calculation.
The graphs give the visual presentation of data.
CLASSIFICATION OF DATA Graphs are useful in model fitting of data. At
looking the graphs, one can understand the data in
Classification of data is the process of an easy way.
arranging in classes according to some Common types of graphs:
resemblance or common characteristics. The 1. simple bar cart
classification is also called categorization of data. 2. multiple bar chart
The classification can be done on the basis of 3. pie chart
quality o attribute such as gender, color, literacy, 4. Histogram
beauty, IQ. This type of classification is called 5. Scatter diagram
The art of Data Analysis 100

SIMPLE BAR CHART PIE CHARTS


The simple bar chart consists of vertical or The pie chart consists of circle which is
horizontal bars of equal width by length subdivided into sectors whose area is proportional
proportional to the magnitude of the value they to different components of the total quantity.
represent.
Angle of component
component value
50
45 = * 360
45 total
40

35
30
30 26

25
China
20
17%
15
USA Pakistan
10
38% 6%
5
India
0
11%
I st I i nd IIIr d
UK
28%
SIMPLE BAR CHART
China Pakistan India UK USA
Multiple or Sub divided chart
Pie charts
It is simply the extension of simple bar
charts, which represents the more than one related HISTOGRAM
set of data.
A histogram is a graph of continuous data
40 like weight, height, age etc. to see the theoretical
40
35
35 shape of data. The curve of histogram tells us
30
30 whether data is skewed or symmetrical. Histogram
25 25 consists of a series of adjacent rectangles drawn
25
20 for a grouped frequency.
15 15
15
10 10
10
5
5
0 100
China Pakistan India UK USA

80
Multiple charts

60

80
40
70
60
50 20
Std. Dev = 9.95
40 Mean = 56.1
30 0 N = 387.00

20 35.0 45.0 55.0 65.0 75.0 85.0


40.0 50.0 60.0 70.0 80.0 90.0
10
0 AGE
China Pakistan India UK USA Symmetrical curve
Sub divided charts
The art of Data Analysis 101

14 numerical or quantitative shape is the measure of


location or central tendency. The most common
12 measures of central tendency are Arithmetic mean,
10
median, mode. The measure of central tendency
summarizes the data in a single value which
8 normally lies in the centre of the distribution. The
above said measures are used according to the
6
situation or the data collected by the researcher.
4 Arithmetic mean is useful when data is relatively
homogeneous, median is used when data or
Std. Dev = 13.92
2
Mean = 44.2
values are relatively heterogeneous and mode is
0 N = 55.00 useful when one value occurs more frequently.
10.0 20.0 30.0 40.0 50.0 60.0 When relative weight of values is same we use
15.0 25.0 35.0 45.0 55.0 65.0
weighted mean.
AGE
Negatively Skewed ARITHMETIC MEAN:
14
The arithmetic mean of values are
obtained by adding all the values and then divided
12 by their number. Symbolically it is described as;

x  x
10
 x2  ........  xn
8
x 1

n n
6

Example: Suppose we have a data set as 1.7, 2.2,


4 3.9, 3.11 14.7. The mean is calculated as:-

x  x
2 Std. Dev = 10.57
Mean = 24.6
 x2  ........  xn
0 N = 55.00 x 1
10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0
n n
AGE 1.7  2.2  3.9  3.11  14.7

5
 5.12
45
40
35
30
25 MEDIAN:
20 Median is defined as the central
15
10
value of the arranged data; it does not depend on
5 magnitude of values but only on size of values.
0 Mathematically it is written as:
0 10 20 30 40

 n  1
Scatter diagram
median   th value of data
 2 
Measure of Location In the above example the median is 3.9. Other
allied measures of median are quartile and
percentiles which are useful in the application of
probability theory.
The graphical representation of data gives
us only tentative picture of data; it is unable to tell
MODE:
us to provide the exact picture of distribution and
It is that value of the data which is more
to estimate or predict the value. The important
frequent in the data. Suppose we have collar size
feature which able us to describe the data in
of patients as:
The art of Data Analysis 102

12, 14, 12, 15, 12, 13, 14, 12, 15. Here 12 is the SD
mode of data. CV  *100
x
MEASURE OF VARIATION:
Coefficient of variation is used to compare the
The sample variability plays an important data from different sources. Standard deviation is
role in data analysis. In biological and pathological much useful to describe the shape of distribution.
characteristics, the variations are more common in
real life facts. There are many common measures
of variation but the most useful measures are MEASURE OF RELATIONSHIP:
range, Standard deviation and variance.
In certain situations the researcher is
Range: interested to find out the relationship between
variables. Whether there is a strong relationship
It is difference between maximum and between two variables or weaker between
minimum values. The range depends only on variables?
extreme values of the data and does not consider The measure of relationship is classified
the other values. The occurrence of extreme values as:
in the data greatly influence the range, so it is not
consider a good measure of dispersion, even 1. Regression.
though it is used in certain circumstances. 2. Correlation.

STANDARD DEVIATION: Regression analysis is used to predict or


estimate one variable on the basis of other
The most common and useful measure of variable. In regression we intend to describe the
dispersion is standard deviation. The square of dependence of one value on the other.
standard deviation is called variance. Variance is
more useful for further mathematical The regression line is mathematically defined as:-
manipulation.
Y = a + bx
Mathematically Standard deviation is defined as: Where b is the slope of line which measure the
change in dependent variable with unit change in

 x  x 
2
independent variable and “a” is initial value of the
dependent variable.
SD=
n CORRELATION:

It describes the relationship between two variables.


Standard deviation is much useful in
many cases. The prime objective to calculate the Co-relation coefficient:
standard deviation is to find out the variation in the
data, greater the value of standard deviation means It measures the degree of relationship
greater is the variation in the data. The related between two variables and mathematically defined
quantity with standard deviation is standard error. as:-
That is:-

SD r
 xy  n x y
SE ( x) 
 x  n x  y  n y 
  
2 2 2 2
n

and other related quantities with standard
deviation is Coefficient of variation. The value of “r” lies between -1 and +1. The
coefficient of determination 100 r2 explains the
variations in dependent variable due to
independent variable.
The art of Data Analysis 103

STATISTICAL INFERENCE ii. Selection of level of


significance.
iii. Test statistic.
Some time we have to decide which thing iv. Critical region
is better than other, it is difficult to answer if one v. Computation
has no sufficient knowledge about both things.
vi. Decision
Statistical inference termed as science of
making the conclusion and decision about
population on the basis of sample information. Null hypothesis
Categories of statistical inference are as:-
A statistical hypothesis is simply a
statement or assumption about population
ESTIMATION parameter under study. The hypothesis and
represented by Ho and a hypothesis which is
It is simply a process or procedure of accepted in the case of rejecting the null
estimating the unknown parameters of population. hypothesis is called alternative hypothesis.
E.g. sample mean is an estimator of population
Level of significance
mean. The population parameter is estimated
in a single value is called point estimator. It is denoted by α and it is probability of rejecting
Sometimes point estimator does not provide a the true null hypothesis. it is pre-assigned value
good estimate of population parameter, and then which is normally 5% or 1%.
we use interval estimation parameter lies up to a
certain degree of confidence, e.g. 95% Test statistic
confidence interval for population mean for large
sample is:- Simply test statistic is a formula or
function or method used to decide whether to

x  1.96 accept or reject the null hypothesis. Following are
n the commonly used test statistical:-
And for small samples i. Z-
statistic
ii. T-
statistic
s iii. F-statistic
xt iv. χ2 -
n statistic

TESTING OF HYPOTHESIS Z-statistic is used to test the population mean


when sample size is large normally more
than 30.
The procedure which leads to accept or
reject specified statement about population
x
parameter is called testing of hypothesis, e.g. we Z
wish to decide whether the average life of a 
certain produce is two years.
n
T-statistic is used to test the population means
Procedure involved in testing of hypothesis
when sample size is small and population
parameter is unknown.
The following are the steps:-
x
i. state the null hypothesis and
t
s
alternate hypothesis.
n
The art of Data Analysis 104

F-statistic is used to test the equality of two


population variance on the bases of sample
variances.

 2 Statistic
is used to test the goodness of fit
between observed data and expected data:

Critical region

After selecting the suitable test statistic


according to problem, the next step is to decide
which value of test statistic should lead us to
reject null hypothesis or not reject null
hypothesis. These amounts to partitioning
the sample space for test statistic into two sets.
The set of value which would lead us to reject
null hypothesis is called critical region.

Computation

The proper computations are made


after selecting the test statistic.

DECISION

If the computed value of test statistic lies


in critical region, we reject our null hypothesis,
otherwise do not reject null hypothesis

The reliable and accurate data analysis only


requires a good understanding and sufficient
background of statistics; otherwise the results
lead us to false decision.

View publication stats

You might also like