Nothing Special   »   [go: up one dir, main page]

Part 1 Notes AGB Unit1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17
At a glance
Powered by AI
Key takeaways from the document include an introduction to statistics and biostatistics, descriptive and inferential statistics, and different methods of graphically representing data.

The two main branches of statistics are descriptive statistics and inferential statistics.

The two main types of descriptive statistics are measures of central tendency (average, mean, median) and measures of spread (range, variance, standard deviation).

Topics covered Biostatistics and Computer

Application (AGB-Unit I)
1. Introduction and importance of statistics and biostatistics
2. Parameter, statistic and observation
3. Sampling methods Dr. Rojan P. M.
4. Classification and tabulation of data Assistant Professor
5. Graphical and diagrammatic representation of data AGB

INTRODUCTION AND IMPORTANCE OF STATISTICS AND BIOSTATISTICS


Statistics
'Statistics is the study of statistical methods and procedures involving collection, classification,
presentation, analysis and interpretation of numerical data, and to make inference from it'. (According
to Croxton and Cowden)

Derivation of Term Statistics


The term 'statistics' is derived from the Latin word 'status', the Italian word 'statistica', the
French word 'statistique', or the German word 'statistik'. All these words are used for political state or
a government.

Biostatistics
'Biostatistics is the application of statistical methods to a wide variety of fields of biology or life
sciences including human biology, medicine, public health, agriculture, veterinary, microbiology and
genetics.' Biostatistics is also called biometry, literally meaning biological measurement (Greek origin,
bios = life + metron = measure).

 Francis Galton (1822-1911), first cousin of Charles Darwin, is called the 'Father of Biostatistics and
Eugenics'.
 Karl Pearson (1857-1936) laid the foundation for the 'descriptive and correlational statistics'. He also
emphasised that the whole doctrine of heredity rests on statistical basis.
 The term biometry was coined by W.F.R. Weldon (1860-1906), a zoologist at University College,
London.
 Ronald A. Fisher (1890-1962) was a dominant figure in statistics and biometry.

Limitations of Statistical Methods


 Statistical methods cannot be applied to all kinds of phenomena and cannot answer all our doubts.
They have certain limitations.
 Statistical laws are not exact laws, like mathematical or chemical laws. They are based on the average.
 Statistical data can be treated as approximations or estimates and not as precise measurement.
 Statistical methods deal with the aggregate of facts and not the individual fact.
 They are derived by taking a majority of cases and are not true for every individual. Thus the statistical
inferences are uncertain.
 It is only a tool and not an end itself. It cannot help in formulating policy prescriptions. It merely helps
to throw light on the phenomenon.
 The greatest limitation of statistics is that only one who has a sound knowledge of statistical methods
can efficiently handle statistical data. Persons with poor expertise knowingly or unknowingly can
draw faulty conclusions.
 Manipulation of figures can lead to wrong conclusions.
 Statistics neither proves anything nor disapproves anything.

Part I – Page 1
DESCRIPTIVE STATISTICS AND INFERENTIAL STATISTICS
Two branches of statistic are Descriptive Statistics and Inferential Statistics
Descriptive Statistics
Descriptive statistics is the study of statistical procedures that deal with the collection,
organisation, graphical representation and processing or summarisation of data to make it informative
and comprehensive.
There are two types of descriptive statistics:
• Measures of central tendency i.e., average, mean and median.
• Measures of spread i.e., range, variance and standard deviation.

Inferential Statistics
Inferential statistics involves those statistical procedures which are used to draw an inference
about the conditions or characteristics of a large population by studying attributes of some small
samples drawn from that population randomly. The inference is considered as a generalisation about
the large population.
An example of inferential statistics is to test the efficacy of a new hypertensive or a new cancer
drug, in which physician will have only limited number of patients to find the efficacy of the drug.

PARAMETER, STATISTIC AND OBSERVATION


Population
Biological definition of population is the totality of individuals of a given species per given time
in a given area. In genetics, a population means a group of all the individuals of a species that
interbreed. But a statistical population refers to any well-defined group of individuals who are being
studied or ‘the total number of observations of a particular type about which inferences are to be
made’. More simply, a group of study elements is called population.
For example, all university students in KVASU could be a population.

Parameter
Parameter is any numerical property, characteristic or fact that is descriptive of a population.
Usually all the characteristics of a population can be specified in terms of a few parameters.

Sample
Sample is a small group or subset of a population selected, that represents all the attributes of
entire population and can be used for investigating its properties. Say, researchers want to find out
some specific feature about a population, but it is not possible to study every single individual in the
population. They select a small number of individuals from the population, study them and use that
information to draw conclusions about the whole population. This is called sample.
For example, we want to study the average height of students studying KVASU. It is not
necessary to observe the height measurements of all the students. In fact, we can take a small
representative sample of a few students from different batches for measurements and can give results.

Statistic
Statistic is defined to represent any descriptive characteristic or measure obtained from sample
data. In other words, statistic is a function of sample observations.
Example: Average height of students obtained from sample observations.

Data or Statistical Data


Data are a set of facts expressed in quantitative or qualitative form in a manner suitable for
presentation and analysis.
Part I – Page 2
Accuracy- Accuracy shows the closeness of a measured or computed value to its true value.
Precision - Precision is the closeness of repeated measurements of the same quantity.

VARIABLE
Variable is a quality or a characteristic which is being observed or measured and can vary from
one individual to another. For example, animals of same species may differ in their length, weight, age,
sex, etc. Variables may be of two types:

1. Quantitative variable is a characteristic which can be measured on a scale in some appropriate


units, e.g., measurement of age, weight, height, etc. or counting the number of persons, places or
things belonging to various categories. Further divided into two:
i) Discontinuous or discrete random variable is one which is incapable of taking all possible
values. It is characterised by gaps or interruptions in the values or absence of values in between. For
example, the number of rooms in a house or the number of persons in the family can take only the
integral values such as 2, 3, 4, etc. Here a count of 2.5 is not possible.
ii) Continuous random variable is one which can take any numerical value within a certain
range, i.e., it does not possess any gaps or interruptions. The height of a child at various ages when he
grows from 120 cm to 150 cm, assumes all possible values within the limit even in fractions.

2. Qualitative variable is one which is immeasurable and inexpressible in magnitude. It can be


expressed in qualities which are called attributes, e.g., colour of flowers, texture of leaves, etc.
Conventionally, the quantitative variables are termed as variables and the qualitative variables
are termed as attributes.

Random variable: Whenever the height, weight, or age of an individual is determined, the
result is referred as a value of the respective variable. When the values obtained arise as a result of
chance factors, the variable is called a random variable. Values obtained from measurement
procedures are described as observations. These are of following two types:

A variable can be dependent or independent. A variable is called dependent variable when it


is an outcome of interest. It changes in response to some intervention. The independent variable is
that variable which is itself being manipulated.

When observations within the data for a particular variable do not have the same value, they
exhibit variation. Variation between observations can be due to many factors. For example, the
variation in human height is hereditary but it may also be due to diet or disease.

Variables

Quantitative Qualitative

Discrete Continuous Nominal Ordinal


Interval, Ratio Interval, Ratio

STATISTICAL DATA
Any record, descriptive or qualitative account or symbolic representation of any attribute,
event or process expressed in quantitative form is considered as data. The scientific records of the
results or observations of an experiment or a series of experiments is also called data.

Part I – Page 3
Sources of statistical data
The main sources for the collection of biological data are:
1. Experiments 2. Surveys 3. Records

Experiments
Experiments are performed in the fields, laboratories (biochemistry, physiology,
pharmacology), and in hospitals. Data is collected with a specific objective by one or more workers and
is compiled for analysis and conclusion. The data is made available to various scientific workers through
theses and scientific papers published in scientific journals.

Surveys
Surveys are carried out for epidemiological data or animal biometric (height, length, girth) data
collection from the field studies, carried out by trained teams.

Records
Data collected through experiments or surveys are maintained in registers or books and
journals over a long period of time to be consulted or referred for future use for various purposes.
Records provide secondary source of data while experiments and survey form primary source.

Primary and secondary data


Depending on the source, the data may be primary or secondary.

Primary Data
The data collected by the investigator from personal experimental studies or measurements
are called primary data. They are original and raw.

Secondary Data
The data obtained from some secondary source such as journals, magazines, newspapers or
research papers, etc. are known as secondary data. These data have already been collected by some
other person and organised by statistical procedures. These are in finished form and ready to analyse
and interpret. Even though it saves time and money, may not be very accurate.

Qualitative and Quantitative Data


Qualitative Data
Qualitative data include observations that are not numerical but descriptive. They represent
the number of individuals with the same characteristic or attribute and not the measurement of the
attribute. It means qualitative data have only one variable, i.e., the number of individuals (=frequency).
There is no magnitude or size of the characteristic as the same cannot be measured. Persons with same
attribute are counted. They form a group or class such as young, old, infants, healthy, patients, treated,
nontreated, or drug or placebo, etc. These characteristics or variables cannot be measured but the
frequency of persons of each type is determined and represented.

Quantitative Data or Numerical Data


Quantitative data are represented in numbers. They have both frequency as well as magnitude.
Each attribute studied, has two variables -the characteristic and frequency. Characteristic is a feature
like height, number of RBCs, or amount of haemoglobin per mL blood. Frequency is the number of
persons with same characteristic and in the same range. The quantification of data leads to numerical
measurements which are interpreted in a meaningful way.

Part I – Page 4
THE LEVELS OF MEASUREMENTS

Generally, measurements of statistical data can be represented on four different levels or


scales. These are Nominal scales, Ordinal scales, Interval scales and Ratio scales.

Nominal: Nominal scale measure is used to identify by name or label of categories. The order,
distance, and ratio of measurement are not meaningful, and thus can be used only for identification
by names or labels of various categories. Nominal data measure qualitative characteristics of data
expressed by various categories. Examples: gender (male or female) disease category (acute or
chronic), place of residence (rural or urban), etc.

Ordinal: Ordinal data satisfy both identification and order criteria, but if we consider interval
and ratio between measurements, then there is no meaningful interpretation in case of ordinal data.
Ordinal Variables may be qualitative or quantitative. Examples: Educational level of respondent (no
schooling, primary incomplete, primary complete, secondary incomplete, secondary complete,
college, or higher), status of a disease (severe, moderate, normal), etc.

Interval: Interval data have better properties than the nominal and ordinal data. In addition to
identification and order, interval data possess the additional property that the difference between
interval scale measurements is meaningful. However, there is a limitation of the interval data due to
the fact that there is no true starting point (zero) in case of interval scale data. Examples: temperature,
IQ level, ranking an experience, score in a competition, etc. If we consider temperature data, then the
zero temperature is arbitrary and does not mean the absence of any temperature implying that zero
temperature is not absolute zero. Hence, any ratio between two values of temperature by Celsius or
Fahrenheit scales is not meaningful.

Ratio: Ratio data are the highest level of measurements with optimum properties. Ratios
between measurements are meaningful because there is a starting point (zero). Ratio scale satisfies
all the four criteria including absolute zero implying that not only difference between two values but
also ratio of two values is also meaningful. Examples: age, height, weight, etc.

SAMPLING DISTRIBUTION

Statistic: A measure computed from the data of a sample (Eg: Sample Mean, x̄)
Parameter: A measure calculated from the data of a population (Eg: Population Mean, μ)
Sampling distribution is the distribution of sample means. As sample size ‘n’ increases, the
shape of the distribution of the sample means obtained from any population (irrespective of
population distribution) with mean μ and standard deviation σ will approach a normal distribution
with mean μ and standard deviation σ/√n (Central limit theorem)

Standard deviation of sample mean is known as standard error. (σ/√n)


The confidence limits of population mean of a sample of size n as follows
95% confidence limits of population mean = x̄ ± 1.96 * σ/√n
99% confidence limits of population mean = x̄ ± 2.58 * σ/√n

ESTIMATION
Generally parameters are unknown; they have to be estimated by the corresponding statistics.
Estimate: Value of population parameter obtained
Estimator: Method of estimation to estimate the value of population parameter

Part I – Page 5
Types of Estimate
 Point estimate: It is the single value which is used to estimate the population parameter
 Interval estimate: Is an interval in which population parameter lies between. It is called
confidence interval
Properties of estimator
 Unbiasedness
An estimate is said to be unbiased if its expected value is identical with the population
parameter being estimated
 Consistency
Means that, as the sample size increases, the estimates (produced by the estimator)
"converge" to the true value
 Minimum variance
Best estimator
If an estimator is unbiased and consistent, it is called best estimator
Efficient estimator
Unbiased, minimum variance estimator is called the efficient estimator

THEORY OF SAMPLING
Census and Sampling are the two methods by which any required information or data may be
collected.

CENSUS METHOD
Complete enumeration of the data from each and every unit of the population or universe.
(Refer: Livestock census)
Merits
 Data obtained from each and every unit
 More accurate
Demerits
 Difficult if the population is very large
 Effort, money, time etc.

SAMPLING METHOD
Learning about the population on the basis of the samples drawn from the population. A
portion of population is known as sample. The process of selecting sample is sampling. For statistical
inferences about a population from a sample, it is essential that the samples are representative of the
population.
Merits
 Saves time
 Less cost
Demerits
 It is only an estimate of population
parameter

Part I – Page 6
SAMPLING METHODS

Random/Probability Non random/Non-probability


1. Simple random sampling 1. Judgment sampling
or Unrestricted random sampling 2. Quota sampling
2. Restricted random samples 3. Convenience sampling
a. Stratified sampling
b. Systematic sampling
c. Cluster sampling

i. Random/Probability Sampling Methods

Simple Random Sampling


Technique of drawing samples from population in such a way that each unit of the population
has an equal and independent chance of being included in the sample. Well applicable when
population is small, homogenous and readily available. Also called unrestricted random sampling.
Two methods
 Lottery method
 Method using random tables. Eg: Tippets random number table, Fisher and Yates random
number table, Kendall and Smith random number table

Stratified Random Sampling

The population of size 'N' is subdivided into a definite number of overlapping and distinct sub
populations of sizes N1, N2….Nk such that N1+N2+….+Nk = N.
The procedure of dividing the population into distinct sub population is called stratification and each
sub population is called a stratum. While forming a stratum, we see that the units within each stratum
are more homogenous with respect to character under study.
Within each stratum of size Ni, a random sample of ni is drawn such that n1+n2+….+nk=n while n is the
size of the sample.

Systematic Sampling

Consists of selecting only the 1st unit at random, the rest being selected according to some
predetermined pattern involving regular spacing of units. (Eg:- Randomly selecting first one, then
selecting every 10th item.)

Cluster sampling
The total population is divided depending on the problem under study, into some recognizable
subdivisions named as clusters and simple random samples from these clusters are drawn.

ii. Non-random Methods

Purposive/Deliberate/Subjective/Judgment sampling
It is the one in which the investigator takes the samples exclusively at his discretion.
Convenient sampling
If the investigator chooses the samples at his convenience or ease of access.
Quota sampling
It is a type of Judgment sampling wherein quotas are setup according to some specified characteristics.
Part I – Page 7
CLASSIFICATION AND TABULATION OF DATA
The data obtained by the investigator is unorganised and does not give much information. This
data set has to be rearranged. The most elementary rearrangement of data is called array. It is 'an
arrangement of the observations according to size/magnitude'. This means the observed values are
arranged in order of magnitude and this rearranged data is termed as arrayed data and this
arrangement of observed values in order of magnitude from smallest value to the largest value is called
ordered array.
Objectives of Classification of Data
The objectives of compilation and classification of data are:
• to make the data simple and meaningful, to leave a lasting impression.
• to make the data easily accessible, easily understandable and proper use.
• to present data in condensed form by summation of items so as it is easy to draw statistical inference.
• to ensure easy detection of errors and omission in the data.
• to ensure and define the problem and suggest solution too.
• to ensure quick comparison and easy study of data.
Methods of classification of data
1. Classification by Space (Geographical Data)
In this classification, data is classified by location of occurrence, i.e., according to area or region. The
data is organised in the sets of categories in the order of their geographical location. For example, if
we consider production of fish statewise in India
2. Classification by Time (Chronological Data)
In this classification, the data is classified by the time of occurrence of the observations or occurrence
of an event. The categories are arranged in chronological order. For example, data of egg production
of a poultry farm for the last five years.

3. Classification by Attribute (Qualitative Data)


In this classification data is collected and classified on the basis of some qualitative characteristics
termed as attributes. The qualitative attributes cannot be measured. Some examples are sex, colour,
health condition. The number of males and females in the fish population of a species from a pond
represents qualitative data. Here sex is an attribute for classification. The qualitative data can be of
two types:
Simple or Dichotomous classification: In this type of classification, the population is divided
just into two categories based only on one attribute: 1. possessing the attribute and 2. without
the attribute.
Manifold classification: In this type of classification two or more attributes are considered
simultaneously. Therefore, data is classified into several classes, depending on the number of
attributes.
4. Classification by Size (Quantitative Data)
The data pertaining to such characteristics that are represented by quantitative measurements and
are expressed by numerical values are called quantitative data. The classification of quantitative
attribute is described as quantitative classification or classification of variable.
Differences between variable and attribute
A measurable characteristic which can be expressed numerically in terms of some unit is called a
variable, e.g., height, weight, income, number of plants in a field, number of dogs in a village or number
of outdoor patients visiting a hospital daily. Variables represent quantitative data. The non-
measurable characteristic which cannot be measured or represented numerically is called an attribute.
These are qualitative in nature.
Part I – Page 8
Differences between classification and tabulation
The process of arranging and presenting the primary or raw data in a systematic way is called
classification of data. This is a prelude to tabulation.
Tabulation is presentation of statistical data in the form of tables. A table is a systematic organisation
of statistical data in columns and rows. A properly constructed and adequately labelled table can be
read and understood independently without consulting the accompanying text. Therefore, tables are
designed in such a way that these enable the reader to grasp the information that tables intend to
convey.
FREQUENCY DISTRIBUTION
The classification of data according to class interval is called frequency distribution. When the
data are grouped into classes of appropriate interval, showing the number in each class, we get the
frequency distribution e.g. the frequency table showing the distribution of goats in different body
weight classes.
Body weight Frequency
10 - 20 16
20 - 30 24
30 - 40 25
In constructing a frequency distribution, it is desirable to consider the following basic rules or criteria:
 Number of classes: As a general rule, the number of classes should be about 15, never more
than 30 and not less than 6. However, the number of classes in a frequency distribution are not
fixed. An ideal number of classes for any frequency distribution would be that which gives the
maximum information about the data.
 Class intervals: The class intervals depend on the range of the data, and the number of classes.
The range is the difference between the highest and the lowest value of the variable. The class
width would be equal to the difference between the highest and the lowest value of the
variable divided by the number of classes. The following formula may be used to estimate the
class interval:
i=
where i = class interval or class width
L = largest value
S = smallest value
c = number of classes
Here, the number of classes can be decided with the help of the Sturge rule. According to
Sturge, the class intervals would be:
c = 1 + 3.322 log N
where N = total number of observations
log = logarithm (base 10) of the number
or
Yule’s rule
c = 2.5 X N(1/4)
c = Number of required classes
N = Number of observations
For example, as per Sturge rule, if 100 observations are being studied, the number of classes
will be: c = 1 + (3.322 x 2) = 1 + 6.644 = 7.644 or 8
Similarly, if 500 observations are being considered, the number of classes will be:
c = 1 + (3.322 x 2.699) = 1 + 8.966 = 9.966 or 10
For 1,000 observations, the number of classes would be:
c = 1 + (3.322 x 3) = 1 + 9.966 = 10.966 or 11
Part I – Page 9
 The class width should not be too broad or too narrow.- range between 2 - 5 or 2 to 10.
 The class width should be same for all classes
 The headings and stubs should be clear
 Groups must be tabulated either in descending or ascending order of magnitude
 If certain observations are not included, the reasons for the same must be given
 The range of classes should contain entire data and the classes should be continuous

Class and Class Interval

When data consisting of large number of observations are divided into certain groups that have
defined upper and lower limits, each group is called a class. The size of the class is called class interval.
For example, in Table values 10-20, 20-30 or 30-40, etc. are classes of the series 10 to 40.

There are two ways of classifying the data on the basis of class intervals:

(i) Exclusive method: In the exclusive method, the upper limit of a class is the lower limit of the
succeeding class. This method ensures the continuity of the data (Left hand side table).

Body weight Frequency Body weight Frequency


10 but less than 20 16 10 - 19 16
20 but less than 30 24 20 - 29 24
30 but less than 40 25 30 - 39 25
40 but less than 50 10 40 - 49 10
50 but less than 60 25 50 - 59 25

(ii) Inclusive method: Under the inclusive method, the upper limit of one class is included in that class.
For continuous variable, the exclusive method should be used, and in the case of discrete variables it
is possible to use inclusive method.

Class Limits
The two ends of a class are called class limits. The smaller value of the class represents the lower class
limit of the class and the higher value represents the upper class limit. For example, in case of class 91-
94, 91 is the lower class limit and 94 is the upper class limit.

Class Boundary
The class boundaries are the limits up to which the two limits of each class or group may be extended
to fill up the gap existing between the classes. The lower extreme value of the class value is called the
lower class value. For example, in the class 91-94, the lower class limit can be extended to 90.5. It is
called lower class boundary. Similarly, its upper class limit can be extended to 94.5 and hence
represents the upper class boundary.

Class Intervals Class Boundaries Class Width Mid value


91 - 94 90.5 - 94.5 94.5-90.5 = 4 (91+94)/2 = 92.5
95 - 98 94.5 - 98.5 98.5-94.5 = 4 (95+98)/2 = 96.5
99 - 102 98.5 - 102.5 102.5-98.5 = 4 (99+102)/2 = 100.5
103- 106 102.5- 106.5 106.5-102.5 = 4 (103+106)/2 = 104.5
107- 110 106.5-110.5 110.5-106.5 = 4 (107+110)/2 = 108.5

Part I – Page 10
Class Width or Class Magnitude
The difference between the upper and lower class boundaries is described as class magnitude. It is also
called class size, class range or class width. The class width can be calculated by the formula:
Class width or Class range = Largest value (or Upper class boundary) – Smallest value (or
Lower class boundary)
Mid Value of Class
The value just at the middle of the class is called mid value of the class. It is also known as mid point
or central value. It is calculated as the arithmetic mean of the upper class limit and lower class limit of
a class or the highest and lowest limits of the class interval. For example, the mid value of the class
91-94 will be 92.5. The formula used for calculating mid value is:
Mid value =
Arrangement of frequencies of a variable and their presentation in a defined group is called as
frequency table. The number of times a value occurs in a series is called the frequency of that value of
the variables.

Procedure for preparation of a frequency table


 The data given should be arranged in ascending or descending order.
 The number of observations (N) in the given data is to be calculated.
 The number of classes required is to be calculated using Sturge’s or Yule’s rule
 The class width is to be calculated.
 The class frequencies are then calculated using tally mark method and table is prepared

PARTS OF A TABLE
A table should essentially contain seven parts namely:
1. Table number: When a book or report contains more than one table, each table must have a
number.
2. Title of the table: Every table must have a suitable heading. The heading should be short, clear and
convey the purpose of the table.
3. Captions and stubs: Caption refers to the vertical column headings while stubs refers to the
horizontal rows heading.
4. Head notes: It is a statement given below the title which clarifies the contents of the table. It explains
the entire table or main parts of it. For e.g. milk yield of the state in different years is usually expressed
in a head note as metric tonnes or cattle population of the state in millions.
5. Body: The figures that are to be presented to the readers are called as body of the table and must
contain subtotals and grand totals.
6. Source: When the secondary data is presented in a table, the source of the data is needed to be
given. The source should give the name of the book, page number, table number. etc., from which the
data have been collected.
7. Foot note: A footnote is a pointer; it tells the reader that whatever bit of text they are reading
requires additional information to make complete sense. For example in a table giving information on
the actual milk production of the state for the years from 2000 to 2018 , if the projected value is given
only for 2018, it needs to be marked as foot note as "projected value" which will be mentioned at the
bottom of the table.

Part I – Page 11
GRAPHICAL AND DIAGRAMMATIC REPRESENTATION OF DATA
One of the most convincing and appealing ways in which statistical results may be presented is
through diagrams and graphs. There are numerous ways in which statistical data may be displayed
pictorially such as different types of diagrams, graphs and maps.

Difference between diagrams and graphs


For constructing a graph we generally make use of a graph paper whereas a diagram generally
constructed on a plain paper. In other words, a graph represents mathematical relationship between
two variables whereas diagram does not.

DIAGRAMMATIC REPRESENTATION OF DATA


Diagrams always help the statistician to visualize the meaning of a numerical complex at a
single glance. A large number of diagrams are used in biostatistical analyses. The important types of
diagrams which are commonly used for presentation of qualitative data are given here.
They are as follows:

1. One-dimensional diagrams
(a) Line diagram
(b) Bar diagram
There are four types of bar diagrams.
(i) Simple bar diagram
(ii) Divided bar diagram
(iii) Percentage bar diagram
(iv) Multiple bar diagram
2. Two-dimensional diagrams or area diagrams
Rectangles, squares and circles (pie diagram)
3. Three-dimensional diagrams or volume diagrams
Cubes, cylinders and spheres
4. Pictograms and cartograms

LINE DIAGRAM

It is the simplest type of a diagram. For diagrammatic representation of data, the frequencies of the
discrete variable can be presented by a line diagram. The variable is taken on the x-axis, and the
frequencies of the observations on the y-axis. The straight lines are drawn whose lengths are
proportional to the frequencies.

BAR DIAGRAM

Bar diagrams are commonly used in practice to represent the statistical data. They are also known as
one dimensional diagrams because the length of the bar is important, and not the width. In the case
of a large number of items, line diagrams may be drawn instead of bars. In the place of line diagram,
one can construct rectangular bars of equal width instead of straight lines, and such a representation
is called bar diagram.
The following points should be taken into consideration while constructing a bar diagram.
(i) They may be in the shape of horizontal or vertical bars.
(ii) The width of the bars should be uniform throughout the diagram.
(iii) The gap between one and the other bar should be uniform throughout.
Bar diagrams can be of the following types:
Part I – Page 12
Simple bar diagram: A simple bar diagram is used to represent only one variable. As one bar represents
only one figure, there are as many bars as the number of figures.

Divided bar diagram: In a divided bar diagram, the frequency is divided into different components and
such a representation is called a divided bar diagram.

Percentage bar diagram: In percentage bar diagrams, the length of the bars is kept equal to 100 and
the divisions of the bar correspond to the percentages of different components. This diagram is called
a percentage divided bar diagram.

Multiple bar diagram: Multiple bar diagrams are preferred whenever a comparison between two or
more related variables is to be made. The technique of simple bar diagrams can be extended to
represent two or more sets of interrelated data in a diagram.

Part I – Page 13
Pie diagram
It is used for percentage distribution. Different components are represented by means of sectors of a
circle. A circle represents an angle of 360° at the centre and this represents the total and angles of
sectors are proportional to the respective values or measurements of different components (usually
pie chart is not used for depicting large num. This sort of representation is called a pie chart. A pie
chart is also known as circular chart or sector chart. (See above RHS picture)

Pictogram:
When statistical data is represented by pictures, they give more attractive
presentation and such pictures are called pictograms. Pictograms are
diagrams of pictorial or semi-pictorial nature and are drawn in different
sizes according to scale.

Cartogram or map diagram


When numerical facts are shown in the form of maps they are termed as
cartograms. Cartograms are more suitable for geographical data. The
different values on a map can be represented by different colors, varying
degrees of shading or cross-hatching by dots of similar size with different
density of numbers or by dots of proportional size etc.

Diagrams are extremely useful because of the following reasons:


 They give a more attractive representation of data as compared to
figures.
 They simplify the complexity.
 They facilitate comparison of data.
 They have universal utility.
 They give more information.
 They save time and labour.

It will be better for the statistician to present tables for detailed reference, and diagrams for rapid
understanding. Diagrammatic representation has the following limitations:
 Can give only a limited amount of information because they show approximate values.
 They can be used only for comparative studies.
 Diagrams cannot be analysed further.
 Diagrammatic representation is only useful to the common man. Its utility to an expert is
limited.
Part I – Page 14
GRAPHIC REPRESENTATION OF DATA

Graphic methods enable the statisticians to present quantitative data in a simple, clear and
effective manner. However, the important step in statistical analysis is to prepare a frequency
distribution table. But the graphic representation of data in the frequency distribution table reveals
the relationship that might be overlooked in a table. The frequency table of most biological variables
develops a distribution which can be compared with the standard distributions, such as normal,
binomial and poisson. A graph is a visual form of representation of statistical data. Comparisons can
be made between two or more phenomena very easily with the help of a graph. The frequency
distribution can be represented graphically in any of the following ways:
(i) Histogram
(ii) Frequency polygon
(iii) Frequency curve
(iv) Cumulative frequency curve
(v) Scatter diagram or dot diagram

It will be convenient in discussing frequency graphs to use the conventional mathematical


terms, ordinate and abscissa. In the construction of frequency graphs, the values of the variable are
measured on the x-axis and the corresponding frequencies on the y-axis. For construction of graphs,
two lines are drawn which cut each other at right angles. The horizontal line is called abscissa and the
vertical line is called ordinate. The point where these lines cut each other is called the point of origin
or O. Together, the ordinate and abscissa are called co-ordinates to the point.
For construction of a graph, the graph paper is divided into four parts, called 'quadrants'.
However, both the x and y values are positive in quadrant I. In quadrant II, the x's are negative and y's
are positive. In quadrant III, both x and y are negative. In quadrant IV, x values are positive and y values
are negative. Usually first quadrant is used for graphical presentation of statistical data.

The following essential steps are required in constructing a graph:


 Each X and Y-axis should begin at 0.
 The scale and intervals for each axis should match the magnitude of variables to be plotted.
 Each axis must be labelled fully in terms of variable.
 The range of variable should have equally spaced intervals.
 The values on graph proceed from left to right. On the X-axis, the lower numbers should be on
the left and on the Y-axis they should be from bottom towards top.
 The points plotted on the graph are called coordinates. They represent corresponding values
of two variables.
 The points on the graph are marked by Χ or by ⊙ and never by dot alone.
 The points marked on graph are joined by a series of straight line segments, by smooth curve
or by a regression line.
 Points on graph represent actual data. But estimates of other values can be obtained from
reading the coordinates on any point on the line. This is called interpolation. The coordinates
outside the range of graph are determined by extending the line of the graph. This technique
is known as extrapolation. Both interpolation and extrapolation are just only estimates.
 A small and appropriate unit bar line is selected to represent the statistical data on graph. In
case we have to represent large numbers such as 1000, 2000, 3000, 4000 and so on, they can
be represented on the graph as 1, 2, 3, 4 and so on.
 To plot frequency distributions on graph, the values of variables are shown on X-axis and the
frequencies on Y-axis.

Part I – Page 15
HISTOGRAM
Histogram is the most important method for displaying the frequency distribution. Histogram
is a set of vertical bars whose areas are proportional to the frequencies represented. In constructing
the histogram, the variable should be taken on the horizontal axis (x-axis) and the frequencies
depending on it on the vertical axis (y-axis). Each class is represented by a distance which is always
proportional to its class interval. When all the classes are of equal lengths, the heights of the rectangles
will be proportional to the frequencies of the respective classes. In this way, there are a number of
rectangles each with a class interval distance as its width, and the frequency distance as its height.
Histogram is two-dimensional where both the length as well as the width are important. Whereas a
bar diagram is one-dimensional, i.e. only the length of the bar is important, and not the width.
The histogram can be constructed in two ways depending upon the class-intervals:
(i) For distributions that have equal class intervals.
(ii) For distributions that have unequal class intervals.
In the first case, if the class intervals are equal, the height of the rectangles will be proportional
to the frequency. In case of unequal class intervals, a correction must be made. One has to take the
lowest class interval into consideration for making suitable adjustments in the frequencies of other
classes. For example, if one class interval is twice as compared to the lowest class interval, the height
of the rectangle is divided by two. On the other hand, if it is three times more, we divide the height of
its rectangle by three and so on. Frequency equal to the area of the bar.

FREQUENCY POLYGON
Frequency distribution can be portrayed graphically by means of a frequency polygon. To
construct a frequency polygon, we mark the frequencies on the vertical axis and the values of the
variable on the horizontal axis as in the case of histogram. A dot is placed above the mid-point of each
class and the height of a given dot corresponds to the frequency of the relevant class interval. By
connecting the dots by a straight line, the frequency polygon is prepared. A frequency polygon is simply
a line graph that connects the mid-points of all the bars in a histogram.

FREQUENCY CURVE

The frequency polygon or histogram will approach more and more the form of a smooth curve.
Such a curve is obtained in normal distribution of individuals in a large sample or in a population. In a
majority of the biological characters, the frequency distributions approximate to a symmetrical bell
shaped curve known as the normal curve. The frequency curve is drawn freehand to eliminate as far
as possible, the accidental variations that might be present in the biological, agricultural and other
data. The total area under the curve should be equal to the area under the original histogram or
polygon.
Part I – Page 16
CUMULATIVE FREQUENCY CURVE or OGIVE
In the case of this graphic representation of data, it is desirable to determine the number of
observations that fall above or below a certain value rather than within a given interval. In such cases,
the regular frequency distribution may be converted to a cumulative frequency distribution. A graph
of cumulative frequency distribution is called the ogive (pronounced "oh-jive"). There are two methods
of constructing ogive, namely:
(i) The "less than" method
(ii) The "more than" method-
In the "less than" method, we start with the upper limit of the classes and go on adding the frequencies
(Less than cumulative frequencies plotted against the upper class limits). However, in the case of
"more than" method, we start with the lower limit of classes (More than cumulative frequencies
plotted against the lower class limits). The first method gives a rising curve, whereas the second
method shows a declining curve.
Lorenz Curve is a modification of the Ogive when the variables and the cumulative frequencies
are expressed as percentages. It is a graphical method of studying dispersion.

SCATTER DIAGRAM OR DOT DIAGRAM


It is prepared in class in which frequencies of at least two variables have been related. Of them
one variable is independent and another variable is dependent. The independent variable is the cause
of all dependent variable (See above RHS picture).

Advantages of graphic representation


(i) It is the simplest method of presenting data.
(ii) They give an attractive, interesting, and impressive view.
(iii) They make comparison easy.
(iv) They are helpful in ascertaining certain statistical measures.
(v) They save time and labour.

Limitations of graphic representation


Graphic representation of statistical data is a valuable tool for the biometricians. However,
graphic presentation of data has the following limitations:
(i) A graph simply shows tendency and fluctuations; actual values are not known.
(ii) Complete accuracy is not possible on a graph.
(iii) Graphs cannot be quoted in support of some statements.
(iv) Only a few characteristics can be depicted on a graph. However, in the case of many figures, it
is difficult to follow the graph.
(v) Graphic representation may often give misleading impressions.
Part I – Page 17

You might also like