Statistical Background

• Statistics deals with the collection, presentation, analysis, and interpretation of numerical
data. The modern day statistics deals primarily with statistical inference.
• The objective of statistical inference is to make an inference about a population based on the
information obtained from sample/samples.
• One of the most common and useful presentation of data sets is the frequency table and its
corresponding graph, i.e., the histogram .
• A frequency table records how often observed values fall within certain intervals or class.
The frequency table is also being presented graphically in the form of a histogram .
• The important feature of most histogram can be captured by a few summary statistics such as
measures of location, measures of spread and measures of shape.
Measures of location

• This statistic in the first instance gives us information about where various parts of the
distribution lie. The mean, median and mode are the most useful statistics for measuring the
location of the centre of the distribution .
The sample mean, x , is the arithmetic average of the data values

1 n
x   xi
n i 1

where n is the number of data and x1,x2,……..xn are the individual data values.

The sample mean provides an estimate of population mean:

1 n
Hence Û = x   xi
n i 1
For grouped data
Let xi = mid point of the i th class interval
fi = frequency associated with the i th class interval
n= total number of observations
x = sample mean

x 
 fixi

Example: 14, 12, 15, 17, 16

x =14.8
Point And Interval Estimates Of Population Mean

• Case1: when population is normal and sample size n< 30

1 n

Point estimate of µ= x 
i 1

Interval estimation of µ:

x  t / 2 s / n    x  t1 / 2 s / n
Case 2:When the sample size is more ( 30)
Interval estimation of µ:

x  z / 2 s / n    x  z1 / 2 s / n
Where x is the sample mean,
s is the sample standard deviation,
 is the population mean
n is the number of samples in each group
t1 / 2 and z1 / 2 are the standard variate at a (1-α) confidence level
 Median: The median, M, is the midpoint of the observed values if they are arranged in
increasing order. Once the data are ordered so that x1  x2  ..............  x n, the median can be
calculated from one of the following equations
M= x(n+1)/2 if n is odd
= (xn/2+xn/2+1)/2 if n is even

For grouped data:

Median= L+ fm
(0.5n  cfb)

Where L= lower limit of the interval that contains the median

n= total frequency
cfb = cumulative for all classes before the median class

fm= frequency of the class interval containing the median

w= interval width
 Mode: The mode is the most frequent value. A data values can have more than one mode.
For histograms, a mode is a relative maximum
 Measures of Spread
The measures of spread can be calculated by variance, standard deviation and the range
 Variance: The variance is a measure of how spread out a distribution is. It is computed as
the average squared difference of the observed values from their mean. Since it involves
squared differences, the variance is sensitive to erratic high values. The sample variance, ,
is given by
 s2
The sample variance  s is an estimate of the population variance

1 n

s  ( xi  x ) 2
n  1 i 1

 Standard deviation: The Standard deviation, σ, is simply the squared root of variance. It is
often used instead of the variance since its units are the same as the units of the variable
being described. The sample standard deviation provides an estimation of population
standard deviation


s   ( xi  x)
n 1
 Range: Range is the simplest possible measure of dispersion and is defined as the difference
between the values of the extreme items of the series. Thus
Range= xn-xi
Where xn is the highest value of the data values and xi is the lowest value of the data values. The
utility of the range is that it gives an idea of the variability very quickly

Measures of shape
 Skewness – Skewness is the measure of asymmetry and shows the manner in which the items
are clustered around the average. A distribution is skewed if one of its tails is longer than the
other .it is measured by:

E[( X   ) ]

 For a symmetric distribution skewness is zero.
 Kurtosis – Kurtosis is the measure of flat-toppedness of a curve. The following is the
formula to calculate kurtosis.

 (x
i 1
i  x) 4
kurtosis  3
n 4

where, σ is the standard deviation


• Consider two random variable X and Y.

• One way to visualize the relationship between two variable is draw a scatter
plot and measure the strength of relationship.

1 3.0 2.1
2 2.6 2.5
3 3.0 3.4
4 1.2 1.0
5 2.0 1.6

3 *
2 *

1 2 3 4
Covariance (X,Y)=бxy=E[{(X- x )}{(Y- Y )}]
n  

Sample Covariance (X,Y)=  ( xi  x)(Yi  Y )

i 1

n 1

Y =2.12

X =2.36

sx = 0.588

sy =0.827

sxy = 0.596

cov( x, y )
Correlation co-efficient= xy  xy
• Sample correlation co-efficient= rxy 

n  

 ( xi  x)(Yi  Y )
i 1

• =  2 

 ( xi  x)  (Yi  Y )

rxy = 0.855
Correlation Analysis

 Pearson Correlation Coefficient

Correlation Analysis
 Step 1:First, we specify the null and alternative hypotheses:
Null hypothesis H0: r = 0
Alternative hypothesis H1: r≠0 
 Step 2: Second, we calculate the value of the test statistic
using the following formula:

 Step 3: Decision
If p-value < significance level : Reject H0 in favour of H1
If p-value > significance level : Reject H1 in favour of H0

