Nothing Special   »   [go: up one dir, main page]

Chapter 2 - 2012

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Chapter 2: Single-Stage Simple Random Sampling

After a target population is defined and decision is made to use sample survey as a method of data
collection, there are several methods of statistical sampling techniques to use for the intended purpose.
These include simple random sampling, stratified random sampling, systematic sampling, cluster
sampling and multi-stage sampling. All of these techniques have one sampling characteristic in
common. Each deals with a method of random sample selection from a defined target population to be
investigated. The random selection procedure ensures that each population unit has an equal chance of
being included in the sample. It is this random selection method that provides representative samples
impartially and without bias by avoiding any influence of human being. Each technique will be
discussed in subsequent chapters.

This chapter will present the basic principles and characteristics of single-stage simple random
sampling technique. Simple random sampling is very important as a basis for development of the
theory of sampling. It serves as a central reference for all other sampling designs.

2.1 Definition and Basic Concepts

Simple random sampling (SRS) is a basic probability sampling selection technique in which a
predetermined number of sample units are selected from a population list (sampling frame) so that
each unit on the list has an equal chance of being included in the sample. Simple random sampling also
makes the selection of every possible combination of the desired number of units equally likely. In this
way, each sample has an equal chance of being selected. If the population has N units then a random
method of selection is one which gives each of the N units in the population to be covered a calculable
probability of being selected.
To undertake a sample selection, there are two types of random sampling selectionsampling with
replacement (wr) and sampling without replacement (w o r).

Sampling Without Replacement:

Sampling without replacement (wor) means that once a unit has been selected, it cannot be selected
again. In other words, it means that no unit can appear more than once in the sample. If there are n
N
sample units required for selection from a population having N units, then there are   ways of
n
selecting n units out of a total of N units without replacement, disregarding the order of the n units.
N
Hence, simple random sampling is equivalent to the selection of one of the   possible samples with
n
N
an equal probability 1   assigned to each sample.
n
In simple random sampling without replacement the probability of a specified unit of the population
being selected at any given draw is equal to the probability of its being selected at the first draw, that
is, 1 N . However, for a sample of size n, the sum of the probabilities of these mutually exclusive
events is n N .

Sampling with replacement:

1
The process of sampling with replacement (wr) allows for a unit to be selected on more than one draw.
There are Nn ways of selecting n units out of a total of N units with replacement. In this case, the order
of selection will be considered. All selections are independent since the selected unit is returned to the
population before making the next selection. Thus, the probability is 1 N for any specific element on
each of the n draws.
Simple random sampling with or without replacement is practically identical if the sample size is a
very small fraction of the population size. Generally, sampling without replacement yields more
precise results and is operationally more convenient.

2.2 Simple Random Sample Selection Procedures

In sample survey when sample units are selected from a population there could be possibilities of
biases in the selection procedure which may come from the use of a non-random method. That is, the
selection is consciously or unconsciously influenced by subjective judgment of human being. Such
bias can be avoided by using a random selection method. The true randomness can be ensured by using
the method of selection which cannot be affected by human influence.

There are different random sample selection methods. The important aspect of random selection in
each method is that the selection of each unit is based purely on chance. This chance is known as
probability of selection which eliminates selection bias. If there is a bias in the selection, it may
prevent the sample from being representative of the population. Representative means that probability
samples permits scientific approaches in which the samples give accurate estimates of the total
population. We consider here two basic and common procedures of random selection method.

Lottery Method:

This is a very common method of taking a random sample. Under this method, we label each member
of the population by identifiable disc or a ticket or pieces of paper. Discs or tickets must be of identical
size, color and shape. They are placed in a container (urn/bowl) and well mixed before each draw, and
then without looking into the container selection of designated labels will be performed with or
without replacement. Then series of drawing may be continued until a sample of the required size is
selected. This procedure shows that selection of each item depends entirely on chance.

For example, if we want to take a sample of 18 persons out of a population of 90 persons, the
procedure is to write the names of all the 90 persons on separate slips (tickets) of paper. The slips
(tickets) of paper must be of identical size, color and shape. The next step is to fold these slips, mix
them thoroughly and then make a blindfold selection of 18 slips one at a time without replacement.
This lottery method becomes quite cumbersome and time consuming to use as the sizes of sample and
population increase. To avoid such problems and to reduce the labor of selection process, another
method known as a random number table selection process can be used.

The Use of Random Numbers:

A table of random numbers consists of digits from 0 to 9, which are equally represented with no
pattern or order, produced by a computer random number generator. The members of the population
are numbered from 1 to N and n numbers are selected from one of the random tables in any convenient
and systematic way. The procedure of selection is outlined as follows.

2
 Identify the population units (N) and give serial numbers from 1 to N. This total number N
determines how many of the random digits we need to read when selecting the sample
elements. This requires preparation of accurate sampling frame.
 Decide the sample size (n) to be selected, which will indicate the total serial numbers to be
selected.
 Select a starting point of the table of random numbers; you can start from any one of the
columns, which can be determined randomly.
 Since each digit has an equal chance of being selected at any draw, you may read down
columns of digits in the table.
 Depending on the population size N, you can use numbers in pairs, three at a time, four at a
time, and so on, to read from the table.
 If selected numbers are less or equal to the population size N, then they will be considered as
sample serial numbers.
 All selected numbers greater than N should be ignored.
 For sampling without replacement, reject numbers that come up for a second time.
 The selection process continues until n distinct units are obtained.

For example, consider a population with size N = 5000. Suppose it is desired to take a sample of 25
items out of 5000 without replacement. Since N = 5000, we need four digit numbers. All items from 1
to 5000 should be numbered. We can start anywhere in the table and select numbers four at a time.
Thus, using a random table found at the end of this chapter, if we start from column five and read
down columns then we will obtain 2913, 2108, 2993, 2425, 1365, 1760, 2104, 1266, 4033, 4147, 0334
4225, 0150, 2940, 1836,1322, 2362, 3942, 3172, 2893, 3933, 2514, 1578, 3649, 0784 by ignoring all
numbers greater than 5000.

2.3 Review of Sampling Distribution

Basic Notations:

We will adapt to use the following basic notations to represent population parameters or sample
statistics. These notations will be used throughout this book, but slight modifications will be made to
suit the specific design to be considered.

For Population parameters:

N = the total number of units in the population (population size).


Y i = Value of the "y" variable for i th population element (i =1, 2, - - -, N).
N
Y   Yi is the population total for the "y" Variable
i 1
N

Y

i 1
Yi
Y   = y is the population mean per element of the Yi variable. We will use Y and y
N N
interchangeably.

3
N N

 (Y i  Y )2  (Y i  Y )2
 y2  i 1
, S y2  i 1
is the variance of population element
N N 1

The relationship between these two variances can be established by expressing each variance in terms
of the other, i.e., S y2  N  2 or  y2  N  1 S y2 .
N 1 N
Taking the square root of the variance will give the standard deviation of the population elements,
which is represented by S y or  y .
N N
 X 
i  X  Yi  Y   X i 
 X  Yi  Y 
i 1 i 1
Sxy = or σxy = is the covariance of the random variable X and
N 1 N
Y.

S xy  xy
 xy  or  xy  is the population correlation coefficient
SxS y  x y

For Sample Statistics:

n = the number of sample units selected from the population (the sample size)
yi = Value of the yi variable for i th sample element (i = 1, 2, - - -, n).
n
y= y
i 1
i is the sample total for the “y “variable,
n

y
i 1
i
y
y  is the sample mean per element of the " y " variable.
n n
n
(y i  y)2
s 2y = i 1
is the variance of the sample elements, and its square root denoted by s y is the
n 1
standard deviation of the sample elements.
n
 xi  x  yi  y 
i 1
sxy = is sample covariance
n 1
sx y
̂  is sample correlation coefficient
sx s y
n
f= is sampling fraction
N

The sample statistics are computed from the results of sample surveys since the primary objective of a
sample survey is to provide estimates of the population parameters, because the reality shows that
almost all population parameters are unknown.

4
Sampling Variability

The sample statistics, calculated from selected sample units, are subject to sampling variability. They
depend on the number of sample units (sample size) and types of units included in the sample. Each
unit in the population has different characteristic and/or value. For example, a salary of employees
varies from individuals to individuals in which the magnitude of salary to be used in the calculation of
average income depends on the types of employees selected from the totals workers. Similarly the
number of sample workers selected (sample size) will affect the sample values. This indicates that the
sample statistics such as mean, total, variance, ratio and proportion are random variables. Like other
random variables, these sample statistics possess a probability distribution, which is more commonly
known as sampling distribution.

Sampling Distribution:

What is sampling distribution? What is the purpose of computing sampling distribution? The following
example will illustrate the basic idea of sampling distribution and its use.

Example 2.1:

For demonstration purpose we will consider a very small hypothetical population of 5 farmers, who
use fertilizer in their farming. Suppose the amount of fertilizer used (in kg) by each farmer is 70, 78,
80, 80, and 95. Then, the following parameters of the population and sample values (statistics) are
computed to justify the basic idea behind estimation.

Population Parameters:

Let Yi denotes the amount of fertilizer used by each farmer (i =1, 2, - - -, 5). The population size is 5,
i.e. N = 5. The total amount of fertilizer used by all farmers and the average fertilizer consumption per
farmer are computed as follows.

N
The total amount of fertilizer used is Y   Yi = 70 + 78 + 80 + 80 + 95 = 403 kg.
i 1
Y 403
The mean consumption of fertilizer per farmer is Y  =  80.6 kg.
N 5
Regarding fertilizer consumption variability among farmers, both types of population variances and
their corresponding standard deviations are calculated.

2
2  (Y
i  Y )2 (70  80.6)2  (78  80.6) 2  (80  80.6)2  (80  80.6) 2  (95  80.6)
S y  
N 1 4
327.2
S y2   81.8 .
4
2  (Yi  Y ) 2 327.2
y  =  65.44
N 5

Taking the square root of each variance gives standard deviation of the population, which gives

5
S y = 9.044, and  Y  8.089 . In reality all these population characteristics are mostly unknown for
relatively large size of population and should be estimated from survey results collected and
summarized from sample elements.

Now we want to estimate these population values from sample elements assuming that population
parameters are unknown. In the following sampling distribution we will examine all possible samples.

Assume that sample of three farmers are selected from the total farmers to estimate the population
 N   5
parameters. The total number of possible samples can be calculated as   =   = 10. The following
 n   3
table shows the ten possible samples with their corresponding values and sample means. Let Fi
represents the ith farmer, i = 1, 2, - - - , 5.

Types of Value for Sample


Sample each Sample Mean
Units element ( yk )
1 F1 F2 F3 70,78,80 76.00
2 F1 F2 F4 70,78,80 76.00
3 F1 F2 F5 70,78,95 81.00
4 F1 F3 F4 70,80,80 76.67
5 F1 F3 F5 70,80,95 81.67
6 F1 F4 F5 70,80,95 81.67
7 F2 F3 F4 78,80,80 79.33
8 F2 F3 F5 78,80,95 84.33
9 F2 F4 F5 78,80,95 84.33
10 F3 F4 F5 80,80,95 85.00

The Sample Mean:

For each possible sample, dividing the sum of the amount of fertilizer used by the size of a sample
70  78  80
would give the sample mean ( y k ) . For instance, the mean of the first sample is = 76.00,
3
and the remaining sample means can be calculated in a similar way.

From the values of random variable y k , we can construct the frequency distribution as shown below.
From this frequency we obtain the probabilities of the random variable y k , by dividing the frequency
of the random variable y k by the sum of the frequencies.

Values of Frequency Probability


yk (f) of y k
76.00 2 2 10 = 0.2

6
76.67 1 1 10 = 0.1
79.33 1 1 10 = 0.1
81.00 1 1 10 = 0.1
81.67 2 2 10 = 0.2
84.33 2 2 10 = 0.2
85.00 1 1 10 = 0.1
Total 10 1.00

This table gives the sampling distribution of ( y k ) . If we draw just one sample of three farmers from the
population of five farmers, we may draw any one of the 10 possible samples of farmers. Hence, the
sample mean y can assume any one of the values listed above with the corresponding probabilities.
2
For instance, the probability of the mean 81.67 is P ( y k  81.67)   0.2 . This shows that the
10
sample average ( y k ) is a random variable that depends on which sample is selected. Its values vary

from 76.00 to 85 and some of these values are lower or higher than the population mean Y = 80.6.

The overall mean, which can be calculated from all possible samples, is equal to the true population
mean. That is, the expected value of y k , denoted by E ( y k ) , taken over all possible samples equals the

true mean of the population. From the table, E ( y k ) =


 f y k

806 
 80.6 , which is the same as Y .
f 10
It can also be calculated using probability concept, that is,
k
2 1
E ( y k ) =  y i P ( yi ) = 76x + - - - + 85x = 80.6 = y
i 1 10 10

What is the deviation of sample mean from the true population mean?

It can be observed that the sample mean is either equal to or different from the true population mean.
This deviation can be assessed in terms of probability. We will continue with the same example to
explain the properties of this deviation.

We will consider only when the deviation is one unit or two units or four units from the true
population.

P(1  y k   +1) = P(80.6-1  y k  80.6 +1) = P(79.6  y k  81.6) = 1 10 = 0.1

P(2  y k   +2) = P(80.6-2  y k  80.6 +2) = P(78.6  y k  82.6) = 4 10 = 0.4

P(4  y k   +4) = P(80.6-4  y k  80.6 +4) = P(76.6  y k  84.6) = 7 10 = 0.7

This indicates that the greater the demands we make of being close to "true" value, the smaller the
chance we have of fulfilling it.

Variability of the mean:

7
The Sampling variance of the mean, V( y ), is defined as the average of the squared deviations of the
k
 ( yi  Y ) 2
sample means from the true mean, that is, V ( y )  E ( y i  Y ) 2  i 1 , where k is the total
k

number of possible samples, yi is the mean of ith sample and Y is the true mean of the population.

The square root of the sampling variance, Var ( y ) , is called the standard error (S.E.) of the mean of
the sample. The smaller the standard error of the mean, the greater is its reliability.
For each possible ith sample, we can compute sample variance ( s i2 ). Then, the mean of the sample
k
2
s
i 1
i

variance ( s 2 ) is equal to the population variance ( S 2y ), i.e. E (s 2 )  = S 2y , where k is total number


k
of possible samples.

Consider again example 2.1, the population consisting of 5 farmers. The sample variances for all 10
possible samples of size 3 can be computed as:
n n
2

j 1
( y ij  y i ) 
j  1
y ij
s i2  , where y i  , for ith sample with sample size j = 3.
n 1 n

2
s2

70  76   (78  76) 2  (80  76) 2

56
 28 ,
1
2 2

2
s 2

70  76   (78  76) 2  (80  76) 2

56
 28 ,
2
2 2
.
.
.
(80  85)  (80  85) 2  (95  85) 2 150
2
s102    75
2 2

A summary of the calculated sample variances are listed below.

2 2 2 2 2 2 2 2 2 2
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
28 28 163 33.4 158.34 158.34 1.34 86.34 86.34 75

Therefore, the mean of the sample variance ( s 2 ) is computed as,


k
2
s
i 1
i
28      75 818.1
E( s 2 ) =    81.81 .
k 10 10

8
We know that the population variance is S 2y = 81.8, and this shows that E ( s 2 )  S 2 with some
rounding errors. But the sampling variance, V( y ), is not the same as the population variance ( S 2 ), that
is, V( y )  S 2 . The equality can be established using the following relationship.

S2
V( y ) = 1  f , where (1-f) is a finite population correction (fpc).
n
This shows that sampling variance, V( y ), depends on the population parameter S 2 , which is mostly
known, but should be estimated by sample variance.

EX. Verify that V( y )  S 2

2.4 Properties of Estimates

An estimator is a rule that tells how to calculate an estimate based on the measurements contained in a
sample. It is a sample statistic used to estimate a population parameter. Thus, the sample mean y is an

estimator of the population mean Y . The value (s) assigned to a population parameter based on the
value of a sample statistic is called an estimate. For instance from the above example, y k  76.00 is

an estimate of Y .

Unbiased: Let ˆ is a point estimator of a parameter  computed from the sample. If E( ˆ ) =  we


say ˆ is unbiased. If E( ˆ )   we say ˆ is a biased estimator of , then E( ˆ )   = , where  =
bias. For example, an estimator is unbiased if the mean of its distribution equals the mean of the

population parameter. That is, if E  y   Y , then we say y is an unbiased estimator. Most estimators in
common uses are unbiased though occasionally it may be convenient to use an estimator which suffers
 
from some small degree of bias. In this case, E  y   Y , and it implies that E  y   Y = Bias.
For biased estimator, the mean square error (MSE) measures the variability of sampling distribution. It
is defined as MSE( ˆ ) = E( ˆ  )2 = Var( ˆ ) + 2 (Verify). For the mean,
MSE ( y )  Var ( y ) +2 = sampling variance + the square of its bias, where  = bias. For unbiased
estimator, MSE ( y )  Var ( y ) , since  = 0. Thus, the smaller the mean square error of an estimate, the
greater is the accuracy.
Consistency: An estimator is said to be consistent if it tends to the population value as the sample size
increases. Let ˆ is an estimator of a population parameter which denoted by . Then ˆ is a consistent
estimator of  if:
 
 For any positive number , lim P ˆn      0 . This indicates that ˆ approaches  as n
n

approaches .

n
 
lim E ( ˆn   ) 2  lim Var (ˆn )  0
n 

9
Example:
An estimator is said to be consistent if it tends to the population value with increasing sample size. As
the size of the sample increases, the sample estimates concentrate around the population value. By
considering the population of 5 farmers, we can find all possible samples of size 2, 3, and 4 without
replacement and compute the sample results. The sampling distribution is has already been calculated
when the sample size is three and in similar way the sampling distributions can be calculated for
sample sizes two and four. The following possible sample means can be observed from three different
sample sizes.

74.00  y  87.5 , when the sample size n = 2 with 10 possible samples.


76.00  y  85.00 , when the sample size n = 3 with 10 possible samples.
77.00  y  83.25 , when the sample size n = 4 with 5 possible samples.
This example shows that as the sample size increases, the sample mean tends to the population mean in
both directions.

Efficiency: A particular sampling scheme is said to be more" efficient" than another if, for a fixed
sample size, the sampling variance of survey estimates for the first scheme is less than that for the
second. For the same population often comparisons of efficiency are made with simple random
sampling as a basic scheme using the ratio of their variances.

For example, if y1 and y 2 are two estimators of a parameter Y , with equal sample size, and having
variances, V( y1 ) and V( y 2 ) respectively, then the efficiency of y1 relative to y 2 is given as follows.
V  y2 
Efficiency  y1 , y 2   , for unbiased estimator, and
V  y1 
MSE y 2 
Efficiency  y1 , y 2   , for biased estimator
MSE y1 
Thus, if this ratio is greater than one, then y1 is a better estimator than y 2 .

EX: From the distribution given above, which one is more efficient y1 or y 7 ?

2.5 The Sample Mean and Its Variances and Standard Errors

Theorem 1:

The sample mean y is an unbiased estimator of the population mean Y , i.e., E( y ) = Y . Prove this
theorem.

Theorem 2:

The variance of the mean y from a simple random sample is:


S2
V( y ) = 1  f  , for sampling without replacement (w o r)
n
2
V(y)= , for sampling with replacement (w r),
n

10
n N n
where f = is sampling fraction, and 1-f = is a finite population correction. Prove
N N
theorem 2.

S
Corollary: The standard error is S.E. ( y ) = V ( y ) = 1  f , (w o r) or
n

S.E. ( y ) = (w r)
n

Corollary: i) Y = N y is an unbiased estimate of the population total Y,

ii) If Y = N y is an unbiased estimate of the population total Y, then its variance is given by:
S2

V Yˆ  N 2V  y   N 2
n
1  f  , for sampling without replacement (w o r) and
2

V Yˆ  N 2V  y   N 2
n
, for sampling with replacement (w r)
Their corresponding standard errors are:
S 
S.E ( Yˆ ) = N S.E ( y ) = N 1  f , and S.E ( Yˆ ) = N , respectively.
n n

Theorem 3:

If a pair of variables, xi and yi, defined on every unit in the population have the corresponding sample
means x and y from simple random sampling of size n, then the covariance is given by
S xy  N  n  S xy
Cov( x , y ) =   1  f  for sampling without replacement and
n  N  n
 xy
Cov( x , y ) = for sampling with replacement, where Sxy and xy are population covariances of X
n
and Y for the two types of sampling respectively. Prove this theorem.

2.6 Estimation of Standard Error from a Sample

Since the variances, S 2 and  2 , of the population parameter are mostly unknown, we use the estimate,
s 2 , from a sample observations measured in a single survey. For a simple random sampling design a
sample variance ( s 2 ) is unbiased estimator of S2 or  2 .

Theorem 4:

For a simple random sample, the sample variance, s2, is an unbiased estimator of S2 or 2 for sampling
without replacement and sampling with replacement respectively. Prove this theorem for both cases.

11
In practice the sample variance s2 is used and, therefore, the unbiased estimates of the variances of the
sample mean and total are given as:
s2
For Mean: The variance for sampling without replacement is v( y ) = 1  f , and its standard error
n
s
will be s.e.( y ) = 1  f  .
n
s2 s
Similarly for sampling with replacement, v( y ) = , and its standard error s.e. ( y ) = .
n n
s2
For total: v( Y ) = N n 1  f , and s.e. ( Y ) = N
  s
2
1  f for sampling without replacement, and
n
 s2  s
v( Y ) = N2 n , and its standard error will be s.e . ( Y ) = N for sampling with replacement.
n
If we look at all these expressions, we can observe that as n increases, the value of n also increases
and hence the standard error decreases. Thus, the standard error from a sample is used for various
purposes. It is mainly used:
 To compare the precision of estimate from SRS with that from other sampling methods.
 To determine the sample size required in a survey, and
 To estimate the actual precision of the survey.

2.7 Confidence Intervals

In practice surveys are conducted only once for one specific objective. In other words, one does not
draw all possible samples to calculate the variance or the standard error of an estimate. However, if
probability-sampling methods are used, the sample estimates and their associated measures of
sampling error can be determined on the basis of a single sample.

Therefore, any specific value or estimate obtained from sample observations may be different from
population parameter. Hence, the estimate from sample could be less or greater or equal to the
population value. Because of this discrepancy an assessment must be made on the accuracy of the
estimate. The question is “How do we reasonably confident that our inference is correct?”
Estimates are often presented in terms of what is called confidence intervals to express precision in a
meaningful way. A confidence interval constitutes a statement on the level of confidence that the true
value for the population lies within a specified range of values.

A 95% confidence interval can be described as follows. If sampling is repeated indefinitely, each
sample will lead to a new confidence interval. Then in 95% of the samples the interval will cover the
true population value. For example, consider a sample mean y , which is unbiased estimate of
population mean μy, the confidence interval for μy is μy = y  Sampling error, where the sampling
error depends on the sampling distribution of y . Translating this into a description of a normal
distribution, an approximate 100 1    % probability confidence interval for Y is:
 
 
P  y  Z  S .E ( y )   y  y  Z  S .E ( y )   1  
 2 2 
 

12
Where, μy is an unknown population parameter, 1-  is the confidence level,  is the permissible level
of error or the percentage that one is willing to be wrong and is known as the significance level.
Z  is a critical value for the normal distribution, y + Z  S.E ( y ) is the upper confidence limit, and
2 2

y  Z  S .E  y  is the lower confidence limit.


2
Similarly, for the population total (parameter) the confidence limit is given as:
Y= Yˆ  Z  S.E (Yˆ ) or Y = N y  Z  N S.E ( y ). Since S.E ( y ) is not known we substitute the
2 2
S.E ( y ) by the sample standard error, s.e.( y ) computed from the sample observations.

Example: See Cochran 3rd edition page 27.

2.8 Estimation for Sub-populations

Sometimes needs arise to estimate population parameters not only for the entire population, but also
for its “subdivision” or “subpopulations” known as domain of study. Such division could be by
residence, age, sex, geographical area, income group, etc. Note that in some cases study domains may
coincide with strata or may differ.

Notation:

N = the number of elements in the population


Nj = the number of elements in the jth domain
nj = the number of sample elements in a srs of size n that happen to fall in the jth domain.
Yjk are measurements on the kth element in jth domain, for k = 1, 2, - - -, nj for sample and k = 1, 2, - - -,
Nj for population
The objective is to estimate the subpopulation parameters such as mean, Y j , and total, Yj for the jth
domain. These parameters and their estimators are computed as follows.

i) Subpopulation Mean ( Y j )
Nj nj

 Y jk
k 1
y
k 1
jk
The subpopulation mean is defined as  j  and its sample estimator is given by y j  .
Nj nj
Nj
S 2j (Y jk  Y j ) 2
a) E( y j ) =  j b) Var( y j ) = 1  f  , where S
j
2
j  ,
nj k 1 N j 1
where fj = nj / Nj, sampling fraction for jth domain.
nj
s 2j ( y jk  y j ) 2
The sample variance is given by: var( y j ) =
nj
1  f ,j s 2
j
n j 1
and its standard error is
k 1

sj
s.e.( y j ) =
nj
 1  f . If N is not known use n / N = f in place of f = n / N .
j j j j j

13
Nj

ii) Sub-population Total Yj: It is given by Y j   Y jk and consider two cases to get its population
k 1

estimator Yˆj .
N 2j S 2j
ˆ ˆ
Case 1 is when Nj is known: a) Y j = Nj y j , b) Var( Y j ) = 1  f j ,
nj
N
Case 2 is when Nj is unknown: Estimate Nj by Nˆ j  x n j . Then the total estimate will be:
n
Y j2
n  Yi 2 
N j
N 2 S '2 N
a) Yˆj = N̂ j y j  b) Var( Yˆj ) =
i  j th domain
x  y jk , 1  f  , where S '2 
n k 1 n N 1
 Y if the unit is in the j th domain ( N j units )
Yi '   i

 0 if otherwise N  N j units 
N
'
Y
i 1
i  Y i
i  jth domain
 Yj Verify (a) and (b)

N 2j s 2j
ˆ
The sample estimate is given by a) var( Y j ) = 1  f j , if Nj is known.
nj
N 2 s' 2
b) var( Yˆj ) = 1  f  , if Nj is unknown, where
n
2
 n '
n
  yi 
 y i   i 1 
'2

n  y , if the unit is in the j th domain (n j units )


s '2  i 1 and y i'   i
0, if the unit is not in j domain n  n j units 
th
n 1

Example: See Cochran 3rd edition page 37.

Comparison between Domain Means

Consider the population units that are classified into two domains. Let us say for example jth and kth
domains with the sample means y j and y k from simple random sampling. The variance of the
difference of the means is given by:
S 2j 2
Var( y j  y k ) = Var( y j ) + Var( y k ) = 1  f j  + S k 1  f k  and verify this.
nj nk
S k2 S 2j
If the fpc is ignored, then Var( y j  y k ) = +
nj nk
Mostly comparison can be made between two populations in order to assess the population
characteristics. For example, two different treatments are applied to two independent sets of similar
subjects or the same treatment is applied to two different kinds of subjects. Depending on the objective

14
of the survey, we make confidence intervals and test hypotheses about the difference between the two
population parameters when samples are independent.

2.9 Sample Size Determination for One Item

In the planning of a sample survey one of the first considerations is the sample size determination.
Since every survey is different, there can be no hard and fast rules for determining sample size.
Generally, the factors, which decide the scale of the survey operations, have to do with cost, time,
operational constraints and the desired precision of the results. Once these points have been appraised
and individually assessed, the investigators are in a better position to decide the size of the sample.

2.9.1 Desired Precision of Sample Estimates

One of the major considerations in deciding sample size has to do with the level of error that one
deems tolerable and acceptable. We know that measures of sampling error such as standard error or
coefficient of variation are frequently used to indicate the precision of sample estimates. Since it is
desirable to have high levels of precision, it is also desirable to have large sample sizes, since the larger
the sample, the more precise estimates will be. The sample size can be determined by specifying the
precision required for each major finding to be produced from the survey.

The sample size required under simple random sampling for estimation of population mean  y is as
follows. Consider that the sample estimate y differs in absolute value from the true unknown mean
y
 y by no more than d , i.e., an absolute error d  y  Y or relative error  in which d    .

Specifying maximum allowable difference between y and  y , and allowing for a small probability 
that the error may exceed that difference, choose a sample size n such that P( y -  y > d)  .

With SRS we can show that, assuming the estimate y has a standard normal distribution, the sample
Z 2S 2 d 2 no
n must satisfy the relation given by n  2 2 2
or n  , where no  Z 2 S 2 d 2 and Z
1  Z S Nd 1  n o N
is the reliability coefficient which denote the upper  2 point of standard normal distribution.
If the population size N is very much greater than the required sample size n , the relation above can
Z 2S2 Z 2S2
be approximated by n  or n  n o . As a first approximation calculate n o  . If no N , the
d2 d2
sampling fraction is very small, say less than 5%, we may consider n o as a satisfactory approximation
no
to the required sample size n . Otherwise calculate using the given formula, n  .
1  no N

Z 2 S Y 
2
Z 2S 2 2 S 
2
Z 2 CV(2y )
If we use the relative error d   Y , then we get n o  Z     , where
d2 d  2 2
CV(y) is coefficient of variation.

2.9.2 Sample Size with More Than One Item

15
In practical situations more variables are used as the basis for calculation of sample size. The decision
on sample size will in fact be largely governed by the way the results are to be analyzed, so that the
investigator must at the outset consider, at least in broad terms, the breakdowns or sub-populations to
be made in the final tabulations. Such populations might be defined in terms of age/sex groups or
geographic areas. In the “multi-purpose” nature of most surveys we also deal with many variables in
which an estimation of sample size is needed separately for each variable.

The sample size falling into each sub-population (variable) should be large enough to enable estimates
to be produced at specified levels of precision. Therefore, several of the most important variables are
chosen and sample sizes are calculated for each of these variables. The final sample size chosen might
then be the largest of these calculated sample sizes. If funds are not available to take the largest of
these calculated sample sizes, then, as a compromise measure, the median or mean of the calculated
ns might be taken.

2.10 Relative Error

Statistical measures such as standard deviation and the standard error appear in the units of
measurement of variables. Such measurement units may cause difficulties in making some
comparisons. Relative measures, such as coefficients of variation, can be used to overcome the
problems.
Sy sy
The element coefficient of variation can be expressed as CV ( y )  and estimated by cv( y )  .
Y y
S .E.( y )
For the mean ( y ), the coefficient of variation is given by CV ( y )  , and estimated by
Y
s e ( y) S .E.(Yˆ )
cv( y )  . For the total ( Yˆ ), the coefficient of variation is given by CV (Yˆ )  , and
y E (Yˆ )
N s e ( y ) s e ( y)
estimated by cv(Yˆ )   , which is the same as the coefficient of variation of the mean.
Ny y

Example: A sample survey of retail outlets is to be conducted in a city that contains 2,500 outlets. The
objective is to estimate the average retail price of 20 items of a commonly used food. An estimate is
needed that is within 10% of the true of the average retail price in the city. An SRS will be taken from
available list of all outlets. Another survey from the same population showed an average price of $
7.00 for 20 items with a standard deviation of $1.4. Assuming 99.7% confidence internal, determine
the sample size.
Solution: N  2500 s = 1.4 s2 = (1.4)2  = 0.1 y  7.00
s 2 1.4 
2
2
CV ( y )  2   0.04, Z  3 for 99.7%,
y 72
Z 2 CV ( y ) 2 3 2 0.04  n0 36
n0  2
 2
 36,   0.0144  5%
 0.1 N 2500
 n 0  n  36 , which is a good approximation for the sample. But if you calculate for n, you will get
n0 36 36
that n     35.5  36
n0 36 1  0.0144
1 1
N 2500

16
2.11 Limitations of Simple Random Sampling

Under simple random sampling any particular sample of n elements from a population of N elements
can be chosen and in addition, is as likely to be chosen as any other sample. In this sense, it is
conceptually the simplest possible method, and hence it is one against which all other methods can be
compared. However, despite such importance, simple random sampling has the following limitations:

 It can be expensive and often not feasible in practice since it requires that all elements be
identified and labeled prior to the sampling. This prior identification is not possible, and hence
a simple random sample of elements cannot be drawn.
 Since it gives each element in the population an equal chance of being chosen in the sample, it
may result in samples that are spread out over a large geographic area. Such a geographic
distribution of the sample would be very costly to implement.
 It would not be good for those surveys in which interest is focused on subgroups that comprise
a small proportion of the population. For example, it is not likely to be an efficient design for
rare events such as disability and special crops.

17

You might also like